Protecting the scala stream collector and scala stream enricher against known security vulnerabilities


#1

We are using the scala stream collector and scala stream enricher to feed into kinesis analytics. we analyzed our setup for security vulnerabilities using zaproxy and identified that there are various issues in terms of allowing extra special characters are allowed which let the data load to redshift fail.
I have 2 questions:

  1. Has anyone observed this before ?
  2. Are there any know best practices which we can use to protect ourselves against these ?

Regards
Bhanu


#2

Thanks for raising this @kaushikbhanu. Is the problem covered by this ticket in RDB Loader:

Or is it a separate issue?


#3

Doesn’t seem like it ,
we tried sending “\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\W” this as data in one of the fields. it passes through Scala collector and Scala stream enricher as everything but when we try to copy this into redshift the copy fails as it complains about the length exceeds ddl.


#4

to give a more background on our setup , we have Scala collector -> kinesis stream->Scala-stream-enrich-> kinesis stream-> kinesis-analytics - > kinesis-firehose->redshift. We were mostly tracking unstruct event with browser and mobile context etc. we were able to protect the custom schema by regex/lengths etc … but the snowplow schema we don’t want to manage/maintain ourself in our repo. hence the question !!


#5

What field are you sending this through in? I’m not sure if this is a security issue so much as a value exceeding a database length (which ideally shouldn’t happen but having a non-zero MAXERROR on a Redshift load isn’t uncommon).


#6

That’s not a supported architecture - a Snowplow pipeline loading Redshift would use our Kinesis S3 Loader and our standard load process - so it’s difficult for us to treat this as an issue in Snowplow…


#7

I get that it is not part of supported pipeline by snowplow bu this is stream enrich issue in my opinion.


#8

@mike all the fields were manipulated. the idea is to send garbage data as shown in the example and see how the pipeline behaves. for example sent this in aid or eid , this passes enricher which basically fails the pipeline down stream. if we were to protect this from happening, how would we do it. I don’t think it shouldnt matter what the consumer is for the precessed data.


#9

I think this might fail with the way you are loading data via Firehose but \\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\…\\W is a valid appid and should be loaded into Redshift or any other downstream targets correctly in the standard pipeline.


#10

But you are not seeing how the pipeline behaves - because you are not using the Snowplow pipeline downstream. If the same issue occurs with the actual Snowplow pipeline, we can file a bug in the offending Snowplow project.


#11

@Alex will try that out. Is kinesis analytics support somewhere on the roadmap.


#12

Not currently I’m afraid Bhanu.