RDBLoader PostgreSQL Error


#1

I’m receiving this error when my RDBLoader step runs:

ERROR: Data loading error org.postgresql.util.PSQLException: ERROR: invalid byte sequence for encoding "UTF8": 0x00
  Where: COPY events, line 1
Following steps completed: [Discover,Download]

I’m running PostgreSQL 9.6.6 on AWS RDS.
rdb_loader: 0.14.0
rdb_shredder: 0.13.1
emr etl: 104 stoplesteinan

I see in the Snowplow docs it demonstrates installing PostgreSQL 9.1… is 9.6 compatible? I can’t go any lower than 9.4.7


#2

Hi @caleb_bertsch,

Yes, latest version of PostgreSQL should be supported - we’re using 42.0.0 JDBC Driver, which should be compatible with quite broad range of servers.

This is not something I’ve seen before (though, need to admit - Postgres is not something that we use/test actively). Is your shredder output gzipped?


#3

Yes, the shredded files are gzipped. Is it normal for there to be many empty shredded files? (see screenshot, the 20 byte objects are empty after decompression)


#4

Thanks @caleb_bertsch. No, empty files are fine, this is an artifact of Spark/Hadoop distribution model.

Just as an experiment can we try to unzip couple of files and run RDB Loader against it? Here’s the default load-statement that we use to load data, just in case:

COPY $tableName FROM STDIN
 WITH CSV ESCAPE E'\x02' QUOTE E'\x01'
 DELIMITER '\t'
 NULL ''

I suspect that there’s nothing in RDB Loader takes gzip output compression into account for Postgres. If this is the case then switching to none output compression should help. Let me know if it works - I’ll submit a bugreport.


#5

Thanks @anton

So yes the manual COPY command with ungzipped data worked.
I see now in another emr-etl config that Redshift is the only database that supports gzip compression. The issue is that stream mode also only supports gzip. So as of right now, would you agree that it’s impossible to use PostgreSQL and stream mode together?


#6

Thanks for update, @caleb_bertsch, I created an issue https://github.com/snowplow/snowplow-rdb-loader/issues/107

If I remember right, it is up to EmrEtlRunner to decide whether RDB Shredder should dump data in gz archive and EmrEtlRunner itself makes this decision based on output_compression setting in config.yml, so whole process should be decoupled from Stream/Batch mode.


#7

I can confirm that turning off gzip compression fixed the issue.

I believe that makes the comment on line #56 here incorrect and should be set to NONE for PostgreSQL.