I have an issue with my batch pipeline today.
The COPY to Redshift fails because some values of domain_sessionidx are overflowing. The type in the database is int2 which has its maximum at 32767. I have multiple entries with a domain_sessionidx above that limit (probably a bot or something?).
I’d like to change the type of that column to something like BIGINT but since Redshift doesn’t allow altering column types, I’ll probably need to add a new column. I was wondering if the column order is important for the data pipeline (the inserting part actually) or if recreating that column (it would be the last column) is enough. Else I’ll have to completely duplicate the data in a new table in order to change the column type.
I guess another solution would be to parse the files and replace the overflowing values.
The short answer is yes - unfortunately order does matter as atomic.events is loaded as a TSV so the order of these fields will matter when loading data into the atomic.events table.
Copying the data into another table and then reloading into atomic.events is probably the best bet though it’s a little odd for even a bot to have more than 32k sessions. If it’s a very unusual use case it may be easier to put some custom code in the JS to not send events for a certain useragent string (if that’s the case).
Mmmh alright.
Yeah this is really unusual. First time in 6 months that I see this.
The useragent doesn’t look like a bot. But everything seems to be coming from the same user.
For now, I think I’ll just run a script to modify the shredded lines concerned. If it happens again, I’ll look into a longer-term solution.
I has the same issue and I was lookig for a quick fix but, as our events table (several billions of lines) is huge, I think copying the data/reloading it is not that quick.
The entries come from a bot, with an average of 15k domain_sessionsidx per day for ~11 days.
Also, this is the first time in years, but another user (bot?) is approaching quickly to the maximum capacity of this field.
In the end, I remember I went like @ihor presented and it worked just fine. IIRC, once the column type has been changed, you don’t need to update anything else.
Redshift doesn’t enforce constraints the way that Postgres does. Although constraints are useful for the query planner in Redshift they don’t do much beyond that.