Invalid Byte Encoding with Postgres JSONB columns
I've been running a fairly successful website with
drizzle-orm
but have recently come across a problem that I cannot for the life of me figure out. Any insight into a solution, or even how to debug this further would be helpful.
I have a pretty standard setup -
- postgres16 [docker image]
- Web frontend using nuxt
. Data used for the backend API is provided by drizzle.
I have a few tables that store information in jsonb
columns. This data contains strings (peoples names) that often have characters outside the standard en-us
character set. For a long time (almost a year) this has worked seemingly well for me and I had not noticed any errors. Recently though, I've started to receive errors like this from postgres -
invalid byte sequence for encoding "UTF8": 0xf8
invalid byte sequence for encoding "UTF8": 0xe5 0x6b 0x61
What I found out from investigating this is that the data being sent to postgres is using an encoding like USC-2
. The byte sequence that is reported as error is the unicode representation of the extended character not the utf-8
sequence. I determined this by running db.execute(sql'${JSON.stringify(entries)}::bytea')
and then dumping the resulting buffer with toString('hex')
What's really crazy about this is that I can only get it to happen on my production server after about 30 minutes of uptime. Visiting a page triggers an update of a row in the database - The update
, with extended characters, works immediately after the service comes up, but will then consistently fail after some amount of time.
I started with the postgres.js
driver and assumed that it was a bug in their implementation, so I switched to node-postgres
and ran into the same issue.
Right now I have no real guesses on what is going wrong - all the code is pretty straight forward. Maybe something with server side rendering is changing the default encoding output of JSON.stringify?
Any help is appreciated!6 Replies
Oh, one additional thing to mention here - I cannot reproduce this at all on my machine. I can't even create dummy data that will cause this postgres error. If anything I'd like to understand how this is even possible with an
update
clause through the ORM.
I've investigated more and began dumping the hex output of the resultant string produced by JSON.stringify
and the data going into my jsonb column.
It turns out that when this error begins to occur, the that output has the Unicode code point encoded into it, NOT the hex value. I've done this by using the following
So for the error: invalid byte sequence for encoding "UTF8": 0xe9 0x73 0x20
the hex for the word that contains the incorrect symbol is 47 69 76 72 E9 73 20
instead of 47 69 76 72 C3 A9 73 20
Based on this, I think the error is actually in something hijacking the global JSON.stringify
behavior.I don’t think drizzle would do that
You can try a different serializer/parser to see if there a difference.
Yea, I think it's something in one of the nuxt extensions I am using. Trying a different serializer is actually a great idea, I'll give that a shot tomorrow.
For completeness, I created the following type and used another serializer, which resolved the core issue.
Unfortunately though, I am now seeing the same error when updating / inserting normal varchar columns. I feel something is deeply wrong outside of drizzle and will continuing to investigate.
Please let us know
Minor update here - I've continued to be flummoxed by this. The errors have disappeared.
The only change I have in place currently is sanitizing all input I get from API's with the following
I'll likely stop posting updates here as this isn't a drizzle issue (that I know of), but one thing I never figured out was how to reproduce this locally. Even defining strings with unicode strings like
\u00f8
were handled correctly locally.
Anyway, thank you for the help and if anyone finds this and the future and runs into the same problem please let me know as I'd love to understand what's going on.Sounds good, thanks for the follow up!