Yeah I had a backend with poor support for anything that wasn’t ASCII. So my solution was turning everything into hex before storing it. I wonder if people are still using it.
Yeah I had a backend with poor support for anything that wasn’t ASCII
PHP is like this. Poor Unicode support, but it treats strings as raw bytes so it usually works well enough. It turns out a programming language can take data from a form, save it to a database, then later load and render it, without having to know what those bytes actually mean, as long as the app or browser knows it’s UTF-8, for example through a Content-Type header or meta tag.
The tricky thing is the all the standard string manipulation functions (strlen, substr, etc) don’t handle Unicode properly at all and they deal with number of bytes rather than number of characters. You need to use the “multibyte” (Unicode-ready) equivalents like mb_substr, but a lot of PHP developers forget to do this and end up with string truncation code that cuts UTF-8 characters in half (e.g.if it’s truncating a long title with Emoji in it, it might cut off the title in the middle of the three bytes that represent the Emoji and only leave 1 or 2 of them)
Yeah I had a backend with poor support for anything that wasn’t ASCII. So my solution was turning everything into hex before storing it. I wonder if people are still using it.
PHP is like this. Poor Unicode support, but it treats strings as raw bytes so it usually works well enough. It turns out a programming language can take data from a form, save it to a database, then later load and render it, without having to know what those bytes actually mean, as long as the app or browser knows it’s UTF-8, for example through a Content-Type header or meta tag.
The tricky thing is the all the standard string manipulation functions (
strlen
,substr
, etc) don’t handle Unicode properly at all and they deal with number of bytes rather than number of characters. You need to use the “multibyte” (Unicode-ready) equivalents likemb_substr
, but a lot of PHP developers forget to do this and end up with string truncation code that cuts UTF-8 characters in half (e.g.if it’s truncating a long title with Emoji in it, it might cut off the title in the middle of the three bytes that represent the Emoji and only leave 1 or 2 of them)