a solved problem unsolved itself: WordPress, MySQL, UTF-8

I have been running a WordPress blog hosted on a Linux-based shared host since WordPress became a thing. It has worked quite well from about 2004 up until a few weeks ago. Sadly, *something* recently decided my database encoding was wrong. And that something decided to "fix" it. It certainly "fixed it", but not in any way I could want. It also did the same for Catherine's blog. I know I didn't change any part of the config chain. As far as I can see: * the MySQL database still thinks the text is encoded in UTF-8; * Wordpress thinks the data is in UTF-8; * the web server is serving UTF-8. (NB: there's going to be some UTF-8 and hex chars in this message.) A typical post which shows this problem is https://scruss.com/blog/2016/02/27/t%c9%92k-b%c9%92ks-a-tiny-hardware-speech... When I should be seeing something like: [tɒk bɒks] 5b74 c992 6b20 62c9 926b 735d I'm seeing this in the page served up (and in the db text itself): [tÉ’k bÉ’ks] 5b74 c389 e280 996b 2062 c389 e280 996b 735d So the phonetic character U+0252 has been mangled into U+00C9 + U+2019. Every UTF-8 character seems to be affected this way. I wasn't expecting to wake up to a UTF-8 encoding problem this decade. There are a raft of "how to fix WP encoding issues" pages that show up in web searches, but the newest of them is from 2008 or so. I'm pretty much resigned to going through 16+ years of posts fixing this, but can mangled UTF-8 be recovered without rekeying? cheers, Stewart

| From: Stewart C. Russell via talk <talk@gtalug.org> | I have been running a WordPress blog hosted on a Linux-based shared host since | WordPress became a thing. It has worked quite well from about 2004 up until a | few weeks ago. Do you have shell access? I think you imply "yes". | Sadly, *something* recently decided my database encoding was wrong. And that | something decided to "fix" it. It certainly "fixed it", but not in any way I | could want. It also did the same for Catherine's blog. Does "fix it" mean "changed the raw data" or mangle the data somewhere downstream of the disk files? | I know I didn't change any part of the config chain. As far as I can see: | | * the MySQL database still thinks the text is encoded in UTF-8; | | * Wordpress thinks the data is in UTF-8; | | * the web server is serving UTF-8. And says so: <meta charset>"UTF-8" /> | I'm pretty much resigned to going through 16+ years of posts fixing this, but | can mangled UTF-8 be recovered without rekeying? Back-ups?

On 2021-11-27 18:04, D. Hugh Redelmeier wrote:
Do you have shell access? I think you imply "yes".
Yes, I do, but not to the database server. All I have for that is socket access and PHPMyAdmin (blecch).
Does "fix it" mean "changed the raw data" or mangle the data somewhere downstream of the disk files?
"fix it" meant "broke it". The MySQL DB tables seem to have been quietly reprocessed from one encoding to another.
Back-ups?
Would have been an excellent idea, yes. Stewart

| From: Stewart C. Russell via talk <talk@gtalug.org> | On 2021-11-27 18:04, D. Hugh Redelmeier wrote: | > | > Do you have shell access? I think you imply "yes". | | Yes, I do, but not to the database server. All I have for that is socket | access and PHPMyAdmin (blecch). Ahh. Kind of "no". | > Does "fix it" mean "changed the raw data" or mangle the data somewhere | > downstream of the disk files? | | "fix it" meant "broke it". I quoted "fix it" because I understood that. | The MySQL DB tables seem to have been quietly | reprocessed from one encoding to another. That seems (1) odd and (2) rude. Is there a chance that the problem is actually in presenting the data (due to some incorrect setting of a locale somewhere)? Can you ask you supplier just what happened, why, and if they can reverse it for you? There is a chance that the transformation was bijective (or at least injective) and thus reversible.

WordPress updates the database when it upgrades, so it entirely possible that the database is profoundly changed by an upgrade. I dump my WordPress database daily and back it up (I also use rsync to keep an updated collection of media (mostly photos) in the wp-content folder, because I hate losing or reconstructing my blogs. I don't have any insight into the issue mentioned by OP though. On Mon., Nov. 29, 2021, 13:54 D. Hugh Redelmeier via talk, <talk@gtalug.org> wrote:
| From: Stewart C. Russell via talk <talk@gtalug.org>
| On 2021-11-27 18:04, D. Hugh Redelmeier wrote: | > | > Do you have shell access? I think you imply "yes". | | Yes, I do, but not to the database server. All I have for that is socket | access and PHPMyAdmin (blecch).
Ahh. Kind of "no".
| > Does "fix it" mean "changed the raw data" or mangle the data somewhere | > downstream of the disk files? | | "fix it" meant "broke it".
I quoted "fix it" because I understood that.
| The MySQL DB tables seem to have been quietly | reprocessed from one encoding to another.
That seems (1) odd and (2) rude.
Is there a chance that the problem is actually in presenting the data (due to some incorrect setting of a locale somewhere)?
Can you ask you supplier just what happened, why, and if they can reverse it for you?
There is a chance that the transformation was bijective (or at least injective) and thus reversible. --- Post to this mailing list talk@gtalug.org Unsubscribe from this mailing list https://gtalug.org/mailman/listinfo/talk

On 27/11/2021 14:41, Stewart C. Russell via talk wrote:
I have been running a WordPress blog hosted on a Linux-based shared host since WordPress became a thing. It has worked quite well from about 2004 up until a few weeks ago. <snip> So the phonetic character U+0252 has been mangled into U+00C9 + U+2019. Every UTF-8 character seems to be affected this way.
I wasn't expecting to wake up to a UTF-8 encoding problem this decade. There are a raft of "how to fix WP encoding issues" pages that show up in web searches, but the newest of them is from 2008 or so.
I'm pretty much resigned to going through 16+ years of posts fixing this, but can mangled UTF-8 be recovered without rekeying?
Probably. If you've been running it for 10+ years, there is/was most certainly some latin1 data hanging around, that's likely been converted to UTF-8, or UTF-8 that's been double-encoded somewhere along the line. This page has a section on the possible incorrect casing issue and a fix: https://codex.wordpress.org/Converting_Database_Character_Sets#Variant:_3-st... The rest of the page has a lot of useful information as well that might apply to your situation. Another thing to try is using mysqli_set_charset("UTF8"); somewhere in your site's code. Substitute in different character sets until you find the correct one, and then you'll be able to figure out a way to migrate your tables to whatever WordPress wants. Cheers, Jamon

On 2021-11-29 16:25, Jamon Camisso via talk wrote:
Another thing to try is using mysqli_set_charset("UTF8"); somewhere in your site's code. Substitute in different character sets until you find the correct one ...
Thanks, Jamon, but there isn't a valid encoding for what my database seems to be holding. It was UTF-8, and now it's seemingly UTF-8 decoded to CP1252 bytes re-encoded to UTF-8 characters again. If WordPress were using Python (it's not), if my db held the 4 character, 6 byte UTF-8 string, the equivalent Python code to end up in the mess I'm in is: >>> bytes(bytes("côté",encoding='utf-8').decode(encoding='cp1252'), encoding='utf-8') b'c\xc3\x83\xc2\xb4t\xc3\x83\xc2\xa9' or 6 characters / 10 bytes of gibberish ('côté'). Since this happened in the last month or so, it's not really a legacy encoding issue. Perfectly good UTF-8 got destroyed with no input/changes from me. I'd been fairly careful with backups for the first decade of running this blog, but the process got wearing after a while, especially since every update went flawlessly so the manual backup process was a waste of time. Wordpress offers automatic updates without forcing a backup checkpoint, which I think is wrong. cheers, Stewart

On 01/12/2021 08:05, Stewart C. Russell via talk wrote:
On 2021-11-29 16:25, Jamon Camisso via talk wrote:
Another thing to try is using mysqli_set_charset("UTF8"); somewhere in your site's code. Substitute in different character sets until you find the correct one ...
Thanks, Jamon, but there isn't a valid encoding for what my database seems to be holding. It was UTF-8, and now it's seemingly UTF-8 decoded to CP1252 bytes re-encoded to UTF-8 characters again.
If WordPress were using Python (it's not), if my db held the 4 character, 6 byte UTF-8 string, the equivalent Python code to end up in the mess I'm in is:
>>> bytes(bytes("côté",encoding='utf-8').decode(encoding='cp1252'), encoding='utf-8') b'c\xc3\x83\xc2\xb4t\xc3\x83\xc2\xa9'
or 6 characters / 10 bytes of gibberish ('côté'). Since that encoding is reversible, can you attempt it on some of the corrupted posts/pages? e.g.
bytes(bytes('côté', encoding='utf-8').decode(), encoding='cp1252').decode() 'côté'
Since this happened in the last month or so, it's not really a legacy encoding issue. Perfectly good UTF-8 got destroyed with no input/changes from me.
I'd been fairly careful with backups for the first decade of running this blog, but the process got wearing after a while, especially since every update went flawlessly so the manual backup process was a waste of time. Wordpress offers automatic updates without forcing a backup checkpoint, which I think is wrong.
Is it a managed Wordpress? That's terribly bad sounding if it is. Worse I suppose if Wordpress itself just did it. Do any of the casting suggestions on that link that I sent fix it? Or are you going to have to dump each row and run them through that double-decoding process? Jamon

On Thu, Dec 02, 2021 at 09:11:38AM -0500, Stewart C. Russell via talk wrote:
I haven't had a chance to try them yet, but your note about the transformation being reversible gives me hope that it can be fixed.
Here is someone that hit problems latin1 vs utf8 encoding for wordpress this year when mysql was upgraded to version 8: https://www.bigmessowires.com/2021/10/15/wordpress-latin1-and-utf-8-part-2/ -- Len Sorensen

It is fixable. I’ve dealt with similar transformation issues in the past (with a rather different cause). You can’t use iconv to do the conversion of what the web server outputs, as there are a few bytes that show up that aren’t considered to be valid cp1252 code points. We’ll see if the bit of Perl I’ve attached makes it through; it works on the page content I paste in, even not mangling the actually UTF-8 post_title. It doesn’t include any of the DBD::MySQL bits that you would use to SELECT and UPDATE post_content FROM the wp_posts table (remember to take a full backup first).
On Dec 2, 2021, at 09:11, Stewart C. Russell via talk <talk@gtalug.org> wrote:
On 2021-12-01 21:53, Jamon Camisso via talk wrote:
Do any of the casting suggestions on that link that I sent fix it?
I haven't had a chance to try them yet, but your note about the transformation being reversible gives me hope that it can be fixed.
Seneca

On 27/11/2021 14:41, Stewart C. Russell via talk wrote:
I have been running a WordPress blog hosted on a Linux-based shared host since WordPress became a thing. It has worked quite well from about 2004 up until a few weeks ago. <snip> So the phonetic character U+0252 has been mangled into U+00C9 + U+2019. Every UTF-8 character seems to be affected this way.
I wasn't expecting to wake up to a UTF-8 encoding problem this decade. There are a raft of "how to fix WP encoding issues" pages that show up in web searches, but the newest of them is from 2008 or so.
I'm pretty much resigned to going through 16+ years of posts fixing this, but can mangled UTF-8 be recovered without rekeying?
Probably. If you've been running it for 10+ years, there is/was most certainly some latin1 data hanging around, that's likely been converted to UTF-8, or UTF-8 that's been double-encoded somewhere along the line. This page has a section on the possible incorrect casing issue and a fix: https://codex.wordpress.org/Converting_Database_Character_Sets#Variant:_3-st... The rest of the page has a lot of useful information as well that might apply to your situation. Another thing to try is using mysqli_set_charset("UTF8"); somewhere in your site's code. Substitute in different character sets until you find the correct one, and then you'll be able to figure out a way to migrate your tables to whatever WordPress wants. Cheers, Jamon
participants (6)
-
D. Hugh Redelmeier
-
Jamon Camisso
-
Lennart Sorensen
-
Seneca Cunningham
-
Stewart C. Russell
-
William Witteman