//

Posting this so hopefully no-one else wastes the 3 hours I just did trying to track down such a pointless bug.

So a new laptop means a whole load of installing and updating, and as I start working on my website again I notice that there's some unicode issues in some of my database data. I'm still stuck using MySQL (until I bring my CouchDB branch up to date), mainly because I lack the patience to move my stuff to Postgres. Unfortunately this means I'm stuck with the many nasty design choices made by the MySQL team.

I track the issue down to the fact that my data is being double encoded into utf-8. I've made specific efforts to keep all my databases in utf-8 because it is the only sane encoding. I seriously have less respect for anyone who chooses to use another character set as default because of the amount of my life that has been wasted chasing down the inevitable issues.

So the database is in utf-8, but when I do a SET NAMES 'utf8'; in mysql to ensure the client is getting utf-8 I get the same issues - basically, unless I leave the client encoding as latin1 it's going to doubly encode the data. Rather than tell you what I think of the developer that coded that, I'll just give you the solution that worked for me in Django - put this line in your settings:

DATABASE_OPTIONS = dict(charset="latin1", use_unicode=False)

Update 2009-01-05

So admittedly that was a hacky solution, and it turns out it breaks the django admin. After more hours looking through the django source code trying to find a way to fix the hack I give up.

The problem is that if you have latin1 data in your database, no matter what you do to the database the data will remain in that encoding. If you have the patience to go through every column in the database, converting to a blob and back then this may be your solution. You could probably even script it. For me, I've spent far too long trying to fix this bug - it only shows up in my staging environment so I'll live with it. And as soon as I can get away from the nastyness of mysql I will.