Muffinresearch Labs by Stuart Colville

Working around uTidyLib’s unicode handling | Comments (0)

Posted in Code, Linux/Unix on 29th July 2008, 9:16 pm by Stuart

A couple of weeks back I was giving uTidyLib a hard time for exploding when passed a unicode string. (see FAIL of the week: uTidyLib unicode error).

Looking further into using uTidyLib for tidying HTML I found there is a way to make uTidyLib handle unicode correctly. The workaround is quite simple; by passing an encoded string object and setting “char_encoding” to “utf8″ it’s possible to get unicode handled as expected.

The reason this works is that what’s being passed into uTidyLib is a string and not a unicode object.

>>> type('Iñtërnâtiônàlizætiøn')
<type 'str'>
>>> type(u'Iñtërnâtiônàlizætiøn')
<type 'unicode'>
>>> type('Iñtërnâtiônàlizætiøn'.encode('utf8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
>>> type(u'Iñtërnâtiônàlizætiøn'.encode('utf8'))
<type 'str'>
>>> import tidy
>>> print str(tidy.parseString(u'<strong>Iñtërnâtiônàlizætiøn'.encode('utf8'), **{'show-body-only': 1}))
<strong>I&Atilde;&plusmn;t&Atilde;&laquo;rn&Atilde;&cent;ti&Atilde;&acute;n&Atilde;&nbsp;liz&Atilde;&brvbar;ti&Atilde;&cedil;n</strong>

>>> print str(tidy.parseString(u'<strong>Iñtërnâtiônàlizætiøn'.encode('utf8'), **{'char_encoding': "utf8", 'show-body-only': 1}))
<strong>Iñtërnâtiônàlizætiøn</strong>

Post Tools

Comments: Add yours







XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>



Using Loggerhead with mod_wsgi|(0)

Here’s a post I wrote over on the Project Fondue Blog about our use of Loggerhead with mod_wsgi under Apache. Loggerhead is the rather nice branch viewer for bazaar branches as used on Launchpad.net.

If you’re not already subscribed to the Project Fondue blog feed then I can recommend it, as there should be some interesting posts coming out of there in the coming months (yes I’m unashamedly biased!).

Ubuntu: Turn off changing workspace with mouse wheel|(1)

I found the changing with the workspace with the mouse wheel really annoying. To disable it go to System => Preferences => CompizConfig (available if the compizconfig-settings-manager package is installed) and uncheck “Viewport Switcher” which is under the “Desktop” heading.

Photos on Flickr

© Copyright 2004-10 Stuart Colville, all rights reserved. May contain traces of Muffin. Powered by WordPress. Hosting by Slicehost.com This page was baked in 0.652s.