Muffinresearch Labs by Stuart Colville

Working around uTidyLib’s unicode handling | Comments (0)

Posted in Code, Linux/Unix on 29th July 2008, 9:16 pm by Stuart

A couple of weeks back I was giving uTidyLib a hard time for exploding when passed a unicode string. (see FAIL of the week: uTidyLib unicode error).

Looking further into using uTidyLib for tidying HTML I found there is a way to make uTidyLib handle unicode correctly. The workaround is quite simple; by passing an encoded string object and setting “char_encoding” to “utf8″ it’s possible to get unicode handled as expected.

The reason this works is that what’s being passed into uTidyLib is a string and not a unicode object.

>>> type('Iñtërnâtiônàlizætiøn')
<type 'str'>
>>> type(u'Iñtërnâtiônàlizætiøn')
<type 'unicode'>
>>> type('Iñtërnâtiônàlizætiøn'.encode('utf8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
>>> type(u'Iñtërnâtiônàlizætiøn'.encode('utf8'))
<type 'str'>
>>> import tidy
>>> print str(tidy.parseString(u'<strong>Iñtërnâtiônàlizætiøn'.encode('utf8'), **{'show-body-only': 1}))
<strong>I&Atilde;&plusmn;t&Atilde;&laquo;rn&Atilde;&cent;ti&Atilde;&acute;n&Atilde;&nbsp;liz&Atilde;&brvbar;ti&Atilde;&cedil;n</strong>

>>> print str(tidy.parseString(u'<strong>Iñtërnâtiônàlizætiøn'.encode('utf8'), **{'char_encoding': "utf8", 'show-body-only': 1}))
<strong>Iñtërnâtiônàlizætiøn</strong>

Post Tools

Comments: Add yours







XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>



Inspiring a Sense of Ownership|(0)

Former colleague Mike West talks about how inspiring a team’s sense of ownership around a project is the key to great things happening: http://mikewest.org/2008/11/the-inspiration-of-ownership. Quality stuff.

VMware Server: Convert Fixed Disk-images to Growable|(0)

Quick tip if you ever want to convert from a fixed disk image to an expandable one then the following command should do it:

sudo vmware-vdiskmanager -r source.vmdk -t 0 expandable.vmdk

Just replace “source” and “expandable” with your disk image file names. For more on what vmware-vdiskmanager can do for you type vmware-vdiskmanager -h

Photos on Flickr

© Copyright 2004-08 Stuart Colville, all rights reserved. May contain traces of Muffin. Powered by WordPress. Hosting by 1&1 This page was baked in 0.756s.