Working around uTidyLib's unicode handling

A couple of weeks back I was giving uTidyLib a hard time for exploding when passed a unicode string. (see FAIL of the week: uTidyLib unicode error).

Looking further into using uTidyLib for tidying HTML I found there is a way to make uTidyLib handle unicode correctly. The workaround is quite simple; by passing an encoded string object and setting "char_encoding" to "utf8" it's possible to get unicode handled as expected.

The reason this works is that what's being passed into uTidyLib is a string and not a unicode object.

>>> type('Iñtërnâtiônàlizætiøn')
<type 'str'>
>>> type(u'Iñtërnâtiônàlizætiøn')
<type 'unicode'>
>>> type('Iñtërnâtiônàlizætiøn'.encode('utf8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
>>> type(u'Iñtërnâtiônàlizætiøn'.encode('utf8'))
<type 'str'>
>>> import tidy
>>> print str(tidy.parseString(u'<strong>Iñtërnâtiônàlizætiøn'.encode('utf8'), **{'show-body-only': 1}))
<strong>I&Atilde;&plusmn;t&Atilde;&laquo;rn&Atilde;&cent;ti&Atilde;&acute;n&Atilde;&nbsp;liz&Atilde;&brvbar;ti&Atilde;&cedil;n</strong>

>>> print str(tidy.parseString(u'<strong>Iñtërnâtiônàlizætiøn'.encode('utf8'), **{'char_encoding': "utf8", 'show-body-only': 1}))
<strong>Iñtërnâtiônàlizætiøn</strong>
comments powered by Disqus