Muffinresearch Labs by Stuart Colville

Working around uTidyLib’s unicode handling | Comments (0)

Posted in Code, Linux/Unix on 29th July 2008, 9:16 pm by Stuart

A couple of weeks back I was giving uTidyLib a hard time for exploding when passed a unicode string. (see FAIL of the week: uTidyLib unicode error).

Looking further into using uTidyLib for tidying HTML I found there is a way to make uTidyLib handle unicode correctly. The workaround is quite simple; by passing an encoded string object and setting “char_encoding” to “utf8″ it’s possible to get unicode handled as expected.

The reason this works is that what’s being passed into uTidyLib is a string and not a unicode object.

>>> type('Iñtërnâtiônàlizætiøn')
<type 'str'>
>>> type(u'Iñtërnâtiônàlizætiøn')
<type 'unicode'>
>>> type('Iñtërnâtiônàlizætiøn'.encode('utf8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
>>> type(u'Iñtërnâtiônàlizætiøn'.encode('utf8'))
<type 'str'>
>>> import tidy
>>> print str(tidy.parseString(u'<strong>Iñtërnâtiônàlizætiøn'.encode('utf8'), **{'show-body-only': 1}))
<strong>I&Atilde;&plusmn;t&Atilde;&laquo;rn&Atilde;&cent;ti&Atilde;&acute;n&Atilde;&nbsp;liz&Atilde;&brvbar;ti&Atilde;&cedil;n</strong>

>>> print str(tidy.parseString(u'<strong>Iñtërnâtiônàlizætiøn'.encode('utf8'), **{'char_encoding': "utf8", 'show-body-only': 1}))
<strong>Iñtërnâtiônàlizætiøn</strong>

Post Tools

Comments: Add yours







XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>



GNU screen: open tab in current working directory|(1)

A nice trick for having screen open a new tab in the same directory as the one you’re currently in. To use it add it to your .screenrc

# Open new window in current dir.
bind c stuff "screen -X chdir \$PWD;screen^M"
bind ^c stuff "screen -X chdir \$PWD;screen^M"

Hat tip: mteckert on SuperUser.com

Ubuntu: add-apt-repository: command not found|(2)

When you’re using a minimal Ubuntu install if you find the ‘add-apt-repository’ command is missing (it’s useful for adding PPAs and other repositories), then simply run:

sudo apt-get install python-software-properties

Photos on Flickr

© Copyright 2004-12 Stuart Colville, all rights reserved. May contain traces of Muffin. Powered by WordPress. Hosting by Slicehost.com This page was baked in 0.485s.