Article unicode

Working around uTidyLib's unicode handling

Stuart Colville

29 Jul 2008 • 1 min read

A couple of weeks back I was giving uTidyLib a hard time for exploding when passed a unicode string. (see FAIL of the week: uTidyLib unicode error).

Looking further into using uTidyLib for tidying HTML I found there is a way to make uTidyLib handle unicode correctly. The workaround is quite simple; by passing an encoded string object and setting "char_encoding" to "utf8" it's possible to get unicode handled as expected.

The reason this works is that what's being passed into uTidyLib is a string and not a unicode object.

>>> type('Iñtërnâtiônàlizætiøn')
<type 'str'>
>>> type(u'Iñtërnâtiônàlizætiøn')
<type 'unicode'>
>>> type('Iñtërnâtiônàlizætiøn'.encode('utf8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
>>> type(u'Iñtërnâtiônàlizætiøn'.encode('utf8'))
<type 'str'>
>>> import tidy
>>> print str(tidy.parseString(u'<strong>Iñtërnâtiônàlizætiøn'.encode('utf8'), **{'show-body-only': 1}))
<strong>I&Atilde;&plusmn;t&Atilde;&laquo;rn&Atilde;&cent;ti&Atilde;&acute;n&Atilde;&nbsp;liz&Atilde;&brvbar;ti&Atilde;&cedil;n</strong>

>>> print str(tidy.parseString(u'<strong>Iñtërnâtiônàlizætiøn'.encode('utf8'), **{'char_encoding': "utf8", 'show-body-only': 1}))
<strong>Iñtërnâtiônàlizætiøn</strong>

Topics

Events: 58 Notes: 43 SXSW: 30 OSX: 30 ubuntu: 23 FOWA: 19 mozilla: 19 Python: 17 Web Development: 15 Apple: 15 JavaScript: 15 CSS: 12 @media: 12 2006: 12 shell: 12 bash: 12 family: 11 Linux/Unix: 10 Life: 9 Google: 9 Firefox: 9 Bazaar: 9 Django: 9 Web Standards: 8 Work: 7 virtualisation: 7 2007: 7 Code: 6 Parallels: 6 vmware: 6 Tech: 5 PHP: 5 demo: 5 vim: 5 performance: 5 Charles Proxy: 5 Virtualbox: 5 security: 4 Browsers: 4 opera: 4 Multimap: 4 ssh: 4 vcs: 4 FAIL: 4 Spotify: 4 testing: 4 Music: 3 graphics: 3 humour: 3 video: 3 Yahoo: 3 IE: 3 keyboard: 3 DVCS: 3 presentations: 3 Debugging: 3 DjangoCon: 3 networking: 3 android: 3 github: 3 unifi: 3 Maps: 2 London: 2 Design: 2 9rules: 2 2005: 2 Wordpress: 2 Windows: 2 meme: 2 books: 2 link: 2 NetNewsWire: 2 hardware: 2 Camino: 2 Apache: 2 microformats: 2 YUI: 2 domtool: 2 Bugs: 2 Microsoft: 2 photos: 2 safari: 2 Launchpad: 2 FOSS: 2 Coffee: 2 Amazon: 2 mp3: 2 DRM: 2 tar: 2 Nokia: 2 git: 2 spritegen: 2 2008: 2 MySQL: 2 Leopard: 2 greasemonkey: 2 unicode: 2 puppet: 2 Project Fondue: 2 apt: 2 Bzr: 2 SDCards: 2 screen: 2 2011: 2 d-bus: 2 Fronteers: 2 firefoxos: 2 FFOS: 2 grunt: 2 SSL: 2 docker: 2 svg: 2 ES6: 2 ubiquiti: 2 locate: 1 ads: 1 technorati: 1 Film: 1 star-wars: 1 flickr: 1 spam: 1 regex: 1 Dreamweaver: 1 Thunderbird: 1 Christmas: 1 reading: 1 Rentokil: 1 parenting: 1 Textmate: 1 Tabs: 1 Transmit: 1 Analytics: 1 Gmail: 1 HTML: 1 Air Jaldi: 1 Dharamsala: 1 mod_rewrite: 1 Emma: 1 France: 1 Holidays: 1 Provence: 1 forms: 1 Disco: 1 O2: 1 Extensions: 1 BBC: 1 Asthma: 1 iframes: 1 Happy New Year: 1 sysadmin: 1 recovery: 1 iphone: 1 sshfs: 1 proxy: 1 socks-proxy: 1 Hosting: 1 grid-computing: 1 encryption: 1 email: 1 WCAG: 1 fsck: 1 reviews: 1 sprites: 1 icons: 1 xkcd: 1 E61: 1 Sqlite: 1 hackdays: 1 hacking: 1 IE8: 1 c: 1 GCAP: 1 iTunes: 1 Arduino: 1 chocolate: 1 DJUGL: 1 Omnigraffle: 1 mac: 1 battery: 1 Titlecase: 1 iTerm: 1 terminal: 1 Facebook: 1 Wireshark: 1 Growl: 1 Notifications: 1 iSync: 1 legends: 1 The Troggs: 1 SVN: 1 San Francisco: 1 Teams: 1 lazyweb: 1 Audio: 1 bittorrent: 1 the long tail: 1 tools: 1 Vector: 1 Imagick: 1 favicon: 1 permissions: 1 lxml: 1 Setuptools: 1 virtualenv: 1 Keyring: 1 Library: 1 youtube: 1 phishing: 1 lugradio: 1 loggerhead: 1 Router: 1 Speedtouch: 1 unicomp: 1 BBCB: 1 Beebem: 1 Gaming: 1 EC2: 1 Canonical: 1 Workflow: 1 data-recovery: 1 SatNav: 1 SMTP: 1 Geek: 1 Fosdem: 1 Pathogen: 1 nandroid: 1 backup: 1 pbuilder: 1 chrome: 1 adb: 1 mavericks: 1 vms: 1 sinon: 1 grunt-bower-task: 1 applescript: 1 nginx: 1 gandi: 1 grunt-casper: 1 notes-to-self: 1 docker-compose: 1 fig: 1 Muffin Research Labs: 1 duckduckgo: 1 search: 1 mozfest: 1 2014: 1 mozfest14: 1 nss: 1 sync: 1 marketplace: 1 Firefox Android: 1 irc: 1 productivity: 1 slimerjs: 1 casperjs: 1 phantomjs: 1 ntlm: 1 wsg: 1 Tweetdeck: 1 open source: 1 ux: 1 payments: 1 karma: 1 travis: 1 React: 1 Babel: 1 ES7: 1 ReStructuredText: 1 pisight: 1 raspberry pi: 1 wifi: 1