John Gruber recently published a perl script to convert strings into title case avoiding capitalizing small words based on rules from the New York Times Manual of style as well as catering for several special cases.
Before porting the perl script I tried out Python's inbuilt title string method to see how well it handled text according to the various rules:
>>> test="this is zed's favorite outburst" >>> test.title() "This Is Zed'S Favorite Outburst"
As you can see it really can't cope with anything remotely complicated.
I originally knocked up a direct port of the script, but I found it a tad unwieldy and so I decided on a fresh approach which processes the text after splitting the strings on whitespace characters.
Before I started both approaches I wrote test cases for all of the examples John gave in his post, and added a few of my own some based on John's archive of posts later. This made coding a lot easier as it meant once the tests stopped failing I could stop coding. Of course with something like this there's likely to be more cases that need to be catered for, if you have some ideas as to new test cases please add them in the comments.
The script is flexible from the point of view that you can either import it and use the function itself directly e.g:
>>> from titlecase import titlecase >>> titlecase('a thing') 'A Thing'
You can pass a file of text to stdin:
$ ./titlecase.py < ~/title-case-examples Q&A With Steve Jobs: 'That's What Happens in Technology' What Is AT&T's Problem? Apple Deal With AT&T Falls Through This v That This vs That This v. That This vs. That The SEC's Apple Probe: What You Need to Know 'By the Way, Small Word at the Start but Within Quotes.' Small Word at End Is Nothing to Be Afraid of Starting Sub-Phrase With a Small Word: A Trick, Perhaps? Sub-Phrase With a Small Word in Quotes: 'A Trick, Perhaps?' Sub-Phrase With a Small Word in Quotes: "A Trick, Perhaps?" "Nothing to Be Afraid Of?" "Nothing to Be Afraid Of?" A Thing 2lmc Spool: 'Gruber on OmniFocus and Vapo(u)rware'
Finally running the script from the command line without args results in the tests being run:
$ ./titlecase.py Testing: a thing ... ok Testing: Apple Deal With AT&T Falls Through ... ok Testing: The SEC's Apple Probe: What You Need to Know ... ok Testing: What Is AT&T's Problem? ... ok Testing: this is just an example.com ... ok Testing: this is something listed on an del.icio.us ... ok Testing: Generalissimo Francisco Franco: Still Dead; Kieren McCarthy: Still a Jackass ... ok Testing: iTunes should be unmolested ... ok Testing: "Nothing to Be Afraid of?" ... ok Testing: "Nothing to Be Afraid Of?" ... ok Testing: Q&A With Steve Jobs: 'That's What Happens In ... ok Testing: Seriously, â€˜Repair Permissionsâ€™ Is Voodoo ... ok Testing: Sub-Phrase With a Small Word in Quotes: "a Trick, ... ok Testing: Small word at end is nothing to be afraid of ... ok Testing: 'by the Way, small word at the start but within ... ok Testing: Sub-Phrase With a Small Word in Quotes: 'a Trick, ... ok Testing: Starting Sub-Phrase With a Small Word: a Trick, ... ok Testing: this v that ... ok Testing: this v. that ... ok Testing: this vs that ... ok Testing: this vs. that ... ok Testing: Reading Between the Lines of Steve Jobsâ€™s â€˜Thoughts on ... ok Testing: 2lmc Spool: 'Gruber on OmniFocus and Vapo(u)rware' ... ok ---------------------------------------------------------------------- Ran 23 tests in 0.019s
To get the code check it out with bzr with the command:
bzr branch lp:titlecase.py
Or download it from here: https://launchpad.net/titlecase.py/trunk/0.2