<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Luke Loeffler &#187; language</title>
	<atom:link href="http://lukeloeffler.com/tag/language/feed/" rel="self" type="application/rss+xml" />
	<link>http://lukeloeffler.com</link>
	<description></description>
	<lastBuildDate>Thu, 29 Apr 2010 20:07:44 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>The Color of Rhyme</title>
		<link>http://lukeloeffler.com/2009/the-color-of-rhyme/</link>
		<comments>http://lukeloeffler.com/2009/the-color-of-rhyme/#comments</comments>
		<pubDate>Sat, 08 Aug 2009 02:42:57 +0000</pubDate>
		<dc:creator>luke</dc:creator>
				<category><![CDATA[blog]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[visualization]]></category>

		<guid isPermaLink="false">http://lukeloeffler.com/?p=344</guid>
		<description><![CDATA[Over the last few weeks I&#8217;ve been exploring language and words and how to deal with them algorithmically. Lately I&#8217;ve been thinking about ways to visualize various aspects of language, and one of the first that came to mind was the idea of representing the sound of words with color.
I am using the Carnegie Mellon [...]]]></description>
			<content:encoded><![CDATA[<p>Over the last few weeks I&#8217;ve been exploring language and words and how to deal with them algorithmically. Lately I&#8217;ve been thinking about ways to visualize various aspects of language, and one of the first that came to mind was the idea of representing the sound of words with color.</p>
<p>I am using the Carnegie Mellon Pronouncing Dictionary to encode each word. The CMUPD provides a list of 39 phonemes, the unique sounds that comprise spoken English. These are as follows:</p>
<div style="height: 150px; width: 300px; overflow-x: hidden; overflow-y: scroll;">
<pre>Phoneme Example Translation
AA	odd     AA D
AE	at	AE T
AH	hut	HH AH T
AO	ought	AO T
AW	cow	K AW
AY	hide	HH AY D
B 	be	B IY
CH	cheese	CH IY Z
D 	dee	D IY
DH	thee	DH IY
EH	Ed	EH D
ER	hurt	HH ER T
EY	ate	EY T
F 	fee	F IY
G 	green	G R IY N
HH	he	HH IY
IH	it	IH T
IY	eat	IY T
JH	gee	JH IY
K 	key	K IY
L 	lee	L IY
M 	me	M IY
N 	knee	N IY
NG	ping	P IH NG
OW	oat	OW T
OY	toy	T OY
P 	pee	P IY
R 	read	R IY D
S 	sea	S IY
SH	she	SH IY
T 	tea	T IY
TH	theta	TH EY T AH
UH	hood	HH UH D
UW	two	T UW
V 	vee	V IY
W 	we	W IY
Y 	yield	Y IY L D
Z 	zee	Z IY
ZH	seizure	S IY ZH ER</pre>
</div>
<p>Initially I gave each phoneme a different hue (from 1-360 degrees) by spreading each phoneme out on the color wheel equidistantly. Using this mapping, the phrase &#8220;Who knew sniffing glue could give you the flu?&#8221; translates to the following image: <a href="http://lukeloeffler.com/wordpress/wp-content/uploads/2009/08/phrase1.png"><img class="alignnone size-full wp-image-345" title="phrase1" src="http://lukeloeffler.com/wordpress/wp-content/uploads/2009/08/phrase1.png" alt="phrase1" width="215" height="23" /></a>.  You can clearly see the phoneme UW in magenta appearing at the end of each word.  Unfortunately, there is a lot of visual noise introduced by displaying all the other phonemes.</p>
<p>Realizing not all phonemes are going to be equally represented in the corpus, I decided to find the distribution of each. The sound AH (the &#8220;uhh&#8221; in the word &#8220;hut&#8221;) is the  most commonly occurring, accounting for nearly 10% of phonemes (this may be different if the most common 10% of words are analyzed, haven&#8217;t looked). This also happens to answer the question I asked my linguistics studies cousin at Christmas dinner some years back after doing imitations of other languages: &#8220;So, what does a non-English speaker hear when an American speaks?&#8221; It sounds like the answer may be &#8220;uhh&#8230; duh.. buh&#8230; fuh.&#8221;</p>
<div style="height: 150px; width: 300px; overflow-x: hidden; overflow-y: scroll;">
<pre>Phon.   Count   Prob.
AH	70564	0.0938934151927
N	53577	0.0712902826622
L	44148	0.0587439274124
S	43349	0.0576807671786
T	41698	0.0554839241923
R	40794	0.0542810495348
K	38174	0.0507948420096
IH	33779	0.0449467954168
IY	30957	0.0411918039527
D	28491	0.0379105109157
M	26330	0.0350350550142
ER	25871	0.0344243033905
EH	24564	0.0326851914686
Z	23955	0.0318748478111
AA	22175	0.0295063556757
AE	19151	0.0254825802726
B	18943	0.0252058126523
P	17305	0.0230262676423
OW	17147	0.0228160306999
G	12248	0.0162973548733
F	12147	0.0161629629038
EY	11851	0.0157691012903
AO	10059	0.0133846417922
AY	9838	0.0130905761956
V	9349	0.0124399061651
NG	8692	0.0115656930567
UW	8579	0.0114153337245
HH	8439	0.0112290478262
W	7737	0.0102949571077
SH	7730	0.0102856428128
JH	5461	0.00726648064689
Y	4392	0.00584405475209
CH	4378	0.00582542616226
AW	2932	0.00390135895563
TH	2597	0.00345560341329
UH	2021	0.00268917000318
OY	1124	0.00149560964056
DH	504	0.000670629233846
ZH	482	0.000641355735543</pre>
</div>
<p>The next step is to take the frequency distribution into account&#8230; Before diving any farther into palette selection, I created a simple <a href="http://lukeloeffler.com/labs/WordDna.html">test application</a> which you may play with (though be warned, it may be broken and may be an older version).</p>
<p>In this screen shot, syllable stress was taken into account as I experiment with the pre-attentive characteristic of height to aid visualization.<br />
<a href="http://lukeloeffler.com/wordpress/wp-content/uploads/2009/08/Safari.png"><img src="http://lukeloeffler.com/wordpress/wp-content/uploads/2009/08/Safari.png" alt="test with stress" title="test with stress" width="423" height="149" class="alignnone size-full wp-image-350" /></a></p>
<p>In the following version the previous phrase was repeated with a palette restricted to vowels only and colors assigned according to probability. Phonemes with a higher rate of occurrence received hues that were more distinct from others.<br />
<a href="http://lukeloeffler.com/wordpress/wp-content/uploads/2009/08/Safari1.png"><img src="http://lukeloeffler.com/wordpress/wp-content/uploads/2009/08/Safari1.png" alt="palette restriction" title="palette restriction" width="413" height="40" class="alignnone size-full wp-image-353" /></a></p>
<p>This change appears to have improved the legibility somewhat. More experimentation will be necessary.  I suspect this linear format will be a failed experiment. It seems there is little improvement in recognition of rhyme, as the viewer must use working memory to hold what amounts to merely a new representation of sound rather than use pre-attentive factors to quickly match sounds together.</p>
]]></content:encoded>
			<wfw:commentRss>http://lukeloeffler.com/2009/the-color-of-rhyme/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Natural Language Processing</title>
		<link>http://lukeloeffler.com/2009/natural-language-processing/</link>
		<comments>http://lukeloeffler.com/2009/natural-language-processing/#comments</comments>
		<pubDate>Fri, 24 Jul 2009 05:07:08 +0000</pubDate>
		<dc:creator>luke</dc:creator>
				<category><![CDATA[blog]]></category>
		<category><![CDATA[code]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://lukeloeffler.com/?p=314</guid>
		<description><![CDATA[I have become increasingly interested in tools to perform natural language processing. So much of how we approach and see problems is tied up in the words we find to describe them. I am currently exploring ways to use language to help define and understand problems as well as get out of creative blocks.
Below are [...]]]></description>
			<content:encoded><![CDATA[<p>I have become increasingly interested in tools to perform natural language processing. So much of how we approach and see problems is tied up in the words we find to describe them. I am currently exploring ways to use language to help define and understand problems as well as get out of creative blocks.</p>
<p>Below are a few useful tools:</p>
<p>Princeton&#8217;s <a href="http://wordnet.princeton.edu/">Wordnet</a> is a massive linguistic database containing not just definitions, but how words are related to each other.</p>
<p>The Python <a href="http://www.nltk.org/">Natural Language Toolkit</a>, which provides tools for parsing and understanding natural language semantics.</p>
<p>The Carnegie Mellon <a href="http://www.speech.cs.cmu.edu/cgi-bin/cmudict">Pronouncing Dictionary</a>, which breaks words down into their phonemes, providing information regarding pronouncing, rhyming, and syllable counts.</p>
<p>Having found no simple resource to provide syllable counts for common words, I want to share a quick solution I wrote in Python, which uses the CMU dictionary. Simply download the dictionary to the same directory as this script, naming it cmu_pron.txt.</p>
<pre>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">f = open("cmu_pron.txt")</div>

f = open("cmu_pron.txt")
lines = f.readlines()
f.close()
words = {}

for line in lines:
	pieces = line.split()
	if pieces[0] == ";;;":
		continue
	words[pieces[0]] = pieces[1:]

def num_syllables(word):
	global words
	key = word.upper()
	plist = words[key]
	return len(filter(lambda c: c in ("0","1","2"),"".join(plist)))

print "alphabet: %s" % num_syllables("alphabet")</pre>
<p>The result for alphabet is indeed 3. There are probably more efficient ways to do this, but it gets the job done.</p>
]]></content:encoded>
			<wfw:commentRss>http://lukeloeffler.com/2009/natural-language-processing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
