Conlang: Re: Word class marking in the wild... (Alex Fink, Nov 18 '08, 18:04)

Re: Word class marking in the wild...

From:	Alex Fink <000024@...>
Date:	Tuesday, November 18, 2008, 18:04

From:

Alex Fink <000024@...>

Date:

Tuesday, November 18, 2008, 18:04

On Tue, 18 Nov 2008 15:10:05 +0100, Lars Mathiesen <thorinn@...> wrote:

>2008/11/18 Benct Philip Jonsson <bpj@...>: >> Lars Mathiesen skrev: >>> This article describes how even English seems to allow heuristic >>> discrimination of nouns and verbs (with a low success rate, it seems, >>> but better than random): >>> >>> http://www.livescience.com/strangenews/060809_word_sounds.html

I'm quite unsurprised by this. This is exactly what one should expect, given the existence of, well, morphology. Over the history of English there've been a variety of noun-forming and verb-forming morphological operations, some native, some borrowed, and on top of these there are forms we expect (native and borrowed) nouns to take and other forms we expect verbs to take because of the phonological shapes compatible with their (mostly dropped in ModE) former declensions and desinences and such. So if you take a noun (say) which has one of these marks of nominality, there'll be a bunch of other nouns which exhibit the same mark, and thus have a certain amount of formal similarity to it; and thus (bearing in mind their classifier algorithm, below) it'll probably look a little bit more like a noun than a verb. The cumulative effect of these seems plenty large enough to get an effect of the size portrayed in http://www.pnas.org/content/103/32/12203/F1.expansion.html

>> So what **is** the difference? one wonders! Intonation? > >Something about clustering analysis in a multidimensional parameter space >based on occurrence of phonological features. There's not enough detail in >the linked article to tell what's really going on.

Not even, it turns out, on investigating your reference

>Farmer, T.A., Christiansen, M.H. & Monaghan, P. (2006). Phonological >typicality influences on-line sentence comprehension. Proceedings of the >National Academy of Sciences, 103, 12203-12208.

No clustering. They just defined a distance measure between phonological strings, and then to classify a word W they just looked at the mean distance of W from all the nouns, and the mean distance of W all the verbs; and if the first mean is larger then your word sounds nouny, if the second, it sounds verby. If you don't pay attention to the definition of the distance measure, this approach is sufficiently agnostic to word structure that there's no chance of extracting what the features that correlate to nouniness or verbiness are. [technical blather about distance measures follows] I don't understand one aspect of their distance measure: they state the definition only for monosyllables, and then go ahead and use it on polysyllables, like "marble" and "insect" -- okay, maaaybe "marble" is some flavour of 'extended monosyllable' with coda /-l/ that gets realised syllabically, but I can't see a dodge of that form for "insect". Anyway. Given two monosyllables, you find the best alignment of the two phoneme strings and then do some sort of Euclidean distance thing on each position. So to take their example "kelp" and "street" line up as . k . E . l p . s t r\ i i . t . nucleus against nucleus, stop against stop, etc; then the distance is sqrt(dist(0, s)^2 + dist(k, t)^2 + ...) They do the alignments in an ad-hoc way for every pair of words: in particular a given word can have different alignments when compared against different things. I find this kinda unsatisfying: in particular, their alignments involve no knowledge of English phonotactics, and the example they gave of variable alignments is a bad one when you take phonotactics into account. They say that when comparing "kelp" to "street" the onsets are /.k./ vs. /str\/, but when comparing "kelp" to "goat" they're /k../ vs. /g../, and oh look, "kelp" is aligned differently! But, no: the only 3-term English onsets have the form sC[wrl] (and all 2-term onsets are subsequences of this), and so why not set up the second comparison as /.k./ vs. /.g./? Only 'cause the code they wrote defaults to using the leftmost slots it can, that's why. Granted, there are problems with doing a more consistent alignment -- if your onset is /s/, do you align it with the second spot or the first? -- but still. Alex