Re: Word class marking in the wild...
From: | Alex Fink <000024@...> |
Date: | Tuesday, November 18, 2008, 18:04 |
On Tue, 18 Nov 2008 15:10:05 +0100, Lars Mathiesen <thorinn@...> wrote:
>2008/11/18 Benct Philip Jonsson <bpj@...>:
>> Lars Mathiesen skrev:
>>> This article describes how even English seems to allow heuristic
>>> discrimination of nouns and verbs (with a low success rate, it seems,
>>> but better than random):
>>>
>>>
http://www.livescience.com/strangenews/060809_word_sounds.html
I'm quite unsurprised by this. This is exactly what one should expect,
given the existence of, well, morphology.
Over the history of English there've been a variety of noun-forming and
verb-forming morphological operations, some native, some borrowed, and on
top of these there are forms we expect (native and borrowed) nouns to take
and other forms we expect verbs to take because of the phonological shapes
compatible with their (mostly dropped in ModE) former declensions and
desinences and such. So if you take a noun (say) which has one of these
marks of nominality, there'll be a bunch of other nouns which exhibit the
same mark, and thus have a certain amount of formal similarity to it; and
thus (bearing in mind their classifier algorithm, below) it'll probably look
a little bit more like a noun than a verb.
The cumulative effect of these seems plenty large enough to get an effect of
the size portrayed in
http://www.pnas.org/content/103/32/12203/F1.expansion.html
>> So what **is** the difference? one wonders! Intonation?
>
>Something about clustering analysis in a multidimensional parameter space
>based on occurrence of phonological features. There's not enough detail in
>the linked article to tell what's really going on.
Not even, it turns out, on investigating your reference
>Farmer, T.A., Christiansen, M.H. & Monaghan, P. (2006). Phonological
>typicality influences on-line sentence comprehension. Proceedings of the
>National Academy of Sciences, 103, 12203-12208.
No clustering. They just defined a distance measure between phonological
strings, and then to classify a word W they just looked at the mean distance
of W from all the nouns, and the mean distance of W all the verbs; and if
the first mean is larger then your word sounds nouny, if the second, it
sounds verby.
If you don't pay attention to the definition of the distance measure, this
approach is sufficiently agnostic to word structure that there's no chance
of extracting what the features that correlate to nouniness or verbiness are.
[technical blather about distance measures follows]
I don't understand one aspect of their distance measure: they state the
definition only for monosyllables, and then go ahead and use it on
polysyllables, like "marble" and "insect" -- okay, maaaybe "marble" is some
flavour of 'extended monosyllable' with coda /-l/ that gets realised
syllabically, but I can't see a dodge of that form for "insect".
Anyway. Given two monosyllables, you find the best alignment of the two
phoneme strings and then do some sort of Euclidean distance thing on each
position. So to take their example "kelp" and "street" line up as
. k . E . l p .
s t r\ i i . t .
nucleus against nucleus, stop against stop, etc; then the distance is
sqrt(dist(0, s)^2 + dist(k, t)^2 + ...)
They do the alignments in an ad-hoc way for every pair of words: in
particular a given word can have different alignments when compared against
different things. I find this kinda unsatisfying: in particular, their
alignments involve no knowledge of English phonotactics, and the example
they gave of variable alignments is a bad one when you take phonotactics
into account. They say that when comparing "kelp" to "street" the onsets
are /.k./ vs. /str\/, but when comparing "kelp" to "goat" they're /k../ vs.
/g../, and oh look, "kelp" is aligned differently! But, no: the only 3-term
English onsets have the form sC[wrl] (and all 2-term onsets are subsequences
of this), and so why not set up the second comparison as /.k./ vs. /.g./?
Only 'cause the code they wrote defaults to using the leftmost slots it can,
that's why. Granted, there are problems with doing a more consistent
alignment -- if your onset is /s/, do you align it with the second spot or
the first? -- but still.
Alex