From: | Jim Henry <jimhenry1973@...> |
---|---|
Date: | Sunday, April 22, 2007, 22:47 |
On 4/22/07, Henrik Theiling <theiling@...> wrote:> So your main criterion would be predictability of semantics? If > predictable => no new word, if not predictable => new word. This > seems, well, very reasonable for composing a lexicon. Of course there > will be difficult cases, but let's ignore them for now. > > This means that for counting a conlang's words, we probably should: > > - also count phrases ('bubble sort algorithm') and idioms > > - not count lexicon entries that are due to irregular forms > ('saw' cf. 'see')I can see an argument for counting irregular and especially suppletive forms as separate words -- from the POV of the learner they increase the amount of vocabulary one has to learn. But in gauging a conlang's completeness and expressivity by its lexicon size, one would of course not count irregular and suppletive forms separately (unless perhaps some of them have a special sense not shared by other forms of the "same" word?).> - count polysynthetically constructed words several times, > excluding structures that are semantically clear operations, > but counting all irregularly derived conceptsI don't think we need to treat polysynthetic words specially, as such. Would it make sense just to count the number of morpheme boundaries in a word and see how many of those result in a semantically opaque compounding? Even with that qualification I'm not sure I agree with you -- it seems to me that a semantically opaque compound built of 5 morphemes and another one built of 2 morphemes should both count as one word in the lexicon.> This seems quite reasonable. Do you also think it's a good way of > counting? It also looks undoable since the lexicons are generally not > structured like this.Of course in starting a new lexicon for a new language one could easily have a field for "semantic transparency", or perhaps an integral field indicating how many words (or "lexical items") each entry counts for (1 for root words and opaque compounds, 0 for irregular forms and transparent compounds; 1 for idioms and stock phrases?). On the other hand, transparency/opacity is a continuous rather than a boolean quality. Some "transparent" compounds are more tranparent than others, some "opaque" compounds are more opaque than others; and the same is true of idiomatic phrases. So maybe the semantic transparency field gets real numbers ranging from 0.0 to 1.0, and the overall word count for the language would probably be non-integral. On the gripping hand, maybe the "semantic transparency" needs to be applied at the morpheme boundary level rather than the word level. For instance, in E-o "el-don-ej-o" there are three morpheme boundaries, one perfectly transparent (ej-o), one somewhat transparent (between el-don and -ej), and one almost completely opaque (el-don). We might assign them transparency (or rather opacity) scores of el- don -ej -o 0.95, 0.20, 0.0 or thereabouts. How would we combine these to get an overall opacity score for the word? Not by simply averaging them; "eldonejo" is slightly more opaque than "eldoni". Nor by adding, because we don't want a score over 1.0. Another complicating factor is that we don't want the presence of both "eldoni" and "eldonejo" in the lexicon to inflate the count too much since the latter builds on the former and is almost transparent if you already know "eldoni". -- Jim Henry http://www.pobox.com/~jimhenry
Henrik Theiling <theiling@...> |