Conlang: Re: THEORY: Parsing for meaning. (Gary Shannon, Jun 26 '06, 20:07)

Re: THEORY: Parsing for meaning.

From:	Gary Shannon <fiziwig@...>
Date:	Monday, June 26, 2006, 20:07

From:

Gary Shannon <fiziwig@...>

Date:

Monday, June 26, 2006, 20:07

--- Roger Mills <rfmilly@...> wrote: <snip>

> > I'm getting the impression that this would, in > effect, be a model of the > human brain's knowledge of the semantics + syntax of > a given language.

I'm thinking that the database would also model a lot of "common sense" real-world information that we take for granted in mentally parsing a sentence. Again, the database would contain only string patterns to be matched in the sentence and result strings which would be emitted as output or placed into the input stream to replace the matched string. There would be no "intelligence" or "logic", just character-by-character pattern matching and replacement.

> > (subject: pets) I'd like a little monkey. > (subject: food) I'd like a little monkey.

Taken out of context even a human would not be able to dismbiguate these. I know that linguists can dream up all sorts of "tricky" sentences like these, but I can't help but wonder if they are really all that common in everyday discourse.

> plus all the synonyms, paraphrases, idioms etc. that > could also occur.

Humans memorize these idoms and recognize them as a single unit of meaning. The parser would have to do likewise. For example the adjacent pair "parking lots" could occur in the context "I've been parking lots of cars today." or in the context "I don't recall which of the parking lots I left my car in." But the pattern in the database in this example would consider "parking lots" to be the plural of the noun "parking-lot" only if "parking" is preceeded by some kind of article or quantifier or other word that can't appear immediately before a verb. Example patterns: parking lot[|s] %{of} -> noun(parking_lot,[s|p]) lots of -> quan(many) the lord of the rings -> story(lord_of_rings) Example pattern matching step: I worked all night parking lots of cars in the parking lots at The Lord of the Rings. -> I worked all night parking quan(many) cars in the noun(parking_lot,p) at story(lord_of_rings). A preliminary vocabulary reduction pass would use another database to replace idiomatic expressions like "chew the fat", or "shoot the breeze" with "talk". This replacement would take place even before the "parsing" was started. The same pass would handle book titles (To Kill a Mockingbird), famous names (Abraham Lincoln), locations (New York), etc. etc.

> > 2. I probably wouldn't use "standard" catgeories, > but > > would allow the categories to emerge based on > > interchangability. Also certain categories would > be > > necessary to validate the parsing process as it > > proceeds. For example the category "person" might > be > > different from the category "Animate Object" > > Certainly. Assuming that lexical items, in addition > to their semantic > features/description, also have certain grammatical > and syntactical > features. In Chomskian ("Aspects...") terms, among > these for "speak" would > be: > 1. it is a "verb" [+verb], i.e. it can occur as the > head of a VP > 2. its subject must usually be [+human] > 3. in a VP, it may occur in the env. __ NP NP (where > NP1 is by def. IO, NP2 > by def. DO) > ..3a. if NP1, it must be at least [+anim], usually > [+human] (and in Engl., > must undergo "dative movement" >>> a "to- phrase") > ..3b. if NP2, it must be (semantically) marked as "a > language, or something > related to language"

My unsubstantiated hypothesis is that there is a tradeoff between the "cleverness" of the rules and the "hugeness" of the database. With a large enough database of patterns the rules can be very simplistic.

> (In fact, I wonder if there isn't always a DO > ("something related to > language") that is typically deleted, even in the > usual intransitive usage: > "He spoke {...) (to me)". Thus it might even belong > in that class of verbs > with "understood" DO's-- eat, sing, dance, cook etc. > "Labile" verbs > IIRC???)

I think that rather than try to categorize "speak" as being a particular type of verb it would be more productive to treat "speak" as being "speak" with its own collection of patterns. In that way complex generalizations that may not be entirely correct are avoided in favor of many, many, many entries in the database. But remember that each database entry covers only a very few words in the sentence, not the entire sentence pattern.

> Personification may be a feature that covers an > entire narrative ("Coyote > spoke..., Mama Bear spoke...; Rock spoke to > Pebble...; Speak, Memory..." > etc). A (human) language universal??

Perhaps the pattern works the other way around. Rather than asserting that "speak" must apply to a sentient being perhaps the presence of "speak" informs us that the speaker, even if non-sentient, is being personified in this context.

> Figurative uses may be more lang.specific-- "His > silence spoke volumes", > "His silence doesn't speak well of him". "Etruscan > stones begin to speak" (a > book title). This may be where "Speak, Rover!" > belongs-- it means generally > that Rover should emit a bark on command. (If Rover > or Kitty is actually > thought to be communicating, I at least would use > "talk".) At a dinner > party: "Ooh, this Beef Wellington really speaks to > me!!" And "Speak up!!" > And that peculiar phrase "well-spoken" (of a > person), which implies so much > more than just speaking well.

These would all be found in one of the earlier "idiomatic pattern" passes, before actual parsing began. I don't honestly believe that a human "parses" these idioms in order to understand them, but simply finds them in the storehouse of examples to which he has been exposed.

> > Can any mechanical translation system really be > expected to handle all > this???

Yes, I think it can. It would have to built up in stages, perhaps using long lists of sentences beginning with the simplest structures and gradually working up to the more complex. The goal of the parser would be to break the complex sentence into a sequence of simple sentences which collectively express the same intent as the original complex sentence. A pilot project might use the first few hundred of those 1200 graded sentences I discovered, perhaps recasting the sentences to use a smaller vocabulary (e.g. allowing the only animals mentioned to be cat, dog, bird, fish, and the only names to be John and Marsha) to begin with so that the dictionary would be small and the database could be built by hand. That should be enough for proof of concept. If it can't be done within that small universe of discourse then it can't be done. But if it can be done then perhaps expnding the vocabulary and resolving the tricky situations might well be possible. --gary