Re: THEORY: Parsing for meaning.
From: | Gary Shannon <fiziwig@...> |
Date: | Monday, June 26, 2006, 20:07 |
--- Roger Mills <rfmilly@...> wrote:
<snip>
>
> I'm getting the impression that this would, in
> effect, be a model of the
> human brain's knowledge of the semantics + syntax of
> a given language.
I'm thinking that the database would also model a lot
of "common sense" real-world information that we take
for granted in mentally parsing a sentence.
Again, the database would contain only string patterns
to be matched in the sentence and result strings which
would be emitted as output or placed into the input
stream to replace the matched string. There would be
no "intelligence" or "logic", just
character-by-character pattern matching and
replacement.
>
> (subject: pets) I'd like a little monkey.
> (subject: food) I'd like a little monkey.
Taken out of context even a human would not be able to
dismbiguate these. I know that linguists can dream up
all sorts of "tricky" sentences like these, but I
can't help but wonder if they are really all that
common in everyday discourse.
> plus all the synonyms, paraphrases, idioms etc. that
> could also occur.
Humans memorize these idoms and recognize them as a
single unit of meaning. The parser would have to do
likewise. For example the adjacent pair "parking lots"
could occur in the context "I've been parking lots of
cars today." or in the context "I don't recall which
of the parking lots I left my car in." But the pattern
in the database in this example would consider
"parking lots" to be the plural of the noun
"parking-lot" only if "parking" is preceeded by some
kind of article or quantifier or other word that can't
appear immediately before a verb.
Example patterns:
parking lot[|s] %{of} -> noun(parking_lot,[s|p])
lots of -> quan(many)
the lord of the rings -> story(lord_of_rings)
Example pattern matching step:
I worked all night parking lots of cars in the
parking lots at The Lord of the Rings.
-> I worked all night parking quan(many) cars in
the noun(parking_lot,p) at story(lord_of_rings).
A preliminary vocabulary reduction pass would use
another database to replace idiomatic expressions like
"chew the fat", or "shoot the breeze" with "talk".
This replacement would take place even before the
"parsing" was started. The same pass would handle book
titles (To Kill a Mockingbird), famous names (Abraham
Lincoln), locations (New York), etc. etc.
> > 2. I probably wouldn't use "standard" catgeories,
> but
> > would allow the categories to emerge based on
> > interchangability. Also certain categories would
> be
> > necessary to validate the parsing process as it
> > proceeds. For example the category "person" might
> be
> > different from the category "Animate Object"
>
> Certainly. Assuming that lexical items, in addition
> to their semantic
> features/description, also have certain grammatical
> and syntactical
> features. In Chomskian ("Aspects...") terms, among
> these for "speak" would
> be:
> 1. it is a "verb" [+verb], i.e. it can occur as the
> head of a VP
> 2. its subject must usually be [+human]
> 3. in a VP, it may occur in the env. __ NP NP (where
> NP1 is by def. IO, NP2
> by def. DO)
> ..3a. if NP1, it must be at least [+anim], usually
> [+human] (and in Engl.,
> must undergo "dative movement" >>> a "to- phrase")
> ..3b. if NP2, it must be (semantically) marked as "a
> language, or something
> related to language"
My unsubstantiated hypothesis is that there is a
tradeoff between the "cleverness" of the rules and the
"hugeness" of the database. With a large enough
database of patterns the rules can be very simplistic.
> (In fact, I wonder if there isn't always a DO
> ("something related to
> language") that is typically deleted, even in the
> usual intransitive usage:
> "He spoke {...) (to me)". Thus it might even belong
> in that class of verbs
> with "understood" DO's-- eat, sing, dance, cook etc.
> "Labile" verbs
> IIRC???)
I think that rather than try to categorize "speak" as
being a particular type of verb it would be more
productive to treat "speak" as being "speak" with its
own collection of patterns. In that way complex
generalizations that may not be entirely correct are
avoided in favor of many, many, many entries in the
database.
But remember that each database entry covers only a
very few words in the sentence, not the entire
sentence pattern.
> Personification may be a feature that covers an
> entire narrative ("Coyote
> spoke..., Mama Bear spoke...; Rock spoke to
> Pebble...; Speak, Memory..."
> etc). A (human) language universal??
Perhaps the pattern works the other way around. Rather
than asserting that "speak" must apply to a sentient
being perhaps the presence of "speak" informs us that
the speaker, even if non-sentient, is being
personified in this context.
> Figurative uses may be more lang.specific-- "His
> silence spoke volumes",
> "His silence doesn't speak well of him". "Etruscan
> stones begin to speak" (a
> book title). This may be where "Speak, Rover!"
> belongs-- it means generally
> that Rover should emit a bark on command. (If Rover
> or Kitty is actually
> thought to be communicating, I at least would use
> "talk".) At a dinner
> party: "Ooh, this Beef Wellington really speaks to
> me!!" And "Speak up!!"
> And that peculiar phrase "well-spoken" (of a
> person), which implies so much
> more than just speaking well.
These would all be found in one of the earlier
"idiomatic pattern" passes, before actual parsing
began. I don't honestly believe that a human "parses"
these idioms in order to understand them, but simply
finds them in the storehouse of examples to which he
has been exposed.
>
> Can any mechanical translation system really be
> expected to handle all
> this???
Yes, I think it can. It would have to built up in
stages, perhaps using long lists of sentences
beginning with the simplest structures and gradually
working up to the more complex. The goal of the parser
would be to break the complex sentence into a sequence
of simple sentences which collectively express the
same intent as the original complex sentence.
A pilot project might use the first few hundred of
those 1200 graded sentences I discovered, perhaps
recasting the sentences to use a smaller vocabulary
(e.g. allowing the only animals mentioned to be cat,
dog, bird, fish, and the only names to be John and
Marsha) to begin with so that the dictionary would be
small and the database could be built by hand.
That should be enough for proof of concept. If it
can't be done within that small universe of discourse
then it can't be done. But if it can be done then
perhaps expnding the vocabulary and resolving the
tricky situations might well be possible.
--gary