Re: TECH: Re: Summary, web based mailinglist archives
From: | Paul Bennett <paul.bennett@...> |
Date: | Monday, October 25, 1999, 14:02 |
tal writes:
>>>>>>
* Paul.Bennett@xncorp.com (Paul.Bennett@xncorp.com) [991025 14:29]:
> Yes. I was never suggesting REing a "monolithic bag o' bits" (TM). What
> I feel it needs is a fairly compute- and space-intensive phase when a
> new message (set) is added. Indexes of indexes and all that funky stuff
> seems to be the order of the day, as well as a cute little trick that I
> call (after the guy who explained it to me) "Julian" Indexes (more on
> this is available for the terminally curious, it's not super techie, but
> if you've never come across it, it blows your mind at first).
Is your 'Julian Indexes' indexing on words, the words pointing to the
documents that contain the words? This is called an 'inverted
filesystem' in IR, and is precisely -not- what would suit conlang-l,
I've already built a simple ir-system ('twas for class), and used one
month of conlang-mail (may 98?) as the main dataset. No go.
I could write on vector-search and clustering techniques and stemming
algorithms too if you like :)
<warning>I *am* a perfectionist</warning>
<<<<<<
No, it isn't, though that's the "indexes of indexes" thing pretty well nailed
down from most directions. Why do you say it's not helpful for archiving
conlang-l? What happened with your sample dataset?
The J-index really needs a piece of paper and a pen to describe adequately. I
suppose with hindsight, it's an "n-tree" (after btree, quadtree, octree, et al
(about which I know next to nothing)), and is used (among other places) in the
UK Post Office directory CD. The index grows to a phenominal size, being
potentially kn^x in size, where n is the number of symbols in the symbol set, x
is the maximum allowed number of symbols in an index entry and k is some
constant value, usually (n * sizeof(long far *void) * 3) or something along
those lines (I don't recall clearly). A prudent programmer would obviously
limit this growth somewhat, and it's fairly straightforward to do so. For a
start, you can break the field into chunks, thus reducing the all-critical
exponential effect of x quite considerably. The one advantage is that it is
*FAST*, for an n-symbol key, you follow n pointers and are pointing straight at
the record(s) you require. This can be done at a much faster speed than a user
can type, giving a practically instant search time in purely client-side apps.
One thing of note is that having J-indexed a field, you no longer need to store
that field in the database, which very marginally (in comparison) reduces the
storage overhead.
Please share what knowledge you have, though we're heading into deep water, and
probably left topic about the time I said "Perl". <G> Private email may be
better. It's up to you.
<warning rebuttal>http://www.c2.com/cgi/wiki?QualityPlateau</warning rebuttal>
Pb
*************************************************************
This email and any files transmitted with it are confidential
and intended solely for the use of the individual or entity
to whom they are addressed.
If you have received this email in error please notify the
sender. This footnote also confirms that this email message
has been scanned for the presence of computer viruses.
*************************************************************