Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: TECH: Re: Summary, web based mailinglist archives

From:Paul Bennett <paul.bennett@...>
Date:Monday, October 25, 1999, 14:02
tal writes:
>>>>>>
* Paul.Bennett@xncorp.com (Paul.Bennett@xncorp.com) [991025 14:29]:
> Yes. I was never suggesting REing a "monolithic bag o' bits" (TM). What > I feel it needs is a fairly compute- and space-intensive phase when a > new message (set) is added. Indexes of indexes and all that funky stuff > seems to be the order of the day, as well as a cute little trick that I > call (after the guy who explained it to me) "Julian" Indexes (more on > this is available for the terminally curious, it's not super techie, but > if you've never come across it, it blows your mind at first).
Is your 'Julian Indexes' indexing on words, the words pointing to the documents that contain the words? This is called an 'inverted filesystem' in IR, and is precisely -not- what would suit conlang-l, I've already built a simple ir-system ('twas for class), and used one month of conlang-mail (may 98?) as the main dataset. No go. I could write on vector-search and clustering techniques and stemming algorithms too if you like :) <warning>I *am* a perfectionist</warning> <<<<<< No, it isn't, though that's the "indexes of indexes" thing pretty well nailed down from most directions. Why do you say it's not helpful for archiving conlang-l? What happened with your sample dataset? The J-index really needs a piece of paper and a pen to describe adequately. I suppose with hindsight, it's an "n-tree" (after btree, quadtree, octree, et al (about which I know next to nothing)), and is used (among other places) in the UK Post Office directory CD. The index grows to a phenominal size, being potentially kn^x in size, where n is the number of symbols in the symbol set, x is the maximum allowed number of symbols in an index entry and k is some constant value, usually (n * sizeof(long far *void) * 3) or something along those lines (I don't recall clearly). A prudent programmer would obviously limit this growth somewhat, and it's fairly straightforward to do so. For a start, you can break the field into chunks, thus reducing the all-critical exponential effect of x quite considerably. The one advantage is that it is *FAST*, for an n-symbol key, you follow n pointers and are pointing straight at the record(s) you require. This can be done at a much faster speed than a user can type, giving a practically instant search time in purely client-side apps. One thing of note is that having J-indexed a field, you no longer need to store that field in the database, which very marginally (in comparison) reduces the storage overhead. Please share what knowledge you have, though we're heading into deep water, and probably left topic about the time I said "Perl". <G> Private email may be better. It's up to you. <warning rebuttal>http://www.c2.com/cgi/wiki?QualityPlateau</warning rebuttal> Pb ************************************************************* This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the sender. This footnote also confirms that this email message has been scanned for the presence of computer viruses. *************************************************************