Re: CHAT: Summary, web based mailinglist archives
From: | Boudewijn Rempt <bsarempt@...> |
Date: | Monday, October 25, 1999, 11:37 |
On Mon, 25 Oct 1999, Paul Bennett wrote:
> tal writes:
> >>>>>>
>
> Having author, date, threading-information and subject in a database,
> and grepping through the raw text would be a (quick and) workable
> solution. As for sizes, I've guesstimated that the list nets in at
> about 130 megs (unpacked) so far, growing with about 30 megs a year...
>
> <<<<<<
>
> If you're looking for "quick & dirty" interim fixes, I'm going to start a holy
> war by suggesting that a set of DBMs (with indices) plonked into some Perl
> hashes should do the trick, and Perls RE engine outperforms both grep and awk
> considerably. You could then squirt the text of the files into html as you go.
>
> My appologies to any Python fans (You Know Who You Are), it's just that I'm
> learning Perl at the moment and have never studied Python at length. I'd agree
> that a free-SQL backend might be better for a post-alpha project, however.
>
I've never tried Perl (beyond trying to read the documentation), but I
fancy that, say, 200 mb of text is a bit too much to regexp easily, even
for Perl - storing the header info and perhaps some keywords in a
database (doesn't matter much whether you pick a real database or use
dbm's - it's both equally easy), and indexing the text files themselves
with glimpse should be much more workeable. A few man-days from specs
to prototype, I guess.
> Just as soon as I can pick a linux and jump in with both feet, I'll start
> experimenting, if you like...
>
I'd advise you to stay clear of Suse - it tries to be clever, instead of
defering to the sysadmin. When the next Slackware comes out, I'll be
returning to it.
Boudewijn Rempt | http://denden.conlang.org/~bsarempt