Re: A tale of a programming conlanger
From: | Don Blaheta <dpb@...> |
Date: | Monday, October 4, 1999, 19:41 |
--6c2NcOVqGQ03X4Wi
Content-Type: text/plain; charset=us-ascii
Quoth Charles:
> taliesin the storyteller wrote:
> > > So, where can we download the source? Sounds mighty interesting.
> >
> > I scrapped it all and made a lettercounter (no digraphs) from scratch.
> > I'll probably expand on it, but I have to figure out what it is I really
> > want to use it for first...
>
> OK, I couldn't resist any longer, and spent 5 minutes writing this:
I spent a tad longer, and made a slightly configurable one. By the way:
> # all digraphs
> $jj = lc $zz; while ($jj =~ s/(.)(.)/$2/) { $cnt{ "digraph $1$2" } ++ }
This won't work right, because it won't count unaligned digraphs (e.g.
in "read", "re" and "ad" would be counted, but not "ea").
Anyway, my script is attached, along with the output of my script run -x
on 177K words of Wall Street Journal text. If anyone is exceptionally
interested, I could probably augment it in various ways.
--
-=-Don Blaheta-=-=-dpb@cs.brown.edu-=-=-<http://www.cs.brown.edu/~dpb/>-=-
Segun nusen savo, nusen komputile ha nulifoy have nondetektet erore.
-- Weisert
--6c2NcOVqGQ03X4Wi
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment; filename="letters.perl"
#!/usr/local/bin/perl -w
# =A9 1999 by Don Blaheta, licensed under GPL
use strict;
use vars qw(%opt %single %initial %digraph $single $initial $digraph $prev);
use Getopt::Std;
# Currently counts three distributions: single letters, digraphs, and=20
# letters that appear at the beginnings of words (that is, at the starts
# of lines and after whitespace). Non-letter non-whitespace characters
# are counted as letters, but not reported by default (-a to report them)
# Not normally case-sensitive (by current locale's definitions of case),
# but can be made so with -c. If -x option is given, script will filter
# out XML-like stuff---everything between <> on a single line.
getopts("xca", \%opt);
my $ignore_XML =3D $opt{x};
my $ignore_case =3D !$opt{c};
my $ignore_nonalph =3D !$opt{a};
while (<>) {
s/\<.*?\>//g if $ignore_XML;
$_ =3D lc $_ if $ignore_case;
s/\s{2,}/ /g;
$prev =3D ' ';
$_ .=3D ' ';
while (/(.)/g) {
++$single{$1} and ++$single;
++$initial{$1} and ++$initial if $prev eq " ";
++$digraph{$prev . $1} and ++$digraph;
# add your own favourite distribution here, and print it as below
$prev =3D $1;
}
}
print "Single letter distribution: \n";
print_distribution (\%single, $single);
print "Word-initial letter distribution: \n";
print_distribution (\%initial, $initial);
print "Digraph distribution: \n";
print_distribution (\%digraph, $digraph, 40);
# prints a nice table of the distribution. First arg is a hash between
# strings and counts; second arg is a total of all counts (for calculating
# percentages), and third arg, if present, is max num of lines to print.
sub print_distribution {
my $hash =3D shift;
my $count =3D shift;
my $length =3D shift || keys %$hash;
print " count %\n=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=
=3D=3D\n";
foreach my $dig (sort {$$hash{$b} <=3D> $$hash{$a}} keys %$hash) {
next if ($dig =3D~ /[^A-Za-z]/ and $ignore_nonalph);
last unless $length--;
printf "%2s %8d %6.3f%%\n", $dig,=20
$$hash{$dig}, 100.0*$$hash{$dig}/$count;
}
}
--6c2NcOVqGQ03X4Wi
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename=counts
Single letter distribution:
count %
== ======== =======
e 88326 9.097%
t 65329 6.729%
a 60890 6.271%
i 55196 5.685%
o 54482 5.611%
n 53751 5.536%
s 53110 5.470%
r 50151 5.165%
l 31368 3.231%
h 29857 3.075%
d 27959 2.880%
c 26100 2.688%
m 19885 2.048%
u 19844 2.044%
p 16750 1.725%
f 16334 1.682%
g 14382 1.481%
y 12195 1.256%
b 11797 1.215%
w 11142 1.148%
v 7261 0.748%
k 5522 0.569%
x 2023 0.208%
j 1726 0.178%
q 860 0.089%
z 699 0.072%
Word-initial letter distribution:
count %
== ======== =======
t 20967 11.763%
a 16021 8.988%
s 11072 6.212%
i 10210 5.728%
o 8895 4.990%
c 8653 4.855%
b 7818 4.386%
m 7171 4.023%
p 6699 3.758%
w 6632 3.721%
f 6408 3.595%
r 4801 2.694%
h 4780 2.682%
d 4656 2.612%
e 3700 2.076%
l 3438 1.929%
n 3162 1.774%
g 2377 1.334%
u 1857 1.042%
y 1498 0.840%
j 1160 0.651%
v 898 0.504%
k 535 0.300%
q 319 0.179%
z 62 0.035%
x 25 0.014%
Digraph distribution:
count %
== ======== =======
th 15891 1.637%
in 15040 1.549%
he 13187 1.358%
er 11822 1.218%
re 11094 1.143%
an 11038 1.137%
on 10753 1.108%
es 8791 0.905%
ar 7788 0.802%
or 7758 0.799%
en 7497 0.772%
st 7382 0.760%
at 7374 0.759%
te 7079 0.729%
to 6932 0.714%
ti 6813 0.702%
ed 6627 0.683%
it 6586 0.678%
nd 6259 0.645%
al 6220 0.641%
nt 6201 0.639%
ng 5983 0.616%
co 5939 0.612%
ha 5717 0.589%
se 5387 0.555%
of 5203 0.536%
is 5099 0.525%
de 4971 0.512%
io 4948 0.510%
as 4948 0.510%
ne 4695 0.484%
ve 4651 0.479%
ll 4625 0.476%
ro 4523 0.466%
le 4490 0.462%
me 4359 0.449%
ra 4340 0.447%
ea 4320 0.445%
ou 4309 0.444%
li 4292 0.442%
--6c2NcOVqGQ03X4Wi--