Re: TECH: schcompile (Was: More Þrjótran)
From: | Philip Newton <philip.newton@...> |
Date: | Sunday, April 23, 2006, 13:06 |
On 4/23/06, Benct Philip Jonsson <bpj@...> wrote:
> Henrik Theiling skrev:
>
> > Because the file is translated to Perl with the literal strings left
> > as is, you can use whatever your Perl installation accepts. I.e.,
> > basically anything.
>
> OK, but how do I make Perl know that an input file is in UTF-8?
One question is: do you need Perl to know that an input file is in UTF-8?
If it's just replacing bytes with other bytes, it doesn't really
matter much whether Perl thinks of a string as "öå" (island stream?)
or as "öå" -- it's all just bytes to Perl. So if the rule file is in
UTF-8 and the input text is in UTF-8, too, it should just work, and
the output should be in UTF-8, too, even if Perl is unaware of that
fact.
> I tried with a simple program:
>
> while(<INFILE>){
> chomp;
> print length($_) . "\n";
> }
>
> Which printed 12 when the line in the input file was really
> six UTF-8 characters! So I guess there must be some way of
> telling Perl in what encoding INFILE is.
With newer perls (>= 5.6.3 or so, I think; 5.8.x should all be fine),
I *think* that this could work:
open INFILE, '<:utf8', 'filename';
and equivalently
open OUTFILE, '>:utf8', 'otherfilename';
Alternatively, binmode(INFILE, ':utf8'); may help, as may use
open ':utf8'; . Running 'perldoc perluniintro' may help, and may
provide some pointers to further documentation.
Lycka till,
--
Philip Newton <philip.newton@...>