Re: TECH: schcompile (Was: More Þrjótran)

From:	Philip Newton <philip.newton@...>
Date:	Sunday, April 23, 2006, 13:06

|< < Post > >| << List/Tree >> Reference April 2006 Index

On 4/23/06, Benct Philip Jonsson <bpj@...> wrote:
> Henrik Theiling skrev:
>
> > Because the file is translated to Perl with the literal strings left
> > as is, you can use whatever your Perl installation accepts.  I.e.,
> > basically anything.
>
> OK, but how do I make Perl know that an input file is in UTF-8?
One question is: do you need Perl to know that an input file is in UTF-8?

If it's just replacing bytes with other bytes, it doesn't really
matter much whether Perl thinks of a string as "öå" (island stream?)
or as "Ã¶Ã¥" -- it's all just bytes to Perl. So if the rule file is in
UTF-8 and the input text is in UTF-8, too, it should just work, and
the output should be in UTF-8, too, even if Perl is unaware of that
fact.

> I tried with a simple program:
>
> while(<INFILE>){
>      chomp;
>      print length($_) . "\n";
> }
>
> Which printed 12 when the line in the input file was really
> six UTF-8 characters!  So I guess there must be some way of
> telling Perl in what encoding INFILE is.
With newer perls (>= 5.6.3 or so, I think; 5.8.x should all be fine),
I *think* that this could work:

    open INFILE, '<:utf8', 'filename';
and equivalently
    open OUTFILE, '>:utf8', 'otherfilename';

Alternatively,   binmode(INFILE, ':utf8');    may help, as may     use
open ':utf8';    . Running 'perldoc perluniintro' may help, and may
provide some pointers to further documentation.

Lycka till,
--
Philip Newton <philip.newton@...>

|< < Post > >| << List/Tree >> Reference April 2006 Index