SAMPA to Unicode tool

From:	Julien Eychenne <je@...>
Date:	Monday, July 7, 2003, 13:55
|< < Post > >| << List/Tree >> July 2003 Index
Hi,

As I am writing a little website, I needed an efficient way to encode
Unicode IPA entities. However, since not anyone is familiar with Unicode
or have a recent browser, I also needed to write transcriptions in
SAMPA. So, I wrote a perl script (unipa.pl) that can :
- interactively tranlate an (X-)SAMPA chain into Unicode HTML entities,
when called without argument (e.g. './unipa.pl')
- convert all SAMPA transcriptions into Unicode HTML entities in a file,
when called with an argument (e.g. './unipa.pl -c myfile' or './unipa.pl
--convert myfile').

The script must be run from the command line.  It should work on all
platforms. Of course, you need Perl to use it. If not, have a look at :
  http://www.activestate.com/Products/Download/Download.plex?id=ActivePerl

In a file, SAMPA code must be included between '$_' and '_$' (to be
recognized by the script). For instance, the SAMPA pronunciation of
_sabyukà_ must be written [$_s6bj"uk@_$] (or $_[s6bj"uk@]_$). The script
makes a back up of your file (myfile.sampa), deletes the mark up '$_'
and '_$' and translates into Unicode ("s&#592;bj&#712;uk&#601;" in my
example).
When the script is called without argument, you must NOT use '$_' and
'_$', and type directly the SAMPA code (e.g. 's6by"uk@' without quotes).

I tried to stick as close as possible to (X-)SAMPA as it is explained
here :  http://www.phon.ucl.ac.uk/home/sampa/home.htm
I did not include symbols that are not implemented in Unicode
(independant tones : _R ; _F ; _H_T ; _B_L ; _R_F).

However, there are a few things you should be aware of :
- you CANNOT use the symbols § and # (except the latter in a particular
case). If you really need them, you must exclude them from the code. For
instance, you could write $_sampa_$#$_ipa_$ to get 'sampa#ipa'.
- diacritics must be written AFTER the character they modify (e.g.
'p_?\a~' for a pharyngealized p followed by a nasalized a)
- syllabic consonants can be written either C_= or C= (where C is any
consonant); nasalised sounds can be written either C_~ or C~
- glottalized  sounds can be written C_?
- the tie bar is ugly, as it is not really implemented in Unicode. I
didn't use the standard '_', because it is used for most diacritics.
Thus, 'Q_O' could mean either "more rounded [Q]" or "diphtong [QO]". So,
tie bar must be written either '__' (double underscore) or '#' (sharp).
For instance, [t_S] can be written [$_t__S_$] or [$_t#S_$].
- at the end of the code, you can modify it to suit your needs if you
use more glottalized than palatalized sounds.

So, I hope this will be useful for you (at least it is for me). I think
it might be of interest for people willing to translate their SAMPA
webpages : you just need to markup your transcriptions with '$_' and
'_$' and run the script on the page.

This is GPL'ed software, so feel free to modify it for your own purposes
:). Comments and feedback will be greatly appreciated :))). If you find
errors,  please notice me. This is the very first version of the script,
so please be lenient.

Here is the script : you must copy it into a text file called 'unipa.pl'.

Best regards,

Julien.





#!/usr/bin/perl -w

#############################################################################
#
     #
# This program is free software; you can redistribute it and/or modify
     #
# it under the terms of the GNU General Public License as published by
     #
# the Free Software Foundation; either version 2 of the License, or
     #
# (at your option) any later version.
     #
#
     #
# This program is distributed in the hope that it will be useful,
     #
# but WITHOUT ANY WARRANTY; without even the implied warranty of
     #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
     #
# GNU General Public License for more details.
     #
#
     #
# You should have received a copy of the GNU General Public License
     #
# along with this program; if not, write to the Free Software
     #
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307
USA #
#
     #
#############################################################################


# unipa.pl :
# From SAMPA to Unicode.
# version 0.1
# Author : Julien Eychenne <je@...>
# website : http://www.exuna.net
# Copyright 2003, Julien Eychenne
#
# (X-)SAMPA : http://www.phon.ucl.ac.uk/home/sampa/home.htm
# Unicode : http://www.unicode.org
#
# to do :
# - tie bar
# - help
# - tones
#
# absent in Unicode 4.0 : tie bar, linking mark, nasal release, tones
# nasal release was replaced with "superscript n"
# "tie bar" was replaced with an ugly tie-like bar.

# Sorry, variables, functions and comments are in French...

use strict;
use File::Copy;

my ($chaine,$i,$car,$code,$unicode,$fichier,$file);
my (@input,@output);
my $barre = 0;
my $retro = 0;
my $pseudo = 0;
my $diese = 0;



if (! exists($ARGV[0])) { interaction() }
elsif (($ARGV[0] eq "-h") or ($ARGV[0] eq "--help")) { aide() }
elsif (($ARGV[0] eq "-c") or ($ARGV[0] eq "--convert")) { conversion() }
else {
     print "$ARGV[0] is not a valid argument.\n"
        ."To display help, type --help ou -h\n";
}

sub aide {
     print "To do...\n";
}

sub interaction {
     print "Write in the SAMPA code to be converted into Unicode\n";
     $chaine = <>;
     chomp($chaine);
     sampa2unicode($chaine);
     print "The Unicode translation of your code is :\n";
     print "$unicode\n";
     exit 0;
}

sub conversion {
     $fichier = $ARGV[1];
     chomp($fichier);
     copy("$fichier","$fichier".".sampa") or die "I can't open your file
$!\n";
     print "\nI made a copy of your file : $fichier".".sampa\n\n";
     open(SAMPA,"$fichier".".sampa") or die "I can't open the copy"
        ." of your file $!\n";
     {
        local $/ = undef;
        $file = <SAMPA>;
     }
     my @portions = split(/(\$\_.*?\_\$)/,$file);
     close(SAMPA);
     open(IN,">$fichier") or die "I can't open your file $!\n";
     for (@portions) {
        if ($_ =~ /\$\_(.*?)\_\$/) {
            sampa2unicode($1);
            $_ = $unicode;
        }
        print IN $_;
     }
     close(IN);
     print "The job is done...\n\n";
     exit 0;
}


# sampa2unicode() : convertit la chaîne SAMPA en unicode en la lisant à
# l'envers, caractère par caractère, et la stocke dans la chaîne $unicode.
# --> CHAINE
sub sampa2unicode {
     my $chaine = $_[0];
     undef $unicode;
     undef @input;
     undef @output;
     # conversion en pseudo-SAMPA
     # modèle :         $chaine =~ s///g;
     $chaine =~ s/\|\\\|\\/\|§/g;
     $chaine =~ s/\#/_\#/g;
     if ($chaine =~ /_/) {
        $chaine =~ s/__/_\#/g;
        $chaine =~ s/_0/0§/g;
        $chaine =~ s/_v/v§/g;
        $chaine =~ s/_h/h§/g;
        $chaine =~ s/_=/=/g;
        $chaine =~ s/_\>/\>§/g;
        $chaine =~ s/_\</\#/g;
        $chaine =~ s/_\~/\~/g;
        $chaine =~ s/_O/O§/g;
        $chaine =~ s/_c/c§/g;
        $chaine =~ s/_w/w§/g;
        $chaine =~ s/_j/j§/g;
        $chaine =~ s/_G/V§/g;
        $chaine =~ s/_\?\\/\?§/g;
        $chaine =~ s/_\?/\?\#/g;
        $chaine =~ s/([^a-zA-Z])\`/$1\`\#/g;
        $chaine =~ s/_\^/\^§/g;
        $chaine =~ s/_\+/\+§/g;
        $chaine =~ s/_-/-§/g;
        $chaine =~ s/_\"/\"§/g;
        $chaine =~ s/_t/t§/g;
        $chaine =~ s/_k/k§/g;
        $chaine =~ s/_N/N§/g;
        $chaine =~ s/_d/d§/g;
        $chaine =~ s/_a/a§/g;
        $chaine =~ s/_m/m§/g;
        $chaine =~ s/_r/r§/g;
        $chaine =~ s/_o/o§/g;
        $chaine =~ s/_\}/\}§/g;
        $chaine =~ s/_A/A§/g;
        $chaine =~ s/_q/q§/g;
        $chaine =~ s/_e/e§/g;
        $chaine =~ s/_X/X§/g;
        $chaine =~ s/_n/n§/g;
        $chaine =~ s/_l/l§/g;
        $chaine =~ s/--\>/\>\#/g;
        $chaine =~ s/\<R\>/R\#/g;
        $chaine =~ s/\<F\>/F\#/g;
        $chaine =~ s/_T/T\#/g;
        $chaine =~ s/_H/H\#/g;
        $chaine =~ s/_L/M\#/g;
        $chaine =~ s/_M/L\#/g;
        $chaine =~ s/_B/B\#/g;
     }

     @input = split(//, $chaine);

     # conversion unicode
     for ($i=$#input ; $i>=0 ; $i--) {
        $car = $input[$i];
        # teste le diacritique \
        if ($car eq "\\") {
            $barre = 1;
            next;
        }
        # teste le diacritique `
        if ($car =~ "\`") {
            $retro = 1;
            next;
        }
        # teste le diacritique §
        if ($car =~ "§") {
            $pseudo = 1;
            next;
        }
        # teste le diacritique #
        if ($car =~ "\#") {
            $diese = 1;
            next;
        }


        # modèle :  elsif ($car eq "") { $code = "\&\#;" }
        # voyelles rhotiques
        if (($car eq "\`") and ($diese == 1))
        { $code = "\&\#734;" }
        # avec dièse
        elsif ($diese == 1) {
            if ($car eq "b") { $code = "\&\#595;" }
            elsif ($car eq "d") { $code = "\&\#598;" }
            elsif ($car eq "S") { $code = "\&\#644;" }
            elsif ($car eq "g") { $code = "\&\#608;" }
            elsif ($car eq "G") { $code = "\&\#667;" }
            elsif ($car eq "\>"){ $code = "\&\#8594;"}
            elsif ($car eq "_") { $code = "\&\#865;" }
            elsif ($car eq "?") { $code = "\&\#704;" }
            elsif ($car eq "R") { $code = "\&\#8599;"}
            elsif ($car eq "F") { $code = "\&\#8600;"}
            elsif ($car eq "T") { $code = "\&\#779;" }
            elsif ($car eq "H") { $code = "\&\#769;" }
            elsif ($car eq "M") { $code = "\&\#772;" }
            elsif ($car eq "L") { $code = "\&\#768;" }
            elsif ($car eq "B") { $code = "\&\#783;" }
            else { print "\n\nYour chain has unknown or not implemented "
                       ."symbols near $car\n\n"}
        }
        # avec §
        elsif ($pseudo == 1) {
            if ($car eq "|") { $code = "\&\#449;" }
            elsif ($car eq "h") { $code = "\&\#688;" }
            elsif ($car eq "v") { $code = "\&\#812;" }
            elsif ($car eq "0") { $code = "\&\#805;" }
            elsif ($car eq "O") { $code = "\&\#825;" }
            elsif ($car eq "c") { $code = "\&\#796;" }
            elsif ($car eq ">") { $code = "\&\#700;" }
            elsif ($car eq "w") { $code = "\&\#695;" }
            elsif ($car eq "j") { $code = "\&\#690;" }
            elsif ($car eq "V") { $code = "\&\#736;" }
            elsif ($car eq "\?"){ $code = "\&\#740;" }
            elsif ($car eq "\^"){ $code = "\&\#815;" }
            elsif ($car eq "+") { $code = "\&\#799;" }
            elsif ($car eq "-") { $code = "\&\#800;" }
            elsif ($car eq "\""){ $code = "\&\#776;" }
            elsif ($car eq "t") { $code = "\&\#804;" }
            elsif ($car eq "k") { $code = "\&\#816;" }
            elsif ($car eq "N") { $code = "\&\#828;" }
            elsif ($car eq "a") { $code = "\&\#826;" }
            elsif ($car eq "m") { $code = "\&\#827;" }
            elsif ($car eq "d") { $code = "\&\#810;" }
            elsif ($car eq "}") { $code = "\&\#794;" }
            elsif ($car eq "r") { $code = "\&\#797;" }
            elsif ($car eq "o") { $code = "\&\#798;" }
            elsif ($car eq "A") { $code = "\&\#792;" }
            elsif ($car eq "q") { $code = "\&\#793;" }
            elsif ($car eq "e") { $code = "\&\#820;" }
            elsif ($car eq "X") { $code = "\&\#774;" }
            elsif ($car eq "n") { $code = "\&\#8319;"}
            elsif ($car eq "l") { $code = "\&\#737;" }
            else { print "\n\nYour chain has unknown or not implemented "
                       ."symbols near $car\n\n"}
        }
        # symboles avec barre
        elsif ($barre == 1) {
            if ($car eq "J") { $code = "\&\#607;" }
            elsif ($car eq "G") { $code = "\&\#610;" }
            elsif ($car eq "N") { $code = "\&\#628;" }
            elsif ($car eq "B") { $code = "\&\#665;" }
            elsif ($car eq "R") { $code = "\&\#640;" }
            elsif ($car eq "p") { $code = "\&\#632;" }
            elsif ($car eq "j") { $code = "\&\#669;" }
            elsif ($car eq "X") { $code = "\&\#295;" }
            elsif ($car eq ">") { $code = "\&\#673;" }
            elsif ($car eq "<") { $code = "\&\#674;" }
            elsif ($car eq ">") { $code = "\&\#673;" }
            elsif ($car eq "?") { $code = "\&\#661;" }
            elsif ($car eq "H") { $code = "\&\#668;" }
            elsif ($car eq "h") { $code = "\&\#614;" }
            elsif ($car eq "K") { $code = "\&\#622;" }
            elsif ($car eq "v") { $code = "\&\#651;" }
            elsif ($car eq "r") { $code = "\&\#633;" }
            elsif ($car eq "M") { $code = "\&\#624;" }
            elsif ($car eq "L") { $code = "\&\#671;" }
            elsif ($car eq "\@"){ $code = "\&\#600;" }
            elsif ($car eq "3") { $code = "\&\#606;" }
            elsif ($car eq "s") { $code = "\&\#597;" }
            elsif ($car eq "z") { $code = "\&\#657;" }
            elsif ($car eq "l") { $code = "\&\#634;" }
            elsif ($car eq "x") { $code = "\&\#615;" }
            elsif ($car eq "O") { $code = "\&\#664;" }
            elsif ($car eq "|") { $code = "\&\#448;" }
            elsif ($car eq "!") { $code = "\&\#451;" }
            elsif ($car eq "=") { $code = "\&\#450;" }
            elsif ($car eq ":") { $code = "\&\#721;" }
            else { print "\n\nYour chain has unknown or not implemented "
                       ."symbols near $car\n\n"}
        }
        # r rétroflexe approximant
        elsif (($barre == 1) and ($retro == 1) and ($car eq 'r'))
        { $code = "\&\#635" }
        # rétroflexes
        elsif ($retro == 1) {
            if ($car eq "t") { $code = "\&\#648;" }
            elsif ($car eq "d") { $code = "\&\#598;" }
            elsif ($car eq "n") { $code = "\&\#627;" }
            elsif ($car eq "r") { $code = "\&\#637;" }
            elsif ($car eq "s") { $code = "\&\#642;" }
            elsif ($car eq "z") { $code = "\&\#656;" }
            elsif ($car eq "l") { $code = "\&\#621;" }
            else { print "\n\nYour chain has unknown or not implemented "
                       ."symbols near $car\n\n"}
        }
        # consonnes
        elsif ($car eq "?") { $code = "\&\#660;" }
        elsif ($car eq "F") { $code = "\&\#625;" }
        elsif ($car eq "J") { $code = "\&\#626;" }
        elsif ($car eq "N") { $code = "\&\#331;" }
        elsif ($car eq "4") { $code = "\&\#638;" }
        elsif ($car eq "B") { $code = "\&\#946;" }
        elsif ($car eq "T") { $code = "\&\#952;" }
        elsif ($car eq "D") { $code = "\&\#240;" }
        elsif ($car eq "S") { $code = "\&\#643;" }
        elsif ($car eq "Z") { $code = "\&\#658;" }
        elsif ($car eq "C") { $code = "\&\#231;" }
        elsif ($car eq "G") { $code = "\&\#611;" }
        elsif ($car eq "X") { $code = "\&\#967;" }
        elsif ($car eq "R") { $code = "\&\#641;" }
        elsif ($car eq "K") { $code = "\&\#620;" }
        elsif ($car eq "P") { $code = "\&\#651;" }
        elsif ($car eq "L") { $code = "\&\#654;" }
        elsif ($car eq "W") { $code = "\&\#653;" }
        elsif ($car eq "H") { $code = "\&\#613;" }
        elsif ($car eq "5") { $code = "\&\#619;" }
        # voyelles
        elsif ($car eq "1") { $code = "\&\#616;" }
        elsif ($car eq "}") { $code = "\&\#649;" }
        elsif ($car eq "M") { $code = "\&\#623;" }
        elsif ($car eq "I") { $code = "\&\#618;" }
        elsif ($car eq "Y") { $code = "\&\#655;" }
        elsif ($car eq "U") { $code = "\&\#650;" }
        elsif ($car eq "2") { $code = "\&\#248;" }
        elsif ($car eq "8") { $code = "\&\#629;" }
        elsif ($car eq "7") { $code = "\&\#612;" }
        elsif ($car eq "\@"){ $code = "\&\#601;" }
        elsif ($car eq "E") { $code = "\&\#603;" }
        elsif ($car eq "9") { $code = "\&\#339;" }
        elsif ($car eq "3") { $code = "\&\#604;" }
        elsif ($car eq "V") { $code = "\&\#652;" }
        elsif ($car eq "O") { $code = "\&\#596;" }
        elsif ($car eq "{") { $code = "\&\#230;" }
        elsif ($car eq "6") { $code = "\&\#592;" }
        elsif ($car eq "\&"){ $code = "\&\#630;" }
        elsif ($car eq "A") { $code = "\&\#593;" }
        elsif ($car eq "Q") { $code = "\&\#594;" }
        # suprasegmental
        elsif ($car eq "\""){ $code = "\&\#712;" }
        elsif ($car eq "\%"){ $code = "\&\#716;" }
        elsif ($car eq ":") { $code = "\&\#720;" }
        elsif ($car eq "\^"){ $code = "\&\#8593;"}
        elsif ($car eq "!") { $code = "\&\#8595;"}
        # diacritiques
        elsif ($car eq "\'"){ $code = "\&\#690;" }
# If you want to use the simple quote for explosives rather than
palatalized
# consonants, you must comment whith a '#' the line above and uncomment
# the line below.
#       elsif ($car eq "\'"){ $code = "\&\#700;" }
        elsif ($car eq "~") { $code = "\&\#771;" }
        elsif ($car eq "=") { $code = "\&\#809;" }

        else { $code = $car }

        unshift(@output, $code);

        # sortie
        $barre = 0;
        $retro = 0;
        $pseudo = 0;
        $diese = 0;
     }
     $unicode = join("", @output);
}
|< < Post > >| << List/Tree >> July 2003 Index
Reply

Christophe Grandsire <christophe.grandsire@...>