SAMPA to Unicode tool
From: | Julien Eychenne <je@...> |
Date: | Monday, July 7, 2003, 13:55 |
Hi,
As I am writing a little website, I needed an efficient way to encode
Unicode IPA entities. However, since not anyone is familiar with Unicode
or have a recent browser, I also needed to write transcriptions in
SAMPA. So, I wrote a perl script (unipa.pl) that can :
- interactively tranlate an (X-)SAMPA chain into Unicode HTML entities,
when called without argument (e.g. './unipa.pl')
- convert all SAMPA transcriptions into Unicode HTML entities in a file,
when called with an argument (e.g. './unipa.pl -c myfile' or './unipa.pl
--convert myfile').
The script must be run from the command line. It should work on all
platforms. Of course, you need Perl to use it. If not, have a look at :
http://www.activestate.com/Products/Download/Download.plex?id=ActivePerl
In a file, SAMPA code must be included between '$_' and '_$' (to be
recognized by the script). For instance, the SAMPA pronunciation of
_sabyukà_ must be written [$_s6bj"uk@_$] (or $_[s6bj"uk@]_$). The script
makes a back up of your file (myfile.sampa), deletes the mark up '$_'
and '_$' and translates into Unicode ("sɐbjˈukə" in my
example).
When the script is called without argument, you must NOT use '$_' and
'_$', and type directly the SAMPA code (e.g. 's6by"uk@' without quotes).
I tried to stick as close as possible to (X-)SAMPA as it is explained
here : http://www.phon.ucl.ac.uk/home/sampa/home.htm
I did not include symbols that are not implemented in Unicode
(independant tones : _R ; _F ; _H_T ; _B_L ; _R_F).
However, there are a few things you should be aware of :
- you CANNOT use the symbols § and # (except the latter in a particular
case). If you really need them, you must exclude them from the code. For
instance, you could write $_sampa_$#$_ipa_$ to get 'sampa#ipa'.
- diacritics must be written AFTER the character they modify (e.g.
'p_?\a~' for a pharyngealized p followed by a nasalized a)
- syllabic consonants can be written either C_= or C= (where C is any
consonant); nasalised sounds can be written either C_~ or C~
- glottalized sounds can be written C_?
- the tie bar is ugly, as it is not really implemented in Unicode. I
didn't use the standard '_', because it is used for most diacritics.
Thus, 'Q_O' could mean either "more rounded [Q]" or "diphtong [QO]". So,
tie bar must be written either '__' (double underscore) or '#' (sharp).
For instance, [t_S] can be written [$_t__S_$] or [$_t#S_$].
- at the end of the code, you can modify it to suit your needs if you
use more glottalized than palatalized sounds.
So, I hope this will be useful for you (at least it is for me). I think
it might be of interest for people willing to translate their SAMPA
webpages : you just need to markup your transcriptions with '$_' and
'_$' and run the script on the page.
This is GPL'ed software, so feel free to modify it for your own purposes
:). Comments and feedback will be greatly appreciated :))). If you find
errors, please notice me. This is the very first version of the script,
so please be lenient.
Here is the script : you must copy it into a text file called 'unipa.pl'.
Best regards,
Julien.
#!/usr/bin/perl -w
#############################################################################
#
#
# This program is free software; you can redistribute it and/or modify
#
# it under the terms of the GNU General Public License as published by
#
# the Free Software Foundation; either version 2 of the License, or
#
# (at your option) any later version.
#
#
#
# This program is distributed in the hope that it will be useful,
#
# but WITHOUT ANY WARRANTY; without even the implied warranty of
#
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
#
# GNU General Public License for more details.
#
#
#
# You should have received a copy of the GNU General Public License
#
# along with this program; if not, write to the Free Software
#
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307
USA #
#
#
#############################################################################
# unipa.pl :
# From SAMPA to Unicode.
# version 0.1
# Author : Julien Eychenne <je@...>
# website : http://www.exuna.net
# Copyright 2003, Julien Eychenne
#
# (X-)SAMPA : http://www.phon.ucl.ac.uk/home/sampa/home.htm
# Unicode : http://www.unicode.org
#
# to do :
# - tie bar
# - help
# - tones
#
# absent in Unicode 4.0 : tie bar, linking mark, nasal release, tones
# nasal release was replaced with "superscript n"
# "tie bar" was replaced with an ugly tie-like bar.
# Sorry, variables, functions and comments are in French...
use strict;
use File::Copy;
my ($chaine,$i,$car,$code,$unicode,$fichier,$file);
my (@input,@output);
my $barre = 0;
my $retro = 0;
my $pseudo = 0;
my $diese = 0;
if (! exists($ARGV[0])) { interaction() }
elsif (($ARGV[0] eq "-h") or ($ARGV[0] eq "--help")) { aide() }
elsif (($ARGV[0] eq "-c") or ($ARGV[0] eq "--convert")) { conversion() }
else {
print "$ARGV[0] is not a valid argument.\n"
."To display help, type --help ou -h\n";
}
sub aide {
print "To do...\n";
}
sub interaction {
print "Write in the SAMPA code to be converted into Unicode\n";
$chaine = <>;
chomp($chaine);
sampa2unicode($chaine);
print "The Unicode translation of your code is :\n";
print "$unicode\n";
exit 0;
}
sub conversion {
$fichier = $ARGV[1];
chomp($fichier);
copy("$fichier","$fichier".".sampa") or die "I can't open your file
$!\n";
print "\nI made a copy of your file : $fichier".".sampa\n\n";
open(SAMPA,"$fichier".".sampa") or die "I can't open the copy"
." of your file $!\n";
{
local $/ = undef;
$file = <SAMPA>;
}
my @portions = split(/(\$\_.*?\_\$)/,$file);
close(SAMPA);
open(IN,">$fichier") or die "I can't open your file $!\n";
for (@portions) {
if ($_ =~ /\$\_(.*?)\_\$/) {
sampa2unicode($1);
$_ = $unicode;
}
print IN $_;
}
close(IN);
print "The job is done...\n\n";
exit 0;
}
# sampa2unicode() : convertit la chaîne SAMPA en unicode en la lisant à
# l'envers, caractère par caractère, et la stocke dans la chaîne $unicode.
# --> CHAINE
sub sampa2unicode {
my $chaine = $_[0];
undef $unicode;
undef @input;
undef @output;
# conversion en pseudo-SAMPA
# modèle : $chaine =~ s///g;
$chaine =~ s/\|\\\|\\/\|§/g;
$chaine =~ s/\#/_\#/g;
if ($chaine =~ /_/) {
$chaine =~ s/__/_\#/g;
$chaine =~ s/_0/0§/g;
$chaine =~ s/_v/v§/g;
$chaine =~ s/_h/h§/g;
$chaine =~ s/_=/=/g;
$chaine =~ s/_\>/\>§/g;
$chaine =~ s/_\</\#/g;
$chaine =~ s/_\~/\~/g;
$chaine =~ s/_O/O§/g;
$chaine =~ s/_c/c§/g;
$chaine =~ s/_w/w§/g;
$chaine =~ s/_j/j§/g;
$chaine =~ s/_G/V§/g;
$chaine =~ s/_\?\\/\?§/g;
$chaine =~ s/_\?/\?\#/g;
$chaine =~ s/([^a-zA-Z])\`/$1\`\#/g;
$chaine =~ s/_\^/\^§/g;
$chaine =~ s/_\+/\+§/g;
$chaine =~ s/_-/-§/g;
$chaine =~ s/_\"/\"§/g;
$chaine =~ s/_t/t§/g;
$chaine =~ s/_k/k§/g;
$chaine =~ s/_N/N§/g;
$chaine =~ s/_d/d§/g;
$chaine =~ s/_a/a§/g;
$chaine =~ s/_m/m§/g;
$chaine =~ s/_r/r§/g;
$chaine =~ s/_o/o§/g;
$chaine =~ s/_\}/\}§/g;
$chaine =~ s/_A/A§/g;
$chaine =~ s/_q/q§/g;
$chaine =~ s/_e/e§/g;
$chaine =~ s/_X/X§/g;
$chaine =~ s/_n/n§/g;
$chaine =~ s/_l/l§/g;
$chaine =~ s/--\>/\>\#/g;
$chaine =~ s/\<R\>/R\#/g;
$chaine =~ s/\<F\>/F\#/g;
$chaine =~ s/_T/T\#/g;
$chaine =~ s/_H/H\#/g;
$chaine =~ s/_L/M\#/g;
$chaine =~ s/_M/L\#/g;
$chaine =~ s/_B/B\#/g;
}
@input = split(//, $chaine);
# conversion unicode
for ($i=$#input ; $i>=0 ; $i--) {
$car = $input[$i];
# teste le diacritique \
if ($car eq "\\") {
$barre = 1;
next;
}
# teste le diacritique `
if ($car =~ "\`") {
$retro = 1;
next;
}
# teste le diacritique §
if ($car =~ "§") {
$pseudo = 1;
next;
}
# teste le diacritique #
if ($car =~ "\#") {
$diese = 1;
next;
}
# modèle : elsif ($car eq "") { $code = "\&\#;" }
# voyelles rhotiques
if (($car eq "\`") and ($diese == 1))
{ $code = "\&\#734;" }
# avec dièse
elsif ($diese == 1) {
if ($car eq "b") { $code = "\&\#595;" }
elsif ($car eq "d") { $code = "\&\#598;" }
elsif ($car eq "S") { $code = "\&\#644;" }
elsif ($car eq "g") { $code = "\&\#608;" }
elsif ($car eq "G") { $code = "\&\#667;" }
elsif ($car eq "\>"){ $code = "\&\#8594;"}
elsif ($car eq "_") { $code = "\&\#865;" }
elsif ($car eq "?") { $code = "\&\#704;" }
elsif ($car eq "R") { $code = "\&\#8599;"}
elsif ($car eq "F") { $code = "\&\#8600;"}
elsif ($car eq "T") { $code = "\&\#779;" }
elsif ($car eq "H") { $code = "\&\#769;" }
elsif ($car eq "M") { $code = "\&\#772;" }
elsif ($car eq "L") { $code = "\&\#768;" }
elsif ($car eq "B") { $code = "\&\#783;" }
else { print "\n\nYour chain has unknown or not implemented "
."symbols near $car\n\n"}
}
# avec §
elsif ($pseudo == 1) {
if ($car eq "|") { $code = "\&\#449;" }
elsif ($car eq "h") { $code = "\&\#688;" }
elsif ($car eq "v") { $code = "\&\#812;" }
elsif ($car eq "0") { $code = "\&\#805;" }
elsif ($car eq "O") { $code = "\&\#825;" }
elsif ($car eq "c") { $code = "\&\#796;" }
elsif ($car eq ">") { $code = "\&\#700;" }
elsif ($car eq "w") { $code = "\&\#695;" }
elsif ($car eq "j") { $code = "\&\#690;" }
elsif ($car eq "V") { $code = "\&\#736;" }
elsif ($car eq "\?"){ $code = "\&\#740;" }
elsif ($car eq "\^"){ $code = "\&\#815;" }
elsif ($car eq "+") { $code = "\&\#799;" }
elsif ($car eq "-") { $code = "\&\#800;" }
elsif ($car eq "\""){ $code = "\&\#776;" }
elsif ($car eq "t") { $code = "\&\#804;" }
elsif ($car eq "k") { $code = "\&\#816;" }
elsif ($car eq "N") { $code = "\&\#828;" }
elsif ($car eq "a") { $code = "\&\#826;" }
elsif ($car eq "m") { $code = "\&\#827;" }
elsif ($car eq "d") { $code = "\&\#810;" }
elsif ($car eq "}") { $code = "\&\#794;" }
elsif ($car eq "r") { $code = "\&\#797;" }
elsif ($car eq "o") { $code = "\&\#798;" }
elsif ($car eq "A") { $code = "\&\#792;" }
elsif ($car eq "q") { $code = "\&\#793;" }
elsif ($car eq "e") { $code = "\&\#820;" }
elsif ($car eq "X") { $code = "\&\#774;" }
elsif ($car eq "n") { $code = "\&\#8319;"}
elsif ($car eq "l") { $code = "\&\#737;" }
else { print "\n\nYour chain has unknown or not implemented "
."symbols near $car\n\n"}
}
# symboles avec barre
elsif ($barre == 1) {
if ($car eq "J") { $code = "\&\#607;" }
elsif ($car eq "G") { $code = "\&\#610;" }
elsif ($car eq "N") { $code = "\&\#628;" }
elsif ($car eq "B") { $code = "\&\#665;" }
elsif ($car eq "R") { $code = "\&\#640;" }
elsif ($car eq "p") { $code = "\&\#632;" }
elsif ($car eq "j") { $code = "\&\#669;" }
elsif ($car eq "X") { $code = "\&\#295;" }
elsif ($car eq ">") { $code = "\&\#673;" }
elsif ($car eq "<") { $code = "\&\#674;" }
elsif ($car eq ">") { $code = "\&\#673;" }
elsif ($car eq "?") { $code = "\&\#661;" }
elsif ($car eq "H") { $code = "\&\#668;" }
elsif ($car eq "h") { $code = "\&\#614;" }
elsif ($car eq "K") { $code = "\&\#622;" }
elsif ($car eq "v") { $code = "\&\#651;" }
elsif ($car eq "r") { $code = "\&\#633;" }
elsif ($car eq "M") { $code = "\&\#624;" }
elsif ($car eq "L") { $code = "\&\#671;" }
elsif ($car eq "\@"){ $code = "\&\#600;" }
elsif ($car eq "3") { $code = "\&\#606;" }
elsif ($car eq "s") { $code = "\&\#597;" }
elsif ($car eq "z") { $code = "\&\#657;" }
elsif ($car eq "l") { $code = "\&\#634;" }
elsif ($car eq "x") { $code = "\&\#615;" }
elsif ($car eq "O") { $code = "\&\#664;" }
elsif ($car eq "|") { $code = "\&\#448;" }
elsif ($car eq "!") { $code = "\&\#451;" }
elsif ($car eq "=") { $code = "\&\#450;" }
elsif ($car eq ":") { $code = "\&\#721;" }
else { print "\n\nYour chain has unknown or not implemented "
."symbols near $car\n\n"}
}
# r rétroflexe approximant
elsif (($barre == 1) and ($retro == 1) and ($car eq 'r'))
{ $code = "\&\#635" }
# rétroflexes
elsif ($retro == 1) {
if ($car eq "t") { $code = "\&\#648;" }
elsif ($car eq "d") { $code = "\&\#598;" }
elsif ($car eq "n") { $code = "\&\#627;" }
elsif ($car eq "r") { $code = "\&\#637;" }
elsif ($car eq "s") { $code = "\&\#642;" }
elsif ($car eq "z") { $code = "\&\#656;" }
elsif ($car eq "l") { $code = "\&\#621;" }
else { print "\n\nYour chain has unknown or not implemented "
."symbols near $car\n\n"}
}
# consonnes
elsif ($car eq "?") { $code = "\&\#660;" }
elsif ($car eq "F") { $code = "\&\#625;" }
elsif ($car eq "J") { $code = "\&\#626;" }
elsif ($car eq "N") { $code = "\&\#331;" }
elsif ($car eq "4") { $code = "\&\#638;" }
elsif ($car eq "B") { $code = "\&\#946;" }
elsif ($car eq "T") { $code = "\&\#952;" }
elsif ($car eq "D") { $code = "\&\#240;" }
elsif ($car eq "S") { $code = "\&\#643;" }
elsif ($car eq "Z") { $code = "\&\#658;" }
elsif ($car eq "C") { $code = "\&\#231;" }
elsif ($car eq "G") { $code = "\&\#611;" }
elsif ($car eq "X") { $code = "\&\#967;" }
elsif ($car eq "R") { $code = "\&\#641;" }
elsif ($car eq "K") { $code = "\&\#620;" }
elsif ($car eq "P") { $code = "\&\#651;" }
elsif ($car eq "L") { $code = "\&\#654;" }
elsif ($car eq "W") { $code = "\&\#653;" }
elsif ($car eq "H") { $code = "\&\#613;" }
elsif ($car eq "5") { $code = "\&\#619;" }
# voyelles
elsif ($car eq "1") { $code = "\&\#616;" }
elsif ($car eq "}") { $code = "\&\#649;" }
elsif ($car eq "M") { $code = "\&\#623;" }
elsif ($car eq "I") { $code = "\&\#618;" }
elsif ($car eq "Y") { $code = "\&\#655;" }
elsif ($car eq "U") { $code = "\&\#650;" }
elsif ($car eq "2") { $code = "\&\#248;" }
elsif ($car eq "8") { $code = "\&\#629;" }
elsif ($car eq "7") { $code = "\&\#612;" }
elsif ($car eq "\@"){ $code = "\&\#601;" }
elsif ($car eq "E") { $code = "\&\#603;" }
elsif ($car eq "9") { $code = "\&\#339;" }
elsif ($car eq "3") { $code = "\&\#604;" }
elsif ($car eq "V") { $code = "\&\#652;" }
elsif ($car eq "O") { $code = "\&\#596;" }
elsif ($car eq "{") { $code = "\&\#230;" }
elsif ($car eq "6") { $code = "\&\#592;" }
elsif ($car eq "\&"){ $code = "\&\#630;" }
elsif ($car eq "A") { $code = "\&\#593;" }
elsif ($car eq "Q") { $code = "\&\#594;" }
# suprasegmental
elsif ($car eq "\""){ $code = "\&\#712;" }
elsif ($car eq "\%"){ $code = "\&\#716;" }
elsif ($car eq ":") { $code = "\&\#720;" }
elsif ($car eq "\^"){ $code = "\&\#8593;"}
elsif ($car eq "!") { $code = "\&\#8595;"}
# diacritiques
elsif ($car eq "\'"){ $code = "\&\#690;" }
# If you want to use the simple quote for explosives rather than
palatalized
# consonants, you must comment whith a '#' the line above and uncomment
# the line below.
# elsif ($car eq "\'"){ $code = "\&\#700;" }
elsif ($car eq "~") { $code = "\&\#771;" }
elsif ($car eq "=") { $code = "\&\#809;" }
else { $code = $car }
unshift(@output, $code);
# sortie
$barre = 0;
$retro = 0;
$pseudo = 0;
$diese = 0;
}
$unicode = join("", @output);
}
Reply