A conlang database program
From: | Boudewijn Rempt <bsarempt@...> |
Date: | Thursday, July 8, 1999, 12:12 |
Dear people,
I've been working on and off on ideas for a bit of software to help in
analyzing and describing a language,along with my ideas for an ideal
grammar (http://www.xs4all.nl/~bsarempt/conlang/dream.html), and I've
prepared a design draft (which is also intended for consumption by the
non-conlang linguistic community). I'd like you all to comment upon it:
-----------------------------------------------------------------------
Kura
an open-source multi-language multi-user linguistics database
application
Boudewijn Rempt
Introduction
Kura is an application intended to facilitate descriptive and analytic
linguistic work in the tradition described by Dixon as 'Basic Theory'.
It will be a multi-user, multi-language application with facilities
for linking data between languages. Since linguistic data can be
available in the form of sound files, manuscript scans and textual
data, Kura will be a true multi-media application. Kura will be an
extensible application, based on open standards, like SQL, XML and
Unicode.
Other projects
The Summer Institute of Linguistics has for years provided the gratis
software package Shoebox, which is a capable single-user linguistics
database and analysing tool. However, this is a closed-source
application that only runs on non-standard and volatile environments
like Microsoft Windows and Apple Macintosh. Also from SIL is CELLAR,
the Computing Environment for Linguistics, Literary and
Anthropological Research. (Simons and Thomson 1998). This and related
projects are described in Nerbonne (1998). Lawler and Dry (1998) give
a general introduction to the subject. The Himalaya Languages Project
in Leyden has started up a project similar to this at the instigation
of the present author, but was unable to pursue the project due to a
lack of funding.
Technical background
The language of development is Python, the back-end any SQL database.
The current implementation of the back-end is on MySQL, but
PostgresSQL, Oracle, Sybase, DB2, mSQL or Ingres should work too, as
long as there is a standard Python DB interface available. Two separate
interfaces are intended: a web-server based interface and a graphical
interface for the Unix KDE desktop. A modular design will facilitate
the development of other interface components.
Other possible choices would have been: storing the data in an XML
format (not chosen because storing the parsed data instead of the
compiled data improves concurrent access: perhaps there will be a
module to export the data to an XML file or a HTML file), using a
multi-platform interface for the Gui client, such as TkInter or
WxPython (the first doesn't do tables, the second uses the far from
stable GTk widget set on Unix) Also, with version 2.0, Qt is well on
the way to pervasive Unicode support. Due to constraints in terms of
available development time and resources, writing an application in C
or a C derived language such as Java is out of the question.
Python has the added advantage of being very much a language designed
to be usable for the subject specialist who is not a software
engineer. Linguists can use Python to add their own modules to Kura.
At the present time Unicode support is not feasible yet without
investing in expensive, closed-source components. However, Qt 2.0
already supports Unicode, the WChar extension package for Python
provides Unicode support for Python, and the maintainer of the
pyKDE/pyQT interface between Python and KDE/Qt is actively developing
a version that will support Qt 2.0 and Unicode. Of the back-end
servers, only Oracle is already capable of supporting Unicode text, to
my current knowledge. Unicode support will be an essential feature of
the finished application.
Design prerequisites
At the core of the application is the raw linguistic data. True
linguistic data comes in one of two forms: text and sound. Textual
data can either be in the form of original manuscript data or other
written materials. Sound data results from field work tapes. Together,
these data constitute the corpus for a language. Attributes of this
corpus are for instance: place of origin, date of origin, author,
recording technique, recording or transcribing linguist. Other
attributes are possible.
Linguistic data can be analysed in two ways: on a phonetic /
phonological or graphical level or on morphological / syntactical or
semantical level. An analysis of linguistic data is often hierarchical
in nature: texts divide into sentences, sentences into phrases, phrases
into words, words into morphemes and morphemes into sounds. On the
other hand, linguistic data is alway linearly ordered: words and sounds
follow each other. The relational model is singularly ill equipped for
storing linear data, but with some programming complexity this is
solvable. However, it is clear that this will be the most complex and
error-prone aspects of the Kura project.
Linguistic endeavours are divided into projects, some large, like
'preparing a grammatic description of Denden', some small, like 'the
phonological status of centro-palatal stops in Matraian', and others
span more than one language: 'the development of the tense system in
the Charyan languages'. Projects are carried out by one or more
linguists. Proper attribution to analyses presented by the Kura system
is important from a point of view of scholarly accountability. Since
linguistics as a scholarly discipline is a process-oriented endeavour
it is important to preserve the history of analyses.
Functions
* The administration of recordings
* The administration of scanned manuscript data
* The entering and administration of transcribed data, aided by the
recorded and scanned data.
* The semi-automated morphological and phonological analysis of
transcribed data, with reference to the underlying recorded and
scanned data.
* The entering and administration of general linguistical notes,
related to analysed data.
* The production of interlinear texts in XML, HTML and plaintext
formats.
* The production of bilingual lexicons, etymologies and comparative
lexicons.
* The querying of analysed data for phonological, morphological,
syntactical and lexical phenomena, within and across languages.
* The administration of attributions and references to work within
and outside the application.
Entities
* Administrative Entities
+ Language
+ Linguist
+ Project
* Language Data
+ Sound files
+ Graphics files
+ textual (transcribed data).
* Phonetic data
* Lexical data
* Structural (grammatical) data
The fieldworker provides the language with language data in the form
of manuscript data and recordings. The manuscript data and recordings
are transcribed and form the basis for the analysed data. The
transcribed data is analysed and produced linear phonetic data and
linear and hierarchically ordered lexical and structural data.
References
Lawler, John M. and Helen Aristar Dry. 1998. Using Computers in
Linguistics. Routledge.
Nerbonne, John (ed.). 1998. Linguistic Databases. Stanford, CSLI
Publications.
Simons, Gary F. and John V. Thomson. 'Multilingual Data Processing in
the CELLAR Environment', in Nerbonne (1998), 203-234.
-----------------------------------------------------------------------
I'm presently designing the actual datamodel, that is the structures used
to store the linguistic data, and making a prototype interface, and all
comments, additions and so on will be helpful. At the moment I'm not really
looking for contributions to the code, since there's little code yet.
Boudewijn Rempt | http://www.xs4all.nl/~bsarempt