Theiling Online    Sitemap    Conlang Mailing List HQ   

A conlang database program

From:Boudewijn Rempt <bsarempt@...>
Date:Thursday, July 8, 1999, 12:12
Dear people,

I've been working on and off on ideas for a bit of software to help in
analyzing and describing a language,along with my ideas for an ideal
grammar (, and I've
prepared a design draft (which is also intended for consumption by the
non-conlang linguistic community). I'd like you all to comment upon it:



       an open-source multi-language multi-user linguistics database

                              Boudewijn Rempt


   Kura is an application intended to facilitate descriptive and analytic
   linguistic work in the tradition described by Dixon as 'Basic Theory'.
   It will be a multi-user, multi-language application with facilities
   for linking data between languages. Since linguistic data can be
   available in the form of sound files, manuscript scans and textual
   data, Kura will be a true multi-media application. Kura will be an
   extensible application, based on open standards, like SQL, XML and

Other projects

   The Summer Institute of Linguistics has for years provided the gratis
   software package Shoebox, which is a capable single-user linguistics
   database and analysing tool. However, this is a closed-source
   application that only runs on non-standard and volatile environments
   like Microsoft Windows and Apple Macintosh. Also from SIL is CELLAR,
   the Computing Environment for Linguistics, Literary and
   Anthropological Research. (Simons and Thomson 1998). This and related
   projects are described in Nerbonne (1998). Lawler and Dry (1998) give
   a general introduction to the subject. The Himalaya Languages Project
   in Leyden has started up a project similar to this at the instigation
   of the present author, but was unable to pursue the project due to a
   lack of funding.

Technical background

   The language of development is Python, the back-end any SQL database.
   The current implementation of the back-end is on MySQL, but
   PostgresSQL, Oracle, Sybase, DB2, mSQL or Ingres should work too, as
   long as there is a standard Python DB interface available. Two separate
   interfaces are intended: a web-server based interface and a graphical
   interface for the Unix KDE desktop. A modular design will facilitate
   the development of other interface components.

   Other possible choices would have been: storing the data in an XML
   format (not chosen because storing the parsed data instead of the
   compiled data improves concurrent access: perhaps there will be a
   module to export the data to an XML file or a HTML file), using a
   multi-platform interface for the Gui client, such as TkInter or
   WxPython (the first doesn't do tables, the second uses the far from
   stable GTk widget set on Unix) Also, with version 2.0, Qt is well on
   the way to pervasive Unicode support. Due to constraints in terms of
   available development time and resources, writing an application in C
   or a C derived language such as Java is out of the question.

   Python has the added advantage of being very much a language designed
   to be usable for the subject specialist who is not a software
   engineer. Linguists can use Python to add their own modules to Kura.

   At the present time Unicode support is not feasible yet without
   investing in expensive, closed-source components. However, Qt 2.0
   already supports Unicode, the WChar extension package for Python
   provides Unicode support for Python, and the maintainer of the
   pyKDE/pyQT interface between Python and KDE/Qt is actively developing
   a version that will support Qt 2.0 and Unicode. Of the back-end
   servers, only Oracle is already capable of supporting Unicode text, to
   my current knowledge. Unicode support will be an essential feature of
   the finished application.

Design prerequisites

   At the core of the application is the raw linguistic data. True
   linguistic data comes in one of two forms: text and sound. Textual
   data can either be in the form of original manuscript data or other
   written materials. Sound data results from field work tapes. Together,
   these data constitute the corpus for a language. Attributes of this
   corpus are for instance: place of origin, date of origin, author,
   recording technique, recording or transcribing linguist. Other
   attributes are possible.

   Linguistic data can be analysed in two ways: on a phonetic /
   phonological or graphical level or on morphological / syntactical or
   semantical level. An analysis of linguistic data is often hierarchical
   in nature: texts divide into sentences, sentences into phrases, phrases
   into words, words into morphemes and morphemes into sounds. On the
   other hand, linguistic data is alway linearly ordered: words and sounds
   follow each other. The relational model is singularly ill equipped for
   storing linear data, but with some programming complexity this is
   solvable. However, it is clear that this will be the most complex and
   error-prone aspects of the Kura project.

   Linguistic endeavours are divided into projects, some large, like
   'preparing a grammatic description of Denden', some small, like 'the
   phonological status of centro-palatal stops in Matraian', and others
   span more than one language: 'the development of the tense system in
   the Charyan languages'. Projects are carried out by one or more
   linguists. Proper attribution to analyses presented by the Kura system
   is important from a point of view of scholarly accountability. Since
   linguistics as a scholarly discipline is a process-oriented endeavour
   it is important to preserve the history of analyses.


     * The administration of recordings
     * The administration of scanned manuscript data
     * The entering and administration of transcribed data, aided by the
       recorded and scanned data.
     * The semi-automated morphological and phonological analysis of
       transcribed data, with reference to the underlying recorded and
       scanned data.
     * The entering and administration of general linguistical notes,
       related to analysed data.
     * The production of interlinear texts in XML, HTML and plaintext
     * The production of bilingual lexicons, etymologies and comparative
     * The querying of analysed data for phonological, morphological,
       syntactical and lexical phenomena, within and across languages.
     * The administration of attributions and references to work within
       and outside the application.


     * Administrative Entities
          + Language
          + Linguist
          + Project
     * Language Data
          + Sound files
          + Graphics files
          + textual (transcribed data).
     * Phonetic data
     * Lexical data
     * Structural (grammatical) data

   The fieldworker provides the language with language data in the form
   of manuscript data and recordings. The manuscript data and recordings
   are transcribed and form the basis for the analysed data. The
   transcribed data is analysed and produced linear phonetic data and
   linear and hierarchically ordered lexical and structural data.


   Lawler, John M. and Helen Aristar Dry. 1998. Using Computers in
   Linguistics. Routledge.

   Nerbonne, John (ed.). 1998. Linguistic Databases. Stanford, CSLI

   Simons, Gary F. and John V. Thomson. 'Multilingual Data Processing in
   the CELLAR Environment', in Nerbonne (1998), 203-234.


I'm presently designing the actual datamodel, that is the structures used
to store the linguistic data, and making a prototype interface, and all
comments, additions and so on will be helpful. At the moment I'm not really
looking for contributions to the code, since there's little code yet.

Boudewijn Rempt  |