Transforming the representation of lexical knowledge
Under this banner, SULTRY has embarked on a number of projects
relating to dictionaries, particularly dictionaries for Australian
Aboriginal languages. Some of this research relates to the traditional
concerns of lexical semantics and lexicography. For example, most
current dictionaries of Aboriginal languages are strong on words for
plants and animals but very weak on words and metaphorical usages of
words that describe thoughts, feelings, and emotions. This is not
because these words and usages do not exist, but just because it is
harder for a linguist to gather and understand them, and so Jane Simpson
and Christopher Manning (linguistics department staff) are working on a
project to improve dictionary coverage in such areas. Other work
relates to the use and usability of dictionaries of endangered
languages. Much of
the work relates to the manipulation of dictionary
information on and by computers and it is mainly this work that is
described below.
While there has been a reasonable amount of work in the lexicography and
second language acquisition communities on the use and usability of
dictionaries for the handful of most spoken and most taught langauges
(English, French, Japanese, ...), there is almost no work on the use and
usability of dictionaries for languages with few speakers ("endangered
languages"). We saw collecting some data on this issue as important
background for our research into electronic dictionaries for Aboriginal
languages. Our first results are available here:
Miriam Corris, Christopher Manning, Susan Poetsch, and Jane Simpson. 1999.
Dictionaries and endangered languages -- in
Postscript or
RTF (Word).
Paper presented at the Endangered Languages Workshop, La Trobe
University, 30 November 1999. (Earlier version presented at
1999 Perth Congress of the Applied Linguistics Association of Australia.)
To appear in David Bradley and Maya Bradley (eds), Language
Endangerment and Language Maintenance: an active approach. London:
Curzon Press.
Miriam Corris, Christopher Manning, Susan Poetsch, and Jane Simpson.
Bilingual Dictionaries for Australian
Languages: User studies on the
place of paper and electronic dictionaries in
Postscript or
RTF (Word).
Proceedings of the Ninth Euralex International Congress (Euralex 2000),
Stuttgart, pp. 169-181.
There are a range of existing tools for gathering dictionary
information, most notably those developed by SIL (MacLex, Shoebox), many
of which are discussed in the Australian context in Nick Thieberger's
Notes
for the computer assisted language worker (available from
AIATSIS). However, it is notable that the
majority of people on the ground assembling dictionaries actually do not
use them, but rather keep their dictionary entries as text files, and edit
them with a word processor or text editor. This has many disadvantages:
there is no control on the uniformity or layout of entries; the
interpretation of information in entries is often vague, even if an
attempt is made to use markup tags (see further below); there is no way
to put speech and pictures or drawings into the dictionary; there is no
support for linking between related words; and the purely textual
representation of the dictionary means that any changes in transcription
practices must be applied by hand to the entire dictionary. While the
text-file dictionary is a useful description of the lexicon, it cannot
be easily manipulated by computer, and cannot be easily used by
potential dictionary customers. (Some of these disadvantages, and the
corresponding advantages of keeping dictionary data in a relational
database have been explored by Peter Austin and David Nathan, for
instance in their International Journal of Lexicography article
from 1992.)
Nevertheless, there are clearly reasons
why people use a word processor as a dictionary maker. The main ones
seem to be: a word processor is a very flexible, unconstrained
environment for a dictionary author; current dictionary-making programs
are too hard to learn and too hard to use; and that current programs do
not provide adequate facilities for importing and exporting data. It is
hard to get your existing dictionary data into current dictionary-making
programs,
and it is hard to get a print-ready dictionary out of the material you
have assembled. One of our research goals is to gain the benefits of a more
structured dictionary representation while making it easy enough to use
and to import and export from that people will actually choose to use
it.
As a first step towards this goal, we're building a FileMakerPro
template for building and maintaining dictionaries. FileMakerPro is a
very easy to use database program, which runs on Macs and PCs, and which
provides the ability to store speech, pictures, and video, as well as
text. The database is very easy to use and update, includes help files,
and we are building facilities for importing and exporting data. Our
aim is that we should be able to take a dictionary built within the
database and be able to produce a typeset-quality printed dictionary
without any human intervention.
Work on the FileMakerPro dictionary templates has been being done by
Brett Baker, who brings to the project
much experience in fieldwork and Aboriginal languages. We presented
a paper at Australex '98
describing a preliminary version. Work is currently continuing, with
the help of Tony Williams on producing a version fit for public
distribution.
Intertranslatability of electronic dictionary formats
Most current dictionaries of Australian languages are
marked up in some version of Field Ordered Standard Format or FOSF.
This general system was proposed by SIL, and a version of it for
Australian languages was developed by Nash and Simpson (1989) in the
context of the AIAS Lexicography Project. In it, dictionary entries are
encoded into fields, each on a new line and begun with a backslash
code. Here's an example (from Kaurna):
\w kuiya
\x D.8
\d fish;
\i kuiya wika
\t fish net;
\i kuiyarnappendi
\t to fish
The structured data of an FOSF file makes it far superior to completely
unstructured textual data, and it has been used very successfully, for
example, in the production of the Eastern and Central
Arrernte to English dictionary by John Henderson and Veronica Dobson
(IAD, 1994). With a well-structured dictionary, final typeset copy can
be produced from it automatically without any hand formatting. For
example, we've used such methods on Kaurna materials to produce the
sample formatted dictionary page shown on the right. (This pocket-sized
book version being produced specially for this picture by simply varying
the page size parameters.)
However, for our own projects, we are moving
to the use of SGML, and in particular XML, a simplified dialect of SGML
which has been developed for web applications. There are two main
reasons for this (note that these reasons are independent of the web --
the fact that XML makes web versions easier is just gravy):
Wide support: There is a great deal of free and commercial software
available that works with SGML and XML. SGML is widely used in
publishing and large companies, and XML is destined to be widely used
on the web. It's better if we can leverage off a widely supported
format rather than having to build all our own tools for a format
restricted to certain sections of the linguistics community.
Greater power: A problem with FOSF is that it provides only a single
level of structuring, given by the backslash codes at the beginning of a
line. But dictionaries often need more levels of structuring than this,
whether it is to say that certain words in the definition or gloss
should be in bold or italics -- or should appear in the finder list. Or
to group the examples according to sense or dialect. Users of FOSF have
responded to these needs in various ad hoc ways (putting asterisks,
carets or other
special marks into the source file, trying to show grouping with
additional tags). But the non-standard, and often insufficiently precise
nature of these annotations makes their interpretation and use by
computer programs difficult to impossible. In contrast, SGML/XML provides a
powerful
and general way to mark up all forms of structure, including
crossreferencing and multiple levels of nested structure.
The same entry above in XML might look like this:
kuiyaD.8fishkuiya wikafish net;kuiyarnappendito fish
It looks more complicated admittedly (!), but the clearer and richer
structure pays off in the longer term.
On the other hand, interoperability is key. Often the inability to
translate between formats is what keeps people using one piece of
software. Thus we are also very interested in the translatability of
data between different representations.
Work has already begun on programs for translating existing dictionaries
between different formats. Manning has worked on a program to convert FOSF
dictionaries so that they can be loaded into the FileMakerPro
application discussed above, and has written another script that
transforms the Warlpiri dictionary (which is in its own format, which
looks a little like FOSF because of the use of backslashes,
but is actually structurally more similar to SGML)
into conformant XML. Ben Hutchinson (a linguistics honours student) has
worked on a general purpose Perl script for converting FOSF dictionaries
into XML.
Tools for using and navigating computer dictionaries: Kirrkirr
Dictionaries on computers and the web are now commonplace. There
are hundreds of them available. However almost all of them are like
this: You go to a page and enter a search word in a form, and then it
shows you the entry for that word in a format that attempts to duplicate
what the printed dictionary looks like (but is normally far worse in
terms of layout, font quality and readability). Or you can often search
for entries anywhere in the dictionary where a word appears. Now,
being able
to do text search over dictionaries is very useful for some purposes.
But having just this facility is hardly pushing the envelope of possibilities
for user interfaces for dictionaries. If one thinks about the
possibilities for intelligent and flexible information access to
dictionary information, there are just lots of possibilities that are
not being exploited. For instance, in a paper dictionary, you can
browse. When people look at a paper dictionary, they normally look at
not only for the word they started off looking up, but at other nearby
entries. Very few electronic dictionaries allow you to do that (but the
OED interface described below does). Similarly, paper dictionaries and
these electronic dictionaries only allow you to access a word by looking
it up alphabetically -- you have to be able to spell it. But one should
be able to search for word in other ways, such as according to their
semantic domain, or according to other words they're used with, or by
how it is pronounced.
In late 1997, Casey Whitelaw (another
linguistics/computer science undergraduate) did some preliminary work on
this topic, working on an interface to
Kaurna dictionary materials. This was built using a TCL/Tk interface, which
allows it to be portable to Macs, PCs, and Unix.
In 1998, Kevin Jansz,
a computer science honours student, together with Nitin Indurkhya and
Christopher Manning, worked intensively on
Kirrkirr, a
browsing interface for the Warlpiri dictionary. Using the XML version
of the Warlpiri dictionary, he wrote a Java browser. As well as
providing nice HTML-formatted output from the dictionary, and flexible
searching, which includes "fuzzy" spelling and regular expression
searches, the key feature of this software is a graphical network
interface for visualizing word relationships. This software has
continued to be developed by Jansz and Manning, and we have started
testing this
interface in schools. The software also allows the incorporation
of audio files, pictures, and video into the dictionary. A more
detailed description of the software appears at the Kirrkirr site. Some other Warlpiri dictionary
related material appears on the Warlpiri
dictionary work page.
Stephen Wilson has produced a beautiful web-based
Wagiman dictionary. This
was produced from a dictionary stored in a Microsoft Access database,
which in turn was produced from original materials stored in an FOSF
file. The material in the database was automatically transformed into
both the web edition and a paper edition through the use of Visual Basic
programs.
(English) Dictionaries on the web
We haven't been totally neglecting English. In some circumstances, the
web is an effective medium for delivering dictionary information. But
providing a flexible interface for browsing and finding information is
crucial, as was discussed above. In cooperation with
SETIS, the
University of Sydney Scholarly Electronic Text and Image Service, we've
built a web interface to the Oxford English Dictionary that
offers more convenient access, including the ability to browse forwards
and backwards in the dictionary, to find words in citations and
definitions and to link to related words. This system was mainly
developed by Michael Roper, an Arts-Science student.
A
prototype of the new interface was launched in October 1998, and
has become the standard interfaced to the OED at the University of
Sydney. (The interface is only usable by people at USyd because of
licensing restrictions on the dictionary content, but we can make the
interface available to others.)