Transforming the representation of lexical knowledge

Under this banner, SULTRY has embarked on a number of projects relating to dictionaries, particularly dictionaries for Australian Aboriginal languages. Some of this research relates to the traditional concerns of lexical semantics and lexicography. For example, most current dictionaries of Aboriginal languages are strong on words for plants and animals but very weak on words and metaphorical usages of words that describe thoughts, feelings, and emotions. This is not because these words and usages do not exist, but just because it is harder for a linguist to gather and understand them, and so Jane Simpson and Christopher Manning (linguistics department staff) are working on a project to improve dictionary coverage in such areas. Other work relates to the use and usability of dictionaries of endangered languages. Much of the work relates to the manipulation of dictionary information on and by computers and it is mainly this work that is described below.

Dictionary use and usability

While there has been a reasonable amount of work in the lexicography and second language acquisition communities on the use and usability of dictionaries for the handful of most spoken and most taught langauges (English, French, Japanese, ...), there is almost no work on the use and usability of dictionaries for languages with few speakers ("endangered languages"). We saw collecting some data on this issue as important background for our research into electronic dictionaries for Aboriginal languages. Our first results are available here:

Tools for making dictionaries

There are a range of existing tools for gathering dictionary information, most notably those developed by SIL (MacLex, Shoebox), many of which are discussed in the Australian context in Nick Thieberger's Notes for the computer assisted language worker (available from AIATSIS). However, it is notable that the majority of people on the ground assembling dictionaries actually do not use them, but rather keep their dictionary entries as text files, and edit them with a word processor or text editor. This has many disadvantages: there is no control on the uniformity or layout of entries; the interpretation of information in entries is often vague, even if an attempt is made to use markup tags (see further below); there is no way to put speech and pictures or drawings into the dictionary; there is no support for linking between related words; and the purely textual representation of the dictionary means that any changes in transcription practices must be applied by hand to the entire dictionary. While the text-file dictionary is a useful description of the lexicon, it cannot be easily manipulated by computer, and cannot be easily used by potential dictionary customers. (Some of these disadvantages, and the corresponding advantages of keeping dictionary data in a relational database have been explored by Peter Austin and David Nathan, for instance in their International Journal of Lexicography article from 1992.)

Nevertheless, there are clearly reasons why people use a word processor as a dictionary maker. The main ones seem to be: a word processor is a very flexible, unconstrained environment for a dictionary author; current dictionary-making programs are too hard to learn and too hard to use; and that current programs do not provide adequate facilities for importing and exporting data. It is hard to get your existing dictionary data into current dictionary-making programs, and it is hard to get a print-ready dictionary out of the material you have assembled. One of our research goals is to gain the benefits of a more structured dictionary representation while making it easy enough to use and to import and export from that people will actually choose to use it.

As a first step towards this goal, we're building a FileMakerPro template for building and maintaining dictionaries. FileMakerPro is a very easy to use database program, which runs on Macs and PCs, and which provides the ability to store speech, pictures, and video, as well as text. The database is very easy to use and update, includes help files, and we are building facilities for importing and exporting data. Our aim is that we should be able to take a dictionary built within the database and be able to produce a typeset-quality printed dictionary without any human intervention. Work on the FileMakerPro dictionary templates has been being done by Brett Baker, who brings to the project much experience in fieldwork and Aboriginal languages. We presented a paper at Australex '98 describing a preliminary version. Work is currently continuing, with the help of Tony Williams on producing a version fit for public distribution.

Intertranslatability of electronic dictionary formats

Most current dictionaries of Australian languages are marked up in some version of Field Ordered Standard Format or FOSF. This general system was proposed by SIL, and a version of it for Australian languages was developed by Nash and Simpson (1989) in the context of the AIAS Lexicography Project. In it, dictionary entries are encoded into fields, each on a new line and begun with a backslash code. Here's an example (from Kaurna):

\w kuiya
\x D.8
\d fish;
\i kuiya wika
\t fish net;
\i kuiyarnappendi
\t to fish
The structured data of an FOSF file makes it far superior to completely unstructured textual data, and it has been used very successfully, for example, in the production of the Eastern and Central Arrernte to English dictionary by John Henderson and Veronica Dobson (IAD, 1994). With a well-structured dictionary, final typeset copy can be produced from it automatically without any hand formatting. For example, we've used such methods on Kaurna materials to produce the sample formatted dictionary page shown on the right. (This pocket-sized book version being produced specially for this picture by simply varying the page size parameters.)

However, for our own projects, we are moving to the use of SGML, and in particular XML, a simplified dialect of SGML which has been developed for web applications. There are two main reasons for this (note that these reasons are independent of the web -- the fact that XML makes web versions easier is just gravy):

  1. Wide support: There is a great deal of free and commercial software available that works with SGML and XML. SGML is widely used in publishing and large companies, and XML is destined to be widely used on the web. It's better if we can leverage off a widely supported format rather than having to build all our own tools for a format restricted to certain sections of the linguistics community.
  2. Greater power: A problem with FOSF is that it provides only a single level of structuring, given by the backslash codes at the beginning of a line. But dictionaries often need more levels of structuring than this, whether it is to say that certain words in the definition or gloss should be in bold or italics -- or should appear in the finder list. Or to group the examples according to sense or dialect. Users of FOSF have responded to these needs in various ad hoc ways (putting asterisks, carets or other special marks into the source file, trying to show grouping with additional tags). But the non-standard, and often insufficiently precise nature of these annotations makes their interpretation and use by computer programs difficult to impossible. In contrast, SGML/XML provides a powerful and general way to mark up all forms of structure, including crossreferencing and multiple levels of nested structure.
The same entry above in XML might look like this:

<entry>
<headword>kuiya</headword>
<semcode>D.8</semcode>
<sense>
<definition>fish</definition>
<example><source>kuiya wika</source>
<trans>fish net</trans>;</example>
<example><source>kuiyarnappendi</source>
<trans>to fish</trans></example>
</sense>
</entry>

It looks more complicated admittedly (!), but the clearer and richer structure pays off in the longer term. On the other hand, interoperability is key. Often the inability to translate between formats is what keeps people using one piece of software. Thus we are also very interested in the translatability of data between different representations.

Work has already begun on programs for translating existing dictionaries between different formats. Manning has worked on a program to convert FOSF dictionaries so that they can be loaded into the FileMakerPro application discussed above, and has written another script that transforms the Warlpiri dictionary (which is in its own format, which looks a little like FOSF because of the use of backslashes, but is actually structurally more similar to SGML) into conformant XML. Ben Hutchinson (a linguistics honours student) has worked on a general purpose Perl script for converting FOSF dictionaries into XML.

Tools for using and navigating computer dictionaries: Kirrkirr

Dictionaries on computers and the web are now commonplace. There are hundreds of them available. However almost all of them are like this: You go to a page and enter a search word in a form, and then it shows you the entry for that word in a format that attempts to duplicate what the printed dictionary looks like (but is normally far worse in terms of layout, font quality and readability). Or you can often search for entries anywhere in the dictionary where a word appears. Now, being able to do text search over dictionaries is very useful for some purposes. But having just this facility is hardly pushing the envelope of possibilities for user interfaces for dictionaries. If one thinks about the possibilities for intelligent and flexible information access to dictionary information, there are just lots of possibilities that are not being exploited. For instance, in a paper dictionary, you can browse. When people look at a paper dictionary, they normally look at not only for the word they started off looking up, but at other nearby entries. Very few electronic dictionaries allow you to do that (but the OED interface described below does). Similarly, paper dictionaries and these electronic dictionaries only allow you to access a word by looking it up alphabetically -- you have to be able to spell it. But one should be able to search for word in other ways, such as according to their semantic domain, or according to other words they're used with, or by how it is pronounced.

In late 1997, Casey Whitelaw (another linguistics/computer science undergraduate) did some preliminary work on this topic, working on an interface to Kaurna dictionary materials. This was built using a TCL/Tk interface, which allows it to be portable to Macs, PCs, and Unix.

In 1998, Kevin Jansz, a computer science honours student, together with Nitin Indurkhya and Christopher Manning, worked intensively on Kirrkirr, a browsing interface for the Warlpiri dictionary. Using the XML version of the Warlpiri dictionary, he wrote a Java browser. As well as providing nice HTML-formatted output from the dictionary, and flexible searching, which includes "fuzzy" spelling and regular expression searches, the key feature of this software is a graphical network interface for visualizing word relationships. This software has continued to be developed by Jansz and Manning, and we have started testing this interface in schools. The software also allows the incorporation of audio files, pictures, and video into the dictionary. A more detailed description of the software appears at the Kirrkirr site. Some other Warlpiri dictionary related material appears on the Warlpiri dictionary work page.

Stephen Wilson has produced a beautiful web-based Wagiman dictionary. This was produced from a dictionary stored in a Microsoft Access database, which in turn was produced from original materials stored in an FOSF file. The material in the database was automatically transformed into both the web edition and a paper edition through the use of Visual Basic programs.

(English) Dictionaries on the web

We haven't been totally neglecting English. In some circumstances, the web is an effective medium for delivering dictionary information. But providing a flexible interface for browsing and finding information is crucial, as was discussed above. In cooperation with SETIS, the University of Sydney Scholarly Electronic Text and Image Service, we've built a web interface to the Oxford English Dictionary that offers more convenient access, including the ability to browse forwards and backwards in the dictionary, to find words in citations and definitions and to link to related words. This system was mainly developed by Michael Roper, an Arts-Science student. A prototype of the new interface was launched in October 1998, and has become the standard interfaced to the OED at the University of Sydney. (The interface is only usable by people at USyd because of licensing restrictions on the dictionary content, but we can make the interface available to others.)


SULTRY Home Page
About Research Study People Links
Christopher Manning -- <cmanning@sultry.arts.usyd.edu.au> -- revised 16 December 2001