WordNet conversion from Polaris format to VisDic XML file
Note: The following text tells what steps it is necessary to make
to convert WordNet from Polaris to VisDic. If you wish to convert
your data as quickly as possible, just download
polaris2visdic.zip package, unzip it and run ./polaris2visdic script
with a name of Polaris file stored in import/export format without any extension.
It is assumed, that this file has an .ewn extension.
Example: ./polaris2visdic wn_ship converts the file
wn_ship.ewn to visdic format. This file is included in the package
and it contains a hypero-hyponymical subtree of the synset ship:1.
- Convert Polaris format to XML format
First, it is necessary to make an XML formatted file.
For non-ILI file the use the wn2xml script,
for ILI file use the ili2xml script.
This script wn2xml makes <NAME>VALUE</NAME> tags
instead of LEVEL NAME VALUE Polaris entry.
Moreover, the unique ILI entry tag is add, if the EQ_SYNONYM is defined.
This ILI contains the 8 cipher WORDNET_OFFSET number or ADD_ON_ID number,
minus sign ('-') and PART_OF_SPEECH entry. If the ADD_ON_ID is present,
the first number is set to 9. If the EQ_SYNONYM does not exist,
the default "xxxxxxxx-z" ILI is made. xxxxxxxx is a position of the synset
in the WordNet. You should not find any "xxxxxxxx-z" tag in your converted WordNet.
English and Czech WordNets are for example correct in this way.
Example: wn2xml wn_en.ewn >wn_en.xml
- Change the internal structure of XML entry
The Polaris file contained many non-necessary entries and had too rich tree
representation. A dproc tool can change the internal XML structure
of the entry, so the representation becomes clearer and faster.
There is a script for processing WordNet XML format named wn.scr.
This fix can remove some necessary information from the WordNet,
so if you wish to save them, try to change this script.
(Especially the EQ relations other than EQ_SYNONYM are removed).
This process is very time consuming, so please be patient.
Example: dproc wn_en.xml -fwn.scr
- Fast the XML representation just even more
Now, every link in XML file has a structure
<LITERAL>word<SENSE>number</SENSE></LITERAL>.
It is nice, but not too fast for searching. A tool called wnlink
replaces this entry with ILI record of the corresponding word and the sense number.
It can also detect two type of inconsistecies in the WordNet.
- The literal and the sense are presented in some relation, but they are
not defined in any <SYNONYM> tag.
- Any relation references to itself
These inconsistencies are written to the standard error output.
For illustration:
English WordNet has 134 inconsistencies, Czech WordNet has 9 inconsistencies.
Example: wn2link wn_en.xml 2>wn_en.err
- Make the new VisDic representation
Now, rebuild the XML again.
Example: dbuild wn_en.xml
- Change the .def file
Finally, it is necessary to make some changes in .def file of the WordNet
in VisDic representation. Every line in this file represents one entry.
This entry contains level in the tree structure, name,
minimal number of entries (-1 is infinite), maximal number of entries
(-1 is infinite) and entry type.
The last value is the most important. 'N' means a normal tag, 'K' means
a key tag, 'L' means a link tag, 'R' means a reverse tag and 'E' means
a external file link tag.
- Change 'N' in SYNSET.ILI line to 'K' and its minimal and maximal value to 1.
- Change every SYNSET.* line excluding SYNSET.SYNONYM to 'L'.
- Set SYNSET.HYPERONYM and other tags you like to minimal value 0 and maximal value 1.
- You can realize, that some entries such as SYNSET.HYPONYM are not present.
But this information can be derived from SYNSET.HYPERONYM tag. They are
"reverse tags" in some sense. For these entries, add a line like
1 HYPONYM 0 -1 R SYNSET.HYPERONYM
at the end of file.
- Finally, you can add something like
1 GLOSS 0 1 E WORD_MEANING.GLOSS wn_ili
if you wish to see an ILI definition for every entry. It is a link
to "wn_ili" file and WORD_MEANING.GLOSS tag. Be sure, this file and
this tag exist.
PLEASE DO NOT INSERT ANY LINES AMONG ALREADY DEFINED LINES!
If you wish to add new entries do it so at the end of the file.
Example: wn_en.def
These steps can be done all in one also by wn2visdic script
or by ili2visdic script respectively.
This script can handle with every inter-language synset relation and every
external relation. It will also ensures, that every WordNet will have
the same definition type
stored in .def file.
Example: wn2visdic wn_en
- Change the .cfg file
Replace the .cfg file with wn_en.cfg file.
If you like, you can change some things
such as "English WordNet" text to your own or "ENG" text to your own.
Do not bother with <EQ_DICTx> tags, VisDic does not use it.
- Change the visdic.cfg file
If you wish to work with your WordNet in VisDic, add a line such as
<DICT><FILE>wn_en</FILE></DICT>
to your visdic.cfg file.
Finally, run the visdic and hope it works.
If it does not, please e-mail to
tomaspavelek@lycos.co.uk
and specify, what it does.
Please be tolerant.
Notice: Every file you specify must contain a path relative to VisDic
application. So, if you have VisDic in some directory and your WordNet
"file" in a subdirectory "wn/", then you must write "wn/file".
Notice: Every VisDic representation dictionary must be specified by a name
without any additional extension. So, if the original XML file is called
"wn_en.xml", then the VisDic file is referenced by "wn_en",
however it does not exist.
Notice: Every time you wish to make your WordNet more effective for searching,
use dbuild -s command. Especially after doing more changes
in the dictionary, this step can speed up the access to entries.
Example of synsets:
Polaris format (before Step 1)
0 @3@ WORD_MEANING
1 PART_OF_SPEECH "n"
1 VARIANTS
2 LITERAL "life"
3 SENSE 1
3 DEFINITION "living things collectively; "there is no life on Mars""
3 EXTERNAL_INFO
4 SOURCE_ID 1
5 TEXT_KEY "00003504-n"
1 INTERNAL_LINKS
2 RELATION "has_hyperonym"
3 TARGET_CONCEPT
4 PART_OF_SPEECH "n"
4 LITERAL "being"
5 SENSE 1
2 RELATION "has_hyponym"
3 TARGET_CONCEPT
4 PART_OF_SPEECH "n"
4 LITERAL "wildlife"
5 SENSE 1
1 EQ_LINKS
2 EQ_RELATION "eq_synonym"
3 TARGET_ILI
4 PART_OF_SPEECH "n"
4 WORDNET_OFFSET 3504
XML format (after Step 1)
<WORD_MEANING>
<PART_OF_SPEECH>n</PART_OF_SPEECH>
<VARIANTS>
<LITERAL>life
<SENSE>1</SENSE>
<DEFINITION>living things collectively; "there is no life on Mars"</DEFINITION>
<EXTERNAL_INFO>
<SOURCE_ID>1
<TEXT_KEY>00003504-n</TEXT_KEY>
</SOURCE_ID>
</EXTERNAL_INFO>
</LITERAL>
</VARIANTS>
<INTERNAL_LINKS>
<RELATION>has_hyperonym
<TARGET_CONCEPT>
<PART_OF_SPEECH>n</PART_OF_SPEECH>
<LITERAL>being
<SENSE>1</SENSE>
</LITERAL>
</TARGET_CONCEPT>
</RELATION>
<RELATION>has_hyponym
<TARGET_CONCEPT>
<PART_OF_SPEECH>n</PART_OF_SPEECH>
<LITERAL>wildlife
<SENSE>1</SENSE>
</LITERAL>
</TARGET_CONCEPT>
</RELATION>
</INTERNAL_LINKS>
<EQ_LINKS>
<EQ_RELATION>eq_synonym
<TARGET_ILI>
<PART_OF_SPEECH>n</PART_OF_SPEECH>
<WORDNET_OFFSET>3504</WORDNET_OFFSET>
</TARGET_ILI>
</EQ_RELATION>
</EQ_LINKS>
<ILI>00003504-n</ILI>
</WORD_MEANING>
Reduced XML format (after Step 2)
<SYNSET>
<POS>n</POS>
<SYNONYM>
<LITERAL>life
<SENSE>1</SENSE>
</LITERAL>
</SYNONYM>
<ILI>00003504-n</ILI>
<HYPERONYM>
<LITERAL>being
<SENSE>1</SENSE>
</LITERAL>
</HYPERONYM>
</SYNSET>
Fastened XML format (after Step 3)
<SYNSET>
<POS>n</POS>
<SYNONYM>
<LITERAL>life
<SENSE>1</SENSE>
</LITERAL>
</SYNONYM>
<ILI>00003504-n</ILI>
<HYPERONYM>00002728-n</HYPERONYM>
</SYNSET>