WordNet conversion from Polaris format to VisDic XML file

Note: The following text tells what steps it is necessary to make to convert WordNet from Polaris to VisDic. If you wish to convert your data as quickly as possible, just download polaris2visdic.zip package, unzip it and run ./polaris2visdic script with a name of Polaris file stored in import/export format without any extension. It is assumed, that this file has an .ewn extension.

Example: ./polaris2visdic wn_ship converts the file wn_ship.ewn to visdic format. This file is included in the package and it contains a hypero-hyponymical subtree of the synset ship:1.
  1. Convert Polaris format to XML format
    First, it is necessary to make an XML formatted file. For non-ILI file the use the wn2xml script, for ILI file use the ili2xml script.
    This script wn2xml makes <NAME>VALUE</NAME> tags instead of LEVEL NAME VALUE Polaris entry.
    Moreover, the unique ILI entry tag is add, if the EQ_SYNONYM is defined. This ILI contains the 8 cipher WORDNET_OFFSET number or ADD_ON_ID number, minus sign ('-') and PART_OF_SPEECH entry. If the ADD_ON_ID is present, the first number is set to 9. If the EQ_SYNONYM does not exist, the default "xxxxxxxx-z" ILI is made. xxxxxxxx is a position of the synset in the WordNet. You should not find any "xxxxxxxx-z" tag in your converted WordNet. English and Czech WordNets are for example correct in this way.
    Example: wn2xml wn_en.ewn >wn_en.xml

  2. Change the internal structure of XML entry
    The Polaris file contained many non-necessary entries and had too rich tree representation. A dproc tool can change the internal XML structure of the entry, so the representation becomes clearer and faster.
    There is a script for processing WordNet XML format named wn.scr.
    This fix can remove some necessary information from the WordNet, so if you wish to save them, try to change this script. (Especially the EQ relations other than EQ_SYNONYM are removed).
    This process is very time consuming, so please be patient.
    Example: dproc wn_en.xml -fwn.scr

  3. Fast the XML representation just even more
    Now, every link in XML file has a structure <LITERAL>word<SENSE>number</SENSE></LITERAL>. It is nice, but not too fast for searching. A tool called wnlink replaces this entry with ILI record of the corresponding word and the sense number. It can also detect two type of inconsistecies in the WordNet.
    1. The literal and the sense are presented in some relation, but they are not defined in any <SYNONYM> tag.
    2. Any relation references to itself
    These inconsistencies are written to the standard error output.
    For illustration: English WordNet has 134 inconsistencies, Czech WordNet has 9 inconsistencies.
    Example: wn2link wn_en.xml 2>wn_en.err

  4. Make the new VisDic representation
    Now, rebuild the XML again.
    Example: dbuild wn_en.xml

  5. Change the .def file
    Finally, it is necessary to make some changes in .def file of the WordNet in VisDic representation. Every line in this file represents one entry. This entry contains level in the tree structure, name, minimal number of entries (-1 is infinite), maximal number of entries (-1 is infinite) and entry type.
    The last value is the most important. 'N' means a normal tag, 'K' means a key tag, 'L' means a link tag, 'R' means a reverse tag and 'E' means a external file link tag.
    PLEASE DO NOT INSERT ANY LINES AMONG ALREADY DEFINED LINES!
    If you wish to add new entries do it so at the end of the file.
    Example: wn_en.def

    These steps can be done all in one also by wn2visdic script or by ili2visdic script respectively. This script can handle with every inter-language synset relation and every external relation. It will also ensures, that every WordNet will have the same definition type stored in .def file.
    Example: wn2visdic wn_en

  6. Change the .cfg file
    Replace the .cfg file with wn_en.cfg file. If you like, you can change some things such as "English WordNet" text to your own or "ENG" text to your own. Do not bother with <EQ_DICTx> tags, VisDic does not use it.

  7. Change the visdic.cfg file
    If you wish to work with your WordNet in VisDic, add a line such as
    <DICT><FILE>wn_en</FILE></DICT>
    to your visdic.cfg file.
Finally, run the visdic and hope it works. If it does not, please e-mail to tomaspavelek@lycos.co.uk and specify, what it does. Please be tolerant.

Notice: Every file you specify must contain a path relative to VisDic application. So, if you have VisDic in some directory and your WordNet "file" in a subdirectory "wn/", then you must write "wn/file".

Notice: Every VisDic representation dictionary must be specified by a name without any additional extension. So, if the original XML file is called "wn_en.xml", then the VisDic file is referenced by "wn_en", however it does not exist.

Notice: Every time you wish to make your WordNet more effective for searching, use dbuild -s command. Especially after doing more changes in the dictionary, this step can speed up the access to entries.


Example of synsets:


Polaris format (before Step 1)
  0 @3@ WORD_MEANING
    1 PART_OF_SPEECH "n"
    1 VARIANTS
      2 LITERAL "life"
        3 SENSE 1
        3 DEFINITION "living things collectively; "there is no life on Mars""
        3 EXTERNAL_INFO
          4 SOURCE_ID 1
            5 TEXT_KEY "00003504-n"
    1 INTERNAL_LINKS
      2 RELATION "has_hyperonym"
        3 TARGET_CONCEPT
          4 PART_OF_SPEECH "n"
          4 LITERAL "being"
            5 SENSE 1
      2 RELATION "has_hyponym"
        3 TARGET_CONCEPT
          4 PART_OF_SPEECH "n"
          4 LITERAL "wildlife"
            5 SENSE 1
    1 EQ_LINKS
      2 EQ_RELATION "eq_synonym"
        3 TARGET_ILI
          4 PART_OF_SPEECH "n"
          4 WORDNET_OFFSET 3504

XML format (after Step 1)
  <WORD_MEANING>
    <PART_OF_SPEECH>n</PART_OF_SPEECH>
    <VARIANTS>
      <LITERAL>life
        <SENSE>1</SENSE>
        <DEFINITION>living things collectively; "there is no life on Mars"</DEFINITION>
        <EXTERNAL_INFO>
          <SOURCE_ID>1
            <TEXT_KEY>00003504-n</TEXT_KEY>
          </SOURCE_ID>
        </EXTERNAL_INFO>
      </LITERAL>
    </VARIANTS>
    <INTERNAL_LINKS>
      <RELATION>has_hyperonym
        <TARGET_CONCEPT>
          <PART_OF_SPEECH>n</PART_OF_SPEECH>
          <LITERAL>being
            <SENSE>1</SENSE>
          </LITERAL>
        </TARGET_CONCEPT>
      </RELATION>
      <RELATION>has_hyponym
        <TARGET_CONCEPT>
          <PART_OF_SPEECH>n</PART_OF_SPEECH>
          <LITERAL>wildlife
            <SENSE>1</SENSE>
          </LITERAL>
        </TARGET_CONCEPT>
      </RELATION>
    </INTERNAL_LINKS>
    <EQ_LINKS>
      <EQ_RELATION>eq_synonym
        <TARGET_ILI>
          <PART_OF_SPEECH>n</PART_OF_SPEECH>
          <WORDNET_OFFSET>3504</WORDNET_OFFSET>
        </TARGET_ILI>
      </EQ_RELATION>
    </EQ_LINKS>
    <ILI>00003504-n</ILI>
  </WORD_MEANING>

Reduced XML format (after Step 2)
  <SYNSET>
    <POS>n</POS>
    <SYNONYM>
      <LITERAL>life
        <SENSE>1</SENSE>
      </LITERAL>
    </SYNONYM>
    <ILI>00003504-n</ILI>
    <HYPERONYM>
      <LITERAL>being
        <SENSE>1</SENSE>
      </LITERAL>
    </HYPERONYM>
  </SYNSET>

Fastened XML format (after Step 3)
  <SYNSET>
    <POS>n</POS>
    <SYNONYM>
      <LITERAL>life
        <SENSE>1</SENSE>
      </LITERAL>
    </SYNONYM>
    <ILI>00003504-n</ILI>
    <HYPERONYM>00002728-n</HYPERONYM>
  </SYNSET>