JRC Eurovoc Indexer - JEX - European Commission

Introduction
The EuroVoc Thesaurus
JEX usage conditions
Download JEX
More information on JEX
Acknowledgements

Introduction

Multilingual Eurovoc thesaurus descriptors are used by a large number of European Parliaments and Documentation Centres to manually index their large document collections. The assigned descriptors are then used to search and retrieve documents in the collection and to summarise the document contents for the users.

view details

As Eurovoc descriptors exist in one-to-one translations in almost thirty languages, they can be displayed in a language other than the text language and give users cross-lingual access to the information contained in each document. At the same time, EuroVoc is an ideal means to search in the user's language and to retrieve documents in other languages.

The European Commission's (EC) Joint Research Centre (JRC) has developed - and makes available - software that automatically assigns EuroVoc descriptors to documents in currently 22 languages. The system uses statistical Machine Learning methods that learn the multi-label categorisation rules from previously manually indexed documents. The method used can be described as profile-based category ranking. This software, called JRC EuroVoc Indexer, or short JEX, has been trained for 22 languages and is available for download from this site. The software allows users to re-train the software on their own data, even using their own, alternative classification systems.

The EuroVoc Thesaurus

The EuroVoc thesaurus was developed by the European Parliament (EP), in collaboration with the EU Publications Office (OP) and several national organisations for the indexing (cataloguing / classification / categorisation) of document collections in several languages.

view details

EuroVoc currently exists not only in 22 official EU languages (Bulgarian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish and Swedish), but also in Basque, Catalan, Croatian, Russian and Serbian. Further non-official translations exist.

The number of Eurovoc users and language versions is steadily increasing. The thesaurus covers the major interests of the involved institutions. It is hierarchically organised into 21 fields and - at the next level - into 127 micro-thesauri, with altogether about 6,700 descriptor terms (classes). The maximum depth of the hierarchy is 8 levels. To browse the thesaurus, see the EuroVoc web site.

JEX usage conditions

The JEX software can in principle be downloaded and used free of charge, but the detailed usage conditions in the EU Licence Agreement (EULA) need to be adhered to. Scientific work using JEX, or scientific publications making reference to JEX, should make reference to at least one of the publications mentioned below (see the Section More information on JEX, below).

Download JEX

JEX has been trained for twenty-two languages. Each language version can be downloaded separately.

view details

For each language version, there are furthermore two versions of JEX: (1) one basic version for the typical end user who wants to either test the software or who wants to use the software in a production environment; (2) an advanced version of the JEX software for technically trained IT specialists; This advanced version additionally allows to re-train the software with a new document collection and to run scientific experiments. It also includes the data on which the software has been trained, meaning that the download packages are much larger. We suggest that you first download the basic version and that you only download the advanced version once you have confirmed that you really want to use it.

JEX is implemented in Java. It can be run on the Windows operating system, on un*x-like operating systems, as well as on Apple Mac. The software should run on most modern computers, but it requires a minimum memory of 2GB.

When downloading, you agree to the JEX usage conditions, as formulated in the EU Licence Agreement (EULA).

Language	Version	Indexing (basic)	Indexing and Training (advanced)
bg	1.0	download (18 MB)	download (89 MB)
cs	1.0	download (20 MB)	download (75 MB)
da	1.0	download (29 MB)	download (116 MB)
de	1.0	download (32 MB)	download (131 MB)
el	1.0	download (27 MB)	download (156 MB)
en	1.0	download (15 MB)	download (99 MB)
es	1.0	download (17 MB)	download (110 MB)
et	1.0	download (21 MB)	download (72 MB)
fi	1.0	download (35 MB)	download (121 MB)
fr	1.0	download (24 MB)	download (117 MB)
hu	1.0	download (14 MB)	download (72 MB)
it	1.0	download (25 MB)	download (117 MB)
lt	1.0	download (18 MB)	download (117 MB)
lv	1.0	download (19 MB)	download (72 MB)
mt	1.0	download (16 MB)	download (68 MB)
nl	1.0	download (25 MB)	download (117 MB)
pl	1.0	download (18 MB)	download (76 MB)
pt	1.0	download (24 MB)	download (116 MB)
ro	1.0	download (22 MB)	download (119 MB)
sk	1.0	download (18 MB)	download (75 MB)
sl	1.0	download (18 MB)	download (70 MB)
sv	1.0	download (28 MB)	download (115 MB)

More information on JEX

The user manual gives an easy-to-understand overview of the software and explains how to use it, step by step:

Ebrahim Mohamed, Ralf Steinberger & Marco Turchi. JEX Manual.

view details

The following document, published in 2012, explains JEX, its history and possible uses. It describes the documents JEX was trained on, gives an overview of the indexing methodology and presents automatic evaluation results for all 22 languages. It also explains how to use JEX:

Steinberger Ralf, Mohamed Ebrahim & Marco Turchi (2012). JRC EuroVoc Indexer JEX - A freely available multi-label categorisation tool . Proceedings of the 8th international conference on Language Resources and Evaluation (LREC'2012), pp. 798-805, Istanbul, 21-27 May 2012.

This third document shows experimental JEX classification results for languages belonging to four different language families: Slavic, Finno-Ugric, Germanic and Romance. It furthermore explores the usefulness of linguistic pre-processing (lemmatisation, part-of-speech tagging):

Ebrahim Mohamed, Maud Ehrmann, Marco Turchi & Ralf Steinberger (2012). Multi-label Eurovoc classification for Eastern and Southern EU languages. In: Cristina Vertan & Walther v. Hahn (eds): Multilingual processing in Eastern and Southern EU languages - Low resources technologies and translation, pp. 370-394. Cambridge Scholars Publishing, Cambridge, UK.

This last document, mostly targeted at the scientific community, explains the categorisation algorithm in more depth and also describes the results of a manual evaluation of the automatic classification, performed by specialised human EuroVoc indexers, for English and Spanish documents.

Pouliquen Bruno, Steinberger Ralf, Camelia Ignat (2003). Automatic annotation of multilingual text collections with a conceptual thesaurus . In: Proceedings of the Workshop Ontologies and Information Extraction at the Summer School The Semantic Web and Language Technology - Its Potential and Practicalities (EUROLAN'2003). Bucharest, Romania, 28 July - 8 August 2003.

Acknowledgements

We would like to thank Bruno Pouliquen, who has developed a major part of the main assignment method, and Mladen Kolar, who has implemented an initial Java version of the tool. We would like to mention the support of Victoria Fernandez-Mera from the Spanish Congress of Deputies and Elisabet Lindkvist from the Swedish Riksdagen, who gave us a lot of advice on practices relating to manual EuroVoc indexing and who helped us to thoroughly evaluate the software. Finally, we are grateful to the Publications Office of the European Commission for having provided their collection of manually EuroVoc-indexed documents. The initial work on JEX was funded as a JRC Exploratory Research Project. The preparation of the first public release of JEX, in May 2012, was partially funded under the JRC’s Innovative Project Competition scheme.