Language Technology Resources

This page gives you an overview of Linguistic Resources and Tools (multilingual software, parallel corpora, and more) that are available for download from the webpages of the JRC's Competence Centre on Text Mining and Analysis.

The data releases are in line with the general effort of the European Commission to support multilingualism, language diversity and the re-use of Commission information.

The JRC has developed Language Technology (text mining, computational linguistics) tools for more than twenty languages and it has been analysing up to 300,000 online news articles per day since 2004, thus creating valuable meta-data. Some of this software and of the created meta-data have been released publicly, starting in 2006 with the large-scale multilingual parallel corpus JRC-Acquis, covering twenty-two languages. The JRC also helps distribute the linguistic resources produced by other European Union organisations. The most outstanding feature of all these resources is their high multilinguality and the fact that the texts are parallel (i.e. the corpora consist of texts and their manually produced translations). For comparative details, see the journal publication An overview of the European Union’s highly multilingual parallel corpora.

The resources listed below are useful to academia and industry to carry out research and development into highly multilingual text analysis tools and especially into cross-lingual applications. The resources distributed here have already been used to train statistical machine translation, generate dictionaries, evaluate multilingual document summarisers and information extraction software, support librarians in their daily work, help improve name searches in large data repositories, and more.

To better understand the background of our work, you may want to have a look at a list of the publications produced by the Language Technology team of the JRC's Competence Centre on Text Mining and Analysis.

JRC-Acquis

The JRC-Acquis is a multilingual sentence-aligned parallel corpus in 22 languages, containing a total of over 1 billion words.

This collection of documents and their manually produced translations can be used for many purposes, including the training of statistical machine translation systems, the training and testing of text mining applications, and more.

Languages: Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish.

Date of first release: May 2006.

ISLRN (International Standard Language Resource Number): 821-325-977-001-1.

DGT-Acquis

The DGT-Acquis is a multilingual paragraph-aligned parallel corpus in all 23 official EU languages, including documents from the Official Journal’s L and C series since the year 2004.

This collection of aligned full-text documents and their manually produced translations can be used for many purposes, including the training of statistical machine translation systems, the training and testing of text mining applications, and more.

Languages: Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French, Irish, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish.

Date of first release: December 2012.

ISLRN: 393-866-130-658-2.

DCEP-Digital Corpus of the European Parliament

The DCEP is a multilingual sentence-aligned parallel corpus in 23 official EU languages (253 language pairs) plus Turkish (altogether 24 languages and 276 language pairs when considering the small number of Turkish documents), consisting of European Parliament texts produced between 2001 and 2012 and containing over 1.3 billion words.

The corpus includes a variety of different text types, including press releases, motions, minutes of plenary sessions, rules or procedure, reports and written questions to the parliament. This collection of sentence-aligned full-text documents and their manually produced translations can be used for many purposes, including the training of statistical machine translation systems, the training and testing of text mining applications, and more.

Date of first release: March 2015.

ISLRN: 823-807-024-162-2.

DGT-Translation Memory (DGT-TM)

DGT-TM is a 24-language Translation Memory of the Acquis Communautaire, i.e. the body of European legislation, including all the treaties, regulations and directives adopted by the European Union (EU) and the rulings of the European Court of Justice.

Translation memories are collections of small pieces of text and their manually produced translations. Translation memories are typically used to support human translators, but they can also be used to train statistical machine translation systems. DGT-TM consists of between 4 and 7 million units per language. It is distributed in the widely used TMX format.

Languages: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French, Irish, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish.

Date of first release: November 2007, updated annually since 2011.

ISLRN: 710-653-952-884-4.

EAC-Translation Memory (EAC-TM)

EAC-TM is a Translation Memory (a collection of sentences and their manually produced translations) in 26 languages focusing on the subject domain of education, training, culture and youth.

The parallel corpus was provided by the European Commission’s Directorate General for Education and Culture (EAC) and the data has been processed further by the JRC. The EAC-TM is smaller compared to the other parallel corpora available here, but it has the advantage that it focuses on a very different domain. EAC-TM consists of a total of over 32,000 units. It is distributed in the widely used TMX format.

Languages: Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French, Croatian, Hungarian, Icelandic, Italian, Latvian, Lithuanian, Maltese, Norwegian, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish and Turkish.

Date of first release: January 2013.

ISLRN: 589-927-543-547-4.

ECDC-Translation Memory (ECDC-TM)

ECDC-TM is a Translation Memory of the web pages of the European Centre for Disease Prevention and Control (ECDC).

The major part of the documents talks about health-related topics (anthrax, botulism, cholera, dengue fever, hepatitis, etc.), but some of the web pages also describe the organisation ECDC (e.g. its organisation, job opportunities) and its activities (e.g. epidemic intelligence, surveillance). ECDC-TM consists of up to 2500 translation units per language. It is distributed in the widely used TMX format.

Languages (25): Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French, Hungarian, Icelandic, Irish, Italian, Latvian, Lithuanian, Maltese, Norwegian, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish.

Date of release: October 2012.

ISLRN: 476-596-396-497-8.

JRC-Names - a multilingual named entity resource

JRC-Names is a highly multilingual named entity resource for person and organisation names. It consists of large lists of names and their many spelling variants (up to hundreds for a single person), including across scripts (Latin, Greek, Arabic, Cyrillic, Japanese, Chinese, etc.). It has been compiled by analysing large volumes of multilingual news since 2004, combined with Wikipedia mining. It gets updated daily. As of March 2016, it contains 307,000 person and organisation names plus 333,000 spelling variants written in over 20 different scripts and in many more languages. At its initial release in September 2011, there were 205,000 distinct entities.

It can be used for a number of purposes, including the improvement of name search in databases or on the internet, seeding machine learning systems to learn named entity recognition rules, improve machine translation results, and more.

Languages: JRC-Names covers many different languages, including: Arabic, Bulgarian, Chinese, Danish, Dutch, English, Estonian, Farsi, French, Georgian, German, Greek, Hebrew, Hindi, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Romanian, Russian, Slovenian, Spanish, Swahili, Swedish, Thai and Turkish.

Date of release: September 2011. Updated daily.

ISLRN: 328-863-023-410-2.

JRC-Names as Linked Data

JRC-Names as Linked Data is an RDF representation of the JRC-Names resource.
This new edition offers more information compared to the previous JRC-Names resource, including: titles and function names that have been historically found next to the person mentions; information about the time period during which name variants and their titles were found; various frequency counts.
It has links to other linked datasets such as DBpedia, New York Times Open Data and Talk of Europe.

The JRC-Names RDF representation is based on lemon Lexicon Model for Ontologies, a model which allows the expression of lexical information relative to ontologies.

JRC entities are modeled as instances of DBpedia classes (dbpedia:Person and dbpedia:Organisation) and the multilingual lexicalizations of their names and function names are represented as Lexical Entries of lemon Lexicons. Various other types of linguistic information and metadata are expressed using standardized vocabularies (LexInfo, OLiA, ISOCat, Lexvo, DCTerms, etc.). For cases where no already existing vocabulary could appropriately answer the needs, in-house classes and properties were defined ( see JRC data model for JRC names).
The JRC-Names schema gives an overview of how JRC-Names data is modeled.

This new linked data edition has a SPARQL endpoint access through the European Union’s Open Data Portal, with examples of queries such as:

Given a person's name, retrieve all of its name variants
Given a person's name, retrieve all of its name variants in a language
Given a person's name, retrieve all of its titles/function names in a language
Given a variant and a language, retrieve the corresponding entity
Given a title and a language, retrieve all of the persons with this title

The resource is also referenced on the datahub.io portal as JRC-Names.

Additional information is available on the EU Open Data Portal:

http://data.europa.eu/euodp/en/data/dataset/jrc-emm-jrc-names

A complete description of the Linked Data version of JRC-Names (version 1) was published in the paper below. Please use this publication as a reference when you refer to the resource:

Ehrmann Maud, Guillaume Jacquet & Ralf Steinberger (2016). JRC-Names: Multilingual Entity Name variants and titles as Linked Data, Semantic Web Journal, March 2016.

Download JRC-Names as an RDF file from:

http://cidportal.jrc.ec.europa.eu/ftp/jrc-opendata/EMM/JRC-Names/LATEST/jrcnames_uri.zip

Date of release: March 2016.

JEX - JRC EuroVoc Indexer

JEX is multi-label classification software that automatically assigns a ranked list of the over six thousand descriptors (classes) from the controlled vocabulary of the EuroVoc thesaurus to new texts. JEX has been trained for twenty-two EU languages.

The software allows users to re-train the system with their own documents, or with a combination of their own documents and the data provided together with the software. JEX can also be trained using classification schemes other than EuroVoc.

Languages: Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish.

Date of release: May 2012.

Multilingual summary evaluation data

This is a manually annotated collection of document clusters of parallel texts in seven languages (Arabic, Czech, English, French, German, Russian and Spanish) that can be used to evaluate multi-document, or even single document, summarisation software.

The accompanying publication by Turchi et al. (2010): Using parallel corpora for multilingual (multi-document) Summarisation Evaluation (Proceedings of CLEF'2010, Springer LNCS series) suggests that precious annotation time can be saved by projecting the monolingual sentence selection annotation across languages due to the sentence alignment information in this parallel corpus. Various ways are proposed to make use of the varying degree of overlap of the manual annotation by four different annotators. The downloadable zip file contains the full text of all documents in seven languages, sentence-split full texts, sentence alignment information for all language pairs involving English, as well as the annotations of the English documents. Important background information about the xml structure of the files can be found in the Readme file. The four document clusters consist of five high-level commentaries each selected from www.project-syndicate.org, discussing fields that can roughly be described as being about malaria, Israel-and-Palestine-Conflict, genetics and science-and-society. You can download the manually annotated multilingual multi-document summary evaluation data at the URL: http://optima.jrc.it/Resources/2010_JRC_multilingual-summary-evaluation.zip.

Languages (7): Arabic, Czech, English, French, German, Russian and Spanish.

Date of release: September 2010.

ISLRN: 762-292-165-648-8.

Sentiment-annotated set of quotations

This is a set of 1590 English language quotations (reported speech) extracted automatically from the news and annotated manually for the sentiment expressed towards entities (persons or organisations) mentioned inside the quotation.

For each quote, the resource consists of the text found inside the quotation markers, the speaker (the person who issued the quotation), the entity mentioned inside the quotation, as well as two manually produced sentiment judgements. The data is distributed as an Excel file with three sheets: one containing important background information (the Readme), one containing the instructions given to the annotators, and one containing the main data. You can download the English language sentiment-annotated set of quotations at the URL: http://optima.jrc.it/Resources/2010_JRC_1590-Quotes-annotated-for-sentiment.zip.

Language: English.

Date of release: June 2010.

ISLRN: 574-735-957-886-6.

Named entity annotations of a Turkish tweet data set

This resource includes 1322 named entity annotations from a total of 868 Turkish tweets published on July 26 2013 between 12:00 and 13:00 GMT.

The named entity types considered are person, location, organisation, date, time, money, percent, and misc. You can download this manually annotated resource at the URL: http://optima.jrc.it/Resources/2014_JRC_Twitter_TR_NER-dataset.zip.

Language: Turkish

Date of release: February 2014.

ISLRN: 764-177-227-350-7.