- Languages / File format
- Text types / Domain
- Statistics on the corpus
- Conditions for Use
- Further Translation Memories available here
- Download the EAC Translation Memory
- Referring to this resource
- Acknowledgements and Contact
In October 2012, the European Union's (EU) Directorate General for Education and Culture (
DG EAC) released a translation memory (TM), i.e. a collection of sentences and their professionally produced translations, in twenty-six languages. The data gets distributed via the web pages of the EC's Joint Research Centre (JRC). Here we describe
this resource, which bears the name EAC Translation Memory, short EAC-TM.
Translation Memories are
parallel texts, i.e. texts and their manually produced translations. They are also referred to as bi-texts. A
translation memory is a collection of small text segments and their translations (referred to as translation units, TU). These TUs can be sentences or parts of sentences. Translation memories are used to support translators by ensuring that pieces
of text that have already been translated do not need to be translated again.
Both translation memories and parallel texts are important linguistic resources that
can be used for a variety of purposes, including:
- training automatic systems for statistical machine translation (SMT);
- producing monolingual or multilingual lexical and semantic resources such as dictionaries and ontologies;
- training and testing multilingual information extraction software;
- checking translation consistency automatically;
- testing and benchmarking alignment software (for sentences, words, etc.).
The value of a parallel corpus grows with its size and with the number of languages for which translations exist. While parallel corpora for some languages are abundant, there are few or no parallel corpora for most language pairs. The most outstanding
advantage of the various
parallel corpora available via our web pages - apart from them being freely available - is the number of rare language pairs (e.g. Maltese-Estonian, Slovenian-Finnish, etc.).
The EAC-TM is relatively small compared to the
JRC-Acquis and to
DGT-TM, but it has the
advantage that it focuses on a very different domain, namely that of education and culture. Also, it includes translation units for the languages Croatian (HR), Icelandic (IS), Norwegian (Bokmål, NB or Norwegian, NO) and Turkish (TR).
EAC-TM covers up to
26 languages: 22 official languages of the EU (all except Irish) plus Icelandic, Croatian, Norwegian and Turkish. EAC-TM thus contains translations from English into the following 25 languages: Bulgarian, Czech, Danish, Dutch, Estonian, German,
Greek, Finnish, French, Croatian, Hungarian, Icelandic, Italian, Latvian, Lithuanian, Maltese, Norwegian, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish and Turkish.
All documents and sentences were originally written in English (source language is English) and then translated into the other languages. The texts were translated by staff of the National Agencies of the Lifelong Learning and Youth in Action programmes.
They are typically professionals in the field of education/youth and EU programmes. They are thus not professional translators, but they are normally native speakers of the target language.
The documents are distributed in the widely used Translation Memory eXchange
(TMX) format. They are encoded in the UTF-8 character set.
EAC-TM was built in the context of translating electronic forms: application and report forms for decentralised actions of EAC's Life-long Learning Programme (LLP) and the Youth in Action Programme.
The contents in the electronic forms are technically split into two types: (a) the labels and contents of drop-down menus (referred to as 'Forms' Data) and (b) checkboxes (referred to as 'Reference Data'). Due to the different types of data, the two collections
are kept separate. For example, labels can be 'Country', 'Please specify your home country' etc., while examples for reference data are 'Germany', 'Basic/general programmes', 'Education and Culture' etc. To get an overview of the programmes managed
by DG EAC, go to the
website of DG EAC.
The data consists of translations carried out between the end of the year 2008 and July 2012.
The EAC Translation Memory is available in 25 languages: the English source language and its translations into up to 24 other languages. The following tables show the coverage, expressed in the approximate total number of translation units available for
each language, the number of words and the number of characters.
The first table describes the 'Forms' Data; the second table describes the 'Reference Data'. For details, there are also files containing the statistics on the size of the EAC-TM per language pair for the
Forms Data and the
No. of TUs
No. of words
No. of Chars
No. of words per TU
No. of chars per TU
Table 1. Size of EAC's Translation Memory 'Forms Data', expressed as the total number of
translation units (TU) per language for each of the 25 languages (22 out of the 23 official EU languages plus Icelandic, Norwegian Bokmål and Turkish).
No. of TUs
No. of words
No. of Chars
No. of words per TU
No. of chars per TU
Table 2. Size of EAC's Translation Memory 'Reference Data', expressed as the total number of
translation units (TU) per language for each of the 26 languages (22 out of the 23 official EU languages plus Croatian, Icelandic, Norwegian and Turkish).
copright notice applies
The public release of the EAC-Translation Memory follows the release of various other multilingual resources via the JRC's website.
These include the
JRC-Acquis parallel corpus since 2006 (22 languages); the
DGT-Translation Memory (DGT-TM) since 2007 (22 languages); the
JRC-Names multilingual and multi-script name variant list and related software (since 2011); the
JRC Eurovoc Indexer (JEX) multilingual document categorisation software (22 languages) since 2012; and the
ECDC-Translation Memory (ECDC-TM) since 2012 (25 languages). For details and other, smaller linguistic resources, see the
Further multilingual linguistic resources will be made available in the future. We also hope to make updates of the currently existing resources available.
The distribution of the
EAC Translation Memory consists of a single zip file (EAC-TM-all.zip), which can be downloaded by clicking on the link below. In the zip file, you find two TMX files (EAC_FORMS.tmx and EAC_REFERENCE_DATA.tmx) containing the English sentences and
their translations into up to 25 other languages; the DTD file, which should be kept in the same directory; two PDF files with the statistics on the corpora; the Java utility CreateLanguagePair.jar that allows you to extract a TMX file containing
only one single language pair. The language codes used are those defined by the norm ISO 639-1.
EAC-TM (August 2012)
When referring to the
EAC Translation Memory EAC-TM in publications, please use the following reference:
- Steinberger Ralf, Mohamed Ebrahim, Alexandros Poulis, Manuel Carrasco-Benitez, Patrick Schlüter, Marek Przybyszewski & Signe Gilbro (2014).
An overview of the European Union's highly multilingual parallel corpora [337 KB]. Language Resources and Evaluation Journal (LRE). DOI: 10.1007/s10579-014-9277-0.
For more information, you can contact the following persons:
Directorate General for Education and Culture
Mr Marek Przybyszewski (Email address format: Firstname.email@example.com)
European Commission - Directorate General for Education and Culture (DG EAC)
Joint Research Centre (JRC)
Ralf Steinberger (Email address format: Firstname.Lastname@jrc.ec.europa.eu)
IPSC - GlobeSec - OPTIMA
Via E. Fermi 2749, T.P. 267
I-21027 Ispra (VA)
The EAC Translation Memory was offered by the EC's Directorate General of Education and Culture (DG EAC). The original files - one for each of the 25 language pairs with English - were cleaned and combined into one by Mohamed Ebrahim from the European
Commission's Joint Research Centre JRC.
Directorate General for Education and Culture (DG EAC) is a directorate of the European Commission which has the aim of reinforcing and promoting lifelong learning through policy cooperation with EU Member States on the one hand and through the
implementation of the Lifelong Learning Programme on the other hand. For details, read the
mission of DG EAC.
Joint Research Centre (
JRC) is also a Directorate-General of the European Commission. The JRC has for many years worked on highly multilingual text analysis applications. The JRC has contributed to the dissemination of the
DGT Translation Memory and it has itself produced and disseminated a number of further highly multilingual linguistic resources: the
JRC Eurovoc Indexer JEX, and a series of
further linguistic resources.
The JRC is the creator of the
Europe Media Monitor (EMM) family of news aggregation and analysis applications. EMM aggregates news from about 3000 news portals world-wide in about 50 languages (status 2012). EMM's news analysis tools always show the latest news from around
the world as its pages are updated every five minutes. As EMM not only displays the news articles, but it also groups related articles, classifies the articles into hundreds of news categories and displays automatically extracted meta-information
together with the news items, EMM has many users from around the world, resulting in up to 1.2 million hits per day. Much information is available via RSS feeds, allowing EMM output to be combined with third-party tools. The JRC is scientifically
very active, as can be seen from the large number of international scientific publications in the field of multilingual text mining and media monitoring. JRC's publicly accessible media monitoring applications are:
- NewsBrief: Breaking News detection and display of the very latest thematically organised news from around the world; Grouping of related news; breaking news detection; RSS feeds and automatic email alerting; 50 languages.
- MedISys : EMM's Medical Information System selects the health-related EMM news in 50 languages and additionally gathers documents from about 250 medical web sites. MedISys displays the medical news according to diseases, symptoms, organisations and themes
and has statistics-based early warning functions for each category. A second, restricted site offers more functionality to EU public health organisations.
- NewsExplorer: Summary of the news in 20 languages for each 24-hour period; grouping of related news into clusters; linking of daily clusters over time and across languages; visualisation of time lines and of geographical news coverage; information
extraction to detect and disambiguate persons, organisations and locations; quotation recognition; individual, daily-updated pages for over one million names; detection of quotations by and about people; automatic generation of social networks.