EAC-Translation Memory

Introduction
Languages / File format
Text types / Domain
Statistics on the corpus
Conditions for Use
Further Translation Memories available here
Download the EAC Translation Memory
Referring to this resource
Acknowledgements and Contact

Introduction

In October 2012, the European Union's (EU) Directorate General for Education and Culture (
DG EAC) released a translation memory (TM), i.e. a collection of sentences and their professionally produced translations, in twenty-six languages. The data gets distributed via the web pages of the EC's Joint Research Centre (JRC). Here we describe
this resource, which bears the name EAC Translation Memory, short EAC-TM.

view details

Translation Memories are
parallel texts, i.e. texts and their manually produced translations. They are also referred to as bi-texts. A
translation memory is a collection of small text segments and their translations (referred to as translation units, TU). These TUs can be sentences or parts of sentences. Translation memories are used to support translators by ensuring that pieces
of text that have already been translated do not need to be translated again.

Both translation memories and parallel texts are important linguistic resources that
can be used for a variety of purposes, including:

training automatic systems for statistical machine translation (SMT);
producing monolingual or multilingual lexical and semantic resources such as dictionaries and ontologies;
training and testing multilingual information extraction software;
checking translation consistency automatically;
testing and benchmarking alignment software (for sentences, words, etc.).

The value of a parallel corpus grows with its size and with the number of languages for which translations exist. While parallel corpora for some languages are abundant, there are few or no parallel corpora for most language pairs. The most outstanding
advantage of the various
parallel corpora available via our web pages - apart from them being freely available - is the number of rare language pairs (e.g. Maltese-Estonian, Slovenian-Finnish, etc.).

The EAC-TM is relatively small compared to the
JRC-Acquis and to
DGT-TM, but it has the
advantage that it focuses on a very different domain, namely that of education and culture. Also, it includes translation units for the languages Croatian (HR), Icelandic (IS), Norwegian (Bokmål, NB or Norwegian, NO) and Turkish (TR).

Languages / File Format

EAC-TM covers up to
26 languages: 22 official languages of the EU (all except Irish) plus Icelandic, Croatian, Norwegian and Turkish. EAC-TM thus contains translations from English into the following 25 languages: Bulgarian, Czech, Danish, Dutch, Estonian, German,
Greek, Finnish, French, Croatian, Hungarian, Icelandic, Italian, Latvian, Lithuanian, Maltese, Norwegian, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish and Turkish.

view details

All documents and sentences were originally written in English (source language is English) and then translated into the other languages. The texts were translated by staff of the National Agencies of the Lifelong Learning and Youth in Action programmes.
They are typically professionals in the field of education/youth and EU programmes. They are thus not professional translators, but they are normally native speakers of the target language.

The documents are distributed in the widely used Translation Memory eXchange
(TMX) format. They are encoded in the UTF-8 character set.

Text types / Domain

EAC-TM was built in the context of translating electronic forms: application and report forms for decentralised actions of EAC's Life-long Learning Programme (LLP) and the Youth in Action Programme.

view details

The contents in the electronic forms are technically split into two types: (a) the labels and contents of drop-down menus (referred to as 'Forms' Data) and (b) checkboxes (referred to as 'Reference Data'). Due to the different types of data, the two collections
are kept separate. For example, labels can be 'Country', 'Please specify your home country' etc., while examples for reference data are 'Germany', 'Basic/general programmes', 'Education and Culture' etc. To get an overview of the programmes managed
by DG EAC, go to the
website of DG EAC.

The data consists of translations carried out between the end of the year 2008 and July 2012.

Statistics for the EAC Translation Memory

The EAC Translation Memory is available in 25 languages: the English source language and its translations into up to 24 other languages. The following tables show the coverage, expressed in the approximate total number of translation units available for
each language, the number of words and the number of characters.

view details

The first table describes the 'Forms' Data; the second table describes the 'Reference Data'. For details, there are also files containing the statistics on the size of the EAC-TM per language pair for the
Forms Data and the
Reference Data.

Language	No. of TUs	No. of words	No. of Chars	No. of words per TU	No. of chars per TU
bg	1503	15797	110066	10±14.74	73±92.35
cs	1035	8598	62157	8±13.35	60±87.14
da	1202	11588	80094	9±18.27	66±89.89
de	2193	22849	181676	10±16.61	82±95.05
el	1411	15098	110984	10±16.15	78±95.92
en	2512	29217	191728	11±15.78	76±95.29
es	2193	25954	171104	11±15.62	78±95.42
et	1137	8071	67376	7±15.21	59±94.15
fi	734	4490	41889	6±14.94	57±93.38
fr	2212	25711	175480	11±14.94	79±94.21
hu	1784	17966	142691	10±14.96	79±97.10
is	1188	12317	82957	10±14.88	69±96.58
it	881	8894	60987	10±14.85	69±96.62
lt	918	6413	52687	6±14.66	57±95.80
lv	1443	11272	92750	7±14.39	64±94.72
mt	605	4602	39170	7±14.29	64±94.48
nb	642	4925	36391	7±14.20	56±93.93
nl	642	5877	41562	9±14.15	64±93.71
pl	1478	14649	118217	9±14.07	79±94.08
pt	1434	16418	110630	11±14.09	77±94.14
ro	970	10444	72151	10±14.10	74±94.33
sk	643	5120	37449	7±14.03	58±93.96
sl	2061	19773	142018	9±13.90	68±93.30
sv	901	7734	57047	8±13.84	63±93.03
tr	918	6386	52180	6±13.74	56±92.65
ALL	32,640	320,163	2,331,441

Table 1. Size of EAC's Translation Memory 'Forms Data', expressed as the total number of

translation units (TU) per language for each of the 25 languages (22 out of the 23 official EU languages plus Icelandic, Norwegian Bokmål and Turkish).

Language	No. of TUs	No. of words	No. of Chars	No. of words per TU	No. of chars per TU
bg	2558	14416	105825	5±9.15	41±64.32
cs	2316	11146	84596	4±7.78	36±55.22
da	2555	12522	98937	4±8.12	38±57.75
de	2280	9482	84592	4±7.73	37±56.24
el	1407	7159	55225	5±7.66	39±55.82
en	2642	15497	111791	5±8.33	42±59.27
es	2110	11302	79744	5±8.22	37±58.03
et	1133	3897	33883	3±8.03	29±57.03
fi	724	2115	19391	2±7.92	26±56.61
fr	2264	12135	88323	5±7.86	39±55.81
hr	573	1894	14497	3±7.82	25±55.60
hu	1671	6430	54652	3±7.66	32±54.81
is	1018	3969	29474	3±7.61	28±54.40
it	1289	7112	73186	5±7.62	56±56.17
lt	2468	11691	97009	4±7.60	39±56.59
lv	2437	10125	85194	4±7.44	34±55.60
mt	1117	4606	38729	4±7.38	34±55.36
nl	1163	5118	41171	4±7.36	35±55.26
no	523	1809	13297	3±7.35	25±55.15
pl	2549	14074	114591	5±7.47	44±56.37
pt	2067	10705	76179	5±7.45	36±55.94
ro	2189	10390	77375	4±7.40	35±55.42
sk	2329	11065	86093	4±7.34	36±54.97
sl	2583	13345	103353	5±7.40	40±55.30
sv	2008	8455	69896	4±7.35	34±54.89
tr	2280	10126	80104	4±7.27	35±54.38
ALL	45,973	220,459	1,737,003

Table 2. Size of EAC's Translation Memory 'Reference Data', expressed as the total number of

translation units (TU) per language for each of the 26 languages (22 out of the 23 official EU languages plus Croatian, Icelandic, Norwegian and Turkish).

Conditions for Use

The Commission's
copright notice applies

Further Translation Memories available here

The public release of the EAC-Translation Memory follows the release of various other multilingual resources via the JRC's website.

view details

These include the
JRC-Acquis parallel corpus since 2006 (22 languages); the
DGT-Translation Memory (DGT-TM) since 2007 (22 languages); the
JRC-Names multilingual and multi-script name variant list and related software (since 2011); the
JRC Eurovoc Indexer (JEX) multilingual document categorisation software (22 languages) since 2012; and the
ECDC-Translation Memory (ECDC-TM) since 2012 (25 languages). For details and other, smaller linguistic resources, see the
JRC-Resources page.

Further multilingual linguistic resources will be made available in the future. We also hope to make updates of the currently existing resources available.

Download the EAC Translation Memory

The distribution of the
EAC Translation Memory consists of a single zip file (EAC-TM-all.zip), which can be downloaded by clicking on the link below. In the zip file, you find two TMX files (EAC_FORMS.tmx and EAC_REFERENCE_DATA.tmx) containing the English sentences and
their translations into up to 25 other languages; the DTD file, which should be kept in the same directory; two PDF files with the statistics on the corpora; the Java utility CreateLanguagePair.jar that allows you to extract a TMX file containing
only one single language pair. The language codes used are those defined by the norm ISO 639-1.

EAC-TM (August 2012)	Download size
EAC-TM-all.zip	3.5 MB

Referring to this resources

When referring to the
EAC Translation Memory EAC-TM in publications, please use the following reference:

Steinberger Ralf, Mohamed Ebrahim, Alexandros Poulis, Manuel Carrasco-Benitez, Patrick Schlüter, Marek Przybyszewski & Signe Gilbro (2014).
An overview of the European Union's highly multilingual parallel corpora [337 KB] . Language Resources and Evaluation Journal (LRE). DOI: 10.1007/s10579-014-9277-0.

Acknowledgement and Contact

For more information, you can contact the following persons:

Directorate General for Education and Culture

Mr Marek Przybyszewski (Email address format: Firstname [dot] lastnameec [dot] europa [dot] eu (Firstname[dot]lastname[at]ec[dot]europa[dot]eu))

European Commission - Directorate General for Education and Culture (DG EAC)

Brussels, Belgium

URL:
http://ec.europa.eu/dgs/education_culture/

Joint Research Centre (JRC)

Ralf Steinberger (Email address format: Firstname [dot] Lastnamejrc [dot] ec [dot] europa [dot] eu (Firstname[dot]Lastname[at]jrc[dot]ec[dot]europa[dot]eu))

IPSC - GlobeSec - OPTIMA

Via E. Fermi 2749, T.P. 267

I-21027 Ispra (VA)

view details

The EAC Translation Memory was offered by the EC's Directorate General of Education and Culture (DG EAC). The original files - one for each of the 25 language pairs with English - were cleaned and combined into one by Mohamed Ebrahim from the European
Commission's Joint Research Centre JRC.

The
Directorate General for Education and Culture (DG EAC) is a directorate of the European Commission which has the aim of reinforcing and promoting lifelong learning through policy cooperation with EU Member States on the one hand and through the
implementation of the Lifelong Learning Programme on the other hand. For details, read the
mission of DG EAC.

The
Joint Research Centre (
JRC) is also a Directorate-General of the European Commission. The JRC has for many years worked on highly multilingual text analysis applications. The JRC has contributed to the dissemination of the
DGT Translation Memory and it has itself produced and disseminated a number of further highly multilingual linguistic resources: the
JRC-Acquis,
JRC-Names, the
JRC Eurovoc Indexer JEX, and a series of
further linguistic resources.

The JRC is the creator of the
Europe Media Monitor (EMM) family of news aggregation and analysis applications. EMM aggregates news from about 3000 news portals world-wide in about 50 languages (status 2012). EMM's news analysis tools always show the latest news from around
the world as its pages are updated every five minutes. As EMM not only displays the news articles, but it also groups related articles, classifies the articles into hundreds of news categories and displays automatically extracted meta-information
together with the news items, EMM has many users from around the world, resulting in up to 1.2 million hits per day. Much information is available via RSS feeds, allowing EMM output to be combined with third-party tools. The JRC is scientifically
very active, as can be seen from the large number of international scientific publications in the field of multilingual text mining and media monitoring. JRC's publicly accessible media monitoring applications are:

NewsBrief: Breaking News detection and display of the very latest thematically organised news from around the world; Grouping of related news; breaking news detection; RSS feeds and automatic email alerting; 50 languages.
MedISys : EMM's Medical Information System selects the health-related EMM news in 50 languages and additionally gathers documents from about 250 medical web sites. MedISys displays the medical news according to diseases, symptoms, organisations and themes
and has statistics-based early warning functions for each category. A second, restricted site offers more functionality to EU public health organisations.
NewsExplorer: Summary of the news in 20 languages for each 24-hour period; grouping of related news into clusters; linking of daily clusters over time and across languages; visualisation of time lines and of geographical news coverage; information
extraction to detect and disambiguate persons, organisations and locations; quotation recognition; individual, daily-updated pages for over one million names; detection of quotations by and about people; automatic generation of social networks.