ECDC-Translation Memory

Introduction

In October 2012, the European Union (EU) agency 'European Centre for Disease Prevention and Control' (ECDC) released a translation memory (TM), i.e. a collection of sentences and their professionally produced translations, in twenty-five languages. The data gets distributed via the web pages of the EC's Joint Research Centre (JRC). Here we describe this resource, which bears the name ECDC Translation Memory, short ECDC-TM.

Translation Memories are parallel texts, i.e. texts and their manually produced translations. They are also referred to as bi-texts. A translation memory is a collection of small text segments and their translations (referred to as translation units, TU). These TUs can be sentences or parts of sentences. Translation memories are used to support translators by ensuring that pieces of text that have already been translated do not need to be translated again.

Both translation memories and parallel texts are important linguistic resources that can be used for a variety of purposes, including:

training automatic systems for statistical machine translation (SMT);
producing monolingual or multilingual lexical and semantic resources such as dictionaries and ontologies;
training and testing multilingual information extraction software;
checking translation consistency automatically;
testing and benchmarking alignment software (for sentences, words, etc.).

The value of a parallel corpus grows with its size and with the number of languages for which translations exist. While parallel corpora for some languages are abundant, there are few or no parallel corpora for most language pairs. The most outstanding advantage of the various parallel corpora available via our web pages - apart from them being freely available - is the number of rare language pairs (e.g. Maltese-Estonian, Slovenian-Finnish, etc.).

The ECDC-TM is relatively small compared to the JRC-Acquis and to DGT-TM, but it has the advantage that it focuses on a very different domain, namely that of public health. Also, it includes translation units for the languages Irish (Gaelige, GA), Norwegian (Norsk, NO) and Icelandic (IS).

Languages / File Format

ECDC-TM covers 25 languages: the 23 official languages of the EU plus Norwegian (Norsk) and Icelandic. ECDC-TM was created by translating from English into the following 24 languages: Bulgarian, Czech, Danish, Dutch, English, Estonian, Gaelige (Irish), German, Greek, Finnish, French, Hungarian, Icelandic, Italian, Latvian, Lithuanian, Maltese, Norwegian (NOrsk), Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish and Swedish. The JRC then combined these 24 translation memory files to produce one large translation memory, allowing to also extract translation units for other language pairs.
All documents and sentences were thus originally written in English. They were then translated into the other languages by professional translators from the Translation Centre CdT in Luxembourg.

The documents are distributed in the widely used Translation Memory eXchange (TMX) format. They are encoded in the UTF-8 character set. The files have the following structure:

<tu>
<tuv xml:lang="EN">
<seg>Vaccination against hepatitis C is not yet available.</seg>
</tuv>
<tuv xml:lang="BG">
<seg>Засега няма ваксина срещу хепатит С.</seg>
</tuv>
<tuv xml:lang="CS">
<seg>Očkování proti hepatitidě C zatím není k dispozici.</seg>
</tuv>
...

<tuv xml:lang="SV">
<seg>Det finns ännu inget vaccin mot hepatit C.</seg>
</tuv>
</tu>

Text types / Domain

ECDC-TM was built on the basis of the website of the European Centre for Disease Prevention and Control (ECDC). The major part of the documents talks about health-related topics (anthrax, botulism, cholera, dengue fever, hepatitis, etc.), but some of the web pages also describe the organisation ECDC (e.g. its organisation, job opportunities) and its activities (e.g. epidemic intelligence, surveillance). The file ECDC-domains.xlsx gives further details.

Statistics for the ECDC Translation Memory

The following table shows the size of ECDC Translation Memory per language: the number of translation units, the number of words and characters of the whole corpus and the average number of words and characters per translation unit.

For details, there is also a file containing the statistics on the size of the ECDC-TM per language pair.

Language	No. of TUs	No. of words	No. of Chars	No. of words per TU	No. of chars per TU
BG	2567	53557	293635	20±37.02	114±100.02
CS	2562	45564	271290	17±32.44	105±93.31
DA	2577	41955	261529	16±28.41	101±90.24
DE	2560	43187	306148	16±25.99	119±92.17
EL	2530	50658	317722	20±24.85	125±93.88
EN	3919	72085	395269	18±24.12	100±92.98
ES	2564	52406	300495	20±25.06	117±93.49
ET	2581	39435	255112	15±28.36	98±92.87
FI	2617	38467	277958	14±27.62	106±92.10
FR	2561	50106	303936	19±26.88	118±92.49
GA	1356	22619	143006	16±26.40	105±91.99
HU	2571	45744	290470	17±28.39	112±92.69
IS	2511	42005	256966	16±27.68	102±91.99
IT	2534	47038	295964	18±27.08	116±92.13
LT	2545	102229	347591	40±83.88	136±129.47
LV	2542	48095	273604	18±82.86	107±128.24
MT	2539	61855	315865	24±80.75	124±126.89
NL	2510	46666	292721	18±78.82	116±125.35
NO	2537	40149	254315	15±76.83	100±123.45
PL	2546	91237	347955	35±83.69	136±128.72
PT	2531	49239	294449	19±81.78	116±127.24
RO	2555	46999	292453	18±80.00	114±125.90
SK	2525	88810	323179	35±85.24	127±129.73
SL	2545	84756	308808	33±89.63	121±132.99
SV	2527	39442	259710	15±87.91	102±131.39
ALL	63,912	1,344,303	7,280,150

Size of ECDC's Translation Memory (expressed as the number of translation units, number of words and number of characters) per language for each of the 25 European languages (all 23 official EU languages plus Icelandic and Norwegian).

Terms of Use

By downloading or using the ECDC-Translation Memory, you are bound by the ECDC-TM usage conditions (PDF).

Further Translation Memories (and more) available on our site

The public release of the ECDC-Translation Memory follows the release of various other multilingual resources via the JRC's website. These include the JRC-Acquis parallel corpus since 2006 (22 languages); the DGT-Translation Memory (DGT-TM) since 2007 (22 languages); the JRC-Names multilingual and multi-script name variant list and related software (since 2011); and the JRC Eurovoc Indexer (JEX) multilingual document categorisation software (22 languages) since 2012. For details and other, smaller linguistic resources, see the JRC-Resources page.
Further multilingual linguistic resources will be made available in the future.

Download the ECDC Translation Memory

The distribution of the ECDC Translation Memory consists of a single zip file (ECDC-TM.zip), which can be downloaded by clicking on the link below.

In the zip file, you find: the main file ECDC.tmx, containing the aligned translation units for all languages; the DTD file, which should be kept in the same directory; a PDF file with statistics on the corpus; a PDF document describing the terms of use; a Java utility that allows you to extract a TMX file containing only one single language pair and to produce statistics on the number of translation units; and two readme files with explanations.
Should you be interested in the full-text version of the English files that were used to produce the translation memory, you can download these also (En_full-texts_2010-Dec-13.zip).

ECDC-TM (October 2012)	Download size
ECDC-TM.zip	3.7MB

Referring to this resource

When referring to the ECDC-TM in publications, please use the following reference:

Steinberger Ralf, Mohamed Ebrahim, Alexandros Poulis, Manuel Carrasco-Benitez, Patrick Schlüter, Marek Przybyszewski & Signe Gilbro (2014). An overview of the European Union's highly multilingual parallel corpora. Language Resources and Evaluation Journal (LRE). DOI: 10.1007/s10579-014-9277-0.

Acknowledgement and Contact

For more information on ECDC-TM, you can contact the following persons:

Web Editor for Multilingual Content
Email address: webmasterecdc [dot] europa [dot] eu (webmaster[at]ecdc[dot]europa[dot]eu)
European Centre for Disease Prevention and Control (ECDC)
Tomtebodavägen 11A
171 83 Stockholm, Sweden
URL: http://www.ecdc.europa.eu

Joint Research Centre (JRC)
Ralf Steinberger (Email address format: Firstname [dot] Lastnamejrc [dot] ec [dot] europa [dot] eu (Firstname[dot]Lastname[at]jrc[dot]ec[dot]europa[dot]eu))
IPSC - GlobeSec - OPTIMA
Via E. Fermi 2749, T.P. 267
I-21027 Ispra (VA)

The ECDC Translation Memory was offered by the European Centre for Disease Prevention and Control (ECDC). The original files - one for each of the 24 language pairs - were cleaned and combined by Mohamed Ebrahim from the European Commission's Joint Research Centre JRC.
The European Centre for Disease Prevention and Control (ECDC) is an EU agency whose aim is to strengthen Europe's defences against infectious diseases. It was established in 2008 and it is seated in Stockholm, Sweden.
The ECDC's mission: According to the Article 3 of the founding Regulation, ECDC's mission is to identify, assess and communicate current and emerging threats to human health posed by infectious diseases. In order to achieve this mission, ECDC works in partnership with national health protection bodies across Europe to strengthen and develop continent-wide disease surveillance and early warning systems. By working with experts throughout Europe, ECDC pools Europe's health knowledge, so as to develop authoritative scientific opinions about the risks posed by current and emerging infectious diseases.
The Joint Research Centre (JRC) is also a Directorate-General of the European Commission. The JRC has for many years worked on highly multilingual text analysis applications. The JRC has contributed to the dissemination of the DGT Translation Memory and it has itself produced and disseminated a number of further highly multilingual linguistic resources: the JRC-Acquis, JRC-Names, the JRC Eurovoc Indexer JEX, and a series of further linguistic resources.
The JRC is the creator of the Europe Media Monitor (EMM) family of news aggregation and analysis applications. EMM collects and aggregates about 150,000 online news articles per day in 50 languages from about 3500 news portals world-wide (status 2012). EMM's news analysis tools always show the latest news from around the world as its pages are updated every ten minutes. As EMM not only displays the news articles, but it also groups related articles, classifies the articles into hundreds of news categories and displays automatically extracted meta-information together with the news items, EMM has many users from around the world, resulting in up to 1.2 million hits per day. Much information is available via RSS feeds, allowing EMM output to be combined with third-party tools. The JRC is scientifically very active, as can be seen from the large number of international scientific publications in the field of multilingual text mining and media monitoring. JRC's publicly accessible media monitoring applications are:

NewsBrief: Breaking News detection and display of the very latest thematically organised news from around the world; Grouping of related news; breaking news detection; RSS feeds and automatic email alerting; 50 languages.
MedISys: EMM's Medical Information System selects the health-related EMM news in 50 languages and additionally gathers documents from about 250 medical web sites. MedISys displays the medical news according to diseases, symptoms, organisations and themes and has statistics-based early warning functions for each category. A second, restricted site offers more functionality to EU public health organisations.
NewsExplorer: Summary of the news in 20 languages for each 24-hour period; grouping of related news into clusters; linking of daily clusters over time and across languages; visualisation of time lines and of geographical news coverage; information extraction to detect and disambiguate persons, organisations and locations; quotation recognition; individual, daily-updated pages for over one million names; detection of quotations by and about people; automatic generation of social networks.