Introduction Languages / File format Text types / Domain Statistics on the corpus Conditions for Use Further Translation Memories available here Download the EAC Translation Memory Referring to this resource Acknowledgements and Contact Introduction In October 2012, the European Union's (EU) Directorate General for Education and Culture ( DG EAC) released a translation memory (TM), i.e. a collection of sentences and their professionally produced translations, in twenty-six languages. The data gets distributed via the web pages of the EC's Joint Research Centre (JRC). Here we describe this resource, which bears the name EAC Translation Memory, short EAC-TM. view details Translation Memories are parallel texts, i.e. texts and their manually produced translations. They are also referred to as bi-texts. A translation memory is a collection of small text segments and their translations (referred to as translation units, TU). These TUs can be sentences or parts of sentences. Translation memories are used to support translators by ensuring that pieces of text that have already been translated do not need to be translated again. Both translation memories and parallel texts are important linguistic resources that can be used for a variety of purposes, including: training automatic systems for statistical machine translation (SMT); producing monolingual or multilingual lexical and semantic resources such as dictionaries and ontologies; training and testing multilingual information extraction software; checking translation consistency automatically; testing and benchmarking alignment software (for sentences, words, etc.). The value of a parallel corpus grows with its size and with the number of languages for which translations exist. While parallel corpora for some languages are abundant, there are few or no parallel corpora for most language pairs. The most outstanding advantage of the various parallel corpora available via our web pages - apart from them being freely available - is the number of rare language pairs (e.g. Maltese-Estonian, Slovenian-Finnish, etc.). The EAC-TM is relatively small compared to the JRC-Acquis and to DGT-TM, but it has the advantage that it focuses on a very different domain, namely that of education and culture. Also, it includes translation units for the languages Croatian (HR), Icelandic (IS), Norwegian (Bokmål, NB or Norwegian, NO) and Turkish (TR). Languages / File Format EAC-TM covers up to 26 languages: 22 official languages of the EU (all except Irish) plus Icelandic, Croatian, Norwegian and Turkish. EAC-TM thus contains translations from English into the following 25 languages: Bulgarian, Czech, Danish, Dutch, Estonian, German, Greek, Finnish, French, Croatian, Hungarian, Icelandic, Italian, Latvian, Lithuanian, Maltese, Norwegian, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish and Turkish. view details All documents and sentences were originally written in English (source language is English) and then translated into the other languages. The texts were translated by staff of the National Agencies of the Lifelong Learning and Youth in Action programmes. They are typically professionals in the field of education/youth and EU programmes. They are thus not professional translators, but they are normally native speakers of the target language. The documents are distributed in the widely used Translation Memory eXchange (TMX) format. They are encoded in the UTF-8 character set. Text types / Domain EAC-TM was built in the context of translating electronic forms: application and report forms for decentralised actions of EAC's Life-long Learning Programme (LLP) and the Youth in Action Programme. view details The contents in the electronic forms are technically split into two types: (a) the labels and contents of drop-down menus (referred to as 'Forms' Data) and (b) checkboxes (referred to as 'Reference Data'). Due to the different types of data, the two collections are kept separate. For example, labels can be 'Country', 'Please specify your home country' etc., while examples for reference data are 'Germany', 'Basic/general programmes', 'Education and Culture' etc. To get an overview of the programmes managed by DG EAC, go to the website of DG EAC. The data consists of translations carried out between the end of the year 2008 and July 2012. Statistics for the EAC Translation Memory The EAC Translation Memory is available in 25 languages: the English source language and its translations into up to 24 other languages. The following tables show the coverage, expressed in the approximate total number of translation units available for each language, the number of words and the number of characters. view details The first table describes the 'Forms' Data; the second table describes the 'Reference Data'. For details, there are also files containing the statistics on the size of the EAC-TM per language pair for the Forms Data and the Reference Data. Language No. of TUs No. of words No. of Chars No. of words per TU No. of chars per TU bg 1503 15797 110066 10±14.74 73±92.35 cs 1035 8598 62157 8±13.35 60±87.14 da 1202 11588 80094 9±18.27 66±89.89 de 2193 22849 181676 10±16.61 82±95.05 el 1411 15098 110984 10±16.15 78±95.92 en 2512 29217 191728 11±15.78 76±95.29 es 2193 25954 171104 11±15.62 78±95.42 et 1137 8071 67376 7±15.21 59±94.15 fi 734 4490 41889 6±14.94 57±93.38 fr 2212 25711 175480 11±14.94 79±94.21 hu 1784 17966 142691 10±14.96 79±97.10 is 1188 12317 82957 10±14.88 69±96.58 it 881 8894 60987 10±14.85 69±96.62 lt 918 6413 52687 6±14.66 57±95.80 lv 1443 11272 92750 7±14.39 64±94.72 mt 605 4602 39170 7±14.29 64±94.48 nb 642 4925 36391 7±14.20 56±93.93 nl 642 5877 41562 9±14.15 64±93.71 pl 1478 14649 118217 9±14.07 79±94.08 pt 1434 16418 110630 11±14.09 77±94.14 ro 970 10444 72151 10±14.10 74±94.33 sk 643 5120 37449 7±14.03 58±93.96 sl 2061 19773 142018 9±13.90 68±93.30 sv 901 7734 57047 8±13.84 63±93.03 tr 918 6386 52180 6±13.74 56±92.65 ALL 32,640 320,163 2,331,441 Table 1. Size of EAC's Translation Memory 'Forms Data', expressed as the total number of translation units (TU) per language for each of the 25 languages (22 out of the 23 official EU languages plus Icelandic, Norwegian Bokmål and Turkish). Language No. of TUs No. of words No. of Chars No. of words per TU No. of chars per TU bg 2558 14416 105825 5±9.15 41±64.32 cs 2316 11146 84596 4±7.78 36±55.22 da 2555 12522 98937 4±8.12 38±57.75 de 2280 9482 84592 4±7.73 37±56.24 el 1407 7159 55225 5±7.66 39±55.82 en 2642 15497 111791 5±8.33 42±59.27 es 2110 11302 79744 5±8.22 37±58.03 et 1133 3897 33883 3±8.03 29±57.03 fi 724 2115 19391 2±7.92 26±56.61 fr 2264 12135 88323 5±7.86 39±55.81 hr 573 1894 14497 3±7.82 25±55.60 hu 1671 6430 54652 3±7.66 32±54.81 is 1018 3969 29474 3±7.61 28±54.40 it 1289 7112 73186 5±7.62 56±56.17 lt 2468 11691 97009 4±7.60 39±56.59 lv 2437 10125 85194 4±7.44 34±55.60 mt 1117 4606 38729 4±7.38 34±55.36 nl 1163 5118 41171 4±7.36 35±55.26 no 523 1809 13297 3±7.35 25±55.15 pl 2549 14074 114591 5±7.47 44±56.37 pt 2067 10705 76179 5±7.45 36±55.94 ro 2189 10390 77375 4±7.40 35±55.42 sk 2329 11065 86093 4±7.34 36±54.97 sl 2583 13345 103353 5±7.40 40±55.30 sv 2008 8455 69896 4±7.35 34±54.89 tr 2280 10126 80104 4±7.27 35±54.38 ALL 45,973 220,459 1,737,003 Table 2. Size of EAC's Translation Memory 'Reference Data', expressed as the total number of translation units (TU) per language for each of the 26 languages (22 out of the 23 official EU languages plus Croatian, Icelandic, Norwegian and Turkish). Conditions for Use The Commission's copright notice applies Further Translation Memories available here The public release of the EAC-Translation Memory follows the release of various other multilingual resources via the JRC's website. view details These include the JRC-Acquis parallel corpus since 2006 (22 languages); the DGT-Translation Memory (DGT-TM) since 2007 (22 languages); the JRC-Names multilingual and multi-script name variant list and related software (since 2011); the JRC Eurovoc Indexer (JEX) multilingual document categorisation software (22 languages) since 2012; and the ECDC-Translation Memory (ECDC-TM) since 2012 (25 languages). For details and other, smaller linguistic resources, see the JRC-Resources page. Further multilingual linguistic resources will be made available in the future. We also hope to make updates of the currently existing resources available. Download the EAC Translation Memory The distribution of the EAC Translation Memory consists of a single zip file (EAC-TM-all.zip), which can be downloaded by clicking on the link below. In the zip file, you find two TMX files (EAC_FORMS.tmx and EAC_REFERENCE_DATA.tmx) containing the English sentences and their translations into up to 25 other languages; the DTD file, which should be kept in the same directory; two PDF files with the statistics on the corpora; the Java utility CreateLanguagePair.jar that allows you to extract a TMX file containing only one single language pair. The language codes used are those defined by the norm ISO 639-1. EAC-TM (August 2012) Download size EAC-TM-all.zip 3.5 MB Referring to this resources When referring to the EAC Translation Memory EAC-TM in publications, please use the following reference: Steinberger Ralf, Mohamed Ebrahim, Alexandros Poulis, Manuel Carrasco-Benitez, Patrick Schlüter, Marek Przybyszewski & Signe Gilbro (2014). An overview of the European Union's highly multilingual parallel corpora [337 KB] . Language Resources and Evaluation Journal (LRE). DOI: 10.1007/s10579-014-9277-0. Acknowledgement and Contact For more information, you can contact the following persons: Directorate General for Education and Culture Mr Marek Przybyszewski (Email address format: Firstname [dot] lastnameec [dot] europa [dot] eu (Firstname[dot]lastname[at]ec[dot]europa[dot]eu)) European Commission - Directorate General for Education and Culture (DG EAC) Brussels, Belgium URL: http://ec.europa.eu/dgs/education_culture/ Joint Research Centre (JRC) Ralf Steinberger (Email address format: Firstname [dot] Lastnamejrc [dot] ec [dot] europa [dot] eu (Firstname[dot]Lastname[at]jrc[dot]ec[dot]europa[dot]eu)) IPSC - GlobeSec - OPTIMA Via E. Fermi 2749, T.P. 267 I-21027 Ispra (VA) view details The EAC Translation Memory was offered by the EC's Directorate General of Education and Culture (DG EAC). The original files - one for each of the 25 language pairs with English - were cleaned and combined into one by Mohamed Ebrahim from the European Commission's Joint Research Centre JRC. The Directorate General for Education and Culture (DG EAC) is a directorate of the European Commission which has the aim of reinforcing and promoting lifelong learning through policy cooperation with EU Member States on the one hand and through the implementation of the Lifelong Learning Programme on the other hand. For details, read the mission of DG EAC. The Joint Research Centre ( JRC) is also a Directorate-General of the European Commission. The JRC has for many years worked on highly multilingual text analysis applications. The JRC has contributed to the dissemination of the DGT Translation Memory and it has itself produced and disseminated a number of further highly multilingual linguistic resources: the JRC-Acquis, JRC-Names, the JRC Eurovoc Indexer JEX, and a series of further linguistic resources. The JRC is the creator of the Europe Media Monitor (EMM) family of news aggregation and analysis applications. EMM aggregates news from about 3000 news portals world-wide in about 50 languages (status 2012). EMM's news analysis tools always show the latest news from around the world as its pages are updated every five minutes. As EMM not only displays the news articles, but it also groups related articles, classifies the articles into hundreds of news categories and displays automatically extracted meta-information together with the news items, EMM has many users from around the world, resulting in up to 1.2 million hits per day. Much information is available via RSS feeds, allowing EMM output to be combined with third-party tools. The JRC is scientifically very active, as can be seen from the large number of international scientific publications in the field of multilingual text mining and media monitoring. JRC's publicly accessible media monitoring applications are: NewsBrief: Breaking News detection and display of the very latest thematically organised news from around the world; Grouping of related news; breaking news detection; RSS feeds and automatic email alerting; 50 languages. MedISys : EMM's Medical Information System selects the health-related EMM news in 50 languages and additionally gathers documents from about 250 medical web sites. MedISys displays the medical news according to diseases, symptoms, organisations and themes and has statistics-based early warning functions for each category. A second, restricted site offers more functionality to EU public health organisations. NewsExplorer: Summary of the news in 20 languages for each 24-hour period; grouping of related news into clusters; linking of daily clusters over time and across languages; visualisation of time lines and of geographical news coverage; information extraction to detect and disambiguate persons, organisations and locations; quotation recognition; individual, daily-updated pages for over one million names; detection of quotations by and about people; automatic generation of social networks.