Skip to main content
Λογότυπος της Ευρωπαϊκής Επιτροπής
EU Science Hub

Introduction

In October 2012, the European Union's (EU) Directorate General for Education and Culture (
DG EAC) released a translation memory (TM), i.e. a collection of sentences and their professionally produced translations, in twenty-six languages. The data gets distributed via the web pages of the EC's Joint Research Centre (JRC). Here we describe
this resource, which bears the name EAC Translation Memory, short EAC-TM.

view details

Translation Memories are
parallel texts, i.e. texts and their manually produced translations. They are also referred to as bi-texts. A
translation memory is a collection of small text segments and their translations (referred to as translation units, TU). These TUs can be sentences or parts of sentences. Translation memories are used to support translators by ensuring that pieces
of text that have already been translated do not need to be translated again.

Both translation memories and parallel texts are important linguistic resources that
can be used for a variety of purposes, including:

  • training automatic systems for statistical machine translation (SMT);
  • producing monolingual or multilingual lexical and semantic resources such as dictionaries and ontologies;
  • training and testing multilingual information extraction software;
  • checking translation consistency automatically;
  • testing and benchmarking alignment software (for sentences, words, etc.).

The value of a parallel corpus grows with its size and with the number of languages for which translations exist. While parallel corpora for some languages are abundant, there are few or no parallel corpora for most language pairs. The most outstanding
advantage of the various
parallel corpora available via our web pages - apart from them being freely available - is the number of rare language pairs (e.g. Maltese-Estonian, Slovenian-Finnish, etc.).

The EAC-TM is relatively small compared to the
JRC-Acquis and to
DGT-TM, but it has the
advantage that it focuses on a very different domain, namely that of education and culture. Also, it includes translation units for the languages Croatian (HR), Icelandic (IS), Norwegian (Bokmål, NB or Norwegian, NO) and Turkish (TR).

Languages / File Format

EAC-TM covers up to
26 languages: 22 official languages of the EU (all except Irish) plus Icelandic, Croatian, Norwegian and Turkish. EAC-TM thus contains translations from English into the following 25 languages: Bulgarian, Czech, Danish, Dutch, Estonian, German,
Greek, Finnish, French, Croatian, Hungarian, Icelandic, Italian, Latvian, Lithuanian, Maltese, Norwegian, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish and Turkish.

view details

All documents and sentences were originally written in English (source language is English) and then translated into the other languages. The texts were translated by staff of the National Agencies of the Lifelong Learning and Youth in Action programmes.
They are typically professionals in the field of education/youth and EU programmes. They are thus not professional translators, but they are normally native speakers of the target language.

The documents are distributed in the widely used Translation Memory eXchange
(TMX) format. They are encoded in the UTF-8 character set.

Text types / Domain

EAC-TM was built in the context of translating electronic forms: application and report forms for decentralised actions of EAC's Life-long Learning Programme (LLP) and the Youth in Action Programme.

view details

The contents in the electronic forms are technically split into two types: (a) the labels and contents of drop-down menus (referred to as 'Forms' Data) and (b) checkboxes (referred to as 'Reference Data'). Due to the different types of data, the two collections
are kept separate. For example, labels can be 'Country', 'Please specify your home country' etc., while examples for reference data are 'Germany', 'Basic/general programmes', 'Education and Culture' etc. To get an overview of the programmes managed
by DG EAC, go to the
website of DG EAC.

The data consists of translations carried out between the end of the year 2008 and July 2012.

Statistics for the EAC Translation Memory

The EAC Translation Memory is available in 25 languages: the English source language and its translations into up to 24 other languages. The following tables show the coverage, expressed in the approximate total number of translation units available for
each language, the number of words and the number of characters.

view details

The first table describes the 'Forms' Data; the second table describes the 'Reference Data'. For details, there are also files containing the statistics on the size of the EAC-TM per language pair for the
Forms Data and the
Reference Data.

Language

No. of TUs

No. of words

No. of Chars

No. of words per TU

No. of chars per TU

bg

1503

15797

110066

10±14.74

73±92.35

cs

1035

8598

62157

8±13.35

60±87.14

da

1202

11588

80094

9±18.27

66±89.89

de

2193

22849

181676

10±16.61

82±95.05

el

1411

15098

110984

10±16.15

78±95.92

en

2512

29217

191728

11±15.78

76±95.29

es

2193

25954

171104

11±15.62

78±95.42

et

1137

8071

67376

7±15.21

59±94.15

fi

734

4490

41889

6±14.94

57±93.38

fr

2212

25711

175480

11±14.94

79±94.21

hu

1784

17966

142691

10±14.96

79±97.10

is

1188

12317

82957

10±14.88

69±96.58

it

881

8894

60987

10±14.85

69±96.62

lt

918

6413

52687

6±14.66

57±95.80

lv

1443

11272

92750

7±14.39

64±94.72

mt

605

4602

39170

7±14.29

64±94.48

nb

642

4925

36391

7±14.20

56±93.93

nl

642

5877

41562

9±14.15

64±93.71

pl

1478

14649

118217

9±14.07

79±94.08

pt

1434

16418

110630

11±14.09

77±94.14

ro

970

10444

72151

10±14.10

74±94.33

sk

643

5120

37449

7±14.03

58±93.96

sl

2061

19773

142018

9±13.90

68±93.30

sv

901

7734

57047

8±13.84

63±93.03

tr

918

6386

52180

6±13.74

56±92.65

ALL

32,640

320,163

2,331,441

Table 1. Size of EAC's Translation Memory 'Forms Data', expressed as the total number of

translation units (TU) per language for each of the 25 languages (22 out of the 23 official EU languages plus Icelandic, Norwegian Bokmål and Turkish).

Language

No. of TUs

No. of words

No. of Chars

No. of words per TU

No. of chars per TU

bg

2558

14416

105825

5±9.15

41±64.32

cs

2316

11146

84596

4±7.78

36±55.22

da

2555

12522

98937

4±8.12

38±57.75

de

2280

9482

84592

4±7.73

37±56.24

el

1407

7159

55225

5±7.66

39±55.82

en

2642

15497

111791

5±8.33

42±59.27

es

2110

11302

79744

5±8.22

37±58.03

et

1133

3897

33883

3±8.03

29±57.03

fi

724

2115

19391

2±7.92

26±56.61

fr

2264

12135

88323

5±7.86

39±55.81

hr

573

1894

14497

3±7.82

25±55.60

hu

1671

6430

54652

3±7.66

32±54.81

is

1018

3969

29474

3±7.61

28±54.40

it

1289

7112

73186

5±7.62

56±56.17

lt

2468

11691

97009

4±7.60

39±56.59

lv

2437

10125

85194

4±7.44

34±55.60

mt

1117

4606

38729

4±7.38

34±55.36

nl

1163

5118

41171

4±7.36

35±55.26

no

523

1809

13297

3±7.35

25±55.15

pl

2549

14074

114591

5±7.47

44±56.37

pt

2067

10705

76179

5±7.45

36±55.94

ro

2189

10390

77375

4±7.40

35±55.42

sk

2329

11065

86093

4±7.34

36±54.97

sl

2583

13345

103353

5±7.40

40±55.30

sv

2008

8455

69896

4±7.35

34±54.89

tr

2280

10126

80104

4±7.27

35±54.38

ALL

45,973

220,459

1,737,003

Table 2. Size of EAC's Translation Memory 'Reference Data', expressed as the total number of

translation units (TU) per language for each of the 26 languages (22 out of the 23 official EU languages plus Croatian, Icelandic, Norwegian and Turkish).

Conditions for Use

The Commission's
copright notice applies

Further Translation Memories available here

The public release of the EAC-Translation Memory follows the release of various other multilingual resources via the JRC's website.

view details

These include the
JRC-Acquis parallel corpus since 2006 (22 languages); the
DGT-Translation Memory (DGT-TM) since 2007 (22 languages); the
JRC-Names multilingual and multi-script name variant list and related software (since 2011); the
JRC Eurovoc Indexer (JEX) multilingual document categorisation software (22 languages) since 2012; and the
ECDC-Translation Memory (ECDC-TM) since 2012 (25 languages). For details and other, smaller linguistic resources, see the
JRC-Resources page.

Further multilingual linguistic resources will be made available in the future. We also hope to make updates of the currently existing resources available.

Download the EAC Translation Memory

The distribution of the
EAC Translation Memory consists of a single zip file (EAC-TM-all.zip), which can be downloaded by clicking on the link below. In the zip file, you find two TMX files (EAC_FORMS.tmx and EAC_REFERENCE_DATA.tmx) containing the English sentences and
their translations into up to 25 other languages; the DTD file, which should be kept in the same directory; two PDF files with the statistics on the corpora; the Java utility CreateLanguagePair.jar that allows you to extract a TMX file containing
only one single language pair. The language codes used are those defined by the norm ISO 639-1.

EAC-TM (August 2012)

Download size

EAC-TM-all.zip

3.5 MB

Referring to this resources

When referring to the
EAC Translation Memory EAC-TM in publications, please use the following reference:

Acknowledgement and Contact

For more information, you can contact the following persons:

Directorate General for Education and Culture

Mr Marek Przybyszewski (Email address format: Firstname.lastname@ec.europa.eu)

European Commission - Directorate General for Education and Culture (DG EAC)

Brussels, Belgium

URL:
http://ec.europa.eu/dgs/education_culture/

Joint Research Centre (JRC)

Ralf Steinberger (Email address format: Firstname.Lastname@jrc.ec.europa.eu)

IPSC - GlobeSec - OPTIMA

Via E. Fermi 2749, T.P. 267

I-21027 Ispra (VA)

view details

The EAC Translation Memory was offered by the EC's Directorate General of Education and Culture (DG EAC). The original files - one for each of the 25 language pairs with English - were cleaned and combined into one by Mohamed Ebrahim from the European
Commission's Joint Research Centre JRC.

The
Directorate General for Education and Culture (DG EAC) is a directorate of the European Commission which has the aim of reinforcing and promoting lifelong learning through policy cooperation with EU Member States on the one hand and through the
implementation of the Lifelong Learning Programme on the other hand. For details, read the
mission of DG EAC.

The
Joint Research Centre (
JRC) is also a Directorate-General of the European Commission. The JRC has for many years worked on highly multilingual text analysis applications. The JRC has contributed to the dissemination of the
DGT Translation Memory and it has itself produced and disseminated a number of further highly multilingual linguistic resources: the
JRC-Acquis,
JRC-Names, the
JRC Eurovoc Indexer JEX, and a series of
further linguistic resources.

The JRC is the creator of the
Europe Media Monitor (EMM) family of news aggregation and analysis applications. EMM aggregates news from about 3000 news portals world-wide in about 50 languages (status 2012). EMM's news analysis tools always show the latest news from around
the world as its pages are updated every five minutes. As EMM not only displays the news articles, but it also groups related articles, classifies the articles into hundreds of news categories and displays automatically extracted meta-information
together with the news items, EMM has many users from around the world, resulting in up to 1.2 million hits per day. Much information is available via RSS feeds, allowing EMM output to be combined with third-party tools. The JRC is scientifically
very active, as can be seen from the large number of international scientific publications in the field of multilingual text mining and media monitoring. JRC's publicly accessible media monitoring applications are:

  • NewsBrief: Breaking News detection and display of the very latest thematically organised news from around the world; Grouping of related news; breaking news detection; RSS feeds and automatic email alerting; 50 languages.
  • MedISys : EMM's Medical Information System selects the health-related EMM news in 50 languages and additionally gathers documents from about 250 medical web sites. MedISys displays the medical news according to diseases, symptoms, organisations and themes
    and has statistics-based early warning functions for each category. A second, restricted site offers more functionality to EU public health organisations.
  • NewsExplorer: Summary of the news in 20 languages for each 24-hour period; grouping of related news into clusters; linking of daily clusters over time and across languages; visualisation of time lines and of geographical news coverage; information
    extraction to detect and disambiguate persons, organisations and locations; quotation recognition; individual, daily-updated pages for over one million names; detection of quotations by and about people; automatic generation of social networks.