What are the Acquis Communautaire and the JRC-Acquis? The Acquis Communautaire (AC) is the total body of European Union (EU) law applicable in the the EU Member States. This collection of legislative text changes continuously and currently comprises selected texts written between the 1950s and now. As of the beginning of the year 2007, the EU had 27 Member States and 23 official languages. The Acquis Communautaire texts exist in these languages, although Irish translations are not currently available. The Acquis Communautaire thus is a collection of parallel texts in the following 22 languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovenian and Swedish. The data release by the JRC is in line with the general effort of the European Commission to support multilingualism, language diversity and the re-use of Commission information. The JRC did not receive an authoritative list of documents that belong to the Acquis Communautaire. In order to compile the document collection distributed here, we selected all those CELEX documents (see below) that were available in at least ten of the twenty EU-25 languages (the official languages of the EU before Bulgaria and Romania joined in 2007) and that additionally existed in at least three of the nine languages that became official languages with the Enlargement of the EU in 2004 (i.e. Czech, Estonian, Hungarian, Lithuanian, Latvian, Maltese, Polish, Slovak and Slovenian). The collection distributed here is thus an approximation of the Acquis Communautaire which we call the JRC-Acquis. The JRC-Acquis must not be seen as a legal reference corpus. Instead, the purpose of the JRC-Acquis is to provide a large parallel corpus of documents for (computational) linguistics research purposes. The linguistic research interest of the JRC-Acquis Generally speaking, parallel corpora are useful for all types of cross-lingual research. The value of a parallel corpus grows with its size and with the number of languages for which translations exist. While parallel corpora for some languages exist abundantly, there are few or no parallel corpora for most other language pairs. To our knowledge, the Acquis Communautaire is the biggest parallel corpus in existence, if we take into consideration both its size and the large number of languages involved. The most outstanding advantage of the Acquis Communautaire - apart from being freely available - is the number of rare language pair combinations (e.g. Maltese-Estonian, Slovenian-Finnish, etc.). The AC and other Community legislation is publicly available on the European Commission's web sites. The Optima group of the Joint Research Centre (JRC) in Ispra, Italy, has attempted to identify the documents that are part of the AC, has downloaded them and converted them to XML format. The Bulgarian and Romanian documents were processed by the Romanian Academy of Sciences. In further processing steps, the texts were cleaned of their footers and annexes, and they were sentence-aligned twice: once using Vanilla and once using HunAlign. Instead of using a single pivot language, all possible 231 language pair combinations were aligned individually. This is useful due to the n-to-n relationship between aligned sentences, which often differs depending on the language pair involved. For some of the documents, only preliminary translations were available. For the online texts in some of the languages, only the title has been translated, but the text displayed is English. An automatic language recognition tool was therefore used to filter out those texts that are displayed as being one language, but which are actually English. No manual check was carried out. The Publications Office of the European Union (OP) manages the distribution rights of this aligned multilingual parallel corpus. OP agreed that the corpus can be given to research partners for non-commercial use. See the section on licensing issues, below. Statistics for version 3.0 of the JRC-Acquis corpus The JRC-Acquis corpus (version 3.0) is currently available in 22 languages with the following distribution: Language ISO code Number of texts Text body Signatures Annexes Total No words (text + signatures + annexes): Total No words Total No characters Average No words Total No words Total No words bg 11384 16140819 104522671 1417.85 2170075 14114612 32425506 cs 21438 22843279 148972981 1065.55 7225300 16763733 46832312 da 23624 31459627 213468135 1331.68 2629786 16855213 50944626 de 23541 32059892 232748675 1361.87 2542149 16327611 50929652 el 23184 36453749 239583543 1572.37 2973574 16459680 55887003 en 23545 34588383 210692059 1469.03 3198766 17750761 55537910 es 23573 38926161 238016756 1651.3 3490204 19716243 62132608 et 23541 24621625 192700704 1045.9 1336051 14995748 40953424 fi 23284 24883012 212178964 1068.67 2677798 12547171 40107981 fr 23627 39100499 234758290 1654.91 3021013 19978920 62100432 hu 22801 28602380 213804614 1254.44 2529488 15056496 46188364 it 23472 35764670 230677013 1523.72 3120797 18331535 57217002 lt 23379 26937773 199438258 1152.22 2436585 15018484 44392842 lv 22906 27592514 196452051 1204.6 1673124 15437969 44703607 mt 10545 20926909 128906748 1984.53 1336042 15620611 37883562 nl 23564 35265161 231963539 1496.57 3039580 18467115 56771856 pl 23478 29713003 214464026 1265.57 2513141 17027393 49253537 pt 23505 37221668 227499418 1583.56 3034308 19350227 59606203 ro 6573 9186947 60537301 1397.68 514296 11185842 20887085 ro-19211 ( readme) 19211 30832212 182631277 1604.92 --- --- 30832212 sk 21943 26792637 179920434 1221.01 3227852 16190546 46211035 sl 20642 27702305 178651767 1342.04 3103193 16837717 47643215 sv 20243 29433037 199004401 1453.99 2575771 14965384 46974192 Total 463,792 636,216,050 4,288,962,348 1387.23 60,368,893 358,999,011 1,055,583,954 Size of version 3.0 of the JRC collection of the Acquis Communautaire in 22 of the official languages of the European Union. Numbers are given separately for the text body (the main text), the signature and the annexes. Statistics on the alignment with Vanilla: Total of 4,350,447 aligned documents (all languages); Total of 243,187,303 links (all languages); Average of 18,833 aligned documents per language; Average of 1,052,759 links per language pair (average of all language pairs); Average of 85.43% of one-to-one links. What is the difference between the JRC-Acquis and the other EU corpora? JRC-Acquis, DGT-Acquis and DCEP are corpora consisting of full texts with additional information on which sentences are aligned with which others, while the Translation Memories DGT-TM, EAC-TM and ECDC-TM are collections of translation units (mostly sentences), from which the full text cannot be reproduced. Some of the resources overlap, while others are entirely different. JRC-Acquis documents additionally are acompanied by information on the manually assigned Eurovoc subject domain classes so that the JRC-Acquis can also be used to train automatic multi-label classification software. For details and background information on each of the multilingual resources, read the overview article An overview of the European Union's highly multilingual parallel corpora . Usage conditions / Licensing issues I. Intellectual property and conditions of use of data The JRC-Acquis data is the exclusive property of the European Commission. The Commission cedes its non-exclusive rights free of charge and world-wide for the entire duration of the protection of those rights to the re-user, for all kinds of use which comply with the conditions laid down in the Commission Decision of 12 December 2011 on the re-use of Commission documents, published in Official Journal of the European Union L330 of 14 December 2011, pages 39 to 42. Any re-use of the data or of the structured elements contained in it is required to be identified by the re-user, who is under an obligation to state the source of the documents used: the website address, the date of the latest update and the fact that the European Commission retains ownership of the data. The following usage conditions must be adhered to: The European Communities consider legislative and quasi-legislative documents published in the Official Journal of the European Union and related COM and SEC series as well as charters and treaties and ECJ case-law to be in the public domain. Prior written permission is thus not required for their reproduction/translation, and they may be reproduced/translated freely without restriction, including for the purpose of further non-commercial dissemination to final users, subject to the condition that appropriate acknowledgement is given to the European Communities and to the source, and provided that the additional guidelines set out below are respected. Whenever a document is reproduced verbatim from a source other than the printed version of the Official Journal of the European Union, a prominently positioned disclaimer should read: 'Only European Community legislation printed in the paper edition of the Official Journal of the European Union is deemed authentic.' For the reasons stated in the disclaimer above, it is advisable to ensure that translations are made from the printed, authentic version of the Official Journal. This precaution, while minimizing the risk of error, does not confer any legal status whatsoever to the translated text. The following notice shall accompany the translated text, printed below the acknowledgement: 'Originally published in the official languages of the European Union in the Official Journal of the European Union by the Office for Official Publications of the European Communities. Responsibility for the translation into [specify language] from the original [specify language] edition lies entirely with [name of translation copyright holder].' Moreover, please note that we do not consider a "further commercial dissemination" the inclusion, as reference material for consultation purposes, of small amounts of relevant legislative texts in articles/thesis/studies/reports/books issued by third-party authors or publishers, whatever the means, and disseminated subject to payment. II. Conditions for use of software The JRC-Acquis data is distributed with the software necessary for its exploitation/extraction. Use of such software must be carried out in accordance with the conditions laid down in the EUPL licence. III. Responsibility The data and the accompanying software are made available, without any guarantee, explicit or tacit. The Commission cannot be held responsible for any loss, injury or damage the re-user may suffer due to the re-use. The Commission does not however guarantee the absence of any irregularities which may be present in the data, within the structured data they contain or the software itself. The Commission does not guarantee the on-going distribution of said data and software. The Commission cannot be held responsible for any loss, injury or damage caused to third parties as a result of the re-use. The re-user shall bear sole responsibility for the re-use of the data collection, the structured elements it contains and the software. Re-use must not mislead third parties in respect of the contents of the data and the structured elements it contains, it's the source of the contents or the date of the last update thereto. This disclaimer is not intended to limit the liability of the Commission in violation of any requirements laid down in applicable national law or to exclude its liability in cases where this is not permitted by the applicable law. IV. Definitions Definitions of terms used by the Commission Decision of 12 December 2011 on the re-use of Commission documents, published in Official Journal of the European Union L330 of 14 December 2011, pages 39 to 42, are supplemented by the following definitions: Re-user: Any natural or legal person who re-uses the documents, in accordance with the conditions laid down in the Commission Decision of 12 December 2011 on the re-use of Commission documents, published in Official Journal of the European Union L330 of 14 December 2011, pages 39 to 42. Databases: A collection of independent works, data or other materials arranged in a systematic or methodical way and individually accessible by electronic means or in any other way. Eurovoc thesaurus Unlike the AC corpus, the EuroVoc Thesaurus must not be used or disseminated without prior written permission from the OP. If you want to get the rights to use Eurovoc and to receive a copy of the multilingual thesaurus, please contact OP at OP-INFO-COPYRIGHTpublications [dot] europa [dot] eu (OP-INFO-COPYRIGHT[at]publications[dot]europa[dot]eu), mentioning the file reference number 2005-COP-395. To our knowledge, the licence is free of charge for research purposes. For a commercial licence, please contact OP. Download the JRC-Acquis corpus AC Corpus - version 3.0 (by language) AC aligned corpus using Vanilla aligner AC aligned corpus using HunAlign By downloading these resources, you agree to the usage conditions. Previous version: JRC-ACQUIS Multilingual Parallel Corpus, Version 2.2. Click here to see a history of changes regarding the preparation of this corpus. Acknowledgement / Reference publication A description of the JRC-Acquis corpus (version 2.2) was published in the paper below. Please use this reference publication when referring to the JRC-Acquis. Steinberger Ralf, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaž Erjavec, Dan Tufis, Dániel Varga (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages . Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'2006). Genoa, Italy, 24-26 May 2006. To compare JRC-Acquis with the other linguistic resources distributed by EU institutions, see: Steinberger Ralf, Mohamed Ebrahim, Alexandros Poulis, Manuel Carrasco-Benitez, Patrick Schlüter, Marek Przybyszewski & Signe Gilbro (2014). An overview of the European Union's highly multilingual parallel corpora . Language Resources and Evaluation Journal (LRE). DOI: 10.1007/s10579-014-9277-0.