What is the DGT-Acquis? Description of the Data - Data Format Statistics on the corpus What is the difference with DGT-Acquis, JRC-Acquis and DGT-TM Conditions for Use Download the DGT-Acquis How to produce bilingual extractions Acknowledgement and contact What is the DGT-Acquis? The DGT-Acquis is a family of several multingual parallel corpora extracted from the Official Journal of the European Union (OJ) in Formex 4 (XML) format, consisting of documents from the middle of 2004 to the end of 2011 in up to 23 languages. view details The following OJ series are included: Official Journal series Year L Legislation C Information and notices 2004 L 2004 C 2004 CA 2004 CE 2004 2005 L 2005 LM 2005 C 2005 CA 2005 CE 2005 2006 L 2006 LM 2006 C 2006 CA 2006 CE 2006 2007 L 2007 LM 2007 C 2007 CA 2007 CE 2007 2008 L 2008 LM 2008 C 2008 CA 2008 CE 2008 2009 L 2009 LM 2009 C 2009 CA 2009 CE 2009 2010 L 2010 LM 2010 C 2010 CA 2010 CE 2010 2011 L 2011 LM 2011 C 2011 CA 2011 CE 2011 Description of the Data - Data Format The original data of the OJ has been processed in several steps. In each step, the result of the previous step was refined to a finer granularity: (1) original data, (2) file level in Formex4 format, (3) file level in plain text and (4) paragraph level. The result of each step is a corpus packaged as a self-contained Multilingual Dataset Format ( muset) file. Even though the musets are independent, they are linked to each other so that, for example, one can find the source document of any given text segment. Data users can choose the data with the most appropriate processing level for their own needs. view details The table in next section (statistics) describes the data and provides some statistics. The original data (da1-ox) includes both the XML and the tiff files. This opens the option to make use of the data for other types of applications (e.g. to work on optical character recognition, and more). The original data also allows users who want to re-process the whole data set using their own tools and methods. The file level formats (da1-fx in Formex 4 format and da1-ft in plain text format) are relevant for users who need access to the full texts, e.g. to analyse the discourse structure, to consider the context of each sentence, etc. The paragraph level format (da1-pc) is relevant for people who do not need access to the full text, but who are mostly interested in smaller segments and their translations, e.g. to produce dictionaries or to work on (machine) translation. Unfortunately, at this time, we cannot provide any statistics on this data and we cannot provide more information on how the data was produced. Statistics on the corpus ID Title Granularity Format Structure Zipped Statistics Comments da1-ox Original data original formex4 tree 81 GB 3,901,048 files original filenames; with TIFF files da1-fx File level in Formex4 file formex4 tree 9 GB 3,537,876 files standardised filenames; without TIFF files da1-ft File level in plain text format file text tree 5 GB 3,537,872 files XML marking removed da1-pc Paragraph level in column-file format paragraph column-file table 3 GB 4,900,254 segments one table ID. muset identity. Example, da1-ox. It is composed of: Title The title of the Multilingual Dataset Format (muset). Using these links one can download individual language files. Granularity Details on this can be found in sections 10.4.5 and 10.5.4 of the document ' Open Architecture for multilingual parallel texts' (Carrasco Benitez, 2008). original file paragraph sentence sub-sentence Format Formex 4 (XML) format text: plain text. column-file: each column in the table is in one file. The file a1.txt[.zip] contain the filenames of the segment provenance. Structure tree: tree of directories and files; the data is in the original context. table: one table with all the data; the data is out of context (Details in section 10.6 of the document ' Open Architecture for multilingual parallel texts' (Carrasco Benitez, 2008) (Carrasco Benitez, 2008)). Zipped Size of the muset zipped into one file; available on request from the DGT contact person mentioned at the end of this page. Statistics Main statistics, such as the number of files or segments. Comments. General comments. What is the difference with DGT-Acquis, JRC-Acquis and DGT-TM There is no simple answer to that question. For a detailed answer, you can read the following article: Steinberger Ralf, Mohamed Ebrahim, Alexandros Poulis, Manuel Carrasco-Benitez, Patrick Schlüter, Marek Przybyszewski & Signe Gilbro (2014). An overview of the European Union's highly multilingual parallel corpora [337 KB] . Language Resources and Evaluation Journal (LRE). DOI: 10.1007/s10579-014-9277-0 Both JRC-Acquis and DGT-Acquis are paragraph-aligned parallel corpora, i.e. corpora consisting of full text documents with added meta-information on which paragraphs are aligned with which others in the other languages. Since the JRC-Acquis contains data since the 1950s up to the year 2006 and DGT-Acquis contains data starting in 2004, there is no overlap for data since 2007 and up to 2003. There will be some overlap for the data covering the years 2004 to 2006. If you need to avoid overlapping document sets of both sources, try using the Eur-Lex document identifiers. The processing steps (data preparation and alignment) to produce both data sets were entirely different. The format is not the same, and the quality of both resources is expected to be different, as well. DGT-Acquis and the translation memory DGT-TM are of a different nature. While the DGT-Acquis parallel corpus contains full documents with additional segmentation information, DGT-TM is a translation memory, i.e. a collection of Translation Units (sentences and the like). In parallel corpora, one can thus see each sentence in its context, while in translation memories, each sentence is in isolation, i.e. out of context. As for their overlap, DGT-TM is based exclusively on the L-Series of the Official Journal, while DGT-Acquis also contains the LM, C, CA and CE collections (see the table of documents included in DGT-Acquis, on this page under "What is the DGT-Acquis?"). Again, the processing steps (data preparation and alignment) to produce both data sets were entirely different. The format is not the same, and the quality of both resources is expected to be different, as well. Conditions for Use I. Intellectual property and conditions of use of data The DGT-Acquis data is the exclusive property of the European Commission. The Commission cedes its non-exclusive rights free of charge and world-wide for the entire duration of the protection of those rights to the re-user, for all kinds of use which comply with the conditions laid down in the Commission Decision of 12 December 2011 on the re-use of Commission documents, published in Official Journal of the European Union L330 of 14 December 2011, pages 39 to 42. Any re-use of the data or of the structured elements contained in it is required to be identified by the re-user, who is under an obligation to state the source of the documents used: the website address, the date of the latest update and the fact that the European Commission retains ownership of the data. II. Conditions for use of software The DGT-Acquis data is distributed with the software necessary for its exploitation/extraction. Use of such software must be carried out in accordance with the conditions laid down in the EUPL licence. III. Responsibility The data and the accompanying software are made available, without any guarantee, explicit or tacit. The Commission cannot be held responsible for any loss, injury or damage the re-user may suffer due to the re-use. The Commission does not however guarantee the absence of any irregularities which may be present in the data, within the structured data they contain or the software itself. The Commission does not guarantee the on-going distribution of said data and software. The Commission cannot be held responsible for any loss, injury or damage caused to third parties as a result of the re-use. The re-user shall bear sole responsibility for the re-use of the data collection, the structured elements it contains and the software. Re-use must not mislead third parties in respect of the contents of the data and the structured elements it contains, it's the source of the contents or the date of the last update thereto. This disclaimer is not intended to limit the liability of the Commission in violation of any requirements laid down in applicable national law or to exclude its liability in cases where this is not permitted by the applicable law. IV. Definitions Definitions of terms used by the Commission Decision of 12 December 2011 on the re-use of Commission documents, published in Official Journal of the European Union L330 of 14 December 2011, pages 39 to 42, are supplemented by the following definitions: Re-user: Any natural or legal person who re-uses the documents, in accordance with the conditions laid down in the Commission Decision of 12 December 2011 on the re-use of Commission documents, published in Official Journal of the European Union L330 of 14 December 2011, pages 39 to 42. Databases: A collection of independent works, data or other materials arranged in a systematic or methodical way and individually accessible by electronic means or in any other way. Download the DGT-Acquis There are three downloading options (see Section Description of the Data - Data Format above for details): Full corpus (da1-ox and da1-fx): Please contact the DGT contact person mentioned at the end of the page if you want to receive this data. These packages are very big (several GBs; see the table in Section Statistics on the corpus above). One language per collection per zip file for "file level in plain text format" (da1-ft): You can browse and select the DGT-Acquis files you are interested in. One language per zip file for "paragraph level in column-file format" (da1-pc): You can download these files by clicking on the links in the table below. view details Paragraph level (da1-pc) Size data.a1.txt.zip 1MB data.bg.txt.zip 109MB data.cs.txt.zip 134MB data.da.txt.zip 128MB data.de.txt.zip 140MB data.el.txt.zip 180MB data.en.txt.zip 144MB data.es.txt.zip 136MB data.et.txt.zip 116MB data.fi.txt.zip 136MB data.fr.txt.zip 142MB data.ga.txt.zip 5MB data.hu.txt.zip 133MB data.it.txt.zip 137MB data.lt.txt.zip 125MB data.lv.txt.zip 125MB data.mt.txt.zip 106MB data.nl.txt.zip 134MB data.pl.txt.zip 137MB data.pt.txt.zip 138MB data.ro.txt.zip 92MB data.sk.txt.zip 136MB data.sl.txt.zip 123MB data.sv.txt.zip 124MB Total size 3GB How to produce bilingual extractions Download the above required zipped files. Each file contains one language. The file data.a1.txt.zip contains the filenames indicating the source of the segment. Here is a Unix example to produce a bilingual file with contents in English and French without the empty strings in either language, separated by the character '|' : paste -d'|' data.en.txt data.fr.txt | sed '/^|/d ; /|$/d' > bilang.txt Acknowledgement and contact For more information, you can contact the following persons: Directorate-General for Translation (DGT) M.T. Carrasco Benitez (Email address: manuel [dot] carrasco-benitezec [dot] europa [dot] eu (manuel[dot]carrasco-benitez[at]ec[dot]europa[dot]eu)) Unit DGT.R.3 Informatics Jean-Monnet Building A2/137 L-2920 Luxembourg More information on DGT. Joint Research Centre (JRC) Ralf Steinberger (Email address format: Firstname [dot] Lastnamejrc [dot] ec [dot] europa [dot] eu (Firstname[dot]Lastname[at]jrc[dot]ec[dot]europa[dot]eu)) IPSC - GlobeSec - OPTIMA Via E. Fermi 2749, T.P. 267 I-21027 Ispra (VA) When making reference to the DGT-Acquis in sicneitific publications, please quote the following paper: Steinberger Ralf, Mohamed Ebrahim, Alexandros Poulis, Manuel Carrasco-Benitez, Patrick Schlüter, Marek Przybyszewski & Signe Gilbro (2014). An overview of the European Union's highly multilingual parallel corpora [337 KB] . Language Resources and Evaluation Journal (LRE). DOI: 10.1007/s10579-014-9277-0 view details The Directorate-General for Translation (DGT) is one of the biggest translation services in the world. It is also the largest single department in the European Commission with a total number of around 2500 staff members and a total production of some 2 million pages a year. Various computer tools are available to translators, who use them according to their translation needs and personal preferences. Irrespective of their preferred working methods, all translators need the possibility to reuse previously translated texts (translation memories, electronic archives, ….). To perform its tasks, DG Translation has a wide variety of language resources at the disposal of its staff: terminology in many different forms (multilingual libraries, terminology databases, electronic dictionaries, etc.), translation memories enabling genuine data sharing; texts as such to be retrieved from internal archiving systems and other sources; and machine translation, which, at the European Commission, is used as a browsing tool to view the gist of a text and also to be used as a genuine translation aid. The Joint Research Centre ( JRC) is also a Directorate-General of the European Commission. The JRC has for many years worked on highly multilingual text analysis applications. The JRC has contributed to the dissemination of the DGT Translation Memory and it has itself produced and disseminated a number of further highly multilingual linguistic resources: the JRC-Acquis, JRC-Names, the JRC Eurovoc Indexer JEX, and a series of further linguistic resources. The JRC is the creator of the Europe Media Monitor (EMM) family of news aggregation and analysis applications. EMM aggregates news from about thousands of news portals world-wide in about 50 languages (status 2012). EMM's news analysis tools always show the latest news from around the world as its pages are updated every five minutes. As EMM not only displays the news articles, but it also groups related articles, classifies the articles into hundreds of news categories and displays automatically extracted meta-information together with the news items, EMM has many users from around the world, resulting in up to 1.2 million hits per day. Much information is available via RSS feeds, allowing EMM output to be combined with third-party tools. The JRC is scientifically very active, as can be seen from the large number of international scientific publications in the field of multilingual text mining and media monitoring. JRC's four publicly accessible media monitoring applications are: NewsBrief: Breaking News detection and display of the very latest thematically organised news from around the world; Grouping of related news; breaking news detection; RSS feeds and automatic email alerting; 50 languages. MedISys : EMM's Medical Information System selects the health-related EMM news in 50 languages and additionally gathers documents from about 250 medical web sites. MedISys displays the medical news according to diseases, symptoms, organisations and themes and has statistics-based early warning functions for each category. A second, restricted site offers more functionality to EU public health organisations. NewsExplorer: Summary of the news in 20 languages for each 24-hour period; grouping of related news into clusters; linking of daily clusters over time and across languages; visualisation of time lines and of geographical news coverage; information extraction to detect and disambiguate persons, organisations and locations; quotation recognition; individual, daily-updated pages for over one million names; detection of quotations by and about people; automatic generation of social networks.