- What is the DGT-Acquis?
- Description of the Data - Data Format
- Statistics on the corpus
- What is the difference with DGT-Acquis, JRC-Acquis and DGT-TM
- Conditions for Use
- Download the DGT-Acquis
- How to produce bilingual extractions
- Acknowledgement and contact
What is the DGT-Acquis?
The DGT-Acquis is a family of several multingual parallel corpora extracted from the
Official Journal of the European Union (OJ) in
Formex 4 (XML) format, consisting of documents from the middle of 2004 to the end of 2011 in up to
23 languages.
view details
The following OJ series are included:
Year |
L
|
C
|
|||
---|---|---|---|---|---|
2004 |
L 2004 |
C 2004 |
CA 2004 |
CE 2004 |
|
2005 |
L 2005 |
LM 2005 |
C 2005 |
CA 2005 |
CE 2005 |
2006 |
L 2006 |
LM 2006 |
C 2006 |
CA 2006 |
CE 2006 |
2007 |
L 2007 |
LM 2007 |
C 2007 |
CA 2007 |
CE 2007 |
2008 |
L 2008 |
LM 2008 |
C 2008 |
CA 2008 |
CE 2008 |
2009 |
L 2009 |
LM 2009 |
C 2009 |
CA 2009 |
CE 2009 |
2010 |
L 2010 |
LM 2010 |
C 2010 |
CA 2010 |
CE 2010 |
2011 |
L 2011 |
LM 2011 |
C 2011 |
CA 2011 |
CE 2011 |
Description of the Data - Data Format
The original data of the OJ has been processed in several steps. In each step, the result of the previous step was refined to a finer granularity: (1) original data, (2) file level in Formex4 format, (3) file level in plain text and (4) paragraph level.
The result of each step is a corpus packaged as a self-contained
Multilingual Dataset Format (
muset) file. Even though the musets are independent, they are linked to each other so that, for example, one can find the source document of any given text segment. Data users can choose the data with the most appropriate processing level for
their own needs.
view details
The table in
next section (statistics) describes the data and provides some statistics.
The original data (da1-ox) includes both the XML and the tiff files. This opens the option to make use of the data for other types of applications (e.g. to work on optical character recognition, and more). The original data also allows users who want
to re-process the whole data set using their own tools and methods.
The file level formats (da1-fx in Formex 4 format and da1-ft in plain text format) are relevant for users who need access to the full texts, e.g. to analyse the discourse structure, to consider the context of each sentence, etc.
The paragraph level format (da1-pc) is relevant for people who do not need access to the full text, but who are mostly interested in smaller segments and their translations, e.g. to produce dictionaries or to work on (machine) translation.
Unfortunately, at this time, we cannot provide any statistics on this data and we cannot provide more information on how the data was produced.
Statistics on the corpus
ID |
Title |
Granularity |
Format |
Structure |
Zipped |
Statistics |
Comments |
---|---|---|---|---|---|---|---|
da1-ox |
Original data |
original |
formex4 |
tree |
81 GB |
3,901,048 files |
original filenames; with TIFF files |
da1-fx |
File level in Formex4 |
file |
formex4 |
tree |
9 GB |
3,537,876 files |
standardised filenames; without TIFF files |
da1-ft |
File level in plain text format |
file |
text |
tree |
5 GB |
3,537,872 files |
XML marking removed |
da1-pc |
Paragraph level in column-file format |
paragraph |
column-file |
table |
3 GB |
4,900,254 segments |
one table |
- ID.
muset identity. Example, da1-ox. It is composed of:
- Title
The title of the Multilingual Dataset Format (muset). Using these links one can download individual language files. - Granularity
Details on this can be found in sections 10.4.5 and 10.5.4 of the document '
Open Architecture for multilingual parallel texts' (Carrasco Benitez, 2008).
original
file
paragraph
sentence
sub-sentence - Format
Formex 4 (XML) format
text: plain text.
column-file: each column in the table is in one file. The file a1.txt[.zip] contain the filenames of the segment provenance. - Structure
tree: tree of directories and files; the data is in the original context.
table: one table with all the data; the data is out of context (Details in section 10.6 of the document '
Open Architecture for multilingual parallel texts' (Carrasco Benitez, 2008) (Carrasco Benitez, 2008)). - Zipped
Size of the muset zipped into one file; available on request from the DGT contact person mentioned at the end of this page. - Statistics
Main statistics, such as the number of files or segments. - Comments.
General comments.
What is the difference with DGT-Acquis, JRC-Acquis and DGT-TM
There is no simple answer to that question. For a detailed answer, you can read the following article:
- Steinberger Ralf, Mohamed Ebrahim, Alexandros Poulis, Manuel Carrasco-Benitez, Patrick Schlüter, Marek Przybyszewski & Signe Gilbro (2014).
An overview of the European Union's highly multilingual parallel corpora [337 KB] . Language Resources and Evaluation Journal (LRE). DOI: 10.1007/s10579-014-9277-0
Both
JRC-Acquis and
DGT-Acquis are paragraph-aligned parallel corpora, i.e. corpora consisting of full text documents with added meta-information on which paragraphs are aligned with which others in the other languages. Since the JRC-Acquis contains data since the
1950s up to the year 2006 and DGT-Acquis contains data starting in 2004, there is no overlap for data since 2007 and up to 2003. There will be some overlap for the data covering the years 2004 to 2006. If you need to avoid overlapping document sets
of both sources, try using the Eur-Lex document identifiers. The processing steps (data preparation and alignment) to produce both data sets were entirely different. The format is not the same, and the quality of both resources is expected to be different,
as well.
DGT-Acquis and the translation memory
DGT-TM are of a different nature. While the DGT-Acquis parallel corpus contains full documents with additional segmentation information, DGT-TM is a translation memory, i.e. a collection of Translation Units (sentences and the like). In parallel
corpora, one can thus see each sentence in its context, while in translation memories, each sentence is in isolation, i.e. out of context. As for their overlap, DGT-TM is based exclusively on the L-Series of the Official Journal, while DGT-Acquis
also contains the LM, C, CA and CE collections (see the
table of documents included in DGT-Acquis, on this page under "What is the DGT-Acquis?"). Again, the processing steps (data preparation and alignment) to produce both data sets were entirely different. The format is not the same, and the quality
of both resources is expected to be different, as well.
Conditions for Use
I. Intellectual property and conditions of use of data
The DGT-Acquis data is the exclusive property of the European Commission. The Commission cedes its non-exclusive rights free of charge and world-wide for the entire duration of the protection of those rights to the re-user, for all kinds of use which
comply with the conditions laid down in the Commission Decision of 12 December 2011 on the re-use of Commission documents, published in Official Journal of the European Union L330 of 14 December 2011, pages 39 to 42.
Any re-use of the data or of the structured elements contained in it is required to be identified by the re-user, who is under an obligation to state the source of the documents used: the website address, the date of the latest update and the fact that
the European Commission retains ownership of the data.
II. Conditions for use of software
The DGT-Acquis data is distributed with the software necessary for its exploitation/extraction. Use of such software must be carried out in accordance with the conditions laid down in the EUPL licence.
III. Responsibility
The data and the accompanying software are made available, without any guarantee, explicit or tacit. The Commission cannot be held responsible for any loss, injury or damage the re-user may suffer due to the re-use. The Commission does not however guarantee
the absence of any irregularities which may be present in the data, within the structured data they contain or the software itself. The Commission does not guarantee the on-going distribution of said data and software.
The Commission cannot be held responsible for any loss, injury or damage caused to third parties as a result of the re-use. The re-user shall bear sole responsibility for the re-use of the data collection, the structured elements it contains and the software.
Re-use must not mislead third parties in respect of the contents of the data and the structured elements it contains, it's the source of the contents or the date of the last update thereto. This disclaimer is not intended to limit the liability of
the Commission in violation of any requirements laid down in applicable national law or to exclude its liability in cases where this is not permitted by the applicable law.
IV. Definitions
Definitions of terms used by the Commission Decision of 12 December 2011 on the re-use of Commission documents, published in Official Journal of the European Union L330 of 14 December 2011, pages 39 to 42, are supplemented by the following definitions:
Re-user: Any natural or legal person who re-uses the documents, in accordance with the conditions laid down in the Commission Decision of 12 December 2011 on the re-use of Commission documents, published in Official Journal of the European Union L330
of 14 December 2011, pages 39 to 42.
Databases: A collection of independent works, data or other materials arranged in a systematic or methodical way and individually accessible by electronic means or in any other way.
Download the DGT-Acquis
There are three downloading options (see Section
Description of the Data - Data Format above for details):
- Full corpus (da1-ox and da1-fx): Please contact the DGT contact person mentioned at the end of the page if you want to receive this data. These packages are very big (several GBs; see the table in Section
Statistics on the corpus above). - One language per collection per zip file for "file level in plain text format" (da1-ft): You can
browse and select the DGT-Acquis files you are interested in. - One language per zip file for "paragraph level in column-file format" (da1-pc): You can download these files by clicking on the links in the table below.
view details
Paragraph level (da1-pc) |
Size |
---|---|
1MB |
|
109MB |
|
134MB |
|
128MB |
|
140MB |
|
180MB |
|
144MB |
|
136MB |
|
116MB |
|
136MB |
|
142MB |
|
5MB |
|
133MB |
|
137MB |
|
125MB |
|
125MB |
|
106MB |
|
134MB |
|
137MB |
|
138MB |
|
92MB |
|
136MB |
|
123MB |
|
124MB |
|
Total size |
3GB |
How to produce bilingual extractions
Download the above required zipped files. Each file contains one language. The file data.a1.txt.zip contains the filenames indicating the source of the segment.
Here is a Unix example to produce a bilingual file with contents in English and French without the empty strings in either language, separated by the character '|' :
paste -d'|' data.en.txt data.fr.txt | sed '/^|/d ; /|$/d' > bilang.txt
Acknowledgement and contact
For more information, you can contact the following persons:
Directorate-General for Translation (DGT)
M.T. Carrasco Benitez (Email address: manuel [dot] carrasco-benitezec [dot] europa [dot] eu (manuel[dot]carrasco-benitez[at]ec[dot]europa[dot]eu))
Unit DGT.R.3 Informatics
Jean-Monnet Building A2/137
L-2920 Luxembourg
More information on DGT.
Joint Research Centre (JRC)
Ralf Steinberger (Email address format: Firstname [dot] Lastnamejrc [dot] ec [dot] europa [dot] eu (Firstname[dot]Lastname[at]jrc[dot]ec[dot]europa[dot]eu))
IPSC - GlobeSec - OPTIMA
Via E. Fermi 2749, T.P. 267
I-21027 Ispra (VA)
When making reference to the DGT-Acquis in sicneitific publications, please quote the following paper:
- Steinberger Ralf, Mohamed Ebrahim, Alexandros Poulis, Manuel Carrasco-Benitez, Patrick Schlüter, Marek Przybyszewski & Signe Gilbro (2014).
An overview of the European Union's highly multilingual parallel corpora [337 KB] . Language Resources and Evaluation Journal (LRE). DOI: 10.1007/s10579-014-9277-0
view details
The
Directorate-General for Translation (DGT) is one of the biggest translation services in the world. It is also the largest single department in the European Commission with a total number of around 2500 staff members and a total production
of some 2 million pages a year. Various computer tools are available to translators, who use them according to their translation needs and personal preferences. Irrespective of their preferred working methods, all translators need the possibility
to
reuse previously translated texts (translation memories, electronic archives, ….). To perform its tasks, DG Translation has a wide variety of language resources at the disposal of its staff: terminology in many different forms (multilingual
libraries,
terminology databases, electronic dictionaries, etc.),
translation memories enabling genuine data sharing;
texts as such to be retrieved from internal archiving systems and other sources; and
machine translation, which, at the European Commission, is used as a browsing tool to view the gist of a text and also to be used as a genuine translation aid.
The
Joint Research Centre (
JRC) is also a Directorate-General of the European Commission. The JRC has for many years worked on highly multilingual text analysis applications. The JRC has contributed to the dissemination of the
DGT Translation Memory and it has itself produced and disseminated a number of further highly multilingual linguistic resources: the
JRC-Acquis,
JRC-Names, the
JRC Eurovoc Indexer JEX, and a series of
further linguistic resources.
The JRC is the creator of the
Europe Media Monitor (EMM) family of news aggregation and analysis applications. EMM aggregates news from about thousands of news portals world-wide in about 50 languages (status 2012). EMM's news analysis tools always show the latest news
from around the world as its pages are updated every five minutes. As EMM not only displays the news articles, but it also groups related articles, classifies the articles into hundreds of news categories and displays automatically extracted meta-information
together with the news items, EMM has many users from around the world, resulting in up to 1.2 million hits per day. Much information is available via RSS feeds, allowing EMM output to be combined with third-party tools. The JRC is scientifically
very active, as can be seen from the large number of international scientific publications in the field of multilingual text mining and media monitoring. JRC's four publicly accessible media monitoring applications are:
- NewsBrief: Breaking News detection and display of the very latest thematically organised news from around the world; Grouping of related news; breaking news detection; RSS feeds and automatic email alerting; 50 languages.
- MedISys : EMM's Medical Information System selects the health-related EMM news in 50 languages and additionally gathers documents from about 250 medical web sites. MedISys displays the medical news according to diseases, symptoms, organisations and
themes and has statistics-based early warning functions for each category. A second, restricted site offers more functionality to EU public health organisations. - NewsExplorer: Summary of the news in 20 languages for each 24-hour period; grouping of related news into clusters; linking of daily clusters over time and across languages; visualisation of time lines and of geographical news coverage;
information extraction to detect and disambiguate persons, organisations and locations; quotation recognition; individual, daily-updated pages for over one million names; detection of quotations by and about people; automatic generation of
social networks.