- Introduction
- Format and Structure of the Data
- Document types contained in DCEP
- Statistics on the DCEP
- Usage conditions
- Download the DCEP corpus
- How to produce bilingual corpora
- Acknowledgement and contact
- Reference publication
- International Standard Language Resource Number: 823-807-024-162-2
Introduction
The Digital Corpus of the European Parliament (DCEP) contains the majority of the documents published on the
European Parliament's official website. It comprises a variety of document types, from press releases to session and legislative documents related to European Parliament's activities and bodies. The current version of the corpus contains documents
that were produced between 2001 and 2012.
view details
The EP's Directorate General for Translation has created this corpus, and made it publicly accessible, to contribute to the European Parliament's policy of multilingualism, designed to ensure the equal treatment of languages.
DCEP is a multilingual corpus including documents in all official EU languages and it can be used for various language processing and research purposes such as:
- Machine Translation;
- Creation of monolingual or multilingual linguistic resources;
- Translation studies, annotation projection for co-reference resolution, discourse analysis, comparative language studies;
- Improvement of sentence or word alignment algorithms;
- Cross-lingual information retrieval.
To avoid overlapping with the
Europarl corpus, DCEP does not contain the verbatim reports of the speeches made in European Parliament's plenary (CRE documents). CRE is a French abbreviation that stands for "Compte Rendu in Extenso".
Format and Structure of the Data
DCEP is available as full-text documents and as sentence-aligned data. DCEP includes alignment information for the full documents, as well as for sentences, produced separately for each language pair. DCEP is accompanied by tools that allow to produce
sentence-aligned corpora separately for each of the 276 language pairs. The sentence-aligned data is in plain text format, i.e. XML/TMX output is not supported.
view details
The full-text documents are available in two formats: text-only and structured data (either XML or SGML).
The original structured data (to be found in the
source directory) has the advantage that users can develop or use their own tools to process the XML and SGML files according to their needs. The plain text version is useful for faster processing because the markup has already been removed (
strip directory). There is an additional plain text version including sentence segmentation (
sentences directory), where each line consists of one sentence.
The directories
sentences,
source and
strip are each organised according to language, document type and original structured document format (XML or SGML).
The document alignment information can be found in the file
cross-lingual-index.txt.bz2. This index file can be used to create bilingual or multilingual corpora. Each line consists of a space-separated list of file names of corresponding documents. For example, if there is only one file name, it means
that the document is available only in one language.
The sentence alignment information for each of the 276 language pairs (produced with a customised version of HunAlign) can be found in the folder
langpairs. This information consists of links between the aligned sentence-segmented documents and their respective line numbers.
The extraction tool provided with DCEP (contained in
DCEP-extract-scripts.tar.bz2) makes use of this sentence alignment information to produce bilingual sentence-aligned corpora for each of the language pairs. The
DCEP readme page provides examples of how to use the tool.
The release of DCEP includes a number of further tools and scripts (available via the
download page), which are not required to produce the parallel corpus, but they can be examined for transparency reasons.
Document types
The following document types are included in the current version of the DCEP corpus:
Document type | Brief description |
AGENDA | Agenda of the plenary session meetings |
COMPARL | Draft Agenda of the part-session |
IM-PRESS and PRESS | General texts and articles on parliamentary news seen from a national angle, specific to one or several Member States, presentation of events in the EP |
IMP-CONTRIB | Various press documents including technical announcements, events (hearings, workshops) produced by the Parliamentary Committees |
MOTION | Motions for resolutions put to the vote in plenary |
PV | Minutes of plenary sittings |
REPORT | Reports of the parliamentary committees |
RULES-EP | The Rules of Procedure of the EP laying down the rules for the internal operation and organisation of EP |
TA (Adopted Texts) | The motions for resolutions and reports tabled by Members and by the parliamentary committees are put to the vote in plenary, with or without a debate. After the vote, the final texts as adopted are published and forwarded to the authorities
concerned |
WQ (Written Question) | Written questions are texts from Members of the EP which request an answer in writing |
WQA (Written Question Answer) | The written answer to the parliamentary written questions |
OQ (Oral Question) | Oral questions are asked in plenary sitting and included in the day's debates |
QT (Questions for Question Time) | Questions for question time are asked during the period set aside for questions during plenary sittings. |
Statistics
For details on the
statistics on the DCEP, click here. The tables contain a summary of the corpus size in
- Number of documents;
- Number of words;
- Number of unique words.
Words have been counted with the wc Unix utility after removing the mark-up. On the other hand, unique words have been counted on tokenised text whereby only words composed from alphabetical characters have been taken into consideration. The first Table
presents figures per language and document type while the second one contains statistics per language pair.
DCEP is the largest single release of documents published by an institution of the European Union. It contains various document types in 23 languages (253 language pairs). Here are some statistics:
- Total number of documents : 1.5 million;
- Total number of words: 1.37 billion;
- Total number of English segments: 7.7 million;
- The best-represented language in terms of number of words is English (103,458,996);
- French and Spanish miss less than 10%.
More statistics are available in the publication
DCEP - Digital Corpus of the European Parliament (LREC'2014).
Usage Conditions
I. Intellectual property and conditions of use of data
The DCEP data is the exclusive property of the European Parliament. The Parliament cedes its non-exclusive rights free of charge and world-wide for research purposes for the entire duration of the protection of those rights to the re-user.
Any re-use of the data or of the structured elements contained in it is required to be identified by the re-user, who is under an obligation to state the source of the documents used: the website address, the date of the latest update and the fact that
the European Parliament retains ownership of the data.
II. Conditions for use of software
The DCEP data is distributed with the software necessary for its exploitation/extraction. Use of such software must be carried out in accordance with the conditions laid down in the EUPL license.
III. Responsibility
The data and the accompanying software are made available, without any guarantee, explicit or tacit. The Parliament cannot be held responsible for any loss, injury or damage the re-user may suffer due to the re-use. The Parliament does not however guarantee
the absence of any irregularities which may be present in the data, within the structured data they contain or the software itself. The Parliament does not guarantee the on-going distribution of said data and software.
The Parliament cannot be held responsible for any loss, injury or damage caused to third parties as a result of the re-use. The re-user shall bear sole responsibility for the re-use of the data collection, the structured elements it contains and the software.
Re-use must not mislead third parties in respect of the contents of the data and the structured elements it contains, the source of the contents or the date of the last update thereto. This disclaimer is not intended to limit the liability of the
Parliament in violation of any requirements laid down in applicable national law or to exclude its liability in cases where this is not permitted by the applicable law.
Download the DCEP corpus
Please chose the downloading options on the
DCEP download page.
How to produce bilingual corpora
In order to extract the bilingual corpus, which has been aligned at the sentence level using the HunAlign sentence aligner, please follow the
readme page.
Acknowledgement and contact
DCEP has been created and published by the Machine Translation team of the
European Parliament's Directorate-General for Translation (DGTRAD), represented by
Najeh Hajlaoui (Machine Translation Expert at the European Parliament). DGTRAD was supported by Jaakko Väyrynen and Ralf Steinberger from the European Commission's
Joint Research Centre. The sentence alignment was produced by Dániel Varga, researcher at Budapest University of Technology and Economics, with a customised version of the HunAlign software.
For more information you can send an e-mail to
machinetranslation@ep.europa.eu .
view details
The
Directorate-General for Translation ensures that the European Parliament's documents are available in all the official languages of the European Union, thus enabling Parliament to meet its commitment to the policy of multilingualism. By directly
enabling Parliament to practise multilingualism, DGTRAD plays an integral role in protecting the cultural and linguistic diversity of the Union. It facilitates transparency, understanding and the exchange of views.
DGTRAD's main tasks are:
- translating documents out of and into the 24 official languages of the European Union, thus providing all EU citizens with immediate access to European texts in their own language and the opportunity to communicate with the institutions in their own
language; - supplying a translation service which ensures both quality and efficiency, keeping costs at an acceptable level;
- developing the appropriate IT tools and terminology databases to aid translators and integrating them into the workflow;
- revising documents translated outside Parliament and monitoring the quality of external translations;
- managing translation traineeships.
The
Joint Research Centre (
JRC) is a Directorate-General of the European Commission. The JRC has for many years worked on highly multilingual text analysis applications. The JRC has also contributed to the preparation and dissemination of a number of
linguistic resources, including the parallel corpora
JRC-Acquis and
DGT-Acquis, the Translation Memories
DGT-TM,
ECDC-TM and
EAC-TM, as well as
JRC-Names and the
JRC Eurovoc Indexer JEX.
References - Relevant publications
For a more detailed description of DCEP and when making reference to DCEP in scientific publications, please refer to:
- Hajlaoui Najeh, Kolovratnik David, Vaeyrynen Jaakko, Steinberger Ralf, and Varga Dániel (2014).
DCEP-Digital Corpus of the European Parliament. Proc. LREC 2014 (Language Resources and Evaluation Conference). Reykjavik, Iceland. Mai 26-31, 2014. pp 3164-3171 (URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/943_Paper.pdf).
To compare DCEP with the other linguistic resources distributed by EU institutions, see:
- Steinberger Ralf, Mohamed Ebrahim, Alexandros Poulis, Manuel Carrasco-Benitez, Patrick Schlüter, Marek Przybyszewski & Signe Gilbro (2014).
An overview of the European Union's highly multilingual parallel corpora . Language Resources and Evaluation Journal (LRE). DOI: 10.1007/s10579-014-9277-0.
Related Content
Competence Centre on Text Mining and Analysis
Scientific Publications by the EMM Team
Europe Media Monitor - NewsBrief
Europe Media Monitor - NewsExplorer
Medical Information System (MedISys)