Skip to main content
Logotip Europske komisije
EU Science Hub

DCEP: Digital Corpus of the European Parliament

Introduction

The Digital Corpus of the European Parliament (DCEP) contains the majority of the documents published on the
European Parliament's official website. It comprises a variety of document types, from press releases to session and legislative documents related to European Parliament's activities and bodies. The current version of the corpus contains documents
that were produced between 2001 and 2012.

view details

The EP's Directorate General for Translation has created this corpus, and made it publicly accessible, to contribute to the European Parliament's policy of multilingualism, designed to ensure the equal treatment of languages.

DCEP is a multilingual corpus including documents in all official EU languages and it can be used for various language processing and research purposes such as:

  • Machine Translation;
  • Creation of monolingual or multilingual linguistic resources;
  • Translation studies, annotation projection for co-reference resolution, discourse analysis, comparative language studies;
  • Improvement of sentence or word alignment algorithms;
  • Cross-lingual information retrieval.

To avoid overlapping with the
Europarl corpus, DCEP does not contain the verbatim reports of the speeches made in European Parliament's plenary (CRE documents). CRE is a French abbreviation that stands for "Compte Rendu in Extenso".

Format and Structure of the Data

DCEP is available as full-text documents and as sentence-aligned data. DCEP includes alignment information for the full documents, as well as for sentences, produced separately for each language pair. DCEP is accompanied by tools that allow to produce
sentence-aligned corpora separately for each of the 276 language pairs. The sentence-aligned data is in plain text format, i.e. XML/TMX output is not supported.

view details

The full-text documents are available in two formats: text-only and structured data (either XML or SGML).

The original structured data (to be found in the
source directory) has the advantage that users can develop or use their own tools to process the XML and SGML files according to their needs. The plain text version is useful for faster processing because the markup has already been removed (
strip directory). There is an additional plain text version including sentence segmentation (
sentences directory), where each line consists of one sentence.

The directories
sentences,
source and
strip are each organised according to language, document type and original structured document format (XML or SGML).

The document alignment information can be found in the file
cross-lingual-index.txt.bz2. This index file can be used to create bilingual or multilingual corpora. Each line consists of a space-separated list of file names of corresponding documents. For example, if there is only one file name, it means
that the document is available only in one language.

The sentence alignment information for each of the 276 language pairs (produced with a customised version of HunAlign) can be found in the folder
langpairs. This information consists of links between the aligned sentence-segmented documents and their respective line numbers.

The extraction tool provided with DCEP (contained in
DCEP-extract-scripts.tar.bz2) makes use of this sentence alignment information to produce bilingual sentence-aligned corpora for each of the language pairs. The
DCEP readme page provides examples of how to use the tool.

The release of DCEP includes a number of further tools and scripts (available via the
download page), which are not required to produce the parallel corpus, but they can be examined for transparency reasons.

Document types

The following document types are included in the current version of the DCEP corpus:

Document typeBrief description
AGENDAAgenda of the plenary session meetings
COMPARLDraft Agenda of the part-session
IM-PRESS and PRESSGeneral texts and articles on parliamentary news seen from a national angle, specific to one or several Member States, presentation of events in the EP
IMP-CONTRIBVarious press documents including technical announcements, events (hearings, workshops) produced by the Parliamentary Committees
MOTIONMotions for resolutions put to the vote in plenary
PVMinutes of plenary sittings
REPORTReports of the parliamentary committees
RULES-EPThe Rules of Procedure of the EP laying down the rules for the internal operation and organisation of EP
TA (Adopted Texts)The motions for resolutions and reports tabled by Members and by the parliamentary committees are put to the vote in plenary, with or without a debate. After the vote, the final texts as adopted are published and forwarded to the authorities
concerned
WQ (Written Question)Written questions are texts from Members of the EP which request an answer in writing
WQA (Written Question Answer)The written answer to the parliamentary written questions
OQ (Oral Question)Oral questions are asked in plenary sitting and included in the day's debates
QT (Questions for Question Time)Questions for question time are asked during the period set aside for questions during plenary sittings.

Statistics

For details on the
statistics on the DCEP, click here. The tables contain a summary of the corpus size in

  • Number of documents;
  • Number of words;
  • Number of unique words.

Words have been counted with the wc Unix utility after removing the mark-up. On the other hand, unique words have been counted on tokenised text whereby only words composed from alphabetical characters have been taken into consideration. The first Table
presents figures per language and document type while the second one contains statistics per language pair.

DCEP is the largest single release of documents published by an institution of the European Union. It contains various document types in 23 languages (253 language pairs). Here are some statistics:

  • Total number of documents : 1.5 million;
  • Total number of words: 1.37 billion;
  • Total number of English segments: 7.7 million;
  • The best-represented language in terms of number of words is English (103,458,996);
  • French and Spanish miss less than 10%.

More statistics are available in the publication
DCEP - Digital Corpus of the European Parliament (LREC'2014).

Usage Conditions

I. Intellectual property and conditions of use of data

The DCEP data is the exclusive property of the European Parliament. The Parliament cedes its non-exclusive rights free of charge and world-wide for research purposes for the entire duration of the protection of those rights to the re-user.

Any re-use of the data or of the structured elements contained in it is required to be identified by the re-user, who is under an obligation to state the source of the documents used: the website address, the date of the latest update and the fact that
the European Parliament retains ownership of the data.

II. Conditions for use of software

The DCEP data is distributed with the software necessary for its exploitation/extraction. Use of such software must be carried out in accordance with the conditions laid down in the EUPL license.

III. Responsibility

The data and the accompanying software are made available, without any guarantee, explicit or tacit. The Parliament cannot be held responsible for any loss, injury or damage the re-user may suffer due to the re-use. The Parliament does not however guarantee
the absence of any irregularities which may be present in the data, within the structured data they contain or the software itself. The Parliament does not guarantee the on-going distribution of said data and software.

The Parliament cannot be held responsible for any loss, injury or damage caused to third parties as a result of the re-use. The re-user shall bear sole responsibility for the re-use of the data collection, the structured elements it contains and the software.
Re-use must not mislead third parties in respect of the contents of the data and the structured elements it contains, the source of the contents or the date of the last update thereto. This disclaimer is not intended to limit the liability of the
Parliament in violation of any requirements laid down in applicable national law or to exclude its liability in cases where this is not permitted by the applicable law.

Download the DCEP corpus

Please chose the downloading options on the
DCEP download page.

How to produce bilingual corpora

In order to extract the bilingual corpus, which has been aligned at the sentence level using the HunAlign sentence aligner, please follow the
readme page.

Acknowledgement and contact

DCEP has been created and published by the Machine Translation team of the
European Parliament's Directorate-General for Translation (DGTRAD), represented by
Najeh Hajlaoui (Machine Translation Expert at the European Parliament). DGTRAD was supported by Jaakko Väyrynen and Ralf Steinberger from the European Commission's
Joint Research Centre. The sentence alignment was produced by Dániel Varga, researcher at Budapest University of Technology and Economics, with a customised version of the HunAlign software.

For more information you can send an e-mail to
machinetranslation@ep.europa.eu .

view details

The
Directorate-General for Translation ensures that the European Parliament's documents are available in all the official languages of the European Union, thus enabling Parliament to meet its commitment to the policy of multilingualism. By directly
enabling Parliament to practise multilingualism, DGTRAD plays an integral role in protecting the cultural and linguistic diversity of the Union. It facilitates transparency, understanding and the exchange of views.

DGTRAD's main tasks are:

  • translating documents out of and into the 24 official languages of the European Union, thus providing all EU citizens with immediate access to European texts in their own language and the opportunity to communicate with the institutions in their own
    language;
  • supplying a translation service which ensures both quality and efficiency, keeping costs at an acceptable level;
  • developing the appropriate IT tools and terminology databases to aid translators and integrating them into the workflow;
  • revising documents translated outside Parliament and monitoring the quality of external translations;
  • managing translation traineeships.

The
Joint Research Centre (
JRC) is a Directorate-General of the European Commission. The JRC has for many years worked on highly multilingual text analysis applications. The JRC has also contributed to the preparation and dissemination of a number of
linguistic resources, including the parallel corpora
JRC-Acquis and
DGT-Acquis, the Translation Memories
DGT-TM,
ECDC-TM and
EAC-TM, as well as
JRC-Names and the
JRC Eurovoc Indexer JEX.

References - Relevant publications

For a more detailed description of DCEP and when making reference to DCEP in scientific publications, please refer to:

To compare DCEP with the other linguistic resources distributed by EU institutions, see:

Related Content

Competence Centre on Text Mining and Analysis

Language Technology Resources

Scientific Publications by the EMM Team

Europe Media Monitor - NewsBrief

Europe Media Monitor - NewsExplorer

Medical Information System (MedISys)

Tools for Innovation Monitoring

EMM App for mobile devices