JRC-Names

What is JRC-Names
What can JRC-Names be used for?
How was JRC-Names produced?
Statistics on JRC-Names
Related information
Usage conditions
Privacy statement
Download JRC-Names
JRC-Names as Linked Data (external link)

What is JRC-Names?

JRC-Names is a highly multilingual named entity resource for person and organisation names (called 'entities'). It consists of large lists of names and their many spelling variants (up to hundreds for a single person), including across scripts (Latin,
Greek, Arabic, Cyrillic, Japanese, Chinese, etc.). Since March 2016, JRC-Names has also been available as linked data, including additional information such as frequencies per language, titles found with the entities, and date ranges.

view details

The named entity resource file with the list of spelling variants is accompanied by Java-implemented demonstrator software that (a) allows to produce - for any input name - a list of known spelling variants, and that (b) analyses UTF8-encoded text files
to find known entity mentions, returning the name variant found, the preferred display name for that entity, the unique name identifier for that name, the position of the entity name in the text, and its length in characters.

To see examples, go to any of the over one million entity pages on
EMM-NewsExplorer (e.g. that for the
United Nations) to see the list of spelling variants automatically collected for that entity. Below, you see known spelling variants for the person name Muammar Gaddafi:

The data release by the Joint Research Centre (JRC) is in line with the general effort of the European Commission to support multilingualism, language diversity and the re-use of Commission information.

What can JRC-Names be used for?

JRC-Names is a technical resource that can be used to find names even if they are spelled differently, but it is also a useful ingredient for IT systems that process text, e.g. for text mining.

view details

The tool serves many purposes and addresses various problems, including the following:

Proper names are a problem when searching databases, the internet and other repositories, because variants of searched names are often not found. This results in non-optimal use and exploitation of repositories for documents, images and audio-visual
content. JRC-Names allows standardising the names and thus improving retrieval;

Names are a known problem for machine translation as they should not be translated like other words; names can be extracted before the translation process and the foreign language variant can be re-inserted in the target language to solve this problem;
Lists of names in two different scripts are often used to learn transliteration rules;
Names can be recognised and marked up in text to use as seeds when training a machine learning named entity recognition system;
Social networks are less biased by national viewpoints if produced using multi-national sources and entity lists;
Recognition of names is useful as input to the computational linguistics tasks of opinion mining, co-reference resolution, summarisation, topic detection and tracking, cross-lingual linking of related documents across languages, and more.

How was JRC-Names produced?

JRC-Names is a by-product of the analysis of about 220,000 news reports per day by the Europe Media Monitor (EMM) family of applications.

view details

It was mostly compiled automatically, by analysing hundreds of millions of news articles since the year 2004 in up to twenty-one languages, identifying names of entities (mostly persons, but also organisations, event names, and more), and detecting which
of these newly found names are variant spellings of each other. Most name variants in JRC-Names are thus spellings that were found in real-life text (including frequent spelling mistakes). Additionally, for a subset of the collection of entities,
software automatically extracted spelling variants in many further languages (e.g. Chinese, Thai, Japanese, ...) from the cross-lingual links in Wikipedia. For highly frequent or otherwise important names, the named entity resource was additionally
manually verified. As JRC-Names was mostly produced automatically, it will contain some errors.

For details, you can read the publication
JRC-Names: A freely available, highly multilingual named entity resource.

Statistics on JRC-Names

JRC-Names contains the most important names of the EMM name database, i.e. those names that were found frequently or that were verified manually or found on Wikipedia.

view details

The first release of JRC-Names (September 2011) contained the names of about 205,000 distinct known entities, plus about the same amount of variant spellings for these entities. Additionally, it contains a number of morphologically inflected variants
of these names. By March 2016, the resource has grown to 307,000 distinct entities plus 333,000 variants.

EMM identifies new names every day, and a file including also the most recently found names and name spellings is available for daily download from the JRC's web pages.

As of July 2011, the database included names spelt in 27 different scripts. The most frequently used scripts are Latin (including English and most other European languages), Cyrillic (e.g. Russian and Bulgarian), Arabic (including Farsi), Japanese (Han,
Hiragana and Katakana) and Chinese Han (simplified variant).

64% of the names in JRC-Names do not have additional spelling variants. For 28% of the names, JRC-Names knows two or three spellings. There are 3760 entities with ten spellings or more, and 37 entities with over 100 spelling variants. The names with the
most spelling variants are
Muammar Gaddafi (413 spellings),
Mikhail Saakashvili (256) and
Mahmoud Ahmadinejad (246) (status July 2011).

Related information

A description of JRC-Names (version 1) was published in the publication below. Information on the Linked Data version of JRC-Names can be found in the second paper. Please use these publications as a reference when you refer to JRC-Names:

Steinberger Ralf, Bruno Pouliquen, Mijail Kabadjov, Jenya Belyaeva & Erik van der Goot (2011).
JRC-Names: A freely available, highly multilingual named entity resource . Proceedings of the 8th International Conference Recent Advances in Natural Language Processing (RANLP). Hissar, Bulgaria, 12-14 September 2011.
Ehrmann Maud, Guillaume Jacquet & Ralf Steinberger (2016).
JRC-Names: Multilingual Entity Name Variants and Titles as Linked Data . Semantic Web Journal (March 2016).

JRC-Names Java demonstrator code: This .jar file allows to analyse UTF8-encoded text files to recognise known named entities. It also allows to generate a list of all known variants for any input name; Needs to be used in combination with
the entity resource file.
JRC-Names named entity resource file: This file contains the list of names and their variants. It is planned that this file will be updated daily in order to include the most recently added entity names. (filename: entities.gzip; zipped size:
ca. 5.6MB; unzipped: ca. 18MB).
J
RC-Names Java source code: You only need this if you want to integrate the resource into your own environment.
JRC-Names documentation: This is the documentation for the Java software.
JRC-Names linked data version access on the EU's Open Data portal, including as an RDF file.