Data and Methodology

Data

Methodology

This section summarises the main characteristics of education programmes analysed, the data source used, and the main methodological steps to produce the final results. The work follows the methodology developed in Academic offer and demand for advanced profiles in the EU (López-Cobo et al., 2019) and revised in Academic Offer of Advanced Digital Skills in 2019-20. International Comparison (Righi and López-Cobo et al., 2020).

Main characteristics of the programmes

Technological domain. The study covers four advanced digital domains: artificial intelligence, high performance computing, cybersecurity, and data science. An education programme may be considered in more than one technological domain due to the existing overlap between these domains (e.g., a programme on parallel computing may belong to high performance computing and data science simultaneously).
Geographical area. Refers to the country in which the programme is offered. The study covers the 27 EU Member States and six additional countries: the United Kingdom, Norway, Switzerland, Canada, the United States, and Australia.
Education level. The study collects data on three education levels: master, bachelor and short professional courses.
Programme’s scope. Education programmes are classified into "specialised" and "broad", according to the focus with which they address the technological domain considered. Specialised programmes are those with a strong focus in the domain (e.g. a master on supercomputing), while broad programmes target the addressed domain, but in a more generic way (e.g. a bachelor degree on biomedicine that includes a course on artificial intelligence). A programme has only one scope in a specific technological domain, but it may be a broad programme in one domain and a specialised one in another.
Programme’s field of education. This variable of analysis refers to the field of education or discipline in which the programme is taught, according to the Fields of education and training 2013 classification (e.g., “Engineering, manufacturing and construction”, “Business administration and Law”). A programme may be taught in several fields of education. In those cases, the programme is weighted using fractional counting.
Programme’s content areas. These refer to the subdomains covered by the programmes’ syllabus. For each of the four technological domains, specific content areas are defined following existing taxonomies or built-up ones by analysing programmes’ descriptions.

The results are provided for each technological domain separately. If a programme belongs to more than one technological domain, it is fully counted within each of them. The statistics calculated are the number of programmes —by scope, field of education and content areas—, and the penetration rate, i.e., share of programmes over total number of programmes (of any type and with any type of content) that are offered in the considered geographical area.

Data source: strengths and caveats

The study uses data from the Studyportals’ platform as the starting point. It includes programmes from 3,700 universities in over 120 countries. Out of the seven dedicated Studyportals’ websites, this study analyses the ones focused on master’s and bachelor’s degrees and short professional courses. These three repositories overall account for more than 150,000 programmes, out of which nearly 50,000 (in 2022) correspond to programmes taught in European universities or study centres.

This source offers the widest coverage among all identified platforms. However, it still suffers from some lack of coverage, as national language programmes are not tracked.

The main assumption of the study is that, even if the source does not cover all the education offer in each country, it shows a representative part of it, and the attributes of the programmes captured by our study can be extrapolated to the whole education offer. This assumption is considered valid, as it resulted from the previous study Academic offer and demand for advanced profiles in the EU (López-Cobo et al., 2019). In addition, the focus on English language is considered pertinent in view of the highly-technological and computer-related domains addressed by this study.

Another strong advantage of the data source is the amount of program-related information available, which makes possible the analysis of the characteristics of the programmes covered. In particular, some of the most interesting attributes for our analysis relate to the programmes’ content (title of the programme, short and long description and programme outline). We use them to first identify a programme as related to the four domains covered, but also to categorise the technological subdomains taught in the programme. The field of education in which the programme is taught is also a very valuable piece of information, which entitles us to explore the diversification or concentration of the provision of advanced digital education offer across disciplines.

Identification of domain boundaries and categories for the analysis

Since official classifications lack to identify transversal technological domains such as the ones examined, we use lists of representative keywords (one list per domain, see following section) to query the data source. The selection of keywords follows a semi-automatic process aimed at identifying a representative list of terms present in specialised scientific publications. The first selection is performed as detailed in Academic offer and demand for advanced profiles in the EU (López-Cobo et al., 2019) for each domain separately. In a second step, the programmes identified as specialised during the 2019 study have been analysed to detect additional keywords.

After the identification of programmes relevant to the technological domains under study, they are classified into “broad” and “specialised”. A programme is considered as “specialised” in a technological domain if either its title or its short description include at least one keyword representative of the technological domain, or at least three different keywords are present in any other text field of the programme description. If neither these conditions are met (i.e., only one or two keywords are found in the long description), the programme is considered as “broad”.

The keywords are also used to classify the programmes according to the content areas taught. In general, the categorisation of content areas is derived following the methodology proposed in the 2019 study (López-Cobo et al., 2019) and refined with the analysis of the syllabus of the most specialised programmes in the data source. When available, existing taxonomies have also been used.

For AI, we consider the AI taxonomy developed by JRC in the framework of AI Watch, AI WATCH. Defining Artificial Intelligence. Towards an operational definition and taxonomy of artificial intelligence (Samoili and López-Cobo et al., 2020).
For CS, we use a JRC report aimed at aligning the cybersecurity terminologies, definitions and domains into a coherent and comprehensive taxonomy to facilitate the categorisation of cybersecurity capabilities in the EU to enrich the categorisation of content areas, European Cybersecurity Centres of Expertise Map - Definitions and Taxonomy (Nai-Fovino et al., 2018).
For HPC and DS, the taxonomy is developed by the authors of the work, based on the review of several specialised masters in the field.

accountability *	deep learning	machine translation	sound synthesis
adaptive learning	deep neural network	multi-agent system	speaker identification
ai application	ethics *	narrow artificial intelligence	speech processing *
anomaly detection	expert system	natural language generation	speech recognition
artificial general intelligence	explainability *	natural language processing	speech synthesis
artificial intelligence	face recognition	natural language understanding	strong artificial intelligence
audio processing *	fairness *	neural network	supervised learning
automated vehicle	human computer interaction	pattern recognition	support vector machine
automatic translation	human-ai interaction	predictive analytics	swarm intelligence
autonomous system *	image processing	recommender system *	text mining
autonomous vehicle	image recognition	reinforcement learning	transfer learning
business intelligence *	inductive programming	robot system *	transparency *
chatbot	intelligence software	robotics	trustworthy ai
computational creativity *	intelligent agent *	safety *	uncertainty *
computational linguistics	intelligent control	security *	unsupervised learning
computational neuroscience *	intelligent software development	semantic web *	voice recognition
computer vision	intelligent system	sentiment analysis *	weak artificial intelligence
control theory	knowledge representation and reasoning	service robot *
cyber physical system	machine learning	social robot *
* Terms that are queried in combination with domain’s core terms.

accelerators *	distributed computing	hpc applications *	parallel programming *
cloud *	distributed systems *	hpcc	parallelisation *
cloud computing	energy efficiency	infiniband	performance analysis
cluster *	exascale *	manycore	performance evaluation
cluster computing *	field-programmable gate array	mapreduce *	performance modeling
compute unified device architecture *	fpga	massive parallelism *	performance optimisation
computer architecture *	gpgpu	message passing interface	reconfigurable computing *
computer modelling *	gpu	multi core	scalability
concurrent *	graphics processing unit	opencl	single instruction multiple data
cuda	grid computing	parallel algorithms *	supercomputer
data center	hadoop	parallel architectures *	supercomputer technology
data intensive computing	high performance computation	parallel computation *
* Terms that are queried in combination with domain’s core terms.

access control	cyber warfare	firewall *	phishing
access management	cybercrime	hacker	pseudonymity
activity monitoring	cybersecurity	hash function	public key
anonymity *	cybersecurity incident	identity access management	random number generation
anonymization	data anonymisation	identity management	security analysis
computer security	data sanitisation	information assurance	security protocol *
control system	data security	information protection	stuxnet
counterintelligence	digital evidence	information security	supervisory control data acquisition
cryptanalysis	digital forensics	intrusion detection	system security
cryptography	digital rights management	key management	vulnerability assessment
cryptology	digital signature	malware	web protocol
cyber attack	distributed systems	network attack	web protocol security
cyber risk	encryption	network security
cyber threat	fault tolerance	penetration testing
* Terms that are queried in combination with domain’s core terms.

ant colony optimisation	distributed computing	metaheuristic optimisation	reinforcement learning
automated machine learning	distributed processing	multiagent system	scalability
big data	ensemble method	natural language processing	semantic web
business intelligence	evolutionary algorithm	natural language understanding	semi-supervised learning
data analytics	genetic algorithm	neural network	sentiment analysis
data mining	gradient descent	nosql	spark *
data science	hadoop	parallel computing *	statistical learning
data visualisation	information extraction	parallel processing *	supervised learning
decision analytics	information retrieval	parallelisation *	support vector machine
decision support	k-nearest-neighbour	pattern recognition	transfer learning
decision tree	machine learning	predictive analytics	unstructured data
deep learning	mapreduce	recommender system	unsupervised learning
ant colony optimisation	distributed computing	metaheuristic optimisation	reinforcement learning
* Terms that are queried in combination with domain’s core terms.

Data and Methodology

Data

Methodology

Main characteristics of the programmes

Data source: strengths and caveats

Identification of domain boundaries and categories for the analysis

Keywords for programmes’ identification

Artificial intelligence

High performance computing

Cybersecurity

Data science