Automatic term extraction (ATE) is a task which, despite receiving plenty of research attention over the past decades, remains very challenging. While terms are generally defined as “lexical items that represent concepts of a domain” (Kageura, Kyo & Marshman, Elizabeth, 2019), there appears to be a lack of agreement about the fundamental nature of terms. Since ATE is supposed to automatically identify terms from specialised text, the absence of a consensus about the basic characteristics of terms is problematic. The disagreement covers both practical aspects, such as term length and part-of-speech (POS) pattern, and theoretical considerations about the difference between words (or collocations/phrases) and terms. This poses great difficulties for all aspects of ATE, from data collection (1), to extraction methodology (2), to evaluation (3). 

Data collection (term annotation) (1) is time- and effort-consuming, inter-annotator agreement is notoriously low and there is no consensus about an annotation protocol. This leads to a scarcity in available resources. Moreover, it means that the few available datasets are difficult to combine and compare, and often cover only a single language and domain. Two of the most used annotated datasets are GENIA (Kim, Ohta, Tateisi, & Tsujii, 2003), a collection of 2000 abstracts from the MEDLINE database in the domain of biomedicine and the ACL-RD-TEC 2.0 (Qasemizadeh & Schumann, 2016), which contains 300 annotated abstracts from the ACL Anthology Reference Corpus. Both are in English. There are other (smaller) examples as well  (Bernier-Colborne, 2012; Daille, 2012; Hätty, Tannert, & Heid, 2017; Schumann & Fischer, 2016), but the fact remains that there are only few large annotated resources available for the task and they are usually monolingual and cover only a single domain. Since term characteristics, and therefore also ATE performance, can vary greatly between languages and domains, this is a serious drawback.

The second problem caused by the lack of consensus about the nature of terms concerns ATE methodologies (2), more specifically: which terms tools are designed to extract. As discussed, there is no agreement about features such as term length and POS- pattern. This means that some tools extract only single-word terms (Amjadian, Inkpen, Paribakht, & Faez, 2016; Conrado, Pardo, & Rezende, 2013; Hätty & Schulte im Walde, 2018; Nokel, Michael, Bolshakova, E.i., & Loukachevitch, Natalia, 2012), others extract only multi-word terms (Azé, Roche, Kodratoff, & Sebag, 2005; Karan, Snajder, & Dalbelo Basic, Bojana, 2012; Loukachevitch, 2012; Patry & Langlais, 2005), and still others extract both, often with different upper limits to the term length, with maximum term length ranging from bigrams (Loukachevitch & Nokel, 2013; Vivaldi & Rodríguez, 2001), to no restrictions at all (Kucza, Niehues, Zenkel, Waibel, & Stüker, 2018; Rigouts Terryn, Drouin, Hoste, & Lefever, 2019), and everything in between. A similar trend can be seen regarding POS-patterns. Research often focuses only on nouns and noun phrases. Some include verbs, adjectives, and adverbs. Others don’t use any restrictions, for example those that consider all n-grams potential term candidates (Wang, Liu, & McDonald, 2016). Still others obtain POS-patterns from annotated data (Hätty, Dorna, & Schulte im Walde, 2017; Patry & Langlais, 2005; Rigouts Terryn, Drouin, et al., 2019). There are also those who focus only on very specific patterns, such as only single-word compound terms (Hätty & Schulte im Walde, 2018) or noun+noun terms (Azé et al., 2005). These decisions regarding term length and POS are often motivated, not by a belief that all terms conform to these limitations, but rather because of the difficulties created by not posing any such restrictions, both in terms of the explosion of (false) term candidates and the added effort to create and evaluate the data. 

This is related to a different problem in ATE, which is also linked to evaluation (3), namely the fact that most tools are particularly bad at extracting infrequent terms. Many tools discard all terms below a certain frequency threshold (Conrado et al., 2013; Ljubešić, Erjavec, & Fišer, 2018; Ramisch, Villavicencio, & Boitet, 2010b). While some have experimented with frequency thresholds (Drouin, 2003; Ramisch, Villavicencio, & Boitet, 2010a), rare terms remain difficult to find. This is mostly due to the fact that ATE often relies heavily on frequency-based termhood and unithood metrics (Kageura & Umino, 1996), which fail to detect rare terms. This is especially problematic, since one of the main applications of ATE is to efficiently keep up with terminology, so to also detect new or rare terms, which do not already appear in existing lexicons (Kageura, Kyo & Marshman, Elizabeth, 2019). Therefore, any evaluation that does not take into account these infrequent terms, does not necessarily represent the potential usefulness of the ATE in a real-world setting. All of the previously mentioned problems also make evaluation extremely challenging. The most common evaluation metrics for ATE are precision (how many of the extracted term candidates are true terms), recall (how many of the true terms were extracted) and f-score (the weighted average of the two). While most research does report precision, the calculation of recall and f-score is not as common, since they require a fully annotated corpus; you need to identify all true terms in a corpus to be able to calculate how many of them the ATE has found. This rarely happens because of the cited problems with term annotation and available datasets. Apart from the evaluation metrics, comparative evaluations are problematic as well. First of all, the lack of diversity in datasets does not allow for many cross-lingual or cross-domain comparisons. Second, the great differences between term definitions used in the research does not promote fair and transparent comparisons. For instance, it would not be fair to compare a system with no limits on term length, term POS-pattern, or term frequency, to one with restrictions for all of those. Finally, even using reported precision scores is not always informative, because of the varying strictness used in calculating the score, e.g. counting partial matches as correct. Another proposed alternative is a more user- and application-oriented evaluation, but this comes with its own set of problems, such as measuring the impact of ATE on the task (Mustafa El Hadi, Timimi, & Dabbadie, 2004; Mustafa El Hadi et al., 2006; Nazarenko & Zargayouna, 2009).

In conclusion, despite the amount of research available on the subject, there is still surprisingly little consensus about ATE. This shared task is meant to address some of the major concerns. It introduces a dataset that covers three languages and four domains, manually annotated with four different term labels. This allows participants to train and test their systems on diverse and detailed data. Moreover, all participants get relevant information about the types of terms their system is supposed to find from the provided training/development data and all systems are evaluated identically, on the same test data. Thus, with this dataset, all participant systems can be fairly and transparently evaluated. The aim of this shared task is both to introduce a valuable new resource and to obtain a detailed overview of the current state-of-the-art. It is meant to identify the strengths and weaknesses of ATE and to inspire new ideas in the field.


Amjadian, E., Inkpen, D., Paribakht, T. S., & Faez, F. (2016). Local-Global Vectors to Improve Unigram Terminology Extraction. Proceedings of the 5th International Workshop on Computational Terminology, 2–11. Osaka, Japan.

Azé, J., Roche, M., Kodratoff, Y., & Sebag, M. (2005). Preference Learning in Terminology Extraction: A ROC-based approach. Proceeedings of Applied Stochastic Models and Data Analysis, 209–2019. Retrieved from

Bernier-Colborne, G. (2012). Defining a Gold Standard for the Evaluation of Term Extractors. Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC). Presented at the Istanbul, Turkey. Istanbul, Turkey: ELRA.

Conrado, M. da S., Pardo, T. A. S., & Rezende, S. O. (2013). A Machine Learning Approach to Automatic Term Extraction using a Rich Feature Set. Proceedings of the NAACL HLT 2013 Student Research Workshop, 16–23. Atlanta, GA, USA: ACL.

Daille, B. (2012). Building Bilingual Terminologies from Comparable Corpora: The TTC TermSuite. Proceedings of the 5th Workshop on Building and Using Comparable Corpora with Special Topic ”Language Resources for Machine Translation in Less-Resourced Languages and Domains”, Co-Located with LREC 2012. Presented at the Istambul, Turkey. Istambul, Turkey.

Drouin, P. (2003). Term Extraction Using Non-Technical Corpora as a Point of Leverage. Terminology9(1), 99–115.

Hätty, A., Dorna, M., & Schulte im Walde, S. (2017). Evaluating the Reliability and Interaction of Recursively Used Feature Classes for Terminology Extraction. Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics, 113–121.

Hätty, A., & Schulte im Walde, S. (2018). Fine-Grained Termhood Prediction for German Compound Terms Using Neural Networks. Proceedings of the Joint Workshop on,Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018), 62–73. Sante Fe, New Mexico, USA.

Hätty, A., Tannert, S., & Heid, U. (2017). Creating a gold standard corpus for terminological annotation from online forum data. Proceedings of Language, Ontology, Terminology and Knowledge Structures Workshop (LOTKS 2017).

Kageura, K., & Umino, B. (1996). Methods of automatic term recognition. Terminology3(2), 259–289.

Kageura, Kyo, & Marshman, Elizabeth. (2019). Terminology Extraction and Management. In O’Hagan, Minako (Ed.), The Routledge Handbook of Translation and Technology.

Karan, M., Snajder, J., & Dalbelo Basic, Bojana. (2012). Evaluation of Classification Algorithms and Features for Collocation Extraction in Croatian. Proceedings of Eighth International Conference on Language Resources and Evaluation (LREC 2012), 657–662. Istanbul, Turkey: ELRA.

Kim, J.-D., Ohta, T., Tateisi, Y., & Tsujii, J. (2003). GENIA corpus—A semantically annotated corpus for bio-textmining. Bioinformatics19(1), 180–182.

Kucza, M., Niehues, J., Zenkel, T., Waibel, A., & Stüker, S. (2018). Term Extraction via Neural Sequence Labeling a Comparative Evaluation of Strategies Using Recurrent Neural Networks. Interspeech 2018, 2072–2076.

Ljubešić, N., Erjavec, T., & Fišer, D. (2018). KAS-term and KAS-biterm: Datasets and baselines for monolingual and bilingual terminology extraction from academic writing. Digital Humanities, 7.

Loukachevitch, N. (2012). Automatic Term Recognition Needs Multiple Evidence. Proceedings of LREC 2012, 2401–2407. Istanbul, Turkey: ELRA.

Loukachevitch, N., & Nokel, M. (2013). An Experimental Study of Term Extraction for Real Information-Retrieval Thesauri. Proceedings 10th International Conference on Terminology and Artificial Intelligence TIA 2013, 69–76.

Mustafa El Hadi, W., Timimi, I., & Dabbadie, M. (2004). EVALDA-CESART Project: Terminological Resources Acquisition Tools Evaluation Campaign. Proceedings of LREC 2004, 515–518. Lisbon, Portugal.

Mustafa El Hadi, W., Timimi, I., Dabbadie, M., Choukri, K., Hamon, O., & Chiao, Y.-C. (2006). Terminological Resources Acquisition Tools: Toward a User-oriented Evaluation Model. Proceedings of LREC 2006, 945–948. Genoa, Italy: ELRA.

Nazarenko, A., & Zargayouna, H. (2009). Evaluating term extraction. Proceedings of the International Conference RANLP-2009, 299–304. Borovets, Bulgaria: ACL.

Nokel, Michael, Bolshakova, E.i., & Loukachevitch, Natalia. (2012). Combining multiple features for single-word term extraction. Proceedings of Dialog 2012, 490–501.

Patry, A., & Langlais, P. (2005). Corpus-Based Terminology Extraction. Terminology and Content Development – Proceedings of the 7th International Conference on Terminology and Knowledge Engineering, 313–321. Copenhagen, Denmark.

Qasemizadeh, B., & Schumann, A.-K. (2016). The ACL RD-TEC 2.0: A Language Resource for Evaluating Term Extraction and Entity Recognition Methods. Proceedings of LREC 2016, 1862–1868. Portorož, Slovenia: ELRA.

Ramisch, C., Villavicencio, A., & Boitet, C. (2010a). Multiword Expressions in the wild? The mwetoolkit comes in handy. Coling 2010: Demonstration Volume, 57–60. Beijing, China.

Ramisch, C., Villavicencio, A., & Boitet, C. (2010b). mwetoolkit: A Framework for Multiword Expression Identification. Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), 662–669. Valetta, Malta: ELRA.

Rigouts Terryn, A., Drouin, P., Hoste, V., & Lefever, E. (2019). Analysing the Impact of Supervised Machine Learning on Automatic Term Extraction: HAMLET vs TermoStat. Proceedings of RANLP 2019. Presented at the Varna, Bulgaria. Varna, Bulgaria.

Rigouts Terryn, A., Hoste, V., & Lefever, E. (2018). A Gold Standard for Multilingual Automatic Term Extraction from Comparable Corpora: Term Structure and Translation Equivalents. Proceedings of LREC 2018. Presented at the Miyazaki, Japan. Miyazaki, Japan: ELRA.

Rigouts Terryn, A., Hoste, V., & Lefever, E. (2019). In No Uncertain Terms: A Dataset for Monolingual and Multilingual Automatic Term Extraction from Comparable Corpora. Language Resources and Evaluation, 1–34.

Schumann, A.-K., & Fischer, S. (2016). Compasses, Magnets, Water Microscopes. Proceedings of LREC 2016, 3578–3584. Portorož, Slovenia: ELRA.

Steyaert, K., & Rigouts Terryn, A. (2019). Multilingual Term Extraction from Comparable corpora: Informativeness of Monolingual Term Extraction Features. Proceedings of BUCC.

Vivaldi, J., & Rodríguez, H. (2001). Improving term extraction by combining different techniques. Terminology7(1), 31–48.

Wang, R., Liu, W., & McDonald, C. (2016). Featureless Domain-Specific Term Extraction with Minimal Labelled Data. Proceedings of Australasian Language Technology Association Workshop, 103–112. Melbourne, Australia.