ACTER dataset

ACTER Annotated Corpora for Term Extraction Research

The data can be downloaded from the Github repository: https://github.com/AylaRT/ACTER which also contains an elaborate README.md with all necessary information.

The ACTER dataset (currently version 1.3) is publicly available with Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) (https://creativecommons.org/licenses/by-nc-sa/4.0/). The dataset can be freely used and adapted for non-commercial purposes, provided any changes made to the data are clearly mentioned and the proper reference is cited: https://doi.org/10.1007/s10579-019-09453-9

Background & Numbers

The ACTER dataset has been specifically created to address some of the hurdles faced by ATE. The corpora cover four domains: corruption, dressage (branch of horse riding), heart failure and wind energy. For each domain, corpora have been collected in three languages – English, French, and Dutch – with a more or less equal number of tokens per domain in each language. The corpora within each domain are all comparable, in the sense that they contain original texts of the same types, on the same subject, and they are not parallel translations, so they cannot be aligned (not on sentence -level, or even on document-level). The corpus on corruption also contains a parallel component, with sentence-aligned, trilingual texts. More detailed information about the creation of these corpora, as well as the annotation process, can be found in Rigouts Terryn, Hoste, & Lefever (2018, 2019) and in the overview paper of the TermEval2020 shared task, which will appear in the CompuTerm workshop proceedings. Not all information about the annotations as mentioned in previous work has been publicly released yet, so the remainder of this page will focus only on the information available in ACTER 1.2.

In each language and domain (for corruption, only the parallel corpus has been annotated), ±50k tokens were manually annotated, with a total of 596,058 annotated tokens. This resulted in 119,455 individual annotations or 19,002 unique annotations. The annotation guidelines are freely available online[1]. Table 1 shows a summary of the dataset, showing, for each corpus, the domain, type (comparable or parallel), language, number of tokens, number of annotated tokens and number of unique annotations. For English, there are a total of 1,108,085 tokens, 194,403 of which are annotated[2]. The French corpora count 1,142,575 tokens in total, of which 206,859 are annotated. For Dutch, these numbers are 1,115,264 and 194,796 respectively. In total, the corpora contain 1194 different documents.

TypeDomainLanguage# Tokens# Documents# Annotated Tokens
parallelcorruptionen 176,314 24 45,234
fr 196,327 24 50,429
nl 184,541 24 47,305
comparablecorruptionen 468,711 44-
fr 475,244 31-
nl 470,242 49-
dressageen 102,654 89 51,470
fr 109,572 125 53,316
nl 103,851 125 50,882
heart failureen 45,788 190 45,788
fr 46,751 215 46,751
nl 47,888 175 47,888
wind energyen 314,618 38 51,911
fr 314,681 12 56,363
nl 308,742 29 49,582
Total 3,365,924 1194 596,058
Overview of the corpora per type, domain, and language; with number of tokens, number of documents and number of annotated tokens.

While, for TermEval, ATE is conceived as a binary task (term or not term), there are actually 4 annotation labels: Specific Terms, Common Terms, Out-of-Domain Terms, and Named Entities. For TermEval, all labels are combined, with separate evaluations with and without Named Entities (see Task & Evaluation).  No restrictions were placed on the annotations, but rather all lexical items that were considered terms as they were used in the texts, were annotated as such. This means that non-standard terms or wrong spellings are annotated as well, that there is no minimum frequency, that terms can by any length and any POS-pattern.

DomainLanguage# Specific Terms# Common Terms# OOD Terms# Named Entities# Annotations (total)
corruption (parallel)en27864462481174
fr30067852361217
nl31073162491295
dressageen780309714211575
fr705238262211183
nl1026333411531546
heart failureen18853201582282585
fr1714505591472423
nl1561450661822257
wind energyen781296144441534
fr44430821195968
nl577342213051245
Total103615154494302919002

Table 2 shows the number of unique annotations (type counts) per corpus, and per label. These are also the number which can be found in ACTER 1.2 The same information is repeated in Table 3, but this time with token counts, i.e. each annotation is counted separately. This information (the annotations as spans in the text) is not yet available in ACTER 1.2.

DomainLanguage# Specific Terms# Common Terms# OOD Terms# Named Entities# Annotations (total)
corruption (parallel)en107852951223738758
fr107748421121868116
nl104141111123347497
dressageen4729599716397011859
fr49694389394679864
nl625948915729511502
heart failureen7402562698352614537
fr5212533625331911120
nl5428453725443310652
wind energyen5151428245142910907
fr320852071094398963
nl210428051356365680
Total4765857318207212407119455

While these human annotations are unavoidably subject to human error and subjectivity, every care has been taken to obtain high-quality annotations, which have been validated in various studies (Rigouts Terryn, Drouin, et al., 2019; Rigouts Terryn, Hoste, et al., 2019), though some errors may of course still remain. Since the purpose of this shared task is also to provide the community with a practical dataset for ATE, we highly encourage people who use the dataset and spot errors to notify us of these errors so we can continue to improve the data (an email address is provided in the README.md).


[1] http://hdl.handle.net/1854/LU-8503113

[2] While the gold standard list of annotated terms may contain many terms that also appear in the unannotated parts of the corpora, the unannotated parts of the corpora will undoubtedly also contain terms that do not appear in the gold standard lists.