ACTER Annotated Corpora for Term Extraction Research
The data can be downloaded from the Github repository: https://github.com/AylaRT/ACTER which also contains an elaborate README.md with all necessary information.
The ACTER dataset (currently version 1.3) is publicly available with Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) (https://creativecommons.org/licenses/by-nc-sa/4.0/). The dataset can be freely used and adapted for non-commercial purposes, provided any changes made to the data are clearly mentioned and the proper reference is cited: https://doi.org/10.1007/s10579-019-09453-9
Background & Numbers
The ACTER dataset has been specifically created to address some of the hurdles faced by ATE. The corpora cover four domains: corruption, dressage (branch of horse riding), heart failure and wind energy. For each domain, corpora have been collected in three languages – English, French, and Dutch – with a more or less equal number of tokens per domain in each language. The corpora within each domain are all comparable, in the sense that they contain original texts of the same types, on the same subject, and they are not parallel translations, so they cannot be aligned (not on sentence -level, or even on document-level). The corpus on corruption also contains a parallel component, with sentence-aligned, trilingual texts. More detailed information about the creation of these corpora, as well as the annotation process, can be found in Rigouts Terryn, Hoste, & Lefever (2018, 2019) and in the overview paper of the TermEval2020 shared task, which will appear in the CompuTerm workshop proceedings. Not all information about the annotations as mentioned in previous work has been publicly released yet, so the remainder of this page will focus only on the information available in ACTER 1.2.
In each language and domain (for corruption, only the parallel corpus has been annotated), ±50k tokens were manually annotated, with a total of 596,058 annotated tokens. This resulted in 119,455 individual annotations or 19,002 unique annotations. The annotation guidelines are freely available online[1]. Table 1 shows a summary of the dataset, showing, for each corpus, the domain, type (comparable or parallel), language, number of tokens, number of annotated tokens and number of unique annotations. For English, there are a total of 1,108,085 tokens, 194,403 of which are annotated[2]. The French corpora count 1,142,575 tokens in total, of which 206,859 are annotated. For Dutch, these numbers are 1,115,264 and 194,796 respectively. In total, the corpora contain 1194 different documents.
Type | Domain | Language | # Tokens | # Documents | # Annotated Tokens |
---|---|---|---|---|---|
parallel | corruption | en | 176,314 | 24 | 45,234 |
fr | 196,327 | 24 | 50,429 | ||
nl | 184,541 | 24 | 47,305 | ||
comparable | corruption | en | 468,711 | 44 | - |
fr | 475,244 | 31 | - | ||
nl | 470,242 | 49 | - | ||
dressage | en | 102,654 | 89 | 51,470 | |
fr | 109,572 | 125 | 53,316 | ||
nl | 103,851 | 125 | 50,882 | ||
heart failure | en | 45,788 | 190 | 45,788 | |
fr | 46,751 | 215 | 46,751 | ||
nl | 47,888 | 175 | 47,888 | ||
wind energy | en | 314,618 | 38 | 51,911 | |
fr | 314,681 | 12 | 56,363 | ||
nl | 308,742 | 29 | 49,582 | ||
Total | 3,365,924 | 1194 | 596,058 |
While, for TermEval, ATE is conceived as a binary task (term or not term), there are actually 4 annotation labels: Specific Terms, Common Terms, Out-of-Domain Terms, and Named Entities. For TermEval, all labels are combined, with separate evaluations with and without Named Entities (see Task & Evaluation). No restrictions were placed on the annotations, but rather all lexical items that were considered terms as they were used in the texts, were annotated as such. This means that non-standard terms or wrong spellings are annotated as well, that there is no minimum frequency, that terms can by any length and any POS-pattern.
Domain | Language | # Specific Terms | # Common Terms | # OOD Terms | # Named Entities | # Annotations (total) |
---|---|---|---|---|---|---|
corruption (parallel) | en | 278 | 644 | 6 | 248 | 1174 |
fr | 300 | 678 | 5 | 236 | 1217 | |
nl | 310 | 731 | 6 | 249 | 1295 | |
dressage | en | 780 | 309 | 71 | 421 | 1575 |
fr | 705 | 238 | 26 | 221 | 1183 | |
nl | 1026 | 333 | 41 | 153 | 1546 | |
heart failure | en | 1885 | 320 | 158 | 228 | 2585 |
fr | 1714 | 505 | 59 | 147 | 2423 | |
nl | 1561 | 450 | 66 | 182 | 2257 | |
wind energy | en | 781 | 296 | 14 | 444 | 1534 |
fr | 444 | 308 | 21 | 195 | 968 | |
nl | 577 | 342 | 21 | 305 | 1245 | |
Total | 10361 | 5154 | 494 | 3029 | 19002 |
Table 2 shows the number of unique annotations (type counts) per corpus, and per label. These are also the number which can be found in ACTER 1.2 The same information is repeated in Table 3, but this time with token counts, i.e. each annotation is counted separately. This information (the annotations as spans in the text) is not yet available in ACTER 1.2.
Domain | Language | # Specific Terms | # Common Terms | # OOD Terms | # Named Entities | # Annotations (total) |
---|---|---|---|---|---|---|
corruption (parallel) | en | 1078 | 5295 | 12 | 2373 | 8758 |
fr | 1077 | 4842 | 11 | 2186 | 8116 | |
nl | 1041 | 4111 | 11 | 2334 | 7497 | |
dressage | en | 4729 | 5997 | 163 | 970 | 11859 |
fr | 4969 | 4389 | 39 | 467 | 9864 | |
nl | 6259 | 4891 | 57 | 295 | 11502 | |
heart failure | en | 7402 | 5626 | 983 | 526 | 14537 |
fr | 5212 | 5336 | 253 | 319 | 11120 | |
nl | 5428 | 4537 | 254 | 433 | 10652 | |
wind energy | en | 5151 | 4282 | 45 | 1429 | 10907 |
fr | 3208 | 5207 | 109 | 439 | 8963 | |
nl | 2104 | 2805 | 135 | 636 | 5680 | |
Total | 47658 | 57318 | 2072 | 12407 | 119455 |
While these human annotations are unavoidably subject to human error and subjectivity, every care has been taken to obtain high-quality annotations, which have been validated in various studies (Rigouts Terryn, Drouin, et al., 2019; Rigouts Terryn, Hoste, et al., 2019), though some errors may of course still remain. Since the purpose of this shared task is also to provide the community with a practical dataset for ATE, we highly encourage people who use the dataset and spot errors to notify us of these errors so we can continue to improve the data (an email address is provided in the README.md).
[1] http://hdl.handle.net/1854/LU-8503113
[2] While the gold standard list of annotated terms may contain many terms that also appear in the unannotated parts of the corpora, the unannotated parts of the corpora will undoubtedly also contain terms that do not appear in the gold standard lists.