ACTER dataset

ACTER Annotated Corpora for Term Extraction Research

The data can be downloaded from the Github repository: https://github.com/AylaRT/ACTER which also contains an elaborate README.md with all necessary information.

The ACTER dataset (currently version 1.3) is publicly available with Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) (https://creativecommons.org/licenses/by-nc-sa/4.0/). The dataset can be freely used and adapted for non-commercial purposes, provided any changes made to the data are clearly mentioned and the proper reference is cited: https://doi.org/10.1007/s10579-019-09453-9

Background & Numbers

The ACTER dataset has been specifically created to address some of the hurdles faced by ATE. The corpora cover four domains: corruption, dressage (branch of horse riding), heart failure and wind energy. For each domain, corpora have been collected in three languages – English, French, and Dutch – with a more or less equal number of tokens per domain in each language. The corpora within each domain are all comparable, in the sense that they contain original texts of the same types, on the same subject, and they are not parallel translations, so they cannot be aligned (not on sentence -level, or even on document-level). The corpus on corruption also contains a parallel component, with sentence-aligned, trilingual texts. More detailed information about the creation of these corpora, as well as the annotation process, can be found in Rigouts Terryn, Hoste, & Lefever (2018, 2019) and in the overview paper of the TermEval2020 shared task, which will appear in the CompuTerm workshop proceedings. Not all information about the annotations as mentioned in previous work has been publicly released yet, so the remainder of this page will focus only on the information available in ACTER 1.2.

In each language and domain (for corruption, only the parallel corpus has been annotated), ±50k tokens were manually annotated, with a total of 596,058 annotated tokens. This resulted in 119,455 individual annotations or 19,002 unique annotations. The annotation guidelines are freely available online^[1]. Table 1 shows a summary of the dataset, showing, for each corpus, the domain, type (comparable or parallel), language, number of tokens, number of annotated tokens and number of unique annotations. For English, there are a total of 1,108,085 tokens, 194,403 of which are annotated[2]. The French corpora count 1,142,575 tokens in total, of which 206,859 are annotated. For Dutch, these numbers are 1,115,264 and 194,796 respectively. In total, the corpora contain 1194 different documents.

Type	Domain	Language	# Tokens	# Documents	# Annotated Tokens
parallel	corruption	en	176,314	24	45,234
		fr	196,327	24	50,429
		nl	184,541	24	47,305
comparable	corruption	en	468,711	44	-
		fr	475,244	31	-
		nl	470,242	49	-
	dressage	en	102,654	89	51,470
		fr	109,572	125	53,316
		nl	103,851	125	50,882
	heart failure	en	45,788	190	45,788
		fr	46,751	215	46,751
		nl	47,888	175	47,888
	wind energy	en	314,618	38	51,911
		fr	314,681	12	56,363
		nl	308,742	29	49,582
	Total		3,365,924	1194	596,058

Overview of the corpora per type, domain, and language; with number of tokens, number of documents and number of annotated tokens.

While, for TermEval, ATE is conceived as a binary task (term or not term), there are actually 4 annotation labels: Specific Terms, Common Terms, Out-of-Domain Terms, and Named Entities. For TermEval, all labels are combined, with separate evaluations with and without Named Entities (see Task & Evaluation). No restrictions were placed on the annotations, but rather all lexical items that were considered terms as they were used in the texts, were annotated as such. This means that non-standard terms or wrong spellings are annotated as well, that there is no minimum frequency, that terms can by any length and any POS-pattern.

Domain	Language	# Specific Terms	# Common Terms	# OOD Terms	# Named Entities	# Annotations (total)
corruption (parallel)	en	278	644	6	248	1174
	fr	300	678	5	236	1217
	nl	310	731	6	249	1295
dressage	en	780	309	71	421	1575
	fr	705	238	26	221	1183
	nl	1026	333	41	153	1546
heart failure	en	1885	320	158	228	2585
	fr	1714	505	59	147	2423
	nl	1561	450	66	182	2257
wind energy	en	781	296	14	444	1534
	fr	444	308	21	195	968
	nl	577	342	21	305	1245
Total		10361	5154	494	3029	19002

Table 2 shows the number of unique annotations (type counts) per corpus, and per label. These are also the number which can be found in ACTER 1.2 The same information is repeated in Table 3, but this time with token counts, i.e. each annotation is counted separately. This information (the annotations as spans in the text) is not yet available in ACTER 1.2.

Domain	Language	# Specific Terms	# Common Terms	# OOD Terms	# Named Entities	# Annotations (total)
corruption (parallel)	en	1078	5295	12	2373	8758
	fr	1077	4842	11	2186	8116
	nl	1041	4111	11	2334	7497
dressage	en	4729	5997	163	970	11859
	fr	4969	4389	39	467	9864
	nl	6259	4891	57	295	11502
heart failure	en	7402	5626	983	526	14537
	fr	5212	5336	253	319	11120
	nl	5428	4537	254	433	10652
wind energy	en	5151	4282	45	1429	10907
	fr	3208	5207	109	439	8963
	nl	2104	2805	135	636	5680
Total		47658	57318	2072	12407	119455

While these human annotations are unavoidably subject to human error and subjectivity, every care has been taken to obtain high-quality annotations, which have been validated in various studies (Rigouts Terryn, Drouin, et al., 2019; Rigouts Terryn, Hoste, et al., 2019), though some errors may of course still remain. Since the purpose of this shared task is also to provide the community with a practical dataset for ATE, we highly encourage people who use the dataset and spot errors to notify us of these errors so we can continue to improve the data (an email address is provided in the README.md).

^[1] http://hdl.handle.net/1854/LU-8503113

^[2] While the gold standard list of annotated terms may contain many terms that also appear in the unannotated parts of the corpora, the unannotated parts of the corpora will undoubtedly also contain terms that do not appear in the gold standard lists.