{"id":177,"date":"2019-10-23T08:43:41","date_gmt":"2019-10-23T08:43:41","guid":{"rendered":"http:\/\/reacte.ugent.be\/?page_id=177"},"modified":"2021-02-16T17:39:45","modified_gmt":"2021-02-16T16:39:45","slug":"acter-dataset","status":"publish","type":"page","link":"https:\/\/termeval.ugent.be\/nl\/acter-dataset\/","title":{"rendered":"ACTER dataset"},"content":{"rendered":"\n<h2>ACTER Annotated Corpora for Term Extraction Research<\/h2>\n\n\n\n<p class=\"has-text-color has-medium-font-size has-very-dark-gray-color\"><strong>The data can be downloaded from the Github repository: <a href=\"https:\/\/github.com\/AylaRT\/ACTER.git\"><\/a><a href=\"https:\/\/github.com\/AylaRT\/ACTER\">https:\/\/github.com\/AylaRT\/ACTER<\/a> which also contains an elaborate README.md with all necessary information.<\/strong><\/p>\n\n\n\n<p>The ACTER dataset (currently version 1.3) is publicly available with Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (<a href=\"http:\/\/CC BY-NC-SA 4.0) (https:\/\/creativecommons.org\/licenses\/by-nc-sa\/4.0\/\">CC BY-NC-SA 4.0) (https:\/\/creativecommons.org\/licenses\/by-nc-sa\/4.0\/<\/a>). The dataset can be freely used and adapted for non-commercial purposes, provided any changes made to the data are clearly mentioned and the proper reference is cited: <a href=\"https:\/\/doi.org\/10.1007\/s10579-019-09453-9\">https:\/\/doi.org\/10.1007\/s10579-019-09453-9<\/a><\/p>\n\n\n\n<h3>Background &amp; Numbers<\/h3>\n\n\n\n<p>The ACTER dataset has been specifically created to address some of the <a href=\"https:\/\/termeval.ugent.be\/?page_id=166\">hurdles faced by ATE<\/a>. The corpora cover four domains: corruption, dressage (branch of horse riding), heart failure and wind energy. For each domain, corpora have been collected in three languages \u2013 English, French, and Dutch \u2013 with a more or less equal number of tokens per domain in each language. The corpora within each domain are all comparable, in the sense that they contain original texts of the same types, on the same subject, and they are not parallel translations, so they cannot be aligned (not on sentence -level, or even on document-level). The corpus on corruption also contains a parallel component, with sentence-aligned, trilingual texts. More detailed information about the creation of these corpora, as well as the annotation process, can be found in Rigouts Terryn, Hoste, &amp; Lefever&nbsp;(2018, 2019) and in the overview paper of the TermEval2020 shared task, which will appear in the CompuTerm workshop proceedings. Not all information about the annotations as mentioned in previous work has been publicly released yet, so the remainder of this page will focus only on the information available in ACTER 1.2.<\/p>\n\n\n\n<p>In each language and domain (for corruption, only the parallel corpus has been annotated), \u00b150k tokens were manually annotated, with a total of 596,058 annotated tokens. This resulted in 119,455 individual annotations or 19,002 unique annotations. The annotation guidelines are freely available online<sup>[1]<\/sup>. Table 1 shows a summary of the dataset, showing, for each corpus, the domain, type (comparable or parallel), language, number of tokens, number of annotated tokens and number of unique annotations. For English, there are a total of 1,108,085 tokens, 194,403 of which are annotated[2]. The French corpora count 1,142,575 tokens in total, of which 206,859 are annotated. For Dutch, these numbers are 1,115,264 and 194,796 respectively. In total, the corpora contain 1194 different documents.<\/p>\n\n\n\n<table id=\"tablepress-1\" class=\"tablepress tablepress-id-1\" aria-describedby=\"tablepress-1-description\">\n<thead>\n<tr class=\"row-1 odd\">\n\t<th class=\"column-1\">Type<\/th><th class=\"column-2\">Domain<\/th><th class=\"column-3\">Language<\/th><th class=\"column-4\"># Tokens<\/th><th class=\"column-5\"># Documents<\/th><th class=\"column-6\"># Annotated Tokens<\/th>\n<\/tr>\n<\/thead>\n<tbody class=\"row-hover\">\n<tr class=\"row-2 even\">\n\t<td rowspan=\"3\" class=\"column-1\">parallel<\/td><td rowspan=\"3\" class=\"column-2\">corruption<\/td><td class=\"column-3\">en<\/td><td class=\"column-4\"> 176,314 <\/td><td class=\"column-5\">24<\/td><td class=\"column-6\"> 45,234 <\/td>\n<\/tr>\n<tr class=\"row-3 odd\">\n\t<td class=\"column-3\">fr<\/td><td class=\"column-4\"> 196,327 <\/td><td class=\"column-5\">24<\/td><td class=\"column-6\"> 50,429 <\/td>\n<\/tr>\n<tr class=\"row-4 even\">\n\t<td class=\"column-3\">nl<\/td><td class=\"column-4\"> 184,541 <\/td><td class=\"column-5\">24<\/td><td class=\"column-6\"> 47,305 <\/td>\n<\/tr>\n<tr class=\"row-5 odd\">\n\t<td rowspan=\"12\" class=\"column-1\">comparable<\/td><td rowspan=\"3\" class=\"column-2\">corruption<\/td><td class=\"column-3\">en<\/td><td class=\"column-4\"> 468,711 <\/td><td class=\"column-5\">44<\/td><td class=\"column-6\">-<\/td>\n<\/tr>\n<tr class=\"row-6 even\">\n\t<td class=\"column-3\">fr<\/td><td class=\"column-4\"> 475,244 <\/td><td class=\"column-5\">31<\/td><td class=\"column-6\">-<\/td>\n<\/tr>\n<tr class=\"row-7 odd\">\n\t<td class=\"column-3\">nl<\/td><td class=\"column-4\"> 470,242 <\/td><td class=\"column-5\">49<\/td><td class=\"column-6\">-<\/td>\n<\/tr>\n<tr class=\"row-8 even\">\n\t<td rowspan=\"3\" class=\"column-2\">dressage<\/td><td class=\"column-3\">en<\/td><td class=\"column-4\"> 102,654 <\/td><td class=\"column-5\">89<\/td><td class=\"column-6\"> 51,470 <\/td>\n<\/tr>\n<tr class=\"row-9 odd\">\n\t<td class=\"column-3\">fr<\/td><td class=\"column-4\"> 109,572 <\/td><td class=\"column-5\">125<\/td><td class=\"column-6\"> 53,316 <\/td>\n<\/tr>\n<tr class=\"row-10 even\">\n\t<td class=\"column-3\">nl<\/td><td class=\"column-4\"> 103,851 <\/td><td class=\"column-5\">125<\/td><td class=\"column-6\"> 50,882<\/td>\n<\/tr>\n<tr class=\"row-11 odd\">\n\t<td rowspan=\"3\" class=\"column-2\">heart failure<\/td><td class=\"column-3\">en<\/td><td class=\"column-4\"> 45,788 <\/td><td class=\"column-5\">190<\/td><td class=\"column-6\"> 45,788 <\/td>\n<\/tr>\n<tr class=\"row-12 even\">\n\t<td class=\"column-3\">fr<\/td><td class=\"column-4\"> 46,751 <\/td><td class=\"column-5\">215<\/td><td class=\"column-6\"> 46,751 <\/td>\n<\/tr>\n<tr class=\"row-13 odd\">\n\t<td class=\"column-3\">nl<\/td><td class=\"column-4\"> 47,888 <\/td><td class=\"column-5\">175<\/td><td class=\"column-6\"> 47,888 <\/td>\n<\/tr>\n<tr class=\"row-14 even\">\n\t<td rowspan=\"3\" class=\"column-2\">wind energy<\/td><td class=\"column-3\">en<\/td><td class=\"column-4\"> 314,618 <\/td><td class=\"column-5\">38<\/td><td class=\"column-6\"> 51,911 <\/td>\n<\/tr>\n<tr class=\"row-15 odd\">\n\t<td class=\"column-3\">fr<\/td><td class=\"column-4\"> 314,681 <\/td><td class=\"column-5\">12<\/td><td class=\"column-6\"> 56,363 <\/td>\n<\/tr>\n<tr class=\"row-16 even\">\n\t<td class=\"column-3\">nl<\/td><td class=\"column-4\"> 308,742 <\/td><td class=\"column-5\">29<\/td><td class=\"column-6\"> 49,582 <\/td>\n<\/tr>\n<\/tbody>\n<tfoot>\n<tr class=\"row-17 odd\">\n\t<th class=\"column-1\"><\/th><th class=\"column-2\">Total<\/th><th class=\"column-3\"><\/th><th class=\"column-4\"> 3,365,924 <\/th><th class=\"column-5\">1194<\/th><th class=\"column-6\"> 596,058 <\/th>\n<\/tr>\n<\/tfoot>\n<\/table>\n<span id=\"tablepress-1-description\" class=\"tablepress-table-description tablepress-table-description-id-1\">Overview of the corpora per type, domain, and language; with number of tokens, number of documents and number of annotated tokens.<\/span>\n<!-- #tablepress-1 from cache -->\n\n\n\n<p><\/p>\n\n\n\n<p>While, for TermEval, ATE is conceived as a binary task (term or not term), there are actually 4 annotation labels: Specific Terms, Common Terms, Out-of-Domain Terms, and Named Entities. For TermEval, all labels are combined, with separate evaluations with and without Named Entities (see <a href=\"http:\/\/reacte.ugent.be\/?page_id=178\">Task &amp; Evaluation<\/a>).&nbsp;&nbsp;No restrictions were placed on the annotations, but rather all lexical items that were considered terms as they were used in the texts, were annotated as such. This means that non-standard terms or wrong spellings are annotated as well, that there is no minimum frequency, that terms can by any length and any POS-pattern. <\/p>\n\n\n\n<table id=\"tablepress-3\" class=\"tablepress tablepress-id-3\">\n<thead>\n<tr class=\"row-1 odd\">\n\t<th class=\"column-1\">Domain<\/th><th class=\"column-2\">Language<\/th><th class=\"column-3\"># Specific Terms<\/th><th class=\"column-4\"># Common Terms<\/th><th class=\"column-5\"># OOD Terms<\/th><th class=\"column-6\"># Named Entities<\/th><th class=\"column-7\"># Annotations (total)<\/th>\n<\/tr>\n<\/thead>\n<tbody class=\"row-hover\">\n<tr class=\"row-2 even\">\n\t<td rowspan=\"3\" class=\"column-1\">corruption (parallel)<\/td><td class=\"column-2\">en<\/td><td class=\"column-3\">278<\/td><td class=\"column-4\">644<\/td><td class=\"column-5\">6<\/td><td class=\"column-6\">248<\/td><td class=\"column-7\">1174<\/td>\n<\/tr>\n<tr class=\"row-3 odd\">\n\t<td class=\"column-2\">fr<\/td><td class=\"column-3\">300<\/td><td class=\"column-4\">678<\/td><td class=\"column-5\">5<\/td><td class=\"column-6\">236<\/td><td class=\"column-7\">1217<\/td>\n<\/tr>\n<tr class=\"row-4 even\">\n\t<td class=\"column-2\">nl<\/td><td class=\"column-3\">310<\/td><td class=\"column-4\">731<\/td><td class=\"column-5\">6<\/td><td class=\"column-6\">249<\/td><td class=\"column-7\">1295<\/td>\n<\/tr>\n<tr class=\"row-5 odd\">\n\t<td rowspan=\"3\" class=\"column-1\">dressage<\/td><td class=\"column-2\">en<\/td><td class=\"column-3\">780<\/td><td class=\"column-4\">309<\/td><td class=\"column-5\">71<\/td><td class=\"column-6\">421<\/td><td class=\"column-7\">1575<\/td>\n<\/tr>\n<tr class=\"row-6 even\">\n\t<td class=\"column-2\">fr<\/td><td class=\"column-3\">705<\/td><td class=\"column-4\">238<\/td><td class=\"column-5\">26<\/td><td class=\"column-6\">221<\/td><td class=\"column-7\">1183<\/td>\n<\/tr>\n<tr class=\"row-7 odd\">\n\t<td class=\"column-2\">nl<\/td><td class=\"column-3\">1026<\/td><td class=\"column-4\">333<\/td><td class=\"column-5\">41<\/td><td class=\"column-6\">153<\/td><td class=\"column-7\">1546<\/td>\n<\/tr>\n<tr class=\"row-8 even\">\n\t<td rowspan=\"3\" class=\"column-1\">heart failure<\/td><td class=\"column-2\">en<\/td><td class=\"column-3\">1885<\/td><td class=\"column-4\">320<\/td><td class=\"column-5\">158<\/td><td class=\"column-6\">228<\/td><td class=\"column-7\">2585<\/td>\n<\/tr>\n<tr class=\"row-9 odd\">\n\t<td class=\"column-2\">fr<\/td><td class=\"column-3\">1714<\/td><td class=\"column-4\">505<\/td><td class=\"column-5\">59<\/td><td class=\"column-6\">147<\/td><td class=\"column-7\">2423<\/td>\n<\/tr>\n<tr class=\"row-10 even\">\n\t<td class=\"column-2\">nl<\/td><td class=\"column-3\">1561<\/td><td class=\"column-4\">450<\/td><td class=\"column-5\">66<\/td><td class=\"column-6\">182<\/td><td class=\"column-7\">2257<\/td>\n<\/tr>\n<tr class=\"row-11 odd\">\n\t<td rowspan=\"3\" class=\"column-1\">wind energy<\/td><td class=\"column-2\">en<\/td><td class=\"column-3\">781<\/td><td class=\"column-4\">296<\/td><td class=\"column-5\">14<\/td><td class=\"column-6\">444<\/td><td class=\"column-7\">1534<\/td>\n<\/tr>\n<tr class=\"row-12 even\">\n\t<td class=\"column-2\">fr<\/td><td class=\"column-3\">444<\/td><td class=\"column-4\">308<\/td><td class=\"column-5\">21<\/td><td class=\"column-6\">195<\/td><td class=\"column-7\">968<\/td>\n<\/tr>\n<tr class=\"row-13 odd\">\n\t<td class=\"column-2\">nl<\/td><td class=\"column-3\">577<\/td><td class=\"column-4\">342<\/td><td class=\"column-5\">21<\/td><td class=\"column-6\">305<\/td><td class=\"column-7\">1245<\/td>\n<\/tr>\n<\/tbody>\n<tfoot>\n<tr class=\"row-14 even\">\n\t<th colspan=\"2\" class=\"column-1\">Total<\/th><th class=\"column-3\">10361<\/th><th class=\"column-4\">5154<\/th><th class=\"column-5\">494<\/th><th class=\"column-6\">3029<\/th><th class=\"column-7\">19002<\/th>\n<\/tr>\n<\/tfoot>\n<\/table>\n<!-- #tablepress-3 from cache -->\n\n\n\n<p>Table 2 shows the number of unique annotations (type counts) per corpus, and per label. These are also the number which can be found in ACTER 1.2 The same information is repeated in Table 3, but this time with token counts, i.e. each annotation is counted separately. This information (the annotations as spans in the text) is not yet available in ACTER 1.2.<\/p>\n\n\n\n<table id=\"tablepress-4\" class=\"tablepress tablepress-id-4\">\n<thead>\n<tr class=\"row-1 odd\">\n\t<th class=\"column-1\">Domain<\/th><th class=\"column-2\">Language<\/th><th class=\"column-3\"># Specific Terms<\/th><th class=\"column-4\"># Common Terms<\/th><th class=\"column-5\"># OOD Terms<\/th><th class=\"column-6\"># Named Entities<\/th><th class=\"column-7\"># Annotations (total)<\/th>\n<\/tr>\n<\/thead>\n<tbody class=\"row-hover\">\n<tr class=\"row-2 even\">\n\t<td rowspan=\"3\" class=\"column-1\">corruption (parallel)<\/td><td class=\"column-2\">en<\/td><td class=\"column-3\">1078<\/td><td class=\"column-4\">5295<\/td><td class=\"column-5\">12<\/td><td class=\"column-6\">2373<\/td><td class=\"column-7\">8758<\/td>\n<\/tr>\n<tr class=\"row-3 odd\">\n\t<td class=\"column-2\">fr<\/td><td class=\"column-3\">1077<\/td><td class=\"column-4\">4842<\/td><td class=\"column-5\">11<\/td><td class=\"column-6\">2186<\/td><td class=\"column-7\">8116<\/td>\n<\/tr>\n<tr class=\"row-4 even\">\n\t<td class=\"column-2\">nl<\/td><td class=\"column-3\">1041<\/td><td class=\"column-4\">4111<\/td><td class=\"column-5\">11<\/td><td class=\"column-6\">2334<\/td><td class=\"column-7\">7497<\/td>\n<\/tr>\n<tr class=\"row-5 odd\">\n\t<td rowspan=\"3\" class=\"column-1\">dressage<\/td><td class=\"column-2\">en<\/td><td class=\"column-3\">4729<\/td><td class=\"column-4\">5997<\/td><td class=\"column-5\">163<\/td><td class=\"column-6\">970<\/td><td class=\"column-7\">11859<\/td>\n<\/tr>\n<tr class=\"row-6 even\">\n\t<td class=\"column-2\">fr<\/td><td class=\"column-3\">4969<\/td><td class=\"column-4\">4389<\/td><td class=\"column-5\">39<\/td><td class=\"column-6\">467<\/td><td class=\"column-7\">9864<\/td>\n<\/tr>\n<tr class=\"row-7 odd\">\n\t<td class=\"column-2\">nl<\/td><td class=\"column-3\">6259<\/td><td class=\"column-4\">4891<\/td><td class=\"column-5\">57<\/td><td class=\"column-6\">295<\/td><td class=\"column-7\">11502<\/td>\n<\/tr>\n<tr class=\"row-8 even\">\n\t<td rowspan=\"3\" class=\"column-1\">heart failure<\/td><td class=\"column-2\">en<\/td><td class=\"column-3\">7402<\/td><td class=\"column-4\">5626<\/td><td class=\"column-5\">983<\/td><td class=\"column-6\">526<\/td><td class=\"column-7\">14537<\/td>\n<\/tr>\n<tr class=\"row-9 odd\">\n\t<td class=\"column-2\">fr<\/td><td class=\"column-3\">5212<\/td><td class=\"column-4\">5336<\/td><td class=\"column-5\">253<\/td><td class=\"column-6\">319<\/td><td class=\"column-7\">11120<\/td>\n<\/tr>\n<tr class=\"row-10 even\">\n\t<td class=\"column-2\">nl<\/td><td class=\"column-3\">5428<\/td><td class=\"column-4\">4537<\/td><td class=\"column-5\">254<\/td><td class=\"column-6\">433<\/td><td class=\"column-7\">10652<\/td>\n<\/tr>\n<tr class=\"row-11 odd\">\n\t<td rowspan=\"3\" class=\"column-1\">wind energy<\/td><td class=\"column-2\">en<\/td><td class=\"column-3\">5151<\/td><td class=\"column-4\">4282<\/td><td class=\"column-5\">45<\/td><td class=\"column-6\">1429<\/td><td class=\"column-7\">10907<\/td>\n<\/tr>\n<tr class=\"row-12 even\">\n\t<td class=\"column-2\">fr<\/td><td class=\"column-3\">3208<\/td><td class=\"column-4\">5207<\/td><td class=\"column-5\">109<\/td><td class=\"column-6\">439<\/td><td class=\"column-7\">8963<\/td>\n<\/tr>\n<tr class=\"row-13 odd\">\n\t<td class=\"column-2\">nl<\/td><td class=\"column-3\">2104<\/td><td class=\"column-4\">2805<\/td><td class=\"column-5\">135<\/td><td class=\"column-6\">636<\/td><td class=\"column-7\">5680<\/td>\n<\/tr>\n<\/tbody>\n<tfoot>\n<tr class=\"row-14 even\">\n\t<th colspan=\"2\" class=\"column-1\">Total<\/th><th class=\"column-3\">47658<\/th><th class=\"column-4\">57318<\/th><th class=\"column-5\">2072<\/th><th class=\"column-6\">12407<\/th><th class=\"column-7\">119455<\/th>\n<\/tr>\n<\/tfoot>\n<\/table>\n<!-- #tablepress-4 from cache -->\n\n\n\n<p>While these human annotations are unavoidably subject to human error and subjectivity, every care has been taken to obtain high-quality annotations, which have been validated in various studies&nbsp;(Rigouts Terryn, Drouin, et al., 2019; Rigouts Terryn, Hoste, et al., 2019), though some errors may of course still remain. Since the purpose of this shared task is also to provide the community with a practical dataset for ATE, we highly encourage people who use the dataset and spot errors to notify us of these errors so we can continue to improve the data (an email address is provided in the README.md).<br><\/p>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<p class=\"has-text-color has-cyan-bluish-gray-color\"><sup><span style=\"text-decoration: underline;\">[1]<\/span><\/sup><span style=\"text-decoration: underline;\">&nbsp;<\/span><a href=\"http:\/\/hdl.handle.net\/1854\/LU-8503113\"><span style=\"text-decoration: underline;\">http:\/\/hdl.handle.net\/1854\/LU-8503113<\/span><\/a><\/p>\n\n\n\n<p><sup>[2]<\/sup>&nbsp;While the gold standard list of annotated terms may contain many terms that also appear in the unannotated parts of the corpora, the unannotated parts of the corpora will undoubtedly also contain terms that do not appear in the gold standard lists.<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>ACTER Annotated Corpora for Term Extraction Research The data can be downloaded from the Github repository: https:\/\/github.com\/AylaRT\/ACTER which also contains an elaborate README.md with all necessary information. The ACTER dataset (currently version 1.3) is publicly available with Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) (https:\/\/creativecommons.org\/licenses\/by-nc-sa\/4.0\/). The dataset can be freely used and adapted for [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":[],"translation":{"provider":"WPGlobus","version":"2.8.11","language":"nl","enabled_languages":["en","es","de","fr","nl"],"languages":{"en":{"title":true,"content":true,"excerpt":false},"es":{"title":false,"content":false,"excerpt":false},"de":{"title":false,"content":false,"excerpt":false},"fr":{"title":false,"content":false,"excerpt":false},"nl":{"title":false,"content":false,"excerpt":false}}},"_links":{"self":[{"href":"https:\/\/termeval.ugent.be\/nl\/wp-json\/wp\/v2\/pages\/177"}],"collection":[{"href":"https:\/\/termeval.ugent.be\/nl\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/termeval.ugent.be\/nl\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/termeval.ugent.be\/nl\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/termeval.ugent.be\/nl\/wp-json\/wp\/v2\/comments?post=177"}],"version-history":[{"count":36,"href":"https:\/\/termeval.ugent.be\/nl\/wp-json\/wp\/v2\/pages\/177\/revisions"}],"predecessor-version":[{"id":594,"href":"https:\/\/termeval.ugent.be\/nl\/wp-json\/wp\/v2\/pages\/177\/revisions\/594"}],"wp:attachment":[{"href":"https:\/\/termeval.ugent.be\/nl\/wp-json\/wp\/v2\/media?parent=177"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}