MultiTACRED

Full Official Name: MultiTACRED
Submission date: Oct. 15, 2024, 10:01 p.m.

MultiTACRED was developed by the German Research Center for Artificial Intelligence (DFKI) Speech and Language Technology Lab and is a machine translation of TAC Relation Extraction Dataset (LDC2018T24) (TACRED) into twelve languages with projected entity annotations. TACRED is a large-scale relation extraction dataset containing 106,264 examples built over English newswire and web text used in the NIST TAC KBP English slot filling evaluations during the period 2009-2014. The training and evaluation data for the TAC KBP slot filling tasks was developed by the Linguistic Data Consortium. TACRED training, development and test splits were translated into Arabic, Chinese, Finnish, French, German, Hindi,  Hungarian, Japanese, Polish, Russian, Spanish, and Turkish using  DeepL or Google Translate. The test split was back-translated into English to generate machine-translated English test data. TACRED annotations are specified by token offsets. For translation, tokens were concatenated with white space, and the entity offsets were converted into XML-style markers to denote argument. Data is presented in JSON format encoded in UTF-8.

Creator(s)
Distributor(s)
Right Holder(s)