Home / / Datasets / / Tydi-QA
The "tydiqa" dataset is designed for the task of question answering, specifically focusing on extractive question answering. It is a multilingual dataset, encompassing languages such as Arabic, Bengali, English, and eight others. The dataset's size categories are unknown, and it has been created and annotated through crowdsourcing. The source datasets are extended from Wikipedia. Overall, the "tydiqa" dataset represents a rich resource for researchers and practitioners working on multilingual question answering and related fields, offering a diverse set of questions and documents in various languages.
The dataset is divided into primary and secondary tasks, with the primary task containing 170k rows and the secondary task containing 55k rows. The splits include 151k rows for training and 18.7k rows for validation.
The "tydiqa" dataset can be used for developing and evaluating multilingual question-answering models, particularly in the field of extractive question answering. Its diverse language coverage makes it suitable for research in cross-lingual understanding and applications that require multilingual support. The dataset's structure and content may also be valuable for linguistic studies, natural language processing, and machine learning models that aim to understand and generate human-like responses to questions.
The dataset is licensed under the Apache-2.0 license.