Home / / Datasets / / TREC-QA
The TREC dataset contains 5,500 labeled questions in the training set and another 500 for the test set. It consists of 6 coarse class labels and 50 fine class labels. The average length of each sentence is 10, with a vocabulary size of 8,700. The data are collected from various sources, including English questions published by USC, manually constructed questions for rare classes, and TREC 8, 9, and 10 questions. The dataset is available in English and is monolingual. It has been used in research and has associated citation information from works by Xin Li and Dan Roth (2002) and Eduard Hovy et al. (2001). The TREC dataset offers a valuable resource for researchers and practitioners working on text classification and related NLP tasks. Its diverse set of questions and detailed labeling make it suitable for training and evaluating models on complex classification problems.
Training set: 5,452 questions Test set: 500 questions Size of downloaded dataset files: 0.36 MB Size of the generated dataset: 0.41 MB Total amount of disk used: 0.78 MB
The TREC dataset is primarily used for text classification tasks, specifically multi-class classification. It can be employed in various natural language processing applications, such as question answering, semantic analysis, and information retrieval.
The licensing information is not explicitly mentioned.