Contact US

Home / / Datasets

Explore the Popular Machine Learning Datasets.

DocLayNet Large

The DocLayNet dataset is a large human-annotated dataset for document-layout segmentation. It provides page-by-page layout segmentation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique pages from 6 document categories. Compared to related work such as PubLayNet or DocBank, DocLayNet offers several unique features, including human annotation, large layout variability, a detailed label set, redundant annotations, and pre-defined train, test, and validation sets.

License: The DocLayNet dataset is published under the CDLA-Permissive-1.0 license.

Alpaca GPT4

This dataset contains 52,000 instances of instruction-following data, all unique, generated by GPT-4 using the same prompts as in Alpaca. The data includes `instruction `, `input `, `output `, and `text `, which is a concatenation of the previous fields. The dataset was structured to be compatible with Huggingface's datasets library.

License: This dataset is licensed under the Creative Commons NonCommercial (CC BY-NC 4.0).

Dataset Card for AmbigQA: Answ...

AmbigQA is a new open-domain question answering task that consists of predicting a set of question and answer pairs, where each plausible answer is associated with a disambiguated rewriting of the original question. The data set covers 14,042 open-ended QI-open questions.

License:The dataset is licensed under CC BY-SA 3.0.

Apex Video Game

The apex-videogame dataset is an object detection dataset comprising images and their corresponding object annotations. It is a subset of the original COCO dataset, stored in `dataset.tar.gz`.

License:The apex-videogame dataset is released under the Creative Commons (CC) license.

AutoTrain Dataset for project:...

The AutoTrain Dataset for project disparities_pubmed_mit is a text classification dataset that has been automatically processed by AutoTrain. It contains textual data instances with associated target labels. The dataset is used for studying disparities in PubMed articles related to the MIT project. It includes samples with various topics and linguistic characteristics.

License:The license for the dataset is not mentioned in the given information.

Book Corpus

The BookCorpus dataset focuses on aligning books with their corresponding movie releases to provide comprehensive descriptive explanations for visual content. This dataset offers both fine-grained details and high-level semantics found in books, enriching visual understanding beyond current caption-based datasets. This dataset can be used for research and development in fields such as natural language processing, computer vision, and multimodal analysis. It enables the exploration of story-to-visual connections and the creation of richer image descriptions.

License:The books within this dataset have been sourced from https://www.smashwords.com, and users...

Brain Tumor Computer Vision Pr...

The brain-tumor-m2pbp dataset is an open-source dataset for object detection tasks. It consists of images with object annotations, specifically targeting brain tumors. The dataset is stored in `dataset.tar.gz`, with a file size of 202 MB. The dataset contains a total of 9,900 rows.

License:The dataset is licensed under the Creative Commons (cc) license.

Break

BREAK is a question understanding dataset, aiming to probe the ability of models to reason over complex questions. It consists of 83,978 examples, each containing a question, its decomposition into reasoning steps, and a full natural language explanation for arriving at the answer.

License:The dataset is licensed under the CC BY-SA 4.0 license.

Colossal, Cleaned version of C...

The C4 dataset is a colossal, cleaned version of Common Crawl's web crawl corpus, available in different variants. It is designed by AllenAI for pretraining language models and word representations. The dataset is in English and is intended for tasks related to language understanding and generation.

License:The dataset is released under the terms of ODC-BY (Open Data Commons Attribution License)....

ChatGPT Jailbreak Prompts

ChatGPT Jailbreak Prompts is a complete collection of jailbreak related prompts for ChatGPT. This dataset is intended to provide a valuable resource for understanding and generating text in the context of jailbreaking in ChatGPT.

License: The dataset does not provide explicit details for the license.

Climate Detection

The Climate Detection dataset is an expert-annotated dataset that aims to detect climate-related paragraphs in corporate disclosures. It contains paragraphs extracted from corporate annual reports and sustainability reports and is labeled with 0 for not climate-related and 1 for climate-related paragraphs.

License:The climate_detection dataset is licensed under the Creative Commons Attribution-NonCommer...

CNNovel125K

CNNovel125K is a dataset consisting of around 125,000 Chinese novels that have been downloaded from the Chinese novel hosting site ibiquw.com. The dataset contains the full text of each novel, including all chapters. It is primarily intended for unsupervised training of text generation models, but may also be useful for other text-related tasks.

License: The CNNovel125K dataset is licensed under the Apache 2.0 license. However, it should be n...

Commonsense QA

The Commonsense QA dataset is a multiple-choice question-answering dataset that requires a system to have an understanding of commonsense knowledge. It is designed to probe machine's commonsense reasoning capabilities. The dataset contains questions that are constructed to have clear answers, and the questions are designed to be answerable by humans without any context.

License:The dataset is licensed under the CC BY-SA 4.0 license.

Construction Site Safety Image...

This dataset contains a collection of images from construction sites annotated with various safety objects such as hardhats, safety vests, gloves, masks, barricades, and more. The dataset includes images in different formats and sizes, along with corresponding bounding box annotations. It can be used for object detection tasks related to construction site safety.

License: CC BY 4.0

Conv AI3

The ConvAI3 dataset is part of the Conversational Intelligence Challenge 3, which focuses on the task of conversational response ranking. It's a collection of dialogues that are designed to evaluate conversational models. The dataset is built around the PersonaChat dataset, and the conversations are based on the personas provided. The dataset includes a training set with 164,356 examples and a validation set with 7,801 examples.

License:The licensing information for the ConvAI3 dataset is not explicitly mentioned.

CoQA

The "coqa" dataset is designed for the task of Question Answering, specifically focusing on extractive question answering. It is a monolingual dataset in English.

License:The dataset has a combination of licenses based on where the data is collected: Literatur...

Cornell Movie Dialogue Corpus

The Cornell Movie Dialogs dataset is a rich set of movie character dialogues. It contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts. The dataset includes 220,579 conversational exchanges between 10,292 pairs of movie characters, involving 9,035 characters from 617 movies.

License:The dataset is licensed under the Creative Commons Attribution 4.0 International License (...

Covid Fake News

The "COVID-19 Fake News Dataset" is a collection of text data that aims to address the problem of fake news related to the COVID-19 pandemic. The dataset was created to help researchers and data scientists develop models and algorithms for detecting and combating fake news. It consists of 10,700 rows of text data, with each row representing a tweet about COVID-19. The dataset is available under the AI in Medicine workshop AI4COVID19.

License:The dataset is released under the AI in Medicine workshop AI4COVID19.

Drop

The "drop" dataset is designed for question answering and text-to-text generation tasks. It includes sub-tasks like extractive question answering and abstractive question answering. The dataset is in English and is monolingual.

License:The dataset is licensed under the Creative Commons Attribution-ShareAlike 4.0 Internationa...

DuReader2.0

DuReader 2.0 is a large-scale, open-domain Chinese data set for reading comprehension (RK) and question answering (QA). It contains over 300K questions, 1.4M obvious documents and corresponding human-generated answers.

License:The dataset is licensed under the Apache License 2.0

Empathetic Dialogues

This dataset is a PyTorch original implementation named as "Towards Empathetic Open-domain Conversation Models: a New Benchmark and Dataset". This dataset is primarily used for tasks involving conversation and question answering.

License: The dataset is licensed under the Creative Commons Attribution-NonCommercial 4.0 Internat...

Fashionpedia 4 categories

Fashionpedia_4_categories is a variation of the fashionpedia dataset with only 4 categories: Clothing, Shoes, Bags, and Accessories. The dataset was created to make the object detection task easier compared to the full fashionpedia dataset, which contains 46 categories. The dataset includes new splits for training, validation, and testing, with 90% of the images allocated for training, 5% for validation, and 5% for testing.

License: The dataset is licensed under cc-by-4.0.

Financial Phrasebank

The financial_phrasebank dataset is a polar sentiment dataset consisting of 4,840 sentences from English language financial news. Each sentence is categorized based on sentiment, and the dataset is divided by agreement rate of 5-8 annotators. The dataset is designed to be used for tasks related to sentiment classification in the financial domain.

License: The financial_phrasebank dataset is licensed under the Creative Commons Attribution-NonCo...

hansard_speech

The Hansard speech dataset is a collection of every speech in the House of Commons from May 1979 to July 2020. It contains transcriptions of speeches made by Members of Parliament, providing valuable historical text data. The dataset homepage acknowledges that the integrity of the public Hansard record may be questionable at times, but efforts have been made to improve the data. The dataset is presented "as is" and any errors should be reported.

License:The Hansard speech dataset is licensed under the Creative Commons Attribution 4.0 Internat...

Hotpot QA

The "hotpot_qa" dataset is designed for the task of Question Answering in English. It is monolingual and falls under the size category of 100K to 1M. The dataset is original and has been crowdsourced.

License:The dataset is licensed under the CC BY-SA 4.0 License.

IMDB

The dataset is IMDb, a text classification dataset used for sentiment classification. It contains expert-generated annotations and was originally sourced from IMDb. This version of the dataset includes additional columns for unique identifiers and embeddings from three different sentence-transformer models. The labels column has been renamed to "labels" for compatibility with the transformers library. **Dataset Size**: The dataset has a total size of 519 MB. This includes the downloaded dataset files as well as the auto-generated Parquet files. **Dataset License**: The dataset is licensed as "other". The specific details of the license are not provided. **Dataset Use Cases**: The IMDb dataset can be used for tasks related to text classification, specifically sentiment classification. It is primarily focused on English language and is classified as a monolingual dataset. The dataset is suitable for training machine learning models for sentiment analysis and other related natural language processing tasks

License: The dataset is IMDb, a text classification dataset used for sentiment classification. It ...

KLUE

KLUE is a dataset that consists of 8 tasks to assess the natural language understanding ability of Korean language models. The tasks include Topic Classification, Semantic Textual Similarity, Natural Language Inference, Named Entity Recognition, Relation Extraction, Dependency Parsing, Machine Reading Comprehension, and Dialogue State Tracking.

License: The dataset is released under the Creative Commons Attribution-ShareAlike 4.0 Internation...

License Plate Object Detection

The License Plate Object Detection Dataset is a collection of images with annotations for license plates. The dataset includes 8,823 images in COCO format. The images have undergone auto-orientation and no image augmentation techniques were applied. The dataset can be accessed and used through the Roboflow Universe platform.

License:The dataset is licensed under CC BY 4.0.

LIFD_Magnetic_Field_Data

The LIFD_Magnetic_Field_Data dataset is a description of the gufm1 model, which is a global geomagnetic model based on spherical harmonics. It covers the period from 1590 to 1990 and provides information about the radial magnetic field at the core-mantle boundary. The dataset dimensions are (181, 361, 401), representing co-latitude, longitude, and time. The dataset values are in nT (nanotesla). The dataset was initially created by Angela Fan, Ethan Perez, Yacine Jernite, Jason Weston, Michael Auli, and David Grangier, during work done at Facebook AI Research (FAIR).

License:The dataset is licensed under the MIT license.

LLaVA-Instruct-150K

LLaVA Visual Instruct 150K is a set of GPT-generated multimodal instruction-following data, primarily intended for research on large multimodal models and chatbots. The dataset was collected in April 2023 by prompting the GPT-4-0314 API. More information about the dataset can be found on the [LLaVA website](https://llava-vl.github.io/).

License: The dataset is licensed under Attribution-NonCommercial 4.0 International (cc-by-nc-4.0)....

Mask Wearing 608pr

The mask-wearing-608pr dataset is an object detection dataset that consists of images and their object annotations. The dataset comprises images with object annotations for detecting mask-wearing. It is a multilingual dataset, primarily in English. The dataset contains a total of 149 data instances.

License:The mask-wearing-608pr dataset is licensed under the Creative Commons (cc) license.

Math QA

MathQA is a large-scale dataset of math word problems. It is gathered by using a new representation language to annotate over the AQuA-RAT dataset with fully-specified operational programs.

License: Apache License, Version 2.0.

MMLU

This dataset is a colossal multitask test comprising multiple-choice questions from diverse branches of knowledge. The test spans subjects in the humanities, social sciences, hard sciences, and more. For a model to attain high accuracy on this test, it must possess extensive world knowledge and problem-solving ability. The dataset covers 57 tasks including elementary mathematics, US history, computer science, law, and many more. This dataset is intended to bridge the gap between the wide-ranging knowledge that models absorb during pretraining and the existing measures of success.

License: MIT License

MOSS 002 SFT Dataset

This open-source conversational dataset was used to train MOSS-002. The user prompts have been expanded based on a small set of seed prompts written by humans, similar to Self-Instruct. The responses generated by the AI utilize `text-davinci-003`. The user prompts for the `en_harmlessness` sequence are sourced from Anthropic red teaming data.

License: The dataset is under the Creative Commons license, specifically cc-by-4.0.

MultiWOZ

The MultiWOZ v2.2 dataset is a collection of dialogues that cover various tasks and sub-tasks. It includes dialogue modeling, multi-class classification, and parsing. The dataset is monolingual and is available in English. It consists of dialogues that involve services like restaurants, hotels, taxis, trains, buses, police, attractions, and hospitals. The dialogues are structured with different dialogue acts, slots, and values, providing a rich context for various natural language processing tasks.

License:The MultiWOZ v2.2 dataset is licensed under the Apache-2.0 License.

NarrativeQA

NarrativeQA is an English-language dataset of stories and corresponding questions designed to test reading comprehension, especially on long documents. It includes two tasks proposed in the paper: "summaries only" and "stories only," depending on whether the human-generated summary or the full story text is used to answer the question. A typical data point consists of a question and answer pair along with a summary/story that can be used to answer the question.

License:The dataset is released under the Apache-2.0 License.

Natural Questions

The Natural Questions (NQ) corpus is designed for the task of reading comprehension and question answering (QA). It contains questions from real users and requires QA systems to read and comprehend an entire Wikipedia article that may or may not contain the answer to the question. This makes NQ a more realistic and challenging task compared to prior QA datasets.

License:The dataset is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported Lic...

NewsQA

NewsQA is a challenging machine comprehension dataset containing over 100,000 human-generated question-answer pairs. The dataset is based on a set of over 10,000 news articles from CNN, with answers consisting of spans of text from the corresponding articles. The dataset is designed for extractive question answering tasks.

License:The dataset is licensed under the MIT License.

NLS Chapbook Illustrations

The National Library of Scotland Chapbook Illustrations dataset contains images from chapbooks held by the National Library of Scotland. Chapbooks were short printed booklets that were popular in the 17th to 19th centuries and often contained illustrations. This dataset includes annotations for a subset of these chapbooks, providing bounding boxes around illustrations printed on the pages. The dataset can be used for object detection and image classification tasks.

License: The dataset is in the public domain.

Oktober Fest Food Dataset Plus

The OktoberfestFoodDatasetPlus is a dataset that consists of three categories: drinkServed, foodServed, and person. It contains real camera footage that has been annotated by hand, as well as synthetically generated and annotated data. The dataset can be used for tasks such as object detection. A demo space is available to view results after training on the YOLO8 platform.

License:The dataset is licensed under BSD (Berkeley Software Distribution) license.

OpenBook QA

OpenBookQA is a dataset for open book question answering. It consists of 5,957 multiple-choice elementary-level science questions (4,957 train, 500 dev, 500 test), which probe the understanding of a small "book" of 1,326 core science facts. The questions are designed to be solved using the facts in the "book," along with common knowledge.

License:The dataset is licensed under the CC BY-SA 3.0 license. This means that you can share and ...

People in Paintings

The people-in-paintings dataset is an open-source dataset that contains images of paintings with object annotations. The dataset is originally based on the COCO dataset and is stored in `dataset.tar.gz` format.

License: The dataset is licensed under the Creative Commons (cc) license.

Pills SXDHT

The "pills-sxdht" dataset is an object detection dataset which consists of images and their corresponding object annotations. The dataset contains 451 data instances, with each instance consisting of an image and the associated object annotations. The dataset was created by Roboflow users and is stored in the original COCO dataset format. The dataset size is approximately 24.6 MB.

License:The dataset is licensed under the Creative Commons (cc) license.

Polemo2 Official

The PolEmo2.0 dataset is a collection of online consumer reviews from four domains: medicine, hotels, products, and university. It contains 8,216 reviews with a total of 57,466 sentences. Each review and sentence has been manually annotated with sentiment labels in the 2+1 scheme, resulting in a total of 197,046 annotations. The dataset is publicly available under a Creative Commons license.

License:The dataset is publicly available under the Attribution-NonCommercial-ShareAlike 4.0 Inter...

Pubmed QA

Unfortunately, the text provides insufficient information to summarize the dataset accurately.

License: MIT License

Python Code Instructions 18k A...

The dataset contains problem descriptions and code in python language. This dataset is taken from sahil2801/code_instructions_120k, which adds a prompt column in alpaca style.

License: Not provided.

QASC

The QASC dataset is designed for question-answering tasks, specifically for multiple-choice questions. It includes sub-tasks like extractive question-answering and multiple-choice question-answering. The dataset is monolingual and is available in English.

License:The dataset is licensed under the Creative Commons Attribution 4.0 License (cc-by-4.0).

QuAC

The "quac" dataset is designed for tasks related to Question Answering, Text Generation, Fill-Mask, dialogue modeling, and extractive question answering. It's a monolingual dataset in English.

License:The dataset is licensed under the MIT license.

Red Pajama

The RedPajama dataset is an open-source implementation of the LLaMa dataset, incorporating multiple data sources. It contains around 1.2 trillion tokens from diverse sources like Common Crawl, C4, GitHub, Books, ArXiv, Wikipedia, and StackExchange. The dataset primarily consists of English text, and its structure includes content, metadata, and a subset identifier. It aims to reproduce the LLaMa recipe closely.

License:The dataset has multiple licenses based on its source data: Common Crawl, C4, The Pile, Ar...

Refined Web

Falcon RefinedWeb is a substantial English web dataset developed by TII, released under an ODC-By 1.0 license. It is meticulously created through filtering and deduplication of CommonCrawl, resulting in models achieving performance comparable to or better than those trained on curated datasets.

License:Falcon RefinedWeb is released under an ODC-By 1.0 license and aligns with the CommonCrawl ...

RU Goemotions

The RuGoEmotions dataset is a collection of 34k Reddit comments in Russian, each labeled with one or more of 9 emotion categories (joy, interest, surprise, sadness, anger, disgust, fear, guilt, and neutral). It includes predefined train/val/test splits.

License:The RuGoEmotions dataset is licensed under the MIT License.

Schema Guided Dialogue (SGD)

The ConvLab/sgd dataset is a conversational dataset that focuses on dialogue interactions. The dataset includes dialogues that involve tasks such as finding restaurants, making reservations, and requesting specific details like address and phone numbers. The dialogues are structured with various intents, slots, and values, reflecting real-world conversational scenarios.

License:The dataset is licensed under the Creative Commons Attribution-ShareAlike 4.0 Internationa...

Secondary Chains

The secondary-chains dataset is an object detection dataset that comprises images and their corresponding object annotations. The dataset is stored in the COCO format and contains information about the image ID, width, height, and object metadata such as annotation ID, area, bbox, and category. It is a monolingual dataset in English and falls under the size category of 1K<n<10K. The dataset is created by Francesco and the annotations are crowdsourced.

License:The dataset is licensed under the Creative Commons (cc) license.

Ship Detection

The Ship Detection dataset is publicly accessible but requires agreement to the conditions set by the Data-Driven Science competition. By accepting this dataset, users gain access to files and content related to ship detection tasks. The dataset is organized by Data-Driven Science and includes contact information usage consent for communication purposes.

License: The dataset license is not explicitly mentioned in the provided text.

Spotify Tracks Dataset

This dataset contains information about Spotify tracks from various genres. Each track in the dataset is associated with audio features such as danceability, energy, loudness, and more. The data is provided in CSV format and can be easily loaded for analysis.

License: The dataset is licensed under the BSD license.

SQuAD

SQuAD is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

License: The license for this dataset is cc-by-4.0.

SQuAD 2.0

SQuAD (Stanford Question Answering Dataset) v2 is an extension of the SQuAD dataset, which is designed for the task of question answering. The dataset consists of questions posed by crowd workers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage. SQuAD v2 introduces unanswerable questions, making the task more challenging as models must determine whether a question is answerable based on the provided passage.

License:The dataset is licensed under the Creative Commons Attribution-ShareAlike 4.0 Internationa...

Stanford Sentiment Treebank 2

The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language. It consists of 11,855 single sentences extracted from movie reviews and includes a total of 215,154 unique phrases from those parse trees, each annotated by 3 human judges. The corpus is widely used for sentiment classification tasks.

License: The dataset license is unknown.

Starcoder Data

The StarCoder Training Dataset is used to train StarCoder and StarCoderBase, encompassing 783GB of code in 86 programming languages. It includes 54GB of GitHub Issues + 13GB Jupyter notebooks in script and text-code pairs, as well as 32GB of GitHub commits, equivalent to around 250 billion tokens.

License:The dataset's license can be accessed after logging on to HuggingFace.

SVHN

The Street View House Numbers (SVHN) dataset is a real-world image dataset that is used for developing machine learning and object recognition algorithms. It is similar to the MNIST dataset but contains an order of magnitude more labeled data (over 600,000 digit images) and comes from a significantly harder, unsolved, real-world problem of recognizing digits and numbers in natural scene images. The dataset is obtained from house numbers in Google Street View images and is available in two formats: original images with character-level bounding boxes and MNIST-like 32-by-32 images centered around a single character.

License: The SVHN dataset is available for non-commercial use only.

TREC QA

The TREC dataset contains 5,500 labeled questions in the training set and another 500 for the test set. It consists of 6 coarse class labels and 50 fine class labels. The average length of each sentence is 10, with a vocabulary size of 8,700. The data are collected from various sources, including English questions published by USC, manually constructed questions for rare classes, and TREC 8, 9, and 10 questions. The dataset is available in English and is monolingual. It has been used in research and has associated citation information from works by Xin Li and Dan Roth (2002) and Eduard Hovy et al. (2001). The TREC dataset offers a valuable resource for researchers and practitioners working on text classification and related NLP tasks. Its diverse set of questions and detailed labeling make it suitable for training and evaluating models on complex classification problems.

License:The licensing information is not explicitly mentioned.

Truthful QA

TruthfulQA is a benchmark designed to measure whether a language model is truthful in generating answers to questions. The benchmark is made up of 817 questions that span 38 categories, including health, law, finance, and politics. Questions are crafted so that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from imitating human texts.

License: This dataset is licensed under the Apache License, Version 2.0.

Twitter Financial News Sentime...

The Twitter Financial News dataset is an English-language dataset containing an annotated corpus of finance-related tweets. It consists of 11,932 documents annotated with three sentiment labels: "Bearish," "Bullish," and "Neutral." The data was collected using the Twitter API and is used for the task of sentiment analysis in the finance domain.

License: The Twitter Financial Dataset (sentiment) version 1.0.0 is released under the MIT License...

Tydi QA

The "tydiqa" dataset is designed for the task of question answering, specifically focusing on extractive question answering. It is a multilingual dataset, encompassing languages such as Arabic, Bengali, English, and eight others. The dataset's size categories are unknown, and it has been created and annotated through crowdsourcing. The source datasets are extended from Wikipedia. Overall, the "tydiqa" dataset represents a rich resource for researchers and practitioners working on multilingual question answering and related fields, offering a diverse set of questions and documents in various languages.

License:The dataset is licensed under the Apache-2.0 license.

Ubuntu Dialogue Corpus

The Ubuntu Dialogue Corpus is a dataset containing almost one million multi-turn dialogues, with a total of over 7 million utterances and 100 million words. This dataset provides a unique resource for research into building dialogue managers based on neural language models that can make use of large amounts of unlabeled data. The dataset has both the multi-turn property of conversations in the Dialog State Tracking Challenge datasets and the unstructured nature of interactions from microblog services such as Twitter. Overall, the Ubuntu Dialogue Corpus offers a substantial and valuable resource for those working on dialogue systems and conversational AI. Its extensive collection of multi-turn dialogues provides a rich foundation for research and development in these areas.

License:The licensing information for the Ubuntu Dialogue Corpus is not explicitly mentioned.

Wiki QA

The WikiQA dataset is a question and answer corpus extracted from Wikipedia. It consists of question and sentence pairs, where the sentence may or may not answer the question. The dataset is designed to reflect real-world information-seeking questions and is intended for the task of answer sentence selection. The WikiQA corpus contains 3,047 questions and a total of 29,258 answer sentences, where 1,473 sentences are correct answers.

License:The dataset is licensed under the MIT License. This license allows for free use, modificat...

Wikipedia

The Wikipedia dataset comprises cleaned articles from various languages. Derived from Wikipedia dumps, each language has a separate split. Articles are preprocessed, removing markdown and unwanted sections, and parsed using the mwparserfromhell tool. This dataset enables language modeling and offers subsets per language and date.

License:Wikipedia text and images are primarily co-licensed under the Creative Commons Attribution...

X Stance

The x-stance dataset consists of more than 150 political questions and 67,000 comments written by candidates in response to those questions. This dataset can be used to train and evaluate stance detection systems.

License:The x-stance dataset is licensed under CC BY-NC 4.0.

XGLUE

The XGLUE dataset is a benchmark dataset for evaluating the performance of cross-lingual pre-trained models in cross-lingual natural language understanding and generation. It consists of 11 tasks and covers 19 languages. The dataset includes tasks such as named entity recognition, part-of-speech tagging, question answering, summarization, text classification, and more. The training data is available only in English, and the validation and test data are available in multiple languages.

License:The dataset is provided for non-commercial research purposes only and is available free of...

GAIA dataset

GAIA is a benchmark which aims at evaluating next-generation LLMs (LLMs with augmented capabilities due to added tooling, efficient prompting, access to search, etc). GAIA is a benchmark dataset for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs. It shows that human respondents obtain 92% vs. 15% for GPT-4 equipped with plugins. This notableperformance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA’s philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system’s capability to exhibit similar robustness as the average human does on such questions. Using GAIA’s methodology, devised 466 questions and their answer.

License:MIT

Solutions

  • AGIE Data Engine
  • Vector Database
  • LLM FineTuning
  • Monitoring and Observability
  • AI Guardrails

Copyright © 2023 AGIE AI Technology Pvt. Ltd. All rights reserved.