Contact US

Home / / Datasets / / Book-Corpus

TextText Generation

Book Corpus

Summary

The BookCorpus dataset focuses on aligning books with their corresponding movie releases to provide comprehensive descriptive explanations for visual content. This dataset offers both fine-grained details and high-level semantics found in books, enriching visual understanding beyond current caption-based datasets. This dataset can be used for research and development in fields such as natural language processing, computer vision, and multimodal analysis. It enables the exploration of story-to-visual connections and the creation of richer image descriptions.

Size

The downloaded dataset files are approximately 1.18 GB, generating a dataset of around 4.85 GB, with a total disk usage of approximately 6.03 GB.

Use cases

Bert, Roberta, XLNET, Albert etc Models have been trained on this dataset. Check this link for complete list models trained on this dataset. https://huggingface.co/models?dataset=dataset:bookcorpus

License

The books within this dataset have been sourced from https://www.smashwords.com, and users should refer to their terms of service for licensing details.

Download from source

https://huggingface.co/datasets/bookcorpus

Solutions

  • AGIE Data Engine
  • Vector Database
  • LLM FineTuning
  • Monitoring and Observability
  • AI Guardrails

Copyright © 2023 AGIE AI Technology Pvt. Ltd. All rights reserved.