Home / / Datasets / / Book-Corpus
The BookCorpus dataset focuses on aligning books with their corresponding movie releases to provide comprehensive descriptive explanations for visual content. This dataset offers both fine-grained details and high-level semantics found in books, enriching visual understanding beyond current caption-based datasets. This dataset can be used for research and development in fields such as natural language processing, computer vision, and multimodal analysis. It enables the exploration of story-to-visual connections and the creation of richer image descriptions.
The downloaded dataset files are approximately 1.18 GB, generating a dataset of around 4.85 GB, with a total disk usage of approximately 6.03 GB.
Bert, Roberta, XLNET, Albert etc Models have been trained on this dataset. Check this link for complete list models trained on this dataset. https://huggingface.co/models?dataset=dataset:bookcorpus
The books within this dataset have been sourced from https://www.smashwords.com, and users should refer to their terms of service for licensing details.