Contact US

Home / / Datasets / / C4

TextText Generation

Colossal, Cleaned version of Common Crawl(C4)

Summary

The C4 dataset is a colossal, cleaned version of Common Crawl's web crawl corpus, available in different variants. It is designed by AllenAI for pretraining language models and word representations. The dataset is in English and is intended for tasks related to language understanding and generation.

Size

The C4 dataset is primarily used for pretraining language models and word representations. It provides a vast amount of English text data from the web, making it suitable for tasks involving language understanding and generation.

Use cases

C4 dataset is a collection of about 750GB of English-language text

License

The dataset is released under the terms of ODC-BY (Open Data Commons Attribution License). Users are also bound by the Common Crawl terms of use for the content within the dataset. ##

Download from source

https://huggingface.co/datasets/c4

Solutions

  • AGIE Data Engine
  • Vector Database
  • LLM FineTuning
  • Monitoring and Observability
  • AI Guardrails

Copyright © 2023 AGIE AI Technology Pvt. Ltd. All rights reserved.