Home / / Datasets / / C4
The C4 dataset is a colossal, cleaned version of Common Crawl's web crawl corpus, available in different variants. It is designed by AllenAI for pretraining language models and word representations. The dataset is in English and is intended for tasks related to language understanding and generation.
The C4 dataset is primarily used for pretraining language models and word representations. It provides a vast amount of English text data from the web, making it suitable for tasks involving language understanding and generation.
C4 dataset is a collection of about 750GB of English-language text
The dataset is released under the terms of ODC-BY (Open Data Commons Attribution License). Users are also bound by the Common Crawl terms of use for the content within the dataset. ##