Home / / Datasets / / DocLayNet-large
The DocLayNet dataset is a large human-annotated dataset for document-layout segmentation. It provides page-by-page layout segmentation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique pages from 6 document categories. Compared to related work such as PubLayNet or DocBank, DocLayNet offers several unique features, including human annotation, large layout variability, a detailed label set, redundant annotations, and pre-defined train, test, and validation sets.
The DocLayNet dataset consists of 80863 unique pages from 6 document categories. It is divided into three datasets of different sizes: - DocLayNet small: approximately 1% of DocLayNet, with < 1.000k document images (691 train, 64 val, 49 test) - DocLayNet base: approximately 10% of DocLayNet, with < 10.000k document images (6910 train, 648 val, 499 test) - DocLayNet large: approximately 100% of DocLayNet, with < 100.000k document images (69.103 train, 6.480 val, 4.994 test)
The DocLayNet dataset can be used for various tasks, including object detection, image segmentation, and token classification. Its applications include document analysis, layout analysis, and document-understanding tasks. The dataset has been used for fine-tuning layout models like LayoutLMv3 or LiLT, and it is also suitable for evaluating and benchmarking performance in layout segmentation tasks. Additionally, the DocLayNet dataset has been used in the IBMICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents.
The DocLayNet dataset is published under the CDLA-Permissive-1.0 license.