1. Home
  2. AI Tools
  3. PD12M: High-Quality Public Domain Image-Caption Dataset for AI Training

PD12M: High-Quality Public Domain Image-Caption Dataset for AI Training

PD12M is an expansive dataset of 12.4 million high-quality, public domain images with synthetic captions designed to support AI training and minimize copyright issues.

Categories:Datasets

PD12M is an expansive dataset of 12.4 million high-quality, public domain images with synthetic captions designed to support AI training and minimize copyright issues. Each image-caption pair is generated with Florence-2 captions and curated through aesthetic and safety filtering from an initial 34 million image superset. PD12M provides a robust foundation for training and fine-tuning models, from vision transformers to multimodal AI, while adhering to copyright standards. This dataset is accessible through the Source.Plus platform, which introduces community-driven dataset governance for improved transparency, reproducibility, and harm reduction.

PD12M data is hosted independently on AWS S3, ensuring dataset integrity without straining the original image hosts, and includes detailed metadata such as image dimensions, MIME types, license URLs, and embeddings. The dataset’s CC0 and public domain licensing, alongside dedicated quality filtering, makes it an ideal resource for researchers and developers building robust AI models that need high-quality, well-labeled data.

Key Features:

  • 12.4 Million Image-Caption Pairs: The largest public domain image-text dataset to date, ideal for training foundational AI models.

  • Synthetic Captions via Florence-2: Consistently descriptive captions generated with high-quality Florence-2, enhancing AI training effectiveness.

  • Aesthetic and Safety Filtering: Images filtered for content quality, aesthetics, and safety, excluding NSFW or harmful content.

  • Accessible Metadata and Embeddings: Detailed metadata and CLIP ViT-L/14 embeddings are provided for enhanced model training and reproducibility.

  • Governed by Source.Plus: Community-driven governance promotes dataset integrity, harm reduction, and adaptability.

Use Cases:

  • AI Model Training: Provides a reliable foundation for training vision, multimodal, and image captioning models.

  • Research and Development: Supports reproducible research and model evaluation with a high-quality public dataset.

  • Educational and Creative Applications: Offers a legally safe and accessible resource for educational projects, creative tools, and academic research.

PD12M represents a significant advance in publicly available training data, offering high-quality, captioned images in a carefully curated, copyright-safe format, making it an essential resource for AI development across industries.

Leave your comment

© 2024Opendemo