图像数据集

编辑:Troy Ni

数据爬取

数据集

图片

分类名字链接大小备注
互联网cc3mai.google.com3M
互联网cc12mgithub.com12M
互联网laion-400mlaion.ai400M
互联网laion-5blaion.ai5B
互联网laion-aestheticslaion.ai625K - 1.2B不同评分分层
互联网220k-GPT4Vision-captionshuggingface.co22KGPT4V Caption
互联网ye-pophuggingface.co100K < n < 1MLaion-POP alternative
互联网gpt4v-emotion-datasethuggingface.co134表情
互联网gpt4v-datasethuggingface.co12.4K
MJjourneydbjourneydb.github.io4M
MJjourney-db-000huggingface.co20KHF 格式的子集
互联网sa-1bai.meta.com11M人脸有打码
二次元danbooru2023huggingface.co1.1M8TB,可找到 2022、2021 的项目
二次元game_character_skinshuggingface.co4K二次元角色皮肤,同组织还有赛博后宫的数据
二次元anime_aesthetic_fullhuggingface.co6.6M二次元高美学评分图片,无 caption

数据集下载

下载器

HF Datasets

import os
from pathlib import Path

# os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"

from huggingface_hub import HfApi, logging

repo_id = "PixArt-alpha/SAM-LLaVA-Captions10M"
repo_type = "dataset"

local_name = repo_id.split("/")[1]

logging.set_verbosity_debug()
hf = HfApi()
hf.snapshot_download(
    repo_id=repo_id,
    repo_type=repo_type,
    # revision="refs/convert/parquet",
    local_dir=Path("./") / local_name,
    local_dir_use_symlinks=False,
    resume_download=True,
    cache_dir="./cache",
    max_workers=16,
)
  • 提取 hf repo 的文件列表,并存至 TXT,供 IDM 等工具下载
import aiohttp
import asyncio
import json


async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()


async def main():

    repo_id = "PixArt-alpha/PixArt-alpha"
    repo_type = "model"
    urls = []

    type_route_api = "models/"
    type_route_dl = ""
    if repo_type == "dataset":
        type_route_api = "datasets/"
        type_route_dl = "datasets/"

    url = f'https://hf-mirror.com/api/{type_route_api}{repo_id}'
    async with aiohttp.ClientSession() as session:
        response_text = await fetch(session, url)
        response_json = json.loads(response_text)

        for sibling in response_json["siblings"]:
            download_url = f"https://hf-mirror.com/{type_route_dl}{repo_id}/resolve/main/" + sibling["rfilename"] + "?download=true"
            urls.append(download_url)

    with open('output_urls.txt', 'w') as file:
        for item in urls:
            file.write(item + '\n')

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

数据标注

多模态模型标注

打标

美学评分

这篇文档有帮助吗?