图像数据集

数据爬取

二次元：https://deepghs.github.io/waifuc/main/tutorials-CN/installation/index.html

数据集

图片

分类	名字	链接	大小	备注
互联网	cc3m	ai.google.com	3M
互联网	cc12m	github.com	12M
互联网	laion-400m	laion.ai	400M
互联网	laion-5b	laion.ai	5B
互联网	laion-aesthetics	laion.ai	625K - 1.2B	不同评分分层
互联网	220k-GPT4Vision-captions	huggingface.co	22K	GPT4V Caption
互联网	ye-pop	huggingface.co	100K < n < 1M	Laion-POP alternative
互联网	gpt4v-emotion-dataset	huggingface.co	134	表情
互联网	gpt4v-dataset	huggingface.co	12.4K
MJ	journeydb	journeydb.github.io	4M
MJ	journey-db-000	huggingface.co	20K	HF 格式的子集
互联网	sa-1b	ai.meta.com	11M	人脸有打码
二次元	danbooru2023	huggingface.co	1.1M	8TB，可找到 2022、2021 的项目
二次元	game_character_skins	huggingface.co	4K	二次元角色皮肤，同组织还有赛博后宫的数据
二次元	anime_aesthetic_full	huggingface.co	6.6M	二次元高美学评分图片，无 caption

数据集下载

下载器

img2dataset：https://github.com/rom1504/img2dataset

HF Datasets

HF 镜像：https://hf-mirror.com
一篇教程：https://zhuanlan.zhihu.com/p/663712983
使用 hf api 下载（使用 cli 同理，这就是 cli 的内部实现）

import os
from pathlib import Path

# os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"

from huggingface_hub import HfApi, logging

repo_id = "PixArt-alpha/SAM-LLaVA-Captions10M"
repo_type = "dataset"

local_name = repo_id.split("/")[1]

logging.set_verbosity_debug()
hf = HfApi()
hf.snapshot_download(
    repo_id=repo_id,
    repo_type=repo_type,
    # revision="refs/convert/parquet",
    local_dir=Path("./") / local_name,
    local_dir_use_symlinks=False,
    resume_download=True,
    cache_dir="./cache",
    max_workers=16,
)

提取 hf repo 的文件列表，并存至 TXT，供 IDM 等工具下载

import aiohttp
import asyncio
import json


async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()


async def main():

    repo_id = "PixArt-alpha/PixArt-alpha"
    repo_type = "model"
    urls = []

    type_route_api = "models/"
    type_route_dl = ""
    if repo_type == "dataset":
        type_route_api = "datasets/"
        type_route_dl = "datasets/"

    url = f'https://hf-mirror.com/api/{type_route_api}{repo_id}'
    async with aiohttp.ClientSession() as session:
        response_text = await fetch(session, url)
        response_json = json.loads(response_text)

        for sibling in response_json["siblings"]:
            download_url = f"https://hf-mirror.com/{type_route_dl}{repo_id}/resolve/main/" + sibling["rfilename"] + "?download=true"
            urls.append(download_url)

    with open('output_urls.txt', 'w') as file:
        for item in urls:
            file.write(item + '\n')

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

数据标注

多模态模型标注

LLaVA：https://llava-vl.github.io/
CogVLM：https://huggingface.co/spaces/THUDM/CogVLM-CogAgent

打标

美学评分

20 万动漫：https://huggingface.co/spaces/Laxhar/anime-thetic