XS-VID: An Extra Small Object Video Detection Dataset

XS-VID is a comprehensive dataset for Extra Small Object Video Detection, including diverse day and night scenes such as rivers, forests, skyscrapers, and streets. area

Update

[20250303] We have provided a new download of XS-VIDv2 with the v1 data merged to form a new dataset, with a convenient command line download below!
[20250301] We have released XS-VIDv2 containing 374 videos with 186446 frames for training 36478 frames for testing!
[20241124] We will soon be releasing XS-VIDv2, incorporating many new videos and scenarios!
[20241011] our paper released on https://arxiv.org/abs/2407.18137
[20240811] Annotation in YOLO format released!
[20240530] The quantitative results of several mainstream methods on XS-VID test-set are reported!
[20240530] We add the visualization of images in XS-VID.
[20240528] Our Homepage for XS-VID benchmark opens!

XS-VID

XS-VID contains a diverse array of scenes featuring multiple categories and sizes of targets. Notably, XS-VID achieves unprecedented breadth and depth in covering and quantifying minuscule targets (< $32^2$ pixels). Some example images are shown below.

Here is a statistical comparison of our dataset with other related datasets

dataset

Results

We exhibit the quantitative experiment results of several representative methods on the XS-VID test-set and Visdrone2019 VID test-dev set as follows.

results

Codes

The official codes of our benchmark, which mainly includes data preparation and evaluation, are released below.

our XS-VID baseline: YOLOFT
VOD Method: MMtracking; DiffusionVID;
GOD Method: MMdetection
SOD Method: CFINet; CEASC
YOLO Method: Ultralytics；StreamYOLO
Eval Tools: Eval code

Download

We provide the downloading of our datasets, You can download the full XS-VID dataset by choosing any of the following four choices.

Download method1: huggingface

Linux Command:

pip install huggingface_hub && \
huggingface-cli download lanlanlan23/XS-VID --repo-type dataset --local-dir ./XS-VID && \
mkdir -p ./XS-VID/{annotations,images} && \
unzip -o ./XS-VID/annotations.zip -d ./XS-VID/annotations && \
find ./XS-VID -name 'videos_subset_*.zip' -exec unzip -o {} -d ./XS-VID/images \; && \
rm -f ./XS-VID/*.zip

Windows Command (CMD):

pip install huggingface_hub && ^
huggingface-cli download lanlanlan23/XS-VID --repo-type dataset --local-dir ./XS-VID && ^
mkdir "./XS-VID\annotations" && mkdir "./XS-VID\images" && ^
powershell -Command "Expand-Archive -Path './XS-VID/annotations.zip' -DestinationPath './XS-VID/annotations' -Force" && ^
for /r "./XS-VID" %f in (videos_subset_*.zip) do powershell -Command "Expand-Archive -Path '%f' -DestinationPath './XS-VID/images' -Force" && ^
del /f /q "./XS-VID\*.zip"

Download method2: modelscope(Recommended for China)

Linux Command:

pip install modelscope && \
modelscope download --dataset lanlanlanrr/XS-VID --local_dir ./XS-VID && \
mkdir -p ./XS-VID/{annotations,images} && \
unzip -o ./XS-VID/annotations.zip -d ./XS-VID/annotations && \
find ./XS-VID -name 'videos_subset_*.zip' -exec unzip -o {} -d ./XS-VID/images \; && \
rm -f ./XS-VID/*.zip

Windows Command (CMD):

pip install modelscope && ^
modelscope download --dataset lanlanlanrr/XS-VID --local_dir ./XS-VID && ^
mkdir "./XS-VID\annotations" && mkdir "./XS-VID\images" && ^
powershell -Command "Expand-Archive -Path './XS-VID/annotations.zip' -DestinationPath './XS-VID/annotations' -Force" && ^
for /r "./XS-VID" %f in (videos_subset_*.zip) do powershell -Command "Expand-Archive -Path '%f' -DestinationPath './XS-VID/images' -Force" && ^
del /f /q "./XS-VID\*.zip"

Expected Folder Structure

./XS-VID/
├── annotations/    # Annotation files
└── images/         # Video frames (extracted from videos_subset_*.zip)

Notes

The script will auto-delete ZIP files after extraction.
Ensure sufficient disk space (~same as ZIP sizes + extracted content).

If you get an error while unpacking, you can get help from issues

Citation

If you find MovieChat useful for your your research and applications, please cite using this BibTeX:

@article{guo2024XSVID,
  title={XS-VID: An Extremely Small Video Object Detection Dataset},
  author={Jiahao Guo, Ziyang Xu, Lianjun Wu, Fei Gao, Wenyu Liu, Xinggang Wang},
  journal={arXiv preprint arXiv:2407.18137},
  year={2024}
}

Support or Contact

If you have any problems about our XS-VID benchmark, please feel free to contact us at gjh_mhust@hust.edu.cn.