ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language

Introduction

We introduce the new task of 3D object localization in RGB-D scans using natural language descriptions. As input, we assume a point cloud of a scanned 3D scene along with a free-form description of a specified target object. To address this task, we propose ScanRefer, where the core idea is to learn a fused descriptor from 3D object proposals and encoded sentence embeddings. This learned descriptor then correlates the language expressions with the underlying geometric features of the 3D scan and facilitates the regression of the 3D bounding box of the target object. In order to train and benchmark our method, we introduce a new ScanRefer dataset, containing 46,173 descriptions of 9,943 objects from 703 ScanNet scenes. ScanRefer is the first large-scale effort to perform object localization via natural language expression directly in 3D.

Please also check out the project video here.

For additional detail, please see the ScanRefer paper:
"ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language"
by Dave Zhenyu Chen, Angel X. Chang and Matthias Nießner
from Technical University of Munich and Simon Fraser University.

Dataset

If you would like to access to the ScanRefer dataset, please fill out this form. Once your request is accepted, you will receive an email with the download link.

Note: In addition to language annotations in ScanRefer dataset, you also need to access the original ScanNet dataset. Please refer to the ScanNet Instructions for more details.

Download the dataset by simply executing the wget command:

wget <download_link>

Data format

"scene_id": [ScanNet scene id, e.g. "scene0000_00"],
"object_id": [ScanNet object id (corresponds to "objectId" in ScanNet aggregation file), e.g. "34"],
"object_name": [ScanNet object name (corresponds to "label" in ScanNet aggregation file), e.g. "coffee_table"],
"ann_id": [description id, e.g. "1"],
"description": [...],
"token": [a list of tokens from the tokenized description]

Setup

The code is tested on Ubuntu 16.04 LTS & 18.04 LTS with PyTorch 1.2.0 CUDA 10.0 installed. There are some issues with the newer version (>=1.3.0) of PyTorch. You might want to make sure you have installed the correct version. Otherwise, please execute the following command to install PyTorch:

conda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0 -c pytorch

Install the necessary packages listed out in requirements.txt:

pip install -r requirements.txt

After all packages are properly installed, please run the following commands to compile the CUDA modules for the PointNet++ backbone:

cd lib/pointnet2
python setup.py install

Before moving on to the next step, please don't forget to set the project root path to the CONF.PATH.BASE in lib/config.py.

Data preparation

Download the ScanRefer dataset and unzip it under data/.
Downloadand the preprocessed GLoVE embeddings (~990MB) and put them under data/.
Download the ScanNetV2 dataset and put (or link) scans/ under (or to) data/scannet/scans/ (Please follow the ScanNet Instructions for downloading the ScanNet dataset).

After this step, there should be folders containing the ScanNet scene data under the data/scannet/scans/ with names like scene0000_00

Pre-process ScanNet data. A folder named scannet_data/ will be generated under data/scannet/ after running the following command. Roughly 3.8GB free space is needed for this step:

cd data/scannet/
python batch_load_scannet_data.py

(Optional) Download the preprocessed multiview features (~36GB) and put it under data/scannet/scannet_data/.

Usage

Training

To train the ScanRefer model with RGB values:

python scripts/train.py --use_color

For more training options (like using preprocessed multiview features), please run scripts/train.py -h.

Evaluation

To evaluate the trained ScanRefer models, please find the folder under outputs/ with the current timestamp and run:

python scripts/eval.py --folder <folder_name> --use_color

Note that the flags must match the ones set before training. The training information is stored in outputs/<folder_name>/info.json

Visualization

To predict the localization results predicted by the trained ScanRefer model in a specific scene, please find the corresponding folder under outputs/ with the current timestamp and run:

python scripts/visualize.py --folder <folder_name> --scene_id <scene_id> --use_color

Note that the flags must match the ones set before training. The training information is stored in outputs/<folder_name>/info.json. The output .ply files will be stored under outputs/<folder_name>/vis/<scene_id>/

Changelog

01/31/2020: Fixed the issue with bad tokens.

01/21/2020: Released the ScanRefer dataset.

Citation

If you use the ScanRefer data or code in your work, please kindly cite our work and the original ScanNet paper:

@misc{chen2019scanrefer,
    title={ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language},
    author={Dave Zhenyu Chen and Angel X. Chang and Matthias Nießner},
    year={2019},
    eprint={1912.08830},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

@inproceedings{dai2017scannet,
    title={Scannet: Richly-annotated 3d reconstructions of indoor scenes},
    author={Dai, Angela and Chang, Angel X and Savva, Manolis and Halber, Maciej and Funkhouser, Thomas and Nie{\ss}ner, Matthias},
    booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
    pages={5828--5839},
    year={2017}
}

Acknowledgement

We would like to thank facebookresearch/votenet for the 3D object detection codebase and erikwijmans/Pointnet2_PyTorch for the CUDA accelerated PointNet++ implementation.

License

ScanRefer is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
config		config
data/scannet		data/scannet
demo		demo
lib		lib
models		models
pretrained		pretrained
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

chodi150/scanrefer-pointgroup

Folders and files

Latest commit

History

Repository files navigation

ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language

Introduction

Dataset

Data format

Setup

Data preparation

Usage

Training

Evaluation

Visualization

Changelog

Citation

Acknowledgement

License

About

Resources

License

Stars

Watchers

Forks

Languages