Given an input image, MOVES produces features (shown using PCA to project to RGB) that easily group with ordinary clustering systems and can also be used to associate hands with the objects they hold. The clusters are often sufficient for defining objects, but additional cues such as a box improve them further. At training time, MOVES learns this feature space from direct discriminative training on simple pseudo-labels. While MOVES learns only from objects that hands are actively holding (such as the semi-transparent bag), we show that it works well on inactive objects as well (such as the milk carton).
Our method uses manipulation in video to learn to understand held-objects and hand-object contact. We train a system that takes a single RGB image and produces a pixel-embedding that can be used to answer grouping questions (do these two pixels go together) as well as hand-association questions (is this hand holding that pixel). Rather than painstakingly annotate segmentation masks, we observe people in realistic video data. We show that pairing epipolar geometry with modern optical flow produces simple and effective pseudo-labels for grouping. Given people segmentations, we can further associate pixels with hands to understand contact. Our system achieves competitive results on hand and hand-held object tasks.
As input MOVES accepts an RGB image and produces a \(H \times W \times F\) per-pixel feature embedding using a backbone HRNET denoted \(f(\cdot)\). Pairs of F-dimensional embeddings from this backbone can be passed to lightweight MLPs \(g(\cdot)\) to assess grouping probability and \(a(\cdot)\) to identify hand association, or if the pixels are a hand and an object the hand is holding. Once trained, the MOVES embeddings (here visualized with PCA to map the feature dimension to RGB) can be used for:
Results from MOVES, with examples from the EPICK VISOR validation set.
Richard E. L. Higgins and David F. Fouhey
In Computer Vision and Pattern Recognition, 2023.
@inproceedings{higgins2023moves,
title={MOVES: Manipulated Objects in Video Enable Segmentation},
author={Higgins, Richard EL and Fouhey, David F},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={6334--6343},
year={2023}
}
This template was originally made by Phillip Isola and Richard Zhang for a colorful project, and inherits the modifications made by Jason Zhang. The code can be found here.