Combined Tracking and Object Recognition: Tracking Hands
Background
3D hand tracking has great potential as a tool for better human-computer interaction. Tracking hands, in particular articulated finger motion, is a challenging problem because the motion exhibits many degrees of freedom. Typically hand motion can be characterized by 27 degrees of freedom, 21 for the joint angles and 6 for orientation and location. Estimation in this high dimensional state space given only an image (or video sequence) of a hand is rather difficult. Other obstacles which have limited the use of hand trackers in real applications are the handling of self-occlusion (very common in hand motion), tracking in cluttered backgrounds, and automatic tracker initialization. Note that 3D tracking is different from gesture recognition, where there is a limited set of hand poses which need to be recognized.
The presented algorithm uses a tree of templates, generated from a 3D geometric hand model. The hand model is built from truncated quadrics and its contours can be projected into the image plane while handling self-occlusion. Articulated hand motion is learned from training data collected with a data glove, leading to a lower dimensional representation of finger motion. The likelihood cost function is based on the chamfer distance between projected contours and edges in the image. Additionally, edge orientation and skin colour information is used, making the matching more robust in cluttered backgrounds. The problem of tracker initialisation is solved by searching the tree in the first frame without the use of any prior information.
At the heart of the tracker is the tree-based filter, which approximates the optimal Bayesian filtering equations. We propose a tree-based representation of the posterior distribution, where the leaves define a partition of the state space with piecewise constant density. The advantage of this representation is that regions with low probability mass can be rapidly discarded in a hierarchical search, and the distribution can be approximated to arbitrary precision.
The Hand Model
The hand model is built from a set of truncated quadrics, including ellipsoids, cones and cylinders. The advantages of this representation are that the geometry is represented with only few parameters and that the contours can be gernerated easily using projective geometry. The projection of a quadric contour into an image is a conic. For example, the projection of an ellipsoid is an ellipse, and the projection of a cone is a pair of lines. Self-occlusion is also handled when projecting the contours, yielding usable templates. A default shape is first obtained by taking measurements from a real hand. Given the image data, shape matching can be used to estimate a set of shape parameters, including finger lengths and a width parameter.
The model has 27 degrees of freedom: 6 for the global pose, 4 for the pose of each finger, and 5 for the pose of the thumb. However, hand motion is constrained as each joint can only move within certain limits. Furthermore the motion of different joints is correlated, for example most people find it difficult to bend the middle finger and keep the ring finger extended at the same time. Go on, try it yourself. Alternatively, try bending your little finger while keeping the ring finger extended. Thus hand articulation can be expected to lie in a compact region within the high-dimensional angle space. Also, analyzing data captured with a data glove it could be seen that in most cases 95% of the variance is captured by the first 8 principal components. This can be exploited to reduce the dimensionality of the search space.
The Likelihood Function
The likelihood function is at the heart of any estimation algorithm, as it relates the observations to the unknown state. Ideally the chosen observations should yield a likelihood with high discriminative power for detecting a hand with as few local minima as possible. Furthermore it should be possible to compute the likelihood (and features, if needed) with little computational overhead. For hand tracking, finding good features and a suitable likelihood function is challenging, since there are few good features which can be detected and tracked reliable (unlike faces, for example). Color values and edge contours seem to be suitable and have been used in other trackers in the past. In our case we therefore assume that the data is taked from two sets of observations, from edge data and color data.
The term for the edge data is based on a chamfer distance function. The chamfer distance is the mean (or root mean squared average) of the distances between each point in the model point set and its closest point in the edge point set. The chamfer distance between two shapes can be efficiently computed using a distance transform (DT). This transformation takes a binary feature image as input, and assigns to each pixel in the image the distance to its nearest feature. The distance between a template and an edge map can then be computed as the mean of the DT values at the template point coordinates. Edge orientation and color normal to the edge is also taken into account.
The color term is based on the skin color distribution. The RGB values are intensity normalized and skin color is modeled as a Gaussian distribution in this normalized space. For background pixels a uniform distribution is assumed.
left - Surface described by the negative log-likelihood function when searching the scale and angle space, matching a hand template with the input image on the right.
right - The superimposed template corresponds to the global minimum, but there are many local minima.
Oxford Brookes Vision Group