Video-based people tracking is extremely challenging, especially when using one single camera. Among the sources of difficulties are joint reflection ambiguities, occlusions, cluttered backgrounds, non- rigidity of tissue and clothing, complex and rapid motions, and poor image resolution. In this talk, I will present two complementary approaches to addressing these issues. The first relies on using spatio-temporal templates to detect key poses at regular intervals and then to link these detections into complete trajectories, which results in fully automated tracking. The second starts from the observation that a common theme in many recent approaches to capturing human motion from video is to represent the set of likely poses as a low-dimensional manifold parameterized by a few latent variables. This mapping, however, is usually learned in a problem-independent way, which makes it difficult to recover motion without manually initializing the latent variables and the pose. Therefore a direct way to derive a mapping between easily observable image quantities, which serve as latent variables, and pose sequences is introduced. This also yields full automation but, in addition, only requires modest amounts of training data.