The availability of continuous streams of data from multiple sensor modalities covering the same workspace has long been recognised as a privilege by robotics researchers. Data fusion has a successful track record in the field leading to the by now routine generation of high-quality large scale metric and topological maps of unstructured environments. With this success, however, comes the realisation that prominent applications in robotics -- such as action selection and human machine interaction -- require information beyond mere metric or topological representations. As a result, researchers throughout the community are becoming increasingly interested in adding higher-order, semantic information to the maps obtained. In this context, the availability of a rich set of data from complimentary modalities once again comes into its own. In this talk we provide an overview of past and ongoing work aiming to enrich standard metric or topological maps as provided by a mobile robot with higher-order semantic information. As a baseline, we review an approach to scene labelling based on a shallow hierarchy of support vector machines operating on both appearance and 3D lidar data. We then outline a more sophisticated approach where environmental cues are considered for classification at different scales: the first stage considers local scene properties using a probabilistic bag-of-words classifier; the second stage incorporates contextual information across a given scene (spatial context) and across several consecutive scenes (temporal context) via a Markov Random Field (MRF). We demonstrate the virtue of considering such spatial and temporal context during the classification task and analyse the performance of our technique on data gathered over almost 17 km of track through a city.