Gestures are a powerful communication medium which has also been found to make human-machine communication more natural. As the hands can be used to express shape flexibly and immediately in the spatial medium, gestures are an ideal modality for describing pictorial content.

This thesis investigates the principles according to which gesture and speech express shape-related content and presents a computational approach towards the interpretation of such multimodal expressions. For that purpose a corpus of speech-gesture shape descriptions acquired from an empirical study is described and analyzed. The empirical results are used to inform a formal representation model for a unified description of shape conveyed via gesture and speech. Building on this, an implemented process model is described which can algorithmically interpret input from the two modalities and generate an internal representation of shape.