PhD Position in Multimodal (Audio and Vision) Conversational Foundation Models
This funded PhD position at Queen Mary University of London offers an exciting opportunity to advance the field of multimodal conversational foundation models. The research is conducted in collaboration with Tavus Inc, a US-based startup specializing in digital humans, and is embedded within the Multimedia and Vision Research group and the Centre for Multimodal AI. The project aims to design and train components of a next-generation multimodal model capable of perceiving and generating both verbal and non-verbal responses in conversational contexts.
Unlike traditional text-based dialogue systems, this research focuses on capturing the richness of human-to-human interaction, including facial expressions, voice intonation, gestures, emotions, and social signals. The objectives include multimodal perception of human behaviour using supervised, unsupervised, and reinforcement learning methodologies; post-training techniques for controllable multimodal generation aligned with conversational goals and personality; and the development of diffusion-based methods for high-quality, identity-preserving audio-visual output.
The team regularly publishes in top conferences such as CVPR, ICCV, ECCV, NeurIPS, TPAMI, and IJCV, and has access to extensive computational resources, including a server with 64 A100 GPUs and exclusive access to additional servers. The research environment is highly collaborative, with regular interaction between the London-based and international teams of Tavus Inc.
Funding includes a full tuition fee (for both Home and International students) and an annual stipend of £22,780 for three years. Applicants should hold or expect to obtain an MSc in Electronic Engineering, Computer Science, or a closely related discipline, with a distinction or first-class degree highly desirable. The application process requires submission of a CV, cover letter, research proposal, two references, and a certificate of English language proficiency (if applicable).
Applications are accepted year-round until the position is filled, with a preferred start date of June 1, 2026 (or as soon as possible). For further information, prospective candidates are encouraged to contact Prof Ioannis Patras at [email protected] and review the supervisor’s academic profile at
https://ipatras.github.io
and
https://www.qmul.ac.uk/eecs/people/profiles/patrasioannis.html
. Detailed application instructions are available at
http://eecs.qmul.ac.uk/phd/how-to-apply/
.
This position is ideal for candidates passionate about artificial intelligence, computer vision, and multimodal machine learning, seeking to contribute to cutting-edge research in conversational AI and digital avatars.