Suggest supporting multimodal capabilities such as voice (speech-to-text), video, and more.