We uncover a surprising multilingual bias occurring in a popular class of multimodal vision-language models (VLMs). Including an image in the query to a LLaVA-style VLM significantly increases the likelihood of the model returning an English response, regardless of the language of the query.
We investigate the basic multilingual capabilities of state-of-the-art open LLMs beyond their intended use. Specifically, we introduce a new silver standard benchmark which we use to assess the models' multilingual language fidelity and question answering accuracy.
We propose a novel framework for zero-shot module composition, which encompasses existing and some novel variations for selecting, weighting, and combining parameter modules under a single unified notion. Focusing on the scenario of domain knowledge and adapter layers, our framework provides a systematic unification of concepts, allowing us to conduct the first comprehensive benchmarking study of various zero-shot knowledge composition strategies.
Multi-task learning (MTL) has shown considerable practical benefits, particularly when using pre-trained language models (PLMs). On the flip side, current two-stage MTL methods come with the cost of introducing a substantial number of additional parameters. In this work, we address this issue by leveraging the usefulness of linearly scaling the output representations of source adapters for transfer learning.