@pavnilschanda

@pavnilschanda

cross-posted from: https://lemmy.fmhy.ml/post/1259752

In human conversations, individuals can indicate relevant regions within a scene while addressing others. In turn, the other person can then respond by referring to specific regions if necessary. This natural referential ability in dialogue remains absent in current Multimodal Large Language Models (MLLMs). To fill this gap, this paper proposes an MLLM called Shikra, which can handle spatial coordinate inputs and outputs in natural language. Its architecture consists of a vision encoder, an alignment layer, and a LLM. It is designed to be straightforward and simple, without the need for extra vocabularies, position encoder, pre-/post-detection modules, or external plug-in models. All inputs and outputs are in natural language form. Referential dialogue is a superset of various vision-language (VL) tasks. Shikra can naturally handle location-related tasks like REC and PointQA, as well as conventional VL tasks such as Image Captioning and VQA. Experimental results showcase Shikra’s promising performance. Furthermore, it enables numerous exciting applications, like providing mentioned objects’ coordinates in chains of thoughts and comparing user-pointed regions similarities. Our code, model and dataset are accessed at this https URL.

Tags:

(including but not limited to)

[META]: Anything posted by the mod

[Resource]: Links to resources related to AI companionship. Prompts and tutorials are also included

[News]: News related to AI companionship or AI companionship-related software

[Paper]: Works that presents research, findings, or results on AI companions and their tech, often including analysis, experiments, or reviews

[Opinion Piece]: Articles that convey opinions

[Discussion]: Discussions of AI companions, AI companionship-related software, or the phenomena of AI companionship

[Chatlog]: Chats between the user and their AI Companion, or even between AI Companions

[Other]: Whatever isn’t part of the above

[Other] Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic (paper from 27.06.2023)

[Other] Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic (paper from 27.06.2023)