The stages of Woodpecker work in harmony to validate and correct any inconsistencies between image content and generated text. First, it identifies the main objects mentioned in the text. Then, it asks questions around the extracted objects, such as their number and attributes. The framework answers these questions using expert models in a process called visual knowledge validation. Following this, it converts the question-answer pairs into a visual knowledge base consisting of object-level and attribute-level claims about the image. Finally, Woodpecker modifies the hallucinations and adds the corresponding evidence under the guidance of the visual knowledge base.
Here’s the how: