Image paragraph generator

Abstract

Example solutions for image paragraph captioning use a first vision language model to generate visual information (comprising text) for an image. The visual information may include tags, an initial image caption, and information on objects within the image (e.g., further tags and captions, and object attributes and locations within the image). In some examples, the visual information further includes visual clues. A generative language model generates a plurality of image story caption candidates (e.g., descriptive paragraphs) from the visual information. A second vision language model evaluates the plurality of image story caption candidates and selects a caption as the final output caption.

Publication
United States Patent & Trademark Office