• @FooBarrington
    link
    English
    39 months ago

    Describing in machine-readable actionable terms what’s happening in an image isn’t a thing, as far as I know.

    It is. That’s actually the basis of multimodal transformers - they have a shared embedding space for multiple modes of data (e.g. text and images). If you encode data and take those embeddings, you suddenly have a vector describing the contents of your input.