This stuff moves so fast I really can’t keep up and a lot of the research posted here goes a bit over my head. I’m looking for something that doesn’t seem too out of the question given things like CLIPSeg. Is there some tool or library out there that will accept an image and a prompt and then generate a mask within the image that generally corresponds to the prompt?

For example, if I had a picture of an empty park and gave the prompt “little girl flying a kite” I should get back a mask vaguely in the shape of a child with a sort of blob mask in the sky for the kite. Of course from there I could use the mask to inpaint those things. I would really like to be able to layer an image kind of like Photoshop so it’s not all-or-nothing and focus on one element at a time. I could do the masking manually but of course we all want fewer steps in our workflows.