PoC: A talking camera

I wanted to test out a cool game mechanic: An automatic image feature detection system that tells what's in the image. You can get the idea by testing the Kittydar site that tells if there's a cat in an image.

Kittydar detecting two cats in an image

Now I think it would be a cool game mechanic if the user could "take a photo" and a voice would call out what it sees in the image.

Fake it 'til you make it

While detecting cat faces is such a trivial task that it can be done in a browser using crude tools, I'd like to have a Google-level AI that makes valid sentences about what's in the picture.

Fortunately as a game creator I'm also in full control of the surrounding world, so I think it would be pretty easy to fake this effect by labelling all "important" stuff in the world and listing only those that are visible to the camera.

Label saying: You're seeing a pig and a cow

Priorities

Now sometimes you see a lot of stuff on the screen. You might be in a jungle where there's fireflies, bushes, trees, monkeys, birds... and a lion that's sprinting towards you. A normal person only describes the most important things, and so should my "AI".

I figured that I could add an increasing priority number to every labelled object so I can focus on the most important things happening on the screen.

Camera pans, label changes between forest (in the background) and a duck (foreground)

Sprinkle some details

Removing all the details leaves us with a very specific list of the most important objects in the scene. But there are some important relations between objects in a scene that include good details that helps us figure out what's happening in the scene.

We can achieve this effect by linking two visible objects in a scene, and describe the relationship between them ("inside", "over", "under", etc). This way we can get the extra detail by making a "hop" to the most important visible link of the most important visible object:

Title describing: "A cow pretty close to a spruce"

Try it out

That's it! I think this was a good proof of concept for a system describing what's visible to a camera like a good "AI" would describe it.

Check out the interactive demo to test how it works yourself!