With the post production software to place up the things realistically in scenes is much more robust for the computers rather than the humans. It also requires not only determining an appropriate location for the said object but also trying to predict the appearance of the object at the target locations, its pose, shape, scale, occlusions and more.
Moreover, artificial intelligence promises to lend a hand. In apaper which has been accepted at the NeurIPS 2018 conference last week, the researchers at the Seoul National University.
“Inserting objects into an image that conforms to scene semantics is a challenging and interesting task. The task is closely related to many real-world applications, including image synthesis, AR and VR content editing, and domain randomization,” the researchers wrote. “Such an object insertion model can potentially facilitate numerous image editing and scene parsing applications.”
Their end to end framework compromises two modules, one that determines where the inserted object should be, and a second which defines what it should mainly look like, that uses GAN’s, or two-part neural networks consisting of generators that produce the discriminators and samples that also attempt to differentiate between the real world samples and generated samples.
“The main technical novelty of this work is to construct an end-to-end trainable neural network that can sample plausible locations and shapes for the new object from its joint distribution,” the paper’s authors wrote. “The synthesized object instances can be used either as an input for GAN based methods or for retrieving the closest segment from an existing dataset, to generate new images.”
As they explain, the generator, in this case, predicts “plausible” location to generate object masks with “semantically coherent” scales, poses, and shapes — specifically how objects are distributed in a scene and how to insert an object naturally so that it appears to be a part of the scene. Over time, in the course of training, the AI system learns a different distribution for each object category conditioned on a scene — for example, in images of city streets, the fact that people tend to be on sidewalks and cars are usually on the road.