One of the largest social media platform Facebook Inc.’s has around more than 2.2 billion users share a staggering number of photos and videos on the platform each day that the social giant needs to catalog, add to search results and scan for harmful content. A big portion of those images contain text that must be analyzed as well.
To handle this monumental task, the company has built a sophisticated artificial intelligence which is known by the name as Rosetta.
Every day, Rosetta extracts text in a wide range of languages which varies from more than a billion publicly shared images on Facebook and Instagram. The system can analyze the contents of not only standalone files, but also individual frames within even the photos and videos. It scans all the images with the help of a technique that differs from those employed by traditional text recognition software.
The system approaches text analysis as a so-called sequence prediction problem. It analyzes images and uses historical data, rather than just the visual profile of the individual characters, to understand the writing. Facebook said this approach enables Rosetta to recognize words of any length, even ones it wasn’t exposed to during the training phase of development.
“Once we obtain the bounding boxes for word locations on an image, they are cropped and resized to a height of 32 pixels with the aspect ratio maintained,” detailed the Facebook engineers who worked on Rosetta. “All such crops for an image are batched into a single tensor with zero padding as needed and then processed at once by the text recognition model.”
Facebook is using the AI Platform Rosetta to power several various other features. The system makes images explorable via Facebook and Instagram’s respective search functions, helps determine how they should show up in the News Feed and looks for offensive content. The company plans to extend it to yet more areas over time.
“As we look beyond images, one of the biggest challenges is extracting text efficiently from videos,” Facebook’s engineers wrote. “The naive approach of applying image-based text extraction to every single video frame is not scalable, because of the massive growth of videos on the platform, and would only lead to wasted computational resources.”
You may also like to read: