Improving the technology of Just Walk Out with multimodal AI in autonomous stores.

Since its launch in 2018, Amazon’s Just Walk Out technology has revolutionized the shopping experience by allowing customers to enter a store, pick up products, and leave without waiting in line to pay. This frictionless checkout technology is found in over 180 third-party locations worldwide, including travel retailers, sports stadiums, entertainment venues, conference centers, theme parks, convenience stores, hospitals, and university campuses. The Just Walk Out system automatically determines which products each customer selects in the store and provides digital receipts, eliminating the need for checkout lines.

In this post, we highlight the latest generation of Amazon’s Just Walk Out technology, powered by a state-of-the-art multi-modal (MM) base model. We designed this multi-modal model for physical stores using a transformer-based architecture similar to that underpinning many generative artificial intelligence applications. The model will help retailers generate highly accurate purchase receipts using data from multiple inputs, including overhead cameras, specialized shelf weight sensors, digital floor plans, and product catalog images.

Our research and development efforts in state-of-the-art multi-modal models enable the Just Walk Out system to be deployed in a wide range of purchasing situations with increased accuracy and lower cost. Similar to large language models that generate text, the new Just Walk Out system is designed to generate an accurate sales receipt for each shopper visiting the store.

Due to its innovative cashier-less environment, Just Walk Out stores presented us with a unique technical challenge. Both retailers and shoppers, as well as Amazon, demand nearly 100 percent accuracy in review, even in the most complex purchasing situations. These include unusual purchasing behaviors that can create a long and complicated sequence of activities requiring extra effort to analyze.

Previous generations of the Just Walk Out system used a modular architecture that addressed complex purchasing situations by breaking down the shopper’s visit into discrete tasks, such as detecting shopper interactions, tracking items, identifying products, and counting selections. These individual components were then integrated into sequential pipelines to enable the overall functionality of the system. While this approach produced highly accurate receipts, significant engineering efforts were required to address challenges in new and previously unseen situations, limiting the scalability of this approach.

To address these challenges, we introduced a new multi-modal MM designed specifically for retail store environments, allowing the Just Walk Out technology to handle complex real-world purchasing scenarios. The new multi-modal MM further enhances the capabilities of the Just Walk Out system by effectively generalizing to new store formats, products, and customer behaviors, which is crucial for scaling the Just Walk Out technology.

The incorporation of continuous learning allows the model training to adapt and automatically learn from new challenging scenarios as they arise. This self-improvement capability helps ensure that the system maintains high performance, even as purchasing environments continue to evolve.

Through this combination of end-to-end learning and improved generalization, the Just Walk Out system can address a broader range of dynamic and complex retail environments. Retailers can confidently deploy this technology, knowing it will provide a frictionless shopping experience for their customers.

Key elements of our Just Walk Out multi-modal AI model include flexible data inputs, which track how users interact with products and furniture, such as shelves or refrigerators. It primarily relies on multiple-view video streams as inputs, using weight sensors solely to track small items. The model maintains a 3D digital representation of the store and can access catalog images to identify products, even if the shopper returns items to the wrong shelf.

Multi-modal data is processed by encoders that compress it into transformer tokens, the basic input unit for the receipt model. This allows the model to interpret hand movements, differentiate between items, and accurately count the number of items picked up or returned to the shelf quickly and accurately. Additionally, the system uses these tokens to create digital receipts for each shopper, distinguishing between different shopping sessions and dynamically updating each receipt as items are picked up or returned.

To train the Just Walk Out MM, we have invested in a robust infrastructure that can efficiently process the massive amounts of data needed to train high-capacity neural networks that mimic human decision-making. We built the infrastructure for our Just Walk Out model with the help of various Amazon Web Services (AWS) offerings, including Amazon Simple Storage Service (Amazon S3) for data storage and Amazon SageMaker for training.

In conclusion, with our innovative approach, we are moving away from modular AI systems that rely on subcomponents and interfaces defined by humans. Instead, we are building simpler and more scalable AI systems that can be trained end-to-end. Although we have only just begun, multi-modal AI has raised the standard for our already highly accurate receipt system and will allow us to enhance the shopping experience in more Just Walk Out technology stores worldwide.

Referrer: MiMub in Spanish

Scroll to Top
×