In today’s world, effective communication is essential for promoting inclusion and breaking down barriers. However, for people who rely on visual communication methods, such as American Sign Language (ASL), traditional communication tools are often not sufficient. This is where GenASL comes in. GenASL is a solution driven by generative artificial intelligence that translates speech or text into expressive avatar animations in ASL, bridging the gap between spoken and written language and sign language.
The rise of base models and the fascinating world of generative artificial intelligence in which we live opens doors to imagine and build what was previously not possible. AWS enables organizations of all sizes and developers of all skill levels to create and scale generative artificial intelligence applications securely, with privacy and responsible AI.
The GenASL solution includes various AWS services working together to enable smooth translation of speech or text into avatar animations in ASL. Users can input audio, video, or text into GenASL, which generates a video of an ASL avatar interpreting the supplied data. The solution uses AWS AI and machine learning services, including Amazon Transcribe, Amazon SageMaker, Amazon Bedrock, and base models.
The workflow includes the following steps:
An Amazon EC2 instance initiates a batch process to create ASL avatars from a video dataset consisting of over 8,000 poses using RTMPose, a real-time pose estimation tool based on MMPose.
AWS Amplify distributes the GenASL web application, consisting of HTML, JavaScript, and CSS, to users’ mobile devices.
An Amazon Cognito identity pool grants temporary access to the Amazon S3 bucket.
Users upload audio, video, or text to the S3 bucket using the AWS SDK through the web application.
The GenASL web application invokes backend services by sending the S3 object key in the payload to an API hosted on Amazon API Gateway.
API Gateway initiates an AWS Step Functions state machine, which orchestrates AWS AI/ML services Amazon Transcribe and Amazon Bedrock and the Amazon DynamoDB NoSQL database using AWS Lambda functions.
The Step Functions workflow generates a presigned URL for the ASL avatar video for the corresponding audio file.
An asynchronous presigned URL for the video file stored in Amazon S3 is sent back to the user’s browser via API Gateway through polling. The user’s mobile device plays the video file using the presigned URL.
The frontend application was built using Amplify, which allows for building, developing, and deploying full-stack applications, including mobile and web applications. Connecting to S3 during audio file upload uses the temporary identity provided by the Amazon Cognito identity pool.
For an optimal user experience and to ensure good API design practice, GenASL uses an asynchronous API that allows the client to poll a REST resource to check the status of their request.
The backend and frontend architecture and components are designed to provide a scalable and secure solution for generating ASL avatars. Best practices include using optimized integration and continuous monitoring with Amazon CloudWatch to capture metrics and alert the DevOps team in case of failures.
Next steps in the evolution of GenASL include 3D pose estimation, blending techniques for smoother videos, and bidirectional translation between ASL and spoken languages.
The combination of advanced voice-to-text conversion technologies, automated translation, and video generation with AWS AI/ML services makes GenASL a powerful solution for improving accessibility and inclusive communication.
via: MiMub in Spanish