Development of a Document Processing Platform with AI: Integration of an Open Source NER Model and LLM on Amazon SageMaker

A national laboratory in the United States has launched an innovative platform to address accessibility and localization issues of documents in its historical archives. Despite these institutions holding valuable information, much of it remains hidden due to lack of metadata and inconsistent labeling of documents. Traditional keyword-based search methods often prove ineffective, leading to exhaustive manual reviews to extract relevant data.

To tackle these challenges, the laboratory has developed a document processing solution powered by artificial intelligence. This platform combines Named Entity Recognition (NER) and Large Language Models (LLM) using Amazon SageMaker. With this technology, access to archived records is modernized through automation of metadata enrichment, document classification, and summary generation. The system uses the Mixtral-8x7B model for creating summaries and titles, as well as a BERT-based NER model for extracting structured metadata, significantly improving the organization and retrieval of scanned documents.

The platform architecture is serverless and cost-optimized, allowing for efficient resource utilization through dynamic provisioning of SageMaker endpoints, ensuring scalability. By integrating advanced natural language processing technologies and large language models, the tool enhances metadata generation accuracy, enabling more effective searches and agile document management. This approach not only supports digital transformation but also ensures that archived data is effectively utilized in research, policy development, and institutional knowledge preservation.

Named NER & LLM Gen AI Application, the platform combines the benefits of NER and LLM to automate large-scale document analysis. It employs a modular approach with different components managing various aspects of document processing, from creating summaries to author identification. The system is triggered when it detects documents in the extraction bucket, avoiding redundancies by orchestrating the creation of necessary endpoints for batch processing, ensuring more efficient operations.

A key feature of this solution is its ability to process up to 100,000 documents within a 12-hour timeframe, highlighting its cost-effectiveness and performance. Implementing extractive summaries as the first step helps reduce workload by 75 to 90%, translating these results into faster processing and lower operational costs. This platform emerges as a robust response to the growing demands for efficient document processing in the realms of research and knowledge management.

Referrer: MiMub in Spanish

Scroll to Top