Performance improvement using a personalized segmentation mechanism with Amazon Bedrock

Currently, organizations face the challenge of extracting structured information from unstructured PDF documents, which can contain a variety of elements such as images, tables, headers, and text in various formats, making efficient data analysis difficult.

Additionally, the performance of chatbots and other natural language processing (NLP) applications largely depends on the text splitting strategy used. Improper splitting can lead to loss of context, resulting in inaccurate or inconsistent responses. The efficiency of language models is also affected by the size of the fragments, providing more detailed information in smaller fragments, but struggling to generalize, while larger fragments may omit important details.

In this context, Accenture has leveraged the customization capabilities of Knowledge Bases for Amazon Bedrock, integrating a data processing flow and customized logic to create a text splitting mechanism that improves the performance of the Augmented by Retrieval Generation (RAG) and unleashes the potential of data in PDFs.

The Accenture team created a knowledge base with the company’s financial results for each quarter from 2020 to 2024. This document included images, tables, text in different formats, and other noisy elements. The goal was to extract detailed information from the tables and preserve the generalization capabilities of foundation models to answer general questions about financial results.

After several tests, it was discovered that the retrieval mechanism failed to correctly retrieve information for the specified years and quarters in the queries. An example showed that when searching for information from the first quarter of 2023, the system returned data from the first quarter of 2020. By identifying issues in selecting the correct fragments, Accenture decided to change the text splitting strategy using the new features of Amazon Bedrock.

The architectural flow of the updated solution follows the following steps: creation of a data source in Amazon S3, use of Amazon Textract to extract data from PDFs, creation of fragments based on the paragraphs from the Textract result, incorporation of additional metadata to preserve context, and use of Amazon OpenSearch Service to select fragments most similar to the user query.

The new text splitting mechanism avoids splitting sentences or paragraphs in half and removes noisy elements to provide more useful context. Key elements of the PDFs include tables, images, page numbers, and chapter headers. The latter help label the fragments using metadata, improving the accuracy and speed of extraction.

Customized text splitting offers several benefits, such as context preservation, flexible fragment sizes, improved retrieval performance, and seamless integration with other AWS services. Additionally, metadata filtering provides significant improvements in response accuracy, although prior knowledge of filter names and their corresponding values is required.

Finally, the improvement in result accuracy using fine-tuned system templates and manual analysis of responses demonstrated that the customized text splitting strategy with metadata filtering offers significant advantages over fixed methods.

This joint solution between Accenture and AWS solidifies their strategic relationship and employs proven mechanisms to transform data into useful and accurate information, maximizing the potential of unstructured PDF documents in business applications.

Referrer: MiMub in Spanish

Last articles

Scroll to Top
×