A new artificial intelligence-based approach could transform the way language models determine truth and improve the consistency of their responses. This innovative method, known as “Large Language Model Debates” (LLM Debates), involves two language models arguing in favor of opposing positions for three rounds, in a format designed to evaluate and improve factual accuracy.
This procedure is particularly valuable in contexts where manual data annotation to determine objective truths is costly, slow, and potentially contentious. By generating synthetic data, LLM debates can not only accelerate the alignment and curation of objective truths in unsupervised and unexplored datasets, but also contribute to the training of broader and more robust language models.
Inspired by one of the standout works of the 2024 International Conference on Machine Learning (ICML), this technique relies on the “TofuEval” dataset. In each LLM debate exercise, two models, Anthropic’s Claude 3 Sonnet and Mixtral’s 8X7B, defend opposing sides of an argument, while a third model, Mistral 7B, acts as a judge to determine which position is more convincing.
Using the Azure environment, these debates are conducted through Amazon Sagemaker and Bedrock, providing the necessary infrastructure to manage the complexity of the process. Amazon Bedrock is highlighted as a comprehensive solution that facilitates experimentation, customization, and deployment of generative AI capabilities.
One of the main challenges is evaluating the most consistent summary of two proposals based on a set of provided transcriptions, facing difficulties such as subtle changes in meaning and reasoning errors. Four different techniques are compared: Naive Judge, Expert Judge, LLM Consultancy, and LLM Debates.
Each technique offers a different level of accuracy in terms of truthfulness, with the debate method being the most effective, achieving 70% accuracy in experiments. In contrast, the naive judge method serves as a baseline with 10% accuracy.
Advances in LLM Debates not only show significant improvements in factual accuracy but also pave the way for reducing costs and time in manual annotation. This approach promises to set a new standard in the generation of accurate and reliable data to train advanced language models, laying the foundation for substantial improvements in conversational and task-oriented artificial intelligence applications.
Referrer: MiMub in Spanish