Feb 12, 2025 1:43:46 PM

Enhancing LLM Applications with High-Quality Data Processing

6:41

As we steadily explore LLM applications, including DeepSeek’s capabilities for SEC filing retrieval, we continue to emphasize the critical role of high-quality input data. While Large Language Models (LLMs) are powerful tools for extracting insights from unstructured data, their effectiveness in Retrieval Augmented Generation (RAG) depends on the quality and structure of the data they process. At Context Analytics (CA), our patented data refinement technologies enhance the accuracy and reliability of LLM-driven outputs. While not designed solely for this purpose, our Advanced Parsing Engine and Twitter Source Rating Algorithm are two examples of CA technology that can be leveraged to clean and structure data for LLM inputs, improving retrieval accuracy and overall performance.

1) Advanced Parsing Engine for Structured Text Input

Our Parsing Engine was initially developed to process complex documents, primarily global filings, while preserving their hierarchical structure and converting them into machine-readable JSON. Over time, this technology evolved into our Universal Document Processor (UDP), which can ingest any document type and extract or standardize all textual and financial data.

Traditional document formats like PDFs often require preprocessing—either through Python libraries or as part of the LLM’s answering mechanism—adding inefficiencies and potential inaccuracies to the retrieval. UDP streamlines this process by transforming documents into a JSON structure, LLM-friendly format, enabling faster and more accurate retrieval.

Case Study

As a case study, we analyzed the Agronomics Limited Annual Report from 2025-02-10, a 40-page PDF containing nested tables, uneven column headings, sub-headers with content spanning multiple pages, and more. This allows us to compare OpenAI’s ChatGPT-4o RAG capabilities using our parsed structured JSON versus the raw PDF as context input.

For example, Agronomics has company principles that span multiple pages such as QCA Principle 9.

Agronomics company principles that span multiple pages ex: QCA Principle 9

So, we can ask GPT 4o to “Summarize the QCA Principle 9 and all sub-headers.” changing nothing but the input file.

Raw pdf 1-6

Raw data #7

With the raw PDF input, the model incorrectly adds “Board Meetings” and “Governance Structure” as sub-headers, even though they are not. It also merges three sub-headers together and omits the last three. As a result, it correctly identifies 4 out of 10 sub-headers while hallucinating an additional two.

CA Structured JSON

Using the same model and prompt but replacing the raw PDF input with our parsed structured JSON, we eliminate hallucinations. The model accurately identifies all 10 sub-headers with 100% correctness in naming. We see that our processing improved the accuracy of the model.

Similarly, a risk and response table contains nested detailed information.

Risk and Response table

Raw PDF

The model extracts a significant amount of information from the raw PDF but still introduces hallucinations in sub-header names and content. For example, it misses “Assessing Observable Inputs” and instead generates “Review of Market Data & Comparable Transactions,” which is not directly sourced from the table and contains information such as “funding rounds” that is likely hallucinated.

CA Structured JSON

The structured JSON improves the naming of sub-headers and content while reducing hallucinated information. It accurately captures all seven responses provided by Agronomics in the table.

These examples demonstrate how our universal document processor (UDP) enhances the accuracy of a powerful large language model like ChatGPT-4o. The structured JSON input eliminates hallucinations from the raw PDF and improves overall accuracy. As the demand for leveraging unstructured data with LLMs grows, UDP can play a crucial role in enhancing retrieval accuracy.

2) Source Rating Algorithm on X (Twitter)

Another area where Context Analytics excels is unstructured social media data. As the shift away from mainstream media continues, platforms like X (formerly Twitter) have become critical sources for real-time news. Financial securities are frequently discussed on X, with a constant stream of updates that analysts need to monitor. Summarizing these discussions helps financial professionals stay informed about evolving sentiment and key market developments.

While X offers an API that anyone can integrate, the raw data remains highly unstructured and requires extensive processing to extract meaningful insights. Additionally, vast amounts of tweets come from unreliable sources or contain irrelevant information, making it challenging to filter out noise.

With over a decade of experience processing Twitter data, Context Analytics has developed a proprietary algorithm to filter out spam, bots, and low-quality sources. While LLMs can summarize tweets, our technology ensures that only high-quality, relevant content is used—enhancing the accuracy of AI-generated insights. This is particularly valuable for financial professionals who rely on social media intelligence for market analysis.

Again, the quality of an LLM’s output is directly tied to the quality of its input. CA’s Source Rating Algorithm minimizes misinformation and noise, producing cleaner data that provide better context for financial decision-making.

AI summary for $LFWD and $ANRO

AI summary for $NKGN

Because our proprietary filtering and processing techniques are unmatched, the accuracy and reliability of our financial tweet summaries cannot be replicated elsewhere. Our years of experience, coupled with unique data refinement technologies, give Context Analytics an exclusive edge in delivering the most relevant and actionable insights from social media.

Conclusion

Context Analytics provides essential data processing capabilities for LLM applications, ensuring that input data is structured, clean, and contextually relevant. Our Advanced Parsing Engine transforms complex documents into machine-readable formats, while our Source Rating Algorithm filters out noise, ensuring only high-quality, reliable social media data. These proprietary technologies set us apart, allowing us to deliver AI-driven insights with unmatched accuracy in the financial and business sectors.

As AI continues to evolve, the value of structured, high-quality data cannot be overstated. Whether optimizing retrieval in regulatory filings or refining real-time social media intelligence, Context Analytics remains at the forefront of LLM-driven innovation.

For more information, visit www.contextanalytics-ai.com.

Data Analytics, unstructured data, Twitter, AI, OpenAI, case study, generative ai, parsing financial filings, LLM, UDP, AI Summary, Source Rating Algorithm, retrieval augmented generation (RAG)

Enhancing LLM Applications with High-Quality Data Processing

Subscribe to Context Analytics Blog