Enhancing Retrieval-Augmented Generation (RAG) Performance through Effective Documentation
Hongliu (Leo) CAO, Senior Researcher
Recently, I was approached by a group of technical writers with an intriguing query: how can we change the writing style to make it easier for Large Language Models (LLMs) based RAG systems to understand?
As an AI researcher, we often focus on how to propose better solutions dealing with diverse input data qualities as the generalization ability is key. The initial reaction in my head was: the objective of research scientists is to make AI understand/adapt to human beings, not the other way around. However, upon reflecting on some of the state of the art RAG systems becoming more and more complex and expensive for slight performance improvements (without rigorous analysis and comparison on the input data quality) [1], I realize that the question posed by the documentation writers seems not only appropriate but also necessary to address.
The general workflow of RAG is shown in Figure 1, where Step 1 — Step 4 are about document data preparation and Step A to Step I are about how a user query is answered in general RAG systems. Next, I will talk about how effective documentation can help in these different steps.
· Step 1 and Step 2: Most documents in a company are in PDF or Word format. They are binary files and are more complex than plaintext files because more information such as font, color, and layout are stored. While PDFs are easy for people read, they’re not straightforward for software to parse into plaintext. Word documents are more reliable, and can be manipulated via Paragraph and Run objects more easily [2]. The RAG engineers I’ve talked to have slightly higher preference over word format (but it depends on what parsing tools they have at hand, how they use extra information apart from plainetext, etc.). Consistency in the format, language, and style across all documentations can also reduce the workload of pre-processing significantly. Tables, Figures, Equations: they are efficient for human readers but not for computers. Tips: Use them when necessary and write detailed captions. Avoid long multi-page tables. Maybe put equations into images can facilitate the presentation as well as the understanding of the equations.
· Step 3: Chunking is necessary because: 1. most LLMs and text embedding models have a maximum input length; 2. Relatively Smaller chunks can reduce noise and increase relevancy (well not too small neither), 3. Smaller chuncks can make the inference using LLMs faster and cheaper. One popular chunking strategy is using fixed word/token size which is far away to be an ideal solution. However, documentation writers can contribute by using paragraphs as chunks facilitating the retrieval system in pinpointing relevant contextual passages for response generation: for example, each paragraph should not be too long (many text embeddings have max token size of 512), with one main focus on answering one main question, and each paragraph should be self-explanatory (less dependent on other paragraphs).
· Step 4: Embedding: despite the significant improvement of text embedding models (on the MTEB benchmark [3]) poursuing universal text embeddings across different text lengths, tasks and languages, they are not universal yet [4]. However, recent advances on instruction based embeddings have shown a positive sign towards universality. The concept of instruction is simple: add a task description text to the text you want to embed to get a task-specific text embeddings. In this way, the same text can have different embeddings for different tasks/contexts. How documentation writers can help? The answer is: put everything into context. One simple tip would be adding a simple one sentence context description for the documents, chapters or even sections (if necessary) so that we can use it to have context-specific or task-specific text embeddings.
· Step A, B and C: the ideal user query should be clear with relevant context for domain-specific RAGs. When it is not the case, the query modification is needed to reduce ambiguity and enhance clarity. As the query is often embeded within its context, it highlights again the importance of putting documents and chunks into context.
· Step D and E: Information Retrieval (E) from the vector database (D) is the key step in RAG because they aim to extract the relevant chunks for answering the user query. One of the main challenge here is that: unlike the sentence to sentence similarities, the similarity between query (e.g. a short question) and the chunk (e.g. a long paragraph) is asymmetric, which highlights again the importance of using task-specific context description to provide directions on which documents/which parts of a document are relevant to the query. For the user query which needs multiple chunks to answer, reducing redundancy in the document is important: redundant chunks may block the selection of other necessary chunks that provide complementary information for the query. Another domain-specific challenge of many industries is the overuse of Acronyms or Abbreviations. For example, NCE can be Nice city, Nice Côte d’Azur Airport, New Civil Engineer, National Counselor Examination, New chemical entity, Newark College of Engineering, Nigerian Certificate in Education, Normal Curve Equivalent and more. The simple tip is: don’t use them, or add a document specific Glossary for each document to put Acronyms or Abbreviations into context. The authors from [5] also find out that related chunks, while not directly answering the query, are semantically or contextually linked to the topic, are very harmful for RAGs. To put it in simpler terms, even though these chunks don’t provide a direct response to the query, their contextual or semantic relevance to the subject matter can confuse the RAGs, leading to less than optimal performance. One potential solution in documentation writing is to avoid high duplication of vocabulary and syntax across paragraphs, chapters and documents (avoid copy-paste-modify type of writing).
· Step F: Re-ranking refers to re-evaluate the relevancy of selected chunks based on embedding similarity as rerankers are often more accurate (but more costly) than embedding models. Recent studies such as [5] also show that relevant information should be placed near the query in the prompt of Step G and H; otherwise, the model seriously struggles to attend to it. Most previously mentioned tips including mono-focus, self-explanatory, low-ambiguity paragraphs/chunks, less redundancy, better contexts would help this step or would help to skip this step.
· Step G, H and I: After getting the re-ranked relevant chunks, the next steps are designing prompt using the query and extracted relevant chunks so that LLMs can provide an accurate answer. Recent work [6] finds out that there is the tug-of-war between RAG and LLMs’ internal prior. Stricter prompt can be used to force the use of related chunks (which means that the documents need to be more precise and accurate than LLM’s knowledge). However, the author from [6] also find that the more the chunk information deviates from the LLM’s prior, the less likely LLM is to prefer it. Hence, when updating certain information that is different from previous versions or is counter-intuitive/common-sense, make sure to be explicit (e.g. state the previous info is outdated, the new information is more accurate).
In summary, effective documentation writing can help improve RAG systems in multitude ways despite the fast progress on RAGs, LLMs and text embeddings. Some of the key tips include:
1. Put your documents and chapters into explicitly written one-sentence context (task-specific or domain-specific context).
2. Paragraphs as chunks: each paragraph should not be too long, with one main focus on answering one main question, and each paragraph should be self-explanatory (less dependent on other paragraphs).
3. Reduce redundancy. Avoid similar looking paragraphs for different topics/concepts.
4. Avoid Acronyms or Abbreviations, or add a document specific Glossary for each document to put Acronyms or Abbreviations into context.
5. When updating certain information that is different from previous versions or is counter-intuitive, make sure to be explicit.
6. Be consistent.
Do you have any other tips or observations? Feel free to share and comment!
Note 1: these tips are mostly high-level, it is necessary to discuss them with documentation writers on how to implement them in practice.
Note 2: Easily readable by a human does not mean that it is ideal for the computer, RAG and LLMs.
Note 3: this blog is the result of numerous collaborative discussions with various colleagues and experts in the field. I extend my heartfelt gratitude to all those who generously contributed their insights and feedback.
[1] Huang, Yizheng, and Jimmy Huang. “A Survey on Retrieval-Augmented Text Generation for Large Language Models.” arXiv preprint arXiv:2404.10981 (2024).
[2] https://automatetheboringstuff.com/chapter13/
[3] https://huggingface.co/spaces/mteb/leaderboard
[4] Hongliu, CAO. “Recent advances in text embedding: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark.”
[5] Cuconasu, Florin, et al. “The power of noise: Redefining retrieval for rag systems.” arXiv preprint arXiv:2401.14887 (2024).
[6] Wu, Kevin, Eric Wu, and James Zou. “How faithful are RAG models? Quantifying the tug-of-war between RAG and LLMs’ internal prior.” arXiv preprint arXiv:2404.10198 (2024).