Prompt Engineering and Data Preprocessing Techniques for Doubling RAG Performance

Starting Off: "I put in all the documents, so why is the AI talking nonsense?"

The first thing many teams do when building a Retrieval-Augmented Generation (RAG) system is pour PDF or wiki documents into a vector DB. But the results are often disappointing. Even though the information is clearly in the document, it replies "related information cannot be found," or it creates false information by referencing the wrong paragraph.

This happens not because the AI model lacks intelligence, but because the data wasn''t fed in a form that is easy for the AI to read. 80% of RAG performance is determined not by the model, but by the "quality of the retrieved context" and the "prompt" that cooks that context. This guide explains data preprocessing routines and prompt optimization techniques that can be applied immediately in the field, divided into 4 steps.

Why RAG Performance is Low: 3 Failure Patterns

1. Mindless Chunking (Fixed-size Chunking)

If you mechanically cut documents every 500 characters, sentences are cut in half or core context is moved to the next chunk. During retrieval, only half of the information is brought in, causing answer quality to plummet.

2. Using Raw Text Mixed with Noise

If PDF page numbers, headers, footers, and broken text in table formats are included as is, they act as distractions during vector similarity calculations. The AI mistakes this "garbage data" for important information and mixes it into the answer.

3. Hands-off Prompts ("Answer referring to the following")

If the retrieved content is not directly related to the question and you don''t give the AI the authority to refuse, it will try to make up an answer somehow (hallucination). A prompt without clear guidelines is a major contributor to false answers.

Practical Checklist: Items to Fix Before Introduction

Standardize Document Format: Use Markdown format as default (easy to understand structure).
Chunk Size: Experiment between 300-800 tokens, considering the length of domain terms.
Overlap Ratio: Set overlap between 10-20% between chunks to maintain context.
Embedding Model Selection: For multilingual support, prioritize models like BGE-M3.
Evaluation Dataset: Secure 50 question-answer sets where the correct answer is clear first.

Step 1: Apply Semantic Chunking

Divide documents into "semantic units" rather than simple character counts. Using Markdown headers (#, ##) or recursive text splitters (like RecursiveCharacterTextSplitter) to cut by sentence periods or line breaks is basic.

Deliverable: High-quality text chunks with uninterrupted context.

Step 2: Meta-data Enhancement and Filtering

Add document title, creation date, category, and if necessary, a one-sentence summary of the chunk as metadata to each chunk. If similarity search and category filtering are performed simultaneously, accuracy rises dramatically.

Deliverable: Vector data with filterable attributes.

Step 3: Design Context Reconstruction Prompts

Don''t just list several retrieved chunks; give the AI strong instructions to "ignore content unrelated to the question." Also, force the AI to clearly state the source of the answer.

Deliverable: High-reliability prompt template with clearly marked grounds for answers.

Step 4: Add Re-ranking Process

After picking 10-20 candidates through vector search first, pass them through a "Re-ranker" model and put only the 3-5 most closely related to the question into the final prompt. This single step alone improves answer accuracy by more than 30%.

Evaluation Criteria: Comparison of answer consistency (Recall@K) before and after re-ranking.

Editor''s Perspective: "Data cleaning does not end with one time"

The biggest mistake seen in the field is thinking of RAG as a "system that is finished once built." Users'' questions keep changing, and documents are updated. Successful RAG operation teams analyze the list of "failed questions" every week to see why they failed (no data, retrieval failed, or model didn''t understand) and modify the preprocessing logic. Boring "data refinement routines" make better AI than technical flamboyance.

Practical Case: Improving Internal Regulation Chatbot

Situation

When asked "tell me where to use welfare points," salary regulations or severance pay regulations were mixed in, making the answer confusing.

Application Method

Converted all PDFs to Markdown to maintain table structures.
Forced "Category: Welfare" metadata during chunking.
Added a guardrail to the prompt: "If there is no answer in the provided context, reply that you don''t know and guide the user to the HR team extension (1234)."

Result

Answer accuracy (correct answer matching rate) rose from 65% to 92%. In particular, the phenomenon of mixing in wrong information disappeared.

Lesson

Metadata that narrows the search scope and "saying you don''t know when you don''t know" instructions were decisive in increasing user trust.

Executive Execution Summary

Item	Execution Criteria
Preprocessing	Recursive Splitting based on Markdown.
Metadata	Include title, category, and chunk summary.
Prompt Instructions	"Ignore unrelated info" and "State source document number" are mandatory.
Search Optimization	Combination of Hybrid Vector Search + Re-ranking.
Quality Maintenance	Analyze 20 failed questions weekly and perform Data Curation.

Frequently Asked Questions (FAQ)

Q1. Won''t information loss be high if the chunk size is large?▾

If the chunk size is too large, it contains a lot of noise unrelated to the question, causing the model to lose focus. Also, prompts get longer and more expensive. Information loss should be solved with "Overlap" and "Semantic Chunking."

Q2. What is the most important thing to watch out for in RAG for non-English languages?▾

Due to linguistic characteristics like word endings and particles, embedding search based on meaning is advantageous over simple keyword search. However, since vector search might miss proper nouns or specialized terms (e.g., product names), "Hybrid Search" mixing keyword and vector search is recommended.

Q3. Do I have to use a paid model for the re-ranking model?▾

No. There are many excellent open-source models like BGE-Reranker. If you run it on a local server, you can greatly increase performance without the burden of cost.