2025-10-02

RAG là gì?

Định nghĩa

RAG (Retrieval-Augmented Generation) là một kỹ thuật kết hợp giữa information retrieval (tìm kiếm thông tin) và text generation (sinh văn bản) để cải thiện chất lượng câu trả lời của Large Language Models (LLMs).

Thay vì chỉ dựa vào kiến thức đã được training, RAG cho phép LLM truy cập và sử dụng thông tin từ nguồn dữ liệu bên ngoài để đưa ra câu trả lời chính xác và cập nhật hơn.

Vấn đề RAG giải quyết

1. Hallucination (Ảo giác)

LLMs đôi khi tạo ra thông tin không chính xác nhưng nghe có vẻ hợp lý.

Ví dụ: Hỏi ChatGPT về một sản phẩm mới của công ty bạn → Nó sẽ “bịa” thông tin vì không có data.

2. Knowledge Cutoff

LLMs chỉ biết thông tin đến thời điểm training.

Ví dụ: GPT-4 có knowledge cutoff tại tháng 4/2023 → Không biết sự kiện sau đó.

3. Domain-Specific Knowledge

LLMs không có kiến thức chuyên sâu về dữ liệu riêng của tổ chức.

Ví dụ: Không thể trả lời về policies nội bộ, documentation kỹ thuật của công ty.

4. Privacy và Security

Fine-tuning LLM với data riêng rất tốn kém và có thể rò rỉ thông tin.

Cách RAG hoạt động

Kiến trúc RAG cơ bản:

1
User Query
2
    ↓
3
1. Query Processing
4
    ↓
5
2. Embedding Generation (Vector)
6
    ↓
7
3. Similarity Search in Vector DB
8
    ↓
9
4. Retrieve Relevant Documents
10
    ↓
11
5. Context Construction
12
    ↓
13
6. Prompt Engineering
14
    ↓
15
7. LLM Generation
16
    ↓
17
Response to User

Chi tiết từng bước:

Bước 1: Document Ingestion (Chuẩn bị dữ liệu)

1
# 1. Load documents
2
documents = load_documents("./company-docs/")
3

4
# 2. Split into chunks
5
chunks = text_splitter.split_documents(documents)
6

7
# 3. Generate embeddings
8
embeddings = OpenAIEmbeddings()
9
vectors = embeddings.embed_documents(chunks)
10

11
# 4. Store in vector database
12
vector_store = Chroma.from_documents(
13
    documents=chunks,
14
    embedding=embeddings
15
)

Bước 2: Query Processing (Xử lý câu hỏi)

1
# User asks a question
2
user_query = "Chính sách nghỉ phép của công ty là gì?"
3

4
# Generate embedding for query
5
query_embedding = embeddings.embed_query(user_query)
6

7
# Find similar documents
8
relevant_docs = vector_store.similarity_search(
9
    query_embedding,
10
    k=3  # Top 3 most relevant
11
)

Bước 3: Generation (Sinh câu trả lời)

1
# Construct prompt with context
2
prompt = f"""
3
Dựa trên thông tin sau, hãy trả lời câu hỏi của user:
4

5
Context:
6
{relevant_docs}
7

8
Question: {user_query}
9

10
Answer:
11
"""
12

13
# Generate response
14
response = llm.generate(prompt)

Thành phần chính của RAG System

1. Document Loader

Load data từ nhiều nguồn:

PDF, DOCX, TXT files
Web pages (web scraping)
Databases (SQL, MongoDB)
APIs
Google Docs, Notion

Tools: LangChain, LlamaIndex, Unstructured

2. Text Splitter

Chia documents thành chunks nhỏ:

Character-based: Chia theo số ký tự
Token-based: Dựa trên tokens (GPT tokenizer)
Semantic-based: Dựa trên ý nghĩa
Recursive: Chia theo cấu trúc (paragraphs → sentences)

Best practices:

Chunk size: 500-1000 tokens
Overlap: 50-100 tokens
Preserve context và structure

3. Embedding Model

Convert text thành vectors:

OpenAI Embeddings: text-embedding-3-small, text-embedding-3-large
Open-source: sentence-transformers, all-MiniLM-L6-v2
Multilingual: multilingual-e5-large

Dimensions:

Small: 384-768 dim
Large: 1536-3072 dim

4. Vector Database

Store và search embeddings:

Cloud: Pinecone, Weaviate Cloud
Self-hosted: Chroma, Qdrant, Milvus
Hybrid: Elasticsearch, OpenSearch

Features cần có:

Fast similarity search (ANN/KNN)
Metadata filtering
Scalability
Multi-tenancy

5. LLM

Generate final response:

Cloud: GPT-4, Claude, Gemini
Self-hosted: Llama 3, Mistral, Mixtral
Specialized: Code Llama, MedLlama

6. Orchestration Framework

Quản lý toàn bộ pipeline:

LangChain: Phổ biến nhất, nhiều integrations
LlamaIndex: Focus on RAG, tối ưu cho retrieval
Haystack: Enterprise-grade
Custom: Build your own with Python

Các kỹ thuật nâng cao

1. Hybrid Search

Kết hợp vector search và keyword search:

1
# Vector search (semantic)
2
vector_results = vector_db.similarity_search(query)
3

4
# Keyword search (BM25)
5
keyword_results = bm25_search(query)
6

7
# Combine with weighted scoring
8
final_results = rerank(vector_results, keyword_results)

2. Re-ranking

Sắp xếp lại kết quả để tăng độ chính xác:

Cross-encoder models: ms-marco-MiniLM
LLM-based reranking: Dùng LLM để score relevance
Diversity ranking: Tránh duplicate content

3. Query Enhancement

Cải thiện query để retrieve tốt hơn:

Query expansion: Thêm synonyms, related terms
Hypothetical Document Embeddings (HyDE): LLM tạo hypothetical answer, search bằng answer đó
Multi-query: Tạo nhiều versions của query

4. Metadata Filtering

Lọc theo metadata trước khi search:

1
# Filter by source, date, author, etc.
2
results = vector_db.similarity_search(
3
    query,
4
    filter={"source": "hr_policy", "year": 2024}
5
)

5. Citation và Source Tracking

Tracking sources cho transparency:

1
response = {
2
    "answer": "Công ty cho phép 15 ngày nghỉ phép...",
3
    "sources": [
4
        {"file": "hr_policy.pdf", "page": 5},
5
        {"file": "employee_handbook.pdf", "page": 12}
6
    ],
7
    "confidence": 0.92
8
}

Ví dụ thực tế: RAG cho Customer Support

Use case: Chatbot hỗ trợ khách hàng

Data sources:

Product documentation
FAQ database
Previous support tickets
Knowledge base articles

Architecture:

1
Customer Question
2
    ↓
3
Query Processing → Detect intent, extract entities
4
    ↓
5
Retrieval → Search relevant articles (top 5)
6
    ↓
7
Re-ranking → Score by relevance
8
    ↓
9
Context Building → Combine with chat history
10
    ↓
11
LLM Generation → Generate helpful response
12
    ↓
13
Add Citations → Link to source articles
14
    ↓
15
Response to Customer

Benefits:

Instant responses 24/7
Consistent answers
Always up-to-date with latest docs
Reduces support team workload by 60-80%

Evaluation và Monitoring

Metrics để đo lương RAG performance:

1. Retrieval Metrics

Recall@K: % relevant docs in top K results
Precision@K: % of top K results that are relevant
MRR (Mean Reciprocal Rank): Vị trí của first relevant doc

2. Generation Metrics

Answer Relevance: Câu trả lời có liên quan?
Faithfulness: Câu trả lời có dựa trên context?
Hallucination Rate: % câu trả lời bịa đặt

3. End-to-End Metrics

User Satisfaction: CSAT score
Task Success Rate: Giải quyết được vấn đề?
Response Time: Tốc độ trả lời

Tools for evaluation:

RAGAS: RAG Assessment framework
LangSmith: Monitoring cho LangChain apps
Arize AI: LLM observability
Custom evaluation: Test sets + human review

Challenges và Solutions

Challenge 1: Context Window Limits

Problem: LLM có giới hạn context window (4K-128K tokens) Solutions:

Better chunking strategy
Retrieve fewer but more relevant docs
Summarization trước khi pass vào LLM
Use long-context models (Gemini 1M, Claude 200K)

Challenge 2: Retrieval Quality

Problem: Không retrieve được đúng documents Solutions:

Better embeddings (fine-tuned on domain)
Hybrid search (vector + keyword)
Query enhancement techniques
Better metadata filtering

Challenge 3: Cost

Problem: OpenAI embeddings + GPT-4 expensive Solutions:

Cache embeddings và frequent queries
Use smaller models where possible (GPT-3.5)
Open-source alternatives (Llama 3, Mistral)
Batch processing

Challenge 4: Latency

Problem: RAG pipeline slow (retrieval + generation) Solutions:

Fast vector DB (Chroma, Qdrant)
Parallel retrieval
Streaming responses
CDN caching cho static content

Best Practices

Start Small: Begin with một use case cụ thể
Data Quality: Garbage in, garbage out - clean your data
Chunk Properly: Experiment với chunk size và overlap
Test Retrieval First: Đảm bảo retrieval work trước khi worry về generation
Evaluate Continuously: Set up evaluation pipeline từ đầu
User Feedback Loop: Collect feedback để improve
Version Control: Track changes to prompts, embeddings, models

Tools và Frameworks phổ biến

End-to-end Solutions:

LangChain + Chroma: Most popular combo
LlamaIndex + Pinecone: Focus on retrieval optimization
Haystack + Weaviate: Enterprise-grade
Vercel AI SDK: For Next.js apps

Pre-built RAG Services:

OpenAI Assistants API: RAG built-in
Azure AI Search: Enterprise RAG
AWS Bedrock Knowledge Bases: AWS-native
Google Vertex AI Search: Google Cloud

DIY Stack:

1
Document Processing: Unstructured.io
2
Embeddings: OpenAI / sentence-transformers
3
Vector DB: Chroma / Qdrant
4
LLM: GPT-4 / Llama 3
5
Orchestration: LangChain
6
Frontend: Streamlit / Next.js
7
Monitoring: LangSmith / Weights & Biases

Kết luận

RAG là một breakthrough technology giúp LLMs:

Trả lời chính xác hơn với domain-specific knowledge
Luôn up-to-date với data mới
Giảm hallucinations
Cost-effective hơn fine-tuning

RAG phù hợp cho:

Internal knowledge bases
Customer support chatbots
Document Q&A systems
Research assistants
Code documentation search

Next steps:

Thử nghiệm với LangChain + Chroma
Start với documents và use case đơn giản
Iterate dựa trên feedback
Scale khi có traction

RAG không phải silver bullet, nhưng là công cụ mạnh mẽ để unlock value từ data của bạn với LLMs!