๋ณธ๋ฌธ์œผ๋กœ ๊ฑด๋„ˆ๋›ฐ๊ธฐ
-
skycave's Blog
skycave's Blog
  • Home
  • Investment
  • IT
    • Data engineering
    • AI
    • Programing
  • Leisure
    • Camping
    • Fishing
  • Travel
    • Domestic
    • Overseas
  • Book
  • Product
  • Hot keyword in google
  • Home
  • Investment
  • IT
    • Data engineering
    • AI
    • Programing
  • Leisure
    • Camping
    • Fishing
  • Travel
    • Domestic
    • Overseas
  • Book
  • Product
  • Hot keyword in google
๋‹ซ๊ธฐ

๊ฒ€์ƒ‰

AI

[AI Paper] ๐Ÿ“„ RAGAS: Automated Evaluation of Retrieval Augmented Generation

By skycave
2026๋…„ 01์›” 25์ผ 8 Min Read
0

๐Ÿ“„ RAGAS: Automated Evaluation of Retrieval Augmented Generation

๐Ÿ“‹ ๋ฉ”ํƒ€ ์ •๋ณด

ํ•ญ๋ชฉ ๋‚ด์šฉ
์ €์ž Shahul Es, Jithin James (Exploding Gradients), Luis Espinosa-Anke, Steven Schockaert (CardiffNLP, Cardiff University)
๊ธฐ๊ด€ Exploding Gradients, Cardiff University, AMPLYFI
๋ฐœํ‘œ์ฒ˜ EACL 2024 (18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations), pp. 150-158, St. Julians, Malta
๋ฐœํ‘œ ์—ฐ๋„ 2023๋…„ 9์›” (arXiv), 2024๋…„ 3์›” (EACL)
arXiv 2309.15217
GitHub explodinggradients/ragas
๋ผ์ด์„ ์Šค Apache-2.0
PyPI ragas

๐ŸŽฏ ํ•œ์ค„ ์š”์•ฝ

Ground Truth ์—†์ด LLM์„ ํ™œ์šฉํ•˜์—ฌ RAG ํŒŒ์ดํ”„๋ผ์ธ์˜ ๊ฒ€์ƒ‰ ํ’ˆ์งˆ๊ณผ ์ƒ์„ฑ ํ’ˆ์งˆ์„ ์ž๋™์œผ๋กœ ํ‰๊ฐ€ํ•˜๋Š” Reference-free ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ๋กœ, Faithfulness(95%), Answer Relevance(78%), Context Relevance(70%)์˜ ์ธ๊ฐ„ ํ‰๊ฐ€ ์ผ์น˜์œจ์„ ๋‹ฌ์„ฑ


๐Ÿ” ์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ ๋ฐ ๋™๊ธฐ

RAG ์‹œ์Šคํ…œ์˜ ๋ถ€์ƒ

  • RAG(Retrieval Augmented Generation)๋Š” ๊ฒ€์ƒ‰(Retrieval) ๋ชจ๋“ˆ๊ณผ LLM ๊ธฐ๋ฐ˜ ์ƒ์„ฑ(Generation) ๋ชจ๋“ˆ์„ ๊ฒฐํ•ฉ
  • ์™ธ๋ถ€ ์ง€์‹ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋ฅผ ํ™œ์šฉํ•˜์—ฌ LLM์˜ ํ• ๋ฃจ์‹œ๋„ค์ด์…˜(hallucination) ์œ„ํ—˜์„ ์ค„์ž„
  • ์‚ฌ์šฉ์ž์™€ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ์‚ฌ์ด์˜ ์ž์—ฐ์–ด ์ธํ„ฐํŽ˜์ด์Šค ์—ญํ•  ์ˆ˜ํ–‰

๊ธฐ์กด RAG ํ‰๊ฐ€์˜ ๋ฌธ์ œ์ 

  1. ๋‹ค์ฐจ์› ํ‰๊ฐ€์˜ ๋ณต์žก์„ฑ
    • ๊ฒ€์ƒ‰ ์‹œ์Šคํ…œ์˜ ๊ด€๋ จ ์ปจํ…์ŠคํŠธ ์‹๋ณ„ ๋Šฅ๋ ฅ
    • LLM์˜ ๊ฒ€์ƒ‰๋œ ์ปจํ…์ŠคํŠธ ํ™œ์šฉ ๋Šฅ๋ ฅ (Faithfulness)
    • ์ƒ์„ฑ๋œ ๋‹ต๋ณ€์˜ ํ’ˆ์งˆ
  2. ๊ธฐ์กด ๋ฉ”ํŠธ๋ฆญ์˜ ํ•œ๊ณ„
    • BLEU, ROUGE ๋“ฑ ์ „ํ†ต์  ๋ฉ”ํŠธ๋ฆญ์€ ํ‘œ๋ฉด์  ํ…์ŠคํŠธ ์œ ์‚ฌ๋„๋งŒ ์ธก์ •
    • ์‚ฌ์‹ค์  ์ •ํ™•์„ฑ(factual accuracy)๊ณผ ์ปจํ…์ŠคํŠธ ๊ด€๋ จ์„ฑ์„ ํฌ์ฐฉํ•˜์ง€ ๋ชปํ•จ
    • RAG ํŠน์„ฑ์— ๋งž์ง€ ์•Š๋Š” ํ‰๊ฐ€ ๊ธฐ์ค€
  3. Human Annotation ์˜์กด์„ฑ
    • ๊ธฐ์กด ํ‰๊ฐ€ ๋ฐฉ์‹์€ ๋Œ€๊ทœ๋ชจ ์ธ๊ฐ„ ์ฃผ์„ ๋ฐ์ดํ„ฐ ํ•„์š”
    • ๋น„์šฉ๊ณผ ์‹œ๊ฐ„์ด ๋งŽ์ด ์†Œ์š”
    • ๋น ๋ฅธ ๊ฐœ๋ฐœ ์‚ฌ์ดํด์— ๋ถ€์ ํ•ฉ

RAGAS์˜ ํ•„์š”์„ฑ

  • LLM ๊ธฐ๋ฐ˜ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ๋น ๋ฅธ ์ฑ„ํƒ์— ๋”ฐ๋ฅธ ์‹ ์†ํ•œ ํ‰๊ฐ€ ์‚ฌ์ดํด ์š”๊ตฌ
  • Reference-free ํ‰๊ฐ€๋กœ ํ‰๊ฐ€ ๋น„์šฉ ์ ˆ๊ฐ
  • RAG ํŒŒ์ดํ”„๋ผ์ธ์˜ ๊ฐ ๊ตฌ์„ฑ์š”์†Œ๋ฅผ ๋…๋ฆฝ์ ์œผ๋กœ ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ๋Š” ์ฒด๊ณ„์  ํ”„๋ ˆ์ž„์›Œํฌ ํ•„์š”

๐Ÿ’ก ํ•ต์‹ฌ ์•„์ด๋””์–ด

Reference-Free ํ‰๊ฐ€์˜ ๊ฐœ๋…

RAGAS์˜ ํ•ต์‹ฌ์€ Ground Truth ์—†์ด LLM์„ Judge๋กœ ํ™œ์šฉํ•˜์—ฌ RAG ์‹œ์Šคํ…œ์„ ํ‰๊ฐ€ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    RAGAS ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ                      โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                             โ”‚
โ”‚   Question โ”€โ”€โ†’ [Retriever] โ”€โ”€โ†’ Contexts โ”€โ”€โ†’ [Generator] โ”€โ”€โ†’ Answer โ”‚
โ”‚       โ”‚              โ”‚              โ”‚              โ”‚         โ”‚
โ”‚       โ”‚              โ”‚              โ”‚              โ”‚         โ”‚
โ”‚       โ–ผ              โ–ผ              โ–ผ              โ–ผ         โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”โ”‚
โ”‚   โ”‚Context โ”‚   โ”‚  Context     โ”‚  โ”‚Faithfulnessโ”‚  โ”‚Answer  โ”‚โ”‚
โ”‚   โ”‚Relevanceโ”‚   โ”‚  Precision   โ”‚  โ”‚            โ”‚  โ”‚Relevanceโ”‚โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚  & Recall    โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜โ”‚
โ”‚                โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                             โ”‚
โ”‚                                                             โ”‚
โ”‚   โ—€โ”€โ”€โ”€โ”€โ”€โ”€ Retriever ํ‰๊ฐ€ โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ โ—€โ”€โ”€โ”€โ”€ Generator ํ‰๊ฐ€ โ”€โ”€โ”€โ”€โ–ถ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

RAG Triad: ํ•ต์‹ฌ ํ‰๊ฐ€ ์ฒ ํ•™

TruLens์—์„œ ์ฒ˜์Œ ์ œ์•ˆ๋œ RAG Triad ๊ฐœ๋…์„ RAGAS๊ฐ€ ๋ฐœ์ „์‹œ์ผฐ์Šต๋‹ˆ๋‹ค:

๋ฉ”ํŠธ๋ฆญ ํ‰๊ฐ€ ๋Œ€์ƒ ํ•ต์‹ฌ ์งˆ๋ฌธ Ground Truth ํ•„์š”
Faithfulness Generation ๋‹ต๋ณ€์ด ์ปจํ…์ŠคํŠธ์— ๊ทผ๊ฑฐํ•˜๋Š”๊ฐ€? (ํ™˜๊ฐ ๋ฐฉ์ง€) โŒ
Answer Relevance Generation ๋‹ต๋ณ€์ด ์งˆ๋ฌธ์— ์ ์ ˆํžˆ ๋Œ€์‘ํ•˜๋Š”๊ฐ€? โŒ
Context Relevance Retrieval ๊ฒ€์ƒ‰๋œ ์ปจํ…์ŠคํŠธ๊ฐ€ ์ง‘์ค‘์ ์ธ๊ฐ€? โŒ

๐Ÿ—๏ธ ํ‰๊ฐ€ ๋ฉ”ํŠธ๋ฆญ ์ƒ์„ธ

1. Faithfulness (์ถฉ์‹ค์„ฑ) – 95% ์ธ๊ฐ„ ์ผ์น˜์œจ

์ƒ์„ฑ๋œ ๋‹ต๋ณ€์˜ ์‚ฌ์‹ค์  ์ผ๊ด€์„ฑ์„ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค. ๋‹ต๋ณ€์˜ ๋ชจ๋“  ์ฃผ์žฅ์ด ๊ฒ€์ƒ‰๋œ ์ปจํ…์ŠคํŠธ๋กœ๋ถ€ํ„ฐ ์ถ”๋ก  ๊ฐ€๋Šฅํ•œ์ง€ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

๊ณ„์‚ฐ ๊ณผ์ •

Step 1: LLM์„ ์‚ฌ์šฉํ•ด ๋‹ต๋ณ€์„ ๊ฐœ๋ณ„ ์ฃผ์žฅ(claims/statements)์œผ๋กœ ๋ถ„ํ•ด
Step 2: ๊ฐ ์ฃผ์žฅ์ด ์ปจํ…์ŠคํŠธ์—์„œ ์ถ”๋ก  ๊ฐ€๋Šฅํ•œ์ง€ LLM์œผ๋กœ ๊ฒ€์ฆ
Step 3: ์ง€์ง€๋˜๋Š” ์ฃผ์žฅ์˜ ๋น„์œจ๋กœ Faithfulness ์ ์ˆ˜ ๊ณ„์‚ฐ

์ˆ˜์‹

\text{Faithfulness} = \frac{|V|}{|S|}

์—ฌ๊ธฐ์„œ:
– S = ๋‹ต๋ณ€์—์„œ ์ถ”์ถœ๋œ ์ „์ฒด ์ฃผ์žฅ(statements)์˜ ์ง‘ํ•ฉ
– V = ์ปจํ…์ŠคํŠธ์— ์˜ํ•ด ์ง€์ง€๋˜๋Š” ์ฃผ์žฅ์˜ ์ง‘ํ•ฉ

์˜ˆ์‹œ

Question: "ํŒŒ์ด์ฌ์€ ๋ˆ„๊ฐ€ ๋งŒ๋“ค์—ˆ๋‚˜์š”?"
Context: "ํŒŒ์ด์ฌ์€ ๊ท€๋„ ๋ฐ˜ ๋กœ์„ฌ์ด 1991๋…„์— ๊ฐœ๋ฐœํ•œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด์ž…๋‹ˆ๋‹ค."
Answer: "ํŒŒ์ด์ฌ์€ ๊ท€๋„ ๋ฐ˜ ๋กœ์„ฌ์ด 1989๋…„์— ๋งŒ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค."

์ฃผ์žฅ ๋ถ„์„:
1. "ํŒŒ์ด์ฌ์€ ๊ท€๋„ ๋ฐ˜ ๋กœ์„ฌ์ด ๋งŒ๋“ค์—ˆ๋‹ค" โ†’ ์ง€์ง€๋จ โœ“
2. "1989๋…„์— ๋งŒ๋“ค์—ˆ๋‹ค" โ†’ ์ง€์ง€๋˜์ง€ ์•Š์Œ โœ— (์ปจํ…์ŠคํŠธ๋Š” 1991๋…„)

Faithfulness = 1/2 = 0.5

์‹ค๋ฌด์  ์˜๋ฏธ

  • Faithfulness๊ฐ€ ๋‚ฎ๋‹ค๋ฉด: LLM์ด ํ• ๋ฃจ์‹œ๋„ค์ด์…˜์„ ์ƒ์„ฑํ•˜๊ณ  ์žˆ์Œ
  • ํ•ด๊ฒฐ์ฑ…: ํ”„๋กฌํ”„ํŠธ์— “์˜ค์ง ์ œ๊ณต๋œ ์ •๋ณด๋งŒ ์‚ฌ์šฉํ•˜๋ผ”๋Š” ์ง€์‹œ ์ถ”๊ฐ€

2. Answer Relevance (๋‹ต๋ณ€ ๊ด€๋ จ์„ฑ) – 78% ์ธ๊ฐ„ ์ผ์น˜์œจ

์ƒ์„ฑ๋œ ๋‹ต๋ณ€์ด ์›๋ž˜ ์งˆ๋ฌธ์— ์ง์ ‘์ ์ด๊ณ  ์ ์ ˆํ•˜๊ฒŒ ์‘๋‹ตํ•˜๋Š”์ง€ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ์•„์ด๋””์–ด

“๋‹ต๋ณ€์ด ์งˆ๋ฌธ์— ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๋Œ€์‘ํ•œ๋‹ค๋ฉด, ๋‹ต๋ณ€๋งŒ์œผ๋กœ ์›๋ž˜ ์งˆ๋ฌธ์„ ์žฌ๊ตฌ์„ฑํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค”

๊ณ„์‚ฐ ๊ณผ์ •

Step 1: LLM์„ ์‚ฌ์šฉํ•ด ๋‹ต๋ณ€์œผ๋กœ๋ถ€ํ„ฐ N๊ฐœ์˜ ์งˆ๋ฌธ ๋ณ€ํ˜• ์—ญ์ƒ์„ฑ (๊ธฐ๋ณธ๊ฐ’ N=3)
Step 2: ์›๋ณธ ์งˆ๋ฌธ๊ณผ ์ƒ์„ฑ๋œ ์งˆ๋ฌธ๋“ค์˜ ์ž„๋ฒ ๋”ฉ ๊ณ„์‚ฐ
Step 3: ํ‰๊ท  ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋ฅผ Answer Relevance ์ ์ˆ˜๋กœ ์‚ฌ์šฉ

์ˆ˜์‹

\text{Answer Relevance} = \frac{1}{N} \sum_{i=1}^{N} \cos(E_{g_i}, E_o)

ํ™•์žฅํ•˜๋ฉด:

\text{Answer Relevance} = \frac{1}{N} \sum_{i=1}^{N} \frac{E_{g_i} \cdot E_o}{|E_{g_i}| \times |E_o|}

์—ฌ๊ธฐ์„œ:
– E_{g_i}: i๋ฒˆ์งธ ์ƒ์„ฑ๋œ ์งˆ๋ฌธ์˜ ์ž„๋ฒ ๋”ฉ
– E_o: ์›๋ž˜ ์งˆ๋ฌธ์˜ ์ž„๋ฒ ๋”ฉ
– N: ์ƒ์„ฑ๋œ ์งˆ๋ฌธ ์ˆ˜ (๊ธฐ๋ณธ๊ฐ’: 3, strictness ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ์กฐ์ ˆ)

์˜ˆ์‹œ

Original Question: "ํŒŒ์ด์ฌ์˜ ์žฅ์ ์€ ๋ฌด์—‡์ธ๊ฐ€์š”?"
Answer: "ํŒŒ์ด์ฌ์€ ๋ฌธ๋ฒ•์ด ๊ฐ„๊ฒฐํ•˜๊ณ  ๋ฐฐ์šฐ๊ธฐ ์‰ฌ์šฐ๋ฉฐ ๋‹ค์–‘ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค."

Generated Questions (์—ญ์ƒ์„ฑ):
1. "ํŒŒ์ด์ฌ์˜ ํŠน์ง•์€ ๋ฌด์—‡์ธ๊ฐ€์š”?" โ†’ ์œ ์‚ฌ๋„: 0.92
2. "ํŒŒ์ด์ฌ์ด ์ข‹์€ ์ด์œ ๋Š”?" โ†’ ์œ ์‚ฌ๋„: 0.89
3. "ํŒŒ์ด์ฌ์˜ ์ด์ ์„ ์„ค๋ช…ํ•ด์ฃผ์„ธ์š”" โ†’ ์œ ์‚ฌ๋„: 0.95

Answer Relevance = (0.92 + 0.89 + 0.95) / 3 = 0.92

์ ์ˆ˜ ๋ฒ”์œ„ ์ฃผ์˜์‚ฌํ•ญ

  • ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ํŠน์„ฑ์ƒ ์ˆ˜ํ•™์ ์œผ๋กœ [-1, 1] ๋ฒ”์œ„
  • ์‹ค์ œ๋กœ๋Š” ๋Œ€๋ถ€๋ถ„ [0, 1] ๋ฒ”์œ„์ด๋‚˜ ์ˆ˜ํ•™์ ์œผ๋กœ ๋ณด์žฅ๋˜์ง€ ์•Š์Œ

3. Context Relevance (์ปจํ…์ŠคํŠธ ๊ด€๋ จ์„ฑ) – 70% ์ธ๊ฐ„ ์ผ์น˜์œจ

๊ฒ€์ƒ‰๋œ ์ปจํ…์ŠคํŠธ๊ฐ€ ์งˆ๋ฌธ์— ์–ผ๋งˆ๋‚˜ ์ง‘์ค‘์ ์ด๊ณ  ๊ด€๋ จ์„ฑ์ด ๋†’์€์ง€ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค.

์ค‘์š”์„ฑ

  • ๋ถˆํ•„์š”ํ•˜๊ฒŒ ๊ธด ์ปจํ…์ŠคํŠธ๋Š” LLM ๋น„์šฉ ์ฆ๊ฐ€
  • ์ค‘๊ฐ„์— ์œ„์น˜ํ•œ ์ •๋ณด๋Š” LLM์ด ์ž˜ ํ™œ์šฉํ•˜์ง€ ๋ชปํ•จ (“Lost in the Middle” ๋ฌธ์ œ)
  • ์ง‘์ค‘๋œ ์ปจํ…์ŠคํŠธ๊ฐ€ ๋” ์ข‹์€ ๋‹ต๋ณ€ ํ’ˆ์งˆ๋กœ ์ด์–ด์ง

๊ณ„์‚ฐ ๋ฐฉ๋ฒ•

Step 1: LLM์—๊ฒŒ ์ปจํ…์ŠคํŠธ์—์„œ ์งˆ๋ฌธ ๋‹ต๋ณ€์— ํ•„์ˆ˜์ ์ธ ๋ฌธ์žฅ๋“ค์„ ์ถ”์ถœํ•˜๋„๋ก ์š”์ฒญ
Step 2: ์ถ”์ถœ๋œ ๋ฌธ์žฅ ์ˆ˜์™€ ์ „์ฒด ๋ฌธ์žฅ ์ˆ˜์˜ ๋น„์œจ ๊ณ„์‚ฐ

์ˆ˜์‹

\text{Context Relevance} = \frac{|S_{\text{extracted}}|}{|S_{\text{total}}|}

์—ฌ๊ธฐ์„œ:
– S_{\text{extracted}} = ์งˆ๋ฌธ ๋‹ต๋ณ€์— ํ•„์ˆ˜์ ์ธ ๋ฌธ์žฅ๋“ค
– S_{\text{total}} = ์ปจํ…์ŠคํŠธ์˜ ์ „์ฒด ๋ฌธ์žฅ ์ˆ˜

์‹ค๋ฌด์  ์˜๋ฏธ

  • Context Relevance๊ฐ€ ๋‚ฎ๋‹ค๋ฉด: ๊ฒ€์ƒ‰๋œ ์ฒญํฌ์— ๋ถˆํ•„์š”ํ•œ ์ •๋ณด๊ฐ€ ๋งŽ์Œ
  • ํ•ด๊ฒฐ์ฑ…: ์ฒญํ‚น ์ „๋žต ๋˜๋Š” ๊ฒ€์ƒ‰ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐœ์„ 

4. ์ถ”๊ฐ€ ๋ฉ”ํŠธ๋ฆญ (RAGAS v0.1+)

๋ฉ”ํŠธ๋ฆญ ์„ค๋ช… Ground Truth ํ•„์š” ์‚ฌ์šฉ ์‹œ๋‚˜๋ฆฌ์˜ค
Context Precision ๊ด€๋ จ ์ปจํ…์ŠคํŠธ๊ฐ€ ์ƒ์œ„ ๋žญํฌ์— ์žˆ๋Š”์ง€ Yes ๊ฒ€์ƒ‰ ๋žญํ‚น ํ’ˆ์งˆ ํ‰๊ฐ€
Context Recall ํ•„์š”ํ•œ ์ •๋ณด๊ฐ€ ์–ผ๋งˆ๋‚˜ ๊ฒ€์ƒ‰๋˜์—ˆ๋Š”์ง€ Yes ๊ฒ€์ƒ‰ ์ปค๋ฒ„๋ฆฌ์ง€ ํ‰๊ฐ€
Answer Correctness ๋‹ต๋ณ€์˜ ์ •ํ™•์„ฑ Yes ์ตœ์ข… ๋‹ต๋ณ€ ํ’ˆ์งˆ
Answer Semantic Similarity ๋‹ต๋ณ€๊ณผ ์ •๋‹ต์˜ ์˜๋ฏธ์  ์œ ์‚ฌ๋„ Yes ์˜๋ฏธ ๋ณด์กด ํ‰๊ฐ€

Context Precision ์ˆ˜์‹

\text{Context Precision@K} = \frac{\sum_{k=1}^{K} (\text{Precision@k} \times v_k)}{\text{์ƒ์œ„ K๊ฐœ ์ค‘ ๊ด€๋ จ ํ•ญ๋ชฉ ์ˆ˜}}

Context Recall ์ˆ˜์‹

\text{Context Recall} = \frac{|\text{์ปจํ…์ŠคํŠธ์— ๊ท€์† ๊ฐ€๋Šฅํ•œ GT ์ฃผ์žฅ}|}{|\text{GT์˜ ์ด ์ฃผ์žฅ ์ˆ˜}|}

๐Ÿ“Š ์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ

WikiEval ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ•

๊ธฐ์กด์— RAG ๋ฉ”ํŠธ๋ฆญ๊ณผ ์ธ๊ฐ„ ํŒ๋‹จ์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ํ‰๊ฐ€ํ•  ๊ณต๊ฐœ ๋ฐ์ดํ„ฐ์…‹์ด ์—†์–ด WikiEval ๋ฐ์ดํ„ฐ์…‹์„ ์ƒˆ๋กœ ๊ตฌ์ถ•ํ–ˆ์Šต๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ์…‹ ํŠน์„ฑ

  • 2022๋…„ ์ดํ›„ ์ด๋ฒคํŠธ๋ฅผ ๋‹ค๋ฃจ๋Š” 50๊ฐœ Wikipedia ํŽ˜์ด์ง€ ์„ ์ •
  • ์ตœ๊ทผ ํŽธ์ง‘๋œ ํŽ˜์ด์ง€ ์šฐ์„  ์„ ํƒ
  • ๊ฐ ์ธ์Šคํ„ด์Šค๋Š” ๋‘ ๋‹ต๋ณ€ ๋˜๋Š” ๋‘ ์ปจํ…์ŠคํŠธ ๋น„๊ต (Pairwise Comparison)

์‹คํ—˜ ์„ค๊ณ„

  • ๋ชจ๋ธ์ด ์„ ํ˜ธํ•˜๋Š” ๋‹ต๋ณ€/์ปจํ…์ŠคํŠธ์™€ ์ธ๊ฐ„ ํ‰๊ฐ€์ž์˜ ์„ ํƒ ์ผ์น˜ ์—ฌ๋ถ€ ์ธก์ •
  • ์ •ํ™•๋„(Accuracy)๋กœ ๊ฒฐ๊ณผ ๋ณด๊ณ 

๋ฒ ์ด์Šค๋ผ์ธ ๋น„๊ต

๋ฐฉ๋ฒ• ์„ค๋ช…
GPT Score ChatGPT์—๊ฒŒ 0-10 ์ ์ˆ˜๋ฅผ ์ง์ ‘ ๋ถ€์—ฌํ•˜๋„๋ก ์š”์ฒญ
RAGAS ๊ตฌ์กฐํ™”๋œ ํ‰๊ฐ€ ํ”„๋กœ์„ธ์Šค (์ฃผ์žฅ ์ถ”์ถœ, ๊ฒ€์ฆ ๋“ฑ)

ํ•ต์‹ฌ ์‹คํ—˜ ๊ฒฐ๊ณผ: ์ธ๊ฐ„ ํ‰๊ฐ€์™€์˜ ์ผ์น˜์œจ

๋ฉ”ํŠธ๋ฆญ ์ธ๊ฐ„ ์ผ์น˜์œจ ๋ถ„์„
Faithfulness 95% ๊ฐ€์žฅ ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋Š” ๋ฉ”ํŠธ๋ฆญ, ํ™˜๊ฐ ํƒ์ง€์— ๋งค์šฐ ํšจ๊ณผ์ 
Answer Relevance 78% ๋‘ ๋‹ต๋ณ€ ๊ฐ„ ์ฐจ์ด๊ฐ€ ๋ฏธ๋ฌ˜ํ•  ๋•Œ ํŒ๋ณ„ ์–ด๋ ค์›€
Context Relevance 70% ๊ฐ€์žฅ ์–ด๋ ค์šด ํ‰๊ฐ€ ์ฐจ์›, ๊ธด ์ปจํ…์ŠคํŠธ์—์„œ ํ•ต์‹ฌ ๋ฌธ์žฅ ์„ ํƒ ์–ด๋ ค์›€

์ฃผ์š” ๋ฐœ๊ฒฌ

  1. Faithfulness: RAGAS ์˜ˆ์ธก์ด ์ธ๊ฐ„ ํŒ๋‹จ๊ณผ ๋งค์šฐ ๋†’์€ ์ผ์น˜๋„ (95%)
  2. Answer Relevance: ๋‘ ํ›„๋ณด ๋‹ต๋ณ€ ์ฐจ์ด๊ฐ€ ๋ฏธ๋ฌ˜ํ•œ ๊ฒฝ์šฐ ์ผ์น˜๋„ ๋‹ค์†Œ ๋‚ฎ์Œ
  3. Context Relevance: ๊ธด ์ปจํ…์ŠคํŠธ์—์„œ ChatGPT๊ฐ€ ํ•ต์‹ฌ ๋ฌธ์žฅ ์„ ํƒ์— ์–ด๋ ค์›€์„ ๊ฒช์Œ

๋ฐ์ดํ„ฐ์…‹ ๊ณต๊ฐœ

  • WikiEval: Hugging Face

๐Ÿ’ช ๊ฐ•์  ๋ฐ ๊ธฐ์—ฌ

ํ•™์ˆ ์  ๊ธฐ์—ฌ

  1. Reference-Free ํ‰๊ฐ€ ํŒจ๋Ÿฌ๋‹ค์ž„ ํ™•๋ฆฝ
    • Ground Truth ์—†์ด RAG ํ‰๊ฐ€ ๊ฐ€๋Šฅ (Context Recall ์ œ์™ธ)
    • ์‹ค๋ฌด ํ™˜๊ฒฝ์—์„œ ์ฆ‰์‹œ ์ ์šฉ ๊ฐ€๋Šฅ
  2. RAG Triad ๊ฐœ๋… ๋ฐœ์ „
    • TruLens์˜ ๊ฐœ๋…์„ ์ฒด๊ณ„ํ™”ํ•˜๊ณ  ๊ฐœ์„ 
    • ARES, DeepEval ๋“ฑ ํ›„์† ์—ฐ๊ตฌ์— ์˜ํ–ฅ
  3. WikiEval ๋ฒค์น˜๋งˆํฌ ๊ณต๊ฐœ
    • RAG ํ‰๊ฐ€ ์—ฐ๊ตฌ๋ฅผ ์œ„ํ•œ ์ฒซ ๊ณต๊ฐœ ๋ฐ์ดํ„ฐ์…‹

์‹ค๋ฌด์  ๊ฐ€์น˜

  1. ์ฆ‰์‹œ ์‚ฌ์šฉ ๊ฐ€๋Šฅ (Out-of-the-box)
    • pip install ragas๋กœ ๋ฐ”๋กœ ์„ค์น˜ํ•˜์—ฌ ์‚ฌ์šฉ ๊ฐ€๋Šฅ
    • LangChain, LlamaIndex, Haystack ๋“ฑ๊ณผ ์‰ฝ๊ฒŒ ํ†ตํ•ฉ
  2. ํ™•์žฅ์„ฑ
    • ๋Œ€๊ทœ๋ชจ ํ‰๊ฐ€์— ์ ํ•ฉ
    • ์ง€์†์ ์ธ ๋ชจ๋‹ˆํ„ฐ๋ง ๊ฐ€๋Šฅ
    • Langfuse, DataDog ๋“ฑ Observability ๋„๊ตฌ์™€ ํ†ตํ•ฉ
  3. Synthetic Data Generation
    • ๋ฌธ์„œ๋กœ๋ถ€ํ„ฐ ์ž๋™์œผ๋กœ ํ…Œ์ŠคํŠธ ์งˆ๋ฌธ ์ƒ์„ฑ ๊ฐ€๋Šฅ
    • ๋‹ค์–‘ํ•œ ์งˆ๋ฌธ ์œ ํ˜• (simple, reasoning, multi_context) ์ง€์›

โš ๏ธ ํ•œ๊ณ„์  ๋ฐ ํ–ฅํ›„ ์—ฐ๊ตฌ

ํ˜„์žฌ ํ•œ๊ณ„์ 

  1. LLM ํ‰๊ฐ€์ž ์˜์กด์„ฑ
    • ํ‰๊ฐ€ ํ’ˆ์งˆ์ด ์‚ฌ์šฉํ•˜๋Š” LLM ์„ฑ๋Šฅ์— ์˜์กด
    • LLM์˜ ํŽธํ–ฅ์ด ํ‰๊ฐ€์— ๋ฐ˜์˜๋  ์ˆ˜ ์žˆ์Œ
    • ๋„๋ฉ”์ธ ํŠนํ™” ํ‰๊ฐ€์—์„œ LLM Judge์˜ ์ „๋ฌธ ์ง€์‹ ๋ถ€์กฑ
  2. ๋น„์šฉ ๋ฌธ์ œ
    • ๊ฐ ํ”„๋กœ๋•์…˜ ํŠธ๋ ˆ์ด์Šค ํ‰๊ฐ€์— LLM ํ˜ธ์ถœ ํ•„์š”
    • ๋Œ€๊ทœ๋ชจ ํŠธ๋ž˜ํ”ฝ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ ๋น„์šฉ ์ฆ๊ฐ€
  3. Self-Evaluation Bias
    • ๋™์ผ ๋ชจ๋ธ๋กœ ์ƒ์„ฑ๊ณผ ํ‰๊ฐ€ ์‹œ ํŽธํ–ฅ ๋ฐœ์ƒ ๊ฐ€๋Šฅ
    • ๋‹ค๋ฅธ ๋ชจ๋ธ ์‚ฌ์šฉ ๊ถŒ์žฅ
  4. Context Relevance์˜ ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์€ ์ •ํ™•๋„ (70%)
    • ๊ธด ์ปจํ…์ŠคํŠธ์—์„œ ํ•ต์‹ฌ ๋ฌธ์žฅ ์ถ”์ถœ์ด ์–ด๋ ค์›€
    • ๊ฐœ์„  ์—ฌ์ง€ ์žˆ์Œ
  5. ๋‹ค๊ตญ์–ด ์ง€์› ์ œํ•œ
    • ์ฃผ๋กœ ์˜์–ด ์ค‘์‹ฌ์œผ๋กœ ๊ฒ€์ฆ๋จ
    • ํ•œ๊ตญ์–ด ๋“ฑ ๋‹ค๋ฅธ ์–ธ์–ด์— ๋Œ€ํ•œ ์ถ”๊ฐ€ ๊ฒ€์ฆ ํ•„์š”

ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ

  1. ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๊ฒ€์ƒ‰ ๊ธฐ๋ฒ•๊ณผ์˜ ๊ฒฐํ•ฉ ํ‰๊ฐ€ (Lexical + Semantic)
  2. ๋„๋ฉ”์ธ ํŠนํ™” LLM์„ ํ™œ์šฉํ•œ ํ‰๊ฐ€ ์ •ํ™•๋„ ํ–ฅ์ƒ
  3. Deep Thinking ๋ชจ๋ธ (์˜ˆ: o1, DeepSeek-R1)๊ณผ์˜ ํ†ตํ•ฉ ํ‰๊ฐ€
  4. ๋‹ค๊ตญ์–ด ํ™•์žฅ ๋ฐ ๋ฒค์น˜๋งˆํฌ ๊ตฌ์ถ•
  5. ํšจ์œจ์ ์ธ ํ‰๊ฐ€ ๋ฐฉ๋ฒ• – ๋” ์ž‘๊ณ  ๋น ๋ฅธ ํ‰๊ฐ€ ๋ชจ๋ธ ๊ฐœ๋ฐœ

๐Ÿ”— ๊ด€๋ จ ๋…ผ๋ฌธ ๋ฐ ํ”„๋ ˆ์ž„์›Œํฌ

์„ ํ–‰ ์—ฐ๊ตฌ

๋…ผ๋ฌธ/ํ”„๋ ˆ์ž„์›Œํฌ ๊ด€๊ณ„
TruLens RAG Triad ๊ฐœ๋…์˜ ์›์กฐ, RAGAS๊ฐ€ ์ด๋ฅผ ๋ฐœ์ „์‹œํ‚ด
BERTScore ํ…์ŠคํŠธ ์œ ์‚ฌ๋„ ๊ธฐ๋ฐ˜ ํ‰๊ฐ€, RAGAS์™€ ์ƒํ˜ธ๋ณด์™„์ 
BLEU/ROUGE ์ „ํ†ต์  NLP ๋ฉ”ํŠธ๋ฆญ, RAG์—๋Š” ๋ถ€์ ํ•ฉ

ํ›„์†/์œ ์‚ฌ ์—ฐ๊ตฌ

๋…ผ๋ฌธ/ํ”„๋ ˆ์ž„์›Œํฌ ํŠน์ง•
ARES RAGAS์™€ ์œ ์‚ฌํ•œ ์ ‘๊ทผ, Fine-tuned classifier judges ํ™œ์šฉ
RAGChecker ๋” ์„ธ๋ฐ€ํ•œ ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ (NeurIPS 2024)
DeepEval Pytest ์Šคํƒ€์ผ, 14+ RAG ๋ฉ”ํŠธ๋ฆญ, CI/CD ํ†ตํ•ฉ
Giskard ๋ณด์•ˆ ๋ฐ ํŽธํ–ฅ ํ…Œ์ŠคํŠธ ํฌํ•จ

ํ‰๊ฐ€ ๋„๊ตฌ ๋น„๊ต (2025)

๋„๊ตฌ ํŠน์ง• GPT-4o ์ •ํ™•๋„
RAGAS Reference-free, ์˜คํ”ˆ์†Œ์Šค, ๊ฐ„ํŽธํ•œ ์‚ฌ์šฉ ~100%
TruLens LangChain/LlamaIndex ํ†ตํ•ฉ, ์ถ”์  ๊ธฐ๋Šฅ, Snowflake ์ง€์› 80%+
Weights & Biases MLOps ํ†ตํ•ฉ, ์‹คํ—˜ ์ถ”์  ~100%
DeepEval ๋‹ค์–‘ํ•œ ๋ฉ”ํŠธ๋ฆญ ํ†ตํ•ฉ 80%+

๐Ÿ’ป ์‹ค๋ฌด ์ ์šฉ ํฌ์ธํŠธ

์„ค์น˜

# ๊ธฐ๋ณธ ์„ค์น˜
pip install ragas

# ์ตœ์‹  ๊ฐœ๋ฐœ ๋ฒ„์ „
pip install git+https://github.com/explodinggradients/ragas.git

# LangChain ํ†ตํ•ฉ ์‹œ
pip install ragas langchain-core langchain-openai

์š”๊ตฌ์‚ฌํ•ญ: Python >= 3.9

๊ธฐ๋ณธ ์‚ฌ์šฉ ์˜ˆ์‹œ

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

# ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ ์ค€๋น„
data = {
    "question": ["ํŒŒ์ด์ฌ์˜ ์žฅ์ ์€ ๋ฌด์—‡์ธ๊ฐ€์š”?"],
    "answer": ["ํŒŒ์ด์ฌ์€ ๋ฌธ๋ฒ•์ด ๊ฐ„๊ฒฐํ•˜๊ณ  ๋ฐฐ์šฐ๊ธฐ ์‰ฌ์šฐ๋ฉฐ ๋‹ค์–‘ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค."],
    "contexts": [["ํŒŒ์ด์ฌ์€ 1991๋…„ ๊ท€๋„ ๋ฐ˜ ๋กœ์„ฌ์ด ๊ฐœ๋ฐœํ•œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด๋กœ, "
                  "๊ฐ„๊ฒฐํ•œ ๋ฌธ๋ฒ•๊ณผ ํ’๋ถ€ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์ƒํƒœ๊ณ„๊ฐ€ ํŠน์ง•์ž…๋‹ˆ๋‹ค."]],
    "ground_truth": ["ํŒŒ์ด์ฌ์€ ๋ฌธ๋ฒ•์ด ๊ฐ„๊ฒฐํ•˜๊ณ  ํ’๋ถ€ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ์ด ์žฅ์ ์ž…๋‹ˆ๋‹ค."]
}

dataset = Dataset.from_dict(data)

# ํ‰๊ฐ€ ์‹คํ–‰
result = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)

print(result)
# {'faithfulness': 0.95, 'answer_relevancy': 0.92,
#  'context_precision': 0.88, 'context_recall': 0.90}

๊ฐœ๋ณ„ ๋ฉ”ํŠธ๋ฆญ ๋น„๋™๊ธฐ ์‚ฌ์šฉ (v0.2+)

from ragas.metrics import Faithfulness
from ragas.llms import llm_factory
from openai import AsyncOpenAI

# LLM ์„ค์ •
client = AsyncOpenAI()
llm = llm_factory("gpt-4o-mini", client=client)

# Faithfulness ํ‰๊ฐ€
scorer = Faithfulness(llm=llm)

result = await scorer.ascore(
    user_input="์—ํŽ ํƒ‘์€ ์–ธ์ œ ์™„๊ณต๋˜์—ˆ๋‚˜์š”?",
    response="์—ํŽ ํƒ‘์€ 1889๋…„์— ์™„๊ณต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.",
    retrieved_contexts=[
        "์—ํŽ ํƒ‘์€ 1889๋…„ ํŒŒ๋ฆฌ ๋งŒ๊ตญ๋ฐ•๋žŒํšŒ๋ฅผ ์œ„ํ•ด ๊ฑด์„ค๋˜์—ˆ๋‹ค."
    ]
)

print(f"Faithfulness Score: {result.value}")  # 1.0

Context Precision Without Reference

from ragas import SingleTurnSample
from ragas.metrics import LLMContextPrecisionWithoutReference

# Reference ์—†์ด Context Precision ํ‰๊ฐ€
context_precision = LLMContextPrecisionWithoutReference(llm=evaluator_llm)

sample = SingleTurnSample(
    user_input="์—ํŽ ํƒ‘์€ ์–ด๋””์— ์žˆ๋‚˜์š”?",
    response="์—ํŽ ํƒ‘์€ ํŒŒ๋ฆฌ์— ์žˆ์Šต๋‹ˆ๋‹ค.",
    retrieved_contexts=["์—ํŽ ํƒ‘์€ ํŒŒ๋ฆฌ์— ์œ„์น˜ํ•ด ์žˆ์Šต๋‹ˆ๋‹ค."],
)

score = await context_precision.single_turn_ascore(sample)

Self-Evaluation Bias ๋ฐฉ์ง€

# ์ƒ์„ฑ์— ์‚ฌ์šฉํ•œ ๋ชจ๋ธ๊ณผ ๋‹ค๋ฅธ ๋ชจ๋ธ๋กœ ํ‰๊ฐ€
# ์˜ˆ: GPT-4๋กœ ์ƒ์„ฑ, Claude๋กœ ํ‰๊ฐ€

from ragas.llms import llm_factory

# ํ‰๊ฐ€์šฉ LLM ๋ณ„๋„ ์„ค์ •
evaluator_llm = llm_factory("claude-3-5-sonnet-20241022", provider="anthropic")

result = evaluate(
    dataset,
    metrics=metrics,
    llm=evaluator_llm  # ์ƒ์„ฑ ๋ชจ๋ธ๊ณผ ๋‹ค๋ฅธ ๋ชจ๋ธ ์‚ฌ์šฉ
)

Synthetic Test Data ์ƒ์„ฑ

from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

# ๋ฌธ์„œ๋กœ๋ถ€ํ„ฐ ํ…Œ์ŠคํŠธ์…‹ ์ž๋™ ์ƒ์„ฑ
generator = TestsetGenerator.from_langchain(
    generator_llm=generator_llm,
    critic_llm=critic_llm,
    embeddings=embeddings,
)

testset = generator.generate_with_langchain_docs(
    documents,
    test_size=10,
    distributions={
        simple: 0.5,        # ๋‹จ์ˆœ ์งˆ๋ฌธ
        reasoning: 0.25,    # ์ถ”๋ก  ํ•„์š”
        multi_context: 0.25 # ์—ฌ๋Ÿฌ ์ปจํ…์ŠคํŠธ ํ•„์š”
    }
)

๋ชจ๋‹ˆํ„ฐ๋ง ํ†ตํ•ฉ (Langfuse ์˜ˆ์‹œ)

from langfuse import Langfuse
from ragas.integrations.langfuse import RagasCallbackHandler

# Langfuse ํด๋ผ์ด์–ธํŠธ ์„ค์ •
langfuse = Langfuse()

# RAGAS ์ฝœ๋ฐฑ ํ•ธ๋“ค๋Ÿฌ
handler = RagasCallbackHandler(langfuse)

# ํ‰๊ฐ€ ์‹คํ–‰ ๋ฐ ๊ฒฐ๊ณผ ์ž๋™ ๊ธฐ๋ก
result = evaluate(dataset, metrics=metrics, callbacks=[handler])

ํ‰๊ฐ€ ์ ์ˆ˜ ํ•ด์„ ๊ฐ€์ด๋“œ

์ ์ˆ˜ ๋ฒ”์œ„ ํ•ด์„ ๊ถŒ์žฅ ์กฐ์น˜
0.9 – 1.0 ์šฐ์ˆ˜ ํ˜„์ƒ ์œ ์ง€
0.8 – 0.9 ์–‘ํ˜ธ ๋ฏธ์„ธ ์กฐ์ • ๊ณ ๋ ค
0.7 – 0.8 ๊ฐœ์„  ํ•„์š” ํ”„๋กฌํ”„ํŠธ/๊ฒ€์ƒ‰ ์ตœ์ ํ™”
< 0.7 ์‹ฌ๊ฐํ•œ ๋ฌธ์ œ ๊ทผ๋ณธ์ ์ธ ์žฌ์„ค๊ณ„ ํ•„์š”

๋ฌธ์ œ ์ง„๋‹จ ๋ฐ ํ•ด๊ฒฐ ๊ฐ€์ด๋“œ

๋‚ฎ์€ ๋ฉ”ํŠธ๋ฆญ ์›์ธ ํ•ด๊ฒฐ์ฑ…
Faithfulness LLM ํ• ๋ฃจ์‹œ๋„ค์ด์…˜ ํ”„๋กฌํ”„ํŠธ์— “์˜ค์ง ์ œ๊ณต๋œ ์ •๋ณด๋งŒ ์‚ฌ์šฉ” ์ง€์‹œ ์ถ”๊ฐ€
Context Relevance ๊ฒ€์ƒ‰ ํ’ˆ์งˆ ์ €ํ•˜ ์ฒญํ‚น ์ „๋žต, ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ, ๊ฒ€์ƒ‰ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐœ์„ 
Answer Relevance ๋‹ต๋ณ€์ด ์งˆ๋ฌธ๊ณผ ๋™๋–จ์–ด์ง ํ”„๋กฌํ”„ํŠธ๋ฅผ ๋” ๊ตฌ์ฒด์ ์œผ๋กœ ์ˆ˜์ •
Context Precision ๊ด€๋ จ ์—†๋Š” ์ฒญํฌ๊ฐ€ ์ƒ์œ„ ๋žญํฌ Reranker ๋„์ž…, ๊ฒ€์ƒ‰ ํŒŒ๋ผ๋ฏธํ„ฐ ์กฐ์ •
Context Recall ํ•„์š”ํ•œ ์ •๋ณด ๋ˆ„๋ฝ Top-K ์ฆ๊ฐ€, ๊ฒ€์ƒ‰ ์ „๋žต ๋‹ค์–‘ํ™”

Privacy ์„ค์ •

# ์ต๋ช… ์‚ฌ์šฉ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋น„ํ™œ์„ฑํ™”
export RAGAS_DO_NOT_TRACK=true

๐Ÿท๏ธ Tags

#RAG #Evaluation #RAGAS #LLM #Faithfulness #AnswerRelevance #ContextRelevance #ReferenceFree #LLMasJudge #Retrieval #Generation #NLP #MachineLearning #AIEvaluation #ExplodingGradients #CardiffNLP #EACL2024 #WikiEval #HallucinationDetection #RAGTriad

์ž‘์„ฑ์ž

skycave

Follow Me
๋‹ค๋ฅธ ๊ธฐ์‚ฌ
Previous

[AI Paper] Paper2Rebuttal: A Multi-Agent Framework for Transparent Author Response Assistance

Next

[AI Paper] ๐Ÿ“„ RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems

๋Œ“๊ธ€ ์—†์Œ! ์ฒซ ๋Œ“๊ธ€์„ ๋‚จ๊ฒจ๋ณด์„ธ์š”.

๋‹ต๊ธ€ ๋‚จ๊ธฐ๊ธฐ ์‘๋‹ต ์ทจ์†Œ

์ด๋ฉ”์ผ ์ฃผ์†Œ๋Š” ๊ณต๊ฐœ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•„์ˆ˜ ํ•„๋“œ๋Š” *๋กœ ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค

์ตœ์‹ ๊ธ€

  • ๐Ÿ“Š ์ผ์ผ ๋‰ด์Šค ๊ฐ์„ฑ ๋ฆฌํฌํŠธ – 2026-01-28
  • AI ์‹œ์Šคํ…œ์˜ ๋ฌธ๋งฅ ๊ธฐ๋ฐ˜ ๊ฒ€์ƒ‰(Contextual Retrieval) | Anthropic
  • “Think” ํˆด: Claude๊ฐ€ ๋ฉˆ์ถฐ์„œ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๊ธฐ | Anthropic
  • Claude Code ๋ชจ๋ฒ” ์‚ฌ๋ก€ \ Anthropic
  • ์šฐ๋ฆฌ๊ฐ€ ๋ฉ€ํ‹ฐ ์—์ด์ „ํŠธ ์—ฐ๊ตฌ ์‹œ์Šคํ…œ์„ ๊ตฌ์ถ•ํ•œ ๋ฐฉ๋ฒ•
Copyright 2026 — skycave's Blog. All rights reserved. Blogsy WordPress Theme