๋ณธ๋ฌธ์œผ๋กœ ๊ฑด๋„ˆ๋›ฐ๊ธฐ
-
skycave's Blog
skycave's Blog
  • Home
  • Investment
  • IT
    • Data engineering
    • AI
    • Programing
  • Leisure
    • Camping
    • Fishing
  • Travel
    • Domestic
    • Overseas
  • Book
  • Product
  • Hot keyword in google
  • Home
  • Investment
  • IT
    • Data engineering
    • AI
    • Programing
  • Leisure
    • Camping
    • Fishing
  • Travel
    • Domestic
    • Overseas
  • Book
  • Product
  • Hot keyword in google
๋‹ซ๊ธฐ

๊ฒ€์ƒ‰

AI

[AI Paper] ๐Ÿ“„ RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems

By skycave
2026๋…„ 01์›” 25์ผ 5 Min Read
0

๐Ÿ“„ RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems

๐Ÿ“‹ ๋ฉ”ํƒ€ ์ •๋ณด

ํ•ญ๋ชฉ ๋‚ด์šฉ
์ €์ž Robert Friel, Masha Belyi, Atindriyo Sanyal
์†Œ์† Galileo Technologies Inc.
๋ฐœํ‘œ์ผ 2024๋…„ 7์›” 15์ผ (arXiv)
์ตœ์‹  ๋ฒ„์ „ v2 (2025๋…„ 1์›” 16์ผ)
arXiv 2407.11005
๋ฐ์ดํ„ฐ์…‹ HuggingFace – rungalileo/ragbench
GitHub rungalileo/ragbench

๐ŸŽฏ ํ•œ์ค„ ์š”์•ฝ

100,000๊ฐœ ์˜ˆ์ œ๋ฅผ ํฌํ•จํ•œ ์ตœ์ดˆ์˜ ๋Œ€๊ทœ๋ชจ RAG ๋ฒค์น˜๋งˆํฌ ๋ฐ์ดํ„ฐ์…‹๊ณผ ์„ค๋ช… ๊ฐ€๋Šฅํ•œ TRACe ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•˜์—ฌ, RAG ์‹œ์Šคํ…œ์˜ ์ฒด๊ณ„์ ์ด๊ณ  ์ผ๊ด€๋œ ํ‰๊ฐ€๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•จ.


๐Ÿ” ์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ ๋ฐ ๋™๊ธฐ

๊ธฐ์กด RAG ํ‰๊ฐ€์˜ ๋ฌธ์ œ์ 

  1. ํ†ต์ผ๋œ ํ‰๊ฐ€ ๊ธฐ์ค€ ๋ถ€์žฌ
    • ๋‹ค์–‘ํ•œ RAG ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ๋“ค์ด ์„œ๋กœ ๋‹ค๋ฅธ ๋ฉ”ํŠธ๋ฆญ๊ณผ ์ •์˜๋ฅผ ์‚ฌ์šฉ
    • ์‹œ์Šคํ…œ ๊ฐ„ ์ฒด๊ณ„์ ์ธ ๋น„๊ต๊ฐ€ ์–ด๋ ค์›€
  2. ์ฃผ์„์ด ๋‹ฌ๋ฆฐ ๋ฐ์ดํ„ฐ์…‹ ๋ถ€์กฑ
    • ๋Œ€๊ทœ๋ชจ์˜ ํ‘œ์ค€ํ™”๋œ RAG ๋ฒค์น˜๋งˆํฌ ๋ฐ์ดํ„ฐ์…‹์ด ์กด์žฌํ•˜์ง€ ์•Š์Œ
    • ์‹ค์ œ ์‚ฐ์—… ํ™˜๊ฒฝ์„ ๋ฐ˜์˜ํ•œ ๋ฐ์ดํ„ฐ ๋ถ€์กฑ
  3. ๊ธฐ์กด ํ‰๊ฐ€ ๋ฐฉ๋ฒ•์˜ ํ•œ๊ณ„
    • LLM ๊ธฐ๋ฐ˜ ํ‰๊ฐ€์ž์˜ ์•ˆ์ •์„ฑ ๋ฐ ์‹ ๋ขฐ์„ฑ ๋ฌธ์ œ
    • ํ‰๊ฐ€ ๊ฒฐ๊ณผ์˜ ์„ค๋ช… ๊ฐ€๋Šฅ์„ฑ(Explainability) ๋ถ€์กฑ
  4. ์‚ฐ์—… ์ ์šฉ์„ฑ ๋ถ€์กฑ
    • ๊ธฐ์กด ๋ฒค์น˜๋งˆํฌ๊ฐ€ ์‹ค์ œ ์‚ฐ์—… ๋„๋ฉ”์ธ์„ ์ถฉ๋ถ„ํžˆ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•จ

๐Ÿ’ก ํ•ต์‹ฌ ์•„์ด๋””์–ด

100K ์˜ˆ์ œ ๋ฐ์ดํ„ฐ์…‹์˜ ํŠน์ง•

RAGBench๋Š” 12๊ฐœ์˜ ๊ฐœ๋ณ„ QA ๋ฐ์ดํ„ฐ์…‹์„ ํ†ตํ•ฉํ•˜์—ฌ ๊ตฌ์„ฑ๋จ:

๋ฐ์ดํ„ฐ์…‹ ๋„๋ฉ”์ธ ์˜ˆ์ œ ์ˆ˜ ํŠน์ง•
CovidQA ์˜๋ฃŒ/๋ฐ”์ด์˜ค 1.77k COVID-19 ๊ด€๋ จ ๋‹ค์ค‘ ๋ฌธ์„œ ์ถ”๋ก 
PubMedQA ์˜๋ฃŒ/๋ฐ”์ด์˜ค 24.5k ์˜ํ•™ ๋…ผ๋ฌธ ๊ธฐ๋ฐ˜ QA
HotpotQA ์ผ๋ฐ˜ ์ง€์‹ 2.7k ๋‹ค์ค‘ ํ™‰ ์ถ”๋ก  ํ•„์š”
MS Marco ์ผ๋ฐ˜ ์ง€์‹ 2.69k ์›น ๊ฒ€์ƒ‰ ๊ธฐ๋ฐ˜ QA
CUAD ๋ฒ•๋ฅ  2.55k ๋ฒ•๋ฅ  ๊ณ„์•ฝ์„œ ๋ถ„์„ (๊ธด ๋ฌธ๋งฅ)
EManual ๊ณ ๊ฐ ์ง€์› 1.32k ์‚ฌ์šฉ์ž ๋งค๋‰ด์–ผ ๊ธฐ๋ฐ˜
TechQA ๊ณ ๊ฐ ์ง€์› 1.81k ๊ธฐ์ˆ  ์ง€์› ๋ฌธ์„œ
FinQA ๊ธˆ์œต 16.6k ์žฌ๋ฌด ๋ณด๊ณ ์„œ ์ˆ˜์น˜ ์ถ”๋ก 
TAT-QA ๊ธˆ์œต 33.1k ํ…Œ์ด๋ธ”+ํ…์ŠคํŠธ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ถ”๋ก 
ExpertQA ์ „๋ฌธ ์ง€์‹ 2.03k ์ „๋ฌธ๊ฐ€ ์ž‘์„ฑ QA
HAGRID ์ผ๋ฐ˜ ์ง€์‹ 4.53k ์ธ์šฉ ๊ธฐ๋ฐ˜ QA
DelucionQA ๊ณ ๊ฐ ์ง€์› 1.83k ํ™˜๊ฐ ํƒ์ง€์šฉ (์ž๋™ ์ƒ์„ฑ)

5๊ฐœ ์‚ฐ์—… ๋„๋ฉ”์ธ ์ปค๋ฒ„๋ฆฌ์ง€

  • ์˜๋ฃŒ/๋ฐ”์ด์˜ค๋ฉ”๋””์ปฌ: CovidQA, PubMedQA
  • ๋ฒ•๋ฅ : CUAD
  • ๊ณ ๊ฐ ์ง€์›: EManual, TechQA, DelucionQA
  • ๊ธˆ์œต: FinQA, TAT-QA
  • ์ผ๋ฐ˜ ์ง€์‹: HotpotQA, MS Marco, HAGRID, ExpertQA

๋ฐ์ดํ„ฐ์…‹ ๋ถ„ํ• 

  • ํ›ˆ๋ จ ์„ธํŠธ: 78,000 ์˜ˆ์ œ
  • ๊ฒ€์ฆ ์„ธํŠธ: 12,000 ์˜ˆ์ œ
  • ํ…Œ์ŠคํŠธ ์„ธํŠธ: 11,000 ์˜ˆ์ œ
  • ์ฟผ๋ฆฌ ๊ธฐ์ค€ ์—„๊ฒฉํ•œ ๋ถ„๋ฆฌ ์ ์šฉ

๐Ÿ—๏ธ TRACe ๋ฉ”ํŠธ๋ฆญ ์ƒ์„ธ

TRACe๋Š” uTilization, Relevance, Adherence, Completeness์˜ ์•ฝ์ž๋กœ, RAG ์‹œ์Šคํ…œ์˜ ํ’ˆ์งˆ์„ ์„ค๋ช… ๊ฐ€๋Šฅํ•œ ๋ฐฉ์‹์œผ๋กœ ํ‰๊ฐ€ํ•จ.

1. Context Relevance (๋ฌธ๋งฅ ๊ด€๋ จ์„ฑ) – Retriever ํ‰๊ฐ€

์ •์˜: ๊ฒ€์ƒ‰๋œ ๋ฌธ์„œ๊ฐ€ ์งˆ๋ฌธ์— ์–ผ๋งˆ๋‚˜ ์ ์ ˆํ•œ๊ฐ€?
  • ํ‰๊ฐ€ ๋Œ€์ƒ: Retriever ์ปดํฌ๋„ŒํŠธ
  • ์ธก์ • ๋‚ด์šฉ: ๊ฒ€์ƒ‰ ์‹œ์Šคํ…œ์ด ์˜ฌ๋ฐ”๋ฅธ ์ •๋ณด๋ฅผ ๊ฐ€์ ธ์™”๋Š”์ง€ ํ‰๊ฐ€
  • ํ•ต์‹ฌ ์งˆ๋ฌธ: “๊ฒ€์ƒ‰๋œ ๋ฌธ๋งฅ์ด ์งˆ๋ฌธ์— ๋‹ตํ•˜๋Š” ๋ฐ ํ•„์š”ํ•œ ์ •๋ณด๋ฅผ ํฌํ•จํ•˜๋Š”๊ฐ€?”

2. Context Utilization (๋ฌธ๋งฅ ํ™œ์šฉ๋„) – Generator ํ‰๊ฐ€

์ •์˜: ์ƒ์„ฑ๊ธฐ๊ฐ€ ๊ฒ€์ƒ‰๋œ ๋ฌธ์„œ๋ฅผ ์–ผ๋งˆ๋‚˜ ์ž˜ ํ™œ์šฉํ–ˆ๋Š”๊ฐ€?
  • ํ‰๊ฐ€ ๋Œ€์ƒ: Generator ์ปดํฌ๋„ŒํŠธ
  • ์ธก์ • ๋‚ด์šฉ: ์ œ๊ณต๋œ ๋ฌธ๋งฅ ์ •๋ณด๊ฐ€ ์‹ค์ œ๋กœ ์‘๋‹ต ์ƒ์„ฑ์— ์‚ฌ์šฉ๋˜์—ˆ๋Š”์ง€ ํ‰๊ฐ€
  • ํ•ต์‹ฌ ์งˆ๋ฌธ: “์ฑ—๋ด‡์ด ๊ฒ€์ƒ‰๋œ ์ •๋ณด๋ฅผ ์‹ค์ œ๋กœ ํ™œ์šฉํ–ˆ๋Š”๊ฐ€?”

3. Response Adherence (์‘๋‹ต ์ค€์ˆ˜๋„) – Generator ํ‰๊ฐ€

์ •์˜: ์‘๋‹ต์ด ๊ฒ€์ƒ‰๋œ ๋ฌธ๋งฅ์˜ ์ •๋ณด์— ์ถฉ์‹คํ•œ๊ฐ€?
  • ํ‰๊ฐ€ ๋Œ€์ƒ: Generator ์ปดํฌ๋„ŒํŠธ (ํ™˜๊ฐ ํƒ์ง€)
  • ์ธก์ • ๋‚ด์šฉ: ์‘๋‹ต์ด ์ œ๊ณต๋œ ๋ฌธ๋งฅ์— ๊ทผ๊ฑฐํ•˜๋Š”์ง€ ๋˜๋Š” ํ™˜๊ฐ์„ ํฌํ•จํ•˜๋Š”์ง€ ํ‰๊ฐ€
  • ํ•ต์‹ฌ ์งˆ๋ฌธ: “์ฑ—๋ด‡์ด ์‚ฌ์‹ค์— ๊ธฐ๋ฐ˜ํ–ˆ๋Š”๊ฐ€, ์•„๋‹ˆ๋ฉด ์ •๋ณด๋ฅผ ์ง€์–ด๋ƒˆ๋Š”๊ฐ€?”
  • ๊ตฌํ˜„ ๋ฐฉ์‹: ์‘๋‹ต์˜ ๊ฐ ๋ฌธ์žฅ์ด ๋ฌธ๋งฅ์— ์˜ํ•ด ์ง€์ง€๋˜๋Š”์ง€ ํ™•์ธ

4. Response Completeness (์‘๋‹ต ์™„์ „์„ฑ) – Generator ํ‰๊ฐ€

์ •์˜: ์‘๋‹ต์ด ์งˆ๋ฌธ์— ์™„์ „ํžˆ ๋‹ตํ–ˆ๋Š”๊ฐ€?
  • ํ‰๊ฐ€ ๋Œ€์ƒ: Generator ์ปดํฌ๋„ŒํŠธ
  • ์ธก์ • ๋‚ด์šฉ: ์งˆ๋ฌธ์— ๋Œ€ํ•œ ๋ชจ๋“  ํ•„์š”ํ•œ ์ •๋ณด๊ฐ€ ์‘๋‹ต์— ํฌํ•จ๋˜์—ˆ๋Š”์ง€ ํ‰๊ฐ€
  • ํ•ต์‹ฌ ์งˆ๋ฌธ: “๋‹ต๋ณ€์ด ์งˆ๋ฌธ์˜ ๋ชจ๋“  ์ธก๋ฉด์„ ๋‹ค๋ฃจ์—ˆ๋Š”๊ฐ€?”

TRACe ํ”„๋ ˆ์ž„์›Œํฌ ๊ตฌ์กฐ

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      RAG System                              โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  Query โ†’ [Retriever] โ†’ Context โ†’ [Generator] โ†’ Response      โ”‚
โ”‚              โ†“                        โ†“                      โ”‚
โ”‚         Relevance              Utilization                   โ”‚
โ”‚                                Adherence                     โ”‚
โ”‚                                Completeness                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๊ธฐ์ˆ ์  ๊ตฌํ˜„

  • DeBERTa ๋ชจ๋ธ์— ๊ฐ TRACe ๋ฉ”ํŠธ๋ฆญ์„ ์œ„ํ•œ ์–•์€ ์˜ˆ์ธก ํ—ค๋“œ ์ถ”๊ฐ€
  • ๋‹จ์ผ ์ˆœ์ „ํŒŒ(forward pass)๋กœ ๋ชจ๋“  ๋ฉ”ํŠธ๋ฆญ ์˜ˆ์ธก ๊ฐ€๋Šฅ
  • ๋ฌธ๋งฅ ํ† ํฐ: Relevance, Utilization ํ™•๋ฅ  ์ถ”์ •
  • ์‘๋‹ต ํ† ํฐ: Adherence ํ™•๋ฅ  ์ถ”์ •

๐Ÿ“Š ์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ

์ฃผ์š” ์‹คํ—˜ ์„ค์ •

ํ‰๊ฐ€ ๋Œ€์ƒ ์‹œ์Šคํ…œ:
– LLM Judges: GPT-3.5, GPT-4
– ๊ธฐ์กด ํ”„๋ ˆ์ž„์›Œํฌ: RAGAS, TruLens
– Fine-tuned ๋ชจ๋ธ: DeBERTa-v3-Large (400M ํŒŒ๋ผ๋ฏธํ„ฐ)

ํ•ต์‹ฌ ์‹คํ—˜ ๊ฒฐ๊ณผ

ํ™˜๊ฐ ํƒ์ง€ (Adherence) ์„ฑ๋Šฅ – AUROC

๋ฐ์ดํ„ฐ์…‹ DeBERTa RAGAS TruLens
TechQA 0.86 0.57 0.70
์ „์ฒด ๋ฒ”์œ„ 0.64-0.86 – –

์ฃผ์š” ๋ฐœ๊ฒฌ์‚ฌํ•ญ

  1. Small ๋ชจ๋ธ์˜ ์šฐ์ˆ˜์„ฑ
    • 400M ํŒŒ๋ผ๋ฏธํ„ฐ DeBERTa ๋ชจ๋ธ์ด ์ˆ˜์‹ญ์–ต ํŒŒ๋ผ๋ฏธํ„ฐ LLM Judge๋ณด๋‹ค ์ผ๊ด€๋˜๊ฒŒ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ
    • ํŠนํ™”๋œ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ ๊ฐ€์น˜ ์ž…์ฆ
  2. Context Relevance์˜ ๋‚œ์ด๋„
    • ๋ชจ๋“  ๋ชจ๋ธ์—์„œ Context Relevance ์ถ”์ •์ด ๊ฐ€์žฅ ์–ด๋ ค์›€
    • ์งˆ๋ฌธ์— ๋‹ตํ•˜๊ธฐ ์œ„ํ•œ ํŠน์ • ์ •๋ณด ํฌํ•จ ์—ฌ๋ถ€ ํŒ๋‹จ์˜ ๋ณต์žก์„ฑ
  3. ๋„๋ฉ”์ธ๋ณ„ ํ™˜๊ฐ ๋น„์œจ ์ฐจ์ด
    • ๋†’์€ ํ™˜๊ฐ๋ฅ : ExpertQA (12%), CovidQA (16%), MS Marco (13%)
    • ๋‚ฎ์€ ํ™˜๊ฐ๋ฅ : CUAD, FinQA, TAT-QA (๊ฐ ์•ฝ 1%)
  4. Ground Truth์™€์˜ ๊ฒฉ์ฐจ
    • ์ตœ๊ณ  ์„ฑ๋Šฅ ํ‰๊ฐ€์ž์™€ Ground Truth ์‚ฌ์ด์— ์—ฌ์ „ํžˆ ์ƒ๋‹นํ•œ ๊ฒฉ์ฐจ ์กด์žฌ

๐Ÿ’ช ๊ฐ•์  ๋ฐ ๊ธฐ์—ฌ

1. ์ตœ์ดˆ์˜ ๋Œ€๊ทœ๋ชจ RAG ๋ฒค์น˜๋งˆํฌ

  • 100,000๊ฐœ ์˜ˆ์ œ๋กœ ๊ตฌ์„ฑ๋œ ์ตœ์ดˆ์˜ ํฌ๊ด„์  RAG ๋ฒค์น˜๋งˆํฌ
  • 5๊ฐœ ์‚ฐ์—… ๋„๋ฉ”์ธ, 12๊ฐœ ๋ฐ์ดํ„ฐ์…‹ ํ†ตํ•ฉ

2. ์„ค๋ช… ๊ฐ€๋Šฅํ•œ ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ

  • TRACe์˜ 4๊ฐ€์ง€ ๋ฉ”ํŠธ๋ฆญ์ด Retriever์™€ Generator๋ฅผ ๋ถ„๋ฆฌํ•˜์—ฌ ํ‰๊ฐ€
  • ๊ฐ ๋ฉ”ํŠธ๋ฆญ์ด ๊ตฌ์ฒด์ ์ธ ๊ฐœ์„  ๋ฐฉํ–ฅ ์ œ์‹œ

3. ์‹ค์šฉ์  ์‚ฐ์—… ์ ์šฉ์„ฑ

  • ์‚ฌ์šฉ์ž ๋งค๋‰ด์–ผ, ๊ธฐ์ˆ  ๋ฌธ์„œ ๋“ฑ ์‹ค์ œ ์‚ฐ์—… ๋ฐ์ดํ„ฐ ํ™œ์šฉ
  • ๊ณ ๊ฐ ์ง€์›, ๊ธˆ์œต, ๋ฒ•๋ฅ  ๋“ฑ ์‹ค์ œ RAG ์ ์šฉ ๋ถ„์•ผ ๋ฐ˜์˜

4. ํšจ์œจ์ ์ธ ํ‰๊ฐ€ ๋ชจ๋ธ ์ œ์•ˆ

  • 400M DeBERTa๊ฐ€ billion-scale LLM๊ณผ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์„ฑ๋Šฅ
  • ๋น„์šฉ ํšจ์œจ์ ์ธ RAG ํ‰๊ฐ€ ๊ฐ€๋Šฅ

5. ํ† ํฐ ๋ ˆ๋ฒจ ์ฃผ์„

  • ๋‹จ์ˆœ ์ ์ˆ˜๊ฐ€ ์•„๋‹Œ ํ† ํฐ/์ŠคํŒฌ ๋ ˆ๋ฒจ์˜ ์ƒ์„ธ ์ฃผ์„ ์ œ๊ณต
  • ๋””๋ฒ„๊น… ๋ฐ ๊ฐœ์„ ์— ์‹ค์งˆ์  ๋„์›€

6. ์˜คํ”ˆ์†Œ์Šค ๊ณต๊ฐœ

  • ๋ฐ์ดํ„ฐ์…‹๊ณผ ์ฝ”๋“œ ์ „์ฒด ๊ณต๊ฐœ
  • ์žฌํ˜„ ๊ฐ€๋Šฅํ•œ ์—ฐ๊ตฌ ์ง€์›

โš ๏ธ ํ•œ๊ณ„์  ๋ฐ ํ–ฅํ›„ ์—ฐ๊ตฌ

๋…ผ๋ฌธ์—์„œ ์–ธ๊ธ‰ํ•œ ํ•œ๊ณ„์ 

  1. ์˜์–ด ์ค‘์‹ฌ
    • ํ˜„์žฌ ์˜์–ด ํƒœ์Šคํฌ์—๋งŒ ์ดˆ์ 
    • ๋‹ค๊ตญ์–ด ์ง€์› ํ•„์š”
  2. ์ •์  ์ง€์‹ ์†Œ์Šค
    • ๋Œ€๋ถ€๋ถ„ ์ •์ ์ธ ์ง€์‹ ์†Œ์Šค ์‚ฌ์šฉ
    • ๋™์ ์ด๊ณ  ๊ฐœ๋ฐฉํ˜• ๊ฒ€์ƒ‰ ์‹œ๋‚˜๋ฆฌ์˜ค ๋ฏธ๋ฐ˜์˜
  3. Ground Truth์™€์˜ ๊ฒฉ์ฐจ
    • ์ตœ๊ณ  ์„ฑ๋Šฅ ํ‰๊ฐ€์ž๋„ Ground Truth์™€ ์ƒ๋‹นํ•œ ๊ฒฉ์ฐจ ์กด์žฌ
    • ์ถ”๊ฐ€์ ์ธ ์—ฐ๊ตฌ ํ•„์š”
  4. Context Relevance ์ถ”์ •์˜ ์–ด๋ ค์›€
    • ๋‹ค๋ฅธ ๋ฉ”ํŠธ๋ฆญ ๋Œ€๋น„ ๋†’์€ RMSE
    • ์•”๋ฌต์  ์ •๋‹ต ์ถ”๋ก ์˜ ๋ณต์žก์„ฑ
  5. LLM ๊ธฐ๋ฐ˜ ์ฃผ์„์˜ ํ•œ๊ณ„
    • ์ผ๋ถ€ ์ฃผ์„์— LLM ํ™œ์šฉ
    • ์ž ์žฌ์  ํŽธํ–ฅ ๊ฐ€๋Šฅ์„ฑ

ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ

  1. ๋ฒค์น˜๋งˆํฌ ํ™•์žฅ
    • ChatRAGBench ๋“ฑ ์ถ”๊ฐ€ ๋ฐ์ดํ„ฐ์…‹ ํ†ตํ•ฉ ๊ณ„ํš
  2. ๋” ํฐ ์ „๋ฌธ๊ฐ€ ๋ชจ๋ธ ํŒŒ์ธํŠœ๋‹
    • Ground Truth์™€์˜ ๊ฒฉ์ฐจ ์ขํžˆ๊ธฐ ์œ„ํ•œ ์—ฐ๊ตฌ
  3. ํ† ํฐ ๋ ˆ๋ฒจ ์˜ˆ์ธก ๊ฐ•ํ™”
    • ์˜ˆ์ œ ๋ ˆ๋ฒจ์ด ์•„๋‹Œ ํ† ํฐ ๋ ˆ๋ฒจ ์˜ˆ์ธก์œผ๋กœ ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ ํ–ฅ์ƒ
  4. LLM Judge ํŒŒ์ธํŠœ๋‹
    • DeBERTa์™€ GPT-4 ๊ฐ„ ์„ฑ๋Šฅ ๊ฒฉ์ฐจ ํ•ด์†Œ

๐Ÿ”— ๊ด€๋ จ ๋…ผ๋ฌธ

์„ ํ–‰ ์—ฐ๊ตฌ (RAG ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ)

๋…ผ๋ฌธ/ํ”„๋ ˆ์ž„์›Œํฌ ํ•ต์‹ฌ ๋‚ด์šฉ ๊ด€๊ณ„
RAGAS Context Relevance, Groundedness, Answer Relevance ํ‰๊ฐ€ RAGBench๊ฐ€ ๋น„๊ต ๋Œ€์ƒ์œผ๋กœ ์‚ฌ์šฉ
TruLens RAG Triad ๊ฐœ๋… ๋„์ž… (NLI ๋ชจ๋ธ ํ™œ์šฉ) ์œ ์‚ฌํ•œ ๋ฉ”ํŠธ๋ฆญ ์ •์˜, ์„ฑ๋Šฅ ๋น„๊ต
ARES ๋„๋ฉ”์ธ๋ณ„ LLM Judge ์ƒ์„ฑ ์œ ์‚ฌํ•œ ์ ‘๊ทผ๋ฒ•, ๋ฉ”ํŠธ๋ฆญ ์ •์˜ ์ฐธ์กฐ
RGB RAG Triad ๊ฐœ๋… ํ™•์žฅ ์ ์ˆ˜ ์˜ˆ์ธก ๋ฐฉ๋ฒ• ๊ฐœ์„ 
CRUD-RAG CRUD ์ž‘์—… ๊ธฐ๋ฐ˜ RAG ๋ถ„๋ฅ˜ ์ค‘๊ตญ์–ด ๋ฒค์น˜๋งˆํฌ

๊ธฐ๋ฐ˜ ๋ฐ์ดํ„ฐ์…‹

  • HotpotQA, MS Marco, CUAD, FinQA, TAT-QA, PubMedQA, CovidQA ๋“ฑ

ํ›„์†/๊ด€๋ จ ์—ฐ๊ตฌ

๋…ผ๋ฌธ ์„ค๋ช…
RAGChecker (NeurIPS 2024) ์„ธ๋ถ„ํ™”๋œ RAG ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ
T2-RAGBench ํ…์ŠคํŠธ+ํ…Œ์ด๋ธ” RAG ๋ฒค์น˜๋งˆํฌ
RAG-RewardBench RAG ๋ณด์ƒ ๋ชจ๋ธ ๋ฒค์น˜๋งˆํฌ

๐Ÿ’ป ์‹ค๋ฌด ์ ์šฉ ํฌ์ธํŠธ

RAGBench ๋ฐ์ดํ„ฐ์…‹ ์‚ฌ์šฉ ๋ฐฉ๋ฒ•

1. ๋ฐ์ดํ„ฐ์…‹ ๋กœ๋“œ

from datasets import load_dataset

# ๊ฐœ๋ณ„ ๋ฐ์ดํ„ฐ์…‹ ๋กœ๋“œ
ragbench_hotpotqa = load_dataset("rungalileo/ragbench", "hotpotqa")

# ํŠน์ • ๋ถ„ํ• ๋งŒ ๋กœ๋“œ
ragbench_test = load_dataset("rungalileo/ragbench", "hotpotqa", split="test")

# ์ „์ฒด RAGBench ๋กœ๋“œ
ragbench = {}
datasets_list = ['covidqa', 'cuad', 'delucionqa', 'emanual', 'expertqa',
                 'finqa', 'hagrid', 'hotpotqa', 'msmarco', 'pubmedqa',
                 'tatqa', 'techqa']
for dataset in datasets_list:
    ragbench[dataset] = load_dataset("rungalileo/ragbench", dataset)

2. ํ‰๊ฐ€ ์‹คํ–‰

# ๊ธฐ์กด RAG ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ ๋ฒค์น˜๋งˆํ‚น
python calculate_metrics.py --dataset hotpotqa msmarco hagrid expertqa

# ์ถ”๋ก  ์‹คํ–‰
python run_inference.py

TRACe ๋ฉ”ํŠธ๋ฆญ ํ™œ์šฉ ๊ฐ€์ด๋“œ

๋ฉ”ํŠธ๋ฆญ ๋‚ฎ์€ ์ ์ˆ˜ ์‹œ ์กฐ์น˜
Relevance Retriever ๊ฐœ์„  (์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ, ์ฒญํ‚น ์ „๋žต, ๊ฒ€์ƒ‰ ์•Œ๊ณ ๋ฆฌ์ฆ˜)
Utilization ํ”„๋กฌํ”„ํŠธ ์—”์ง€๋‹ˆ์–ด๋ง, ๋ฌธ๋งฅ ํ†ตํ•ฉ ๋ฐฉ์‹ ๊ฐœ์„ 
Adherence ํ™˜๊ฐ ๋ฐฉ์ง€ ๊ธฐ๋ฒ• ์ ์šฉ, ์ถœ์ฒ˜ ์ธ์šฉ ๊ฐ•ํ™”
Completeness ๋‹ค์ค‘ ๋ฌธ์„œ ํ†ตํ•ฉ, ์‘๋‹ต ์ƒ์„ฑ ์ „๋žต ๊ฐœ์„ 

์‹ค๋ฌด ์ ์šฉ ์ฒดํฌ๋ฆฌ์ŠคํŠธ

  • [ ] ์ž์‚ฌ ๋„๋ฉ”์ธ๊ณผ ์œ ์‚ฌํ•œ RAGBench ์„œ๋ธŒ์…‹ ์‹๋ณ„
  • [ ] TRACe ๋ฉ”ํŠธ๋ฆญ์œผ๋กœ ํ˜„์žฌ RAG ์‹œ์Šคํ…œ ํ‰๊ฐ€
  • [ ] ๊ฐ€์žฅ ๋‚ฎ์€ ์ ์ˆ˜์˜ ๋ฉ”ํŠธ๋ฆญ์— ์ง‘์ค‘ํ•˜์—ฌ ๊ฐœ์„ 
  • [ ] DeBERTa ๊ธฐ๋ฐ˜ ํ‰๊ฐ€์ž๋กœ ๋น„์šฉ ํšจ์œจ์  ํ‰๊ฐ€ ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌ์ถ•
  • [ ] ํ† ํฐ ๋ ˆ๋ฒจ ์ฃผ์„ ํ™œ์šฉํ•˜์—ฌ ๊ตฌ์ฒด์  ๋ฌธ์ œ์  ํŒŒ์•…

๐Ÿท๏ธ Tags

#RAG #Benchmark #TRACe #Evaluation #Retrieval #Generation #LLM #Hallucination #DeBERTa #Galileo #HuggingFace #NLP #QA #InformationRetrieval #ContextRelevance #Adherence #Utilization #Completeness #RAGBench #2024


๐Ÿ“š ์ฐธ๊ณ  ์ž๋ฃŒ

  • arXiv Paper: https://arxiv.org/abs/2407.11005
  • HuggingFace Dataset: https://huggingface.co/datasets/rungalileo/ragbench
  • GitHub Repository: https://github.com/rungalileo/ragbench
  • Semantic Scholar: https://www.semanticscholar.org/paper/1b0aba023d7aa5fb9853f9e942efb5c243dc1201

Last Updated: 2025-01-19

์ž‘์„ฑ์ž

skycave

Follow Me
๋‹ค๋ฅธ ๊ธฐ์‚ฌ
Previous

[AI Paper] ๐Ÿ“„ RAGAS: Automated Evaluation of Retrieval Augmented Generation

Next

[AI Paper] ๐Ÿ“„ ReAct: Synergizing Reasoning and Acting in Language Models

๋Œ“๊ธ€ ์—†์Œ! ์ฒซ ๋Œ“๊ธ€์„ ๋‚จ๊ฒจ๋ณด์„ธ์š”.

๋‹ต๊ธ€ ๋‚จ๊ธฐ๊ธฐ ์‘๋‹ต ์ทจ์†Œ

์ด๋ฉ”์ผ ์ฃผ์†Œ๋Š” ๊ณต๊ฐœ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•„์ˆ˜ ํ•„๋“œ๋Š” *๋กœ ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค

์ตœ์‹ ๊ธ€

  • ๐Ÿ“Š ์ผ์ผ ๋‰ด์Šค ๊ฐ์„ฑ ๋ฆฌํฌํŠธ – 2026-01-28
  • AI ์‹œ์Šคํ…œ์˜ ๋ฌธ๋งฅ ๊ธฐ๋ฐ˜ ๊ฒ€์ƒ‰(Contextual Retrieval) | Anthropic
  • “Think” ํˆด: Claude๊ฐ€ ๋ฉˆ์ถฐ์„œ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๊ธฐ | Anthropic
  • Claude Code ๋ชจ๋ฒ” ์‚ฌ๋ก€ \ Anthropic
  • ์šฐ๋ฆฌ๊ฐ€ ๋ฉ€ํ‹ฐ ์—์ด์ „ํŠธ ์—ฐ๊ตฌ ์‹œ์Šคํ…œ์„ ๊ตฌ์ถ•ํ•œ ๋ฐฉ๋ฒ•
Copyright 2026 — skycave's Blog. All rights reserved. Blogsy WordPress Theme