๋ณธ๋ฌธ์œผ๋กœ ๊ฑด๋„ˆ๋›ฐ๊ธฐ
-
skycave's Blog
skycave's Blog
  • Home
  • Investment
  • IT
    • Data engineering
    • AI
    • Programing
  • Leisure
    • Camping
    • Fishing
  • Travel
    • Domestic
    • Overseas
  • Book
  • Product
  • Hot keyword in google
  • Home
  • Investment
  • IT
    • Data engineering
    • AI
    • Programing
  • Leisure
    • Camping
    • Fishing
  • Travel
    • Domestic
    • Overseas
  • Book
  • Product
  • Hot keyword in google
๋‹ซ๊ธฐ

๊ฒ€์ƒ‰

AI

[AI Paper] ๐Ÿ“„ Reflexion: Language Agents with Verbal Reinforcement Learning

By skycave
2026๋…„ 01์›” 25์ผ 9 Min Read
0

๐Ÿ“„ Reflexion: Language Agents with Verbal Reinforcement Learning

๐Ÿ“‹ ๋ฉ”ํƒ€ ์ •๋ณด

ํ•ญ๋ชฉ ๋‚ด์šฉ
์ €์ž Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao
์†Œ์† Northeastern University (Khoury College), MIT, Princeton University
๋ฐœํ‘œ์ฒ˜ NeurIPS 2023 (37th Conference on Neural Information Processing Systems)
์—ฐ๋„ 2023
arXiv 2303.11366
GitHub noahshinn/reflexion
OpenReview vAElhFcKW6

๐ŸŽฏ ํ•œ์ค„ ์š”์•ฝ

LLM ์—์ด์ „ํŠธ๊ฐ€ ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ ์—†์ด ์–ธ์–ด์  ์ž๊ธฐ ์„ฑ์ฐฐ(verbal self-reflection)์„ ํ†ตํ•ด ์‹คํŒจ ๊ฒฝํ—˜์œผ๋กœ๋ถ€ํ„ฐ ํ•™์Šตํ•˜์—ฌ ์˜์‚ฌ๊ฒฐ์ •, ์ถ”๋ก , ์ฝ”๋”ฉ ํƒœ์Šคํฌ์—์„œ ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ์ƒˆ๋กœ์šด ๊ฐ•ํ™”ํ•™์Šต ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์ œ์‹œํ•œ๋‹ค.


๐Ÿ” ์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ ๋ฐ ๋™๊ธฐ

๊ธฐ์กด RL์˜ ํ•œ๊ณ„

  1. ๋ฐ์ดํ„ฐ ๋น„ํšจ์œจ์„ฑ
    • ์ „ํ†ต์ ์ธ ๊ฐ•ํ™”ํ•™์Šต(RL)์€ ๋ฐฉ๋Œ€ํ•œ ์–‘์˜ ํ•™์Šต ์ƒ˜ํ”Œ๊ณผ ๋น„์šฉ์ด ๋งŽ์ด ๋“œ๋Š” ๋ชจ๋ธ ํŒŒ์ธํŠœ๋‹์ด ํ•„์š”
    • Policy gradient๋‚˜ value-based ๋ฐฉ๋ฒ•๋“ค์€ extensive training๊ณผ expensive model fine-tuning ์š”๊ตฌ
  2. ์Šค์นผ๋ผ ๋ณด์ƒ์˜ ํ•œ๊ณ„
    • ๊ธฐ์กด RL์€ ์Šค์นผ๋ผ ๋˜๋Š” ๋ฒกํ„ฐ ํ˜•ํƒœ์˜ ๋ณด์ƒ ์‹ ํ˜ธ๋ฅผ ์‚ฌ์šฉ
    • ์ •ํ™•ํ•œ credit assignment๊ฐ€ ์–ด๋ ค์›€ – ์–ด๋–ค ํ–‰๋™์ด ์„ฑ๊ณต/์‹คํŒจ์— ๊ธฐ์—ฌํ–ˆ๋Š”์ง€ ํŒŒ์•… ๊ณค๋ž€
    • ๊ตฌ์ฒด์ ์ธ ๊ฐœ์„  ๋ฐฉํ–ฅ ์ œ์‹œ ๋ถˆ๊ฐ€๋Šฅ
  3. ํ•ด์„ ๋ถˆ๊ฐ€๋Šฅ์„ฑ
    • ์ •์ฑ… ๋„คํŠธ์›Œํฌ๋‚˜ ๊ฐ€์น˜ ํ•จ์ˆ˜์˜ ํ•™์Šต ๊ณผ์ •์ด ๋ธ”๋ž™๋ฐ•์Šค
    • ์—์ด์ „ํŠธ๊ฐ€ ์™œ ํŠน์ • ํ–‰๋™์„ ์„ ํƒํ–ˆ๋Š”์ง€ ํŒŒ์•…ํ•˜๊ธฐ ์–ด๋ ค์›€
  4. ์ธ๊ฐ„ ํ•™์Šต๊ณผ์˜ ๊ดด๋ฆฌ
    • ์ธ๊ฐ„์€ ์‹คํŒจ๋กœ๋ถ€ํ„ฐ ์„ฑ์ฐฐ(reflection)ํ•˜๊ณ  ๋‹ค์Œ ์‹œ๋„์—์„œ ๊ฐœ์„ ๋œ ๊ณ„ํš์„ ์„ธ์›€
    • ๊ธฐ์กด LLM ์—์ด์ „ํŠธ๋Š” ์ด๋Ÿฌํ•œ ์‹œํ–‰์ฐฉ์˜ค๋ฅผ ํ†ตํ•œ ์ž๊ฐ€ ํ•™์Šต ๋Šฅ๋ ฅ์ด ๋ถ€์กฑ

ํ•ต์‹ฌ ์—ฐ๊ตฌ ์งˆ๋ฌธ

“LLM ์—์ด์ „ํŠธ๊ฐ€ ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ ์—†์ด trial-and-error๋ฅผ ํ†ตํ•ด ๋น ๋ฅด๊ณ  ํšจ์œจ์ ์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์„๊นŒ?”


๐Ÿ’ก ํ•ต์‹ฌ ์•„์ด๋””์–ด

1. Verbal Reinforcement Learning (์–ธ์–ด์  ๊ฐ•ํ™”ํ•™์Šต)

๊ธฐ์กด RL์ด ์Šค์นผ๋ผ ๋ณด์ƒ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๊ณผ ๋‹ฌ๋ฆฌ, Reflexion์€ ์–ธ์–ด์  ํ”ผ๋“œ๋ฐฑ์„ ๊ฐ•ํ™” ์‹ ํ˜ธ๋กœ ์‚ฌ์šฉ:

Traditional RL: state โ†’ action โ†’ scalar reward โ†’ weight update
Reflexion:      state โ†’ action โ†’ verbal feedback โ†’ memory update (no weight change)

ํ•ต์‹ฌ ์ฐจ๋ณ„์ :
– ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ ์—†์Œ: LLM ํŒŒ์ธํŠœ๋‹ ๋ถˆํ•„์š”
– ์–ธ์–ด์  ํ”ผ๋“œ๋ฐฑ: ์Šค์นผ๋ผ ๊ฐ’ ๋Œ€์‹  ์ž์—ฐ์–ด๋กœ ๋œ ๊ตฌ์ฒด์  ๊ฐœ์„  ๋ฐฉํ–ฅ ์ œ๊ณต
– Semantic Gradient: ์–ธ์–ด์  ํ”ผ๋“œ๋ฐฑ์ด ์ผ์ข…์˜ “์˜๋ฏธ๋ก ์  ๊ทธ๋ž˜๋””์–ธํŠธ” ์—ญํ• 

2. Self-Reflection ๋ฉ”์ปค๋‹ˆ์ฆ˜

์—์ด์ „ํŠธ๊ฐ€ ์‹คํŒจํ•œ ํ›„ ์ž๊ธฐ ์„ฑ์ฐฐ์„ ํ†ตํ•ด ๋ฌด์—‡์ด ์ž˜๋ชป๋˜์—ˆ๋Š”์ง€ ๋ถ„์„ํ•˜๊ณ , ์ด๋ฅผ ์ž์—ฐ์–ด๋กœ ์ €์žฅ:

์‹คํŒจํ•œ ํƒœ์Šคํฌ โ†’ ์ž๊ธฐ ์„ฑ์ฐฐ ์ƒ์„ฑ โ†’ ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅ โ†’ ๋‹ค์Œ ์‹œ๋„์—์„œ ํ™œ์šฉ

Self-Reflection์˜ ์—ญํ• :
– ์‹คํŒจ ์›์ธ ๋ถ„์„ (What went wrong?)
– ๊ตฌ์ฒด์ ์ธ ๊ฐœ์„  ๋ฐฉํ–ฅ ์ œ์‹œ (What to do differently?)
– ๋‹ค์Œ ์‹œ๋„์— ํ™œ์šฉํ•  ๊ตํ›ˆ ์ƒ์„ฑ

์˜ˆ์‹œ: ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํƒœ์Šคํฌ์—์„œ ์‹คํŒจ ์‹œ

“ํ•จ์ˆ˜๊ฐ€ ์Œ์ˆ˜ ์ž…๋ ฅ์„ ์ฒ˜๋ฆฌํ•˜์ง€ ๋ชปํ•ด ์‹คํŒจํ–ˆ์Šต๋‹ˆ๋‹ค. ํ–ฅํ›„ ์‹œ๋„์—์„œ๋Š” ์Œ์ˆ˜ ์ž…๋ ฅ์— ๋Œ€ํ•œ ์ฒดํฌ ๋กœ์ง์„ ๋ฐ˜๋“œ์‹œ ์ถ”๊ฐ€ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.”

3. ์—ํ”ผ์†Œ๋”• ๋ฉ”๋ชจ๋ฆฌ (Episodic Memory)

์„ฑ์ฐฐ ๊ฒฐ๊ณผ๋ฅผ ์—ํ”ผ์†Œ๋“œ ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅํ•˜์—ฌ ํ›„์† ์‹œ๋„์—์„œ ์ฐธ์กฐ:

๋ฉ”๋ชจ๋ฆฌ ์œ ํ˜• ์„ค๋ช… ์ €์žฅ ๋‚ด์šฉ
๋‹จ๊ธฐ ๋ฉ”๋ชจ๋ฆฌ (Short-term) ํ˜„์žฌ ์—ํ”ผ์†Œ๋“œ์˜ trajectory action-observation ์‹œํ€€์Šค
์žฅ๊ธฐ ๋ฉ”๋ชจ๋ฆฌ (Long-term) ๊ณผ๊ฑฐ ์‹œ๋„๋“ค์˜ self-reflection ๊ฒฐ๊ณผ ์‹คํŒจ ์›์ธ ๋ถ„์„, ๊ฐœ์„  ๋ฐฉํ–ฅ, ๊ตํ›ˆ

์žฅ๊ธฐ ๋ฉ”๋ชจ๋ฆฌ์˜ ์žฅ์ :
– ๋ช…์‹œ์ ์ด๊ณ  ํ•ด์„ ๊ฐ€๋Šฅํ•œ ๊ฒฝํ—˜ ์ €์žฅ
– ๊ตฌ์ฒด์ ์ธ ํ–‰๋™ ํžŒํŠธ ์ œ๊ณต
– ๋‹ค์Œ ์—ํ”ผ์†Œ๋“œ์—์„œ ์ฆ‰์‹œ ํ™œ์šฉ ๊ฐ€๋Šฅ

4. ํ”ผ๋“œ๋ฐฑ ์œ ์—ฐ์„ฑ

๋‹ค์–‘ํ•œ ํ˜•ํƒœ์™€ ์†Œ์Šค์˜ ํ”ผ๋“œ๋ฐฑ ์‹ ํ˜ธ ์ˆ˜์šฉ ๊ฐ€๋Šฅ:

ํ”ผ๋“œ๋ฐฑ ์œ ํ˜• ์˜ˆ์‹œ
์Šค์นผ๋ผ ๊ฐ’ ์„ฑ๊ณต/์‹คํŒจ, ์ ์ˆ˜, ๋ณด์ƒ
์ž์œ  ํ˜•์‹ ์–ธ์–ด ๊ตฌ์ฒด์ ์ธ ์˜ค๋ฅ˜ ๋ฉ”์‹œ์ง€, ํ‰๊ฐ€ ์ฝ”๋ฉ˜ํŠธ
์™ธ๋ถ€ ํ”ผ๋“œ๋ฐฑ ์ปดํŒŒ์ผ๋Ÿฌ ์—๋Ÿฌ, ํ…Œ์ŠคํŠธ ๊ฒฐ๊ณผ, ํ™˜๊ฒฝ ํ”ผ๋“œ๋ฐฑ
๋‚ด๋ถ€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ Self-generated tests, ์ž์ฒด ํ‰๊ฐ€

๐Ÿ—๏ธ ์•„ํ‚คํ…์ฒ˜ / ๋ฐฉ๋ฒ•๋ก 

์‹œ์Šคํ…œ ๊ตฌ์„ฑ ์š”์†Œ

Reflexion์€ ์„ธ ๊ฐ€์ง€ ํ•ต์‹ฌ ์ปดํฌ๋„ŒํŠธ๋กœ ๊ตฌ์„ฑ:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      Reflexion Framework                     โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                              โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    action    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                  โ”‚
โ”‚   โ”‚  Actor  โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ โ”‚ Environment โ”‚                   โ”‚
โ”‚   โ”‚  (LLM)  โ”‚ โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚             โ”‚                   โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜  observation โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜                  โ”‚
โ”‚        โ”‚                          โ”‚                          โ”‚
โ”‚        โ”‚ trajectory               โ”‚ reward                   โ”‚
โ”‚        โ–ผ                          โ–ผ                          โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                   โ”‚
โ”‚   โ”‚            Evaluator                 โ”‚                   โ”‚
โ”‚   โ”‚    (reward score ๊ณ„์‚ฐ)               โ”‚                   โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                   โ”‚
โ”‚                    โ”‚ feedback                                โ”‚
โ”‚                    โ–ผ                                         โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                   โ”‚
โ”‚   โ”‚         Self-Reflection              โ”‚                   โ”‚
โ”‚   โ”‚    (์–ธ์–ด์  ํ”ผ๋“œ๋ฐฑ ์ƒ์„ฑ)              โ”‚                   โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                   โ”‚
โ”‚                    โ”‚ reflection                              โ”‚
โ”‚                    โ–ผ                                         โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                   โ”‚
โ”‚   โ”‚         Long-term Memory             โ”‚                   โ”‚
โ”‚   โ”‚    (์„ฑ์ฐฐ ๋‚ด์šฉ ์ €์žฅ)                  โ”‚                   โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                   โ”‚
โ”‚                                                              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

1. Actor (ํ–‰๋™์ž)

  • ์—ญํ• : ์ƒํƒœ ๊ด€์ฐฐ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ…์ŠคํŠธ์™€ ํ–‰๋™ ์ƒ์„ฑ
  • ๊ตฌํ˜„: Chain-of-Thought (CoT) ๋˜๋Š” ReAct ๊ธฐ๋ฐ˜
  • ์ž…๋ ฅ: ํ™˜๊ฒฝ ์ƒํƒœ(observation) + ์žฅ๊ธฐ ๋ฉ”๋ชจ๋ฆฌ(reflections)
  • ์ถœ๋ ฅ: ํ–‰๋™(action) ๋ฐ trajectory
  • ํŠน์ง•: ๋ฉ”๋ชจ๋ฆฌ ์ปดํฌ๋„ŒํŠธ์™€ ๊ฒฐํ•ฉํ•˜์—ฌ ๊ณผ๊ฑฐ ๊ฒฝํ—˜ ํ™œ์šฉ

2. Evaluator (ํ‰๊ฐ€์ž)

  • ์—ญํ• : Actor๊ฐ€ ์ƒ์„ฑํ•œ ์ถœ๋ ฅ์˜ ํ’ˆ์งˆ ์ ์ˆ˜ ๊ณ„์‚ฐ
  • ์ž…๋ ฅ: ์ƒ์„ฑ๋œ trajectory (๋‹จ๊ธฐ ๋ฉ”๋ชจ๋ฆฌ)
  • ์ถœ๋ ฅ: ๋ณด์ƒ ์ ์ˆ˜ (reward score)
  • ๊ตฌํ˜„ ๋ฐฉ์‹:
    • Decision-making: ์‚ฌ์ „ ์ •์˜๋œ ํœด๋ฆฌ์Šคํ‹ฑ ํ•จ์ˆ˜ ๋˜๋Š” GPT ๊ธฐ๋ฐ˜ ์ด์ง„ ๋ถ„๋ฅ˜
    • Reasoning: Exact Match (EM) ๊ธฐ๋ฐ˜ ํ‰๊ฐ€
    • Programming: ํ…Œ์ŠคํŠธ ์‹คํ–‰ ๊ฒฐ๊ณผ (pass/fail)
  • ์ฐธ๊ณ : ๋ฐ˜๋“œ์‹œ LLM์ผ ํ•„์š” ์—†์Œ – ๋‹จ์ˆœ ํ…Œ์ด๋ธ” ๋ฃฉ์—…๋„ ๊ฐ€๋Šฅ

3. Self-Reflection (์ž๊ธฐ ์„ฑ์ฐฐ)

  • ์—ญํ• : Actor์˜ ์ž๊ธฐ ๊ฐœ์„ ์„ ๋•๋Š” ์–ธ์–ด์  ๊ฐ•ํ™” ์‹ ํ˜ธ(verbal reinforcement cues) ์ƒ์„ฑ
  • ์ž…๋ ฅ:
    • ๋ณด์ƒ ์‹ ํ˜ธ (reward signal)
    • ํ˜„์žฌ trajectory
    • ์žฅ๊ธฐ ๋ฉ”๋ชจ๋ฆฌ (persistent memory)
  • ์ถœ๋ ฅ: ๊ตฌ์ฒด์ ์ด๊ณ  ๊ด€๋ จ์„ฑ ์žˆ๋Š” ํ”ผ๋“œ๋ฐฑ (self-reflection)
  • ๊ตฌํ˜„: LLM์„ ํ†ตํ•ด ๊ตฌํ˜„
  • ํ•ต์‹ฌ ๊ธฐ๋Šฅ: ํ™˜๊ฒฝ์œผ๋กœ๋ถ€ํ„ฐ์˜ ํ”ผ๋“œ๋ฐฑ์„ ์–ธ์–ด์  ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅ

ํ•™์Šต ๋ฃจํ”„ (Learning Loop)

Trial 1: Task ์‹œ๋„ โ†’ ์‹คํŒจ โ†’ ์„ฑ์ฐฐ ์ƒ์„ฑ โ†’ ๋ฉ”๋ชจ๋ฆฌ ์ €์žฅ
    โ†“
Trial 2: ๋ฉ”๋ชจ๋ฆฌ ์ฐธ์กฐ โ†’ Task ์žฌ์‹œ๋„ โ†’ ์‹คํŒจ โ†’ ์„ฑ์ฐฐ ์ถ”๊ฐ€
    โ†“
Trial 3: ์ถ•์ ๋œ ๋ฉ”๋ชจ๋ฆฌ ์ฐธ์กฐ โ†’ Task ์žฌ์‹œ๋„ โ†’ ์„ฑ๊ณต!

์•Œ๊ณ ๋ฆฌ์ฆ˜ ํ•ต์‹ฌ ๋‹จ๊ณ„:
1. Define: ํƒœ์Šคํฌ ์ •์˜
2. Generate: Actor๊ฐ€ trajectory ์ƒ์„ฑ
3. Evaluate: Evaluator๊ฐ€ ๊ฒฐ๊ณผ ํ‰๊ฐ€
4. Reflect: ์‹คํŒจ ์‹œ Self-Reflection ์ˆ˜ํ–‰
5. Update: ์„ฑ์ฐฐ ๋‚ด์šฉ์„ ์žฅ๊ธฐ ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅ
6. Iterate: ๋‹ค์Œ trajectory ์ƒ์„ฑ (๋ฉ”๋ชจ๋ฆฌ ์ฐธ์กฐ)

ํƒœ์Šคํฌ๋ณ„ ๊ตฌํ˜„ ์„ธ๋ถ€์‚ฌํ•ญ

Decision-Making (AlfWorld)

# ReAct + Reflexion for AlfWorld
def reflexion_decision_making(task, env, max_trials=12):
    memory = []

    for trial in range(max_trials):
        # Actor: ReAct-style reasoning and acting
        trajectory = react_agent.run(task, env, memory)

        # Evaluator: Heuristic or GPT-based binary classification
        success, feedback = evaluate_trajectory(trajectory)

        if success:
            return trajectory

        # Self-Reflection: Generate verbal feedback
        reflection = generate_reflection(
            trajectory=trajectory,
            feedback=feedback,
            memory=memory
        )

        memory.append(reflection)
        env.reset()

    return None

AlfWorld์—์„œ ์žฅ๊ธฐ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋„์›€์ด ๋˜๋Š” ๋‘ ๊ฐ€์ง€ ๊ฒฝ์šฐ:
1. ๊ธด trajectory ์ดˆ๋ฐ˜์˜ ์‹ค์ˆ˜๋ฅผ ์‹๋ณ„ํ•˜๊ณ  ์ƒˆ๋กœ์šด ํ–‰๋™ ์„ ํƒ์ด๋‚˜ ์žฅ๊ธฐ ๊ณ„ํš์„ ์ œ์•ˆ
2. ๊ฒ€์ƒ‰ํ•ด์•ผ ํ•  ํ‘œ๋ฉด/์ปจํ…Œ์ด๋„ˆ๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์„ ๋•Œ, ์—ฌ๋Ÿฌ ์‹œ๋„์— ๊ฑธ์ณ ๋ฐฉ์„ ์ฒด๊ณ„์ ์œผ๋กœ ํƒ์ƒ‰

Programming (HumanEval)

# Reflexion for Code Generation
def reflexion_code_generation(problem, tests, max_trials=10):
    memory = []

    for trial in range(max_trials):
        # Actor: Generate code with memory context
        code = generate_code(problem, memory)

        # Self-Generated Tests (internal feedback)
        internal_tests = generate_tests(problem, code)

        # Evaluator: Run tests
        test_results = run_tests(code, tests + internal_tests)

        if all_tests_pass(test_results):
            return code

        # Self-Reflection
        reflection = reflect_on_code(
            code=code,
            test_results=test_results,
            memory=memory
        )

        memory.append(reflection)

    return best_code

Reasoning (HotPotQA)

# Reflexion for Multi-hop QA
def reflexion_reasoning(question, context, max_trials=5):
    memory = []

    for trial in range(max_trials):
        # Actor: CoT reasoning with memory
        answer, reasoning_chain = cot_reason(
            question, context, memory
        )

        # Evaluator: Check answer correctness
        is_correct, ground_truth = evaluate_answer(answer)

        if is_correct:
            return answer

        # Self-Reflection
        reflection = reflect_on_reasoning(
            question=question,
            reasoning=reasoning_chain,
            answer=answer,
            memory=memory
        )

        memory.append(reflection)

    return best_answer

๐Ÿ“Š ์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ

์‹คํ—˜ ํ™˜๊ฒฝ

ํƒœ์Šคํฌ ์œ ํ˜• ๋ฒค์น˜๋งˆํฌ ์„ค๋ช…
Sequential Decision-Making AlfWorld 134๊ฐœ ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ ํ™˜๊ฒฝ, 6๊ฐ€์ง€ ํƒœ์Šคํฌ ์œ ํ˜•
Reasoning HotPotQA 100๊ฐœ ๋‹ค์ค‘ ํ™‰ ์ถ”๋ก  ์งˆ๋ฌธ (distractor ์„ค์ •)
Programming HumanEval 164๊ฐœ Python ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ฌธ์ œ
Programming LeetcodeHardGym 40๊ฐœ Hard ๋ ˆ๋ฒจ ๋ฌธ์ œ, 19๊ฐœ ์–ธ์–ด (์ƒˆ๋กœ ์ œ์•ˆ)

์ฃผ์š” ๊ฒฐ๊ณผ

1. AlfWorld (Decision-Making)

๋ฐฉ๋ฒ• ์„ฑ๊ณต๋ฅ 
ReAct (baseline) 73%
ReAct + Reflexion (Heuristic) 97% (130/134)
ReAct + Reflexion (GPT) 97% (130/134)
  • ์ ˆ๋Œ€์  ์„ฑ๋Šฅ ํ–ฅ์ƒ: +22% (12 iterative steps ๋‚ด)
  • 134๊ฐœ ํƒœ์Šคํฌ ์ค‘ ๋‹จ 4๊ฐœ๋งŒ ์‹คํŒจ

2. HotPotQA (Reasoning)

๋ฐฉ๋ฒ• ์ •ํ™•๋„
GPT-4 (baseline) 34%
CoT + Episodic Memory 36%
GPT-4 + Reflexion 54%
  • ์ ˆ๋Œ€์  ์„ฑ๋Šฅ ํ–ฅ์ƒ: +20%
  • ๊ฒ€์ƒ‰, ์ •๋ณด ์ถ”์ถœ, ์ถ”๋ก  ๋Šฅ๋ ฅ ๋ชจ๋‘ ํ–ฅ์ƒ

3. HumanEval (Code Generation)

๋ฐฉ๋ฒ• Pass@1
GPT-4 (baseline) 67.0%
CodeT (previous SOTA) 65.8%
GPT-4 + Reflexion 91.0%
  • ์ ˆ๋Œ€์  ์„ฑ๋Šฅ ํ–ฅ์ƒ: +24% (GPT-4 ๋Œ€๋น„)
  • ๋‹น์‹œ SOTA ๋‹ฌ์„ฑ (์ด์ „ ๊ธฐ๋ก์ธ GPT-4์˜ 80% ์ดˆ๊ณผ)

ํ•™์Šต ๊ณก์„  ๋ถ„์„

Performance vs Trial Number (AlfWorld)
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Trial 1:  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘  60%
Trial 3:  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘  90%
Trial 6:  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘  95%
Trial 12: โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ  97%

ReAct (no reflection): plateaus at ~75%
Reflexion: continues to improve with more trials

ํ•ต์‹ฌ ๋ฐœ๊ฒฌ: ReAct๋Š” ์ผ์ • ์ˆ˜์ค€์—์„œ ์„ฑ๋Šฅ์ด ์ •์ฒด๋˜์ง€๋งŒ, Reflexion์€ ์‹œ๋„ ํšŸ์ˆ˜๊ฐ€ ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ ์ง€์†์ ์œผ๋กœ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋จ

Ablation Study ๊ฒฐ๊ณผ

Self-Reflection์˜ ํšจ๊ณผ

  • ์—ํ”ผ์†Œ๋”• ๋ฉ”๋ชจ๋ฆฌ ํ•™์Šต ๋Œ€๋น„ 8% ์ ˆ๋Œ€์  ์„ฑ๋Šฅ ํ–ฅ์ƒ
  • refinement-only ์ ‘๊ทผ๋ฒ•๋ณด๋‹ค self-reflection ๊ธฐ๋ฐ˜ refinement๊ฐ€ ๋” ํšจ๊ณผ์ ์ž„์„ ์ž…์ฆ

์ปดํฌ๋„ŒํŠธ๋ณ„ ๊ธฐ์—ฌ๋„ (HumanEval)

๊ตฌ์„ฑ ์š”์†Œ Pass@1
Base (no reflection) 67.0%
+ Self-generated tests only 77.0%
+ Self-reflection only 80.0%
+ Both (full Reflexion) 91.0%
  • Self-generated tests์™€ self-reflection ๋ชจ๋‘ ์ค‘์š”
  • ๋‘ ์š”์†Œ์˜ ์‹œ๋„ˆ์ง€ ํšจ๊ณผ ํ™•์ธ

๐Ÿ’ช ๊ฐ•์  ๋ฐ ๊ธฐ์—ฌ

๊ธฐ์ˆ ์  ๊ฐ•์ 

  1. ๊ฒฝ๋Ÿ‰์„ฑ (Lightweight)
    • ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ ๋ถˆํ•„์š”: Fine-tuning ์—†์ด in-context learning๋งŒ์œผ๋กœ ์„ฑ๋Šฅ ํ–ฅ์ƒ
    • ๊ณ„์‚ฐ ํšจ์œจ์„ฑ: ์ถ”๊ฐ€ ํ•™์Šต ๋น„์šฉ ์ตœ์†Œํ™”
    • ๋น ๋ฅธ ์ ์‘: ๋ช‡ ๋ฒˆ์˜ ์‹œ๋„๋งŒ์œผ๋กœ ๊ฐœ์„ 
  2. ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ (Interpretability)
    • ์–ธ์–ด์  ํ”ผ๋“œ๋ฐฑ: ์™œ ์‹คํŒจํ–ˆ๋Š”์ง€ ์ž์—ฐ์–ด๋กœ ์„ค๋ช…
    • ํˆฌ๋ช…ํ•œ ํ•™์Šต ๊ณผ์ •: ์„ฑ์ฐฐ ๋‚ด์šฉ์„ ์ง์ ‘ ํ™•์ธ ๊ฐ€๋Šฅ
    • ๋””๋ฒ„๊น… ์šฉ์ด: ์—์ด์ „ํŠธ์˜ ์‚ฌ๊ณ  ๊ณผ์ • ์ถ”์ 
  3. ์œ ์—ฐ์„ฑ (Flexibility)
    • ๋‹ค์–‘ํ•œ ํ”ผ๋“œ๋ฐฑ ์†Œ์Šค: ์™ธ๋ถ€/๋‚ด๋ถ€, ์Šค์นผ๋ผ/์–ธ์–ด
    • ํƒœ์Šคํฌ ๋ฒ”์šฉ์„ฑ: ์˜์‚ฌ๊ฒฐ์ •, ์ฝ”๋”ฉ, ์ถ”๋ก  ๋ชจ๋‘ ์ ์šฉ
    • ๊ธฐ์กด ๊ธฐ๋ฒ•๊ณผ ๊ฒฐํ•ฉ: CoT, ReAct์™€ ์‰ฝ๊ฒŒ ํ†ตํ•ฉ
  4. ๊ธฐ์กด ๋ฐฉ๋ฒ• ๋Œ€๋น„ ์žฅ์ 
    • ์ „ํ†ต์  RL๋ณด๋‹ค ์„ธ๋ฐ€ํ•œ ํ”ผ๋“œ๋ฐฑ ๊ฐ€๋Šฅ (ํƒ€๊ฒŸํŒ…๋œ ํ–‰๋™ ๋ณ€๊ฒฝ)
    • ๋ช…์‹œ์ ์ด๊ณ  ํ•ด์„ ๊ฐ€๋Šฅํ•œ ์—ํ”ผ์†Œ๋”• ๋ฉ”๋ชจ๋ฆฌ
    • ๋‹ค์Œ ์—ํ”ผ์†Œ๋“œ์—์„œ ๋ช…ํ™•ํ•œ ํ–‰๋™ ํžŒํŠธ ์ œ๊ณต

ํ•™์ˆ ์  ๊ธฐ์—ฌ

  1. ์ƒˆ๋กœ์šด ํŒจ๋Ÿฌ๋‹ค์ž„ ์ œ์‹œ: “Verbal RL” – ์ •์ฑ…์„ (๋ฉ”๋ชจ๋ฆฌ ์ธ์ฝ”๋”ฉ, LLM ํŒŒ๋ผ๋ฏธํ„ฐ)๋กœ ํŒŒ๋ผ๋ฏธํ„ฐํ™”
  2. SOTA ๋‹ฌ์„ฑ: HumanEval์—์„œ ๋‹น์‹œ ์ตœ๊ณ  ์„ฑ๋Šฅ ๊ธฐ๋ก
  3. ๋ฒค์น˜๋งˆํฌ ๊ณต๊ฐœ: LeetcodeHardGym ํ™˜๊ฒฝ ๊ณต๊ฐœ (40๊ฐœ hard-level ๋ฌธ์ œ, 19๊ฐœ ์–ธ์–ด)
  4. ์˜คํ”ˆ์†Œ์Šค: ์ฝ”๋“œ, ๋ฐ๋ชจ, ๋ฐ์ดํ„ฐ์…‹ ๋ชจ๋‘ ๊ณต๊ฐœ

โš ๏ธ ํ•œ๊ณ„์  ๋ฐ ํ–ฅํ›„ ์—ฐ๊ตฌ

๋…ผ๋ฌธ์—์„œ ์–ธ๊ธ‰ํ•œ ํ•œ๊ณ„

1. Local Minima ๋ฌธ์ œ

  • Reflexion์€ ๋ณธ์งˆ์ ์œผ๋กœ ์ž์—ฐ์–ด๋ฅผ ์‚ฌ์šฉํ•œ ์ •์ฑ… ์ตœ์ ํ™” ๊ธฐ๋ฒ•
  • ์ •์ฑ… ์ตœ์ ํ™”๋Š” ๊ฐ•๋ ฅํ•˜์ง€๋งŒ ๋น„์ตœ์  ๊ตญ์†Œ ์ตœ์†Ÿ๊ฐ’(non-optimal local minima)์— ๋น ์งˆ ์ˆ˜ ์žˆ์Œ

2. ๋ฉ”๋ชจ๋ฆฌ ์šฉ๋Ÿ‰ ์ œํ•œ

  • ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ๋ฐฉ์‹์œผ๋กœ ์žฅ๊ธฐ ๋ฉ”๋ชจ๋ฆฌ ์šฉ๋Ÿ‰ ์ œํ•œ
  • LLM ์ปจํ…์ŠคํŠธ ๊ธธ์ด ์ œ์•ฝ์œผ๋กœ ์ธํ•œ ์ •๋ณด ์†์‹ค ๊ฐ€๋Šฅ
  • ์ €์ž๋“ค์€ ํ–ฅํ›„ ์—ฐ๊ตฌ์—์„œ ๋ฒกํ„ฐ ์ž„๋ฒ ๋”ฉ DB๋‚˜ SQL DB๋กœ ํ™•์žฅ ๊ถŒ์žฅ

3. Self-Evaluation ์˜์กด์„ฑ

  • LLM์˜ ์ž๊ธฐ ํ‰๊ฐ€ ๋Šฅ๋ ฅ์— ํฌ๊ฒŒ ์˜์กด
  • ํ‰๊ฐ€ ๋ชจ๋ธ์ด ๋ถ€์ •ํ™•ํ•˜๋ฉด ์ž˜๋ชป๋œ ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต
  • ์ „ํ†ต์  RL๊ณผ ๋‹ฌ๋ฆฌ ์„ฑ๊ณต์— ๋Œ€ํ•œ ํ˜•์‹์  ๋ณด์žฅ ์—†์Œ

4. Self-Reflection ํ’ˆ์งˆ ๋ณ€๋™

  • LLM์ด ์ƒ์„ฑํ•˜๋Š” ํ”ผ๋“œ๋ฐฑ์˜ ์ •ํ™•์„ฑ๊ณผ ์‹คํ–‰ ๊ฐ€๋Šฅ์„ฑ์ด ํ•ญ์ƒ ๋ณด์žฅ๋˜์ง€ ์•Š์Œ
  • ํ•ต์‹ฌ ์ด์Šˆ๋ฅผ ๋†“์น˜๊ฑฐ๋‚˜ ์ถฉ๋ถ„ํžˆ ์ •ํ™•ํ•˜์ง€ ์•Š์€ ํ”ผ๋“œ๋ฐฑ ์ƒ์„ฑ ์œ„ํ—˜

5. ์ฝ”๋“œ ์ƒ์„ฑ ํŠนํ™” ํ•œ๊ณ„

  • ๋น„๊ฒฐ์ •์  ํ•จ์ˆ˜: ์ถœ๋ ฅ์ด ๋งค๋ฒˆ ๋‹ฌ๋ผ์ง€๋Š” ํ•จ์ˆ˜ ํ‰๊ฐ€ ์–ด๋ ค์›€
  • ํ•˜๋“œ์›จ์–ด ์˜์กด์  ํ•จ์ˆ˜: ํ™˜๊ฒฝ์— ๋”ฐ๋ผ ์ถœ๋ ฅ์ด ๋‹ฌ๋ผ์ง€๋Š” ๊ฒฝ์šฐ
  • Test-driven development์˜ ์ •ํ™•ํ•œ ์ž…์ถœ๋ ฅ ๋งคํ•‘ ํ•œ๊ณ„

ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ

  1. ๊ณ ๊ธ‰ ๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์กฐ: ๋ฒกํ„ฐ ์ž„๋ฒ ๋”ฉ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค, SQL DB, ๊ตฌ์กฐํ™”๋œ ์ง€์‹ ๊ทธ๋ž˜ํ”„
  2. Meta-Policy Reflexion: ํƒœ์Šคํฌ ํŠนํ™” ์„ฑ์ฐฐ์„ ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๊ทœ์น™์œผ๋กœ ํ†ตํ•ฉ
  3. LLM ๋Šฅ๋ ฅ ํ–ฅ์ƒ ํ™œ์šฉ: ๋ชจ๋ธ ๋ฐœ์ „์— ๋”ฐ๋ฅธ Reflexion ํšจ๊ณผ ์ฆ๋Œ€ ๊ธฐ๋Œ€
  4. ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ํ•™์Šต: ์ „ํ†ต์  RL๊ณผ์˜ ๊ฒฐํ•ฉ์„ ํ†ตํ•œ ์‹œ๋„ˆ์ง€

๐Ÿ”— ๊ด€๋ จ ๋…ผ๋ฌธ

์„ ํ–‰ ์—ฐ๊ตฌ

๋…ผ๋ฌธ ์—ฐ๋„ ๊ด€๊ณ„
ReAct: Synergizing Reasoning and Acting 2022 Actor ๋ชจ๋ธ์˜ ๊ธฐ๋ฐ˜, ์ถ”๋ก ๊ณผ ํ–‰๋™ ์ธํ„ฐ๋ฆฌ๋น™
Chain-of-Thought Prompting 2022 ๋‹จ๊ณ„๋ณ„ ์ถ”๋ก  ๊ธฐ๋ฒ•, Actor ๋ชจ๋ธ๋กœ ์‚ฌ์šฉ
Self-Consistency 2022 ๋‹ค์–‘ํ•œ ์ถ”๋ก  ๊ฒฝ๋กœ ์ƒ˜ํ”Œ๋ง ํ›„ ์ผ๊ด€๋œ ๋‹ต๋ณ€ ์„ ํƒ
Self-Refine 2023 ์œ ์‚ฌํ•œ ๋ฐ˜๋ณต์  ์ž๊ธฐ ๊ฐœ์„  ์ ‘๊ทผ

ํ›„์†/๋ฐœ์ „ ์—ฐ๊ตฌ

๋…ผ๋ฌธ ๊ด€๊ณ„
LATS (Language Agent Tree Search) Reflexion + Monte-Carlo Tree Search ๊ฒฐํ•ฉ
Tree of Thoughts ๋‹ค์ค‘ ์ถ”๋ก  ๊ฒฝ๋กœ๋ฅผ ๋™์‹œ์— ๊ณ ๋ ค
Meta-Policy Reflexion (MPR) ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๊ทœ์น™์œผ๋กœ ์„ฑ์ฐฐ ํ†ตํ•ฉ
MAR (Multi-Agent Reflexion) ๋‹ค์ค‘ ์—์ด์ „ํŠธ ํ™˜๊ฒฝ์œผ๋กœ ํ™•์žฅ

๊ด€๋ จ ํ”„๋ ˆ์ž„์›Œํฌ

  • LangChain/LangGraph: Reflexion ํŒจํ„ด ๊ตฌํ˜„ ์ง€์› (ํŠœํ† ๋ฆฌ์–ผ ์ œ๊ณต)
  • AutoGen: ๋ฉ€ํ‹ฐ ์—์ด์ „ํŠธ reflexion ๊ตฌํ˜„
  • Swarms: ReflexionAgent ํด๋ž˜์Šค ์ œ๊ณต

๐Ÿ’ป ์‹ค๋ฌด ์ ์šฉ ํฌ์ธํŠธ

Self-Reflection ํ”„๋กฌํ”„ํŠธ ์˜ˆ์‹œ

์ผ๋ฐ˜์ ์ธ ์ž๊ธฐ ์„ฑ์ฐฐ ํ”„๋กฌํ”„ํŠธ

You are an advanced reasoning agent that can improve based on
self-reflection. You will be given a previous reasoning trial
in which you were given access to relevant context and a
question to answer. You were unsuccessful in answering the
question either because you guessed the wrong answer with a
probability above the given threshold, or you used up your
set number of reasoning steps.

In a few sentences, diagnose a possible reason for failure and
devise a new, concise, high-level plan that aims to mitigate
the same failure. Use complete sentences.

Previous Trial:
{previous_trial}

Reflection:

์ƒ์„ธ ์ž๊ธฐ ์„ฑ์ฐฐ ํ”„๋กฌํ”„ํŠธ (์‹ค๋ฌด์šฉ)

You are an expert in {topic}. You have incorrectly answered the
following multiple-choice question. Your task is to reflect on
the problem, your solution, and the correct answer.

**Question**: {question}
**Your Answer**: {agent_answer}
**Correct Answer**: {correct_answer}

Please provide:
1. **Why you failed**: Explain why your answer was incorrect
2. **Error keywords**: List keywords describing your error (general โ†’ specific)
3. **Corrected solution**: Solve the problem step-by-step based on the correct answer
4. **Future instructions**: Create detailed instructions to avoid this error
5. **General advice**: List advice for similar problems

Be concise but capture all essential information.

์ฝ”๋“œ ์ƒ์„ฑ์šฉ ์ž๊ธฐ ์„ฑ์ฐฐ ํ”„๋กฌํ”„ํŠธ

You are a Python programming assistant. Your previous code failed the test cases.

**Task**: {task_description}
**Your Code**:
```python
{failed_code}
</code></pre>

<strong>Error Message</strong>: {error_message}
<strong>Failed Test Cases</strong>: {failed_tests}

Reflect on your mistake and provide:
1. Root cause of the failure
2. What edge cases you missed
3. Specific changes needed to fix the code
4. Lessons for future similar tasks

<pre><code><br />### ๊ตฌํ˜„ ํŒจํ„ด (LangGraph)

```python
from langgraph.graph import StateGraph, END
from langchain_anthropic import ChatAnthropic
from typing import TypedDict, List

# State definition
class ReflexionState(TypedDict):
    task: str
    trajectory: List[str]
    reflections: List[str]
    trial: int
    max_trials: int
    success: bool

# Nodes
def actor_node(state: ReflexionState) -> ReflexionState:
    """Generate trajectory using LLM with memory context"""
    llm = ChatAnthropic(model="claude-sonnet-4-20250514")

    prompt = f"""
    Task: {state['task']}
    Previous reflections: {state['reflections']}

    Generate a solution:
    """

    response = llm.invoke(prompt)
    state['trajectory'].append(response.content)
    return state

def evaluator_node(state: ReflexionState) -> ReflexionState:
    """Evaluate the trajectory"""
    state['success'] = evaluate(state['trajectory'][-1])
    return state

def reflection_node(state: ReflexionState) -> ReflexionState:
    """Generate self-reflection"""
    llm = ChatAnthropic(model="claude-sonnet-4-20250514")

    prompt = f"""
    Failed attempt: {state['trajectory'][-1]}

    What went wrong and how to improve?
    """

    reflection = llm.invoke(prompt).content
    state['reflections'].append(reflection)
    state['trial'] += 1
    return state

def should_continue(state: ReflexionState) -> str:
    if state['success']:
        return END
    if state['trial'] >= state['max_trials']:
        return END
    return "reflect"

# Build graph
workflow = StateGraph(ReflexionState)
workflow.add_node("actor", actor_node)
workflow.add_node("evaluator", evaluator_node)
workflow.add_node("reflect", reflection_node)

workflow.set_entry_point("actor")
workflow.add_edge("actor", "evaluator")
workflow.add_conditional_edges(
    "evaluator",
    should_continue,
    {"reflect": "reflect", END: END}
)
workflow.add_edge("reflect", "actor")

app = workflow.compile()

์‹ค๋ฌด ์ ์šฉ ํŒ

1. ์ ์ ˆํ•œ ์‹œ๋„ ํšŸ์ˆ˜ ์„ค์ •

  • ์ฝ”๋”ฉ ํƒœ์Šคํฌ: 5-10ํšŒ
  • ์˜์‚ฌ๊ฒฐ์ •: 10-15ํšŒ
  • ์ถ”๋ก : 3-5ํšŒ

2. ํ”ผ๋“œ๋ฐฑ ํ’ˆ์งˆ ํ™•๋ณด

  • ๊ฐ€๋Šฅํ•˜๋ฉด ๊ตฌ์ฒด์ ์ธ ์™ธ๋ถ€ ํ”ผ๋“œ๋ฐฑ ํ™œ์šฉ (ํ…Œ์ŠคํŠธ ์—๋Ÿฌ ๋ฉ”์‹œ์ง€ ๋“ฑ)
  • LLM ํ‰๊ฐ€ ์‹œ ๋ช…ํ™•ํ•œ ํ‰๊ฐ€ ๊ธฐ์ค€ ์ œ๊ณต

3. ๋น„์šฉ ์ตœ์ ํ™”

  • Self-reflection์— ์ž‘์€ ๋ชจ๋ธ ์‚ฌ์šฉ ๊ณ ๋ ค
  • ๋ถˆํ•„์š”ํ•œ ์‹œ๋„ ์ค„์ด๊ธฐ ์œ„ํ•œ early stopping ๊ตฌํ˜„

4. ์ ์šฉ ์ ํ•ฉ์„ฑ ํŒ๋‹จ

  • ์ ํ•ฉ: ๋ณต์žกํ•œ ํƒœ์Šคํฌ, trial-and-error ๊ฐ€๋Šฅ, ๋ช…ํ™•ํ•œ ํ”ผ๋“œ๋ฐฑ ์‹ ํ˜ธ ์กด์žฌ
  • ๋ถ€์ ํ•ฉ: ๋‹จ์ˆœ ํƒœ์Šคํฌ, ๋‹จ์ผ ์‹œ๋„๋กœ ์ถฉ๋ถ„, ํ”ผ๋“œ๋ฐฑ ๋ชจํ˜ธ

๐Ÿท๏ธ Tags

#AIAgent #SelfReflection #VerbalRL #ReinforcementLearning #LLM #NeurIPS2023 #CodeGeneration #ReAct #ChainOfThought #EpisodicMemory #LanguageAgent #SelfImprovement #HumanEval #AlfWorld #HotPotQA #PromptEngineering #AgenticAI #TrialAndError


๐Ÿ“š ์ฐธ๊ณ  ์ž๋ฃŒ

  • arXiv ๋…ผ๋ฌธ
  • NeurIPS 2023 Proceedings
  • GitHub Repository
  • OpenReview
  • LangGraph Reflexion Tutorial
  • Prompt Engineering Guide – Reflexion
  • Reflecting on Reflexion (์ €์ž ๋ธ”๋กœ๊ทธ)
์ž‘์„ฑ์ž

skycave

Follow Me
๋‹ค๋ฅธ ๊ธฐ์‚ฌ
Previous

[AI Paper] ๐Ÿ“„ ReAct: Synergizing Reasoning and Acting in Language Models

Next

[AI Paper] ๐Ÿ“„ Self-RAG: Learning to Retrieve, Generate, and Critique

๋Œ“๊ธ€ ์—†์Œ! ์ฒซ ๋Œ“๊ธ€์„ ๋‚จ๊ฒจ๋ณด์„ธ์š”.

๋‹ต๊ธ€ ๋‚จ๊ธฐ๊ธฐ ์‘๋‹ต ์ทจ์†Œ

์ด๋ฉ”์ผ ์ฃผ์†Œ๋Š” ๊ณต๊ฐœ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•„์ˆ˜ ํ•„๋“œ๋Š” *๋กœ ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค

์ตœ์‹ ๊ธ€

  • ๐Ÿ“Š ์ผ์ผ ๋‰ด์Šค ๊ฐ์„ฑ ๋ฆฌํฌํŠธ – 2026-01-28
  • AI ์‹œ์Šคํ…œ์˜ ๋ฌธ๋งฅ ๊ธฐ๋ฐ˜ ๊ฒ€์ƒ‰(Contextual Retrieval) | Anthropic
  • “Think” ํˆด: Claude๊ฐ€ ๋ฉˆ์ถฐ์„œ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๊ธฐ | Anthropic
  • Claude Code ๋ชจ๋ฒ” ์‚ฌ๋ก€ \ Anthropic
  • ์šฐ๋ฆฌ๊ฐ€ ๋ฉ€ํ‹ฐ ์—์ด์ „ํŠธ ์—ฐ๊ตฌ ์‹œ์Šคํ…œ์„ ๊ตฌ์ถ•ํ•œ ๋ฐฉ๋ฒ•
Copyright 2026 — skycave's Blog. All rights reserved. Blogsy WordPress Theme