Jump to content

๐Ÿฆœ๐Ÿ”— LangChain์œผ๋กœ ์ด๋ฏธ์ง€๊ฐ€ ์žˆ๋Š” ๋ฌธ์„œ๋ฅผ ๊ฒ€์ƒ‰ํ•˜๋Š” RAG ์‹œ์Šคํ…œ ๊ตฌ์ถ•ํ•˜๊ธฐ (Multimodal RAG Cookbook)


Recommended Posts

image.png.678cf2d79e591f9ec4c6c10b9735858c.png

ย 

๋“ค์–ด๊ฐ€๋ฉฐ


๋ณธ ๊ฐ€์ด๋“œ๋Š” ํด๋กœ๋ฐ” ์ŠคํŠœ๋””์˜ค์™€ ๋žญ์ฒด์ธ(Langchain)์„ ํ™œ์šฉํ•˜์—ฌ Multimodal RAG(๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๊ฒ€์ƒ‰ ์ฆ๊ฐ• ์ƒ์„ฑ) ์‹œ์Šคํ…œ์„ ๊ตฌ์ถ•ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•ˆ๋‚ดํ•ฉ๋‹ˆ๋‹ค.ย ์ตœ๊ทผ ๋น„์ „ ๋ชจ๋ธ์˜ ์ƒ์šฉํ™”๊ฐ€ ๊ฐ€์†ํ™”๋˜๋ฉด์„œ ๊ธฐ์—…๋“ค์€ ๋‚ด๋ถ€์˜ ๋‹ค์–‘ํ•œ ์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜ ๋ฐ์ดํ„ฐ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๊ฒ€์ƒ‰ํ•˜๊ณ  ํ™œ์šฉํ•˜๋ ค๋Š” ๋‹ˆ์ฆˆ๊ฐ€ ์ฆ๊ฐ€ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ๊ธฐ์กด ํ…์ŠคํŠธ ์ค‘์‹ฌ RAG ์‹œ์Šคํ…œ์„ ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ๊นŒ์ง€ ํฌํ•จํ•˜๋„๋ก ํ™•์žฅํ•˜๋Š” ์‚ฌ๋ก€๊ฐ€ ๋Š˜์–ด๋‚˜๋Š” ์ถ”์„ธ์ž…๋‹ˆ๋‹ค.ย ์ด ๊ธ€์—์„œ๋Š” PDF ํ˜•์‹์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์งˆ์˜์‘๋‹ต ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•˜๋Š” Multimodal RAG ์‹œ์Šคํ…œ์„ ๋žญ์ฒด์ธ์„ ํ†ตํ•ด ๊ตฌํ˜„ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ๊ตฌํ˜„ํ•˜๊ณ ์ž ํ•˜๋Š” Multimodal RAG ์‹œ์Šคํ…œ์˜ ๊ตฌ์กฐ๋„๋Š” ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

image.png.6b614bc37e7a7c2252c310c70cefa5f6.png

ย 

๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ž„๋ฒ ๋”ฉ ์—†์ด๋„ ๊ตฌํ˜„ ๊ฐ€๋Šฅํ•œ Multimodal RAG ๊ตฌ์กฐ๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹์€ ๋น„์ „ ๋ชจ๋ธ์„ ํ™œ์šฉํ•ด ์ด๋ฏธ์ง€๋ฅผ ํ…์ŠคํŠธ๋กœ ๋ณ€ํ™˜ํ•œ ํ›„, ํ•ด๋‹น ํ…์ŠคํŠธ๋ฅผ ์ž„๋ฒ ๋”ฉํ•˜์—ฌ ๊ฒ€์ƒ‰์— ํ™œ์šฉํ•˜๋Š” ์ ‘๊ทผ๋ฒ•์ž…๋‹ˆ๋‹ค. LangChain ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ํ†ตํ•ด CLOVA Studio์˜ ๋ชจ๋ธ๊ณผ Chroma, FAISS์™€ ๊ฐ™์€ ์™ธ๋ถ€ ๋ฒกํ„ฐ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์—ฐ๋™ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

image.png.8ea66d38fc293fd3c36c838d431931b5.png

ย 

์ „์ฒด ๊ณผ์ •์€ ์…€ ๋‹จ์œ„๋กœ ์‹ค์Šตํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค(ํŒŒ์ผ๋ช…: multimodal_RAG.ipynb). ์ด ๊ฐ€์ด๋“œ์˜ ํ•ต์‹ฌ์€ ์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜ ๋ฌธ์„œ๋ฅผ ๋‹ค๋ฃจ๋Š” ๊ธฐ์—…๋“ค์ด ์‰ฝ๊ฒŒ ๋„์ž…ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฒ”์šฉ์ ์ธ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ RAG ๊ตฌํ˜„ ๋ฐฉ์‹์„ ์†Œ๊ฐœํ•˜๋Š” ๋ฐ ์žˆ์Šต๋‹ˆ๋‹ค.

Quote

๋ฒ„์ „ ์ •๋ณด
์•„๋ž˜ ์˜ˆ์ œ ์ฝ”๋“œ๋Š” Python 3.12.2 ํ™˜๊ฒฝ์—์„œ ์‹คํ–‰ ๊ฒ€์ฆ์„ ์™„๋ฃŒํ–ˆ์œผ๋ฉฐ, ์ตœ์†Œ Python 3.9 ์ด์ƒ์˜ ๋ฒ„์ „์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ ์ง€์นจ์„ ์ฐธ๊ณ ํ•˜์—ฌ ํ•„์š”ํ•œ ๋ชจ๋“  ๋ชจ๋“ˆ์„ ์„ค์น˜ํ•ด์ฃผ์„ธ์š”.
requirements.txt

ย 

1. ์‚ฌ์ „์ค€๋น„


โ‘  Langchain ํŒจํ‚ค์ง€ ์„ค์น˜
๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ RAG ์‹œ์Šคํ…œ ๊ตฌํ˜„์„ ์œ„ํ•ด์„œ๋Š” LangChain ํ”„๋ ˆ์ž„์›Œํฌ์™€ CLOVA Studio API ์—ฐ๋™์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์ตœ๊ทผ ์ถœ์‹œ๋œ langchain-naver ํŒจํ‚ค์ง€๋ฅผ ํ†ตํ•ด CLOVA Studio์˜ ์ตœ์‹  ๋น„์ „ ๋ชจ๋ธ HCX-005๋ฅผ LangChain๊ณผ ์›ํ™œํ•˜๊ฒŒ ์—ฐ๋™ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์•„๋ž˜ ๋ช…๋ น์–ด๋กœ LangChain ๊ด€๋ จ ํŒจํ‚ค์ง€๋ฅผ ์„ค์น˜ํ•˜์„ธ์š”.

%pip install -qU openai langchain langchain-naver 
ย 
ย 

โ‘ก ์ฝ”๋“œ ๊ณตํ†ต ๋ชจ๋“ˆ imports
ํ•„์š”ํ•œ ๊ธฐ๋ณธ ๋ชจ๋“ˆ๋“ค์„ ๋ฏธ๋ฆฌ importํ•ฉ๋‹ˆ๋‹ค.

import os
import getpass
import uuid
import re
from urllib.parse import urlparse
import http
import json
import time

ย 

โ‘ข API ํ‚ค ๋ฐœ๊ธ‰ ๋ฐ›๊ธฐ
CLOVA Studio์˜ API ํ‚ค ๋ฐœ๊ธ‰์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.ย "ํ”„๋กœํ•„ > API ํ‚ค > ํ…Œ์ŠคํŠธ > ํ…Œ์ŠคํŠธ ์•ฑ ๋ฐœ๊ธ‰" ๊ฒฝ๋กœ๋ฅผ ํ†ตํ•ด ํ‚ค๋ฅผ ๋ฐœ๊ธ‰๋ฐ›์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐœ๊ธ‰๋œ ํ‚ค๋Š” ํ•œ ๋ฒˆ๋งŒ ํ‘œ์‹œ๋˜์–ด ์žฌํ™•์ธ์ด ๋ถˆ๊ฐ€๋Šฅํ•˜๋ฏ€๋กœ, ๋ฐ˜๋“œ์‹œ ๋ณต์‚ฌํ•˜์—ฌ ๋ณ„๋„๋กœ ์•ˆ์ „ํ•˜๊ฒŒ ๋ณด๊ด€ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค

Quote

ํ”„๋กœํ•„ > API ํ‚ค > ํ…Œ์ŠคํŠธ > ํ…Œ์ŠคํŠธ ์•ฑ ๋ฐœ๊ธ‰

"์ต์Šคํ”Œ๋กœ๋Ÿฌ > ๋ฌธ๋‹จ๋‚˜๋ˆ„๊ธฐ, ์ž„๋ฒ ๋”ฉ > ํ…Œ์ŠคํŠธ ์•ฑ ์ƒ์„ฑ"์œผ๋กœ ์ด๋™ํ•˜์—ฌ, ์ ์ ˆํ•œ ์ด๋ฆ„์˜ ํ…Œ์ŠคํŠธ ์•ฑ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด ์•ฑ์€ ์ถ”ํ›„ chunking๊ณผ embedding ๊ณผ์ •์—์„œ ํ™œ์šฉ๋ฉ๋‹ˆ๋‹ค.

image.png.2755d3dc863a68a683201f7d092cfd2b.png

๋ฐœ๊ธ‰๋ฐ›์€ API KEY๋Š” ํ™˜๊ฒฝ ๋ณ€์ˆ˜๋กœ ์ €์žฅํ•˜์—ฌ ๊ด€๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

os.environ["CLOVASTUDIO_API_KEY"] = getpass.getpass("CLOVA Studio API Key: ")

ย 

โ‘ฃ ์ฐธ์กฐํ•  ๋ฌธ์„œ (PDF ๋ฐ์ดํ„ฐ) ์ค€๋น„
๋ณธ ์˜ˆ์ œ์—์„œ๋Š” 'AI ๋ชจ๋ธ ํŠœ๋‹ํ•˜๊ธฐ: ํ•™์Šต ๋ฐ์ดํ„ฐ ํ™œ์šฉ๋ถ€ํ„ฐ ์„ฑ๋Šฅ ํ–ฅ์ƒ๊นŒ์ง€'๊ณผ '๋‹น์‹ ์˜ AI์—๊ฒŒ ํ–‰๋™์„ ๋งก๊ฒจ๋ผ: ์Šคํ‚ฌ๊ณผ Function Calling'ย ํŽ˜์ด์ง€๋ฅผ PDF ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋กœ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.
์ด ๋ฌธ์„œ๋“ค์€ ๋‹จ์ˆœ ํ…์ŠคํŠธ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ํ‘œ, ๊ทธ๋ž˜ํ”„, ๋‹ค์ด์–ด๊ทธ๋žจ, ์ฝ”๋“œ ๋“ฑ ๋‹ค์–‘ํ•œ ํ˜•ํƒœ์˜ ์ด๋ฏธ์ง€๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์–ด, ์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜ ์ •๋ณด๊ฐ€ ์‹ค์ œ ๊ฒ€์ƒ‰์— ์–ด๋–ป๊ฒŒ ํ™œ์šฉ๋˜๋Š”์ง€ ํ…Œ์ŠคํŠธํ•˜๊ธฐ์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.

ํ•„์š”์— ๋”ฐ๋ผ PDF ๋ฌธ์„œ๋ฅผ ๊ต์ฒดํ•˜๊ฑฐ๋‚˜ ๋‚ด์šฉ์„ ํŽธ์ง‘ํ•˜์—ฌ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ค€๋น„ํ•œ PDF ๋ฌธ์„œ๋Š” data/ ํด๋”์— ์ €์žฅํ•˜์—ฌ ์ค€๋น„ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ“ cookbook/
โ”œโ”€โ”€ multimodal_RAG.ipynb/
โ”œโ”€โ”€ data/
ย โ”‚ ย  ย  ย  ย  ย  ย  โ”œโ”€โ”€ย ๋ชจ๋ธํŠœ๋‹.pdf
ย โ”‚ ย  ย  ย  ย  ย  ย  โ”œโ”€โ”€ย แ„‰แ…ณแ„แ…ตแ†ฏ.pdf

ย 

2. ๋ฌธ์„œ ์ „์ฒ˜๋ฆฌํ•˜๊ธฐ


๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ RAG ์‹œ์Šคํ…œ์—์„œ๋Š” ํ…์ŠคํŠธ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ด๋ฏธ์ง€ ๋“ฑ ๋‹ค์–‘ํ•œ ํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„ํ• ํ•˜๊ณ , ์ด๋ฅผ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ๊ฒ€์ƒ‰์— ํ™œ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. PDF ๋ฌธ์„œ๋Š” ์ผ๋ฐ˜ ํ…์ŠคํŠธ ์™ธ์—๋„ ๊ทธ๋ž˜ํ”„, ํ…Œ์ด๋ธ”๊ณผ ๊ฐ™์€ ์‹œ๊ฐ์  ์š”์†Œ๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์–ด, ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ RAG๋ณด๋‹ค ๋” ์ •๊ตํ•˜๊ณ  ๋ณตํ•ฉ์ ์ธ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์ด ์š”๊ตฌ๋ฉ๋‹ˆ๋‹ค.

โ‘  PDF ๋ฌธ์„œ์—์„œ ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ์ถ”์ถœํ•˜๊ธฐ (Load)
image.png.24bdcd65813f5649bd0380dbd166058d.png

ย 

PDF ํŒŒ์ผ์—์„œ ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€๋ฅผ ๊ฐ๊ฐ ์ถ”์ถœํ•˜๋Š” ๊ณผ์ •์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.ย ํ˜„์žฌ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ฌธ์„œ์˜ ์ •๋ณด ์ถ”์ถœ์„ ์ง€์›ํ•˜๋Š” ๋Œ€ํ‘œ์ ์ธ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ๋Š”ย PyPDF,ย PyMuPDF,ย LlamaParse,ย Unstructured.io,ย TorchMultimodalย ๋“ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค.ย 
๊ฐ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” ํ…์ŠคํŠธ ์ถ”์ถœ ์ •ํ™•๋„, ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ ๋ฐฉ์‹, ๊ตฌ์กฐ ๋ณด์กด ์—ฌ๋ถ€ ๋“ฑ ๊ตฌํ˜„ ๋ฐฉ์‹์—์„œ ์ฐจ์ด๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ํŠน์ • ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ์ ˆ๋Œ€์ ์œผ๋กœ ์šฐ์ˆ˜ํ•˜๋‹ค๊ณ  ๋ณด๊ธด ์–ด๋ ต์Šต๋‹ˆ๋‹ค.
ย ๋ณธ ์˜ˆ์ œ์—์„œ๋Š” PyMuPDF๋ฅผ ํ™œ์šฉํ•˜์—ฌ ํŽ˜์ด์ง€๋ณ„๋กœ ์ด๋ฏธ์ง€๋ฅผ ์ €์žฅํ•˜๊ณ  ํ…์ŠคํŠธ๋ฅผ ๊ตฌ์กฐํ™”ํ•˜๋Š” ๋ฐฉ์‹์„ ์ฑ„ํƒํ–ˆ์Šต๋‹ˆ๋‹ค.

%pip install pymupdf

ย 

์•„๋ž˜ ์ฝ”๋“œ๋Š” PDF ํŒŒ์ผ์—์„œ ์ถ”์ถœํ•œ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ ์ƒ๋Œ€ ๊ฒฝ๋กœ ๊ธฐ์ค€์˜ ์ง€์ •๋œ ํด๋”(output_dir)์— ์ €์žฅํ•˜๋„๋ก ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์„ ํ†ตํ•ด ๊ฐ ํŽ˜์ด์ง€์— ํฌํ•จ๋œ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ์ถ”์ถœ ๊ฒฐ๊ณผ๋ฅผ ์ง์ ‘ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์‹คํ–‰ ํ›„ ์ „์ฒด ๋””๋ ‰ํ† ๋ฆฌ ๊ตฌ์กฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

๐Ÿ“ cookbook/
โ”œโ”€โ”€ multimodal_RAG.ipynb/
โ”œโ”€โ”€ data/
ย โ”‚ ย  ย  ย  ย  ย  ย  โ”œโ”€โ”€ ๋ชจ๋ธํŠœ๋‹.pdf
ย โ”‚ ย  ย  ย  ย  ย  ย  โ”œโ”€โ”€ ์Šคํ‚ฌ.pdf
ย โ”‚ ย  ย  ย  ย  ย  ย  โ””โ”€โ”€ extracted_images_๋ฌธ์„œ/
ย โ”‚ ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย โ”œโ”€โ”€ merged_text.txt
ย โ”‚ ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย โ”œโ”€โ”€ page_1_img_1.png
ย โ”‚ ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย โ”œโ”€โ”€ page_2_img_1.png
ย โ”‚ ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย โ”œโ”€โ”€ page_9_img_1.png
ย โ”‚ ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย โ”œโ”€โ”€ page_10_img_1.png
ย โ”‚ ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย โ”œโ”€โ”€ page_11_img_1.png
ย โ”‚ ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย โ”œโ”€โ”€ page_12_img_1.png
ย โ”‚ ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย โ”œโ”€โ”€ page_15_img_1.png
ย โ”‚ ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย โ”œโ”€โ”€ page_15_img_2.png
ย โ”‚ ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย โ””โ”€โ”€ page_16_img_1.png

ํ…์ŠคํŠธ๋Š” ํŽ˜์ด์ง€ ๋‹จ์œ„๋กœ ์ •๋ฆฌ๋˜๋ฉฐ, ์ดํ›„ LangChain์˜ Document ๊ฐ์ฒด๋กœ ๋ณ€ํ™˜๋˜์–ด ์ž„๋ฒ ๋”ฉ ์ฒ˜๋ฆฌ์— ํ™œ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€๋Š” "page_{page_number}img{img_index}.{image_ext}" ํ˜•์‹์œผ๋กœ ์ €์žฅ๋ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ช…๋ช… ๊ทœ์น™์€ ์ด๋ฏธ์ง€๊ฐ€ ์–ด๋А ํŽ˜์ด์ง€์—์„œ ์ถ”์ถœ๋˜์—ˆ๋Š”์ง€ ์‰ฝ๊ฒŒ ์ถ”์ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜๋ฉฐ, ์ดํ›„ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ๋กœ ํ™œ์šฉํ•˜๊ธฐ์—๋„ ๋งค์šฐ ํšจ๊ณผ์ ์ž…๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ๊ตฌ์„ฑ๋œ ๋ฌธ์„œ๋Š” ํ–ฅํ›„ ๊ฒ€์ƒ‰ ๊ธฐ๋ฐ˜ ์งˆ์˜์‘๋‹ต(RAG) ์‹œ์Šคํ…œ์˜ ์ปจํ…์ŠคํŠธ๋กœ ํ™œ์šฉ๋ฉ๋‹ˆ๋‹ค.

import fitz  # PyMuPDF
from langchain_core.documents import Document

def extract_documents_from_pdf(pdf_path: str, output_dir: str = "data/extracted_images_๋ฌธ์„œ"):
    os.makedirs(output_dir, exist_ok=True)

    merged_text_path = os.path.join(output_dir, "merged_text.txt")
    merged_text = ""

    doc = fitz.open(pdf_path)
    documents = []

    for i, page in enumerate(doc):
        page_number = i + 1
        page_text = page.get_text("text").strip()
        images_info = []

        # ์ด๋ฏธ์ง€ ์ถ”์ถœ
        for img_index, img in enumerate(page.get_images(full=True)):
            xref = img[0]
            base_image = doc.extract_image(xref)
            image_bytes = base_image["image"]
            image_ext = base_image["ext"]
            image_filename = f"page_{page_number}_img_{img_index+1}.{image_ext}"
            image_path = os.path.join(output_dir, image_filename)

            with open(image_path, "wb") as img_file:
                img_file.write(image_bytes)

            images_info.append(image_path)

        # LangChain Document๋กœ ๋ณ€ํ™˜
        documents.append(Document(
            page_content=page_text,
            metadata={
                "source": os.path.basename(pdf_path),
                "page": page_number,
                "images": ", ".join(images_info)
            }
        ))

        # ๋ณ‘ํ•ฉ ํ…์ŠคํŠธ ์ €์žฅ์šฉ
        merged_text += f"\n\n--- Page {page_number} ---\n\n{page_text}"

    # ์ „์ฒด ํ…์ŠคํŠธ ์ €์žฅ
    with open(merged_text_path, "w", encoding="utf-8") as f:
        f.write(merged_text)

    return documents, merged_text_path

pdf_path = "data/๋ชจ๋ธํŠœ๋‹.pdf" # ๋‹ค๋ฅธ ํŒŒ์ผ๋กœ ํ…Œ์ŠคํŠธํ•  ๊ฒฝ์šฐ ์•Œ๋งž์€ ๊ฒฝ๋กœ ์ž…๋ ฅ
docs, merged_path = extract_documents_from_pdf(pdf_path)

print(f"์ถ”์ถœ๋œ ๋ฌธ์„œ ํŽ˜์ด์ง€ ์ˆ˜: {len(docs)}")
print(f"๋ณ‘ํ•ฉ๋œ ํ…์ŠคํŠธ ๊ฒฝ๋กœ: {merged_path}")
print(docs[0])  # ํ•˜๋‚˜ ํ™•์ธ
๊ฒฐ๊ณผ
image.png.c6a8197a21ff504b10dad88d47b838e9.png

image.png.7e29f3d45ae81ecf1acc723ae9a6bbf0.png

ย 

โ‘ก ์ด๋ฏธ์ง€ โ†’ ํ…์ŠคํŠธ ์š”์•ฝํ•˜๊ธฐ (Convert)
image.png.a8b0b507d07c75b3e271bf24207f79d8.png

PDF์—์„œ ์ถ”์ถœํ•œ ์ด๋ฏธ์ง€๋Š” ๋‹จ์ˆœํžˆ ์ €์žฅํ•˜๋Š” ๊ฒƒ๋งŒ์œผ๋กœ๋Š” ๊ฒ€์ƒ‰์ด๋‚˜ ์‘๋‹ต ์ƒ์„ฑ์— ์ฆ‰์‹œ ํ™œ์šฉํ•˜๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€ ๋‚ด์šฉ์„ ๊ฒ€์ƒ‰ ๊ฐ€๋Šฅํ•œ ์ •๋ณด๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ ์œ„ํ•ด, CLOVA Studio์˜ ๋น„์ „ ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ ์‹œ๊ฐ ์ •๋ณด๋ฅผ ํ…์ŠคํŠธ๋กœ ์š”์•ฝํ•ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ์ƒ์„ฑ๋œ ์„ค๋ช…์€ ์ดํ›„ RAG ์‹œ์Šคํ…œ์—์„œ ๋ฌธ๋งฅ(Context)์œผ๋กœ ํ™œ์šฉ๋  ์ˆ˜ ์žˆ๋„๋ก ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.
ย 

2.1) HyperCLOVA X ๋น„์ „ ๋ชจ๋ธ์„ ์œ„ํ•œ ์ด๋ฏธ์ง€ ์ „์ฒ˜๋ฆฌ
CLOVA ๋น„์ „ ๋ชจ๋ธ ์‚ฌ์šฉ ์‹œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ด๋ฏธ์ง€ ์ œํ•œ ์‚ฌํ•ญ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

Quote

์ž…๋ ฅ ์กฐ๊ฑด ๋ฐ ์—…๋กœ๋“œ ์‚ฌ์–‘

  • ์ง€์› ํฌ๋งท: PNG, JPEG, WEBP, BMP

  • ํŒŒ์ผ ์šฉ๋Ÿ‰ ์ œํ•œ: ์ด๋ฏธ์ง€๋‹น ์ตœ๋Œ€ 20 MB

  • ์ตœ๋Œ€ ์‚ฌ์ด์ฆˆ: ๊ธด ๋ณ€ ๊ธฐ์ค€ 2240pxย ์ดํ•˜

  • ๊ฐ€๋กœ:์„ธ๋กœ ๋น„์œจ ์ œํ•œ: 1:5ย ๋˜๋Š” 5:1ย ์ดํ•˜

์ฐธ๊ณ :ย 2025๋…„ 4์›” 17์ผ ๊ธฐ์ค€, CLOVA Studio์˜ย HCX-005 ๋ชจ๋ธ์€ ์‚ฌ์šฉ์ž ํ•œ ํ„ด์—ย ์ตœ๋Œ€ 1์žฅ์˜ ์ด๋ฏธ์ง€๋งŒ ์ž…๋ ฅํ•  ์ˆ˜ ์žˆ์ง€๋งŒ,ย ์š”์ฒญ ํ•œ ๋ฒˆ์— ์ตœ๋Œ€ 5์žฅ์˜ ์ด๋ฏธ์ง€๋ฅผ ํฌํ•จํ•œ ๋ฉ”์‹œ์ง€ ์ž…๋ ฅ์€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

ย 

PDF์—์„œ ์ถ”์ถœํ•œ ์ด๋ฏธ์ง€๋Š” ํ•ด์ƒ๋„๊ฐ€ ๋งค์šฐ ํฌ๊ฑฐ๋‚˜ ๊ฐ€๋กœยท์„ธ๋กœ ๋น„์œจ์ด ๋น„์ •์ƒ์ ์œผ๋กœ ๊ธด ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ด๋ฏธ์ง€๋ฅผ CLOVA Studio์— ๊ทธ๋Œ€๋กœ ์ž…๋ ฅํ•˜๋ฉด 'Invalid image ratio' ์—๋Ÿฌ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์˜ค๋ฅ˜๋ฅผ ์‚ฌ์ „์— ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ์ด๋ฏธ์ง€๋ฅผ ๊ฒ€์‚ฌํ•˜๊ณ  ๋ฆฌ์‚ฌ์ด์ฆˆํ•˜๋Š” ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

%pip install -qU Pillow

์•„๋ž˜ ํ•จ์ˆ˜๋Š” ํ•˜๋‚˜์˜ ๋กœ์ปฌ ์ด๋ฏธ์ง€ ๊ฒฝ๋กœ๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ ์กฐ๊ฑด์„ ๋งŒ์กฑํ•˜๋ฉด output_dir์— ๊ทธ๋Œ€๋กœ ๋ณต์‚ฌํ•˜๊ณ , ์กฐ๊ฑด์— ๋งž์ง€ ์•Š์œผ๋ฉด ๋ฆฌ์‚ฌ์ด์ฆˆ ํ›„ output_dir์— ์ €์žฅํ•˜๋Š” ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

from PIL import Image
from pathlib import Path
import shutil

def check_and_resize_image_to_outdir(
    path: Path,
    outdir: Path,
    allowed_formats=("PNG", "JPEG", "WEBP", "BMP"),
    max_bytes=20 * 1024 * 1024,
    max_length=2240,
    max_ratio=4.5,
    save_format="PNG"
):
    try:
        # ์šฉ๋Ÿ‰ ์ดˆ๊ณผ ํ™•์ธ
        if path.stat().st_size > max_bytes:
            print(f"[โœ˜] ์šฉ๋Ÿ‰ ์ดˆ๊ณผ: {path.name}")
            return

        with Image.open(path) as image:
            format = image.format.upper()
            if format not in allowed_formats:
                print(f"[โœ˜] ํฌ๋งท ๋ถˆ๊ฐ€: {path.name} ({format})")
                return

            w, h = image.size
            ratio = max(w, h) / min(w, h)
            needs_resize = max(w, h) > max_length or ratio > max_ratio

            if not needs_resize:
                # ์กฐ๊ฑด ๋งŒ์กฑ โ†’ ๊ทธ๋Œ€๋กœ ๋ณต์‚ฌ
                dest = outdir / path.name
                shutil.copy(path, dest)
                print(f"[โœ“] ์กฐ๊ฑด ๋งŒ์กฑ โ†’ ๋ณต์‚ฌ๋จ: {path.name}")
                return

            # ๋ฆฌ์‚ฌ์ด์ฆˆ ํฌ๊ธฐ ๊ณ„์‚ฐ
            if ratio > max_ratio:
                if w > h:
                    new_w = min(w, max_length)
                    new_h = int(new_w / max_ratio)
                else:
                    new_h = min(h, max_length)
                    new_w = int(new_h / max_ratio)
            else:
                if w >= h:
                    new_w = min(w, max_length)
                    new_h = int(h * (new_w / w))
                else:
                    new_h = min(h, max_length)
                    new_w = int(w * (new_h / h))

            resized = image.resize((new_w, new_h), Image.LANCZOS).convert("RGB")
            dest = outdir / path.name
            resized.save(dest, format=save_format, optimize=True)
            print(f"[โœ”] ๋ฆฌ์‚ฌ์ด์ฆˆ๋จ โ†’ ์ €์žฅ๋จ: {dest.name} ({new_w}x{new_h})")

    except Exception as e:
        print(f"[โœ˜] ์ฒ˜๋ฆฌ ์‹คํŒจ: {path.name} โ†’ {e}")

์•„๋ž˜ ๋ฉ”์ธ ์‹คํ–‰ ์ฝ”๋“œ๋ฅผ ํ†ตํ•ด CLOVA Studio์˜ ๋น„์ „ ๋ชจ๋ธ ๊ธฐ์ค€์— ์ ํ•ฉํ•œ ์•ˆ์ „ํ•œ ์ด๋ฏธ์ง€ ์…‹์„ ๊ตฌ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ“ cookbook/
โ”œโ”€โ”€ multimodal_RAG.ipynb/
โ”œโ”€โ”€ data/
ย โ”‚ ย  ย  ย  ย  ย  ย  โ”œโ”€โ”€ ๋ชจ๋ธํŠœ๋‹.pdf
ย โ”‚ ย  ย  ย  ย  ย  ย  โ”œโ”€โ”€ ์Šคํ‚ฌ.pdf
ย โ”‚ ย  ย  ย  ย  ย  ย  โ”œโ”€โ”€ extracted_images_๋ฌธ์„œ/ ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  โ† PDF์—์„œ ์ถ”์ถœ๋œ ์›๋ณธ ์ด๋ฏธ์ง€
ย โ”‚ ย  ย  ย  ย  ย  ย  โ””โ”€โ”€ filtered_images/ ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย โ† ์กฐ๊ฑด์— ๋งž๋Š” ์ด๋ฏธ์ง€๊ฐ€ ์ €์žฅ๋˜๋Š” ๊ณณ (output_dir)
from pathlib import Path

input_dir = Path("data/extracted_images_๋ฌธ์„œ")
output_dir = Path("data/filtered_images")
output_dir.mkdir(parents=True, exist_ok=True)

valid_exts = [".png", ".jpg", ".jpeg", ".webp", ".bmp"]
image_files = [p for p in input_dir.glob("*") if p.suffix.lower() in valid_exts]

print(f"์ด {len(image_files)}๊ฐœ์˜ ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ ์‹œ์ž‘")

for img_path in image_files:
    check_and_resize_image_to_outdir(img_path, outdir=output_dir)

๊ฒฐ๊ณผ
image.png.a0fa02fa21b38ee4c5bfb2988f407bed.png

ย 

2.2) Ncloud Storage ์‚ฌ์šฉํ•ด์„œ ์ด๋ฏธ์ง€ ์ €์žฅํ•˜๊ธฐ
CLOVA Studio์˜ ๋น„์ „ ๋ชจ๋ธ์€ ๋กœ์ปฌ ๊ฒฝ๋กœ์˜ ์ด๋ฏธ์ง€ ํŒŒ์ผ์ด ์•„๋‹Œ, ์›น URL ํ˜•ํƒœ์˜ ์ด๋ฏธ์ง€๋ฅผ ์ž…๋ ฅ๊ฐ’์œผ๋กœ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋จผ์ € ์ถ”์ถœํ•œ ์ด๋ฏธ์ง€๋ฅผ ๊ฐ์ฒด ์Šคํ† ๋ฆฌ์ง€(Ncloud Storage, S3, ๊ตฌ๊ธ€ ๋“œ๋ผ์ด๋ธŒ ๋“ฑ)์— ์—…๋กœ๋“œํ•œ ํ›„, ํ•ด๋‹น URL์„ ์ˆ˜์ง‘ํ•˜์—ฌ ์ •๋ฆฌํ•˜๋Š” ์ž‘์—…์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

์ด๋ฒˆ cookbook์—์„œ๋Š”ย Ncloud Storageย ๋ฅผ ์‚ฌ์šฉํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. (์ฐธ๊ณ  :ย Ncloud Storage ๊ฐ€์ด๋“œ)

Quote

์ฐธ๊ณ :ย 2025๋…„ 4์›” 17์ผ ๊ธฐ์ค€, Ncloud Storage ์ƒํ’ˆ์€ OBT(Open Beta)๋กœ ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค.
Ncloud Storage๋Š” Open Beta ๊ธฐ๊ฐ„๋™์•ˆ ์šฉ๋Ÿ‰ ์ œํ•œ ์—†์ดย ๋ฌด๋ฃŒ๋กœ ์ด์šฉ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. Open Beta ๊ธฐ๊ฐ„ ์ข…๋ฃŒ ํ›„์—๋Š” ์ €์žฅ๋œ ๋ฐ์ดํ„ฐ์˜ ์ €์žฅ ์šฉ๋Ÿ‰๊ณผ API ์š”์ฒญ์— ๋Œ€ํ•ด ๊ณผ๊ธˆ์œผ๋กœ ์ „ํ™˜๋ฉ๋‹ˆ๋‹ค. Open Beta ๊ธฐ๊ฐ„์— ์„œ๋น„์Šค ์ œ๊ณต์— ๋Œ€ํ•œ SLA๋Š” ๋ณด์žฅ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

ํฌํ„ธ ๋งˆ์ดํŽ˜์ด์ง€ > ๊ณ„์ • ๊ด€๋ฆฌ >ย ์ธ์ฆํ‚ค ๊ด€๋ฆฌ์—์„œ API ์ธ์ฆํ‚ค๋ฅผย ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

image.png.3e7c599f42ca5da340ee7cc925dbf5aa.png
์ƒ์„ฑํ•œ Access Key์™€ Secret Key๋Š” ํ™˜๊ฒฝ ๋ณ€์ˆ˜๋กœ ์„ค์ •ํ•ด์ค๋‹ˆ๋‹ค.
# ๋„ค์ด๋ฒ„ ํด๋ผ์šฐ๋“œ์—์„œ ๋ฐœ๊ธ‰๋ฐ›์€ ํ‚ค๋ฅผ ์ž…๋ ฅํ•˜์„ธ์š”
os.environ["AWS_ACCESS_KEY_ID"] = getpass.getpass("NCP Access Key: ")
os.environ["AWS_SECRET_ACCESS_KEY"] = getpass.getpass("NCP Secret Key: ")

# ๊ธฐ๋ณธ ๋ฆฌ์ „ ์„ค์ •
os.environ["AWS_DEFAULT_REGION"] = "kr"

Ncloud Object Storage๋Š” Amazon S3 API์™€ ํ˜ธํ™˜๋˜๋ฉฐ,ย Python์—์„œ๋Š” ์ด๋ฅผ ์ œ์–ดํ•˜๊ธฐ ์œ„ํ•ด 'boto3' ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

%pip install boto3

๋‹ค์Œ์€ ์ƒˆ๋กœ์šด ๋ฒ„ํ‚ท์„ ์ƒ์„ฑํ•˜๊ณ  ์ „์ฒ˜๋ฆฌํ•œ ์ด๋ฏธ์ง€๋ฅผ ๋ชจ๋‘ ์—…๋กœ๋“œํ•˜๋Š” ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค. ๋ฒ„ํ‚ท ์ด๋ฆ„์€ ์ตœ์†Œ 3์ž์—์„œ ์ตœ๋Œ€ 63์ž๋กœ ๊ตฌ์„ฑํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์†Œ๋ฌธ์ž, ์ˆซ์ž ๋ฐ ํ•˜์ดํ”ˆ(-)๋งŒ ํฌํ•จํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ์‹œ์—์„œ๋Š” 'multi-rag'๋ฅผ ๋ฒ„ํ‚ท ์ด๋ฆ„์œผ๋กœ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

from glob import glob
import boto3
from botocore.client import Config
from botocore.exceptions import ClientError
import mimetypes

# ์„ค์ •
BUCKET_NAME = "multi-rag"
LOCAL_FOLDER = "data/filtered_images"
ENDPOINT_URL = "https://kr.ncloudstorage.com"
REGION = os.environ["AWS_DEFAULT_REGION"]

ACCESS_KEY = os.environ["AWS_ACCESS_KEY_ID"]
SECRET_KEY = os.environ["AWS_SECRET_ACCESS_KEY"]

# boto3 ํด๋ผ์ด์–ธํŠธ ์ดˆ๊ธฐํ™”
s3 = boto3.client(
    "s3",
    aws_access_key_id=ACCESS_KEY,
    aws_secret_access_key=SECRET_KEY,
    endpoint_url=ENDPOINT_URL,
    region_name=REGION,
    config=Config(signature_version="s3v4")
)

# 1. ๋ฒ„ํ‚ท ์ƒ์„ฑ
try:
    s3.head_bucket(Bucket=BUCKET_NAME)
    print(f"์ด๋ฏธ ์กด์žฌํ•˜๋Š” ๋ฒ„ํ‚ท์ž…๋‹ˆ๋‹ค: {BUCKET_NAME}")
except ClientError as e:
    if e.response['Error']['Code'] == '404':
        print(f"๋ฒ„ํ‚ท์ด ์กด์žฌํ•˜์ง€ ์•Š์•„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค: {BUCKET_NAME}")
        s3.create_bucket(Bucket=BUCKET_NAME)
    else:
        raise

# 2. ์ด๋ฏธ์ง€ ์ˆ˜์ง‘
IMAGE_EXTENSIONS = ("*.jpeg", "*.jpg", "*.png", "*.bmp", "*.webp")
image_files = []

for ext in IMAGE_EXTENSIONS:
    image_files.extend(glob(os.path.join(LOCAL_FOLDER, ext)))

print(f"์ด {len(image_files)}๊ฐœ ์ด๋ฏธ์ง€ ํŒŒ์ผ์„ ์ฐพ์•˜์Šต๋‹ˆ๋‹ค.")

# 3. ์ด๋ฏธ์ง€ ์—…๋กœ๋“œ ๋ฐ URL ์ €์žฅ
url_list = [] # ๊ฒฐ๊ณผ ์ €์žฅํ•  ๋ฆฌ์ŠคํŠธ

for file_path in image_files:
    file_name = os.path.basename(file_path)

    try:
        # ์—…๋กœ๋“œ
        s3.upload_file(file_path, BUCKET_NAME, file_name)

        # MIME ํƒ€์ž… ์ถ”์ •
        mime_type, _ = mimetypes.guess_type(file_name)
        if not mime_type:
            mime_type = "application/octet-stream"

        # Signed URL ์ƒ์„ฑ
        signed_url = s3.generate_presigned_url(
            "get_object",
            Params={
                "Bucket": BUCKET_NAME,
                "Key": file_name,
                "ResponseContentDisposition": "inline",
                "ResponseContentType": mime_type
            },
            ExpiresIn=3600 #1์‹œ๊ฐ„๋งŒ
        )

        print(f"URL: {signed_url}")
        url_list.append({signed_url})

    except ClientError as e:
        print(f"์—…๋กœ๋“œ ์‹คํŒจ: {e}")


print("๋ชจ๋“  ์ด๋ฏธ์ง€ ์—…๋กœ๋“œ ๋ฐ ๋งํฌ ์ƒ์„ฑ ์™„๋ฃŒ!")

๊ฒฐ๊ณผ
image.png.35a4ef3a989633cc224229a3be9269ae.png

ย 

2.3) ๋น„์ „ ๋ชจ๋ธ ์‚ฌ์šฉํ•ด์„œ ์ด๋ฏธ์ง€ ์š”์•ฝ ํ•˜๊ธฐ
PDF ๋ฌธ์„œ๋ฅผ ๋ถ„์„ํ•˜๋‹ค ๋ณด๋ฉด ์ธํฌ๊ทธ๋ž˜ํ”ฝ, ๊ทธ๋ž˜ํ”„, ํ…Œ์ด๋ธ”, ์ฝ”๋“œ ์บก์ฒ˜์™€ ๊ฐ™์€ ๋‹ค์–‘ํ•œ ํ˜•ํƒœ์˜ ์ด๋ฏธ์ง€๋“ค์ด ๋“ฑ์žฅํ•ฉ๋‹ˆ๋‹ค.ย ์ด๋ฏธ์ง€๊ฐ€ ๋‹ด๊ณ  ์žˆ๋Š” ์ •๋ณด์˜ ํ˜•์‹๊ณผ ๋‚ด์šฉ์ด ์ด๋ฏธ์ง€๋ณ„๋กœ ์ƒ์ดํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์š”์•ฝ ๋‹จ๊ณ„์—์„œ๋„ ์ด๋ฏธ์ง€ ์œ ํ˜•์— ๋งž๋Š” ์ ‘๊ทผ๋ฒ•์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.ย ํ”„๋กฌํ”„ํŠธ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •(config) ๋˜ํ•œ ์š”์•ฝ ํ’ˆ์งˆ์— ํฐ ์˜ํ–ฅ์„ ๋ฏธ์นฉ๋‹ˆ๋‹ค. ์‹คํ—˜์„ ํ†ตํ•ด ๋น„๊ต์  ์•ˆ์ •์ ์ด๊ณ  ์ผ๊ด€๋œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค€ ์„ค์ •๊ฐ’์„ ํ•จ๊ป˜ ์ œ์‹œํ–ˆ์ง€๋งŒ, ์ด๋ฏธ์ง€ ํŠน์„ฑ์ด๋‚˜ ๋ฌธ์„œ ๋„๋ฉ”์ธ์— ๋”ฐ๋ผ ์ ์ ˆํ•œ ์กฐํ•ฉ์€ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์‚ฌ์šฉํ•˜๋Š” ํ”„๋กฌํ”„ํŠธ์™€ ์ด๋ฏธ์ง€ ์œ ํ˜•์— ๋งž๊ฒŒ config ๊ฐ’์„ ์ง์ ‘ ์กฐ์ •ํ•ด๋ณด๋Š” ๊ฒƒ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค.

์•„๋ž˜๋Š” ๋ฌธ์„œ ๋‚ด ์ด๋ฏธ์ง€๋ฅผ ๋‹ค๋ฃฐ ๋•Œ ์ผ๋ฐ˜์ ์œผ๋กœ ํ™œ์šฉ๋˜๋Š” ๋ฒ”์šฉ ํ”„๋กฌํ”„ํŠธ์™€ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’์˜ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค.

from langchain_core.messages import SystemMessage, HumanMessage
from langchain_naver import ChatClovaX

chat_llm = ChatClovaX(
    model="HCX-005"
)

# ์ด๋ฏธ์ง€ URL
image_url = url_list[-1]

# System, User prompt ๊ตฌ์„ฑ
system_message = SystemMessage(
    content=(
        "๋‹น์‹ ์€ ๋ฌธ์„œ ๋‚ด ๋‹ค์–‘ํ•œ ํ˜•ํƒœ์˜ ์ด๋ฏธ์ง€๋ฅผ ๋ถ„์„ํ•˜์—ฌ, ๊ฒ€์ƒ‰ ๊ธฐ๋ฐ˜ ์งˆ๋ฌธ์‘๋‹ต ์‹œ์Šคํ…œ(RAG)์— ํ™œ์šฉ ๊ฐ€๋Šฅํ•œ ํ…์ŠคํŠธ ์„ค๋ช…์„ ์ƒ์„ฑํ•˜๋Š” AI์ž…๋‹ˆ๋‹ค."
        "์ด๋ฏธ์ง€๋Š” ์ธํฌ๊ทธ๋ž˜ํ”ฝ, ํ‘œ, ๊ทธ๋ž˜ํ”„, ์ฝ”๋“œ ์บก์ฒ˜, ๋‹ค์ด์–ด๊ทธ๋žจ, ํ™”๋ฉด ๊ตฌ์„ฑ ๋“ฑ ๋‹ค์–‘ํ•œ ์œ ํ˜•์ผ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋‹ค์Œ ๊ธฐ์ค€์— ๋”ฐ๋ผ ์š”์•ฝ์„ ์ž‘์„ฑํ•˜์„ธ์š”."
        "- ์ด๋ฏธ์ง€์˜ ์ฃผ์ œ์™€ ๋ชฉ์ ์„ ๋ช…ํ™•ํ•˜๊ฒŒ ํŒŒ์•…ํ•˜๊ณ  ์ž์—ฐ์–ด๋กœ ์š”์•ฝํ•ฉ๋‹ˆ๋‹ค."
        "- ์ด๋ฏธ์ง€๊ฐ€ ์ „๋‹ฌํ•˜๋Š” ๊ตฌ์กฐ๋‚˜ ํ๋ฆ„์ด ์žˆ๋‹ค๋ฉด ์ˆœ์ฐจ์ ์œผ๋กœ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. (์˜ˆ: ๋‹จ๊ณ„, ๊ด€๊ณ„, ๋น„๊ต ๋“ฑ)"
        "- ํ‘œ, ๊ทธ๋ž˜ํ”„, ์ˆ˜์น˜ ์ •๋ณด๋Š” ์ „์ฒด ํ๋ฆ„๊ณผ ํŠน์ง•์ ์ธ ์ฐจ์ด๋งŒ ์š”์•ฝํ•˜๊ณ , ์ˆ˜์น˜ ๋‚˜์—ด์€ ํ”ผํ•ฉ๋‹ˆ๋‹ค."
        "- ์ฝ”๋“œ ์บก์ฒ˜์ธ ๊ฒฝ์šฐ ๊ธฐ๋Šฅ๊ณผ ์—ญํ•  ์ค‘์‹ฌ์œผ๋กœ ์š”์•ฝํ•˜๋ฉฐ, ํ•จ์ˆ˜/๋ณ€์ˆ˜/๋ชจ๋“ˆ๋ช… ๋“ฑ ํ•ต์‹ฌ ์ •๋ณด๋งŒ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค."
        "- ์‹œ๊ฐ์  ์š”์†Œ(์ƒ‰์ƒ, ๋„ํ˜•, ๋ฐฐ์น˜ ๋“ฑ)๋Š” ์ •๋ณด ์ „๋‹ฌ์— ํ•„์š”ํ•  ๊ฒฝ์šฐ์—๋งŒ ๊ฐ„๋‹จํžˆ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค."
        "- OCR๋กœ ์ถ”์ถœ๋œ ํ…์ŠคํŠธ๊ฐ€ ์žˆ๋‹ค๋ฉด ํ•ต์‹ฌ ๋‚ด์šฉ ์œ„์ฃผ๋กœ ์ •๋ฆฌํ•˜์—ฌ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค."
        "- ์„ค๋ช…์€ ๊ฒ€์ƒ‰ ๊ฐ€๋Šฅํ•œ ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ๋ฅผ ํฌํ•จํ•˜๊ณ , ๊ฐ์ƒ์ด๋‚˜ ํ•ด์„ ์—†์ด ์‚ฌ์‹ค ์ค‘์‹ฌ ๋ฌธ์žฅ์œผ๋กœ ๊ตฌ์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค."
        "- ์ตœ์ข… ์ถœ๋ ฅ์€ 3~5๋ฌธ์žฅ ์ด๋‚ด์˜ ๋‹จ์ผ ๋ฌธ๋‹จ์œผ๋กœ ๊ตฌ์„ฑ๋˜๋ฉฐ, RAG ์‹œ์Šคํ…œ์˜ ์ปจํ…์ŠคํŠธ๋กœ ์ง์ ‘ ํ™œ์šฉ ๊ฐ€๋Šฅํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค."
    )
)
human_message = HumanMessage(content=[
        {"type": "text", "text": "์ด ์ด๋ฏธ์ง€๋Š” ๋ฌธ์„œ ๋‚ด ์‹œ๊ฐ ์ž๋ฃŒ์ž…๋‹ˆ๋‹ค. ํ•ต์‹ฌ ์ •๋ณด๋ฅผ ์š”์•ฝํ•ด ์ฃผ์„ธ์š”."},
        {"type": "image_url", "image_url": {"url": image_url}}
    ])

# ๋ฉ”์‹œ์ง€ ๊ตฌ์„ฑ
messages = [
    system_message,
    human_message
    ]

# ํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •
config={
        "generation_config": {
            "temperature": 0.25,
            "repetition_penalty": 1.1
        }
    }

# ๋ชจ๋ธ ํ˜ธ์ถœ
response = chat_llm.invoke(messages,config)
print("[CLOVA ์‘๋‹ต]\n", response.content)

๊ฒฐ๊ณผ

[CLOVA ์‘๋‹ต]
์ด ์ด๋ฏธ์ง€๋Š” 'Tuning'์ด๋ผ๋Š” ์ œ๋ชฉ ์•„๋ž˜ ํŒŒ๋ž€์ƒ‰ ๊ณ„์—ด์˜ ๊ทธ๋ž˜ํ”ฝ์œผ๋กœ ๊ตฌ์„ฑ๋œ ์‹œ๊ฐ ์ž๋ฃŒ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋ฐฐ๊ฒฝ์€ ์ง™์€ ๋‚จ์ƒ‰์ด๋ฉฐ ์ƒ๋‹จ์—๋Š” ํ•œ๊ธ€๋กœ ๋œ ์„ค๋ช…๋ฌธ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ์„ค๋ช…๋ฌธ์€ ํ”„๋กฌํ”„ํŠธ๋งŒ์œผ๋กœ ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ํ•œ๊ณ„๊ฐ€ ์žˆ์–ด ์ž์ฒด ์กฐ๋‹ฌํ•œ ์ปค์Šคํ…€ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•œ ๋ชจ๋ธ ํŠœ๋‹(Tuning) ๊ณผ์ •์„ ๊ฑฐ์ณ ์„ฑ๋Šฅ์„ ๋”์šฑ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค๋Š” ๋‚ด์šฉ์„ ๋‹ด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•˜๋‹จ์˜ ํŒŒ๋™ ๋ชจ์–‘ ๊ทธ๋ž˜ํ”„๋Š” ์‹œ๊ฐ„์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ๊ฐ ์ง€์ ๋งˆ๋‹ค '1์ฐจ', '2์ฐจ', '3์ฐจ'๋ผ๋Š” ํ…์ŠคํŠธ๊ฐ€ ํ‘œ์‹œ๋˜์–ด ์—ฌ๋Ÿฌ ๋ฒˆ์˜ ๊ฐœ์„  ๋‹จ๊ณ„๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ํ•˜๋‹จ์—๋Š” ์›ํ˜•์˜ ์•„์ด์ฝ˜์ด ์„ธ ๊ฐœ ์žˆ์œผ๋ฉฐ ๊ฐ๊ฐ '์—”์ง€๋‹ˆ์–ด๋ง'์ด๋ผ๋Š” ๋‹จ์–ด์™€ ํ™”์‚ดํ‘œ๊ฐ€ ์—ฐ๊ฒฐ๋˜์–ด ์žˆ์–ด, ์ด๋Ÿฌํ•œ ์—”์ง€๋‹ˆ์–ด๋ง ์ž‘์—…์ด ๋ฐ˜๋ณต๋˜๋ฉด์„œ ์„ฑ๋Šฅ์ด ์ ์ฐจ ํ–ฅ์ƒ๋จ์„ ์‹œ๊ฐ์ ์œผ๋กœ ํ‘œํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค. ์˜ค๋ฅธ์ชฝ ๋๋ถ€๋ถ„์—๋Š” 'ํŠœ๋‹'์ด๋ผ๊ณ  ์ ํžŒ ํฐ ์›ํ˜• ๋ฒ„ํŠผ์ด ๊ฐ•์กฐ๋˜์–ด ์žˆ์œผ๋ฉฐ ์ด๋Š” ์ตœ์ข…์ ์ธ ์„ฑ๋Šฅ ์ตœ์ ํ™”๋ฅผ ์ƒ์ง•ํ•ฉ๋‹ˆ๋‹ค. ์ „์ฒด์ ์œผ๋กœ ์ด ์ด๋ฏธ์ง€๋Š” ํŠน์ • ๊ณผ์ •์—์„œ์˜ ์ง€์†์ ์ธ ๊ฐœ์„  ๋ฐ ์ตœ์ ํ™”์˜ ์ค‘์š”์„ฑ์„ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค.

์•„๋ž˜๋Š” ๊ฐ ์ด๋ฏธ์ง€ ์œ ํ˜•์— ํŠนํ™”๋œ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ ์šฉํ•ด ์ƒ์„ฑํ•œ ์˜ˆ์‹œ ๊ฒฐ๊ณผ๋กœ, ์ด๋ฏธ์ง€์˜ ์œ ํ˜•์— ๋”ฐ๋ผ ์š”์•ฝ ๋ฐฉ์‹์ด ์–ด๋–ป๊ฒŒ ๋‹ฌ๋ผ์ง€๋Š”์ง€ ๋น„๊ตํ•ด๋ณผ ์ˆ˜ ์žˆ๋„๋ก ๊ตฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

image.png.ece4431fd657784c1ac07576a85226cb.pngimage.png.1a866bdf64c6f21a834ca687a0876a97.png
ย 
์•„๋ž˜ ์ฝ”๋“œ๋Š” ๋ชจ๋“  ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด ์š”์•ฝ์„ ์ƒ์„ฑํ•˜๋Š” ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค.
# ๊ฒฐ๊ณผ๋ฅผ ์ €์žฅํ•  ๋”•์…”๋„ˆ๋ฆฌ
image_summary_results = []

# URL ๋ฐ˜๋ณต โ†’ ํ”„๋กฌํ”„ํŠธ ์ƒ์„ฑ โ†’ ๋ชจ๋ธ ํ˜ธ์ถœ โ†’ ๋”•์…”๋„ˆ๋ฆฌ ์ €์žฅ
for url in url_list:
    file_name = os.path.basename(url)
    clean_filename = file_name.split("?")[0]
    try:
        # URL๋งŒ ๋ฐ”๊ฟ”์„œ human_message ์žฌ์ƒ์„ฑ
        human_message.content[1]["image_url"]["url"] = url
        messages = [system_message, human_message]
        response = chat_llm.invoke(messages,config)

        # ๊ฒฐ๊ณผ ๋”•์…”๋„ˆ๋ฆฌ์— ์ €์žฅ
        image_summary_results.append({clean_filename: response.content})
        print(f"[โœ”] ์ €์žฅ ์™„๋ฃŒ: {url}")

    except Exception as e:
        print(f"[โœ˜] ์‹คํŒจ: {url} โ†’ {e}")

๊ฒฐ๊ณผ
image.png.0ca58371af6ea2bc3d0b660c45b9a2a7.png

ย 

2.4) ์ด๋ฏธ์ง€ ์š”์•ฝ ํ…์ŠคํŠธ๋ฅผ Document ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ
์•„๋ž˜ ์ฝ”๋“œ๋Š” ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ ์ƒ์„ฑ๋œ ์„ค๋ช… ํ…์ŠคํŠธ(content)์™€ ์ด๋ฏธ์ง€์˜ ์œ„์น˜ ์ •๋ณด ๋“ฑ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ๋ฅผ ํ•จ๊ป˜ ๋‹ด์•„ LangChain์˜ Document ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค. ์ด์ „ ์ด๋ฏธ์ง€ ์ถ”์ถœ ๋‹จ๊ณ„์—์„œ ํŒŒ์ผ๋ช…์— ํŽ˜์ด์ง€ ๋ฒˆํ˜ธ๊ฐ€ ํฌํ•จ๋˜๋„๋ก ๊ตฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.ย ์ด๋ฅผ ํ™œ์šฉํ•ด ์ด๋ฏธ์ง€์˜ ์œ„์น˜ ์ •๋ณด๋ฅผ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ๋กœ ๊ตฌ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๋ ‡๊ฒŒ ๋ณ€ํ™˜๋œ Document๋Š” ํ…์ŠคํŠธ ๋ฌธ๋‹จ๊ณผ ๋™์ผํ•œ ๋ฐฉ์‹์œผ๋กœ ๋ฒกํ„ฐ ์ž„๋ฒ ๋”ฉ์ด ๊ฐ€๋Šฅํ•˜๋ฉฐ, ์ด๋ฏธ์ง€์—์„œ ์ถ”์ถœ๋œ ์ •๋ณด ์—ญ์‹œ ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ ์งˆ์˜์ฒ˜๋Ÿผ ๊ฒ€์ƒ‰๋˜๊ณ  ์‘๋‹ต์— ๋ฐ˜์˜๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ RAG ๊ตฌ์กฐ์˜ ํ•ต์‹ฌ์ ์ธ ์ „์ฒ˜๋ฆฌ ๋‹จ๊ณ„์ž…๋‹ˆ๋‹ค.

image_docs = []
for item in image_summary_results:
    # ๊ฐ ๋”•์…”๋„ˆ๋ฆฌ์—์„œ ํŒŒ์ผ๋ช…๊ณผ ์š”์•ฝ ํ…์ŠคํŠธ ์ถ”์ถœ
    file_name = list(item.keys())[0]
    summary = item[file_name]

    # ์ •๊ทœ์‹์œผ๋กœ ํŽ˜์ด์ง€ ๋ฒˆํ˜ธ ์ถ”์ถœ
    match = re.search(r'page_(\d+)_img_\d+\.\w+', file_name)
    page_number = int(match.group(1)) if match else None

    # LangChain Document ์ƒ์„ฑ
    image_docs.append(Document(
        page_content=summary,
        metadata={
            "source": "๋ชจ๋ธํŠœ๋‹.pdf",
            "page": page_number,
            "images": file_name
        }
    ))

print(f"์ด {len(image_docs)}๊ฐœ์˜ Document ์ƒ์„ฑ ์™„๋ฃŒ")
print(image_docs[0].page_content)
print(image_docs[0].metadata)  # ํ•˜๋‚˜ ํ™•์ธ

๊ฒฐ๊ณผ
image.png.19c0035d042e905202dd290e7e2768bf.png

ย 
โ‘ข ๋ฌธ๋‹จ ๋‚˜๋ˆ„๊ธฐ (Chunking)
image.png.96228c48bbefde8a5d4953f945b9b90c.png

ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ๊ฐ๊ฐ์˜ ์ •๋ณด๋ฅผ ๊ฒ€์ƒ‰์— ์ ํ•ฉํ•œ ๋‹จ์œ„๋กœ ๋ถ„ํ• ํ•˜๋Š” ๊ฒƒ์ด ๋งค์šฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ๋„ˆ๋ฌด ๊ธด ํ…์ŠคํŠธ๋Š” ๊ฒ€์ƒ‰ ์ •ํ™•๋„๋ฅผ ์ €ํ•˜์‹œํ‚ค๊ณ , ์ง€๋‚˜์น˜๊ฒŒ ์ž˜๊ฒŒ ์ชผ๊ฐœ๋ฉด ๋ฌธ๋งฅ์ด ๋‹จ์ ˆ๋  ์ˆ˜ ์žˆ์–ด ์ ์ ˆํ•œ ๋ถ„ํ•  ๊ธฐ์ค€์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฒˆ์—๋Š” Clova Studio์—์„œ ์ œ๊ณตํ•˜๋Š” ๋ฌธ๋‹จ ๋‚˜๋ˆ„๊ธฐ API๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ž์—ฐ์Šค๋Ÿฝ๊ณ  ์˜๋ฏธ ๋‹จ์œ„๋กœ ๊ตฌ๋ถ„๋œ ๋ฌธ์„œ ์ฒญํฌ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

๋‚จ์€ 3๊ฐœ์˜ ๋‹จ๊ณ„(chunking, Embedding, Vector Store)์— ๋Œ€ํ•œ ๋” ์ž์„ธํ•œ ๋‚ด์šฉ์€ย ๐Ÿฆœ๐Ÿ”— ๋žญ์ฒด์ธ(Langchain)์œผ๋กœ Naive RAG ๊ตฌํ˜„ํ•˜๊ธฐ cookbookย ๋ฅผ ์ฐธ๊ณ  ํ•ด์ฃผ์„ธ์š”.

ย 

3.1) ๋ฌธ์„œ chunking

# -*- coding: utf-8 -*-

class CompletionExecutor:
    def __init__(self, host, api_key, request_id):
        self._host = host
        self._api_key = api_key
        self._request_id = request_id

    def _send_request(self, completion_request):
        headers = {
            'Content-Type': 'application/json; charset=utf-8',
            'Authorization': self._api_key,
            'X-NCP-CLOVASTUDIO-REQUEST-ID': self._request_id
        }

        conn = http.client.HTTPSConnection(self._host)
        conn.request('POST', '/testapp/v1/api-tools/segmentation', json.dumps(completion_request), headers)
        response = conn.getresponse()
        result = json.loads(response.read().decode(encoding='utf-8'))
        conn.close()
        return result

    def execute(self, completion_request):
        res = self._send_request(completion_request)
        if res['status']['code'] == '20000':
            return res['result']['topicSeg']
        else:
            print("[CLOVA ์‘๋‹ต ์˜ค๋ฅ˜]", res['status'])
            return 'Error'
        
file_path = "data/extracted_images_๋ฌธ์„œ/merged_text.txt"

with open(file_path, "r", encoding="utf-8") as f:
    text_content = f.read()

if __name__ == '__main__':
    completion_executor = CompletionExecutor(
        host='clovastudio.stream.ntruss.com',
        api_key="Bearer "+os.environ["CLOVASTUDIO_API_KEY"],
        request_id=str(uuid.uuid4())
    )

    chunked_docs = []

    for doc in docs:  # docs๋Š” ํŽ˜์ด์ง€๋ณ„๋กœ ์ถ”์ถœํ•œ Document ๋ฆฌ์ŠคํŠธ
        segments = completion_executor.execute(
            # ์ด์ „ ๋ธ”๋กœ๊ทธ ์ฐธ๊ณ ํ•ด ํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •
            {"postProcessMaxSize": 100,   # ํ›„์ฒ˜๋ฆฌ ์‹œ ํ•˜๋‚˜์˜ ๋ฌธ๋‹จ์ด ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” ์ตœ๋Œ€ ๊ธ€์ž ์ˆ˜ (์˜ˆ: 1000์ž ์ดํ•˜๋กœ ์ž˜๋ผ์คŒ)
            "alpha": -100,                # ๋ฌธ๋‹จ ๋‚˜๋ˆ„๊ธฐ ๋ฏผ๊ฐ๋„ ์กฐ์ ˆ ํŒŒ๋ผ๋ฏธํ„ฐ (๊ธฐ๋ณธ: 0.0 / -100์œผ๋กœ ๋‘๋ฉด ์ž๋™ ์กฐ์ •) - ๊ฐ’์ด ํด์ˆ˜๋ก ๋” ์ž˜๊ฒŒ ๋‚˜๋‰˜๊ณ , ์ž‘์„์ˆ˜๋ก ๋œ ๋‚˜๋‰จ
            "segCnt": -1,                 # ์›ํ•˜๋Š” ๋ฌธ๋‹จ ๊ฐœ์ˆ˜ ์„ค์ • (-1์ด๋ฉด ์ž๋™ ๋ถ„ํ• , 1 ์ด์ƒ์˜ ์ •์ˆ˜ ์ž…๋ ฅ ์‹œ ํ•ด๋‹น ๊ฐœ์ˆ˜๋กœ ๊ณ ์ •)
            "postProcessMinSize": -1,     # ํ›„์ฒ˜๋ฆฌ ์‹œ ํ•˜๋‚˜์˜ ๋ฌธ๋‹จ์ด ๊ฐ€์ ธ์•ผ ํ•  ์ตœ์†Œ ๊ธ€์ž ์ˆ˜ (์˜ˆ: 300์ž ์ด์ƒ ์œ ์ง€)
            "text": doc.page_content,     # ์‹ค์ œ ๋ถ„ํ• ํ•  ์›๋ณธ ํ…์ŠคํŠธ
            "postProcess": True}          # ํ›„์ฒ˜๋ฆฌ ์—ฌ๋ถ€ ์„ค์ • (True: ๋ฌธ๋‹จ ๊ธธ์ด ๊ท ์ผํ™” / False: ๋ชจ๋ธ ์ถœ๋ ฅ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉ)
        )

    for seg in segments:
        chunked_docs.append(Document(
            page_content=' '.join(seg),
            metadata=doc.metadata
        ))    

    print(chunked_docs)
    print("chunk ๊ฐœ์ˆ˜ :",len(chunked_docs))

๊ฒฐ๊ณผ
image.png.7fd49405b7902d17ef3ef8efed063fdd.png

ย 

3.2) ์ด๋ฏธ์ง€ chunking
์ด๋ฏธ์ง€ ์„ค๋ช…์— ๋Œ€ํ•ด์„œ๋Š” ์ผ๋ฐ˜ ํ…์ŠคํŠธ์™€ ๋‹ฌ๋ฆฌ ๋ณ„๋„๋กœ chunking์„ ํ•˜์ง€ ์•Š๊ณ  ํ•œ ๋ฉ์–ด๋ฆฌ๋กœ ๊ทธ๋Œ€๋กœ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์ด ํšจ๊ณผ์ ์ž…๋‹ˆ๋‹ค.

๊ทธ ์ด์œ ๋Š” ๊ฐ„๋‹จํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€ ์„ค๋ช… ํ…์ŠคํŠธ๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ๊ธธ์ด๊ฐ€ ์งง๊ณ , ํ•˜๋‚˜์˜ ์ด๋ฏธ์ง€๊ฐ€ ํ•˜๋‚˜์˜ ์˜๋ฏธ ๋‹จ์œ„๋ฅผ ๋‹ด๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋‚ด์šฉ์„ ์ž˜๋ผ์„œ ๋‚˜๋ˆ„๋ฉด ์˜คํžˆ๋ ค ๋ฌธ๋งฅ์ด ๋‹จ์ ˆ๋˜๊ฑฐ๋‚˜ ์˜๋ฏธ๊ฐ€ ๋ชจํ˜ธํ•ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•œ ๊ฐœ์˜ ๋…๋ฆฝ์ ์ธ ์ฒญํฌ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ์ด ๊ฒ€์ƒ‰ ์ •ํ™•๋„ ์ธก๋ฉด์—์„œ๋„ ๋” ์•ˆ์ •์ ์ธ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ด๋ฏธ์ง€ ์„ค๋ช…์€ ๋ณ„๋„ ๋ถ„ํ•  ์—†์ด, 1 ์ด๋ฏธ์ง€ ์š”์•ฝ = 1 Document ํ˜•ํƒœ๋กœ ๊ตฌ์„ฑํ•˜์—ฌ ๊ธฐ์กด ํ…์ŠคํŠธ ์ฒญํฌ๋“ค๊ณผ ํ•จ๊ป˜ ๋ณ‘ํ•ฉํ•˜์—ฌ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

# image_docs๋ฅผ chunked_docs์— ์ถ”๊ฐ€ (์›๋ณธ์€ ๊ทธ๋Œ€๋กœ ์œ ์ง€)
combined_docs = chunked_docs + image_docs

print(f"์ „์ฒด chunk ๊ฐœ์ˆ˜: {len(combined_docs)}")

image.png.9001377ba63bf49d3c23026d446016e6.png

ย 
ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด chunking ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•œ ๊ฒฐ๊ณผ, ์ด 19๊ฐœ์˜ ์ฒญํฌ(ํ…์ŠคํŠธ 10๊ฐœ + ์ด๋ฏธ์ง€ 9๊ฐœ)๊ฐ€ ์ƒ์„ฑ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด์ œ ์‹ค์ œ๋กœ ์ƒ์„ฑ๋œ ์ฒญํฌ ์ค‘ ์ผ๋ถ€๋ฅผ ์ถœ๋ ฅํ•˜์—ฌ ์–ด๋–ค ํ˜•ํƒœ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋Š”์ง€ ํ™•์ธํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ์•„๋ž˜๋Š” ์ƒ์„ฑ๋œ ์ฒญํฌ ์ค‘ ์ฒ˜์Œ 3๊ฐœ์˜ ์ƒ˜ํ”Œ์ž…๋‹ˆ๋‹ค.
# ์ƒ˜ํ”Œ ์ฒญํฌ ์ถœ๋ ฅ
print("\n์ƒ˜ํ”Œ ์ฒญํฌ (์ฒ˜์Œ 3๊ฐœ):")
for i, chunk in enumerate(combined_docs[:3], 0):
    print(f"\n์ฒญํฌ {i+1}:")
    print(f"๋‚ด์šฉ: {chunk.page_content}")
    print(f"metadata: {chunk.metadata}")
    print(f"๊ธธ์ด: {len(chunk.page_content)}์ž")

๊ฒฐ๊ณผ
image.png.45c31b81868c6ab198c9a7c5e57ca265.png
ย 

โ‘ฃ ๋ฒกํ„ฐ ๋ฐ์ดํ„ฐ ๋ณ€ํ™˜ (Embedding)
image.png.182867cdf886abe3f23b9e0e3d96b29a.png

์ด์ œ ๋ฌธ๋‹จ ๋‹จ์œ„๋กœ ์ž˜๊ฒŒ ๋‚˜๋ˆ ์ง„ ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ์„ค๋ช…์„ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•  ์ฐจ๋ก€์ž…๋‹ˆ๋‹ค.

์ด๋ฒˆ ์˜ˆ์ œ์—์„œ๋Š” CLOVA Studio์˜ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์„ ํ™œ์šฉํ•ด ๋ฒกํ„ฐ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. langchain-naver์˜ ClovaXEmbeddings๋ฅผ ํ†ตํ•ด CLOVA Studio์˜ ์ž„๋ฒ ๋”ฉ ๋ฐ ์ž„๋ฒ ๋”ฉ v2 API๋ฅผ ์†์‰ฝ๊ฒŒ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ž„๋ฒ ๋”ฉ V2๋Š” bge-m3 ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋ฉฐ, ์ด ๋ชจ๋ธ์€ ์ž„๋ฒ ๋”ฉ ๊ณผ์ •์—์„œ ์œ ์‚ฌ๋„ ํŒ๋‹จ์„ ์œ„ํ•ด ์ฝ”์‚ฌ์ธ ๊ฑฐ๋ฆฌ(Cosine)๋ฅผ ๊ฑฐ๋ฆฌ ๋‹จ์œ„๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋ธ์„ ์„ค์ •ํ•˜์ง€ ์•Š์œผ๋ฉด clir-emb-dolphin ๋ชจ๋ธ์ด ๊ธฐ๋ณธ๊ฐ’์œผ๋กœ ์ง€์ •๋˜๋ฏ€๋กœ, ClovaXEmbeddings์˜ ๋ชจ๋ธ์„ bge-m3๋กœ ๋ช…์‹œ์ ์œผ๋กœ ์„ค์ •ํ•ด์ฃผ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

from langchain_naver import ClovaXEmbeddings
 
clovax_embeddings = ClovaXEmbeddings(model='bge-m3') # ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์„ ์„ค์ •

text = "์ž„๋ฒ ๋”ฉ ์‚ฌ์šฉ ์˜ˆ์ œ์ž…๋‹ˆ๋‹ค~"
 
clovax_embeddings.embed_query(text)

๊ฒฐ๊ณผ
image.png.de8ad00a4e5360b8b5d214c32301ad8a.png

ย 

โ‘ค ๋ฒกํ„ฐ ๋ฐ์ดํ„ฐ ์ €์žฅ (Vector Store)
image.png.eb93c6bdcf66ea16933fbf279aa26fd8.png

์ž„๋ฒ ๋”ฉ๋œ ๋ฒกํ„ฐ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•˜๊ณ  ๋‚˜์ค‘์— ํšจ์œจ์ ์œผ๋กœ ๊ฒ€์ƒ‰ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ฒกํ„ฐ ์ €์žฅ์†Œ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

์ด ์˜ˆ์ œ์—์„œ๋Š” ๋กœ์ปฌ ํ™˜๊ฒฝ์—์„œ ๋ณด๋‹ค ์‰ฝ๊ฒŒ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” Chroma์™€ FAISS๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. Langchain์˜ย Vector DB ๋น„๊ต ๋ฌธ์„œ๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ์ž์‹ ์˜ ๊ฐœ๋ฐœ ํ™˜๊ฒฝ์— ์ ํ•ฉํ•œ ์†”๋ฃจ์…˜์„ ์„ ํƒํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

์ „์ฒด ๋ฌธ์„œ๋ฅผ add_documents()๋กœ ํ•œ ๋ฒˆ์— ์ถ”๊ฐ€ํ•˜๋ฉด ๋‚ด๋ถ€์ ์œผ๋กœ ๋งŽ์€ ๊ฐœ๋ณ„ ์ž„๋ฒ ๋”ฉ API ์š”์ฒญ์ด ๋ณ‘๋ ฌ๋กœ ์‹คํ–‰๋˜์–ด ์—๋Ÿฌ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ์š”์ฒญ ์‚ฌ์ด์— time.sleep() ๊ฐ„๊ฒฉ์„ ๋‘์–ด ์ฒ˜๋ฆฌ ์†๋„๋ฅผ ์กฐ์ ˆํ•จ์œผ๋กœ์จ ์—๋Ÿฌ ๋ฐœ์ƒ๋ฅ ์„ ๋‚ฎ์ท„์Šต๋‹ˆ๋‹ค.

ย 

5.1) Chroma
Chroma๋Š” Python ๊ธฐ๋ฐ˜์˜ ์˜คํ”ˆ์†Œ์Šค ๋ฒกํ„ฐ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋กœ, ์‚ฌ์šฉ์ด ๊ฐ„ํŽธํ•˜๊ณ  ๋น ๋ฅธ ํ”„๋กœํ† ํƒ€์ดํ•‘์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ ๋กœ์ปฌ ํ™˜๊ฒฝ์—์„œ๋„ ๋น ๋ฅด๊ฒŒ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์–ด ๊ฐœ๋ฐœ ์ดˆ๊ธฐ ๋‹จ๊ณ„์—์„œ ๋งŽ์ด ํ™œ์šฉ๋ฉ๋‹ˆ๋‹ค.

#chroma ๋‹ค์šด๋ฐ›๊ธฐ
%pip install -qU langchain-chroma
import chromadb
from langchain_chroma import Chroma

# ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ ์ •์˜
clovax_embeddings = ClovaXEmbeddings(model='bge-m3')

# ๋กœ์ปฌ ํด๋ผ์ด์–ธํŠธ ์ƒ์„ฑ
client = chromadb.PersistentClient(path="./chroma_langchain_db")

# ์ปฌ๋ ‰์…˜ ์ค€๋น„ (์ด๋ฆ„ ์ค‘๋ณต ์ฃผ์˜!)
collection_name = "clovastudiodatas_docs"
client.get_or_create_collection(
    name=collection_name,
    metadata={"hnsw:space": "cosine"}
)

# ๋ฒกํ„ฐ์Šคํ† ์–ด ๊ฐ์ฒด ์ƒ์„ฑ
vectorstore_Chroma = Chroma(
    client=client,
    collection_name=collection_name,
    embedding_function=clovax_embeddings
)

# ๋ฌธ์„œ ์ถ”๊ฐ€: ์ตœ์‹  ๋ฐฉ์‹์€ vectorstore.add_documents ์‚ฌ์šฉ
print("Adding documents to Chroma vectorstore...")

for doc in combined_docs:
    try:
        vectorstore_Chroma.add_documents([doc])
        time.sleep(0.5) 
    except Exception as e:
        print(f"[โœ˜] ์‹คํŒจ: {doc.metadata} โ†’ {e}")

print("All documents have been added to the vectorstore.")

๊ฒฐ๊ณผ
image.png.cd3f01001b352585f973ba64aa437305.png

ย 

5.2) FAISS
FAISS๋Š” ๋Œ€๊ทœ๋ชจ ๋ฒกํ„ฐ ๊ฒ€์ƒ‰์„ ์œ„ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ, ์†๋„์™€ ํ™•์žฅ์„ฑ ์ธก๋ฉด์—์„œ ๋งค์šฐ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ ๋Œ€์šฉ๋Ÿ‰ ๋ฌธ์„œ๋ฅผ ์ฒ˜๋ฆฌํ•  ๋•Œ ํšจ์œจ์ ์ด๋ฉฐ, ๊ฒ€์ƒ‰ ์ •ํ™•๋„๋„ ๋›ฐ์–ด๋‚œ ํŽธ์ž…๋‹ˆ๋‹ค.

#FAISS ๋‹ค์šด๋กœ๋“œ
%pip install -qU langchain-community faiss-cpu
import faiss
from langchain_community.vectorstores import FAISS
from langchain_community.docstore.in_memory import InMemoryDocstore

# ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ ์ •์˜
clovax_embeddings = ClovaXEmbeddings(model='bge-m3')

# FAISS ์ธ๋ฑ์Šค ์ƒ์„ฑ (1024๋Š” bge-m3 ์ฐจ์› ์ˆ˜์— ๋งž์ถฐ์•ผ ํ•จ)
index = faiss.IndexFlatIP(1024)  # ๋‚ด์  ๊ธฐ๋ฐ˜ ๊ฒ€์ƒ‰

# FAISS ๋ฒกํ„ฐ์Šคํ† ์–ด ์ƒ์„ฑ
vectorstore_FAISS = FAISS(
    embedding_function=clovax_embeddings,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={}
)

# ๋ฌธ์„œ ์ผ๊ด„ ์ถ”๊ฐ€ (์ž๋™ ์ž„๋ฒ ๋”ฉ ์ฒ˜๋ฆฌ)
print("Adding documents to FAISS vectorstore...")

for doc in combined_docs:
    try:
        vectorstore_FAISS.add_documents([doc])
        time.sleep(0.5) 
    except Exception as e:
        print(f"[โœ˜] ์‹คํŒจ: {doc.metadata} โ†’ {e}")

print("All documents have been added to FAISS vectorstore.")

๊ฒฐ๊ณผ
image.png.4248ed1f2c414c4fbe9084fffa5a8a47.png

ย 

3. ์งˆ์˜ ์‘๋‹ตํ•ด๋ณด๊ธฐ


โ‘  ์งˆ๋ฌธํ•˜๊ธฐ

image.png.bcc66766a7eca466f2a3d45ee366eef3.png

๋ฌธ์„œ ์ž„๋ฒ ๋”ฉ๊ณผ ๋ฒกํ„ฐ ์ €์žฅ์†Œ ๊ตฌ์„ฑ์ด ์™„๋ฃŒ๋˜์—ˆ๋‹ค๋ฉด, ์ด์ œ ์‹ค์ œ ์งˆ๋ฌธ์„ ์ž…๋ ฅํ•˜๊ณ  ๊ด€๋ จ ๋‚ด์šฉ์„ ์ฐพ์•„ ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•˜๋Š” ๊ณผ์ •์„ ์ง„ํ–‰ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. LangChain์—์„œ๋Š” RetrievalQA ์ฒด์ธ์„ ํ†ตํ•ด ์ด ํ๋ฆ„์„ ๊ฐ„๋‹จํ•˜๊ฒŒ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ย 

1.1) Retriever ์ƒ์„ฑํ•˜๊ธฐ
๋จผ์ € ์‚ฌ์šฉ์ž ์งˆ๋ฌธ์— ๋”ฐ๋ผ ์—ฐ๊ด€ ๋ฌธ์„œ๋ฅผ ๊ฒ€์ƒ‰ํ•˜๊ณ , ํ•ด๋‹น ๋ฌธ์„œ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•˜๋Š” ์ฒด์ธ์„ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

์•„๋ž˜ ์ฝ”๋“œ๋Š” ์‹œ์Šคํ…œ ํ”„๋กฌํ”„ํŠธ์™€ ์‚ฌ์šฉ์ž ํ”„๋กฌํ”„ํŠธ๋ฅผ ๊ตฌ๋ถ„ํ•˜์—ฌ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค. ์‹œ์Šคํ…œ ํ”„๋กฌํ”„ํŠธ์—์„œ๋Š” LLM์ด ๊ธฐ์กด ์ง€์‹์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  ๊ฒ€์ƒ‰๋œ ๋ฌธ์„œ(context)์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ๋‹ต๋ณ€ํ•˜๋„๋ก ์•ˆ๋‚ดํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ์šฉ์ž ํ”„๋กฌํ”„ํŠธ์—๋Š” ๋ฌธ์„œ ๋‚ด์šฉ๊ณผ ์งˆ๋ฌธ์ด ํ•จ๊ป˜ ์ „๋‹ฌ๋ฉ๋‹ˆ๋‹ค.ย ์งˆ๋ฌธ์„ ์ž…๋ ฅํ•˜๋ฉด ๊ด€๋ จ ๋ฌธ์„œ๋ฅผ ๊ฒ€์ƒ‰ํ•œ ํ›„ ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ์—๋Š” ๋‹ต๋ณ€๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์–ด๋–ค ๋ฌธ์„œ๊ฐ€ ์ฐธ์กฐ๋˜์—ˆ๋Š”์ง€๋„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ย 

from langchain_core.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate
from langchain.chains import RetrievalQA

# System ๋ฐ User ๋ฉ”์‹œ์ง€๋ฅผ ๋‚˜๋ˆ  ๊ตฌ์„ฑ
system_template = (
    "๋‹น์‹ ์€ ์งˆ๋ฌธ-๋‹ต๋ณ€(Question-Answering)์„ ์ˆ˜ํ–‰ํ•˜๋Š” ์นœ์ ˆํ•œ AI ์–ด์‹œ์Šคํ„ดํŠธ์ž…๋‹ˆ๋‹ค. ๋‹น์‹ ์˜ ์ž„๋ฌด๋Š” ์›๋ž˜ ๊ฐ€์ง€๊ณ ์žˆ๋Š” ์ง€์‹์€ ๋ชจ๋‘ ๋ฐฐ์ œํ•˜๊ณ , ์ฃผ์–ด์ง„ ๋ฌธ๋งฅ(context) ์—์„œ ์ฃผ์–ด์ง„ ์งˆ๋ฌธ(question) ์— ๋‹ตํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค."
    "๋งŒ์•ฝ, ์ฃผ์–ด์ง„ ๋ฌธ๋งฅ(context) ์—์„œ ๋‹ต์„ ์ฐพ์„ ์ˆ˜ ์—†๋‹ค๋ฉด, ๋‹ต์„ ๋ชจ๋ฅธ๋‹ค๋ฉด `์ฃผ์–ด์ง„ ์ •๋ณด์—์„œ ์งˆ๋ฌธ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ฐพ์„ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค` ๋ผ๊ณ  ๋‹ตํ•˜์„ธ์š”."
)
user_template = (
    "๋‹ค์Œ์€ ๊ฒ€์ƒ‰๋œ ๋ฌธ์„œ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค:\n\n{context}\n\n"
    "์œ„ ์ •๋ณด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋‹ค์Œ ์งˆ๋ฌธ์— ๋‹ตํ•ด์ฃผ์„ธ์š”:\n{question}"
)

prompt_template = ChatPromptTemplate.from_messages([
    SystemMessagePromptTemplate.from_template(system_template),
    HumanMessagePromptTemplate.from_template(user_template),
])

# ์›ํ•˜๋Š” vectorstore ์„ ํƒํ•ด์„œ ์‚ฌ์šฉ
retriever = vectorstore_Chroma.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"score_threshold": 0.1, "k": 3}
    )
# retriever = vectorstore_FAISS.as_retriever(
#     search_type="similarity_score_threshold",
#     search_kwargs={"score_threshold": 0.1, "k": 3}
# )

# Retrieval QA ์ฒด์ธ ๊ตฌ์„ฑ
qa_chain = RetrievalQA.from_chain_type(
    llm=chat_llm,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt_template},
    return_source_documents=True
)

# ์‹คํ–‰
question = "๋ฐ์ดํ„ฐ์…‹ ๊ทœ๋ชจ๊ฐ€ ์ปค์งˆ์ˆ˜๋ก 2๋Œ€๋ฅ™์˜ ์˜ค๋ฅ˜ ๋ฐœ์ƒ ํ™•๋ฅ ์€ ์–ด๋–ป๊ฒŒ ๋ผ?"
result = qa_chain.invoke({"query": question})

print("์งˆ๋ฌธ:", question)
print("์‘๋‹ต:", result["result"])  # ๋ชจ๋ธ์˜ ์‹ค์ œ ์‘๋‹ต
for i, doc in enumerate(result["source_documents"]): # ๋‹ต๋ณ€์‹œ ์ฐธ๊ณ  ํ•œ ๋ฌธ์„œ
    print(f"\n[์ถœ์ฒ˜ ๋ฌธ์„œ {i+1}]\n๋‚ด์šฉ: {doc.page_content}\n๋ฉ”ํƒ€๋ฐ์ดํ„ฐ: {doc.metadata}")

ย 

โ‘ก ๋‹ต๋ณ€ํ™•์ธ
ํ…์ŠคํŠธ๋งŒ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ๋Š” ๋ชจ๋ธ์ด ๊ด€๋ จ ์ •๋ณด๋ฅผ ์ฐพ์ง€ ๋ชปํ•ด ์ œํ•œ์ ์ธ ๋‹ต๋ณ€์„ ์ œ๊ณตํ–ˆ์ง€๋งŒ, ์ด๋ฏธ์ง€ ์š”์•ฝ ํ…์ŠคํŠธ๊นŒ์ง€ ํ•จ๊ป˜ ํ™œ์šฉํ–ˆ์„ ๋•Œ๋Š” ํ›จ์”ฌ ๊ตฌ์ฒด์ ์ด๊ณ  ๊ด€๋ จ์„ฑ ๋†’์€ ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ RAG ์‹œ์Šคํ…œ์ด ์–ด๋–ป๊ฒŒ ๊ฒ€์ƒ‰ ํ’ˆ์งˆ๊ณผ ์‘๋‹ต ์ •ํ™•๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ค๋Š”์ง€ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
image.png.2f9a2e9a79a89b81d18eac62df63ca16.png
image.png.98334c1d6c166ead3f9dd8fbd982b270.png

ย 

๋งบ์Œ๋ง


์ด๋ฒˆ cookbook์—์„œ๋Š” ํ…์ŠคํŠธ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜ ์ •๋ณด๋ฅผ ํ•จ๊ป˜ ํ™œ์šฉํ•ด ๊ฒ€์ƒ‰ ์ •ํ™•๋„๋ฅผ ๋†’์ผ ์ˆ˜ ์žˆ๋Š” Multimodal RAG ์‹œ์Šคํ…œ์„ ๊ตฌ์„ฑํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค. ๋‹จ์ˆœํžˆ ๋ฌธ์„œ๋ฅผ ์ž„๋ฒ ๋”ฉํ•˜๋Š” ๋‹จ๊ณ„๋ฅผ ๋„˜์–ด, ์ด๋ฏธ์ง€ ์† ์‹œ๊ฐ ์ •๋ณด๋ฅผ ๋น„์ „ ๋ชจ๋ธ์„ ํ†ตํ•ด ์š”์•ฝํ•˜๊ณ  ๋ฒกํ„ฐ DB์— ์ €์žฅํ•˜์—ฌ ๊ฒ€์ƒ‰์— ํ™œ์šฉํ•จ์œผ๋กœ์จ ๋”์šฑ ํ’๋ถ€ํ•œ ์งˆ์˜์‘๋‹ต์ด ๊ฐ€๋Šฅํ•ด์กŒ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ํ…์ŠคํŠธ๋งŒ์œผ๋กœ๋Š” ์ถฉ๋ถ„ํžˆ ๋Œ€์‘ํ•˜๊ธฐ ์–ด๋ ค์› ๋˜ ์งˆ๋ฌธ์—๋„ ์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜ ๋ฌธ์„œ๋ฅผ ํ†ตํ•ด ์ •ํ™•ํ•œ ๋‹ต๋ณ€์„ ๋„์ถœํ•  ์ˆ˜ ์žˆ์—ˆ๋˜ ์˜ˆ์‹œ๋ฅผ ํ†ตํ•ด, ๋น„์ •ํ˜• ๋ฐ์ดํ„ฐ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๊ตฌ์กฐํ™”ํ•˜๊ณ  ํ™œ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์˜ ๊ฐ€๋Šฅ์„ฑ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

์ด ๊ฐ€์ด๋“œ๋ฅผ ํ†ตํ•ด ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ RAG ์‹œ์Šคํ…œ์„ ์‰ฝ๊ฒŒ ๊ตฌ์ถ•ํ•˜๊ณ  ๋น„์ „ ๋ชจ๋ธ์˜ ์‹ค์ œ ํ™œ์šฉ ํ๋ฆ„์„ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๊ธฐ๋ฅผ ๊ธฐ๋Œ€ํ•ฉ๋‹ˆ๋‹ค.

ย 


ย 

image.png.dd794eeec2017f30f3384a81531b029e.png

ย 

๋งํฌ ๋ณต์‚ฌ
๋‹ค๋ฅธ ์‚ฌ์ดํŠธ์— ๊ณต์œ ํ•˜๊ธฐ

  • CLOVA Studio ์šด์˜์ž changed the title to ๐Ÿฆœ๐Ÿ”— LangChain์œผ๋กœ ์ด๋ฏธ์ง€๊ฐ€ ์žˆ๋Š” ๋ฌธ์„œ๋ฅผ ๊ฒ€์ƒ‰ํ•˜๋Š” RAG ์‹œ์Šคํ…œ ๊ตฌ์ถ•ํ•˜๊ธฐ (Multimodal RAG Cookbook)

๊ฒŒ์‹œ๊ธ€ ๋ฐ ๋Œ“๊ธ€์„ ์ž‘์„ฑํ•˜๋ ค๋ฉด ๋กœ๊ทธ์ธ ํ•ด์ฃผ์„ธ์š”.



๋กœ๊ทธ์ธ
×
×
  • Create New...