[AI Engineering] 8장. Dataset Engineering

내용이 거의 고이즈미 신지로 화법.
"1 더하기 1은 2다. 왜냐면 그것이 수학이기에." (끄덕)
그만큼 당연한 말을 늘어놓고 있음.

Data Curation
- curation
  - 원래 미술관에서 기획자들이 우수한 작품을 뽑아 전시하는 행위
  - 다른 사람이 만들어놓은 콘텐츠를 목적에 따라 분류하고 배포하는 일 (양질의 콘텐츠만을 취합·선별·조합·분류해 특별한 의미를 부여하고 가치를 재창출하는 행위)
- Data with COT
- Data with Tool use
- Data with Conversation
  - Single Turn
  - Multi Turn
Data Quality
- There's many factors for data quality
  - Relevant -> 이커머스는 얘가 중요한 듯
  - Aligned with task requirements -> 개발할 땐 이거 아닌가 ㅋㅋ
  - Consistent -> 제조업에서 가장 중요한 듯
  - Correctly formatted
  - Sufficiently unique
  - Compliant
- What's the most important factor for data quality in LLM?
  - 답이 없는 문제.
  - Aligned with task requirements 아닐까? LLM이 결국 input이 있을 때 output이 있는데, 요새는 코딩이나 수학같은 것들을 tuning함. 그러기 위해선 LLM이 얼마나 복잡성을 담고 있는가가 중요하다고 생각.
- What's the most important factor for data quality in ML?
Data Coverage
- A model's training data should cover the range of problems you expect it to solve
- Training language models to follow instructions with human feedback(2022)
Data Quantity
- There're 3 factors deciding how much data you need
  - Fine-tuning technique
    - Less total tokens: LoRA
    - Much total tokens: Full Fine-Tuning
  - Task complexity: More complex task -> More data you need
    - 경험 상 얘가 제일 중요함. complexity만 잘 되어있으면 LoRA를 하든 Full Fine-Tuning을 하든 잘 되더라. 근데 이거 안 되어 있으면 뭘 해도 안 됨.
  - Base model's performance: Smarter base model -> Less data you need
    - 당연한 얘기 아님???
Data Acquisition and Annotation
- The goal of data acquisition is to produce a sufficiently large dataset with the quality and diversity you need
- Pipeline
  - Raw data archiving
  - Preprocessing with LLM
    - 어떤 경우에 해야 할까?
- What Database?
  - text면 뭘 쓸까? -> 당연히 RDB는 안 씀. mongoDB 같은 NoSQL이나 FileStorage 쓸 듯? -> "내가 원하던 답이 다 나왔다!" ㅋㅋㅋ
  - Ontology 요새 많이 씀 -> 근데 이거 관리 너무 어렵지 않냐? -> 맞음. relationship을 사전에 모두 정의해야 함. Graph DB의 장점이 확장성인데, relationship을 사전에 다 준비한다는 게 모순임. 그래서 이 relation을 어떻게 잘 정의하냐가 숙제임.
  - Graph DB vs. Graph RAG
    - Q. 그냥 retreiver를 Graph DB로 두면 Graph RAG임??
    - A. 얘는 골 때리는 게 Embedding 할 때는 vector DB가 필요했는데, Graph RAG는 LLM을 필요로 함. LLM이 scheme를 보고 가장 적절한 driver를 연결해서 응답을 뿌림. (판단의 주체가 LLM) -> 세부 파라미터 튜닝하는 옵션이 없어서 일관성이 떨어짐. -> 일관성을 유지하려면 relationship을 잘 정의해야 하는데, 그러면 graph DB의 장점이 떨어짐. 진퇴양난에 빠졌다.
Model Distillation
- Large models require very very large resource usage
- Distillate knowledge from teacher model (large) to student model (small)
  - 작은 모델이 큰 모델처럼 생각을 하게 만듦. (서빙이 쉬운 작은 모델을 만들려는 시도.)
  - 가장 유명한 게 deepseek. (metrics 보면 정신이 나간 수준)
    - Deepseek Knowledge Distillation
    - SOTA performance in parameter scale
  - 그럼 large model의 환각 현상을 잡는 것이 중요할 듯.
  - 지식의 확장보다는 small model의 performance를 끌어올릴 때 쓰는 것. (large model을 넘기 어려움.)
- 근데 Distillation할 때 썼던 testset 이외의 다른 입력에 들어왔을 때도 비슷한 성능을 발휘할 수 있나? -> 맞음. 근데 이건 distillation 구상의 문제로 봐야 함.
- 애초에 Distillation의 목적이 specific task의 일을 처리하게 하려고 했는데, large model은 과하니까 small model 쓰려는 거 아닌가? 그렇다면 앞선 질문은 당연한 현상 아님? -> "질문 수준에 감동했다" (ㅋㅋㅋㅋ). 그럼에도 너무 특정 작업에만 몰두하는 경향이 있어서 주의할 필요가 있긴 함.
Data Augmentation and Synthesis
- AI generate training data for weak model
- Science, Math related topic: Froniter Models
- Non-English data: Language specific model
  - Upstage Solar가 한국어에서 대표적인 예시
- NVIDIA NeMo
- 문제점: synthesis가 편향적일 수 있음. 가장 대표적 예시가 deepseek. 중국어가 자꾸 나오거나, 중국 친화적인 답변이 나올 수 있음.
Deduplicate Data
- Active Learning
  - Find representive data for Dataset
- AI learns mapping between variable variance and output variance

저작자표시 비영리 (새창열림)

티스토리툴바