Limitations of LLM-as-a-Judge

youngerjesus 2024. 9. 6. 20:01

2024. 9. 6. 20:01

Types of LLM-as-a-Judge:

Pairwise comparison: 둘 중에 뭐가 더 낫냐를 고르는 것
Single answer grading: 스코어링을 하는 것.
Reference-guided grading: 평가할 때 사용하는 참조 답변을 이용하는 것.

LLM-as-a-Judge 의 편향:

Position bias:
- LLM 이 특정 위치를 다른 위치보다 선호하는 현상임.
- 이거는 LLM 뿐 아니라 인간과 ML 도 가지는 경향임.
- LLM 은 기본적으로 첫번째 선택지를 더 선호하는 경향이 있음.
- LLM 이 Evaluation 할 때 선택지의 일관성을 보면 GPR-4 의 경우 65% 정도 나옴. GPT-3.5 는 50%
- 여기서 말하는 일관성은 선택지의 순서를 다르게 주었을 때도 같은 대답을 내놓는지를 말하는거임. 같은 대답을 내놓을 확률이 65% 라는 것.
Verbosity bias:
- LLM 은 기본적으로 긴 응답을 선호한다. 이게 좀 덜 정확하고, 높은 퀄리티를 내지 않더라도.
- GPT-4 같은 경우는 Verbosity bias 를 많이 줄였음 약 9% 정도로 효과를 받음, GPT-3.5 같은 경우는 응답을 길게 만들면 이걸 선택할 확률이 90% 정도임.
Self-enhancement bias:
- LLM 은 자신이 만든 응답 데이터를 좀 더 선호하는 경향이 나타났다고 함.
- GPT-4 는 10% 정도 더 자신의 답변을 선호한다고 함.
- 근데 이건 그렇게 큰 영향을 끼치진 않는듯? 논문에서도 확정할 수 없다고 함.

Limited capability in grading math and reasoning questions:
- LLM 은 수학적인 문제와 추론 문제에 약하다고 함.
- 기본적인 수학 문제에 대해서도 원래는 풀 수 있음에도 불구하고 잘못된 답변을 같이 전달해주면 그것에 흔들린다고 함.

Addressing limitations:

swapping positions:
- 해결 전략은 순서를 바꿔서도 호출해보는거임. 그리고 평균을 내는 것.
- 또 다른 방법은 순서를 랜덤으로 주게 하는 것. 이건 대량의 데이터에 효과적일 것.
Few-shot judge:
- 예시를 넣으면 Consistency (순서를 바꿔도 결과는 같게 나오는 것) 증가한다고 함.
- 그러나 이게 Judge 의 정확도를 나타내는 건 아니고, 새로운 편향을 만들어 낼 수 있다고 생각해서 잘 모르겠다고 함.
- 첫 번째 답변을 선호하는 편향에서 두 번째 답변을 선호하는 편향으로 바뀐 것 아닌가? 라는 거임
Chain-of-thought and reference-guided judge:
- CoT 를 사용하면 LLM 의 추론 능력이 올라가니까, 이걸 Judge 에 쓰면 확실히 도움이 됨.
- 근데 CoT 를 써도 잘못된 답변의 추론 과정을 따라하는 경향도 있다고 함.
- 그래서 CoT 에다가 따라해볼만한 추론 과정을 데이터에 넣어주는 reference-guide judge 를 권장함.
Fine-tuning a judge model:
- 파인튜닝된 judge model 을 이용하는 거임.

multi-turn judge:

대화를 통해서 judge 를 하려고 할 때 대화 내역을 하나의 프롬프트에 다 담아서 평가하도록 하는게 더 나음.
대화를 별개로 구분해서 넣어주면 LLM 이 잘 못찾는 문제가 있다고 함.

Reference-guided grading 을 이용하는 프롬프트 예시:

프롬프트에는 편향에 휘둘리지 말라고 한다.
거기에는 위치 편향도 있고, verbose 편향도 있고, 특정 name 을 선호하는 편향도 있는데 이런 것에 휘둘리지 마라 라고 적혀있음.
분명 이렇게 하는게 적게나마 더 평가에 영향을 줄 것.

[System]
Please act as an impartial judge and evaluate the quality of the responses provided by two
AI assistants to the user question displayed below. Your evaluation should consider
correctness and helpfulness. You will be given a reference answer, assistant A’s answer,
and assistant B’s answer. Your job is to evaluate which assistant’s answer is better.
Begin your evaluation by comparing both assistants’ answers with the reference answer.
Identify and correct any mistakes. Avoid any position biases and ensure that the order in
which the responses were presented does not influence your decision. Do not allow the
length of the responses to influence your evaluation. Do not favor certain names of the
assistants. Be as objective as possible. After providing your explanation, output your
final verdict by strictly following this format: "[[A]]" if assistant A is better, "[[B]]"
if assistant B is better, and "[[C]]" for a tie.

[User Question]
{question}
[The Start of Reference Answer]
{answer_ref}
[The End of Reference Answer]
[The Start of Assistant A’s Answer]
{answer_a}
[The End of Assistant A’s Answer]
[The Start of Assistant B’s Answer]
{answer_b}
[The End of Assistant B’s Answer]

CoT prompt for evaluation:

그냥 뭐 문제 푸는 순서를 정해주고, 단계별로 해라 이런 것 정도로만 씀.

[System]
Please act as an impartial judge and evaluate the quality of the responses provided by two
AI assistants to the user question displayed below. Your evaluation should consider
correctness and helpfulness. You will be given assistant A’s answer, and assistant B’s
answer. Your job is to evaluate which assistant’s answer is better. You should
independently solve the user question step-by-step first. Then compare both assistants’
answers with your answer. Identify and correct any mistakes. Avoid any position biases and
ensure that the order in which the responses were presented does not influence your
decision. Do not allow the length of the responses to influence your evaluation. Do not
favor certain names of the assistants. Be as objective as possible. After providing your
explanation, output your final verdict by strictly following this format: "[[A]]" if
assistant A is better, "[[B]]" if assistant B is better, and "[[C]]" for a tie.
[User Question]
{question}
[The Start of Assistant A’s Answer]
{answer_a}
[The End of Assistant A’s Answer]
[The Start of Assistant B’s Answer]
{answer_b}
[The End of Assistant B’s Answer]

References:

https://arxiv.org/pdf/2306.05685

저작자표시 비영리

'Generative AI > Prompt Engineering' 카테고리의 다른 글

(2) Building with Claude - Create strong empirical evaluations (0)	2024.09.07
(1) Building with Claude - Define your success criteria (0)	2024.09.06
Claude 에서 Evaluation 하는 법 (0)	2024.09.06
Prompt Evaluation 가이드 (0)	2024.09.06
The Prompt Report: A Systematic Survey of Prompting Techniques (0)	2024.08.28

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

여정민의 블로그

Limitations of LLM-as-a-Judge

'Generative AI > Prompt Engineering' 카테고리의 다른 글

+ Recent posts

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역