Tags:: #[[Research Paper]] #[[prompting [[Large Language Models (LLM)]]]] #[[reflection ([[Large Language Models (LLM)]])]]
Summary
Overview
SELF-REFINE is a method for improving outputs from large language models (LLMs) through iterative self-feedback and refinement. This approach uses the same LLM to generate an initial output, provide feedback, and refine it iteratively without the need for supervised training or additional data.
Key Findings
Performance Improvement: Evaluations using GPT-3.5 and GPT-4 across seven tasks show that SELF-REFINE improves performance by about 20%. Outputs are preferred by humans and score better on metrics.
Complex Task Handling: LLMs often struggle with complex tasks requiring intricate solutions. Traditional refinement methods need domain-specific data and supervision. SELF-REFINE mimics human iterative refinement, where an initial draft is revised based on self-feedback.
Iterative Process: The process uses two steps: FEEDBACK and REFINE, iterating until no further improvements are needed.
Specific Task Performance
Strong Performance:
Constrained Generation: Generating a sentence containing up to 30 given concepts. Iterative refinement allows correction of initial mistakes and better exploration of possible outputs.
Preference-based Tasks: Dialogue Response Generation, Sentiment Reversal, Acronym Generation. Significant gains due to improved alignment with human preferences.
Weaker Performance:
Math Reasoning: Difficulty in accurately identifying nuanced errors in reasoning chains.
Additional Insights
Avoiding Repetition: SELF-REFINE avoids repeating past mistakes by appending the entire history of previous feedback in the REFINE step.
Role-based Feedback: Suggestion to improve results by having specific roles for feedback, like performance, reliability, readability, etc.
Related Method: Providing a scoring rubric to the LLM with dimensions over which they should evaluate the output.
Specific Feedback Importance: Results are significantly better with specific feedback compared to generic feedback.
Iteration Impact: Results improve significantly with the number of iterations (i.e., feedback-refine loops) but with decreasing marginal improvements for each loop. In some cases, like Acronym Generation, quality could improve in one aspect but decline in another. Their solution was to generate numeric scores for different quality aspects, leading to balanced evaluation.
Model Size Impact: SELF-REFINE performs well for different model sizes, but for a small enough model (Vicuna-13B), it fails to generate feedback consistently in the required format, often failing even with hard-coded feedback.
Relevant [[ChatGPT]] conversations: here, here, here
For access to my shared Anki deck and Roam Research notes knowledge base as well as regular updates on tips and ideas about spaced repetition and improving your learning productivity, join "Download Mark's Brain".