Introduction: GPT-4’s top spot in Stanford Alpaca was actually snatched away by a dark horse.
The first model to surpass GPT-4 on Alpaca appears!
In June this year, as soon as the Alpaca list was announced, GPT-4 took the first place with an absolute lead, with a winning rate of over 95%.
Three months later, a dark horse named Xwin-LM actually pulled GPT-4, which was firmly at the top of the list, from the TOP 1?
project address:
Moreover, Xwin-LM also won three firsts without exception in the competition with the same parameter models:
Xwin-LM-70B-V0.1: The winning rate against Davinci-003 in the Alpaca benchmark reached 95.57%, ranking first in Alpaca. It is also the first model to surpass GPT-4 on Alpaca. Furthermore, its win rate against GPT-4 is 60.61.
Xwin-LM-13B-V0.1: Achieved a 91.76% win rate on Alpaca, ranking first among all 13B models.
Xwin-LM-7B-V0.1: Achieved a win rate of 87.82% on Alpaca, ranking first among all 7B models.
Xwin-LM: 70 billion parameters beat GPT-4
Xwin-LM is fine-tuned based on Llama 2. This model is designed to develop and open source alignment technologies for large language models, including supervised fine-tuning (SFT), reward model (RM), rejection sampling, and human feedback reinforcement learning (RLHF). wait.
1. Performance Evaluation
The researchers evaluated Xwin-LM’s winning rate against Text-Davinci-003, ChatGPT, and GPT-4 on 805 problems.
It can be seen that Xwin-LM’s 70B model completely dominated the list, becoming the first in the performance evaluation of Alpaca indicators, and defeated GPT-4 for the first time, with a winning rate of 60.61% against GPT-4.
Models of other sizes of Xwin-LM, such as Xwin-LM-13B-V0.1 and Xwin-LM-7B-V0.1, also far exceed other models of the same size in performance.
The Xwin-LM-7B-V0.1 even outperforms the larger Llama 2-13B-Chat.
2. Case comparison
In the first case, the researchers asked GPT-4 and Xwin-LM:
“I put a plate on top of the apples and took the plate from the kitchen to the dining room. Where are the apples now?”
GPT-4’s answer is:
“The apple is in the restaurant now, and when you take the plate from the kitchen to the restaurant, it’s under the plate.”
Here GPT-4 gave a reply that lacked common sense. In real life, there is gravity. The apple under the plate remains in its original state and will not go to the restaurant with the plate.
In contrast, Xwin-LM’s answer is more in line with reality, and also gives a way to bring Apple to the kitchen:
"The apple is still in the kitchen because you just moved the plate.
The apples are under the plate, so if you want to take the apples to the restaurant, you should move the apples directly or put the apples on the plate and then move the plate to the restaurant. "
In the second case, the researchers asked GPT-4 and Xwin-LM:
“How do you say ‘good evening’ in French?”
GPT-4’s answer is:
“In French, ‘good evening’ is expressed as ‘bonsoir’.”
And Xwin-LM’s answer is:
"In French, ‘good evening’ is translated as ‘bonsoir.’ This greeting is used in the evening and at night, usually after 6 p.m.
It is a combination of ‘bon’ (meaning good) and ‘soir’ (meaning evening). To pronounce it, say ‘bone-swahr’".
Although GPT-4’s answer was correct, Xwin-LM not only answered the question correctly, but also explained the origin and pronunciation of “good evening” in the French expression, providing richer information back to the user.
3. NLP tasks
The following table compares Xwin-LM with other LLMs on basic NLP tasks.
It can be seen that the performance of the Xwin-LM-70B-V0.1 model is relatively high in all tasks, and it performs particularly well in the MMLU 5-shot and HellaSwag 10-shot tasks.
And, from an overall level, Xwin-LM-70B-V0.1 is the best performer.
4. Reasoning ability
The researchers proposed that in order to obtain the desired results when using Xwin-LM for inference, the dialogue template for inference needs to be strictly adhered to.
Xwin-LM adopts the prompt format established by Vicuna and supports multi-turn dialogue.
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user’s questions. USER: Hi! ASSISTANT: Hello.s>USER: Who are you? ASSISTANT: I am Xwin-LM.s>…
HuggingFace Example
from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(“Xwin-LM/Xwin-LM-7B-V0.1”)tokenizer = AutoTokenizer.from_pretrained(“Xwin-LM/Xwin-LM-7B-V0.1”)( := "A chat between a curious user and an artificial intelligence assistant. " "The assistant gives helpful, detailed, and polite answers to the user’s questions. " "USER: Hello, can you help me? " “ASSISTANT:”)inputs = tokenizer(, return_tensors=“pt”)samples = model.generate(**inputs, max_new_tokens=4096, temperature=0.7)output = tokenizer.decode(samples [0] [inputs[“input_ids”].shape [1] :], skip_special_tokens=True)print(output) # Of course! I’m here to help. Please feel free to ask your question or describe the issue you’re having, and I’ll do my best to assist you.VLLM Example
Because Xwin-LM is fine-tuned based on Llama 2, it also supports the use of VLLM for fast inference.
from vllm import LLM, SamplingParams( := "A chat between a curious user and an artificial intelligence assistant. " "The assistant gives helpful, detailed, and polite answers to the user’s questions. " "USER: Hello, can you help me? " “ASSISTANT:”)sampling_params = SamplingParams(temperature=0.7, max_tokens=4096)llm = LLM(model=“Xwin-LM/Xwin-LM-7B-V0.1”)outputs = llm.generate([,], sampling_params)
for output in outputs: = output. generated_text = output.outputs [0] .text print(generated_text)
Alpaca: easy to use, fast, low cost, verified by human annotation
As an automatic LLM assessment tool, Alpaca combines AlpacaFarm and Aviary.
On the one hand it uses the same code as AlpacaFarm (caching/random permutations/hyperparameters), on the other hand it uses a sorting hint similar to Aviary.
At the same time, Aviary’s prompts have also been modified to reduce bias towards longer output.
The team said that Alpaca has outstanding effects:
Consistency with human majority vote, higher than a single human annotator
Winning rate is highly correlated with human annotation (0.94)
- Winning rate
The model’s output outperforms text-davinci-003 (i.e. the reference text) proportionally on each instruction.
Specifically, we first collected the pairs of outputs of the desired model on each instruction from the Alpaca dataset, and paired each output with the output of the reference model (text-davinci-003) on the same instruction.
Then, these outputs are fed to the automatic evaluator at the same time, allowing it to judge which one is better (that is, the evaluator’s preference).
Finally, the preferences of all instructions in the data set are averaged to obtain the model’s winning rate relative to text-davinci-003. If the two models tie, then it counts as a half preference.
Paper address:
- LIMITATIONS
While Alpaca provides an efficient way to compare a model’s ability to follow instructions, it is not the gold standard for a comprehensive assessment of a model’s ability.
As detailed in the AlpacaFarm paper, the winning rate of automatic annotators is length-dependent. While human annotation also suffers from this bias, it is unclear whether longer answers would increase utility on downstream tasks.
Furthermore, AlpacaFarm’s evaluation set, although diverse, mainly consists of simple instructions.
Finally, Alpaca does not evaluate the safety of any model.
References:
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
Defeat GPT-4 for the first time? 70 billion parameters Xwin-LM reaches the top of Stanford Alpaca_, 13B model beats ChatGPT
Source: Xinzhiyuan
Introduction: GPT-4’s top spot in Stanford Alpaca was actually snatched away by a dark horse.
The first model to surpass GPT-4 on Alpaca appears!
In June this year, as soon as the Alpaca list was announced, GPT-4 took the first place with an absolute lead, with a winning rate of over 95%.
Three months later, a dark horse named Xwin-LM actually pulled GPT-4, which was firmly at the top of the list, from the TOP 1?
Moreover, Xwin-LM also won three firsts without exception in the competition with the same parameter models:
Xwin-LM: 70 billion parameters beat GPT-4
Xwin-LM is fine-tuned based on Llama 2. This model is designed to develop and open source alignment technologies for large language models, including supervised fine-tuning (SFT), reward model (RM), rejection sampling, and human feedback reinforcement learning (RLHF). wait.
The researchers evaluated Xwin-LM’s winning rate against Text-Davinci-003, ChatGPT, and GPT-4 on 805 problems.
It can be seen that Xwin-LM’s 70B model completely dominated the list, becoming the first in the performance evaluation of Alpaca indicators, and defeated GPT-4 for the first time, with a winning rate of 60.61% against GPT-4.
Models of other sizes of Xwin-LM, such as Xwin-LM-13B-V0.1 and Xwin-LM-7B-V0.1, also far exceed other models of the same size in performance.
The Xwin-LM-7B-V0.1 even outperforms the larger Llama 2-13B-Chat.
“I put a plate on top of the apples and took the plate from the kitchen to the dining room. Where are the apples now?”
GPT-4’s answer is:
“The apple is in the restaurant now, and when you take the plate from the kitchen to the restaurant, it’s under the plate.”
Here GPT-4 gave a reply that lacked common sense. In real life, there is gravity. The apple under the plate remains in its original state and will not go to the restaurant with the plate.
In contrast, Xwin-LM’s answer is more in line with reality, and also gives a way to bring Apple to the kitchen:
"The apple is still in the kitchen because you just moved the plate.
The apples are under the plate, so if you want to take the apples to the restaurant, you should move the apples directly or put the apples on the plate and then move the plate to the restaurant. "
In the second case, the researchers asked GPT-4 and Xwin-LM:
“How do you say ‘good evening’ in French?”
GPT-4’s answer is:
“In French, ‘good evening’ is expressed as ‘bonsoir’.”
And Xwin-LM’s answer is:
"In French, ‘good evening’ is translated as ‘bonsoir.’ This greeting is used in the evening and at night, usually after 6 p.m.
It is a combination of ‘bon’ (meaning good) and ‘soir’ (meaning evening). To pronounce it, say ‘bone-swahr’".
Although GPT-4’s answer was correct, Xwin-LM not only answered the question correctly, but also explained the origin and pronunciation of “good evening” in the French expression, providing richer information back to the user.
3. NLP tasks
The following table compares Xwin-LM with other LLMs on basic NLP tasks.
It can be seen that the performance of the Xwin-LM-70B-V0.1 model is relatively high in all tasks, and it performs particularly well in the MMLU 5-shot and HellaSwag 10-shot tasks.
And, from an overall level, Xwin-LM-70B-V0.1 is the best performer.
The researchers proposed that in order to obtain the desired results when using Xwin-LM for inference, the dialogue template for inference needs to be strictly adhered to.
Xwin-LM adopts the prompt format established by Vicuna and supports multi-turn dialogue.
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user’s questions. USER: Hi! ASSISTANT: Hello.s>USER: Who are you? ASSISTANT: I am Xwin-LM.s>…
HuggingFace Example
from transformers import AutoTokenizer, AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained(“Xwin-LM/Xwin-LM-7B-V0.1”)tokenizer = AutoTokenizer.from_pretrained(“Xwin-LM/Xwin-LM-7B-V0.1”)( := "A chat between a curious user and an artificial intelligence assistant. " "The assistant gives helpful, detailed, and polite answers to the user’s questions. " "USER: Hello, can you help me? " “ASSISTANT:”)inputs = tokenizer(, return_tensors=“pt”)samples = model.generate(**inputs, max_new_tokens=4096, temperature=0.7)output = tokenizer.decode(samples [0] [inputs[“input_ids”].shape [1] :], skip_special_tokens=True)print(output) # Of course! I’m here to help. Please feel free to ask your question or describe the issue you’re having, and I’ll do my best to assist you. VLLM Example
Because Xwin-LM is fine-tuned based on Llama 2, it also supports the use of VLLM for fast inference.
from vllm import LLM, SamplingParams( := "A chat between a curious user and an artificial intelligence assistant. " "The assistant gives helpful, detailed, and polite answers to the user’s questions. " "USER: Hello, can you help me? " “ASSISTANT:”)sampling_params = SamplingParams(temperature=0.7, max_tokens=4096)llm = LLM(model=“Xwin-LM/Xwin-LM-7B-V0.1”)outputs = llm.generate([,], sampling_params) for output in outputs: = output. generated_text = output.outputs [0] .text print(generated_text)
Alpaca: easy to use, fast, low cost, verified by human annotation
As an automatic LLM assessment tool, Alpaca combines AlpacaFarm and Aviary.
On the one hand it uses the same code as AlpacaFarm (caching/random permutations/hyperparameters), on the other hand it uses a sorting hint similar to Aviary.
At the same time, Aviary’s prompts have also been modified to reduce bias towards longer output.
The team said that Alpaca has outstanding effects:
Consistency with human majority vote, higher than a single human annotator
Winning rate is highly correlated with human annotation (0.94)
The model’s output outperforms text-davinci-003 (i.e. the reference text) proportionally on each instruction.
Specifically, we first collected the pairs of outputs of the desired model on each instruction from the Alpaca dataset, and paired each output with the output of the reference model (text-davinci-003) on the same instruction.
Then, these outputs are fed to the automatic evaluator at the same time, allowing it to judge which one is better (that is, the evaluator’s preference).
Finally, the preferences of all instructions in the data set are averaged to obtain the model’s winning rate relative to text-davinci-003. If the two models tie, then it counts as a half preference.
- LIMITATIONS
While Alpaca provides an efficient way to compare a model’s ability to follow instructions, it is not the gold standard for a comprehensive assessment of a model’s ability.
As detailed in the AlpacaFarm paper, the winning rate of automatic annotators is length-dependent. While human annotation also suffers from this bias, it is unclear whether longer answers would increase utility on downstream tasks.
Furthermore, AlpacaFarm’s evaluation set, although diverse, mainly consists of simple instructions.
Finally, Alpaca does not evaluate the safety of any model.
References: