This is a simple post that journals two things.
First, the long urge I have had to conduct a basic experiment to see for myself ‘how many (training) data points a (training) prompt was worth’. Given a task we wish to train an LLM for, can we yield comparable or even better models with a mere fraction of the labeled data points traditionally used by simply transforming the training examples differently? I share the experiment setup, code and observations from fine-tuning a series of models on the rudimentary task of topic classification which illustrates the sample efficiency of instruction tuning with prompts.
Second, it consolidates reflections on what makes prompting powerful: (1) tracing how the use of prompting an LLM for completion is at the heart of today’s class of LLM fine-tuning methods (instruction tuning, prefix tuning, RLHF, etc). It makes the fine-tuning objective more consistent with the pretraining objective; (2) how non-generative tasks can be creatively reformulated as generative tasks (3) when to consider fine-tuning over prompt engineering
Note: I’m not making a case for ‘LLM-maximalism’ where the perfect prompt and best-on-market LLM combo magically
solves compound NLP tasks. On the contrary to build reliable and feasible production-grade systems with the
relatively untamed LLMs, I identify with: (i) LLM-Pragmatism [1]
and Evaluation-Driven-Development [2]
(EDD?). Depending on your use-case LLM Pragmatism could call for some combination of task-composibility and
control-flows that coordinate different components and tools.
The genesis of this blog post was in early November 2022 after attending an AKBC
workshop. During two of the keynote talks [4]
But then the ChatGPT-era happened and this newfound wonder with prompts felt antiquated, followed by personal genAI-fatigue. However, I am ready to revive this article because (1) I had to scratch the itch, (2) the intuition of prompt guided learning is still fundamental in (almost all?) approaches to fine-tuning LLMs and can often feel hidden beneath the abstractions of handy libraries and APIs around generative models.
The structure of the post is as follows:
Experiment Objective: To quantitatively observe the sample efficiency of fine-tuning LLMs using prompt-completion pairs instead of conventional supervised classification on input-output labelled data points, I performed an experiment inspired by the paper ‘How many data points is a prompt worth’ [7]. The authors perform a number of experiments on 6 different benchmark NLP tasks and find that prompt-based instruction tuned LLMs outperform the classifier-head based model, while needing 100x-1000x fewer prompt data points. In my version of the experiment, I kept things lightweight with a rudimentary causal LM GPT2, and a simple discriminative task.
Experiment design: The experiment trains a series of models on text classification task in two training paradigms using the same base generative LLM. The former training setup uses the more conventional approach of fitting a classifier head after LLM layers, and training the classifier (with or without frozen LLM layers) to map input documents to a fixed set of output labels. The latter setup directly tunes (all or part of) the LLM layers to autoregressively generate the correct output label given a prompt containing an instruction and the input document text to be classified. Each model is trained only for 5 epochs.
Experiment Setup
Pretrained LLM: GPT-2 (for a $$
-and-resource friendly experiment)
Setup 1 | Setup 2 | |
---|---|---|
Input | Text of news articles. Example.: “South Korea lowers interest rates South Korea’s central bank cuts interest rates by a quarter percentage point to 3.5 in a bid to drive growth in the economy.” |
Text of news article embedded in a prompt template. Example: “Given news article: South Korea lowers interest rates South Korea’s central bank cuts interest rates by a quarter percentage point to 3.5 in a bid to drive growth in the economy.”. The topic is: ” |
Target output | Integer id corresponding to 1 of the 4 topic classes (“Sports”, “World”, “Business”, “Sci/Tech”) | (Tokenised) Text of label: “Sports” , “World”, “Business”, “Sci/Tech” |
ML Model architecture | Pretrained model GPT2, followed by a multiclass classifier head. Only last layer trained | Pretrained model GPT2. All layers trainable for autoregressive output generation |
Learning Objective | Minimise Cross entropy loss of classifying labels | Next token prediction objective of the Causal Autoregressive model |
Inference | Argmax of a softmax over the output logits yields the predicted class id. | generate a sequence of upto 5 tokens that are most likely to follow the input prompt, and then use a verbaliser to map the classes to one of the 4 topic labels (or none) |
Experiment Limitations (largely due to decisions of saving on training and infrastructure costs)
Limitation | Elaboration |
---|---|
Under-training of models | I train each model (regardless of training dataset size) for only 5 epochs. This is well below convergence. Related works often train models for > 50 epochs.[7,] |
Use of more primitive model GPT-2 and not powerful Causal LLM | The comparative gains of the prompt-based learning (instruction tuning) method is expected to be more stark when using a powerful FM like GPT-3, or the LLama models [16][17], and could have led to more decisive experiment results |
Choice of rudimentary classification task | I chose the simplest task of sentence classification. It is not clear how my results extrapolate to more advanced tasks of say abstractive NLP. However, the simple classification setup was sufficient for purpose of my experiment. |
Did not use a fewshot-prompting setup as baseline | I do not provide any comparison to the performance reachable by pure prompting and in-context learning (no training scenario). |
Evaluation of generative models is hard, and I did not build a rigourous-enough evals | Mapping the output of generative models to discriminative labels has been notorious in the community for lack of rigour. For instance, huggingface reports evaluation hurdles and holes while carrying out LLM evaluation on the benchmark task of MCQ-MMLU [15]. I used a verbaliser to map the output of the first 5 generated tokens from the fine-tuned LLM to map to one of the 4 class labels using a scoring function such as bert-score and rouge. These are not without limitations, see here. |
Some elementary realisations that grounded my understanding of what made prompting powerful
(besides the fact that today’s LLMs are trained on massive amounts of data [23]
Talk to and teach the model in its native tongue
: Prompting a model for completion in a certain task
during inference is like interrogating the model in the native tongue it was taught in. Thus it would be more
effective to continue to fine-tune the model with a learning objective
similar to its pretraining objective (e.g. next token prediction, denoising) on datasets refomatted as
pompt-completion pairs. This may be common knowledge now, but I still have deep appreciation for this point. I
would strongly recommend this excellent review paper on the different LM architectures, training
objectives and their relationship with prompts to conduct various NLP tasks: Pretrain, Prompt, Predict [8].
All tasks can be viewed as generation
: Almost all non-generative NLP tasks that were traditionally thought of as
discriminative (e.g.Topic Classification, structured extraction (NER, EL, extractive QA), etc) could be posed as a
generative task using prompt-completion frameworks. Instead of trying to modify the method (hammer) to work with the task (bolt), transform the task to
a nail instead. Some examples:
Event
, with event entities and relations as
attributes of the class. They then prompt code-to-text FMs to complete the extraction of remainder attributes from the textThe same fine-tuning recipe behind most of the recent successful LLM apps
The recent surge of
successful prompt-based technologies built atop general purpose Foundation Models (FMs), whether chat-agents, coding
assistants, or natural-language search tools, all inherit from the same 3-stage fine-tuning recipe as seen in Figure 2.A practical question faced is whether relying on in-context learning through carefully engineered prompts (without any training) is sufficient vs fine-tuning all or part of an LLM’s weights on a task.
There are a number of deciding factors in this situation:
There is no ‘golden goldilocks prompt’ that dramatically improves performance on a given task.
In [7] the authors insightfully experiment with the effect of prompt variability on the LLM task performance. They
find that the gains of any one prompt usually disappear over multiple runs, and conclude that prompt size and format
is not a dominant hyperparameter for the LLM.
Evaluation is your true north star: In my experience, the single most effective pattern to guide development of LLM-centric applications is evaluation. Evaluate predictions from an LLM of your choice (e.g. ChatGPT, Llama, etc) on metrics that are relevant to your task. Are you able to reliably (reproducibly and consistently) get the performance you require from prompt engineering alone? Does this performance hold as you increase the size of the test set you evaluate on? If yes, then in-context learning using prompting might be reasonable in your use case. If not, then you may consider other approaches to adapt the LLM to your task.
I am very grateful to Corey Harper for going above and beyond with his editorial review of the post, and to Raahul Dutta for his helpful feedback.
[1] Against LLM maximalism by Mathhew Honnibal
[2] Patterns for Building LL-Based systems and products by Eugene Yan
[3] Building LLM applications for production by Chip Huyen
[4] Raffel, Colin, et al. “Exploring the limits of transfer learning with a unified text-to-text transformer.” The Journal of Machine Learning Research 21.1 (2020): 5485-5551.
[5] Code4Struct[arXiv]
[6] Agirre, Eneko. “Few-shot Information Extraction is Here: Pre-train, Prompt and Entail.” Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2022.[ACM]
[7] Scao, Teven Le, and Alexander M. Rush. “How many data points is a prompt worth?.” arXiv preprint arXiv:2103.08493 (2021). [arXiv]
[8] Liu, Pengfei, et al. “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.” ACM Computing Surveys 55.9 (2023): 1-35. [ACM]
[9] Lialin, Vladislav, Vijeta Deshpande, and Anna Rumshisky. “Scaling down to scale up: A guide to parameter-efficient fine-tuning.” arXiv preprint arXiv:2303.15647 (2023).
[10] RLHF by Chip Huyen
[11] Lester, Brian, Rami Al-Rfou, and Noah Constant. “The power of scale for parameter-efficient prompt tuning.” arXiv preprint arXiv:2104.08691 (2021). [arXiv]
[12] Advanced Prompt-engineering by Cameron R. Wolfe
[13] Yang, Chengrun, et al. “Large language models as optimizers.” arXiv preprint arXiv:2309.03409 (2023).
[14] Huggingface tutorial and implementation of Parameter Efficient Fine-Tuning methods: https://huggingface. co/blog/peft
[15] Huggingface blog “What’s going on with the Open LLM Leaderboard?”
[16] GPT understands, too: Liu, X., et al. “GPT understands, too. arXiv.” arXiv preprint arXiv:2103.10385 (2021).
[17] Touvron, Hugo, et al. “Llama 2: Open foundation and fine-tuned chat models.” arXiv preprint arXiv:2307.09288 (2023).
[18] Chung, Hyung Won, et al. “Scaling instruction-finetuned language models.” arXiv preprint arXiv:2210.11416 (2022).
[19] Min, Sewon, et al. “FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. “ arXiv preprint arXiv:2305.14251 (2023).
[20] Devaraj, Ashwin, et al. “Evaluating factuality in text simplification.” Proceedings of the conference. Association for Computational Linguistics. Meeting. Vol. 2022. NIH Public Access, 2022.
[21] Hu, Edward J., et al. “Lora: Low-rank adaptation of large language models.” arXiv preprint arXiv:2106.09685 (2021).
[22] Dettmers, Tim, et al. “Qlora: Efficient finetuning of quantized llms.” arXiv preprint arXiv:2305.14314 (2023).
[23] Bommasani, Rishi, et al. “On the opportunities and risks of foundation models.” arXiv preprint arXiv:2108.07258 (2021).