How many data points is a prompt worth?

TL;DR

This is a simple post that journals two things.

First, the long urge I have had to conduct a basic experiment to see for myself ‘how many (training) data points a (training) prompt was worth’. Given a task we wish to train an LLM for, can we yield comparable or even better models with a mere fraction of the labeled data points traditionally used by simply transforming the training examples differently? I share the experiment setup, code and observations from fine-tuning a series of models on the rudimentary task of topic classification which illustrates the sample efficiency of instruction tuning with prompts.

Second, it consolidates reflections on what makes prompting powerful: (1) tracing how the use of prompting an LLM for completion is at the heart of today’s class of LLM fine-tuning methods (instruction tuning, prefix tuning, RLHF, etc). It makes the fine-tuning objective more consistent with the pretraining objective; (2) how non-generative tasks can be creatively reformulated as generative tasks (3) when to consider fine-tuning over prompt engineering

Note: I’m not making a case for ‘LLM-maximalism’ where the perfect prompt and best-on-market LLM combo magically solves compound NLP tasks. On the contrary to build reliable and feasible production-grade systems with the relatively untamed LLMs, I identify with: (i) LLM-Pragmatism [1] and Evaluation-Driven-Development [2] (EDD?). Depending on your use-case LLM Pragmatism could call for some combination of task-composibility and control-flows that coordinate different components and tools. Instead of aggregating all steps to be performed by the product feature into one big LLM prompt, decompose it into subtasks and delegate different components to heuristic or more explainable ML solutions where best suited, and LLM-generation only for certain aspects such as summarisation or intent interpretation, which are highly abstractive in nature. The decision flow of control between different components can be determined with agents or rules as seen in immplementation patterns such as RAG (Retrieval Augmented Generation) in search, or agent-based LLM workflows. Several other examples can also be found in blog posts `Building LLM applications for production` by Chip Huyen [3] and `Patterns for Building LLM-Based systems and products` by Eugene Yan [1].. (ii) It is important to understand when and how to fine-tune a suitable (possibly ‘small’) model for one’s use case.

Diving in

The genesis of this blog post was in early November 2022 after attending an AKBC workshop. During two of the keynote talks [4] Keynote talks [4] by Prof. Heng Ji introducing [Code4Struct](https://arxiv.org/abs/2210.12810) [5]; and Prof. Eneko Agirre on ['Pretrain, Prompt, Entail'](https://dl.acm.org/doi/abs/10.1145/3477495.3532786)[6] paradigm for information extraction. , it first dawned upon me how simple, versatile and effective the tactic of prompting was not just for in-context learning at inference time, but also to further fine-tune LLMs on desired tasks. Prompts allowed the framing of tasks in a way that was closer to the pretraining objective (next token/sentence prediction) for certain LLM architectures. All of a sudden I felt as though I had been looking at so many of the NLP tasks around knowledge extraction unhelpfully for years. Made me think about some problems from first principles again.

But then the ChatGPT-era happened and this newfound wonder with prompts felt antiquated, followed by personal genAI-fatigue. However, I am ready to revive this article because (1) I had to scratch the itch, (2) the intuition of prompt guided learning is still fundamental in (almost all?) approaches to fine-tuning LLMs and can often feel hidden beneath the abstractions of handy libraries and APIs around generative models.

The structure of the post is as follows:

  1. Experiment on sample-efficiency of prompt-based fine-tuning Vs. conventional methods
  2. Reflections on fine-tuning LLMs with prompts
  3. Observatory Note on use of prompt engineering vs fine-tuning

Experiment

Experiment Objective: To quantitatively observe the sample efficiency of fine-tuning LLMs using prompt-completion pairs instead of conventional supervised classification on input-output labelled data points, I performed an experiment inspired by the paper ‘How many data points is a prompt worth’ [7]. The authors perform a number of experiments on 6 different benchmark NLP tasks and find that prompt-based instruction tuned LLMs outperform the classifier-head based model, while needing 100x-1000x fewer prompt data points. In my version of the experiment, I kept things lightweight with a rudimentary causal LM GPT2, and a simple discriminative task.

Experiment design: The experiment trains a series of models on text classification task in two training paradigms using the same base generative LLM. The former training setup uses the more conventional approach of fitting a classifier head after LLM layers, and training the classifier (with or without frozen LLM layers) to map input documents to a fixed set of output labels. The latter setup directly tunes (all or part of) the LLM layers to autoregressively generate the correct output label given a prompt containing an instruction and the input document text to be classified. Each model is trained only for 5 epochs.

Experiment Setup

Implementation notes

  Setup 1 Setup 2
Input Text of news articles.
Example.: “South Korea lowers interest rates South Korea’s central bank cuts interest rates by a quarter percentage point to 3.5 in a bid to drive growth in the economy.
Text of news article embedded in a prompt template. Example: “Given news article: South Korea lowers interest rates South Korea’s central bank cuts interest rates by a quarter percentage point to 3.5 in a bid to drive growth in the economy.. The topic is:
Target output Integer id corresponding to 1 of the 4 topic classes (“Sports”, “World”, “Business”, “Sci/Tech”) (Tokenised) Text of label: “Sports” , “World”, “Business”, “Sci/Tech”
ML Model architecture Pretrained model GPT2, followed by a multiclass classifier head. Only last layer trained Pretrained model GPT2. All layers trainable for autoregressive output generation
Learning Objective Minimise Cross entropy loss of classifying labels Next token prediction objective of the Causal Autoregressive model
Inference Argmax of a softmax over the output logits yields the predicted class id. generate a sequence of upto 5 tokens that are most likely to follow the input prompt, and then use a verbaliser to map the classes to one of the 4 topic labels (or none)
Figure 1: Test-set performance of models against number of data points they were trained on

Experiment Observations and Outcome

Experiment Limitations (largely due to decisions of saving on training and infrastructure costs)

Limitation Elaboration
Under-training of models I train each model (regardless of training dataset size) for only 5 epochs. This is well below convergence. Related works often train models for > 50 epochs.[7,]
Use of more primitive model GPT-2 and not powerful Causal LLM The comparative gains of the prompt-based learning (instruction tuning) method is expected to be more stark when using a powerful FM like GPT-3, or the LLama models [16][17], and could have led to more decisive experiment results
Choice of rudimentary classification task I chose the simplest task of sentence classification. It is not clear how my results extrapolate to more advanced tasks of say abstractive NLP. However, the simple classification setup was sufficient for purpose of my experiment.
Did not use a fewshot-prompting setup as baseline I do not provide any comparison to the performance reachable by pure prompting and in-context learning (no training scenario).
Evaluation of generative models is hard, and I did not build a rigourous-enough evals Mapping the output of generative models to discriminative labels has been notorious in the community for lack of rigour. For instance, huggingface reports evaluation hurdles and holes while carrying out LLM evaluation on the benchmark task of MCQ-MMLU [15]. I used a verbaliser to map the output of the first 5 generated tokens from the fine-tuned LLM to map to one of the 4 class labels using a scoring function such as bert-score and rouge. These are not without limitations, see here.

Main reflections

Some elementary realisations that grounded my understanding of what made prompting powerful (besides the fact that today’s LLMs are trained on massive amounts of data [23] This is summed up well in [23] as "Transfer learning makes FMs possible, but it is the scale that makes them powerful" ).

Figure 2: Fine-tuning recipe of FMs

Prompt engineering vs fine-tuning

A practical question faced is whether relying on in-context learning through carefully engineered prompts (without any training) is sufficient vs fine-tuning all or part of an LLM’s weights on a task.

Figure 3: Comparing task performance against model parameter size for various transfer-learning techniques

There are a number of deciding factors in this situation:

  1. Size of the foundation LLM: Literature demonstrates that the number of model parameters is correlated with improvements in prompting performance. In Figure 3 excerpted from one of the initial works in prompt tuning by Lester et al [11], the authors show that while the advantages of fine-tuning over prompting are large for small model sizes (<10B), this advantage is lost as model size approaches 10B.
  2. Does your task require specialised knowledge and complex reasoning? the more specific the knowledge required to perform the target task, it is useful if the LLM has been pretrained on knowledge (data and tasks) similar to the use-case domain. Prompting works well for simple tasks that rely on general-knowledge or are dependent on pattern recognition (e.g. parsing fixed structure from text). However, complex, multi-step reasoning or niche domain knowledge tasks may require fine-tuning for reliable and consistent performance. This is especially true if you are not using an LLM pretrained in the same domain as your task. Hallucinations and factual consistency are known weaknesses of out-of-box FMs [19][20].
  3. Can you tune the instruction for desired output instead of tuning the model? There is a subset of PEFT methods such as prompt-tuning, prefix tuning, etc, that attach learnable parameters to the model input so as to maximise the probability of expected task output. This is in-between prompt engineering and fine-tuning of LLM weights and might benefit your use case.
  4. Do you have access to the actual model weights for fine-tuning? If you access the LLM behind an API, you cannot fine tune the model, prompt or prefix.
  5. What is economically and technically feasible for your application in production? Consider the production use case. Relying on the LLM’s in-context learning often requires very large descriptive prompts with advanced patterns like few-shot examples or Chain of Thought. Attaching a long prompt template for every inference data point can become very expensive in production. In such cases, if your task is very specific, then consider fine-tuning as this can bake in the prompt behaviour into the training examples, and thus improve the zero-shot model performance on the task. On the flipside, is it feasible to access and maintain the infrastructure required for self-hosting and fine-tuning an LLM?

There is no ‘golden goldilocks prompt’ that dramatically improves performance on a given task. In [7] the authors insightfully experiment with the effect of prompt variability on the LLM task performance. They find that the gains of any one prompt usually disappear over multiple runs, and conclude that prompt size and format is not a dominant hyperparameter for the LLM. Note, the tasks studied in the paper are simpler, short form NLP tasks. Long form abstractive reasoning tasks have been shown to benefit from more sophisticated prompts. A useful summary of the more advanced prompting techniques and where to use them can be found in [12]. As mentioned earlier, there are also PEFT methods that experiment with soft tunable prompts. A recent paper [`LLMs as Optimisers`] (https://arxiv.org/abs/2309.03409)[13] uses meta-prompts to prompt the LLM to build its own task optimising prompt

Evaluation is your true north star: In my experience, the single most effective pattern to guide development of LLM-centric applications is evaluation. Evaluate predictions from an LLM of your choice (e.g. ChatGPT, Llama, etc) on metrics that are relevant to your task. Are you able to reliably (reproducibly and consistently) get the performance you require from prompt engineering alone? Does this performance hold as you increase the size of the test set you evaluate on? If yes, then in-context learning using prompting might be reasonable in your use case. If not, then you may consider other approaches to adapt the LLM to your task.

Takeaways

Acknowledgements

I am very grateful to Corey Harper for going above and beyond with his editorial review of the post, and to Raahul Dutta for his helpful feedback.

References

[1] Against LLM maximalism by Mathhew Honnibal

[2] Patterns for Building LL-Based systems and products by Eugene Yan

[3] Building LLM applications for production by Chip Huyen

[4] Raffel, Colin, et al. “Exploring the limits of transfer learning with a unified text-to-text transformer.” The Journal of Machine Learning Research 21.1 (2020): 5485-5551.

[5] Code4Struct[arXiv]

[6] Agirre, Eneko. “Few-shot Information Extraction is Here: Pre-train, Prompt and Entail.” Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2022.[ACM]

[7] Scao, Teven Le, and Alexander M. Rush. “How many data points is a prompt worth?.” arXiv preprint arXiv:2103.08493 (2021). [arXiv]

[8] Liu, Pengfei, et al. “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.” ACM Computing Surveys 55.9 (2023): 1-35. [ACM]

[9] Lialin, Vladislav, Vijeta Deshpande, and Anna Rumshisky. “Scaling down to scale up: A guide to parameter-efficient fine-tuning.” arXiv preprint arXiv:2303.15647 (2023).

[10] RLHF by Chip Huyen

[11] Lester, Brian, Rami Al-Rfou, and Noah Constant. “The power of scale for parameter-efficient prompt tuning.” arXiv preprint arXiv:2104.08691 (2021). [arXiv]

[12] Advanced Prompt-engineering by Cameron R. Wolfe

[13] Yang, Chengrun, et al. “Large language models as optimizers.” arXiv preprint arXiv:2309.03409 (2023).

[14] Huggingface tutorial and implementation of Parameter Efficient Fine-Tuning methods: https://huggingface. co/blog/peft

[15] Huggingface blog “What’s going on with the Open LLM Leaderboard?”

[16] GPT understands, too: Liu, X., et al. “GPT understands, too. arXiv.” arXiv preprint arXiv:2103.10385 (2021).

[17] Touvron, Hugo, et al. “Llama 2: Open foundation and fine-tuned chat models.” arXiv preprint arXiv:2307.09288 (2023).

[18] Chung, Hyung Won, et al. “Scaling instruction-finetuned language models.” arXiv preprint arXiv:2210.11416 (2022).

[19] Min, Sewon, et al. “FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. “ arXiv preprint arXiv:2305.14251 (2023).

[20] Devaraj, Ashwin, et al. “Evaluating factuality in text simplification.” Proceedings of the conference. Association for Computational Linguistics. Meeting. Vol. 2022. NIH Public Access, 2022.

[21] Hu, Edward J., et al. “Lora: Low-rank adaptation of large language models.” arXiv preprint arXiv:2106.09685 (2021).

[22] Dettmers, Tim, et al. “Qlora: Efficient finetuning of quantized llms.” arXiv preprint arXiv:2305.14314 (2023).

[23] Bommasani, Rishi, et al. “On the opportunities and risks of foundation models.” arXiv preprint arXiv:2108.07258 (2021).