Fine-tuning

If you are still reading, then let's level up and talk about what else we can do here! Make sure to check out the background knowledge page for explanations about some jargon used below. We will be putting those theories in practice with what we did in the previous section.

The problem

After following the previous steps and getting the API to work, an obvious problem emerged. How do I write a good marking guide that works with the prompt to always mark the way I want? Will it be consistent enough to keep things fair and reliable and still keep costs low?

As far as simple arithmetic is concerned, the most recent models are easily good enough to be able to handle them without error, especially when given step-by-step worked solutions. The biggest issues I found came from exam marking conventions, rather than content knowledge. For example, are there consequential marks for multi-step problems? Are the formulas and units always required? In my test case examples, I wanted the answer to both questions to be true, but it was much more difficult to enforce that rule with AI than I expected.

Here is a selection of some early results, when my marking guide was less specific.

As you can see, any teacher who awards full marks for all of these is not teaching the right skills. I did manage to put enough comments in the marking guide to get it working in the end, but it required quite specific styles of writing the marking guide. Ideally, you want to be able to rely on a colleague who understands the context and has some teaching experience to know what you want without the need to explicitly state every requirement. This is the perfect chance to try out the next tool available to us — fine-tuning!

Note that fine-tuning has very specific and limited use cases, and is not suitable for all problems. For example, imagine trying to revise for an upcoming exam. Generally, exam preparation involves knowing the content knowledge specific to the subject being examined, and knowing how to answer the questions in the way that the examiners are looking for. It is no use being good at structuring your answers if you don't know what to answer with. Similarly, in this case the fine-tuning can only help with the how, and not the what. No amount of 'checking your answers' will solve the problem if you simply don't know the answer at all. That's where the 'cheat sheets' come in, which is a job for RAGs to handle.

For the purposes of this problem, I chose to use the supervised fine-tuning option as my first attempt. Not only is it easier to come up with fewer examples to feed as training data, the data structure is also (relatively) simpler to work with, despite it still taking me a few tries.

Above: After many failed test attempts, eventaully something went through!
Right: The full training process takes approx. 20 mins with minimal data. Creating good data to train with will probably take much longer.

Interestingly, the longest part of the training process was actually checking for abuse against the usage policies. For reference, do not try to train anything that involves:

advice (As in providing guidance in sensitive areas. This ensures the model handles such requests responsibly and within safety boundaries.)
biological threats
cyber security threats
harassment/threatening
hate
hate/threatening
highly-sensitive
illicit
propaganda
self-harm/instructions
self-harm/intent
sensitive
sexual
sexual/minors
violence

When it is finished, all the hard work is done! Now it's time to test it out and see how much it improved. The newly trainied model should show up in the Chat window of the Playground. The easiest and fastest way to test is by simply copying some more (unseen!) examples into the Playground chat and see how it goes for yourself. It is important that you exclude any examples that were also seen in the training data, otherwise it's called 'teaching to the test' and you don't really know if the training was effective or not.

There are 3 versions of the new model for comparison. Just choose the one without "step-xx".

If you want to look at more outputs all at once, it can be made faster by simply adjusting our code in Apps Script a little so that Sheets uses our own model instead of the standard one.

Replace the 'function' section of the code with the following, noting that the main changes are the names of the function and the model which you need to fill in yourself on line 4:

function GPTfine(prompt, temperature = 0, tokens = 1000) {

const url = "https://api.openai.com/v1/chat/completions";

const payload = {

model: "insert your model name here and keep the double quotes",

messages: [

{

role: "system",

content: "Please mark the provided answers based on the marking guide. If answer is fully correct, simply respond with one word 'Correct'. If answer is not fully correct or is missing some required working out/details, refer to marking guide to give a suitable score out of the total marks available and then briefly explain how the answer can be improved."

{

role: "user",

content: prompt

}

temperature: temperature,

max_completion_tokens: tokens

};

Once you save and run this again in Apps Script, you should be able to start using the new fine-tuned model in Sheets by typing this in a cell: =GPTfine(E2)

Copy the new model name from the fine-tuning page after training is complete. This is what goes in line 4 of the Apps Script code.

Admittedly this is a very small sample size with an even more limited range of test scenarios, but preliminary tests with the most basic example shows some promise.

When provided with the standard system message:

Please mark the provided answers based on the marking guide. If answer is fully correct, simply respond with one word 'Correct'. If answer is not fully correct or is missing some required working out/details, refer to marking guide to give a suitable score out of the total marks available and then briefly explain how the answer can be improved.

and the basic marking guide:

Marking Guide:

s = ut+ 1/2 at^2

s = 0+1/2 (2)(5^2)

s = 25m

Student response:

s=ut+1/2 at^2

=25m

Marks scored out of 3:

Our original GPT-5 nano API in Sheets inaccurately scored it "Correct" 8 out of 10 times. The other 2 times it recognised that there was a missing substitution step. Compared to the newly trained model which gave the correct 2/3 marks and a quick explanation all 10 times.

Despite the limitations of this test, I chose the example specifically because I never provided this type of situation in the training set. The training data had no scenario with formula + missing substitution + final answer.

The closest training pairs in the training data were:

Formula + substitution + answer (fully correct)
Substitution + answer only (missing the formula)
Answer only (missing both formula and substitution)
Formula + substitution + wrong answer
Formula + wrong substitution + answer

There are obviously many more other places where students can go wrong, and there needs to be more testing to see if it can pick up on the more obscure or subtle problems. However, this at least shows that there is some improvement and the training has made a difference.

See the next page for results from further testing!

Still want more?!

Page updated

Google Sites

Report abuse

Fine-tuning

The problem

© Philomaths blog 2025