Evals

Upon further inspection, the previous fine-tuned model was terrible and pretty useless beyond the confines of questions that were very similar to the training data. For this to be more useful across a broader range of topics and different styles of questions, we will need to do more work.

The core problem here is called overfitting, which is an issue to do with how machine learning works. Basically, overfitting means that the model is memorising the answers from training and getting worse at dealing with new data.

You only get accurate predictions if the machine learning model generalises to all types of data within its domain. Overfitting occurs when the model cannot generalise and fits too closely to the training dataset instead. Overfitting happens due to several reasons, such as:

The training data size is too small and does not contain enough data samples to accurately represent all possible input data values.
The training data contains large amounts of irrelevant information, called noisy data.
The model trains for too long on a single sample set of data.
The model complexity is high, so it learns the noise within the training data.

In the case of my simple test example, probably all of the above applies. My data set was tiny, and there were not enough examples to show a clear pattern of what I wanted. Even with more of the same examples, the structure of my data is probably not very consistent, leading to noise.

Training loss shows how well the model fits the training data. Ideally, this should steadily decrease and flatten out, but not keep going down this low.

Accuracy is obvious, but an accuracy level this high is very alarming and suspicious. These results indicate that overfitting has occurred.

So why is this bad? Consider the example where a machine learning algorithm predicts a student's academic performance and graduation outcome by analysing several factors like family income, past academic performance, and academic qualifications of parents. However, the test data only includes candidates from a specific gender or ethnic group. In this case, overfitting causes the algorithm's prediction accuracy to drop for candidates with gender or ethnicity outside of the test dataset.

There are various parameters that can be adjusted before fine-tuning to help reduce overfitting (but the biggest problem here is the dataset). For reference, the available settings are:

🌀 Epochs

How many times the model goes through your entire training dataset.
More epochs → more learning, but also higher risk of overfitting.
Example: 3 epochs means the model saw all your data 3 times.

📦 Batch size

- How many training examples the model processes before updating its internal weights.
- Batch size = 1 → updates after every single example (slower, noisier learning).
- Larger batch size → smoother, more stable training, but needs more memory.

⚙️ Learning Rate (LR) Multiplier

- Scales how big each learning step is when the model updates its parameters.
- LR multiplier = 1 → default (safe) learning rate.
- Higher = learns faster but can be unstable; lower = learns slower but more stable.

It is also possible for underfitting to occur, which is when the model simply did not learn much at all from training. In that case, either train for more epochs (e.g., 3 → 4 or 5), try a higher LR multiplier (1 → 1.5 or 2), or fix the training data again to be more consistent.

Here is another attempt at fine-tuning, but this time I went for the Direct Preference Optimization (DPO) method. The details of this highly unsuccessful attempt can be found here.

Compared to Supervised Fine-Tuning (SFT) which uses labeled input-output pairs to teach a model specific tasks and formats, DPO uses human preference data (preferred vs. rejected responses) to align the model with abstract qualities like helpfulness or harmlessness. SFT is ideal for imitation-based tasks like instruction following and summarisation, whereas DPO excels at optimising for subjective human preferences that are hard to define with correct examples, such as a specific brand voice.

Unfortunately, it was even more difficult to create a good dataset for DPO that clearly highlights the consistent pattern of marking required. The training data needs two different (correct and incorrect) example responses for each example input prompt. Despite offering over 30 (admittedly still a very small amount) sets of examples for training, the results did not look good at all.

Even before trying this one out, it was already clear that the data was too messy and it was not able to learn from a pattern during training.

Apparently, DPO was not the right tool for this job anyway, and the SFT method is much more suitable for what I was trying to achieve. But just before we go back to writing hundreds of bad answers on purpose, let's try one more thing while we have this fine-tuned model.

The eval functions (also available on the OpenAI Dashboard) allow you to automatically run through a large number of prompts and see how well the outputs align with your fine-tuning goals. For our very small sample size though, it is much easier to simply write some new prompts for the model and check the responses manually like how we did it last time.

Nevertheless, here is how our new DPO model turned out. The evaluator gave it a score of 14%, only passing 1 out of the 7 example responses. Most of the problems resulted from the marking being too lenient, or straight up giving marks for work that was not there. It also failed to understand slight variations in the way that formulas and steps can be written out, similar to how traditional code-based marking systems are too rigid to be useful.

Above: results from the eval for the DPO model's responses. The last one did not process correctly but I would have still failed it based on the outputs.Right: example of incorrectly penalising correct answer.

It was nice to see in another example that the evaluator AI was able to recognise an equivalent unit being used. This is most likely due to the evaluator using a more powerful thinking model that is better able to understand these subtle nuances.

The student’s work shows the correct formula p = mv and the correct calculation (0.5 * 8 = 4), which yields the correct numerical answer but provides the unit as 'Ns' instead of the required 'kg·m/s'. One mark should be deducted for incorrect final units even though 'Ns' and 'kg·m/s' are dimensionally equivalent, the marking guide demands the specific unit.

On the other hand, the DPO model output only states that the units were incorrect.

The student correctly used the formula p = mv and performed the calculation correctly. However, the unit given is "Ns" instead of the correct unit "kg·m/s". To improve, the student should write the correct unit for momentum.

At this point, I think it is fair to say that it is probably worth paying the higher price to use the smarter model than to try creating something specifically for this purpose. Maybe prices will go down in the future as this technology improves, but the fact that this is currently available and readily accessible means that we have already come a very long way.

If you made it this far, maybe you are like me and have gone way past simply trying to get some marking done. Maybe you are also interested in the technology and possibilities beyond what this example shows. In any case, hopefully this gives you a good starting point and at least some inspiration if nothing else. Good luck and let me know if you have anything to share!

Page updated

Google Sites

Report abuse

Evals

🌀 Epochs

📦 Batch size

⚙️ Learning Rate (LR) Multiplier

© Philomaths blog 2025