Why do I need to learn about how AI works to use it effectively?
Depending on your use case, it may or may not make a big difference. Usually, the first major problem that most people run into when using LLMs is exceeding the Context Window (maximum context length that it is able to keep track of) in their chat. Whether they are aware of this or not, this is a very common reason why things suddenly start going wrong.
For example, here are some context length comparisons:
ChatGPT (Free): 8,000 tokens / ~ 6000 words
ChatGPT Plus / Teams: 32,000 tokens
ChatGPT Pro / Enterprise: 128,000 tokens
OpenAI API (eg. GPT-4.1): up to 1,000,000 tokens
This may seem like plenty, but consider that in a long conversation, all input and output tokens, including the tokens used for thinking (when using extended thinking), count toward the context window limit. Also note that some models have a max output token limit, unrelated to context. This means that eventually the model will no longer be able to keep track of everything in the conversation history, and begin to output inaccurate responses.
Optimising LLM Accuracy compares and contrasts different approaches for optimising LLM accuracy including fine-tuning, retrieval-augmented generation (RAG) and prompt engineering - it’s great to get a better understanding when fine-tuning is actually suitable.
a. Detailed guide on fine-tuning your own model. Make sure you have understood the purpose and uses of fine-tuning (see above) before you go down this path.
b. To use RAG method, the easiest way is to check out NotebookLM.
2. Types of fine-tuning currently available on OpenAI models in developer platform:
a. The supervised option is simpler and only requires examples of desired outputs for any given relevant input. Minimum 10 different examples required to get started with training. Having a diverse range of ~50-100 is recommended for better and more consistent results. Once you have the data in the correct JSONL format, simply save it as .jsonl using notepad. Training GPT-4.1 nano using the minimum 10 (very basic) examples costs about 2 cents.
b. Direct preference optimisation (DPO) requires both good AND bad responses to also teach about what you don't want to see. Similar in a way to how there are 'negative prompts' in image generation. The general idea is the same, but the examples in your training data needs to be carefully curated to avoid unintended results (or no results).
3. Evaluation of fine-tuned models to see if they work as intended after you created them.
a. Pricing: Using mostly default settings, 18 short samples (<100 tokens per sample, including both input and output) were evaluated by the preset Auto grader using o3-mini. The full evaluation came out under 10 cents in total, with 9 cents being used by 03-mini for the analysis (input/output price ratio of 2-7). Side note: do not use the 'Quick eval 15s' button found in your logs. Very expensive.
b. Instructions: From the create new evaluation screen, select 'Create new data' unless you already have a large collection of responses to test with. Ensure you click +Message under Prompt, and then type {{item.input}} in the user box (if you get the error "No variable references found in the input messages", this is what went wrong). You may fill in the Developer box with what you would normally give the model before the user prompt. The table on the right side of the screen is where you provide the user inputs to test with. Select the model you wish to test and include Ground Truth if you know what you want the output to look like. The next page 'Create test criteria' allows multiple options for how your model is assessed. The model labeller and model scorer options both recommend using o3-mini as the evaluator for best results. When you run the eval, your model will automatically generate the outputs, and then the evaluator model (o3-mini) will check them against your specified requirements.