By Bartosz K. — Published: 19 March 2026 — Updated: 27 March 2026 — 12 min read
Building applications with large language models (LLMs) is deceptively easy to start and surprisingly hard to do well. Getting a model to produce an impressive output in a demo takes minutes. Getting it to produce correct, consistent, and safe outputs across thousands of real user inputs — reliably, at scale, over months — is a significantly harder engineering challenge. This is where prompt engineering becomes a core skill.
Prompt engineering is the practice of designing and optimising the instructions given to a language model. It is not a soft skill or an art — at the level required for production systems, it is rigorous engineering with systematic testing, versioning, and evaluation. This guide covers the techniques that matter most for developers building LLM-powered applications.
A large language model is, at its core, a very sophisticated text completion engine. Given an input (the prompt), it produces an output by predicting the most likely continuation. The quality and reliability of that output depend enormously on how the input is structured.
Prompt engineering encompasses everything from the high-level architecture of the conversation (what roles are defined, what context is provided) to the fine details of phrasing (whether a question is open-ended or constrained, whether examples are included, how output format is specified). Small changes in prompts can produce dramatically different outputs.
In production applications, prompts are software artefacts that need to be versioned, tested, and deployed with the same rigour as code. A prompt change that improves performance on one input class may degrade it on another. Systematic evaluation is not optional.
Most modern LLM APIs support a system message — a set of instructions that sets the context for the entire conversation. This is where you define:
A well-written system prompt is specific and unambiguous. "You are a helpful assistant" is a poor system prompt. "You are a customer support agent for [Company]. You help users with account issues, billing questions, and product features. You do not discuss competitor products. You always respond in English. If you are uncertain about a fact, say so rather than guessing." is a better one.
Important: system prompts are not security boundaries. A determined user can often manipulate a model to ignore system prompt instructions through prompt injection. Do not rely on the system prompt alone to enforce security constraints — apply validation and filtering in your application code.
One of the most reliable ways to improve output quality is to include examples — known as "few-shot" examples — in the prompt. Instead of only describing what you want, you show the model several input/output pairs that demonstrate the desired behaviour.
Few-shot prompting is effective for:
The examples you include matter. Curate them carefully — diverse, high-quality examples that cover edge cases will improve reliability. Poorly chosen examples can reinforce failure modes. For production systems, maintain a library of test cases that includes examples of both typical inputs and known edge cases, and use these to evaluate prompt changes before deployment.
For tasks that require reasoning — mathematical problems, multi-step logic, complex classification decisions — instructing the model to "think step by step" before producing an answer significantly improves accuracy. This technique is known as chain-of-thought (CoT) prompting.
The mechanism is not fully understood, but the empirical result is clear: asking the model to show its reasoning produces better answers than asking for the answer directly. The intermediate steps seem to guide the model toward more correct outputs.
In production applications, you have two options:
Some model providers (including Anthropic's extended thinking models) offer extended thinking as a first-class feature, allowing the model to reason at length before producing a response without that reasoning appearing in the visible output.
Parsing unstructured text output in production code is fragile. When your application needs to extract specific fields from a model's response — a classification label, a list of entities, a JSON object with defined fields — you should request structured output explicitly.
Best practices:
When the model needs to answer questions based on specific information — your documentation, a user's account data, a product catalogue — providing that information as context in the prompt dramatically improves accuracy and reduces hallucination.
This is the retrieval-augmented generation (RAG) pattern: retrieve relevant information from a vector database, then inject it into the prompt as context. The model is then instructed to answer based on the provided context rather than its general training knowledge.
Key considerations for context injection:
In production systems, prompt evaluation is not optional. Before deploying a prompt change, you need evidence that it improves performance on the cases that matter without introducing regressions elsewhere.
Build an evaluation dataset that covers:
For each test case, define what a correct output looks like. This can be:
Track these metrics over time. A prompt engineering workflow without evaluation is essentially guessing.
Vague instructions. "Summarise this document" will produce unpredictable results. "Summarise this document in 3 bullet points, each under 25 words, focusing on the main findings and omitting methodology" will not.
Contradictory instructions. System prompts that give conflicting guidance confuse the model and produce erratic behaviour. Review prompts for internal consistency.
Not accounting for model differences. A prompt optimised for one model (GPT-4, Claude, Gemini) may not perform the same on another. If you switch models, re-evaluate.
Prompt injection. If user-provided text is included in your prompt without sanitisation, malicious users can inject instructions that override your intended behaviour. Never trust user input to be prompt-safe — validate, sanitise, and apply output filtering.
Hallucination acceptance. LLMs can produce confident-sounding false information. For any application where accuracy matters, implement verification steps: RAG with retrieved context, structured output validation, or human review loops for high-stakes decisions.
Production prompt engineering has operational requirements that go beyond writing good prompts: