TL;DR

  • Prompting techniques don't have one ranking. They have at least three: how often you'll actually use them, how much expert use beats average use, and how much output you get per minute of prompt engineering.
  • Most advice ranks on a single axis (usually frequency) and that's why your prompts still feel underwhelming even after reading every "top 10 prompting tricks" post.
  • Few-shot, output constraints, and decomposition score high across all three axes. They're the ones to drill first.
  • Tree-of-thoughts and self-consistency look impressive in papers and have terrible ROI for daily work. Skip them until you actually hit the ceiling.
  • Use the three axes to decide what to learn next, not which technique is "best" overall.

Introduction

Most prompting technique lists rank wrong. They pick one metric (popularity, hype, "research-backed") and stack-rank twenty techniques against it. Then they wonder why readers try chain-of-thought and self-consistency and tree-of-thoughts and still get mediocre outputs.

The problem isn't that the techniques don't work. The problem is that "best technique" is the wrong question. A technique can be high-frequency and low-ROI (you use it daily but it barely moves quality). Or low-frequency and game-changing (you use it once a month but when you do, the output goes from broken to shipping). These are different tools for different jobs. One ranking can't capture that.

This post breaks down 13 prompting techniques across three axes that actually matter: daily use frequency, the gap between average and expert use, and result-to-effort ratio. By the end you'll know which techniques to drill first, which to defer, and which look impressive on paper but aren't worth the prompt engineering tax.

What "Prompting Technique" Actually Means

Strip the hype and a prompting technique is a structural pattern you apply to your input to shift the model's output distribution toward something more useful. That's it. Not magic. Not a hack. Just a way of arranging context, instructions, and examples so the model has a better shot at giving you the answer you want.

Consider the difference between asking "summarize this document" and "summarize this document in 3 bullets, each under 15 words, focused on action items only, in the voice of a project manager writing to executives." Same task. Different prompting structure. The second one constrains the output space so aggressively that the model has nowhere to drift. That's the whole game. Every "technique" is a variation on the same core mechanic: shape the input so the output has nowhere bad to go.

The implication is that no technique is universally "good." A technique is good when it removes enough output variance for the cost of writing it. That's the only honest way to rank them. And that's a three-axis problem.

The Three Axes That Actually Matter

Axis 1: Daily Use Frequency

How often a technique shows up in real prompts you write. Not in papers. Not in viral threads. In your actual daily work. Some techniques are everywhere because the cost of using them is near zero. Others are reserved for specific high-stakes prompts where the setup time is justified.

High-frequency techniques are the ones you should master first because the compounding return is enormous. If you use a technique 50 times a day and it makes outputs 10% better, that's a different value calculation than a technique you use twice a month with a 40% lift. Both matter. Frequency is the multiplier.

Axis 2: Average vs Expert Gap

How much better an expert uses the technique compared to an average user. This is the most overlooked axis. Some techniques are flat: average users and experts get roughly the same lift. Others have a huge gap, where experts get 3-5x the value because they understand the failure modes and design around them.

The gap matters because it tells you where deliberate practice pays off. A technique with a small gap means once you know the basic pattern, you've extracted most of the value. A technique with a big gap means there's a learning curve worth climbing. Few-shot prompting is the textbook example. Average users throw three random examples at the model. Experts pick examples that span the failure modes, edge cases, and output format variations they actually care about. Same technique, completely different outcomes.

Axis 3: Result-to-Effort Ratio

The output quality lift divided by the time it took you to write the prompt. This is the axis that survives contact with deadlines. A technique that takes 30 minutes to set up and lifts quality 10% is worse than one that takes 30 seconds and lifts quality 8%, in almost every real workflow.

This is also the axis where research benchmarks lie hardest. Tree-of-thoughts gets cited as state-of-the-art on certain reasoning benchmarks. In daily product work it's almost always the wrong call because the setup and orchestration cost dwarfs the lift. The papers don't price the engineering tax. You have to.

The bridge is this: knowing where a technique falls on each axis is what separates someone who has read about prompting from someone who can ship reliable LLM features. The 13 techniques below are sorted by my read on all three.

The 13 Techniques, Ranked

The table below ranks each technique on the three axes from low to high. "Daily use" means how often it shows up in real production prompts. "Expert gap" means how much better experts use it. "ROI" is the result-to-effort ratio.

#TechniqueDaily useExpert gapROI
1Output format constraintsHighMediumHigh
2Few-shot with curated examplesHighHighHigh
3Role + audience pinningHighLowHigh
4Decomposition (task splitting)MediumHighHigh
5XML / structural delimitersHighLowHigh
6Negative examples (what not to do)MediumHighHigh
7Chain-of-thoughtMediumMediumMedium
8Self-critique / reflection passMediumHighMedium
9Step-back promptingLowHighMedium
10Generated knowledge (brainstorm first)LowMediumMedium
11ReAct (reason + act)MediumHighLow
12Self-consistency (sample + vote)LowLowLow
13Tree-of-thoughtsVery lowLowVery low

High-ROI Techniques (#1 to #6)

Six techniques score high on ROI. These are the ones to drill first because they pay back the prompt engineering cost almost immediately and stack well together.

1. Output format constraints

Output format constraints are the highest-impact move in prompting and the most consistently underused. Telling the model "respond in JSON matching this schema" or "respond in exactly three bullets, no preamble" cuts output variance by an order of magnitude. The expert gap is medium because the basic move is obvious, but experts know to constrain at the field level (max 15 words per bullet, must include a verb, no hedging language) where average users stop at the structure level.

You are [role / expertise].
Your task is to [clear, specific objective].
Context:
[relevant background or input data]
Output format:
Respond in JSON matching this exact schema:
{
  "field_1": "string, max 15 words, must include a verb",
  "field_2": ["array of strings, exactly 3 items"],
  "field_3": "boolean"
}
No preamble. No trailing text. JSON object only.

2. Few-shot with curated examples

Few-shot with curated examples is where the expert gap gets ugly. Average users pick three examples that look like the input. Experts pick examples that span the failure modes the model is most likely to hit. If you've ever wondered why your few-shot prompts work in development and break in production, this is why. The examples need to cover the edge cases, not just the happy path.

You are [role / expertise].
Your task is to [clear, specific objective].
Examples (each one targets a specific failure mode):

Input: [example that hits failure mode A]
Output: [correct output]

Input: [example that hits failure mode B]
Output: [correct output]

Input: [edge case C the model usually botches]
Output: [correct output]

Now do this:
Input: [actual input]
Output:

3. Role + audience pinning

Role and audience pinning is the lowest-friction move on the list. "You are a senior backend engineer reviewing a junior's pull request" gives the model a posture and a register in one line. The expert gap is small because there isn't much depth to mine here, but the daily use is so high and the cost so low that it lives in the high-ROI cluster anyway.

You are [specific role with seniority and domain, e.g.,
"a senior backend engineer with 10 years of distributed systems experience"].
You are writing for [specific audience with context, e.g.,
"a junior engineer who just shipped their first service to production"].
Your task is to [clear, specific objective].
Context:
[relevant background or input data]
Output format:
[bullets / table / JSON / paragraph]

4. Decomposition (task splitting)

Decomposition is the technique that separates people who use LLMs from people who build with them. Instead of asking one prompt to "extract entities, classify them, and write a summary," you split it into three prompts. Each prompt has one job. Quality goes up because each subtask has fewer ways to go wrong. Latency goes up too, which is the tradeoff. Experts know which tasks are decomposition candidates and which aren't worth the orchestration cost.

# Prompt 1 of 3 -- extract
You are [role].
Your task is to extract [specific entities] from the input.
Context: [input data]
Output format: JSON list of entities.

# Prompt 2 of 3 -- classify
You are [role].
Your task is to classify each entity from the previous step into [categories].
Context: [output of prompt 1]
Output format: JSON object {entity: category}.

# Prompt 3 of 3 -- summarize
You are [role].
Your task is to summarize the classified entities into [final format].
Context: [output of prompt 2]
Output format: [final format]

5. XML / structural delimiters

XML and structural delimiters wrap inputs so the model knows where one section ends and the next begins. <context>...</context> and <task>...</task> and <examples>...</examples> is boring and effective. The expert gap is low because the technique has no depth, just consistency. Use it everywhere and stop thinking about it.

<role>You are [role / expertise].</role>

<task>[clear, specific objective]</task>

<context>
[relevant background or input data]
</context>

<requirements>
- [Constraint 1]
- [Constraint 2]
- [Constraint 3]
</requirements>

<output_format>
[bullets / table / JSON / paragraph]
</output_format>

6. Negative examples (what not to do)

Negative examples (telling the model what NOT to do) get skipped because they feel weird. They shouldn't. Pairing positive and negative examples gives the model a much sharper decision boundary than positive examples alone. If your prompts keep producing one specific failure mode, a single negative example will often kill it faster than rewriting the instructions three times.

You are [role / expertise].
Your task is to [clear, specific objective].
Context:
[relevant background or input data]

Good output (do this):
[correct example]

Bad output (do NOT do this):
[wrong example showing the specific failure mode you keep hitting]
Why it's bad: [one-line explanation of the failure]

Now produce your output for:
[actual input]

Medium-ROI Techniques (#7 to #10)

These four are situational. Use them when the high-ROI cluster has been exhausted and quality still isn't where you need it.

7. Chain-of-thought

Chain-of-thought has earned its hype but also its overuse. "Think step by step" works on reasoning tasks where the model would otherwise pattern-match to a wrong answer. It does almost nothing on tasks where reasoning isn't the bottleneck. The medium expert gap reflects that experts know when to deploy it and when it just inflates the output without helping.

You are [role / expertise].
Your task is to [clear, specific objective].
Context:
[relevant background or input data]
Requirements:
- Think step by step before answering.
- Show your reasoning under a "Reasoning:" header.
- Then give the final answer under an "Answer:" header.
Output format:
Reasoning:
[your step-by-step working]
Answer:
[final answer in the requested format]

8. Self-critique / reflection pass

Self-critique is a follow-up pass where the model reviews and refines its own output. The expert gap is high because the critic prompt has to be specific. "Review your answer" is useless. "Review your answer for the following failure modes: hallucinated citations, missing edge cases, hedging language" is a different tool. Cost is the latency and token doubling, which is why it's medium ROI.

# Prompt 1 -- draft
You are [role / expertise].
Your task is to [clear, specific objective].
Context: [input data]
Output: [first draft]

# Prompt 2 -- critique and revise
You are a strict reviewer of [domain].
Review the draft below for these specific failure modes:
- [failure mode 1]
- [failure mode 2]
- [failure mode 3]
Then produce a revised version that fixes any issues found.
Draft to review:
[draft from prompt 1]
Output format:
Issues found: [list]
Revised version: [final output]

9. Step-back prompting

Step-back prompting asks the model to first state the general principle behind a question before answering the specific one. This works well on math, reasoning, and some classification tasks. It doesn't help on extraction or summarization. Low daily use, but a strong move when the task fits.

You are [role / expertise].
Your task is to [clear, specific objective].

Before answering the specific question, first state the general principle
or framework that applies to problems of this type.

Step 1: General principle.
What general rule, framework, or approach governs problems of this type?

Step 2: Apply that principle to the specific question.
[specific question or input]

Output format:
Principle: [one paragraph]
Answer: [direct response in the requested format]

10. Generated knowledge (brainstorm first)

Generated knowledge prompting asks the model to brainstorm relevant facts or considerations first, then answer. It's a poor man's retrieval. Useful when you don't have a real RAG pipeline yet or when the topic is broad enough that priming the context helps. Mostly replaced by actual retrieval in production systems.

You are [role / expertise].
Your task is to [clear, specific objective].

Step 1: Brainstorm 5-7 facts, considerations, or context points that
are relevant to the question below. Be specific.

Step 2: Use the brainstormed context to answer the question.

Question: [actual question]
Context (if any): [input data]

Output format:
Background:
- [fact / consideration 1]
- [fact / consideration 2]
- ...
Answer: [final response]

Low-ROI Techniques (#11 to #13)

These three look impressive in research papers and almost never justify their cost in real work.

11. ReAct (reason + act)

ReAct interleaves reasoning steps with tool calls. It's the foundation of how most agent frameworks work, and it's powerful when you actually need an agent. For a single-shot prompt it's wildly overkill. The low ROI is about deployment context: ReAct in an agent loop is correct, ReAct in a one-off prompt is engineering theater.

You are [role / expertise] with access to these tools:
- [tool 1]: [what it does]
- [tool 2]: [what it does]
- [tool 3]: [what it does]

Your task is to [clear, specific objective].

For each step, output:
Thought: [what you're reasoning about and why]
Action: [tool to call + arguments]
Observation: [tool result -- the runtime fills this in]
... (repeat until the task is done)
Final Answer: [response to the original task]

Task: [input]

12. Self-consistency (sample + vote)

Self-consistency samples multiple completions and takes the majority vote. It works. It also costs you 5-10x the tokens for a few percentage points of improvement on benchmarks that don't reflect your actual workload. Skip until you've proven the rest of the stack is tuned.

# Run this prompt N times (typically 5-10) at temperature > 0.
# Then majority-vote the answers in code.

You are [role / expertise].
Your task is to [clear, specific objective].
Context:
[relevant background or input data]
Output format:
[strict, parseable format -- so the voter can compare runs]

# After N runs:
# Take the answer that appears most often.
# Tie-break by [criterion, e.g., shortest answer / highest confidence score].

13. Tree-of-thoughts

Tree-of-thoughts builds an explicit search tree of reasoning paths and evaluates branches. Brilliant on paper. In production it's almost always the wrong call. The orchestration overhead, latency, and cost dwarf the lift on real tasks. If you find yourself reaching for this, ask why simpler decomposition didn't get you there first. The answer is usually that it would have.

You are [role / expertise].
Your task is to [clear, specific objective].

Step 1: Generate 3 distinct approaches to this problem.
Step 2: For each approach, list strengths and weaknesses.
Step 3: For the top 2 approaches, generate one level of follow-up sub-steps.
Step 4: Evaluate the resulting branches against [evaluation criteria].
Step 5: Pick the best branch and produce the final answer.

Context:
[relevant background or input data]

Output format:
Approaches: [3 listed]
Evaluations: [strengths / weaknesses for each]
Expanded branches: [top 2 with sub-steps]
Branch evaluations: [...]
Final answer: [chosen path's output]

Where the Three-Axis Framework Breaks

Frontier Model Drift

The rankings shift as models improve. Chain-of-thought used to be a giant lift. On modern reasoning models it's often baked in by default and explicit prompting helps less. Self-critique gets weaker as models get better at first-pass quality. The framework still applies, the specific scores don't.

Re-rank yearly. The techniques near the bottom today might disappear entirely in 18 months because the model does them implicitly. The techniques near the top today (output constraints, few-shot, decomposition) are likely to stay there because they're about reducing output variance, which is a problem that doesn't go away.

Domain Specificity

Some techniques are domain-locked. Step-back is a reasoning-task move. Generated knowledge is a research / brainstorming move. Tree-of-thoughts is a search / planning move. Ranking them on a universal axis loses information. If your daily work is one specific domain, build your own ranking inside that domain. The three-axis framework still applies, the techniques in scope shrink.

The Stacking Problem

Techniques compose. Few-shot plus output constraints plus role pinning is a different prompt than any of those alone. The ranking treats each technique as standalone, but in practice you're always stacking 3-5. The right read of the table is "which to add next to my existing stack," not "which is the single best."

When to Use What

If you're starting from scratch on a new prompt, work the high-ROI cluster first. Add output constraints, pin the role, structure with XML delimiters. Run it. If quality is good enough, stop.

If quality isn't there, add few-shot with examples that span your real failure modes. If you're still failing on a specific category of input, add a negative example that shows the wrong output. If the task is too big for one prompt, decompose.

Only after the high-ROI stack is fully exhausted should you reach for the medium cluster. Chain-of-thought when reasoning is the bottleneck. Self-critique when output quality matters more than latency. Step-back when the task is abstract enough to benefit.

The low-ROI cluster is for when you've genuinely hit the ceiling of single-prompt engineering and need to escalate to agent loops, multi-sample voting, or search. That escalation should be a conscious cost decision, not a default move.

What to Do Next

Pick one technique from the high-ROI cluster you don't use consistently. Add it to your default prompt template this week. Measure the lift on the next 10 prompts you write. Then add the next one.

Don't try to learn all 13 at once. The expert gap on the top 6 is where the real returns live, and those returns compound only if you use them daily. Drill the top half until it's automatic, then revisit this list when you're actually hitting the ceiling.

Further Reading

  • Anthropic's prompt engineering guide: the most rigorous public writeup on output constraints, few-shot design, and structural delimiters. Worth reading even if you've read other prompting guides.
  • The original chain-of-thought paper (Wei et al., 2022): still the cleanest explanation of when CoT helps and when it doesn't.