SQL Execution Order: Write Queries That Think Like the Database

Most SQL bugs aren't logic errors. They're sequence errors — the result of writing a query in one order and the database executing it in another. Consider this query: ```sql SELECT department, COUNT(*) AS headcount FROM employees WHERE headcount > 5 GROUP BY department; ``` It looks reasonable. It will not run. The error you get back — something like column _"headcount"_ does not exist — feels like a bug in the database. It isn't. The database is being completely consistent. `headcount` doesn't exist yet when `WHERE` runs, because `SELECT` hasn't run yet. You defined the alias in step 5. You tried to use it in step 2. This is the trap: SQL reads like a sentence, so we write it like one. _Give me the department and count, from employees, where the count is greater than five_. That sentence has a natural English order, and SQL's syntax follows it. But the database doesn't execute it in that order — it executes it in the order that's logically necessary to produce a correct result. `FROM` runs first, because you can't filter rows from a table you haven't opened yet. `WHERE` runs before `GROUP BY`, because grouping a million rows is expensive and you want to throw out the ones you don't need first. `SELECT` runs near the end, because it's not retrieving data — it's describing the shape of the output. `LIMIT` runs last, because you can't cut the top 10 rows of a result set that hasn't been sorted yet. Once you see the execution order, a whole class of confusing query behaviour stops being mysterious. The alias that doesn't exist. The `HAVING` filter that's slower than expected. The `LIMIT 10` that doesn't feel fast. These aren't quirks. They're the same fact, seen from different angles. The execution order is this: **FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT**. That's the sequence the rest of this article is built around. --- ## The Execution Order, Explained The seven steps aren't arbitrary. Each one sits where it does because it depends on the output of the step before it. Work through them once and the order becomes obvious — even inevitable. --- ### Step 1 · The `FROM`/`JOIN` World Building Before the database can do anything, it needs a dataset to work with. `FROM` identifies the source tables; `JOIN` combines them. Everything else in your query operates on the result of this step. If you're joining three tables, the database assembles that combined row set first — all of it, before a single filter is applied. This is worth sitting with. The full Cartesian product of your joins exists, conceptually, before `WHERE` runs. The optimizer will often reorder things under the hood, but the logical model is: `FROM` builds the world, everything else narrows it. --- ### Step 2 · The Great `WHERE` Filter Now the database discards rows that don't meet your conditions. This is your primary cost-reduction lever — every row eliminated here is a row that never reaches grouping, aggregation, or sorting. Two constraints follow directly from position. First, you can't reference `SELECT` aliases here, because those don't exist yet. And here is why the query above is failing. Second, you can't filter on aggregate values like `COUNT(*)` — aggregation hasn't happened yet either. If you find yourself wanting either of those things, you need `HAVING` or a `CTE`, not `WHERE`. --- ### Step 3 · The `GROUP BY` Bucket Brigade Rows that survived `WHERE` are now collapsed into groups. If you ask for `GROUP BY department`, every row belonging to Engineering becomes one bucket, every row belonging to Product becomes another. Here's the caveat: once you've grouped, individual rows are gone. The only things you can reference in subsequent steps are the columns you grouped by, and aggregate functions over the groups. This is why `SELECT name, COUNT(*)` without name in `GROUP BY` is an error in strict SQL — after grouping, there's no single name to return, only a group of them. --- ### Step 4 · The `HAVING` Bouncer `HAVING` is `WHERE` for groups. It runs after aggregation, which means two things: it can filter on aggregate values (`HAVING COUNT(*) > 5`), and it's more expensive than `WHER`E because the grouping work has already been done before anything gets discarded. This is the most commonly misused step. If a condition doesn't depend on an aggregate, it belongs in `WHERE`. Putting it in `HAVING` isn't wrong — you'll get the right answer — but you're doing extra work for no reason. Group first, then filter, when you could have filtered first, then grouped on far fewer rows. --- ### Step 5 · The `SELECT` Stylist This is the step people mentally treat as first. It isn't. By the time `SELECT` runs, the database already knows exactly which rows and groups exist in the result — it's now computing what to show you about them. Column aliases are created here. Expressions are evaluated here. Window functions run here, between `SELECT` and `ORDER BY`, which is why they can reference column aliases but still can't be filtered on in `WHERE` or `HAVING` — by the time they exist, both of those steps have already finished. --- ### Step 6 · The `ORDER BY` Sorter Sorting happens near the end because it's expensive and you want to sort as few rows as possible. It's placed after `SELECT` for a practical reason: it's the first step that can reference column aliases, because `SELECT` has already run. One trap here: `ORDER BY` sorts the entire result set before `LIMIT` slices it. A query with `ORDER BY` salary `DESC LIMIT 10` doesn't find the top 10 and stop — it sorts everything, then hands the first 10 rows to the next step. Without an index that supports the sort, you're paying full sorting cost regardless of how small your `LIMIT` is. --- ### Step 7 · The `LIMIT/OFFSET` Director's Cut The last thing that happens. Everything upstream has already run to completion — the joins, the filters, the grouping, the sorting. `LIMIT` just decides how much of the finished result set to hand back to you. This makes `LIMIT` nearly useless as a performance tool on its own. It reduces network transfer and client-side processing, but it doesn't reduce database work unless the optimizer can prove it's safe to short-circuit earlier steps — which generally requires an index on the sort column. --- The picture that emerges: _SQL is a narrowing pipeline_. `FROM` produces the maximum possible dataset. Every subsequent step either filters it, reshapes it, or describes it. The earlier you reduce rows, the less work every downstream step has to do. That's not a tip — it's the structure of the language. --- ## Four Rules That Follow Directly From The Order Understanding the execution order is only useful if it changes how you write queries. These four rules aren't best practices abstracted from experience — they fall out of the sequence mechanically. If you know the order, you can derive them yourself. --- ### Rule 1 · Filter Early, Filter in `WHERE` > If a condition doesn't depend on an aggregate, it belongs in `WHERE`, not `HAVING`. ```sql -- Doing unnecessary work SELECT department, COUNT(*) AS headcount FROM employees GROUP BY department HAVING department = 'Engineering'; -- Correct SELECT department, COUNT(*) AS headcount FROM employees WHERE department = 'Engineering' GROUP BY department; ``` Both return the same result. But the first version groups every department in the table, then throws away every group except Engineering. The second version discards non-Engineering rows before grouping, so the grouping step only ever sees one department's worth of rows. The rule is mechanical: if your `HAVING` clause contains no aggregate function, it should be a `WHERE` clause. --- ### Rule 2 · `HAVING` Is For Aggregates, Nothing Else > `HAVING` exists for exactly one purpose — filtering on values that don't exist until after `GROUP BY` runs. ```sql -- This is what HAVING is for SELECT department, COUNT(*) AS headcount FROM employees GROUP BY department HAVING COUNT(*) > 5; ``` You can't move this into `WHERE` — `COUNT(*)` doesn't exist at step 2. `HAVING` is the only place it can go. Use it for that, and only that. When you see a `HAVING` clause filtering on a plain column value, that's a signal someone misunderstood the execution order. ### Rule 3 · `LIMIT` Is Not A Performance Tool On Its Own This one surprises people. A query with `ORDER BY salary DESC LIMIT 10` feels like it should be fast — you only want ten rows. But because `ORDER BY` runs before `LIMIT`, the database sorts the entire result set first, then hands you the top ten. ```sql -- Sorts every row in the table, then returns 10 SELECT name, salary FROM employees ORDER BY salary DESC LIMIT 10; ``` `LIMIT` reduces how much data travels over the wire and how much your application has to process. It does not reduce how much work the database does — unless the optimizer can use an index to satisfy the sort without a full scan, which requires an index on the sort column. If you're relying on `LIMIT` for performance, the real question is whether your `WHERE` clause is doing enough work earlier in the pipeline. --- ### Rule 4 · When You Need To Filter On An Alias, Use A CTE Because `SELECT` aliases are created in step 5, neither `WHERE` (step 2) nor `HAVING` (step 4) can see them. This is the source of the error in the opening example. The fix isn't to find a workaround inside the same query — it's to wrap the query in a CTE, promoting its output into a new `FROM` clause that subsequent steps can filter against. ```sql -- Fails: headcount alias doesn't exist at WHERE SELECT department, COUNT(*) AS headcount FROM employees GROUP BY department WHERE headcount > 5; -- Correct: CTE makes headcount available as a column WITH dept_counts AS ( SELECT department, COUNT(*) AS headcount FROM employees GROUP BY department ) SELECT department, headcount FROM dept_counts WHERE headcount > 5; ``` The CTE isn't syntactic sugar here — it's a structural necessity. The inner query runs to completion, aliases and all, and its output becomes a new dataset that the outer query's FROM picks up at step 1. The outer `WHERE` is now filtering a real column, not a future alias. The same pattern applies to window functions. You can't filter on a window function result in `WHERE` or `HAVING` — wrap the query in a CTE and filter on the outer query. --- These four rules share a single root: the database processes your query in a fixed sequence, and every constraint follows from knowing where in that sequence each clause lives. You don't need to memorise edge cases. You need to know the order. --- ## A Worked Example: How Long Does It Take To Acknowledge A Factory Event? Let's make this concrete. You're working with factory floor data. Events get raised — a machine fault, a threshold breach, a safety alert — and an operator eventually acknowledges them. You want to know: _how long does acknowledgement take, on average, broken down by line and shift?_ The tables: - `machine_events` — one row per event raised, with an event ID, machine ID, line, shift, and the timestamp it was raised - `event_transitions` — one row per state transition per event; each row records the old state, the new state, and when the transition happened - `operator_audit` — one row per operator action; records who did what and when, keyed on event ID The query we want to end up with: ```sql WITH acknowledgements AS ( SELECT et.event_id, et.transitioned_at AS acknowledged_at FROM event_transitions et WHERE et.new_state = 'Acknowledged' AND et.previous_state = 'Unacknowledged' ), ack_times AS ( SELECT me.event_id, me.line, me.shift, oa.operator_id, EXTRACT(EPOCH FROM (a.acknowledged_at - me.raised_at)) / 60 AS minutes_to_ack FROM machine_events me JOIN acknowledgements a ON me.event_id = a.event_id JOIN operator_audit oa ON me.event_id = oa.event_id WHERE me.line IN ('Line A', 'Line B', 'Line C') AND me.raised_at >= '2024-01-01' ) SELECT line, shift, COUNT(*) AS total_events, ROUND(AVG(minutes_to_ack), 1) AS mean_ack_minutes, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY minutes_to_ack) AS median_ack_minutes, ROUND(MIN(minutes_to_ack), 1) AS fastest_ack, ROUND(MAX(minutes_to_ack), 1) AS slowest_ack FROM ack_times WHERE minutes_to_ack IS NOT NULL GROUP BY line, shift HAVING COUNT(*) >= 10 ORDER BY line, shift; ``` Now let's walk it through the execution order and see exactly what the database does — and why the query is structured the way it is. --- ### Step 1 · `FROM / JOIN` — The Database Builds The World The first CTE, `acknowledgements`, opens `event_transitions` and immediately has access to every state transition ever recorded — potentially millions of rows across every event, every machine, every year of data. Nothing has been filtered yet. The second CTE, `ack_times`, then joins three tables: `machine_events`, the `acknowledgements` CTE, and `operator_audit`. At this point, conceptually, the database is assembling a combined row set across all three sources. Every event, joined to its acknowledgement record, joined to its audit record. This is the most expensive moment in the query — the joined dataset is at its largest before a single row has been discarded. This is why the `acknowledgements` CTE exists. Rather than joining raw `event_transitions` directly and filtering it in the outer `WHERE`, we pre-filter it inside the CTE — keeping only the `Unacknowledged → Acknowledged` transitions before the join happens. The join is cheaper because the right-hand side of it is already small. --- ### Step 2 · `WHERE` — Cut Early, Cut Hard Inside ack_times, WHERE runs immediately after the joins: ```sql WHERE me.line IN ('Line A', 'Line B', 'Line C') AND me.raised_at >= '2024-01-01' ``` These two conditions discard every event outside our three lines and everything before 2024 before any grouping or aggregation happens. If `machine_events` holds five years of data across twelve lines, this single `WHERE` clause might eliminate 80% of the joined rows before the database does any further work. Notice what is not here: there is no filter on `minutes_to_ack` at this point. That column doesn't exist yet — it's a computed expression defined in the `SELECT` of `ack_times`, which means it belongs to step 5, three steps from now. To filter on it, we need the outer query's `WHERE`: ```sql WHERE minutes_to_ack IS NOT NULL ``` This is the CTE pattern from rule 4 in action. The inner query computes `minutes_to_ack` and promotes it to a real column. The outer query's `WHERE` can then filter on it cleanly, at step 2 of the outer query's own execution cycle. --- ### Step 3 · `GROUP BY` — Collapsing Into Buckets ```sql GROUP BY line, shift ``` Every row that survived `WHERE` is now collapsed into a `(line, shift)` bucket. A row that was previously o_ne specific event on Line A during the night shift is now part of the Line A / night shift group_. Individual event IDs, operator IDs, and raw timestamps are gone — only the group identity and aggregate functions over the group remain. This is why the outer `SELECT` can reference line and shift directly, but has to use `AVG()`, `COUNT()`, `MIN()`, and `MAX()` for everything else. After grouping, there is no single `minutes_to_ack` to return — only a distribution of them, which aggregation summarises. ### Step 4 · `HAVING` — Filtering Groups, Not Rows ```sql HAVING COUNT(*) >= 10 ``` This discards any `(line, shift)` combination with fewer than ten events. We can't move this into `WHERE` — `COUNT(*)` doesn't exist until after grouping. This is the legitimate use of `HAVING`: a filter that genuinely depends on an aggregate. What we deliberately avoided putting in `HAVING`: the line and date filters from step 2. Those conditions don't depend on aggregates — they belong in `WHERE`, and that's where they are. Putting them in `HAVING` instead would mean grouping all lines and all years of data, computing aggregates for all of them, and then discarding the ones we didn't want. With five years of factory data across twelve lines, that's a significant amount of unnecessary work. --- ### Step 5 · SELECT — Describing The Output ```sql SELECT line, shift, COUNT(*) AS total_events, ROUND(AVG(minutes_to_ack), 1) AS mean_ack_minutes, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY minutes_to_ack) AS median_ack_minutes, ROUND(MIN(minutes_to_ack), 1) AS fastest_ack, ROUND(MAX(minutes_to_ack), 1) AS slowest_ack ``` Only now does the database compute the actual output values. The aliases `mean_ack_minutes`, `median_ack_minutes`, and so on are born here. This is why none of the earlier steps could reference them — they didn't exist yet. `PERCENTILE_CONT` is worth a note. Unlike `AVG` or `COUNT`, it's an ordered-set aggregate — it needs to know the distribution of values within each group to find the midpoint. The database computes this per group, after grouping is complete, entirely within this step. --- ### Step 6 · `ORDER BY` — Sorting The Finished Result ```sql ORDER BY line, shift ``` The full result set — however many `(line, shift)` combinations survived `HAVING` — is sorted here. Because this runs after `SELECT`, it can reference the output column names directly. If we'd wanted to sort by `mean_ack_minutes DESC` to surface the slowest lines first, that would work here and nowhere else in the query. --- ### Step 7 · `LIMIT` — The Exit That Doesn't Help Here This query has no `LIMIT`, and deliberately so. We want all `(line, shift)` combinations — there are only a handful, so there's nothing to cut. But if this query fed a dashboard endpoint and someone added `LIMIT 20` thinking it would speed things up: it wouldn't. The joins, the filtering, the grouping, the aggregation, and the sort all run to completion first. `LIMIT` only reduces the payload sent back to the application. The real performance work happens in steps 1 through 4. --- ### What The Execution Order Reveals About This Query Tracing the query through all seven steps surfaces something that code review alone might miss: the two-CTE structure isn't stylistic — it's load-bearing. The first CTE pre-filters `event_transitions` to only acknowledgement transitions before the join. The second CTE computes `minutes_to_ack` so the outer query can filter on it in `WHERE`. Both decisions reduce the dataset earlier in the pipeline than a flat query would allow. Without them, the same result would require filtering in `HAVING`, sorting computed values that should have been discarded, and joining against a much larger intermediate dataset. The execution order is the reason those CTEs are there. --- ### Quick Reference | Clause | Execution step | Can use `SELECT` aliases? | Runs before or after aggregation? | Typical mistake | |---|---|---|---|---| | `FROM` / `JOIN` | 1 | No | Before | Joining large tables without pre-filtering in a CTE | | `WHERE` | 2 | No | Before | Filtering on an alias or aggregate that doesn't exist yet | | `GROUP BY` | 3 | No | Before | Referencing ungrouped columns in subsequent `SELECT` | | `HAVING` | 4 | No | After | Using it for non-aggregate conditions that belong in `WHERE` | | `SELECT` | 5 | Born here | After | Expecting aliases to be available in earlier clauses | | `ORDER BY` | 6 | Yes | After | Assuming `ORDER BY` + `LIMIT` avoids a full sort | | `LIMIT` | 7 | Yes | After | Using it as a performance tool without an index-backed sort | --- ## Conclusion SQL's syntax is designed to be readable. The execution order is designed to be correct. Those two goals produce different sequences, and the gap between them is where most query bugs live. The alias that doesn't exist. The `HAVING` clause doing work that `WHERE` should have done three steps earlier. The `LIMIT 10` on a query that's still sorting a million rows. None of these are edge cases or database quirks — they're the same misunderstanding, wearing different clothes. Once the execution order is internalised it stops being a rule to remember and becomes a lens. You read a query differently. You spot the `HAVING` with no aggregate and know immediately it's in the wrong place. You see a CTE and understand it's not there for readability — it's there because the alternative would require filtering on a value that doesn't exist yet. You add a `WHERE` clause early and feel the weight of everything it saves downstream. That's the shift this article is trying to produce. Not a checklist, not a set of rules to apply mechanically — a mental model that makes the right structure feel obvious. Write queries in the order the database executes them, not the order English suggests. Everything else follows.

SQL Execution Order: Write Queries That Think Like the Database

Tags

Comments

More Blog

Minimalist EKS: The Easy Way

Never forget to enter the Stern Grove lottery again!

A Free Screenshot Editor That Never Uploads Your Image

I built a CLI to break my highlights out of Apple Books

A Developer's Guide to Agent Hooks in Antigravity CLI

Tactical vs. Strategic Agentic AI Development — A Playbook for Developers