Introductory SQL Diagramming

David Clement

8 November 2007

Oracle version 10gR2


1. Introduction

Dan Tow has described a powerful diagramming technique for analysis of SQL, which should be more widely known in the database world than it is. This document is a brief introduction to the technique.

2. Why should you diagram SQL statements?

You might ask yourself why the technique is worth learning. Diagramming a SQL statement is helpful when tuning a poorly performing statement, when validating complex SQL, and in communication among coders.

First, when a SQL query performs poorly, the optimizer may not have chosen the best execution path. For one thing, the optimizer may not have correct database statistics available. Also, it has to use a brute-force cost algorithm because it lacks the human ability to perceive patterns, and therefore it may not perceive the best one of all the possible execution paths for a complex query. In fact, there may be so many possible execution paths that the optimizer cannot examine them all. For these reasons, it is possible for a human analyst to find a better execution path than the optimizer has found, when both of them are looking at the same fully diagrammed execution path.

Second, and even more common in my experience, diagramming exposes the bugs in a complex query. For example, if a join condition is omitted, and a Cartesian product is the result, it is much easier to see that the condition was left out when you diagram the statement. Another example is that a join can be transitive, that is, it can contain a relation such as table-A.primary-key = table-B.primary-key = table-C.primary-key, implying that table-A.primary-key = table-C.primary-key. The Oracle query optimizer cannot detect transitivity, so for best performance a transitive join should be coded explicitly rather than left implicit. In my experience, transitive join conditions are almost never coded for best performance and the human analyst will rarely be able to catch this without a diagram.

Last, the diagramming technique helps in communication between coders. It is common for a front-end developer or a relatively junior programmer, coming from an OO or procedural world, to find SQL mystifying. A directed acyclic graph, by clearly showing the logical order of row-source operations, can enlighten the mystified coder.

3. How do you diagram SQL statements?

3.1. Basics

A directed acyclic graph or DAG consists of vertices or nodes, which are connected by edges. The nodes are the table names or aliases. It is better to use the aliases, both for brevity and to support self-joins (see section 2.9). The edges are drawn as arrows: this makes them “directed.” No node connects to itself: this makes the graph “acyclic.” In diagramming a SQL statement, each node represents a relation, and the arrow from one relation to another represents a join.

When diagramming SQL statements in everyday usage, a node is a table, simple as that. The arrows represent the query conditions that link one node to another. Each arrow points from a node A to another node B that node A refers to in some way.

In relational theory, a row from a table and a row from a join are two things of the same kind. Any relation is a join of attributes, as when SSN + Last_name + First_name form a Person, which is realized as a row in a Persons table. Similarly SSN + Tax_year + Refund form a relation realized as a row in a Refunds table. A join between Persons and Refunds by SSN is, logically, just another relation, even if it is complicated in some way, for example, by aggregation. For instance, SSN + Last_name + First_name + SUM (Refund) form a new, calculated relation.

So it is logical for each node in a directed acyclic graph to represent a complex relation, such as a subquery or the relations detailed in another diagram. This can be an effective diagramming technique for complicated SQL. However, it is not often needed.

3.2. Nodes and One-Way Arrows

The ordinary way for node A to refer to node B is that A contains a foreign key column pointing to the primary key of B. A plain arrow pointing downward is how the diagramming technique shows that connection.

Here, node A is a detail table with a foreign key column that points to the primary key of node B. A SQL statement for this diagram would look like this:

select a.column_name, b.column_name 
  from table_a a, table_b b 
 where a.foreign_key_to_b = b.primary_key; 

An ERD for the same tables and relationship might look like this.

You can see that the ERD and the DAG both show the same relationship. The crow’s foot on the ERD resembles the feathers of the arrow, while the DAG shows the arrowhead.

3.3. Ratios

If there is a filtering 'where' condition on a table, this is expressed by writing the fraction of the rows that satisfy the condition, underlined, next to the table alias. That is, if half the rows in table B meet the condition, you write 0.5 next to B. If the SQL reads

select a.column_name, b.column_name 
  from table_a a, table_b b 
 where b.filter_condition = 'Y' 
   and a.foreign_key_to_b = b.primary_key; 

then the diagram would be:

The underlined number is known as the "filter ratio."

It is sometimes useful simply to note on the diagram that conditions apply to one of the tables, without working out the fraction. For this purpose, use an underlined f instead of the underlined fractional number. If you know that one fraction is the most selective one on the diagram, use a capital F for that one.

You can also add an asterisk to the letter "f" or the filter ratio to signify that the condition returns a unique row.

Similarly, and especially when debugging, it is useful to note on the diagram that some column from the table is used in the select list. For this purpose, adding an exclamation point to the left of the table alias is convenient. It helps to mark the tables that are actually used, because poorly coded SQL statements sometimes refer to tables that are not needed.

This is my extension of Dan Tow’s original diagramming technique.

The number of rows found in each table per matching row from the other table is the "join ratio." For our diagram, node A would have a "detail join ratio" because it represents the detail table, and node B would have a "master join ratio" because it represents the master table. Usually the master join ratio is 1, since all the rows in A must refer to a row in B, and in that case you can omit writing the master join ratio.

With a detail join ratio, the diagram might look like this.

This example shows the following.

  • The average number of rows in A that refer to any given row in B is 22.
  • The filtering condition on B selects half the rows from that table.
  • Only one row from B is matched by any given row from A.

3.4. Outer Joins

A different form of arrow denotes an outer join. If the SQL reads

select a.column_name, b.column_name, c.column_name 
  from table_a a, table_b b, table_c c 
 where b.filter_condition = 'Y' 
   and a.foreign_key_to_b = b.primary_key 
   and a.foreign_key_to_c = c.primary_key (+); 

then the diagram would be:

The double arrow denotes the outer join.

An outer join where the (+) is on the master table instead of the detail is shown with another form of arrow. If the SQL reads

 select a.column_name, b.column_name 
  from table_a a, table_b b 
 where a.foreign_key_to_b (+) = b.primary_key; 

(notice where the outer join symbol is) then the diagram would be:

Reversing the arrowhead in the middle of the line shows that the master table is the optional one in this join. The middle arrowhead always points to the optional node.

3.5. EXISTS and IN

A semi-join, that is, one using an 'exists' or 'in' condition, is shown almost like an outer join. The difference between the double arrow for an outer join and the double arrow for a semi-join is that an E is written beside the arrowhead in the middle of the arrow, as follows.

The point may be a little subtle, but since we are checking for existence of a row in the detail table, the master table is the optional one in this case. That’s why the arrowhead points up.

More ratios, the "subquery adjusted filter ratio" and the "correlation preference ratio," can be added to diagrams for semi-joins. These are not usually required for debugging or tuning a SQL statement, so this brief document does not cover them. See Dan Tow’s SQL Tuning book for full details. There is a link to the book in the Amazon links on this page, or click here.

3.6. Non-Unique Joins

A join between two tables that is not unique on either end is shown as a horizontal line without an arrowhead. If the SQL reads

select a.column_name, b.column_name 
  from table_a a, table_b b 
 where a.filter_condition = 'Y' 
   and b.filter_condition = a.filter_condition; 

then the diagram would be:

The filter ratios have the same meaning as before.

3.7. One-to-One Joins

A one-to-one join is shown as a horizontal line with two arrowheads:

The filter ratios have the same meaning as above.

3.8. Composite Keys

When columns are combined to form a composite primary key to a third table, the following arrow construction is used. If the SQL reads

select a.column_name, b.column_name, c.column_name 
  from table_a a, table_b b, table_c c 
 where a.foreign_key_to_b = b.primary_key 
   and b.foreign_key_to_c = c.primary_key_column_1 
   and a.foreign_key_to_c = c.primary_key_column_2; 

then the diagram would be:

The numeral 1 beside the C could be omitted, as pointed out above.

3.9. Self-Joins

A self-join is shown as a join between two nodes. Normally it is obvious how to do this, because the same table appears with different aliases in the SQL statement. If the SQL reads

select a.date_time, a2.date_time as later_date_time 
  from table_a a, table_a a2 
 where a.filter_condition = a2.filter_condition 
   and a.date_time < a2.date_time; 

then the diagram would be:

Again, if the SQL reads

select a.date_time, a2.date_time as later_date_time 
  from table_a a, table_a a2 
 where a.child_key = a2.parent_key 
   and a.date_time < a2.date_time; 

then the diagram would be:

The equivalent convention on an ERD is usually a pig’s ear, that is, a relationship line from an entity that returns in a curve to the same entity.

By making self-joins explicit, the DAG diagramming technique clarifies what the RDBMS is doing, far more than the original SQL does.

3.10. Nesting Nodes

If a query is so complex that it cannot be diagrammed on one page, any relation that is part of it can be chosen for a sub-diagram. A node that is sub-diagrammed is surrounded with a dashed line. For example, if the SQL reads

select a.column_name, b.column_name, c.column_name, d.column_name 
  from table_a a, table_b b, table_c c, table_d d 
 where a.primary_key = b.foreign_key_to_a 
   and a.filter_condition = ‘Y’ 
   and b.foreign_key_to_c = c.primary_key 
   and b.foreign_key_to_d = d.primary_key; 

then the whole diagram would look like this:

This could be divided into a main diagram and a sub-diagram as follows.

The diagram on the right, which expands the sub-diagrammed node, would be on another page in real life.

4. Example

To put the pieces together, here is an example of how to diagram an ordinary join.

SELECT e.ssn, e.last_name, e.first_name, d.department_name,
       p.project_name
  FROM employees e,
       employee_project_assignment epa,
       departments d,
       projects p
 WHERE e.department_id = d.department_id
   AND d.department_name = 'DEVELOPMENT'
   AND e.employee_id = epa.employee_id (+)
   AND epa.project_id = p.project_id (+)

5. Links for Further Reading

The Wikipedia gives a brief, useful introduction to directed acyclic graphs here.

For full details of the method, with examples and exercises, see Tow, Dan, SQL Tuning, O’Reilly: 2004.

Dan’s Web site is www.SingingSQL.com.

Dan has graciously provided links to his recent work in this area, in which he builds a taxonomy of bad SQL. His whitepaper is here and the PowerPoint presentation is here.