Join (SQL)
A join combines records from two or more tables in a relational database. In the Structured Query Language (SQL), there are two types of joins: "inner" and "outer". Outer joins are subdivided further into left outer joins, right outer joins, and full outer joins.
A join is essentially a cartesian product followed by a predicate. For example, given employee and department tables as follows:
|
| ||||||||||||||||||||||||||||
The cartesian product of these two tables is computed using the following SELECT statement:
SELECT * FROM employee, department;
An example of a join between these two tables is:
SELECT LastName, DepartmentID, DepartmentName FROM employee, department WHERE employee.DepartmentID = department.DepartmentID;
Another way to write this query is to use an explicit JOIN clause, as follows:
SELECT LastName, DepartmentID, DepartmentName FROM employee JOIN department ON employee.DepartmentID = department.DepartmentID
Table of contents |
Inner and outer join
Inner join
An inner join essentially takes all the records from table A (in this case, employee) and for each of them joins them with each record from table B (department). This means that the length of the result would be, in theory, the product of the number of records in each table—in this case, 6 * 4.
With indexing and an ON clause, this number can be reduced significantly without the need to scan all of those records.
In this case, an inner join (such as the query above) on both tables' DepartmentID columns will return:
| LastName | DepartmentID | DepartmentName |
|---|---|---|
| Smith | 34 | Clerical |
| Jones | 33 | Engineering |
| Robinson | 34 | Clerical |
| Steinberg | 33 | Engineering |
| Rafferty | 31 | Sales |
Left outer join
A left outer join is very different from a inner join. Instead of limiting results to those in both tables, it limits results to those in the "left" table (A). This means that if the ON clause matches 0 records, a row in the result will still be returned—but with NULL values for each column.
For example, this allows us to find the employee's departments, but still show the employee even when their department has not been set yet. The above example would have ignored employees in non-existent departments.
The result might look like this:
| LastName | DepartmentID | DepartmentName |
|---|---|---|
| Smith | 34 | Clerical |
| Jones | 33 | Engineering |
| Robinson | 34 | Clerical |
| Jasper | 36 | NULL |
| Steinberg | 33 | Engineering |
| Rafferty | 31 | Sales |
Right outer join
A right outer join is much like a left outer join, except that the tables are reversed. Every record from the right side, B or department, will be returned, and NULL values will be returned for those that have no matching left record. The results of a right outer join on these tables would look like:
| LastName | DepartmentID | DepartmentName |
|---|---|---|
| Smith | 34 | Clerical |
| Jones | 33 | Engineering |
| Robinson | 34 | Clerical |
| Steinberg | 33 | Engineering |
| Rafferty | 31 | Sales |
| NULL | 35 | Marketing |
Full outer join
Full outer joins are the combination of left and right outer joins. These joins will show records from both tables, and fill in NULLs for missing matches on either side.
Some database systems do not offer this functionality, but it can be emulated through the use of left outer joins and unions.
| LastName | DepartmentID | DepartmentName |
|---|---|---|
| Smith | 34 | Clerical |
| Jones | 33 | Engineering |
| Robinson | 34 | Clerical |
| Jasper | 36 | NULL |
| Steinberg | 33 | Engineering |
| Rafferty | 31 | Sales |
| NULL | 35 | Marketing |
Implementation
The efficient implementation of joins has been the goal of much work in database systems, because joins are both extremely common but rather difficult to execute efficiently. The difficulty results from the fact that joins are both commutative and associative. In practice, this means that the user merely supplies the list of relations to be joined and the join conditions to be used, and the database system has the task of determining the most efficient way to perform the operation. Determining how to execute a query containing joins is done by the query optimizer. It has two basic freedoms:
- join order: because joins are commutative, the order in which relations are joined does not change the final result set of the query. However, join order does have an enormous impact on the cost of the join operation, so choosing the right join order is very important.
- join method: given two relations and a join condition, there are multiple algorithms to produce the result set of the join. Which algorithm is most efficient depends on the sizes of the input relations, the number of tuples from each relation that match the join condition, and the operations required by the rest of the query.
Many join algorithms treat their input relations differently. The input relations are referred to as the "outer" and "inner" relations, or "left" and "right", respectively. In the case of nested loops, for example, the entire inner relation will be scanned for each tuple of the outer relation.
Query plans involving joins can be classified as:
- left-deep: the inner relation of each join in the plan is a base relation (rather than another join).
- right-deep: the outer relation of each join in the plan is a base relation.
- bushy: neither left-deep nor right-deep; both input relations of a join query may be joins themselves.
These names are derived from the appearance of the query if drawn as a tree, with the outer join relation on the left and the inner relation on the right (as is the convention).
Query optimization
Most query optimizers determine join order via a dynamic programming algorithm derived from the System R database system, originally designed by IBM. This algorithm works in two stages:
- First, all ways to access each relation in the query are computed. Every relation in the query can be accessed via a sequential scan. If there is an index on a relation that can be used to answer a predicate in the query, an index scan can also be used. For each relation, the optimizer records the cheapest way to scan the relation, as well as the cheapest way to scan the relation that produces records in a particular sorted order.
- The optimizer then considers combining each pair of relations for which a join condition exists. For each pair, the optimizer will consider the available join algorithms implemented by the DBMS. It will preserve the cheapest way to join each pair of relations, in addition to the cheapest way to join each pair of relations that produces its output according to a particular sort order.
- Then all three-relation query plans are computed, by joining each two-relation plan produced by the previous phase with the remaining relations in the query.
In this manner, a query plan is eventually produced that joins all the queries in the relation. Note that the algorithm keeps track of the sort order of the result set produced by a query plan. This is to avoid a redundant sort operation later on in processing the query.
Historically, System-R derived query optimizers would often only consider left-deep query plans. This heuristic is drawn from the observation that join algorithms such as nested loops only require a single tuple of the outer relation at a time. Therefore, a left-deep query plan means that fewer tuples need to be held in memory at any time: the outer relation's join plan need only be executed until a single tuple is produced, and then the inner base relation can be scanned (this technique is called "pipelining"). This heuristic reduces the number of plans that need to be considered, but may result in not considering the optimal query plan.
Join algorithms
There are three fundamental algorithms to perform a join operation.
Nested loops
This is the simplest join algorithm. For each tuple in the outer join relation, the entire inner join relation is scanned, and any tuples that match the join condition are added to the result set. Naturally, this algorithm performs poorly if either the inner or outer join relation is very large.
A refinement to this technique is called "block nested loops": for every block in the outer relation, the entire inner relation is scanned. For each match between the current inner tuple and one of the tuples in the current block of the outer relation, a tuple is added to the join result set. This variant means that more computation is done for each tuple of the inner relation, but far fewer scans of the inner relation are required.
Merge join
If both join relations are sorted by the join attribute, the join can be performed trivially:
- For each tuple in the outer relation,
- Consider the current "group" of tuples from the inner relation; a group consists of a set of contiguous tuples in the inner relation with the same value in the join attribute.
- For each matching tuple in the current inner group, add a tuple to the join result. Once the inner group has been exhausted, both the inner and outer scans can be advanced to the next group.
This is one reason why many optimizers keep track of the sort order of query nodes — if one or both input relations to a merge join is already sorted on the join attribute, an additional sort is not required. Otherwise, the DBMS will need to perform the sort, usually using an external sort to avoid consuming too much memory.
Hash join
Semi join
A semi join is an efficient join method where first the join attributed of one table are collected and reported to the second one. It was first reported in 1981. It can be improved with a Bloom-Filter (hashing).
See also
Categories: SQL