SQL query with Inner Join, Business questions and visualization with Tableau— Part 3
Understanding Business requirements by using SQL Inner Join, Venn Diagram and Tableau
This blog is related to part 3 of 4. In the previous article, we’ve focused on a basic query involving a single table. We know that a relational database consists of multiple related tables linked together using common columns and the process behind it is called Normalization.
As we discuss in Understanding Relational Database article part 1, the Normalization process makes data storage more efficient and avoids problems such as data redundancy.
However, to answer business questions, we need to recombine the data from the tables. This means that we must write JOIN
queries.
The general syntax of a Join clause is as seen below:
SELECT column_name, expression
FROM table1 as t1
[INNER |{LEFT|RIGHT}[OUTER]] JOIN table2 as t2 ON join condition;
If the join condition uses the equal operator (=
) and the column names in both tables used for matching are the same, we can use the USING
clause instead:
SELECT value_expression
FROM table1 as t1
[INNER |{LEFT|RIGHT}[OUTER]] JOIN table2 as t2 USING (column_name);
I am using MySQL system but there are many other variants available, like PostgreSQL, SQLite, Oracle to mention a few examples. Across the different languages, the syntax can change a little bit, but the main concepts remain the same.
Join is a technique of combine data between one (self-join) or more tables based on values of the common column between the tables.
Combining data is straightforward. We simply specify which tables are involved and how the tables are connected. This means that, in the FROM
clause, we write the name of the first table (table1) that we want to take columns from. Then, we use the JOIN
keyword, followed by the name of the second table (table2).
Next, the ON
or USING
clause (join condition) is used to query on which columns data should be joined (usually, this is an ID
). However, keep in mind that the columns on which we join the data are not always in the form of an ID
. They also don’t need to have the same name.
To understand the process, in this article we will explain how a INNER JOIN
clause works.
INNER JOIN clause involving two tables:
Consider the business request: List the customer name and customer id of everyone who bought something in September 2017.
The first step is to understand our data. Let’s take a look at how the sales data transaction database is structured:
We use SHOW tables
command to display the tables we have in our sales database:
To answer the business request we first must identify the tables involved. We can see it through the Entity Relationship Diagram (ERD) below, it show us how the tables are connected.
In this case, the two tables involved are orders
table and customers
table. The Orders table contains just theCustomerID
to identify the specific customer. Customer name and other additional data are stored in the customers
table along with the CustomerID
.
To get the answer we want, we can use the following query:
The partial output is as follow:
What we are doing is:
- We enter the table names —
orders
andcustomers
— in theFROM
line connected by aINNER JOIN
clause. TheINNER JOIN
clause is used because we want to retrieve an exact match between those two tables. - Second, we tell the DBMS which columns are matched in each table. In this case, we match
CustomerID
in theorders
table to theCustomerID
in thecustomers
table [ON o.CustomerID = c.CustomerID
]. Remember, most of the time the column names will be the same as in this example, but they could be different. Also remember, if we don’t specify any join condition, most query systems will produce a cross join, where every row in one table is paired with every row in the other table. - The letters
C
andO
are the tablealias
for simplicity. Since the columnCustomerID
is used in both tables, it would not make sense to writeCustomerID = CustomerID
. So to keep track of which column we want, we must also specify the name of the table:o.CustomerID
andc.CustomerID
WHERE
clause is used to specify the order date we want to get the information, andORDER BY
clause is used to sort the result. In this case, we sort the values byorderdate
. However, if we want to sort byCustomerID
, the result would be:
The output shows us the purchases each customer made between 2nd and 30th September. For example, Suzanne made a purchase on the following days: 02, 03, 10, 17, 24, 25, 28 and 30 of September.
- Because some customers might have made several purchases on the same day at different times, the
DISTINCT
clause was used to delete the duplicate rows. If we don’t include theDISTINCT
clause, the result would be as follow:
For example, Suzanne made 2 purchases on 02 September. We can think that she made a purchase in the morning and then in the afternoon she made another purchase.
So, what is an INNER JOIN clause?
The INNER JOIN
clause compares each record from the first table with every record from the second table. If values in both rows cause the join condition evaluates to true, the inner join clause will result in the set of record that match both tables.
To put this in a simple way, we can think of intersection in set theory. Suppose we want to examine two sets of numbers. The first set of numbers is A = {1, 5, 8, 9, 32, 55, 78}. The second set of numbers is B = {3, 7, 8, 22, 55, 71, 99}. The intersection of these two sets of numbers is the numbers common to both sets: 8 and 55.
So, for our business request example the intersection of the orders
table and customers
table is the column common to both sets: CustomerID
. The Venn Diagram is illustrated as follow:
INNER JOIN clause involving more than two tables:
In the sales order database, we know that the orders
table and customers
table are linked by CustomerID
. The diagram below shows us that we have the orders
and order_details
tables linked using the orderNumber
column, the order_details
and the products
table linked using productnumber
column:
We need the order number, order date, customer first name and last name, order amount and product name. To translate this we could say:
- From
customers
table joined withorders
onCustomerID
in theorders
table matchingCustomerID
At the customer table, - Then joined with the
order_details
table onorderNumber
in theorders
table matchingorderNumber
in theorder_details
table, - And then finally joined with
products
table onproductNumber
in theproducts
table matchingproductNumber
in theorder_details
table.
In this query, we use INNER JOIN
and two different approaches to link the tables together with both producing the same result set. We use an USING
clause in the first query and then we use an ON
clause.
Once we execute the above statements, we can see the partial result set of the data in the following image.
Remember that INNER JOIN
selects all rows from both tables as long as there is a match between the columns in both tables. So, In this query, we request specific information about customers that have ordered or purchased a product on a specific date. That’s why we use an inner join.
To get customers who didn't order we must use another type of join. These are called OUTER JOIN
which will be explained in article part 4.
More Business questions answered with SQL queries: Customer and product analysis
The result of the above query is shown below:
The graphical representation of the query above is shown as follow in the Dashboard obtained from Tableau:
Summary
We conclude that Normalization splits data into tables that can be stored and searched more efficiently. In another hand, queries are the reverse operation. The process of combining tables is related to Joins. which means that to get information from multiple tables we need to writing JOIN
queries.
Be mindful that any time we use multiple tables, we must join them together. The joined tables also create a huge query. If we don’t specify any constraints, most query systems will produce a cross join, where every row in one table is paired with every row in the other table.
In the next blog, the topic will involve OUTER JOIN
clause. Thanks for reading and I hope that you find this blog useful.
Book Reference
- John L. Viescas, Michael J. Hernandez (2014). SQL Queries for Mere Mortals, A Hand on Guide to Data Manipulation in SQL.