SQL query with Inner Join, Business questions and visualization with Tableau— Part 3

Understanding Business requirements by using SQL Inner Join, Venn Diagram and Tableau

Kueila Ramos
7 min readApr 19, 2021
Photo by Lukas Blazek on Unsplash

This blog is related to part 3 of 4. In the previous article, we’ve focused on a basic query involving a single table. We know that a relational database consists of multiple related tables linked together using common columns and the process behind it is called Normalization.

As we discuss in Understanding Relational Database article part 1, the Normalization process makes data storage more efficient and avoids problems such as data redundancy.

However, to answer business questions, we need to recombine the data from the tables. This means that we must write JOIN queries.

The general syntax of a Join clause is as seen below:

SELECT column_name, expression
FROM
table1 as t1
[INNER |{LEFT|RIGHT}[OUTER]] JOIN
table2 as t2 ON join condition;

If the join condition uses the equal operator (=) and the column names in both tables used for matching are the same, we can use the USING clause instead:

SELECT value_expression
FROM table1 as t1
[INNER |{LEFT|RIGHT}[OUTER]] JOIN table2 as t2 USING (column_name);

I am using MySQL system but there are many other variants available, like PostgreSQL, SQLite, Oracle to mention a few examples. Across the different languages, the syntax can change a little bit, but the main concepts remain the same.

Join is a technique of combine data between one (self-join) or more tables based on values of the common column between the tables.

Combining data is straightforward. We simply specify which tables are involved and how the tables are connected. This means that, in the FROM clause, we write the name of the first table (table1) that we want to take columns from. Then, we use the JOIN keyword, followed by the name of the second table (table2).

Next, the ON or USING clause (join condition) is used to query on which columns data should be joined (usually, this is an ID). However, keep in mind that the columns on which we join the data are not always in the form of an ID. They also don’t need to have the same name.

To understand the process, in this article we will explain how a INNER JOIN clause works.

INNER JOIN clause involving two tables:

Consider the business request: List the customer name and customer id of everyone who bought something in September 2017.

The first step is to understand our data. Let’s take a look at how the sales data transaction database is structured:

Database schema

We use SHOW tables command to display the tables we have in our sales database:

Tables in the sales order database

To answer the business request we first must identify the tables involved. We can see it through the Entity Relationship Diagram (ERD) below, it show us how the tables are connected.

Relation between orders table and customers table

In this case, the two tables involved are orders table and customers table. The Orders table contains just theCustomerID to identify the specific customer. Customer name and other additional data are stored in the customers table along with the CustomerID.

To get the answer we want, we can use the following query:

The partial output is as follow:

What we are doing is:

  • We enter the table names — ordersand customers — in the FROM line connected by a INNER JOIN clause. The INNER JOIN clause is used because we want to retrieve an exact match between those two tables.
  • Second, we tell the DBMS which columns are matched in each table. In this case, we match CustomerID in the orderstable to the CustomerID in the customers table [ON o.CustomerID = c.CustomerID]. Remember, most of the time the column names will be the same as in this example, but they could be different. Also remember, if we don’t specify any join condition, most query systems will produce a cross join, where every row in one table is paired with every row in the other table.
  • The letters C and O are the table alias for simplicity. Since the column CustomerID is used in both tables, it would not make sense to write CustomerID = CustomerID . So to keep track of which column we want, we must also specify the name of the table: o.CustomerID and c.CustomerID
  • WHERE clause is used to specify the order date we want to get the information, and ORDER BY clause is used to sort the result. In this case, we sort the values by orderdate. However, if we want to sort by CustomerID, the result would be:

The output shows us the purchases each customer made between 2nd and 30th September. For example, Suzanne made a purchase on the following days: 02, 03, 10, 17, 24, 25, 28 and 30 of September.

  • Because some customers might have made several purchases on the same day at different times, the DISTINCT clause was used to delete the duplicate rows. If we don’t include the DISTINCT clause, the result would be as follow:

For example, Suzanne made 2 purchases on 02 September. We can think that she made a purchase in the morning and then in the afternoon she made another purchase.

So, what is an INNER JOIN clause?

The INNER JOIN clause compares each record from the first table with every record from the second table. If values in both rows cause the join condition evaluates to true, the inner join clause will result in the set of record that match both tables.

To put this in a simple way, we can think of intersection in set theory. Suppose we want to examine two sets of numbers. The first set of numbers is A = {1, 5, 8, 9, 32, 55, 78}. The second set of numbers is B = {3, 7, 8, 22, 55, 71, 99}. The intersection of these two sets of numbers is the numbers common to both sets: 8 and 55.

So, for our business request example the intersection of the orders table and customers table is the column common to both sets: CustomerID . The Venn Diagram is illustrated as follow:

INNER JOIN clause involving more than two tables:

In the sales order database, we know that the orderstable and customers table are linked by CustomerID. The diagram below shows us that we have the orders and order_details tables linked using the orderNumber column, the order_details and the products table linked using productnumber column:

We need the order number, order date, customer first name and last name, order amount and product name. To translate this we could say:

  • From customers table joined with orders on CustomerID in the orders table matching CustomerID At the customer table,
  • Then joined with the order_details table on orderNumber in the orders table matching orderNumber in the order_details table,
  • And then finally joined with products table on productNumber in the products table matching productNumber in the order_details table.

In this query, we use INNER JOIN and two different approaches to link the tables together with both producing the same result set. We use an USING clause in the first query and then we use an ON clause.

Once we execute the above statements, we can see the partial result set of the data in the following image.

Partial query result set

Remember that INNER JOIN selects all rows from both tables as long as there is a match between the columns in both tables. So, In this query, we request specific information about customers that have ordered or purchased a product on a specific date. That’s why we use an inner join.

To get customers who didn't order we must use another type of join. These are called OUTER JOIN which will be explained in article part 4.

More Business questions answered with SQL queries: Customer and product analysis

The result of the above query is shown below:

Result set

The graphical representation of the query above is shown as follow in the Dashboard obtained from Tableau:

Graphical representation of business question in tableau

Summary

We conclude that Normalization splits data into tables that can be stored and searched more efficiently. In another hand, queries are the reverse operation. The process of combining tables is related to Joins. which means that to get information from multiple tables we need to writing JOIN queries.

Be mindful that any time we use multiple tables, we must join them together. The joined tables also create a huge query. If we don’t specify any constraints, most query systems will produce a cross join, where every row in one table is paired with every row in the other table.

In the next blog, the topic will involve OUTER JOIN clause. Thanks for reading and I hope that you find this blog useful.

Book Reference

  • John L. Viescas, Michael J. Hernandez (2014). SQL Queries for Mere Mortals, A Hand on Guide to Data Manipulation in SQL.

--

--

Kueila Ramos

Tracing my own path to become a Data Scientist in the business field