SQL query with Inner Join, Business questions and visualization with Tableau— Part 3

Understanding Business requirements by using SQL Inner Join, Venn Diagram and Tableau

Photo by Lukas Blazek on Unsplash

This blog is related to part 3 of 4. In the previous article, we’ve focused on a basic query involving a single table. We know that a relational database consists of multiple related tables linked together using common columns and the process behind it is called Normalization.

As we discuss in Understanding Relational Database article part 1, the Normalization process makes data storage more efficient and avoids problems such as data redundancy.

However, to answer business questions, we need to recombine the data from the tables. This means that we must write queries.

The general syntax of a Join clause is as seen below:

SELECT column_name, expression
FROM
table1 as t1
[INNER |{LEFT|RIGHT}[OUTER]] JOIN
table2 as t2 ON join condition;

If the join condition uses the equal operator () and the column names in both tables used for matching are the same, we can use the clause instead:

SELECT value_expression
FROM table1 as t1
[INNER |{LEFT|RIGHT}[OUTER]] JOIN table2 as t2 USING (column_name);

I am using MySQL system but there are many other variants available, like PostgreSQL, SQLite, Oracle to mention a few examples. Across the different languages, the syntax can change a little bit, but the main concepts remain the same.

Join is a technique of combine data between one (self-join) or more tables based on values of the common column between the tables.

Combining data is straightforward. We simply specify which tables are involved and how the tables are connected. This means that, in the clause, we write the name of the first table (table1) that we want to take columns from. Then, we use the keyword, followed by the name of the second table (table2).

Next, the or clause (join condition) is used to query on which columns data should be joined (usually, this is an ). However, keep in mind that the columns on which we join the data are not always in the form of an . They also don’t need to have the same name.

To understand the process, in this article we will explain how a clause works.

INNER JOIN clause involving two tables:

Consider the business request: List the customer name and customer id of everyone who bought something in September 2017.

The first step is to understand our data. Let’s take a look at how the sales data transaction database is structured:

Database schema

We use command to display the tables we have in our sales database:

Tables in the sales order database

To answer the business request we first must identify the tables involved. We can see it through the Entity Relationship Diagram (ERD) below, it show us how the tables are connected.

Relation between table and table

In this case, the two tables involved are table and table. The Orders table contains just the to identify the specific customer. Customer name and other additional data are stored in the table along with the .

To get the answer we want, we can use the following query:

The partial output is as follow:

What we are doing is:

  • We enter the table names — and — in the line connected by a clause. The clause is used because we want to retrieve an exact match between those two tables.
  • Second, we tell the DBMS which columns are matched in each table. In this case, we match in the table to the in the table []. Remember, most of the time the column names will be the same as in this example, but they could be different. Also remember, if we don’t specify any join condition, most query systems will produce a cross join, where every row in one table is paired with every row in the other table.
  • The letters and are the table for simplicity. Since the column is used in both tables, it would not make sense to write . So to keep track of which column we want, we must also specify the name of the table: and
  • clause is used to specify the order date we want to get the information, and clause is used to sort the result. In this case, we sort the values by . However, if we want to sort by , the result would be:

The output shows us the purchases each customer made between 2nd and 30th September. For example, Suzanne made a purchase on the following days: 02, 03, 10, 17, 24, 25, 28 and 30 of September.

  • Because some customers might have made several purchases on the same day at different times, the clause was used to delete the duplicate rows. If we don’t include the clause, the result would be as follow:

For example, Suzanne made 2 purchases on 02 September. We can think that she made a purchase in the morning and then in the afternoon she made another purchase.

So, what is an INNER JOIN clause?

The clause compares each record from the first table with every record from the second table. If values in both rows cause the join condition evaluates to true, the inner join clause will result in the set of record that match both tables.

To put this in a simple way, we can think of intersection in set theory. Suppose we want to examine two sets of numbers. The first set of numbers is A = {1, 5, 8, 9, 32, 55, 78}. The second set of numbers is B = {3, 7, 8, 22, 55, 71, 99}. The intersection of these two sets of numbers is the numbers common to both sets: 8 and 55.

So, for our business request example the intersection of the table and table is the column common to both sets: . The Venn Diagram is illustrated as follow:

INNER JOIN clause involving more than two tables:

In the sales order database, we know that the table and table are linked by . The diagram below shows us that we have the and tables linked using the column, the and the table linked using column:

We need the order number, order date, customer first name and last name, order amount and product name. To translate this we could say:

  • From table joined with on in the table matching At the customer table,
  • Then joined with the table on in the table matching in the table,
  • And then finally joined with table on in the table matching in the table.

In this query, we use and two different approaches to link the tables together with both producing the same result set. We use an clause in the first query and then we use an clause.

Once we execute the above statements, we can see the partial result set of the data in the following image.

Partial query result set

Remember that selects all rows from both tables as long as there is a match between the columns in both tables. So, In this query, we request specific information about customers that have ordered or purchased a product on a specific date. That’s why we use an inner join.

To get customers who didn't order we must use another type of join. These are called which will be explained in article part 4.

More Business questions answered with SQL queries: Customer and product analysis

The result of the above query is shown below:

Result set

The graphical representation of the query above is shown as follow in the Dashboard obtained from Tableau:

Graphical representation of business question in tableau

Summary

We conclude that Normalization splits data into tables that can be stored and searched more efficiently. In another hand, queries are the reverse operation. The process of combining tables is related to Joins. which means that to get information from multiple tables we need to writing queries.

Be mindful that any time we use multiple tables, we must join them together. The joined tables also create a huge query. If we don’t specify any constraints, most query systems will produce a cross join, where every row in one table is paired with every row in the other table.

In the next blog, the topic will involve clause. Thanks for reading and I hope that you find this blog useful.

Book Reference

  • John L. Viescas, Michael J. Hernandez (2014). SQL Queries for Mere Mortals, A Hand on Guide to Data Manipulation in SQL.

Tracing my own path to become a Data Scientist in the business field