Photo by Scott Graham on Unsplash

Data Preparation & EDA of an E-commerce company using Python

Kueila Ramos
12 min readMar 27, 2021

--

In this article, we are going to analyze an E-commerce dataset that contains transactions occurring over two years (2010 and 2011). The dataset was picked from Kaggle in excel format and is related to a UK-based company that operates on an E-commerce market.

Using Python 3 and Jupyter Notebook, we will perform some tasks to prepare our dataset so that we can perform exploration and analysis of the sales data of the fictitious company at a business level. The task to prepare the dataset will be:

1. Rename columns labels to lowercase
2. Rename text from column "description" to lower case
3. Set data types correctly
4. Handle Missing values
5. Check Duplicates values
6. Search for Outliers

Let’s get started

We need to import the relevant libraries that would be useful for our analysis, then, load the data into Pandas. We can do this with a few Python commands:import pandas as pd, import numpy as np, pd.read_csv()

Importing relevant libraries and loading the data

Note: Make sure you have your dataset in the same folder as you are running the Python script. If you have it stored elsewhere, you will need to change the read_csv parameter to point to the file’s location.

Knowing our data

Let’s inspect our data. To do that, we can use .head() and .tail() functions. It gives us the n first rows and the last n rows of the data respectively. The default number of rows displayed by .head() and .tail()is five, but we can specify any number of rows as an argument. In this case, we want to get the first 15 rows and the last 10 rows. We’ll see the following result:

First 15 rows
Last 10 rows

Each row in the dataset corresponds to an order made by a customer. We have information about:

Product:
1.StockCode - Code assigned to each distinct product;
2.Description - Product name;
3.Quantity - The quantities of each product (item) per transaction; 4.UnitPrice - Product price per unit.
Customer:
1.Customer ID - Number assigned to each customer
2.Country - Country where each customer resides.
Others features:
1.InvoiceNo- Number assigned to each transaction;
2.InvoiceDate- The day and time when each transaction was generated.

First, for better readability we will rename the columns and replace text within the Description column from uppercase to lowercase:

Renaming the columns and replacing text

The code displays the new result below and after that, we make a copy of the dataset assigning data to the new variable:

Output after renaming column header and description column value

Understanding the data

We use .info() function to identify the data types and some basic information on the content of the DataFrame. df.info() should return a list of all the columns in our dataset (8 columns in total), 541909 rows, the number of missing values, as we see in the table below.

Basic information

The dataset contains over 540,000 transactions made by the customers, five of our column have the data type object, two are floatand one is integer. However, we have a column candidate for datetime conversion. Take a look at the invoice_date column. Pandas assign object to the column but it should be a datetime type.

Comparing the output after renaming the columns and comparing it with the DataFrame info seen above we can see an inconsistency in the data type. At first glance, the invoice_no column seems to be number type, but it assigned with a stringtype, so there must be something with this variable that will need forward attention.

Upon taking a closer look, even though the first 3 columns are objects, we can see from the data that only the description column have mixed types.
The table also informs us that the description column and customer_id have only 540,455 and 406,829 non-null values respectively. This means that we have missing values since we have 541,909 entries in total.

Now that we know our dataset, our next move will be preparing our data by cleaning it up.

Data preparation/Data Cleaning

  • Changing data types

Knowing the data type of our dataset helps us to make better valuation and improves performance by ensuring that the data is stored correctly. We saw above that, customer_id, invoice_no, stock_no are unique identifier for each transaction and invoice_date column is stored with an incorrect data type. customer_id is stored as a float number, Invoice_date column is a date stored as objectformat.

We use pd.to_datetime() to convert the invoice_date column from objectto datetime format and .astype() function to convert the customer_id, and we get this result:

ValueError

The output means that we cannot modify a data type that has missing values. invoice_date was set to datetime correctly because it has no missing values as we saw previously but relatively to customer_id, we get a ValueError because we have missing values. To convert the customer_id, we must first handle the missing values.

  • Dealing with missing values

Missing values (or missing data a.k.a NaN) is a common problem in data science projects which can have a significant effect on conclusions as it can lead to incorrect analysis. This could be because it was filled out unproperly, or the data was not available or could be a computing error.

Inspecting our dataset above with info() we could see the total of non-null values at each column. Now, we can take a closer look at the data to detect missing values using isnull() and sum() function at the same time.

The isnull() function returns false if we do not have missing values and true otherwise. The sum() function returns the sum of those missing values in each column. We could also use .sort_values() function with ascending=False parameter to sort the output in descending order as we can see below:

Missing values

This output shows us more clearly that customer_id and description column has 135,080 and 1454 missing values respectively.

We can check out the rows with those missing values below:

10 first rows with missing values

Let’s first check out the nulls values we have in the description column:

Since we saw previously that the description column has mixed values types of string and float , we could confirm that the floats values presented in the column are NaN, so to deal with this column, we could just make an imputation with ‘unknow_item’.

However, the table also shows us that the NaN missing values in the description column which matches with the NaN in the customer_id column and since the customer_id column is a random system-generated value and we cannot replace them with another value, it doesn't matter if we impute the description column with “unknown item” because all of the NaN rows will be dropped using data=data.drop(columns=['customer_id')letting us with 406,829 rows.

Note: In some cases, we could opt to drop columns instead of deleting rows. To do that we just would needed to use .dropna() again and provide the axis=1 parameter, however in this particular case we can’t remove the columns because they are relevant for our analysis that will take place forward.

Now that we have the missing values removed, the customer_id column was successfully converted to string type :

Another common issue is duplicate data.

  • Checking Duplicates
Rows with only duplicates data

In total the dataset has 5225 duplicate values. We can delete them using (drop_duplicates(keep=False, inplace=True) leaving us with 396,767 entries. With the data cleaned of duplicates and null values, we can start searching for patterns.

  • Searching for patterns

The next analysis is to perform descriptive summary data on numeric variables. Using describe() method, we get the following result and concerns:

1. Why invoice_no is not considered a numeric value as it is a column with numbers?2. Why the quantity column has negative values and extremely positive and negative values wich indicates outliers3. What about unit_price, How the minimum values are zero? it indicate free items to customers?

Let’s take a look at those features one by one:

Using some methods we found that the invoice_no column has text (starting by letter C) corresponding to 8844 rows, and numbers that correspond to 397924, as shown below. how can we explain this?

Entries starting with the letter C

The negative values seem that corresponds to each invoice_no that has been cancelled. We verify this by defining a command to grab only the negative values (see the project on my GitHub repo). This is explained by the negative values in the quantity column. As a result, it returns us a DataFrame where the invoice_no started by letter C, as shown in the table below:

2. Graphical visualization of the numeric data (quantity and unit_price)

Looking at the histogram below, we can see that the majority of the quantity data is centred in the middle and the unit price range between 0 to 40000.

Quantity vs unit_price graphical representation

Out of the 396767 rows, we observe that the maximum and minimum value are respectively 80995 and -80995. The Interquartile range analysis tells us that most of the values lie between 2 and 12. Values greater than 12 introduces us to Outliers. To know more about outliers we have to Get relevant percentiles and see their distribution.

Now that we find the upper fence and lower fence, by adding some code we can plot all the data above the upper fence and below the lower fence.

Filter out the outlier data and print only the potential data, we obtain the following result:

Now, the result shows an extreme outlier in the unit_price which corresponds to a transaction that has a unit_price of 38970:

To deal with that, we drop the outlier. After that, we go again to check the distribution of the price and we found more outliers, here is the result:

An outlier is a value or an entire observation that lies well outside of the norm. By looking at the histogram and box plot above, now we can see that the maximum and minimum value of the unit_price column are respectively 0.00 and 8142.

For this particular example, unit_price is extremely above 75 percentile and those values seem legitimate but it is way beyond what the prices average of the products are. However, we decide to keep outliers.

After this statistical analysis, in the next section, we will make an exploratory analysis and answer a couple of business-focused questions.

EXPLORATORY DATA ANALYSIS

  • Categorical variables

Within pandas, to get the non-numerical features we apply describe() function and we specify include='object' to the function which returns the number of values (count), unique values unique), the most frequent value (top) and the frequency (freq) in the dataset.

Note: using numpy object data type describe(include=np.object) will give us the same result.

For a quick interpretation, the product most sold was ‘regency cakestand 3 tier’. We have 37 countries in total. The United Kingdom is the top country and the most frequent country with a higher number of orders (20660) which corresponds to approx. 89% of the dataset.

Answering business questions

1. What's are the TOP 5 countries with the highest number of orders?

The company receives the highest number of orders from customers in the United Kingdom (since it is a UK-based company). Therefore, the TOP 5 countries that place the highest number of orders are the United Kingdom, Germany, France, EIRE and Spain.

2. What's are the TOP 5 countries with highest number of sales?

As the company receives the highest number of orders from customers in the United Kingdom (since it is a UK-based company), customers in the United Kingdom spend the most on their purchases. Therefore, the table result above give us the TOP 5 countries that sold the most are United Kingdom, Germany, France, EIRE, Switzerland.

3. What was the Total sales per year?

The transactions occurred between two periods (2010 and 2011). Globally, in total, the sales were 3,432.593 in 2010 and 4,782.446 in 2011.

However, by looking at a specific country, for example, the United Kingdom the result is shown below:

4. How many different orders made by the customers in United Kingdom?

We see that the dataset is largely dominated by orders made from the United Kingdom. The DataFrame contains 330.908 entries to simplify we could ask what is the number of customers and products in these entries?

It can be seen that the data concern 3.858 different customers and that they bought 3.643 different products. The total number of transactions carried out was above 18.427 orders. see the table below for years separately.

5.What was the best month for sales in United Kingdom? How much was earned that month?

The best month for sales in 2010 was December. The best month for sales in 2011 was November. You can also see the graphical representation below:

6.In 2011, How many orders was made per hour? How many orders was made per day?

From the graphical representation below we can conclude that most orders occur at 12 hours, and usually, customers make more purchases on Thursday.

7. How many transactions was made and which corresponds to orders canceled?

This result shows us the existence of frequent customers that buy a large number of items at each order. The existence of entries with the prefix C for the invoice_no column indicate transactions that have been cancelled. And for that client number 12924, only 1 transaction was cancelled.

Summary

This article was to demonstrate the steps I adopted on the analysis of an e-commerce dataset. First of all, I performed some tasks that include: drop NaN values from DataFrame; removing rows based on a condition; change the type of columns (to_datetime, astype), drop duplicate data and also handle outliers for statistical purpose. Then the article moves to the Exploration data Analysis where we explore questions at the business level related to our data.

Reference

--

--

Kueila Ramos

Tracing my own path to become a Data Scientist in the business field