Data Preparation & EDA of an E-commerce company using Python
In this article, we are going to analyze an E-commerce dataset that contains transactions occurring over two years (2010 and 2011). The dataset was picked from Kaggle in excel format and is related to a UK-based company that operates on an E-commerce market.
Using Python 3 and Jupyter Notebook, we will perform some tasks to prepare our dataset so that we can perform exploration and analysis of the sales data of the fictitious company at a business level. The task to prepare the dataset will be:
1. Rename columns labels to lowercase
2. Rename text from column "description" to lower case
3. Set data types correctly
4. Handle Missing values
5. Check Duplicates values
6. Search for Outliers
Let’s get started
We need to import the relevant libraries that would be useful for our analysis, then, load the data into Pandas. We can do this with a few Python commands:import pandas as pd
, import numpy as np
, pd.read_csv()
Note: Make sure you have your dataset in the same folder as you are running the Python script. If you have it stored elsewhere, you will need to change the
read_csv
parameter to point to the file’s location.
Knowing our data
Let’s inspect our data. To do that, we can use .head()
and .tail()
functions. It gives us the n first rows and the last n rows of the data respectively. The default number of rows displayed by .head()
and .tail()
is five, but we can specify any number of rows as an argument. In this case, we want to get the first 15 rows and the last 10 rows. We’ll see the following result:
Each row in the dataset corresponds to an order made by a customer. We have information about:
Product:
1.StockCode - Code assigned to each distinct product;
2.Description - Product name;
3.Quantity - The quantities of each product (item) per transaction; 4.UnitPrice - Product price per unit.Customer:
1.Customer ID - Number assigned to each customer
2.Country - Country where each customer resides.Others features:
1.InvoiceNo- Number assigned to each transaction;
2.InvoiceDate- The day and time when each transaction was generated.
First, for better readability we will rename the columns and replace text within the Description column from uppercase to lowercase:
The code displays the new result below and after that, we make a copy of the dataset assigning data to the new variable:
Understanding the data
We use .info()
function to identify the data types and some basic information on the content of the DataFrame. df.info()
should return a list of all the columns in our dataset (8 columns in total), 541909 rows, the number of missing values, as we see in the table below.
The dataset contains over 540,000 transactions made by the customers, five of our column have the data type object
, two are float
and one is integer
. However, we have a column candidate for datetime
conversion. Take a look at the invoice_date column. Pandas assign object
to the column but it should be a datetime
type.
Comparing the output after renaming the columns and comparing it with the DataFrame info seen above we can see an inconsistency in the data type. At first glance, the invoice_no column seems to be number
type, but it assigned with a string
type, so there must be something with this variable that will need forward attention.
Upon taking a closer look, even though the first 3 columns are objects, we can see from the data that only the description column have mixed types.
The table also informs us that the description column and customer_id have only 540,455 and 406,829 non-null values respectively. This means that we have missing values since we have 541,909 entries in total.
Now that we know our dataset, our next move will be preparing our data by cleaning it up.
Data preparation/Data Cleaning
- Changing data types
Knowing the data type of our dataset helps us to make better valuation and improves performance by ensuring that the data is stored correctly. We saw above that, customer_id, invoice_no, stock_no are unique identifier for each transaction and invoice_date column is stored with an incorrect data type. customer_id is stored as a float number, Invoice_date column is a date stored as object
format.
We use pd.to_datetime()
to convert the invoice_date column from object
to datetime
format and .astype()
function to convert the customer_id, and we get this result:
The output means that we cannot modify a data type that has missing values. invoice_date was set to datetime
correctly because it has no missing values as we saw previously but relatively to customer_id, we get a ValueError because we have missing values. To convert the customer_id, we must first handle the missing values.
- Dealing with missing values
Missing values (or missing data a.k.a NaN
) is a common problem in data science projects which can have a significant effect on conclusions as it can lead to incorrect analysis. This could be because it was filled out unproperly, or the data was not available or could be a computing error.
Inspecting our dataset above with info()
we could see the total of non-null values at each column. Now, we can take a closer look at the data to detect missing values using isnull()
and sum()
function at the same time.
The isnull()
function returns false
if we do not have missing values and true
otherwise. The sum()
function returns the sum of those missing values in each column. We could also use .sort_values()
function with ascending=False
parameter to sort the output in descending order as we can see below:
This output shows us more clearly that customer_id and description column has 135,080 and 1454 missing values respectively.
We can check out the rows with those missing values below:
Let’s first check out the nulls values we have in the description column:
Since we saw previously that the description column has mixed values types of string
and float
, we could confirm that the floats values presented in the column are NaN
, so to deal with this column, we could just make an imputation with ‘unknow_item’.
However, the table also shows us that the NaN
missing values in the description column which matches with the NaN
in the customer_id column and since the customer_id column is a random system-generated value and we cannot replace them with another value, it doesn't matter if we impute the description column with “unknown item” because all of the NaN
rows will be dropped using data=data.drop(columns=['customer_id')
letting us with 406,829 rows.
Note: In some cases, we could opt to drop columns instead of deleting rows. To do that we just would needed to use
.dropna()
again and provide theaxis=1
parameter, however in this particular case we can’t remove the columns because they are relevant for our analysis that will take place forward.
Now that we have the missing values removed, the customer_id column was successfully converted to string
type :
Another common issue is duplicate data.
- Checking Duplicates
In total the dataset has 5225 duplicate values. We can delete them using (drop_duplicates(keep=False, inplace=True)
leaving us with 396,767 entries. With the data cleaned of duplicates and null values, we can start searching for patterns.
- Searching for patterns
The next analysis is to perform descriptive summary data on numeric variables. Using describe()
method, we get the following result and concerns:
1. Why invoice_no is not considered a numeric value as it is a column with numbers?2. Why the quantity column has negative values and extremely positive and negative values wich indicates outliers3. What about unit_price, How the minimum values are zero? it indicate free items to customers?
Let’s take a look at those features one by one:
Using some methods we found that the invoice_no column has text (starting by letter C) corresponding to 8844 rows, and numbers that correspond to 397924, as shown below. how can we explain this?
The negative values seem that corresponds to each invoice_no that has been cancelled. We verify this by defining a command to grab only the negative values (see the project on my GitHub repo). This is explained by the negative values in the quantity column. As a result, it returns us a DataFrame where the invoice_no started by letter C, as shown in the table below:
2. Graphical visualization of the numeric data (quantity and unit_price)
Looking at the histogram below, we can see that the majority of the quantity data is centred in the middle and the unit price range between 0 to 40000.
Out of the 396767 rows, we observe that the maximum and minimum value are respectively 80995 and -80995. The Interquartile range analysis tells us that most of the values lie between 2 and 12. Values greater than 12 introduces us to Outliers. To know more about outliers we have to Get relevant percentiles and see their distribution.
Now that we find the upper fence and lower fence, by adding some code we can plot all the data above the upper fence and below the lower fence.
Filter out the outlier data and print only the potential data, we obtain the following result:
Now, the result shows an extreme outlier in the unit_price which corresponds to a transaction that has a unit_price of 38970:
To deal with that, we drop the outlier. After that, we go again to check the distribution of the price and we found more outliers, here is the result:
An outlier is a value or an entire observation that lies well outside of the norm. By looking at the histogram and box plot above, now we can see that the maximum and minimum value of the unit_price column are respectively 0.00 and 8142.
For this particular example, unit_price is extremely above 75 percentile and those values seem legitimate but it is way beyond what the prices average of the products are. However, we decide to keep outliers.
After this statistical analysis, in the next section, we will make an exploratory analysis and answer a couple of business-focused questions.
EXPLORATORY DATA ANALYSIS
- Categorical variables
Within pandas, to get the non-numerical features we apply describe()
function and we specify include='object'
to the function which returns the number of values (count)
, unique values unique)
, the most frequent value (top)
and the frequency (freq)
in the dataset.
Note: using numpy object data type
describe(include=np.object)
will give us the same result.
For a quick interpretation, the product most sold was ‘regency cakestand 3 tier’. We have 37 countries in total. The United Kingdom is the top country and the most frequent country with a higher number of orders (20660) which corresponds to approx. 89% of the dataset.
Answering business questions
1. What's are the TOP 5 countries with the highest number of orders?
The company receives the highest number of orders from customers in the United Kingdom (since it is a UK-based company). Therefore, the TOP 5 countries that place the highest number of orders are the United Kingdom, Germany, France, EIRE and Spain.
2. What's are the TOP 5 countries with highest number of sales?
As the company receives the highest number of orders from customers in the United Kingdom (since it is a UK-based company), customers in the United Kingdom spend the most on their purchases. Therefore, the table result above give us the TOP 5 countries that sold the most are United Kingdom, Germany, France, EIRE, Switzerland.
3. What was the Total sales per year?
The transactions occurred between two periods (2010 and 2011). Globally, in total, the sales were 3,432.593 in 2010 and 4,782.446 in 2011.
However, by looking at a specific country, for example, the United Kingdom the result is shown below:
4. How many different orders made by the customers in United Kingdom?
We see that the dataset is largely dominated by orders made from the United Kingdom. The DataFrame contains 330.908 entries to simplify we could ask what is the number of customers and products in these entries?
It can be seen that the data concern 3.858 different customers and that they bought 3.643 different products. The total number of transactions carried out was above 18.427 orders. see the table below for years separately.
5.What was the best month for sales in United Kingdom? How much was earned that month?
The best month for sales in 2010 was December. The best month for sales in 2011 was November. You can also see the graphical representation below:
6.In 2011, How many orders was made per hour? How many orders was made per day?
From the graphical representation below we can conclude that most orders occur at 12 hours, and usually, customers make more purchases on Thursday.
7. How many transactions was made and which corresponds to orders canceled?
This result shows us the existence of frequent customers that buy a large number of items at each order. The existence of entries with the prefix C for the invoice_no column indicate transactions that have been cancelled. And for that client number 12924, only 1 transaction was cancelled.
Summary
This article was to demonstrate the steps I adopted on the analysis of an e-commerce dataset. First of all, I performed some tasks that include: drop NaN values from DataFrame; removing rows based on a condition; change the type of columns (to_datetime, astype), drop duplicate data and also handle outliers for statistical purpose. Then the article moves to the Exploration data Analysis where we explore questions at the business level related to our data.
Reference