Photo by Campaign Creators on Unsplash

Getting started with Data Science by cleaning up data

Hy everyone, this is my first starting point in the field of Data Science. In this article, I am going to share my experience on a simple data cleaning on Python.

Kueila Ramos
Analytics Vidhya
Published in
6 min readSep 20, 2020

--

Introduction

Whenever we start doing analysis on Python, usually the first step after importing the necessary packages is to load the data into a Pandas DataFrame using for example read_csv method, read_excel, read_table or read_sql depending on the extension of the file. However, in some cases, if we want to work with a small amount of data, it is also helpful to know how DataFrame works.

For the purpose of this article, I am using Jupyter Notebook as Integrated Development Environment (IDE) and I am going to use sales data to perform data cleaning, but before this, I am quickly going to manually show one common way to create a DataFrame using a list of dictionaries.

Let´s get started

To create a table manually, Python´s Pandas packages provide a constructor for DataFrame that consist of pd.Dataframe() method. (pdis used as an alias for that Pandas library instance).

Let’s use a list of dictionaries from the Pandas library to have a look at its functionality:

Each key (Order ID, product, quantity ordered…) corresponds to a column name and values for data as lists. As you can see above, I defined a dictionary list and stored the variable name as data and then I use the constructer to create the DataFrame and assign it to sales and after we printed, it gives us a representation of the dataset. The dataset was modified and it is an extract from a much larger dataset, to show you one of the common issues in Data Science projects that may affect our analysis: inconsistent column names, capitalization, missing data, incorrect data types etc.

Now, Taking the original data from Keith Galli´s Github repository (it contains files separately from Jan to December 2019), I want to do the following:

  • Rename the variables columns;
  • Handle missing values;
  • Handle data types

By importing the full table into the DataFrame with one line of code using the read_csv command. To perform identical analysis that Keith has done on his repository, but with some adaptation, I started by analysing sales_April_2019 once the CSV file was loaded. The first five rows of the table before any operations is attached below:

At first glance, We can see one problem: Missing values (the DataFrame has the second entry as NaN which means the value is not available). To facilitate the analysis I firstly rename the variables columns:

To inspect the data, we can use the following code to get a sense of how the dataset looks like:

The data consists of 18383 rows and 6 columns. Incorrect column type was also identified as a problem, whereas all variables types are object which is wrong because we have numerical and categorical variables. So before moving forward, we have to solve those problems.

From here, firstly we are solving the missing values and then check the data types that are not set correctly. So the total of missing values for each feature is 59 and in total, we have 354 missing values as shown below:

Now that we’ve summarized the number of missing values we have 3 options:

  1. We can drop the rows or columns if they are not relevant for your analysis;
  2. We can replace the missing values for a more convenient value;
  3. We can ignore them

Ignoring the missing values is not always a good idea because at some point you will have to deal with them.

For this particular case, as each column has the same amount of missing values and they are not a very high number, so we can drop them. Now, the dimension of the data frame is 18324:

After that, we proceed to handle incorrect column type. Before moving on we will make a copy of the data.

Pandas recognized the NaN or np.nan as a missing value, but what if our feature is expected to be numeric, but there is a categorical type for example? Technically this is also a missing value.

‘quantity_ordered’, and ‘price_each’ were expected to be numeric value and ‘order_date’ to be a date type. When we were trying to modify the type of the columns using the apply() function, a ValueError was displayed:

Using the code below, we can see that we have a bunch of lines identified as categorical values that didn’t allow us to modify the data type:

So, to solve this problem, we can apply the np.where() function to replace the categorical values with np.nan (i.e, missing value). Doing that, we can identify more 35 missing values in each column, after that, we will drop all of those values:

With the missing values dropped, we could now modify the data types:

The pd.to_numeric() and pd.to_datetime() will process the data and create a float64 column by default as we can see on the outcome bellow:

Finally, we transform float quantity_ordered column to an integer by using this line of code:

Conclusion:

This study shows me that there are a lot of different approaches to solve one problem. I am a beginner in the field of data science, who is trying to find his way to go from doing simple analysis to more complex ones forward. This particular exercise makes me feel that I have a lot to learn and I am really enjoying it.

So, you can find the git repository Jupyter Notebook of this article here. Thank you for reading this blog and the support and I have more on the way, just stay tuned.

References

--

--

Kueila Ramos
Analytics Vidhya

Tracing my own path to become a Data Scientist in the business field