Web Traffic Analysis with SQL

Analysing web traffic from e-commerce

Photo by rupixen.com on Unsplash

Conversion rate optimization is a process to understand what drives, stops, and persuades users to a website. The objective is to give users the best user experience possible and to increase website sales or minimize acquisition costs.

The main focus of this blog is to analyse traffic sources and marketing channels that lead to conversion rates. To do that we’ll use MySQL Workbench.

Let’s see a list of common digital marketing channels:

  • Organic traffic includes traffic that comes from search engines. In this field, examples are Google, Bing, Yahoo etc. It occurs when visitors or potential customers discover our website by searching for keyword, phrase or product.
  • Paid search This channel includes all traffic that comes from paid search ads on search engines. Such campaigns can be launched on Google AdWords and Bing Ads.
  • Email Includes all traffic that comes from emails (email marketing campaigns, order confirmation emails, newsletters, etc.).
  • Social media includes all traffic that comes from social media channels, such as Facebook or Twitter. It can be free organic traffic got from regular social media posts, or paid traffic from social media ads.
  • Direct Includes all traffic from users that are brought directly to your website when typing in your web address URL in the browser or clicking on the URL in their bookmarks.
  • Referrals usually allude to all traffic that doesn’t come from search engine, i.e traffic that comes from another website by clicking on a http link.

Consider the e-commerce diagram in the picture below:

Our database includes 6 related tables, containing information about website activity, products, and orders. We can clearly see the relationship between those tables, identified by the primary key and foreign keys.

The database has information about transactions that occur between 2012 and 2015. In total, the sessions and orders totalized 472,871 and 32,313 respectively. For each year we also count the number of sessions and orders per year as shown below:

Total sessions and orders per year

Our main goal is to use SQL query to understand which channels are driving traffic to the e-commerce website and what channel is being used and driving the highest conversion rate.

To answer these questions our focus will be on website_sessions, website_pageviews and orders tables.

First, take a look at the website_session table. By using this query SELECT * FROM website_sessions WHERE website_session_id = 1020 the result displays 9 columns as follow:

Website_session table
website_session_id is the primary key of the website sessions table.created_at is a timestamp indicating when that session takes place.user_id is linked to the cookie in a user’s browser used to track users across multiple sessions.is_repeat_session is a binary number. represented by 0 or 1 depending on whether or not this user has been to the website before.utm_source, utm_campaign and utm_content are tracking parameters that used to measure our paid marketing activity.they are appended link tags to our website URLs. Device_type indicates whether the user is on their computer or if they’re on a mobile device.http_referer indicates where traffic is coming from.

We can say that the user_id 1018, access our website page on 2012–03–26 from a paid search engine and from a desktop.

From the website_pageviews table, by using the following query SELECT * FROM website_pageviews WHERE website_session_id = 1020 it includes 4 columns:

website_pageviews
website_pageviews - the Website_pageview table is basically the page views that a user saw when they were on the e-commerce website.created_at - is a timestamp indicating when that pageview happened.website_session_id - note that this column is a primary key on website_session table and in the website_pageviews it corresponds to the foreign key, so when we want to use those two table together we will use a join clause.pageview_url - means that an user first landed on the home page in a specific time and then later he/she went to the products page then clicked through to the original Mr. Fuzzy page then clicked through to the cart then to the shipping page, the billing page and finally they landed on thank you for your order. This is important for conversion funnel analysis.

From the output above, we can say that the user_id 1020 landed on the /home page and 1 minute later she/he went to the /productspage, then 4 minutes later he clicked on /originl-mr-fuzzy, and then after 3 minutes, he/she add the product to the /cart and when she clicked on /shipping, he/she abandon the purchase. So why did this happened? it was because the user was surprised by additional costs? (see this article for more insights). This is a type of question important for conversion funnel analysis.

A common e-commerce flow shop. Source: google.com

In the case of the orders table, where the sales are tracked, by using the following query SELECT * FROM orders it displays 8 columns:

order_id - is the number of the transactioncreated_at - is a timestamp indicating when that order happened.website_session_id - this column is a primary key on website_session table and in the order_id it corresponds to the foreign key, so when we want to use those two table together we will use a join clause.user_id is linked to the cookie in a user’s browser used to track users across multiple sessions.primary_procut_id - indicats the product that a customer puts in their cart first. It has 4 different numbers rnge between 1 to 4.item_purchased - 
price_usd
is the price of the products in dollars
cogs_usd
is the cost of goods sold in dollars

From our example, by using the query SELECT * FROM orders WHERE website_session_id = 1020 we get this result:

orders

The table didn't return any information because the user_id 1020 did not complete the order as we deduce previously.

Channels that drive traffic to the website

Using the query, we obtain the following output:

The result shows that we have different categories in each utm parameters. For example, utm_source has 3 different categories: bsearch, gsearch and socialbook. The same interpretation can be made for the other two utm. All these utm parameters are paid campaign.

An important thing to note is the null value that appears in the first row. Those could be traffic that is not driven by a paid campaign or possibly they forgot to put tracking parameters on those campaigns.

So which channels drive traffic to the website? we can answer the question by using this query:

The output shows us clearly that gsearch source, from nonbrand campaign category, contribute to the highest sessions followed by the null value which corresponds to traffic not paid.

Now that we know that gsearch nonbrand drives the most volume in session, we can ask how many of this session is converted into an order. Let’s do this by using the following query:

So for our example, our website received 282,706 sessions from gsearch nonbrand and was made 18,822 orders.

But what exactly is branded and nonbranded keywords?

A ‘branded’ keyword usually includes the brand name of the company or some variation of it. These are different from nonbrand keywords which are keywords that do not relate to the brand name.

A comparison analysis of nonbranded/brand keywords and traffic is important for two reasons:

  • Because particularly for large brands, branded keywords search in the engines indicates that the user knows the company.
  • Because it often represents the majority of our website’s new visitors and unique visitors. Most importantly, these visitors may not already know about our brand.

But before we move to the analyse of the brand/nonbranded keywords, let’s get the conversion rate. For a better understanding, it would be more useful if we calculate the conversion rate over time from 2012 until 2015 particularly for gsearch source which has the highest number of sales and orders. Consider the following output:

Conversion rate

From the result above we can see that over time both sessions and orders are increasing. But from these values, we can also see that a lot of users that visit the website, few are actually making a purchase (order). In the perspective of conversion rate, we see an increase from 3% in 2012 to 7% in 2014, this doesn't mean that the sessions are turning into sales.

This means that traffic is not resulting in more sales. Instead, it results in ad spend derived from the paid campaign.

Our next step should be to break down the gsearch source into campaign which is categorized into two keywords: brand and nonbrand. Using the following query, we get this result:

Notice that, as we write at the beginning of this article that paid traffic indicates that a visitor/customer came to our website from a paid search ads on search engine, what we can deduce from the table is that:

  • Besides our visitors are searching from our paid gsearch engine, these visitors may not know about our brand name and accidentally they find our website by searching a product in those engines.
  • That’s why nonbrand traffic has a higher number of sessions than brand traffic. We can think of them as the website’s new visitors for example.
  • For those who know the brand of our e-commerce company, besides the increased session over time, it still very low compared to nonbrand session.

We are getting closer to the final step. Now we are going to analyze the type of device our users are using to access our website through the gsearch engine. Using the query below we get the output:

What we can interpret from the result table, is that desktop is being by far the channel that users majority use to access the website. However, this doesn't turn the session into sales. We have users that access our website by using desktops however the orders are extremely low compared to the session. An analysis that is important at this point would be to look at funnel sales. For that, we would need the website_pageviews table, but this could be referenced in another article to not make the blog too long.

Summary

From our web traffic analysis, we saw that the top sources that drive most sessions are: gsearch nonbrand, none (which indicates traffic unpaid) and bsearch nonbrand.

Based on the results we obtain so far, we cannot give a precise conclusion but we can say that our brand is not being noted and also, the company should cut ad spend and look to our unpaid ads that are increasing over time. Besides, we will need more metrics to add to this information for example how sales are growing over time, what’s the cost of sales, look at abandoned cart data, purchase funnel and most popular products.

Thanks for reading, my name is Kueila Ramos and I have been self-learning Data Science for about 2 years. I hope that you find this blog useful. If you have any suggestions, please let me know.

Reference

Keyword Basics: Branded vs. Non-branded — https://www.risdall.com/our-thoughts/keyword-basics-branded-vs-non-branded-provided/

How to Calculate Sales Conversion Rate & Lead Valuehttps://venturi-web-design.com/calculate-sales-conversion-rate/

Designing An eCommerce Checkout Flow: A Must For Your Online Store — https://makewebbetter.com/blog/ecommerce-checkout-flow/

Tracing my own path to become a Data Scientist in the business field