Data Analysis using Python - Sales Analysis


 

Hey everyone!


I know you have been enjoying my older blogs and I first of all, I want to apologize for not being able to put some content for a while.


But that ends now!



gif


So today we will be going through a few things that I found helpful in my day to day analysis after I started working on python. I have never been a fan of coding but I feel that I am kind of getting the hang of it.


We are will be going through some of the basic techniques of data analysis using python. Now this post I have kept very basic so that those who are unfamiliar with python will also be able to do it.


We will be using Superstore data for the analysis. All the files required have been uploaded in my github. Link for that is given below:


https://github.com/anmolp1/data_analysis_using_python


On the basis of the Superstore data, we are going to find the answers to the following questions:


Q1. What is the overall sales trend?

Q2. What is the overall MoM, QoQ and YoY growth?

Q3. Which are the Top 10 products by sales?

Q4. Which are the most selling products?

Q5. Which are the most profitable products?


Now, I know I know - we can do it in any BI tool (psst, Tableau). But I am assuming that not every analyst uses it, or has it readily available.


Now the first thing to do is:


Get coffee, Yes!


Because this is going to take a while, especially for those who have no idea about python. But don't worry, you will hopefully have a good amount of usable knowledge by the end of this.


So, we will start by importing the libraries first.


Pandas : Pandas is probably the most used library for data manipulation.

Numpy : It is a fundamental package for scientific computing in python.

matplotlib.pyplot : Matplotlib is widely used to create non-interactive visualizations in python

datetime : The datetime module supplies classes for manipulating dates and times.


Now I need the data to draw the analysis upon (obviously!).

If you need assistance in how to import data, refer to the notebook in GitHub using the link above. I'd advise you to keep it open as a reference while going through this post.


After importing the data (df), we can proceed to finding the solution of the first question:


Q1. What is the overall sales trend?


Now to solve this, I used the below code:



To calculate the sales by month, we need to have the Month-Year readily available so that we can group the sales later by them.

In order to do that, I am using the function strftime to convert the date 'Order Date' into desired format, i.e. YYYY-MM. By default, it outputs as an object, instead of date.

Next, we will group the data by this field 'Month-Year'. Now, what groupby() does is, it takes for input, an object to group all the numerical values and aggregate it by the defined function - eg, sum(), mean() etc.


After that, I have simply written a code to plot the values - Sum of Sales by Months as a line. (using matplotlib.pyplot, which I defined as plt)

The output for which is as follows:


Easy, right?


Now moving over to the next question:


Q2. What is the overall MoM, QoQ and YoY growth?


Now that we know how the groupby() works, we are read to calculate growths.

To do that, we will be harnessing the power of numpy (oooohh!)

numpy has a function called roll which simply rolls or shifts the value in a particular cell in a df by 'n' number of cells. eg:


MoM_Data['Last_Month'] = np.roll(MoM_Data['Sales'],1)


will shift the Sales values down by 1 cell. Now we have last month sales vales corresponding the sales value in the current month in adjacent cells. Now with the help of simple division(or not), we can calulate growth as follows:


MoM_Data['Growth'] = (MoM_Data['Sales']/MoM_Data['Last_Month'])-1


And now, we have growth(in decimal form) of every month, corresponding to its previous month. All we have to do now is plot them. and the output will be as follows:

Now in order to calculate the QoQ growths, we need to calculate Quarter/Year first, eg. 2019-Q1 so that we can have sales values by each quarter.

In order to do that, we will use the following code:


df['Qtr'] = df['Order Date'].apply(lambda x: x.strftime('%m'))

df['Qtr'] = pd.to_numeric(df['Qtr'])//4+1

df['Year'] = df['Order Date'].apply(lambda x: x.strftime('%Y'))

df['Qtr_Yr'] = df['Year'].astype(str) + '-Q' + df['Qtr'].astype(str)

df.drop('Qtr', axis=1)


In the above demonstrated code, we are first converting months into quarters (line 1-2) and then further we are concatenating/joining/merging it with the corresponding year.

Additionally, we dropped the column Qtr because we don't need that anymore.


Now that we have it, we will repeat the process that we did for MoM growth by first calculating sales for each Qtr_Yr by aggregating using groupby() , followed by calculating the sales value for the previous quarter using np.roll and then finally calculating the growth.


Further, visualizing it will look something like this:

seems easy, doesn't it?


Notice that the growth comes down in every Q4


Moving on to YoY Growth by month. Now that we have seen the MoM growth corresponding to the previous month, we might be interested in calculating the monthly growth by previous year.i.e, current year vs last year same month.


Now that is going to be fairly easy since we already have calculated the sales grouped by Month and year while calculating MoM growth. The code looks something like this:


YoY_Data = pd.DataFrame(df.groupby('Month_Year').sum()['Sales'])

YoY_Data['Last_Year'] = np.roll(YoY_Data['Sales'],12)

YoY_Data = YoY_Data.drop(YoY_Data.index[0:12])


The first two lines of code will simply help us calculate the sales grouped by the Month-Year, like we did previously. Then we will roll the data by 12 so that we have sales for the current month corresponding to the sales of the last year same month. Then we simply have to calculate the growth. Now make sure to drop the first 12 months since python has by default put the values of the latest year in front of the values in the first year, like a cycle. Since this is wrong, we should ignore these values.

After visualizing it, we have somthing like this:

Cool! Now that we have calculated the growths, it's time to move on to the next part.


Q3. Which are the Top 10 products by sales?


Now that we already know how to use groupby() to find the sales values corresponding to any object, we can easily calculate the sales by every product, sort it and get the top10 values. Here's how to do that:


prod_sales = pd.DataFrame(df.groupby('Product Name').sum()['Sales'])

prod_sales.sort_values(by=['Sales'], inplace=True, ascending=False)

top_prods = prod_sales.head(10)

top_prods


As you already might have known by now, in the first line I have calculated the sales by products and then further converted it as a dataframe. Later, I sorted the products by the descending value of sales in line 2. After that it was easy to get the top 10 values using the head() function.

Now the next part is:


Q4. Which are the most selling products?


Previously, we calculated the top 10 products by sales. All we have to do now is find the top 10 in those products which have the most quantity sold. For the sake of simplicity, I am going to find only the top 5. We can do that as follows:


best_selling_prods = pd.DataFrame(df.groupby('Product Name').sum()['Quantity'])

best_selling_prods.sort_values(by=['Quantity'], inplace=True, ascending=False)

best_selling_prods = best_selling_prods.head()

best_selling_prods


Same and easy, right ? Similarly, we will also calculate the Most Profitable Products as follows:


profitable_prods = pd.DataFrame(df.groupby('Product Name').sum()['Profit'])

profitable_prods.sort_values(by=['Profit'], inplace=True, ascending=False)

profitable_prods = profitable_prods.head(5)

profitable_prods


Et, viola!


I think that's enough for this time.

Do let me know your thoughts on this. Maybe you know better ways of doing this which I would love to know about.

Thanks all for your kind support. Will be back with some more

 

fin.

8,762 views0 comments

Recent Posts

See All