Card image cap

Python Data Analysis: Using Pandas & Seaborn to Analyse Manufacturing Data

Overview

In this project I will be analysing data from a mining company to find insights about its manufacturing processes in relation to metal refinement. The data is from a flotation plant which is a process which removes impurities from metal ore by separating the hydrophobic (water repellent) materials from hydrophilic (water absorbing) materials. More information about this process can be found here: https://en.wikipedia.org/wiki/Froth_flotation

The original dataset can be found here: https://www.kaggle.com/datasets/edumagalhaes/quality-prediction-in-a-mining-process

Exploring the Data with Pandas

In this project I'll be using the VS code IDE and putting my code blocks into this article using carbon.now.sh. The necessary packages for this project are Pandas for data manipulation and Seaborn for data visualisation. However, Numpy is installed by default with Pandas and we'll also need to install matplotlib to access the plt function to show our Seaborn graphs.

As with every project I prefer to work in a virtual environment to keep track of my installed packages, I prefer to use pipenv so I create my virtual environment with pipenv shell and then pipenv install pandas, numpy, matplotlib.

The first thing I need to do is import all my necessary packages and then read the data using the pandas read_csv function which I assign to the variable dfRaw (pandas automatically creates a dataframe with the read_csv function however I like to call this data 'raw' to identify that it hasn't been cleaned in any way). I have stored the raw csv in the same folder as the Python script so I don't need to specify its exact file location, I use the decimal argument and pass the value "," so that pandas converts comma's to decimal points.

image_1
Importing our Python libraries and the csv dataset. Viewing the data with .head() and .shape()

As this is the initial exploration of the dataset I print the head of the dataframe which is simply the first five rows (alternatively I could have used the tail() function to print the last five rows, or any number of rows depending on what I pass into the function as an argument). I also want to see what the shape of the data is so I print the shape of the dataframe (the number of rows and columns). The output is as follows:

image_2
The output of our .head() function and .shape() function

A note about the manufacturing process

We've now had a first look at our data, we have 24 columns and 737453 rows. The print function limits how many columns we can see due to the size of the terminal, hence we can only see six columns in the print (separated by "..."). This can be adjusted with pandas using the following line of code which I would normally place at the top of my script right after the imports.

image_3
An option to change how pandas prints dataframes

This gives the following print output which shows all the columns but is still a little messy.

image_4
The output of the pd.set_option command

What I really want to do is find out what each of the column headings are and then only focus on the important columns. I can do this using the pandas dtypes function, this returns the header for each column in the dataframe and the object type e.g. string, float64, which looks like this.

image_5
The output of the pandas dtypes function gives us the column names and the data type

We can now see the heading for each of our 24 columns and we have the object type, most of which is float64 (a 64 bit decimal number). However we notice that the date column is returning an object not a datetime; we need to fix this. But first, it's important to get a little background about what each of these columns represent.

image_6
An overview of each of the column names and their meaning in the flotation process

From this we can see that for our iron mining operation the '% Iron Concentrate' is one of the most important columns to focus on. This is essentially our response variable in the process and process engineers optimise the process to increase the irons purity. The more pure the iron, the better the quality and the higher the price. So this is one thing we will focus on in our analysis. The '% Iron Feed' is the raw state that the ore is in. For example if the ore is 80% pure pre flotation and 99% pure after flotation then the process has improved the iron purity by 19 percentage points.

Before we start making plots we need to make one adjustment to the date column so it can be parsed correctly and we can filter our results over a specific date range.

image_7
Converting the object date to datetime using pd.to_datetime

Here we are using square bracket notation to select a single column in our pandas dataframe. Then using pandas to_datetime function we are passing that same column. The object type will not be converted to date time. If we call dtypes again we can see this.

image_8
The converted datetime now shows correctly

Making Visualisations with Seaborn

A good place to start in any engineering process is a control chart. A control chart is simply a time series line plot which shows how a single variable changes over time. To do this we'll need only two variables the % Iron Concentrate (our response variable) and the date (for our time series). We can make a new dataframe from our raw data frame with the following lines of code.

image_9
Creating a new dataframe and plotting a lineplot using Seaborn

First we're creating a new variable called ControlChart which is just a list with two values in it which have the same names as the columns in our original dataframe. Then we create another variable (dfControlChart) which is our original dataframe but using square bracket notation to only select the columns from our new list that was just created.

To see what this looks like I'll use a seaborn line plot and specify the x and y variables from our dataframe and where to source the datapoints from. The following graph is returned.

image_10
A lineplot for all datapoints

That's quite a lot of data points and it's difficult to gain any real insight at this stage. Let's break it down by month to only analyse data for say June (all data is in the year 2017). To do this we'll create a new dataframe (dfJune) which is done in a similar way to before except we're now filtering the rows by dates, this returns only the data points in June 2017.

image_11
Creating a dataframe to only analyse the data for June

Before we plot this however let's add more insight to the graph. A control chart should have a mean line to show what the average datapoint is relative to all other data points and we should also add 3 sigma reference lines. Without going into too much detail datapoints outside these 3 sigma reference lines will indicate when a process becomes unstable. There are however more complex forms of analysis that can be done with control charts to predict moving averages or sporadic variation however that is out of the scope of this project.

image_12
Creating the mean and sigma level variables and using them as reference lines on our plot

Here we create a new variable (mean) which is accessing the dataframe of dfJune's % Iron concentrate column and returning the mean using pandas .mean() function. The same process is done to return the standard deviation. Calculating the 3 sigma is a little more tricky as the 3 sigma level is just 3 x the standard deviation + or - the value of the mean. So we can create separate variables for these which can be used to add reference lines to the plot.

Since we initialised the line plot under the variable name graph we can use axhline from the matplotlib library to add reference lines, I can do this for the mean and each sigma level respectively. I change the linestyle of the mean to dashed to separate it from the data points and the sigma lines to red. We now have a nice looking control chart for our June data.

image_13
The completed control chart for June

This looks like quite a variable process. It's hard to determine if the majority of data points fall evenly either side of the mean line and there are at least five times that I can see where the process falls below the 3 sigma level.

Let's see what the control chart looks like for other months, but rather than repeating everything we've just done multiple times, we'll make a function to make this process easier and then loop over that function with a for loop.

image_14
The complete function to generate charts for each month using a for loop

I'll call the function generateMonthlyPlots (though it could be adjusted to weekly or daily with some adjustments), and initialise the function to take two arguments start and end. These will be the start and end dates from our dataframe.

Firstly I create a new variable called dfMonth, which is actually the same as the dfJune variable except I've renamed June to Month as the function will repeat for each month within the dataframe. Instead of hard coding the string value for the > and < limits we've put in the start and end named parameters.

Next we create a variable to help us determine for which months data we are looking at, called monthName. This uses pandas iloc function to return a set number of rows (here we've just passed in 0 to return the first row) from the 'date' column and then used the pandas function .month_name to return the name of each month (.month returns the integer for that month e.g. June = 6).

Next we pass in the previous variables for determining the mean, standard deviation and 3 sigma levels. Then we pass in our seaborn functions to create the plots. The main difference from before is that the data provided is now from dfMonth and we've added a title from plt.title(). I've used an f string to give the title name as % Iron Concentrate for month... and then passed in the monthName variable. This will give the title on each graph the correct name so we know what data we're looking at. Lastly I've just updated the xticks to make them more readable.

Now we initialise a for loop to loop over our function, we pass it the values of 3 and 10 which means it will loop from March (month 3) to September (month 9), the value 10 is not included. Now Python generates 7 plots for us automatically.

image_15
% Iron Concentrate for March
image_16
% Iron Concentrate for April
image_17
% Iron Concentrate for May
image_18
% Iron Concentrate for June
image_19
% Iron Concentrate for July
image_20
% Iron Concentrate for August
image_21
% Iron Concentrate for September

It's interesting to note that the variability is more extreme in June, July and August, with the former two months having quality levels less than the -3 Sigma lower limit. It's hard to say exactly why this is, though I would guess that in the summer months the humidity increases and this affects the flotation process which is dependent on the reaction between hydrophilic and hydrophobic materials. It's also interesting to note that there are periods with fewer data points such as March and September, this could be due to machines being offline due to maintenance or reduced customer demand.

Analysing Correlation

We've now explored how stable the overall flotation process is with respect to time, we can estimate also that environmental conditions play a role in the quality of the refined ore. However there are usually other variables that influence process quality. We had 24 columns in our initial dataframe, we know that the % Iron Concentrate and % Silica Concentrate are process outputs and the date is a special case. However the other factors are all inputs. Pandas has another useful function .corr() which will return a matrix with each correlation coefficient represented against each variable respectively. This however is a little tricky to interpret.

image_22
The printed ourput of our .corr() function

The correlation coefficient is a value ranging from -1 to 1 which describes the strength of the correlation between two continuous variables, the closer the value is to 1 (or -1) the stronger the correlation. A value around 0 is said to have no correlation. The printout from the .corr() function tells us something but we can create a visualisation to make this more readable (note, the library Numpy has been imported as np).

image_23
Creating a correlation matrix with Seaborn

The first variable (corr) is the output from the earlier print, however we can use this data to create a correlation matrix visualisation. We first need to create a mask variable which will populate the upper triangle of the heat map with nul values. the f and ax variables are initialised as matplotlib subplots. Lastly camp is created to allow us to use a diverging colour scale to represent the changing correlation coefficient. Finally we create the heatmap with sns.heatmap and pass in our newly created variables. Adjusting the xticks and yticks and adding a title gives us the following visualisation.

image_24
The completed correlation matrix

This correlation matrix shows much more clearly than our print statement which variables interact with each other. The scale shows 0.2 to -0.8 so the majority of the correlations we see are negative; in other words, the more one variable increases the more the other decreases. The most obvious datapoints in this matrix are shown in dark blue with a -0.8 or greater correlation coefficient. When we look at these we see that the biggest influence on the % Iron Concentrate is the % Silica Concentrate. This makes sense as one is a byproduct of the other. In order to get higher purity iron ore the amount of silica must be reduced. Additionally we see at the top left that the biggest impact to % Silica Feed is the % Iron Feed, again this shows the same relationship however this time at the beginning of the process.

If we now focus on what affects the % Iron Concentrate the most (the second from last row), there doesn't appear to be much that stands out. The colour ranges from around 0.2 to -0.2 so there is a very weak correlation with any of the other variables at first glance. This is likely due to the fact that process improvement is rarely influenced by only a single factor (we already saw that environmental temperature and humidity could influence the process and that is not a recorded parameter). To really try and improve this process a full factorial design of experiments would be required and we don't have the available data to do that. However, we can see from our analysis that (as with a lot of manufacturing processes) the quality of your incoming material heavily influences the quality of your outgoing material.

Conclusion

In this project I have analysed data from an Iron ore plant; specifically looking at the flotation process for ore refinement. Using Python and the Pandas and Seaborn library we have:

  • Performed exploratory data analysis using .head() and .describe()
  • Done simple data munging converting object types and replacing commas with decimals
  • Created control charts using Seaborn
  • Created Python functions for generating multiple plots
  • Created a correlation matrix to analyse the relationship between process variables

From the dataset we have learnt:

  • The most important variable is the response variable % Iron Concentrate
  • The most significant factor affecting % Iron Concentrate is % Silica Concentrate
  • There is relatively no obvious correlation between the input variables and the % iron concentrate
  • The process stability appears to be worse during the summer months (June to August)

Thanks for taking the time to read through this project. I always enjoy working with Python and having spent a lot of my early career doing similar types of analysis to this I felt quite at home in this project.

If you'd like to see more of my work you can check out my portfolio in the link below.