Portfolio

Overview

In this project I'm going to take a dataset from Kaggle that looks at healthcare data to try and help us understand what predicts hospital readmissions, what the most costly healthcare procedures are, how long patients tend to stay in hospital and any discrepancies between demographics. I'm using MySQL with DBeaver as my database tool. We have one database (patients), which two tables (health & demographic).

The original dataset can be found here: https://www.kaggle.com/code/iabhishekofficial/prediction-on-hospital-readmission/data?select=diabetic_data.csv

You can learn more about the original dataset here: https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008#

HISTOGRAMS

Hospitals have a fixed capacity so efficiency is a key factor when it comes to quality healthcare treatment. If a hospital needed to take in more admissions then it could either free up space in existing beds or build more rooms with more space. Whilst the second option sounds less achievable it is sometimes necessary, it depends on the number of patients being admitted and that future trend. Each new patient admitted has a marginal profit from the hospitals perspective and after reaching capacity this marginal profit will drop if new investment is required. The investment would only be a financially good decision if the hospital could accurately predict a long term increase in hospital admissions over time. However, without this prediction we are left with needing to free up bed space to allow for new admissions.

We can use SQL to create a histogram visualisation, though SQL is not a data visualisation tool it can help to use a quick query like this as part of our initial exploratory data analysis.

We start with the USE patient command, which tells SQL which database to use (it's different from the SELECT command and we're using in that it selects the database for subsequent processes, not the table). Next we use SELECT and ROUND to give a rounded time in hospital to one decimal place, this gives us the number of days spent in hospital as each bucket's ID. This is important as we will later use COUNT and this won't work as intended without rounding. We then alias this command using AS to define the new column as bucket which represents the bucket our histogram will create. Now we use COUNT(*) to count the total number of rows for each recurring time_in_hospital entry, we again alias this command to create a new column called count.

Next we're going to use the command called RPAD (which stands for Right-Pad), a MySQL command. This command takes three positional arguments such that RPAD('String','Length','RPAD_String'), where the first argument is a string we can pass in that would be common to all outputs, however we don't want that so we will leave this blank (i.e. ' '). Next we set the length of the characters, we want to make a histogram so we want this to be a proportional distribution scaling by 100. So for our length criteria we pass in out COUNT() which is the count of all the rows, divided by 100. Lastly we specify our RPAD_String which is what we want to concatenate to the end of the function. As we didn't select an initial string parameter and we want to create a histogram we will use '' to represent this. Lastly we will alias this as bar to create a column for our plot.

Finally we need to set a GROUP BY parameter, so we will choose bucket and also ORDER BY bucket to create our desired graph.


A SQL command for creating a Histogram

Which creates the following visualisation for us, an exponential distribution. So it looks like from this data that the vast majority of patients stay for between 1 & 4 days. The majority of patients stay for less than 7 days.


Our SQL generated Histogram

DISTINCT

Next, we want to look at what procedures are carried out in the hospital and then which of these procedures are the highest cost and most frequently occurring. Firstly in a large dataset like this it helps to know how many specialties there are. As there are many rows of data for each patient the medical specialties will be duplicated in our table for groups of patients who require the same medical speciality. So instead of using SELECT * and looking through our entire table by eye we can use DISTINCT. DISTINCT returns all the values in a column that are unique regardless of how many times it is duplicated in the table it will only return each unique value once. So we can use this to figure out what each speciality is.

We simply call the DISTINCT function after our SELECT statement with the column we are querying being passed in as the argument. We'll also ORDER BY medical_specialty to give us these specialties in alphabetical order.


Using DISTINCT to identify unique values in a column


There are 73 unique specialties, though I am only showing the first 10 for brevity

HAVING

Now we know what all the hospital specialities are, we want to find out which specialities are doing the most number of procedures as this is the biggest cost to the hospital. We'll create a query that returns the average number of procedures per speciality.

We can do this with the following query. Firstly we select the medical_specialty column and then use the aggregate function AVG and pass in num_procedures as our argument. We'll alias this as avg_procedures. Then we'll GROUP BY our medical_speciality so we can see which specialities have the most procedures and ORDER BY our avg_procedures. Using DESC we specify that the order should be in descending order so now we have a list of most commonly occurring procedures.


A simple query to show the average number of procedures per speciality

Our output looks like this.

We have a lot of recurring decimals in this now, which looks pretty ugly, but we also have some values which are whole numbers which are a little bit suspicious. We could look at getting better data returned by writing a better query. We can use the ROUND statement as we have done previously to truncate those recurring decimals. But what about those whole numbers we saw? It is likely that those exist because of a small number of patients were seen. I.e. a small number of patients with a high number of (complex) procedures could be skewing our data. Let's use the COUNT function again to count the number of procedures to see if this is the case and we'll incorporate our ROUND function as well.

So we will SELECT medical_speciality and ROUND our AVG(num_procedures) and alias again as before. But we'll also add COUNT(*) to count all the rows and alias this as count.


Querying the average number of patients but using COUNT(*) also

We return the following output, and it looks like we were right, those whole number values only had a single patient are are skewing the dataset.


The average number of patients and the total number of patients per procedure

Now we want to filter out these low count entries that are skewing our dataset. Some of these could be data entry errors or rare/complex cases. But we want to look at the average per patient so it's best if we could somehow only look at the averages above a certain count threshold such as say 50 procedures. We could do that by using the HAVING statement.

HAVING is a filter which works in a similar way to the WHERE command. The difference is that we use WHERE command filters on individual rows and the HAVING command filters aggregate rows. So because we want to filter the number of patients below a threshold that is created by using COUNT we must use HAVING to filter this. Here is an example, the query is largely the same as our previous one except that after we use GROUP BY we can specify our HAVING filter. We want to look only at specialties that have >50 procedures with an average procedure >2.5. We can do this by adding in HAVING count >50 AND avg_procedures >2.5.


A query using HAVING

This returns the following data. We can see that Cardiology has by far the highest number of procedures and that Thoracic Surgery (which is also a type of cardiovascular surgery) has the highest number of average procedure. In fact everything in this list bar radiology is related the the circulatory system in some way which is interesting.


The highest number of average procedures per patient using HAVING filters

JOIN

Oftentimes we need to analyse data that is split across two tables. In order to do this effectively we need to JOIN the two tables into a singular table and then run our query on this new table. Essentially when you use the JOIN command you are bolting on one table next to the other. There are different types of JOIN such as LEFT JOIN, INNER JOIN and so on. For this project I am just going to use a LEFT JOIN which in MySQL is the default type for JOIN.

We have two tables in our database, namely health and demographic. The health data as the name suggests contains details about each patient in terms of the number of procedures they had, lab procedures, how long they were in hospital for etc. The demographic table shows us each patients gender, race, age and weight. Both tables have a column named patient_nmb which is key as we need to specify this in our JOIN statement later to match the correct data with the correct patient.

We want to now find out if patients are being treated differently depending on their race. For example we want to find out the number of lab procedures done but the data for this is in two tables.

We do this first by selecting all from our health table, after our SELECT statement we use JOIN followed by the second table which we want to join to our initial table which in this case is called demographics. We then need to use ON which specifies which column to match the data on. As both tables have patient_nbr as a common identifier this is the most logical choice. Notice that we have to use dot notation to specify which table is being joined to which, namely health.patient_nbr = demographic.patient_nbr. If we simply put patient_nbr = patient_nbr the SQL would not know which table to join to which table.


A simple LEFT JOIN

We get the following data. Notice that the readmitted column is from our health table and everything to the right of that has been joined onto the end from our demographics table.


A completed LEFT JOIN

Now that we have our table we can finish our query and determine if there are any discrepancies between the average number of lab procedures and the patients race. Our completed query now takes the same form as any other query we would carry out.


The completed query using a left JOIN

The following data is returned. It doesn't appear that there are any huge differences between races, perhaps only that Asian people seem to have fewer procedures on average.


Average number of lab procedures by race

CASE WHEN

Suppose we wanted to understand if there was a correlation between the number of lab procedures a patient has and their time in hospital. Or more accurately, we hypothesise that the longer a patient stays in hospital the more procedures they have. We can do this by creating new columns that fit certain criteria. It works a little bit like an IF statement in such that we pass a value and depending on the statement it will filter our data point into a particular bin.

Let's say we want to broadly aggregate the number of lab procedures into three categories being: few, average and many. How would we achieve this?

Firstly we SELECT the time_in_hospital column as an AVG and alias it as avg_time. Then after our SELECT statement we call CASE. On a new line we set our first conditional by using the WHEN statement. Our first statement will be WHEN num_lab_procedures >= 0 AND num_lab_procedures < 25 THEN "few".

That means that when the number of lab procedures is greater than or equal to zero and less than 25 we assign to this a bin using THEN and we name that bin "few". We repeat a similar statement for our next WHEN command but call this bin average and this ranges from 25 to 55 procedures. Finally we use the ELSE command which catches all other values which don't fall into the first two bins and we call this bin "many". When using CASE WHEN we have to use the END command to specify that we have finished using this filter, we can alias the column name too which we will do by calling it procedure_frequency.

Then we use FROM as the whole CASE WHEN command is actually part of our SELECT statement (it's joined by the comma). We can then use GROUP BY to group by procedure frequency and ORDER by and DESC to give us a defending order of time in hospital by each aggregate bin.

The output of this query looks like this. So it does appear that there is a correlation between the amount of lab procedures conducted and the length of stay inside a hospital.


The output of our query suggests that the longer people stay in hospital the more procedures they have

UNION

A JOIN is where we take data from two tables and join them together by adding columns to another table. However sometimes we want to stack rows on top of each other from different tables. We can do this using the UNION command. Let's say for example that we wanted to a medical test on anyone who is Asian or had recorded metformin levels as 'Up'. If we wanted to get a list of patient numbers for this test then we would do so with the following SQL query.

First we select our patient_nbr from our first table demographics, but we use a where filter to only return people WHERE race = 'Asian'. This is the same as any other WHERE statement we've used previously. We then call the UNION function and then use SELECT again for our second table. This time we again select patient_nbr (which is common across both tables) and use a WHERE statement for WHERE metformin = 'Up'.


A UNION query

This returns a single column of data for all Asian patients and patients with elevated metformin levels.


Our returned query gives us Asian patients with elevated metformin levels

Subqueries & CTEs

Subqueries are a very useful way to essentially have nested queries within a statement. This can be extremely useful, one such case when this is useful is when we want to ask very specific queries. For example, let's say that we want to create a query that shows us how many people entered the hospital with an emergency (admission_type_id of 1) but stayed for less than the average amount of time.

This sounds like you could use a simple WHERE filter to do this however recall that WHERE won't work on aggregate functions such as AVG(time_in_hospital). Instead we will call a second SELECT statement inside of our query. Here's how it works.

We SELECT * FROM health using a WHERE clause to filter for emergencies (admission_type_id = 1). We chain this with AND time_in_hospital to select only cases where the time is below average. Here is where we use our subquery. We use rounded brackets to specify our subquery and then use SELECT and FROM as we normally would.


A nested query or subquery

This returns all cases of when patients admitted with emergencies were discharged with a less than average time in hospital.


Returned query for emergency patients that stayed for a less than average time in hospital

Sometimes you need to use multiple nested queries and they can be difficult to keep track of. Fortunately we can use something called common table expressions (or CTEs). Essentially it allows us to assign a variable name to a query. We do this with the WITH command, which allows us to alias the query at the beginning of our statement. The returned data is exactly the same as the previous query however it's a little easier for data analysts to keep track of.


A subquery with a CTE

CONCAT

Finally, let's look at the CONCAT function. CONCAT stands for concatenate which basically means joining things together. We can concatenate values from individual cells into a new cell for example. The RPAD command used concatenation earlier though this time we'll be more specific. Let's say that we wanted to find out about patient readmissions how many medications they received and how many lab procedures were conducted. We'll assume that this is for a report and that's it's easy to copy and paste this data out of our table.

Firstly we'll use SELECT followed by our CONCAT command. Like other commands we use rounded brackets. Everything that we place inside the CONCAT brackets will be concatenated into a single cell. We can add text using single quotes and commas to separate our data values. The output will be in the form of: Patient 8222157 was AfricanAmerican and was not readmitted. They had 1 medications and 41 lab procedures.

So firstly we write 'Patient ' followed by the patients ID number, followed by race (separated with appropriate text). Then we need to use CASE WHEN, however we need to do this inside a subquery. So we'll call CASE WHEN and use NOT on health.readmitted (readmitted column from our health table), set to 'NO' and then we'll call this bin "was". We'll repeat this again but we'll remove the NOT to give us the opposite output and call this bin "was not", then we can call END to our CASE WHEN statement. We then add our appropriate text, and the number of medications and number of lab procedures. We end our CONCAT function with a closing bracket and then use FROM to specify where to pull this data from. Lastly, as we've concatenated data from two tables (health and demographic) we need to use JOIN, we'll do this as we did before and JOIN ON patient_nbr.


A CONCAT statement with a nested query

Our returned query looks like this.


Our query returns the following data

Conclusion

In this project I've analysed healthcare data using MySQL and some of SQLs more advanced functions. We have covered:

Creating histograms using RPAD
Using DISTINCT to find unique data entries
Using HAVING to filter aggregate rows
Using JOINs to combine tables
Using CASE WHEN to create categorical variables (bins) from our data
Using UNION to stack rows of data together from different tables
Using subqueries and CTEs for more specific queries
Using CONCAT to concatenate values into single cells

Additionally from the data we've determined the following from our analysis:

The majority of patients stay in hospital between 1 and 4 days
There are 73 unique medical specialties in the dataset
The highest count and average number of procedures are for cardiovascular illnesses
Asian people on average have fewer lab procedures than other races
The longer people stay in hospital the more lab procedures they have

I hope you enjoyed reading about this project. If you'd like to see more of my projects please check out or connect with me on LinkedIn or view my portfolio website or GitHub.

Analysis of US Healthcare Data using advanced SQL functions