Card image cap

Understanding Economic Development with the World Bank & SQL

Overview

The world bank is an international development organisation comprised of 187 countries. Its role is to raise the living standards of its poorer members by issuing credit in the form of loans for economic development. It has several branches such as the IBRD which lends money to middle income countries. The IFC which lends money to the private sector and the IDA which lends money to poorer countries. Loans from the world bank are credited with improving the wellbeing of a countries inhabitants allowing them to: Supply safe drinking water, build schools and train teachers, increase agricultural productivity, manage forests and other natural resources, build and maintain roads, railways, and ports, extend telecommunications networks, generate and distribute energy, expand health care and modernise.

Fortunately the data collated by the World Bank is completely open source and available to the public, and due to the vast amount of data collected provides a perfect opportunity to explore the use of SQL in analysing large data sets.

The Data

The data I will be analysing in this project is from the IDA branch of the world bank. It is the IDA's statement of credits and grants from 1961 to the present date (last updated February 2023). The data can be found here: https://finances.worldbank.org/Loans-and-Credits/IDA-Statement-Of-Credits-and-Grants-Historical-Dat/tdwh-3krx

I downloaded this data from the above link as a CSV and imported it into DBeaver to analyse the data. There are over 870,000 rows of data in this dataset which is a good reminder for why SQL is the preferred tool to analyse data of such a large size. A program such as Excel would be unable to handle such large datasets without crashing or extremely poor performance.

Exploring the dataset

Regardless of what tool I am using, when I open a new dataset the first thing I want to do is explore a little bit to understand what I'm working with. This typically means the first thing I want to do is find the shape of the data, more specifically how many rows and how many columns. Normally just to get a quick look at the data a common SQL command is simply:

image_1
A simple SELECT * statement returns the entire table

Now this returns ALL of the data in the table. The * symbol literally means 'all' so SELECT (ALL) FROM (Table) and you get all your rows and all your columns. Each row in this dataset is a loan that is provided to a country, each column is an attribute of that loan, such as service charge, repayment dates etc. DBeaver actually returns this data pretty quickly but it is hard to analyse. SQL becomes more and more useful when you have specific questions (or rather queries) that you ask it. Generally the more specific your query the more specific the data returned.

image_2
All the data returned from the aforementioned query. Not all data is shown in this screenshot.

Filtering Data (Using WHERE)

We can for example filter for only data of a specific country. We know that Bolivia is listed as a country that has taken out a loan so let's look at how many other loans Bolivia has taken out. We can do this using the WHERE clause. WHERE is added at the end of our SQL script after we've specified what data to select and from which table. So our next query will look like this:

image_3
Using the WHERE command filters for only results that meet the WHERE criteria

Well, this worked but it's still returning a lot of rows of data, more than I can easily count by just scrolling through the table.

image_4
Filtered results WHERE country = 'Bolivia'

Counting the number of data entries with COUNT(*)

Let's ask another query to try and find out specifically how many loans Bolivia has taken out with the IDA. What we are essentially asking SQL to do here is count the number of rows in this new table. We can do this with the COUNT command. Here's how it works:

image_5
COUNT(*) counts the total number of rows WHERE the country is 'Bolivia'

Instead of selecting * (all) data from our table, we're now instead asking SQL to count all the data entries (rows) in the table. This is because we used the * symbol as our argument. This returns the following:

image_6
Aggretate functions such as COUNT(*) returns a single value

A single row and a single column. An aggregate count of all the rows. 16,739, that would have been a lot of scrolling to find this value. So we now know that 16,739 loans have been issued to Bolivia since the founding of the World Bank.

Grouping results with GROUP BY

Let's ask a better query again. We know that Bolivia has had 16,739 loans issued by the World Bank but what about other countries within the IDA branch? To do this we need to use a new clause, GROUP BY. If we wanted to look at other countries beyond Bolivia we would need to reference this at the start of our query after SELECT. We can select multiple columns in a table by separating the column names we're selecting with a comma. However we still want the count of each country so our query would start to look like SELECT country, COUNT(*) from ida_statement_of_credits_and_grants. But the query will error, this is because SQL doesn't know how to group the data, so we need to specify it with GROUP BY. We add GROUP BY at the end of our query and we're simply telling SQL to GROUP BY country, so we want to select the country column, and count how many times that country occurs within the table and then group this count data with its' associated country. The query looks like this:

image_7
Many queries won't run unless we specifcy how to group the data with GROUP BY

And this query returns this data:

image_8
Our unordered list for number of loans

ORDER BY, DESC & LIMIT

Well now this is starting to look much more useful. We can see how many loans each country has taken out with the IDA. But there is a huge discrepancy in the data between countries, some countries have taken out thousands of loans and other have less than 10! This query returned 122 rows of data (for 122 countries that have borrowed from the IDA). However let's only take a look at the top 10. We can do this quite easily with a couple of additional SQL commands. Firstly we want to take the data from this returned query and order it, we can do this with the ORDER BY command. In a similar fashion to GROUP BY, ORDER BY orders that data by a category that we specific, we again but ORDER BY at the end of our query and we choose to order by COUNT(*), that is the count of rows in the original table.

image_9
Using ORDER BY we can now order by a particular column

The following data is returned which is NOT what we wanted, we wanted the top 10 highest number of loans not the bottom 10. This happens because by default SQL orders our data in ascending order, to order it in descending order we use the DESC command at the end of our query.

image_10
By default ORDER BY will return an ascending order

Let's try the same command again but adding the DESC command. Also whilst we're at it let's add another command LIMIT as we only want to return the top 10 not all 122 rows. LIMIT goes at the very end of our query, followed by an integer value which represents the number of rows we want to truncate down to. We want the top 10 so we'll set LIMIT 10.

image_11
When we add DESC we specificy we want the order to be descending, LIMIT truncates the data
image_12
Our top 10 countries ordered by number of loans

Perfect, we now have exactly what we wanted in the first place. The top 10 countries ordered by the number of loans they have taken out with the IDA. This is a good example of how the more specific a query the more specific the results that are returned.

Aggregate functions SUM

However, there is a slight problem. We know which countries have taken out the most loans, however we don't know what the value of those loans are. Bangladesh took out slightly more loans than Pakistan, but the monetary value of those loans is hidden with these queries so lets take a look at finding out what the sum of the loans that have been taken out is we can do this with the SUM command in SQL. The command is another aggregate command and so it will look very similar to our previous query. However the argument we pass into this command will be original_principle_amount as this is the amount given (in USD) as a loan. So the query will be asking SQL to select the country, and sum the values of the original principle amount grouping this data by its associated country, and then ordering the data by the original principle amount. Again we will use DESC and LIMIT to get the top 10 highest loans.

image_13
SUM adds all the values in the given argument and returns that value
image_14
The total amount of money i.e. the SUM of all loans in descending order

Well this revealed some very interesting trends in the data. As with our earlier query where we counted the number of loans we can see that the top 3 borrowers (India, Bangladesh and Pakistan) also borrowed the greatest amount of money. However Bolivia is no longer ranked as the 4th biggest borrower. What's also interesting is that India borrowed $119,091,005,738,949 USD and both Bangladesh and Pakistan borrowed almost four times as much as the next highest borrower.

Let's deep dive into this insight a little further by asking better queries again. Let's focus on India as by far the biggest borrower from the IDA. We know that the IDA has been issuing credit since 1961 and that our query returned ALL data from table. We don't know however at what time countries became members of the 187 member World Bank organisation. Let's filter our query again but lets use the WHERE clause to see how much is still currently owed to the IDA, because our previous query returns the value of all loans, including loans that have been fully repaid. So we will add WHERE credit_status = 'Repaying' to show all active loans.

image_15
Adding credit_status = 'Repaying' to filter for active debts only

The data below shows significantly smaller values now that we have filtered for active loans however we still see the same trend in the top 3 borrowers. South Sudan has disappeared from our list now and Vietnam has appeared as a borrower with an active loan.

image_16
Top 10 active debts

Chaining WHERE commands using AND

This data shows some interesting trends. India for example has taken out a total of 660,992 loans amounting to $119,091,005,738,949 of which it is still actively repaying $3,658,503,335,311. That's a lot of money to comprehend, however we can try and find out some more about this using SQL to determine what that money was invested into within India. Using the project_name column and the WHERE clause again we can filter for projects that contain a specific name. I noticed when browsing through the data that a lot of the projects are for railways, so I add the WHERE command as WHERE project_name = 'RAILWAYS'. However, we're already using where to filter for country, so we need to use another SQL command called AND. AND works in just the same way as WHERE but it is used to chain these filters together. So our query acts like this: SELECT country, credit_status, project_name, SUM(original_principle_amount) from ida_statement_of_credits_and_grants WHERE country = 'India' AND credit_status = 'Repaying' AND project name = 'RAILWAYS'. Now because we've added a few columns and an aggregate function to our SELECT statement we need to specifiy how to group the data for SQL to understand, so we can simply GROUP by credit_status, country and project_name. The final query looks like this:

image_17
Chaining together WHERE commands with AND is very useful

And the resulting output looks like this:

image_18
The total amount still being repaid for railway projects in India

So from this query we now know that India is is currently repaying $6,177,086,091 for railway related projects alone. Let's go one step further and try to understand when these railway projects began filtering with dates.

Inverse filtering using NOT

Let's now look and see how many loans and the total value of loans that have been taken out by India in the past 10 years (i.e. since 2013). To do this we will need to use a WHERE filter again, however we can use these with dates also. We want to include loans that have been fully repaid and are currently being repaid but we don't want to include cancelled or disbursed loans that will give us false values. So our first query will look like this:

image_19
Using NOT selects everything apart from what is specified in the command, we can filter by dates too

And our corresponding data will look like this:

image_20
619,226 loans taken by India in the past 10 years

So we know that in the past 10 years India took out 619,226 loans that were not cancelled or otherwise terminated (i.e. completed projects).

Let's now modify this query to find the total monetary value, we do this again by replacing COUNT(*) with SUM(original_principle_amount). The query looks like this:

image_21
A query to find the total amount loaned (USD) in a 10 year period

And our resulting data looks like this:

image_22
The total amount loaned to India in a 10 year period

So to summarise in the last 10 years India has taken out a total of 619, 226 loans with the IDA amounting to a total of $115,168,012,265,000

Conclusion

So by exploring this data we've explored a few really helpful features that SQL has to offer, it's ability to handle large datasets and quickly and efficiently execute queries is the main reason it is so dominant as a technology today. Some of the SQL commands and their uses are:

  • Using WHERE to filter tables by categorical variables
  • Using AND to chain WHERE clauses together
  • Using COUNT(*) to count the total number of times an entry appears in a table
  • Using NOT to give the inverse selection
  • Using GROUP BY to group results by categorical variables
  • Using ORDER BY to order results by a particular column
  • Using DESC to order in descending order
  • Using LIMIT to truncate the returned query results

In terms of analysing the Economic data we have learned:

  • The highest number of loans given by the IDA was to India with 660,992 loans applied for
  • The biggest borrowers were India, Bangladesh and Pakistan
  • The total money lent to India (in USD) was $119,091,005,738,949
  • The total amount of money currently being repaid by India is $3,658,503,335,311
  • India is currently repaying $6,177,086,091 for railway related projects alone
  • Since 2013, India has taken out 619,226 loans amounting to a total of $115,168,012,265,000

I hope you've enjoyed reading this article, to see more of my projects please click the link below to my personal portfolio website. If you'd like to get in touch please email me or contact me via LinkedIn.

Thanks for reading.