In this tutorial, you will learn all you need to know about data manipulation in Python with Pandas.
According to TheFreeDictionary.com data manipulation is “the standard operations of sorting, merging, input/output, and report generation.” This means that manipulating data is skillfully removing issues from the data to give us clean and tidy data that we can use easily later on in our data analysis.
Now, this tutorial will not go into details on each of these functions but will, instead, give you the quick “how it is done”. If you’re looking for more detailed Pandas tutorials you can check out the excellent blog Marsja.se. That said, let us move on to how to install Pandas.
Why do we have to carry out data manipulation?
Now, before we go on and learn how to carry out data manipulation in Python, you might want to know why we may need to do this.
Well, in real life, the data we are working with is, most of the time, far from ready to analyze. Often, the data that we get or have collected, has gaps, missing values, incorrect date formats, missing variable names, long variable names, variables with strange names, no variable names, variables scattered into multiple columns, or variables we don’t need for the particular analysis we plan to carry out. When this data is presented for analysis, it needs intervention to make it tidy. Therefore, this post will teach you a couple of data manipulation techniques that can be carried out with the Python package called pandas.
Now, first of all, we have to separate the scientific use for data manipulation. It is important to note that manipulating data, in that sense, is nothing you should do. Ever. As you may understand by the definition above, the things you’ll learn in this Pandas tutorial have nothing to do with forging data.
In the sections of this data manipulation in Python tutorial, you will learn the different types of Pandas methods and how to use them over the dataframe under the different types of manipulations that we can carry out in Python using Pandas. First, we start by making sure that Pandas is installed. Second, we will have a quick look at the example data and how to read this data file with Pandas. That is, we will start by learning the method that enables us to import data into a Pandas dataframe. In the following sections, we will go into the data manipulation techniques that Pandas let us use, in Python. Here, we start off by subsetting data and, then, go on by transforming data.
Installing the Python Package Pandas
Are you unsure whether you have pip installed or not? Luckily, you can list all the installed Python packages with pip.
That said, here’s how to install Pandas with pip:
pip install panda
Now that we have made sure that you have Pandas installed, we can go on to a short description of the data we are going to work with in this tutorial.
We will use data found online (here) that is stored in a CSV file. This data is the WHO’s global tuberculosis report.
To read data from a CSV file, we are going to use Pandas read_csv() method:
import pandas as pd
df = pd.read_csv('https://extranet.who.int/tme/generateCSV.asp?ds=dr_surveillance')
Running the code above, will result in importing the data from the CSV file into a dataframe called df, containing all the variables found in the dataset. Of course, the read_csv() method can take multiple arguments other than those we used here. Note, the first thing to do is, of course, importing Pandas. See the documentation for more information. In the next section, before carrying out data manipulation in Python, we will explore the dataframe.
Exploring our Dataframe
Before subsetting data by row numbers or index, let’s have a look at the structure of the dataframe using the .info() method:
The result above shows us a lot of information. For instance, there are a total of 409 observations (rows) and 40 variables (columns). Furthermore, there are four string-type variables (object) and the rest of them are numeric. To be specific, there are two types of numeric: integer (int64) and float (float64). Another way to get a quick overview of the dataframe is to use the .head() method. Let’s check out the first 10 columns and the first 5 rows:
Note, to get the first 10 columns, we used the .iloc() method. In the next section, we are going to start of by subsetting by row numbers or index. In that section, we will use the numbers you can see in the left-most column (left to ”country”).
How to Subset Data in Python with Pandas
According to TheFreeDictionary ”subset” is ”a set that is part of larger set”. This means, that in this section you will learn how to make a smaller set, a subset that is, from the dataframe that we previously created. Remember, in the section ”reading data” we imported our data into the dataframe named ”df”.
Subsetting the Dataframe by Index
Here is an example where we will subset 1st row:
df[df.index == 0]
Note how we used of square brackets and the df.index == 0. This is one way to subset data that makes use of Boolean operators. Indexing in Pandas dataframe works, as you may have noticed now, the same as indexing a Python list (first row is numbered 0). Note, if you make a certain column index, this will not be true. For example, subsetting the first row in a dataframe where you have set the index to be a column in the data you imported, means that you will have to use whatever that is in the index column (first column, sort of speaking).
You can also get the same result by using .iloc (i.e., df.iloc[0:1, :]) and we are going to continue by using .iloc to subset a range of rows. Here’s how to use .iloc and indexes to subset range of rows from 1st to 4th row.
Subsetting dataframe based on a condition
In the previous sections, subsetting of the rows or columns was rather straight forward because we know the index or the observations we want to subset. However, in moste scenarios, subsetting is derived by sometimes complex conditions and calculations.
Let’s look at how can we subset rows from a data frame based on a condition. In the first example, we are going to subset by the variable ”country” (column) and choose the rows where the country is ”Afghanistan”. Here’s how to subset by a single condition:
df[df.country == 'Afghanistan']
Note when subsetting based on conditions, we can use “&” , “|” and “==” (like in the first example) operators to define “AND”, “OR”, and equality conditions. For instance, if we want to select rows that either the symbol, in the ”country” column, is either ”Afghanistan” or ”China” we can do as follows:
df[(df.country == 'Afghanistan') | (df.country == 'China')]
It is also possible to subset based on a condition using larger than (”>”) or smaller than (”>”) as well. Here’s how to select rows that are both, as in the example above, either data from Afghanistan or China (the country variable) or have a value in the variable xdr larger than 5:
df[((df.country == 'Afghanistan') | (df.country == 'China')) & (df.xdr > 5)]
In both examples above, notice the use of parentheses. As we in the last example, are going to subset either Afghanistan or China as well as rows where the column xdr is larger than 5 we set parentheses for the first condition (Afghanistan or China) and then the AND operator outside of the parentheses.
Subsetting data frame using the query() method
Pandas dataframe also has another function, that is quite easy to work with, to subset data: .query(). Here the input can use the same operators, as above, but they should be put into a string within the parentheses of the .query() method. Subsetting rows where the country is Afghanistan can be done like this:
df.query('country == "Afghanistan"')
Notice how we put the condition value (“Afghanistan”) within the “”? This was done because it’s an object (string).
Why would you transform data?
There are, of course, many reasons for transforming data. One could be that you have incomplete data for the analysis you would like. Here in this section, we will look at the following five transformation methods.
It is worth mentioning, however, that data transformation is a large topic and cannot be limited to the following four aspects. Furthermore, the necessity for transforming data is contingent on individual requirements and, thus, the five topics below are generic.
- How to add a new variable to the dataframe in Pandas
- How to remove a variable from the data frame in Pandas
- How to rename variables of a dataframe
- Changing the data type of a variable
- Dropping missing malues
How to add a new variable to the Pandas dataframe
A dataframe is a matrix-like structure where individual variables (columns) often are of different types. Furthermore, dataframe that we are working with in this Pandas tutorial, has four object (string) variables and the rest are numeric variables. What if we want to add another variable that contain a logical value for all observations where the value in the column xdr is larger than 5.
Let’s look at this example and add a new variable xdr_bool to our dataframe. This variable currently doesn’t exist into the dataframe and a simple assignment using = would add it.
df['xdr_bool'] = df.xdr.apply(lambda x: True if x > 5 else False)
Here, we used the apply method and the lambda function. What we did was to go through all observations in the xdr variable and check whether the value is larger than 5.
Now, there are other methods to add a new column to the dataframe. The rest of this section will cover them as well. First, here’s how to add a new variable using the .assign() method:
df = df.assign(PulmGM=(df.pulm_labconf_new + df.pulm_labconf_ret + df.pulm_labconf_unk)/3)
The assign method returns a new dataframe and, thus, we used = and got the changes permanent. Furthermore, the first parameter to the assign method is the new column name. In our case, it is “PulmGM” and we calculated the mean of the different columns named pulm_*.
Here’s a third way to add a new, empty, column to the dataframe: using the .insert() method.
Removing Variables from the Dataframe
Now, the dataframe we are working with contains a lot of variables. Here we are goning to remove a couple of the columns from the dataframe. First, we will have a look at how to remove a single column. After that, we will drop multiple columns.
How to Remove a Variable:
Here’s how to remove a column from Pandas dataframe:
df.drop('iso_numeric', axis=1, inplace=True)
It was pretty, simple, right? We just used the .drop() method together with the inplace parameter. We used the inplace parameter to make the change to the dataframe permanent. In the next section, we will drop more columns from the dataframe.
How to Drop Multiple Columns:
Here’s how to drop many columns instead:
df.drop(['iso2','iso3', 'g_whoregion'], axis=1, inplace=True
As you can see, in the example above, we put a list in the .drop() method, containing all the columns we want to remove as strings (like in the first example). However, we put the variables to drop in a list.
If your values need to be cleaned, from e.g. punctuation in the dataframe columns, you can do this with the replace method.
Renaming a variable of a Pandas dataframe
Why would you rename a column?
In this section, we are going to learn how to rename a column in Pandas dataframe. For example, the data that we are going to use for analysis can sometimes have long variable names which may create a problem with the indentation, length of the line, and may eventually make the code harder to read. This should be avoided, and you may want to rename the columns so that they become a little more manageable.
A variable (column) of a dataframe can be renamed like using the rename() method. Let’s rename the variables pulm_labconf_new to PulmNew in the dataframe.
Let’s go through the code above. It’s pretty straightforward, we used the rename method with the columns parameter. To the columns parameter we put a dictionary of the colum we wanted to rename. Furthermore, the key (first part) is the old name and the value (the second after the colon) is the new column name.
Renaming multiple variables of a dataframe in Pandas
Here’s how to rename multiple columns in one go:
Working with the rename method enables us to change as many columns that we want to change the names of. Note, that if you want the changed names to stick to the dataframe make sure to use the inplace parameter (set it to true: inplace=True).
Changing the data type of a column
As you may remember, the data we are working with in this tutorial, mainly contains of numeric data. If we ever want to change the data type of a column in a Pandas dataframe here’s how to do it:
df['xdr'] = df['xdr'].astype(object)
Dropping missing values
Finally, before concluding this post we are going to have quick look at how to drop rows with one, or two, missing values. Pandas has the .dropna() method for this task. Here’s how to remove all rows with missing values from the dataframe:
That was simple, right? Here, again, we used the inplace parameter for the reason that we want it to stick to the dataframe. Here, you can see that we have fewer rows in the dataframe:
If you feel that you are done with manipulating data you can go on and describe the dataframe, list all variables, inspect the dataframe, and calculate descriptive statistics.
To conclude, in this tutorial, you have learned a lot about using Pandas for data manipulation in Python. Of course, you haven’t learned all techniques. Here’s also a great post to check out!
McGraw-Hill Dictionary of Scientific & Technical Terms, 6E. S.v. “data manipulation.” Retrieved August 30 2020 from https://encyclopedia2.thefreedictionary.com/data+manipulation