Pandas Tutorial: How to Read, and Describe, Dataframes in…
In this Python Pandas tutorial, you are going to learn how to read data into datframes and, then, how to describe the dataframe. This is the first step you go through when doing data analysis with Python and Pandas.
1. Specifying a Working Directory in Python
Previously, you have learned about reading all files in a directory with Python using the Path method from the pathlib module. Here you will learn how to specify the working directory with Path and the os module.
import os
from pathlib import Path
data_path = Path('C:\PyDad\Data')
os.chdir(data_path)
Now, first you created the path to the data folder and then you changed the directory, to this path, using os.chdir.
2. Reading Data with Python and Pandas
Now, data can be stored in numerous different file formats (e.g. CSV, Excel, SQL databases). Here’s the documentation of Pandas.
Reading Data from a CSV File with Pandas:
Here’s how to read data into a Pandas dataframe from a .csv file:
import pandas as pd
df = pd.read_csv('BrainSize.csv')
Now, you have loaded your data from a CSV file into a Pandas dataframe called df. You can now use the numerous different methods of the dataframe object (e.g., describe()
to do summary statistics, as later in the post). Note, the dataset can be downloaded here.
Reading Data from an Excel File with Pandas:
Here’s how to read data into a Pandas dataframe from a Excel (.xls) File:
df_xls = pd.read_excel('distribution-data.xls')
Now, you have read your data from a .xls file and, again, have a dataframe called df. One super neat thing with Pandas is that you can read data from internet. That is you can, if you want to, specify a URL to a .csv or .xlsx, or .xls file, if you like to. Here’s a complete code example for loading both a CSV and an Excel file from internet sources:
import apndas as pd
df = pd.read_csv('http://www.statslab.cam.ac.uk/~rds37/teaching/statistical_modelling/BrainSize.csv')
df2 = pd.read_excel('https://mathcs.org/statistics/course/00-data/distribution-data.xls')
Data types in Pandas Dataframes
In a previous post, you learned how to change the data types of columns in in Pandas dataframes. Here, you’ll get an overview of the available datatypes in Pandas DataFrame objects:
Pandas dtype | Python data type | NumPy type | Description |
---|---|---|---|
object | str | string_, unicode_ | Text (strings) |
int64 | int | int_, int8, int16, int32, int64, uint8, uint16, uint32, uint64 | Integer (numbers) |
float64 | float | float_, float16, float32, float64 | Decimal numbers(floating point numbers) |
bool | bool | bool_ | True/False |
datetime64 | datetime64 | date and time | |
timedelta | the difference between two time points(dates) | ||
category | Text (strings) with a few categories, if they can’t be interpret as a categorical variable |
It is important to keep an eye on the data type of your variables, or else you may encounter unexpected errors or inconsistent results. When you load the data using the Pandas methods, for example read_csv
, Pandas will automatically attribute each variable a data type, as you will see below. Note, if you want to change the type of a column, or columns, in a Pandas dataframe check the post about how to change the data type of columns.
3. How to Inspect and Describe the Data in a Pandas DataFrame
An initial inspection can be carried out directly, by using the shape method of the object df
. In the image below, you will see that the size is 38 (number of rows) x 7 (number of columns).
df.shape
Now, you can also just explore the number of rows or columns by using indexing:
df.shape[0]
Above, you first used 0 to get the number of columns of the dataframe and then, of course, the number of row using 1.
df.shape[1]
If you want to get more information about your DataFrame object you can also use the info()
method:
df.info()
Now, after you have inspected your Pandas DataFrame you might find out that your data contains characters that you want to remove. See the previous post about how to remove punctuation from a Pandas DataFrame if you need to get rid of dots (.), commas, and such from your categorical data. If you need to rename your variables (i.e., columns) check the post about how to rename columns in Pandas DataFrames.
If you need to, you can carry out data manipulation in Python with Pandas. That is if you need to clean the dataframe (e.g., change names, subset data).
Describe the Pandas Dataframe (e.g. Descriptive Statistics):
The data analysis process pipeline should always be started by reviewing your data. To quickly get some desriptive statistics of your data using Python and Pandas you can use the describe() method:
df.describe()
To skip to doing descriptive statistics is always disastrous and leads only to loss of time. The aim is to consider the following things:
- How much data do I have? The number of rows (observations) and columns (variables)?
- What does the distribution look like? The following parameters are of particular interest
- The minimum value (min)
- The highest value (maximum)
- The range (distance between minimum and maximum values)
- The mean and the standard deviation of the normal distribution of the variables
- The median and the interquartile range of the non-normal distribution of the variables
- The mode (the most frequent value)
- How much missing values do you have the respective column (variable)?
- Is there any pattern to the missing data? It is, for example, such as that the same individuals have missing values?
- Are there correlations between the variables, and how pronounced is the correlation (especially important if you plan on doing regression analysis)?
In order to illustrate the above, there are hundreds of functions in Python and Pandas , but you only need to become familiar with a few of them. Here you will start with the method describe()
which describes each of the columns, with the following parameters:
df.describe()
To the above output, it is suitable for the numerical variables, which are described by these parameters. In fact, describe()
will only take your numeric variables in consideration, if you don’t tell it otherwise. Thatis if your DataFrame, on the other hand, contain mixed variables (data types) the describe()
method will by default only present your numerical variables.
Needless to say, describe()
can be used with strings, and other dat types. You will then get, instead of the parameters count
, unique
, the parameters top
, and freq
. Now, top
will get you the most frequent value (also referred to as mode
). On the other hand, freq
is the incidence of the most commonly used value. Now, if you only want descriptive data for the objects (e.g., strings) you can use this code: df.describe(include = ['O'])
, and if you only want to describe the categorical variables, use the command df.describe(include = ['category'])
.
df.describe(include=['O'])
Note, that it’s also possible to use exclude if you want to exclude certain data types. That is if you want to exclude certain data types you can change include to exclude. Furthermore, running the above code, with the data in this tutorial, will only give you one column (and only works with objects, as there are no categorical data. If you want to change data type you can run the following code:
df['GenderCat'] = df['Gender'].astype('category')
df.describe(include=['category'])
How to List all Variables (Columns) in a Pandas DataFrame
To list all the variables (columns) in your Pandas dataframe you can use the following code:
df.columns
Now, this may be useful if you get your data from someone else and need to know the names of the variables in the dataset. For example, if you are planning on using certain variables in a statistical models you may need to know their name.
How to Show the First n or Last n Rows in a Pandas DataFrame
Typically, you will need to get a quick overview of how your data look like. One common way to tackle this, is to print the first n rows of the dataset:
df.head()
Another common method to get a quick glimplse of the data is to print the last n rows of the dataframe:
df.tail()
Both are very good methods to quickly check whether the data looks ok or not. Especially, as we may work with very large datasets that we cannot check as a whole. It’s worth knowing, here, that you can put a digit within the parentheses to show the n first, or last, rows. For example, df.head(7)
will print the first 7 rows of the DataFrame.
How to get Descriptive Statistics of Specific Variables (Columns)
To get the summary statistics of a specific (or two specific) variables you can select the column(s) like this:
df[['FSIQ']].describe()
If you want to select, and describe, more than one column just add that column name to the list (e.g., after FSIQ, in the example above).
To just get the individual descriptive statistics (e.g., mean, standard deviation) you can check the following table:
Descriptive statistic | Pandas Method | Description |
Mean | df.mean() | To calculate the mean of the numerical columns |
Standard Deviation | df.std() | Standard deviation of the numerical columns |
Mode | df.mode() | Mode of the numerical columns |
Median | df.median() | Median values of the numerical column |
Minimum | df.min() | Minimum values |
Maximum | df.max() | Maximum values |
Variance | df.var() | Calculates the variance |
Standard Error of The Mean | df.sem() | Returns the standard error of the mean for the numerical values |
How to Create Frequency Tables and Crosstabs with Pandas
In order to create two-way tables (crosstabs) you can use the crosstab method:
df = pd.read_csv(‘http://www.statslab.cam.ac.uk/~rds37/teaching/statistical_modelling/Cycling.csv’) pd.crosstab(df[‘colour’], df[‘vehicle’])
If you need to learn more about crosstabs in Python, check out this excellent post.
How to Create a Correlation Matrix in Python with Pandas
In order to calculate the correlation statistics (creating a correlation matrix) of your data you can use the corr() method:
df = pd.read_csv('http://www.statslab.cam.ac.uk/~rds37/teaching/statistical_modelling/BrainSize.csv')
df.corr()
Pandas Histogram
You can create a histogram in Python with Pandas using the hist() method:
df = pd.read_csv('http://www.statslab.cam.ac.uk/~rds37/teaching/statistical_modelling/Cycling.csv')
df.hist(columns='passing.distance')
Now, next step might be data pre-processing, depending on what you found out when inspecting your DataFrame. If you’re ready for data analysis you might be interested in learning about 6 Python libraries for neural networks.
That was it, you have now learned about inspecting and describing Pandas dataframes. This is, of course, very important aspects of the data analysis process you’ll go through. More specifically, you have learned how to set the working directory, how to create dataframes from CSV and Excel files, load the data from the Web, inspect parts of the data, and calculate summary statistics. Finally, you also used crosstabs, correlations, and some basic data visualization to explore the disitribution (with histograms, in this case).