How to Remove Punctuation from a Dataframe in Pandas…
In this short Pandas tutorial, you will learn how to remove punctuation from a Pandas dataframe in Python. Note, in a previous post you learned how to remove punctuation from Python strings and this post use a similar mehtod and I refer to that post if you need to know what a “punctuation” is.
Example Data
In the example Pandas DataFrame, below, you can assume that the data were scraped from a website and then added to a Python dictionary.
import pandas as pd
data = {'ID#':[i for i in range(1,11)],
'Gender.1':['F', 'M']*5,
'State':['AL.', 'AK.', 'AS.', 'AS.', 'CA.',
'CO.', 'DC.', 'FL.', 'ID.', 'CA.'],
'Words':['Hey,', 'Stop', 'Seaborn,', 'Pandas', 'DataFrame]',
'Good#', 'DataScience,', 'Python', 'Tutorials$', 'AI..']}
df = pd.DataFrame(data)
Now, you can see that you use the pd.DataFrame method to create a Pandas DataFrame from the dictionary. If you then use df.head()
you will get the following output:
In the image above, you will see that there are punctuation in both the column names and the cells of the Pandas DataFrame. In the following sections, you will learn how to clean the data from punctuation. First, you will learn how to remove punctuation from the columns in the dataframe. Second, you will learn how to remove punctation from the column names of the same dataframe.
Remove Punctuation from a Column in Pandas Dataframe
In this section, you will learn how to get rid of the Punctuation in a column in a Pandas dataframe. Now, here you are going to use the str.replace method to get rid of the punctation from one single Pandas column:
df["StateNoPunctuation"] = df['review'].str.replace('[^\w\s]','')
df.head()
In the example above, you created a new column with the values without the punctuation. If you, however, want to just remove it from the column you coud change the code as follows:
df["StateNoPunctuation"] = df['review'].str.replace('[^\w\s]','')
df.head()
Remove Punctuation from many Columns in Pandas DataFrame
In this section, you will learn how to remove punctuation from multiple columns in Pandas Dataframe. To do so, you can write your own function and then use the apply method:
def remove_punctuation(x):
try:
x = x.str.replace('[^\w\s]','')
except:
pass
return x
df.apply(remove_punctuation)
Now, that you have removed punctuation marks from your Pandas dataframe you may want to start to clean data. If you need to know how to change the data types of Pandas columns, I refer to that post.
How to Clean Column Names from Punctuation in Pandas DataFrame
In this final example, you are going to learn how to clean the column names. As you may have noticed, there are punctuation in the column names as well in the DataFrame. Here, you will, again, use the str.replace method to remove the punctuation but from the column names:
df.columns = df.columns.str.strip().str.replace('[^\w\s]', '')
df
As you can see, here you used the columns method to get the column names and get rid of the punctuation. Now, if you also need to change the column names, entirely, makes sure you check that post out. Finally, if you need to add a column to a Pandas DataFrame, I have covered that in a post as well. In a more general way, what you have done here is data manipulation in Python.
Summary
In this short Python Pandas tutorial, you have learned how to remove punctuation from Pandas DataFrames. In fact, you learned both how to use the str.replace method to do this on one column and all the columns in the DataFrame. Finally, you learned how to clean the column names containing punctuation. Note, there is a really cool Python package that you can use to clean data with. It’s called Pyjanitor. Check it out!