Everything about Pandas Library - Beginner

Aug 4, 2017

PandasPython

Introduction

If you do not know what Pandas is, it’s a python library for data manipulation, data cleaning and data analysis. There are other libraries like Numpy (Pandas is built on top of this), Scipy as well which help in data analysis. They provide a lot of functionalities which help in achieving this. It helps this process of analysis by providing ways to filter out redundant data, fill null information with some sensible data, extracting insights out of existing data, plot graphs for data visualization etc. So let’s get started.

Data Structures in Pandas

Pandas support three types of data structures -Series, DataFrame and Panel. Series is a 1-Dimensional array. DataFrame is a 2-Dimensional array (i.e. table). Panel is 3-Dimensional Array. But Series and DataFrame are the most commonly used ones.

Setting up the Environment

Before getting started with Pandas you need to have it installed on your environment. Simply install it by typing sudo pip install pandas in the terminal. Another simpler way is by installing Anaconda Distribution available here (Note: This distribution includes not just the pandas library, but a lot of other python libraries as well).

Reading a data file in Pandas

Pandas supports a lot of data formats (Read about it here). But the most commonly used data format is CSV.

Let’s get started by taking the training dataset from Kaggle. Now create a new .py file and import the pandas library.

import pandas as pd

Use the read_csv function to load the csv file into the DataFrame.

data = pd.read_csv(‘train.csv’)

The above command might take time based on the number of rows and columns in the csv file.

If you want to customize the way the file is read you can provide it with attributes as well. Some commonly used attributes like headers, delimiter can also be used. You can read more about it here.

.shape and .dtypes

shape property of the dataframe returns a tuple containing number of rows and columns in the dataset.

data.shape

dtypes property of the dataframe returns all the columns in the dataset along with the datatype of each column. Something as shown below.

Pandas shape property

Using head() and tail() in pandas

head method is used to print first 5 rows of the dataset. By default this function returns 5 rows, but you can return as many rows as you want by passing the row count to the head function. Something as shown below.

data.head(7)

The above command prints the first 7 rows of the dataframe.

Similar to head function, you can use tail function to print the last ‘n’ rows of the dataframe using tail function.

data.tail(4)

Using describe()

Describe function gives a rough idea of how data is distributed over the range of values in the dataset. It prints out the minimum, maximum, mean, standard deviation, lower upper and 50th(median) percentiles. This is the case for columns having numeric data types as shown below.

data.describe()

Pandas shape property

For string and timestamp columns the describe function prints unique, top and frequency information also.

data.describe(include=’all’)

Pandas shape property

You can read more about describe function here

Selecting columns

Pandas supports multiple ways to select columns

By Column Name

data.Name

data[‘Name’]
Over a range of rows

data[100:500]
By Column Index

data.iloc[200:500,1:5]

This means take rows from 200 to 499 and columns 1,2,3 and 4
Conditional selection:

data[data.Age>18]

This selects all the rows of the dataframe where age is greater than 18

Adding new columns to the dataframe

Certain situations you might want to create a new column in the dataframe by combining other data columns. Below shown code will concatenate Name and Gender columns of the dataframe.

data[‘Name Gender’]=data[‘Name’]+ ‘-‘ +data[‘Sex’]

Sorting

Dataframe can be sorted using the sort_values function in pandas. You can simply sort the values by a column name by following way.

data.sort_values(by=’Name’)

You can customize the way the values are sorted, read more about it here.

Creating new dataframe out of existing one

This is usually used to filter out unnecessary data out of the dataframe.

filteredData = data[data.Sex = ‘female’]

Here I am creating a new dataframe having only female passengers information

String Functions

Adds a new column to the dataframe which has the length of the description column.

data[‘description_len’] = data[‘description’].str.len()
Convert string to lowercase or uppercase

data[‘Name_Lower’] = data[‘Name’].str.lower()

data[‘Name_Upper’] = data[‘Name’].str.upper()

Math Functions

Finding out mean

data.mean()
Finding out median

data.median()

Writing to DataFrame to a File

Once you are done making the neccessary changes to the DataFrame, you can save it to csv file using to_csv function as shown below.

data.to_csv(‘newfilename.csv’)

Conclusion

This is just the beginner part of Pandas library. There are lot of other features that are available in this library like Grouping, Joining, plotting graphs etc. I will be discussing about these in another blog.