Pandas is a popular Python library for data analysis and manipulation. It provides a fast and easy way to work with tabular data, such as CSV files, Excel spreadsheets, or SQL databases.

In this blog post, I will share some useful pandas cheat sheet that covers some of the most common tasks and operations that you may encounter in your data science projects. Whether you need to filter, sort, group, summarise, understand the data, pandas has a function or method for you. Let’s get started!

You can find the jupyter notebook for this post here : Click here

**Install Pandas**

You can install pandas in two ways:

**Using pip**

```
pip install pandas
```

**Using conda**

```
conda install -c anaconda pandas
```

**Import Pandas**

```
import pandas as pd
```

The data can be one-dimensional data or multi-dimensional data. First, I will give you an overview of one-dimensional data, then we’ll dive deep into multi-dimensional data.

**Series**

A pandas series is a one-dimensional array of data that can store any type of values, such as numbers, strings, or boolean. You can create a pandas series using `Series()`

from a list, a dictionary, or a numpy array.

For example, you can create a series like this:

```
# creating an array using list
temps_list = [1, 2, 3, 4, 5]
# creating a dictionary
temps_dict = {1: 10, 2: 20, 3: 30, 4: 40, 5: 50}
# creating a numpy array
temps_numpy = np.array([100, 200, 300, 400, 500])
# creating a series from list
list_series = pd.Series(temps_list)
# creating a series from dictionary
dict_series = pd.Series(temps_dict)
# creating a series from numpy array
numpy_series = pd.Series(temps_numpy)
list_series
# ------------------- OUTPUT --------------- #
0 1
1 2
2 3
3 4
4 5
dtype: int64
# ------------------- OUTPUT --------------- #
dict_series
# ------------------- OUTPUT --------------- #
1 10
2 20
3 30
4 40
5 50
dtype: int64
# ------------------- OUTPUT --------------- #
numpy_series
# ------------------- OUTPUT --------------- #
0 100
1 200
2 300
3 400
4 500
dtype: int64
# ------------------- OUTPUT --------------- #
```

You can clearly see that this is an one-dimensional data.

Now, let’s learn about multi-dimensional data.

**DataFrame**

A pandas dataframe is a data structure that allows you to store and manipulate tabular data in Python. It is similar to a spreadsheet or a database table, but with more features and flexibility.

You can create a dataframe from various sources, such as lists, dictionaries, files, or web pages. A dataframe has rows and columns, each with a label. You can access and modify the data in a dataframe using various methods and attributes.

For example, you can create a dataframe of names and ages of people like this:

```
# creating a dictionary having names and ages
names_ages_dict = {
"Name": ["Alice", "Bob", "John", "Doe"],
"Age": [18, 24, 35, 11]
}
# creating dataframe from that dictionary
dict_dataframe = pd.DataFrame(names_ages_dict)
dict_dataframe
# --------------------- OUTPUT -------------- #
Name Age
0 Alice 18
1 Bob 24
2 John 35
3 Doe 11
# --------------------- OUTPUT -------------- #
```

This is the easiest way to create a dataframe as you can see the keys of dictionary will become column names and the values of the dictionary will be the values of the respective column.

**Load Data Using Pandas**

We’ll be using The Titanic Dataset for the rest of this post. Click here to download the dataset.

```
# read_csv(file_path).
data = pd.read_csv("Titanic-Dataset.csv")
```

**Look At The Data**

**Look At The Entire Data**

```
data
```

**Look At The Top n Rows**

You can use `head()`

method that takes an optional parameter `n`

. `n`

denotes the number of rows you want. By default, `n = 5`

.

```
# this will output the top 5 rows as by default n = 5
data.head()
```

```
# this will output the top 10 rows as I've set n = 10
data.head(n=10)
```

**Look At The Bottom n Rows**

You can use tail`()`

method that takes an optional parameter `n`

. `n`

denotes the number of rows you want. By default, `n = 5`

.

```
# this will output the bottom 5 rows as by default n = 5
data.tail()
```

```
# this will output the bottom 10 rows as I've set n = 10
data.tail(n=10)
```

**Mathematical Functions on Pandas**

**To get the min value**

To get the minimum values for every column.

```
data.min()
# ---------------------- OUTPUT --------------------------------- #
PassengerId 1
Survived 0
Pclass 1
Name Abbing, Mr. Anthony
Sex female
Age 0.42
SibSp 0
Parch 0
Ticket 110152
Fare 0.0
dtype: object
# ---------------------- OUTPUT --------------------------------- #
```

**To get the max value**

To get the maximum values for every column.

```
data.max()
# ---------------------- OUTPUT --------------------------------- #
PassengerId 891
Survived 1
Pclass 3
Name van Melkebeke, Mr. Philemon
Sex male
Age 80.0
SibSp 8
Parch 6
Ticket WE/P 5735
Fare 512.3292
dtype: object
# ---------------------- OUTPUT --------------------------------- #
```

**To get the mean value**

To get the mean value for every column.

```
data.mean()
# ---------------------- OUTPUT --------------------------------- #
PassengerId 446.000000
Survived 0.383838
Pclass 2.308642
Age 29.699118
SibSp 0.523008
Parch 0.381594
Fare 32.204208
dtype: float64
# ---------------------- OUTPUT --------------------------------- #
```

**To get the median value**

To get the median value for every column.

```
data.median()
# ---------------------- OUTPUT --------------------------------- #
PassengerId 446.0000
Survived 0.0000
Pclass 3.0000
Age 28.0000
SibSp 0.0000
Parch 0.0000
Fare 14.4542
dtype: float64
# ---------------------- OUTPUT --------------------------------- #
```

**To get the standard deviation**

To get the standard deviation value for every column.

```
data.std()
# ---------------------- OUTPUT --------------------------------- #
PassengerId 257.353842
Survived 0.486592
Pclass 0.836071
Age 14.526497
SibSp 1.102743
Parch 0.806057
Fare 49.693429
dtype: float64
# ---------------------- OUTPUT --------------------------------- #
```

**Accessing Rows and Columns of DataFrame**

Now, we’ll learn about how to read every column or every row or a specific column and a specific row. But first, let’s learn about `loc[]`

and `iloc[]`

.

One of the features of pandas is that it allows us to select and modify data using labels or indices. The `loc[]`

and `iloc[]`

methods are two ways of doing this. The `loc[]`

method selects data based on labels, such as column names or row names. The `iloc[]`

method selects data based on integer indices, such as the position of the rows or columns.

Here, I will explain the difference between pandas `loc[]`

and `iloc[]`

methods, which are used to select data from a DataFrame in Python. I will also show some examples of how to use them.

`loc[]`

and `iloc[]`

are both indexers, which means they allow us to access specific rows and columns of a DataFrame by using labels or positions. However, they have different ways of doing so.

**loc Method**

`loc[]`

is a label-based indexer, which means it selects data based on the row and column labels. For example, if we want to select the row with index **0** and the column with label “**Name”**, we can use:

```
# This will return the Name of first row
data.loc[0, "Name"]
# ------------------------- OUTPUT ----------------------- #
'Braund, Mr. Owen Harris'
# ------------------------- OUTPUT ----------------------- #
```

`loc[]`

also supports slicing. For example, if we want to select “Name” and “Age” of all rows :

```
data.loc[:,["Name", "Age"]]
```

**iloc Method**

`iloc[]`

is a position-based indexer, which means it selects data based on the row and column positions. For example, if we want to select the **first row** and the **fourth column** of a DataFrame, we can use:

```
# this will return the Name of the first row as Name column is in position = 3
data.iloc[0, 3]
# ---------------------- OUTPUT ---------------------- #
'Braund, Mr. Owen Harris'
# ---------------------- OUTPUT ---------------------- #
```

`iloc[]`

also supports slicing. For example, if we want to select the **first three rows** and the **Name**, **Sex** and **Age** columns of a DataFrame, we can use:

```
data.iloc[0:3, 3:6]
```

**Differences Between loc and iloc**

One important difference between `loc[]`

and `iloc[]`

is that `loc[]`

**includes the last element of the slice**, while `iloc[]`

**excludes it**. For example, if we want to select the **first two rows** of a DataFrame, we can use:

```
data.loc[0:1] # includes both rows 0 and 1
data.iloc[0:1] # includes only row 0
```

Another difference is that `loc[]`

can accept boolean arrays as inputs, while `iloc[]`

cannot. For example, if we want to select all rows where the ‘Age’ column is greater than 18, we can use:

```
data.loc[data["Age"] == 18] # works fine
data.iloc[data["Age"] == 18] # raises an error
```

To summarize, `loc[]`

and `iloc[]`

are both useful methods for selecting data from a DataFrame in Python. `loc[]`

is based on labels, while `iloc[]`

is based on positions. `loc[]`

includes the last element of the slice, while `iloc[]`

excludes it. `loc[]`

can accept boolean arrays, while `iloc[]`

cannot.

**Describe and Info Of The Data**

We will explore two useful methods that can help us understand the basic properties and statistics of our data: describe and info.

The describe method returns a summary of the numerical columns in a DataFrame or a Series. It calculates some common descriptive statistics, such as count, mean, standard deviation, minimum, maximum and percentiles.

To make sense of the above output, the output says that average or mean age of passenger is 29. Another observation can be that the maximum fare is 512.

This way you can make sense of the output of `describe()`

method.

**Info**

The info method returns a concise summary of the non-numerical columns in a DataFrame or a Series. It shows the index dtype and column dtypes, non-null values and memory usage.

You can see that, at-first the output show that it’s a range-index and it has 891 entries. Then it shows that there are 12 columns and then it shows those 12 columns position, name, non-null count, and data type.

Note: To learn more about the types of indexes, take a look at pandas documentation —Click Here

**Sorting The Data**

We can sort the data in multiple ways:

**Sort Using A Single Column**

Let say you want to sort the data by ages in ascending order. By default, the data will be sorted in ascending order.

```
# sorting the data by fare in ascending order
data.sort_values("Fare")
```

```
# sorting the data by faire in descending order
data.sort_values("Fare", ascending=False)
```

**Sort Using Multiple Columns**

```
data.sort_values(["Fare", "Pclass"])
```

```
data.sort_values(["SibSp", "Pclass"], ascending=False)
```

We can also sort different column in their own way.

```
# This will sort the Pclass is ascending order and Age is descending order
data.sort_values(["Pclass", "Age"], ascending=[True, False])
```

**Manipulating The Data**

**Adding A Column**

```
# creating a new column named total -> sum of SibSp and Parch
data["Total"] = data.iloc[:,[6,7]].sum(axis=1)
data
```

**Deleting A Column**

```
# drop a column
data.drop(columns=["Total"], inplace=True)
```

**Reordering The Columns**

Let say you want to move the “Fare” column to the right most position

```
# reorder the columns using loc
data = data.loc[:,["PassengerId","Survived","Pclass", "Name", "Sex", "Age", "SibSp", "Parch", "Ticket", "Cabin", "Embarked", "Fare"]]
data
# reorder the columns using iloc
data = data.iloc[:,[0,1,2,3,4,5,7,8,10,11,9]]
data
```

**Filtering Data**

```
# get only passenger's having Pclass == 1
data.loc[data["Pclass"] == 1]
```

```
# getting only passenger's having Pclass == 1 and survived
new_data = data.loc[(data["Pclass"] == 1) & (data["Survived"] == 1)]
new_data = new_data.reset_index(drop=True)
new_data
```

**Conditional Change**

```
# replace the sex value of male to 1
data.loc[data["Sex"] == "male", "Sex"] = 1
data.loc[data["Sex"] == "female", "Sex"] = 0
data
```

**Aggregation Using GroupBy**

May be you have tried to learn about groupby in the past and got confused but don’t worry, here I’ll give you enough examples to get comfortable with groupby. So, let’s start!!!

Let say you want to know how many females and males were present in the ship.

Here, we will use the groupby method on the column “Sex” and then use the function count, so that it’ll count the respective column values.

```
data.groupby("Sex").size()
# ------------------------- OUTPUT ------------------ #
Sex
0 314
1 577
dtype: int64
# ------------------------- OUTPUT ------------------ #
```

Let say now you want to find how many male and female passengers survived.

```
data.groupby(["Sex", "Survived"]).size()
# ------------------------ OUTPUT ----------------------- #
Sex Survived
0 0 81
1 233
1 0 468
1 109
dtype: int64
# ------------------------ OUTPUT ----------------------- #
```

Let say you want to find number of male and female passengers for each embark and their min,max, and mean age.

```
data.groupby(["Sex","Embarked","Pclass"]).agg({'PassengerId' : 'count' , 'Age':['min','max','mean']})
```

**Handling Missing Values**

**Get Number of null values in each column**

```
data.isnull().sum()
# -------------------------- OUTPUT --------------------------- #
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
# -------------------------- OUTPUT --------------------------- #
```

This will return the total number of null or missing values in every column.

Note: If you don’t use .sum(), then it’ll return a dataframe consisting of Boolean values where a`True`

means it is null and a`False`

means it is not null.

```
data.isnull()
```

To handle the missing values in a dataset, we can do one of the two things:

Drop the columns which has missing values

Fill the missing values with some value

**Dropping Columns**

You have already learnt how to drop specific columns. Here, I’ll share a strategy with you about when you should drop a column.

If a column has more null values than non-null values, then drop the column.

**Filling Missing Values of a Column**

There are multiple ways of filling the missing values of a column:

You can simply hard code a specific value, which will replace all the null values.

You can also fill the null values with the mean,median or mode of the column values.

In the above image, you saw that there are 177 missing values in `Age`

column. So. let’s fill the null values with the average age of all the passengers.

```
average_age_of_passenger = data["Age"].mean()
data.loc[:,"Age"].fillna(average_age_of_passenger, inplace=True)
data["Age"].isnull().sum() # this will show the number of missing values in Age
# ------------------------ OUTPUT ----------------------- #
0
# ------------------------ OUTPUT ----------------------- #
```

As you can see the missing values are now replaced with the mean age of passengers.

I can’t show you all the different ways to fill missing values, but in your data science journey, you’ll learn a lot of advanced methods to handle missing values.

**Miscellaneous**

**value_counts() method**

You can use the `value_counts()`

function to count the frequency of unique values in a pandas Series or a Dataframe column.

To get the total number of male and female passengers:

```
data["Sex"].value_counts()
# -------------------------- OUTPUT --------------------------- #
male 577
female 314
Name: Sex, dtype: int64
# -------------------------- OUTPUT --------------------------- #
```

**Conclusion**

In this blog post, we have learned the basics of pandas, a powerful Python library for data analysis and manipulation. We have seen how to create and manipulate data frames, how to perform common operations such as filtering, sorting, grouping and aggregating.

I hope you have enjoyed this pandas 101 blog post and learned something new. If you have any questions or feedback, please leave a comment below. Thank you for reading!