TechAE Blogs - Explore now for new leading-edge technologies

TechAE Blogs - a global platform designed to promote the latest technologies like artificial intelligence, big data analytics, and blockchain.

Full width home advertisement

Post Page Advertisement [Top]

Introduction To Pandas - Part 1

Today, we are going to learn about an infamous python library known as Pandas, this is going to be part 1 of the two-article series, what amazing functions it has and how it is to be used successfully.

Table of Contents

  • What is Data Analysis?
  • What is Pandas Python?
  • Why Pandas?
  • How to install Pandas in Python?
  • How to import Pandas in Python?
  • How to check the version of Pandas?
  • Pandas Objects
  • How to read CSV files in python using pandas?
  • Methods and Attributes of DataFrame
  • Some important concepts in Pandas
  • Summary Functions
  • Aggregation Functions
  • Sorting Function
  • Renaming Function
  • Conclusion

What is Data Analysis?

Data analysis is a way of gathering, organizing, and, if necessary, manipulating data in order to extract insightful information (i.e. trends and patterns) from it.


Pandas meme

What is Pandas Python?

Pandas (officially stands for Python Data Analysis Library) is an open-source library that provides a variety of data structures and data manipulation methods that allows performing complex tasks with simple one-line commands. It's mostly used by data scientists.

Pandas

Why Pandas?

The major advantage of using Pandas is it helps you manipulate and analyze large volumes (millions of rows/records) of data with ease and efficiency.

How to install Pandas in Python?

Execute this single-line code in your local environment's console:


pip install pandas

How to import Pandas in Python?

Once installed, we can import it in the following way:


import pandas as pd

How to check the version of Pandas?

The latest version is 1.4.4.


pd.__version__

'1.2.3'

Pandas Objects

There are two fundamental data structures:

What is a Series?

It is a one-dimensional labeled array that can hold any data type like a column in a table along with an index. A Series having elements such as both numbers and strings, its data type is always 'object'. By default, indexing starts from 0 in Series.


a = pd.Series([10, 20, 30, 40, 50])
a

0 10 1 20 2 30 3 40 4 50 dtype: int64

What is a DataFrame?

It is a two-dimensional table made up of a sequence of aligned Series structured with labeled axes (rows and columns). Below is an example of creating a DataFrame using the dictionary.


df = pd.DataFrame({
    "car": ['Mercedes', 'Maserati MC20', 'Ferrari'],
    "speed": [420, 530, 450]
}, index=['a', 'b', 'c'])

df
carspeed
aMercedes420
bMaserati MC20530
cFerrari450

You can give your own row indexes as above.

How to read CSV files in python using pandas?

CSV is a basic file format that stores comma-separated values. Pandas read_csv() method enables you to work with files effectively. You can try any data files (JSON, etc.) for reading and writing data.


df = pd.read_csv("filename.csv") OR
df = pd.read_csv("Link_to_file")

Now, to write DataFrames to CSV file is also easy using the to_csv function.


df = pd.to_csv("filename.csv") OR
df = pd.to_csv("filename.csv", index=False) # export without the index

Methods and Attributes of DataFrame

There are some functions and attributes that allow us to observe basic information about the data stored in a DataFrame object:

DataFrame.head():

By default, it returns the content of the first 5 rows.


df.head()

DataFrame.tail():

By default, it returns the content of the last 5 rows.


df.tail()

DataFrame.shape:

It returns a tuple of the form (number of rows, number of columns).


df.shape

DataFrame.dtypes:

It returns the data types of each column


df.dtypes

DataFrame.info():

This method returns a concise summary of the DataFrame.


df.info()

DataFrame.columns

This returns the name of the columns.


df.columns

DataFrame.index

This returns the index of the rows


df.index

Some important concepts in Pandas:

What is Indexing in Pandas?

Indexing allows easily accessing particular rows and columns from a DataFrame.

There are two different methods of indexing in Pandas:

  • loc - label-based selection
  • iloc - index-based selection

Index-Based Selection - selecting data based on numerical position.


df.iloc[ ]

Label-Based Selection - selects data based on the column or row names/index.


df.loc[ ]

What is Selecting?

There are two types of selection:

Attribute (Dot) Based Selection


df.column_name

Dictionary (Bracket) Based Selection


df['column_name']

To select multiple columns in a DataFrame, you can write like this:


df[['column1_name','column2_name']]

Subsetting a Dataframe

It is a way of filtering portions of your interest. Below is an example of creating a subset of data df, only taking observations that were last updated on 2020-06-13 03:33:14.


updated_data=df[df['Last Update']=='2020-06-13 03:33:14']

What is Assigning?

It allows for assigning data to a DataFrame.


df.car="Lamborghini"

So far, you have learned to read and write a CSV file, some methods to check the information of data, and select data from a DataFrame. Now, we will look at some techniques that will help you know the above information about your data.

Summary Functions:

As we learned earlier about the info(), it's also a summary function but a more brief version of it is the describe() function.

By default, the describe() method only returns a summary of numerical columns.


df.describe()

If we want to get a summary of categorical columns separately, we can use the parameter 'include'.


df.describe(include="object")

For a summary including both categorical and numerical columns, you can write like this:


df.describe(include="all")
Describe function

Let's see what information about the data is returned in the above table:

💠 count - the count of non-null entries in the particular column.

💠 unique - the count of unique values in a column. Only for categorical columns.

💠 top - This tells us which category occurs the maximum number of times. Only for categorical columns.

💠 freq - This tells you the number of occurrences of that column's top category. Only for categorical columns.

💠 mean - the mean value of the numerical column.

💠 std - This tells you about the variation in the data.

💠 min - the minimum value in the numerical column.

💠 25% - the 25th percentile (or 1st quartile) value in the numerical column.

💠 50% - the 50th percentile (or 2nd quartile or the median) value in the numerical column.

💠 75% - the 75th percentile (or 3rd quartile) value in the numerical column.

💠 max - the maximum value in the numerical column.

💠 NaN values mean that a particular summary value is unavailable for a particular column.

Aggregation Functions:

Well, we saw that the describe() function is very useful but we can also use individual methods too. Some of them are:


df.mean() # For mean
df.median() # For median
df['column_name'].unique() # Returns unique value in that column
df['column_name'].value_counts() # Returns count of unique values in the column

Lastly, we are going to see How we can sort and rename columns in DataFrame.

Sorting Function:

The sort_values() which returns the sorted result, by default, in ascending order.


df.sort_values(by="Confirmed", ascending=False)

Try this to get Top N values


df['Confirmed'].sort_values(ascending=False)[0:N]

Renaming Function:

We can rename the column names.


df.rename(columns={
    'car':'cars'
},inplace=True) # inplace makes changes in original dataframe

More Resources:

Let’s Put It All Together:

Now, that you guys know what is Pandas and what are its useful functions helps us analyze our dataset.

If you wish to check out more articles on the market’s most trending technologies like Big Data, Python, and Computer Vision, then you can refer here.

I will be releasing part 2 of the Pandas article series soon.

See you next time,
@TechAE

No comments:

Post a Comment

Bottom Ad [Post Page]