Microsoft Business Intelligence (Data Tools)|August 2021

Thursday, August 26, 2021

Python - Common Functions for Exploratory Data Analysis

In this tutorial, we will learn "Common Functions for Exploratory Data Analysis" in our Data Science processes by using Python.

Python is one of the fastest growing programming languages.

1. Whether it’s data manipulation with Pandas,

2. Creating visualizations with Seaborn, or

3. Deep learning with TensorFlow,

Python seems to have a tool for everything.

Pandas is a Python package that provides fast, flexible, and expressive data structures designed to make working with structured (tabular, multidimensional, potentially heterogeneous) and time series data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.

NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

In the Data Science, in the most cases is not to explore the data but it is something about to analyze the data in some way, often through a model.

Pandas DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. It is similar to an excel spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:

• Dict of 1D ndarrays, lists, dicts, or Series

• 2-D numpy.ndarray

• Structured or record ndarray

• A Series

• Another DataFrame

Along with the data, we can optionally pass index (row labels) and columns (column labels) arguments. After passing an index and / or columns, we are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index

1. Pandas.Dataframe.describe() is very informative function which is used to generate descriptive statistics of the data in a Pandas DataFrame or Series. It summarizes central tendency and dispersion of the dataset. describe() helps in getting a quick overview of the dataset.

2. Head and tail functions - If you want to view a small sample of a Series or DataFrame object, use the head() and tail() methods. The default number of elements to display is five, but you may pass a custom number.

import pandas as pd

import numpy as np

#Create a series with random numbers

s = pd.Series(np.random.randn(400))

#The first two rows of the data series:

print(s.head(2))

#The last two rows of the data series:

print(s.tail(2))

#Create a Dictionary of series

d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),

'Age':pd.Series([25,26,25,23,30,29,23]),

'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame

df = pd.DataFrame(d)

#The first two rows of the data frame

print(df.head(2))

#The last two rows of the data frame

print(print df.tail(2))

As we know Exploratory Data Analysis (EDA) is one of the most essential part of your data science process.

To learn more, please follow us -

http://www.sql-datatools.com
To Learn more, please visit our YouTube channel at -
http://www.youtube.com/c/Sql-datatools
To Learn more, please visit our Instagram account at -
https://www.instagram.com/asp.mukesh/
To Learn more, please visit our twitter account at -
https://twitter.com/macxima

Mukesh Singh

With over 17 years of experience in the Data Engineering stack across a variety of cloud and on-premises systems, I have successfully delivered more than ten complete business product solutions. My expertise lies in building robust infrastructure and architecture to support data engineering, data analytics, and machine learning processes. These solutions have significantly improved collaboration among cross-functional teams, including data scientists, business analysts, software engineers, and stakeholders. Key Contributions Data Modelling and Integration • Data Modeling: Developed various data models to produce suitable data for business users, data analytics, data science, and data visualization teams. • Legacy Systems and Cloud Technologies: Integrated legacy systems with modern cloud-based technologies (AWS, Azure, GCP), data lakes, and data warehouses. • Streamlined Data Pipelines: Built efficient data pipelines, data warehouses, BI reports, and dashboards to streamline data access and insights.

Thursday, August 26, 2021

Python - Common Functions for Exploratory Data Analysis

Popular Posts