Microsoft Business Intelligence (Data Tools)|February 2020

Wednesday, February 26, 2020

Python - Extracting Domain Name From URLs Using Regular Expressions

As a python developers/programmers, we have to accomplished a lot of data cleansing jobs from a file before processing the other business operations.

For an example, you have a raw data text file containing web scrapping data and you have to read some specific data like website URLs by to performing the actual Regular Expression matching to pull the domain names.

Extracting the Domain name accurately can be quite tricky mainly because the domain extension can contain 2 parts (like .com.au or .co.uk) and the sub domain (the prefix) may or may not be there.

The hard part is knowing if the name is at the second or third level or so on.

What is a Regular Expression and which module is used in Python?

Regular expression is a sequence of special character(s) mainly used to find and replace patterns in a string or file, using a specialized syntax held in a pattern.

The Python module re provides full support for Perl-like regular expressions in Python. The re module raises the exception re.error if an error occurs while compiling or using a regular expression.

# Python program to extract domain names from the list of website URLs By Regular Expression.

# Importing module required for regular expressions

import re

# List of website URLs

domainlist=['m.google.com',

'm.docs.google.com',

'www.someisotericdomain.innersite.mall.co.uk',

'www.ouruniversity.department.mit.ac.us',

'www.somestrangeurl.shops.relevantdomain.net',

'www.example.info']

#print values in the list

print(domainlist)

Output -

['m.google.com', 'm.docs.google.com', 'www.someisotericdomain.innersite.mall.co.uk', 'www.ouruniversity.department.mit.ac.us', 'www.somestrangeurl.shops.relevantdomain.net', 'www.example.info']

Now, we have the website URLs in the list and we want to extract only domain name from the list. So, we are going to apply regex based regular expressions such as

# Read list by for loop

# get list of domain

# The regex will have to be enormous in order to catch all kinds of domains

# It returns domain from URL.

#It's quick and doesn't need any input file listing stuff.

for l in domainlist:

# get list of domain

res = re.findall(r'(?<=\.)([^.]+)(?:\.(?:co\.uk|ac\.us|[^.]+(?:$|\n)))',l)

print(l, "|", res[0])

The final output is -

To learn more, please follow us -
http://www.sql-datatools.com
To Learn more, please visit our YouTube channel at —
http://www.youtube.com/c/Sql-datatools
To Learn more, please visit our Instagram account at -
https://www.instagram.com/asp.mukesh/
To Learn more, please visit our twitter account at -
https://twitter.com/macxima

Mukesh Singh

With over 17 years of experience in the Data Engineering stack across a variety of cloud and on-premises systems, I have successfully delivered more than ten complete business product solutions. My expertise lies in building robust infrastructure and architecture to support data engineering, data analytics, and machine learning processes. These solutions have significantly improved collaboration among cross-functional teams, including data scientists, business analysts, software engineers, and stakeholders. Key Contributions Data Modelling and Integration • Data Modeling: Developed various data models to produce suitable data for business users, data analytics, data science, and data visualization teams. • Legacy Systems and Cloud Technologies: Integrated legacy systems with modern cloud-based technologies (AWS, Azure, GCP), data lakes, and data warehouses. • Streamlined Data Pipelines: Built efficient data pipelines, data warehouses, BI reports, and dashboards to streamline data access and insights.

Monday, February 24, 2020

R - Programming Objects and Data Structures

As we know that R is an Open Sourced powerful programming language and also known as an Advanced Statistical language which is cross-plateform compatible, so it can be installed on Windows, MAC OSX and Linux and extensively used by Software Programmers, Statisticians, Data Scientists, and Data Miners.

Objects

R language supports object oriented programming or you can also say that everything in R is an object. An object is nothing and it is a data structure having some attributes and methods which act on its attributes. There are many types of R-objects.

Data Structures

We are very familiar with many language such as C and Java. If you are working within any programming language, then you need to use different variables to store different data. Moreover, variables are reserved in a memory location to store values. In this case, your program is using some memory area on the computer and you need to define your variable data type where data structures are the only way of arranging data so it can be used efficiently on a computer.

Please keep in mind, R doesn’t have variables declared as some data type and the variables are appointed with R-objects and the knowledge form of the R-object becomes the datatype of the variable. The most popular data structured objects are Vector, Matrix, Array, Lists, Data Frames and Factors.

Vector is the most basic data structure in R programming language. It comes in two parts: Atomic vectors and Lists which comes with three common properties such as:

Type function – what actually it is?
Length function – how many elements does it contain.
Attribute function – extra arbitrary metadata.

Atomic Vectors have four common types such as Numeric Data Type, Integer Data Type, Character Data Type and Logical Data Type

Matrix is a two-dimensional rectangular data set and thus it can be created using vector input to the matrix function. In addition, a matrix is a collection of numbers arranged into a fixed number of rows and columns. There are many applications where we can use them -

Matrices are used for carrying out geological surveys. We can represent information in the form of matrices that can be used for plotting graphs, performing statistical operations, etc.
To represent the real-world data is like traits of people’s population. They are the best representation method for plotting common survey things.
In robotics and automation, matrices are the best elements for the robot movements.

Arrays are multi-dimensional data structures in R programming to stored the data in the form of matrices, row, and as well as in columns where a programmer can use the matrix level, row index, and column index to access the matrix elements.

Please keep in mind, Arrays in R are the data objects which can store data in more than two dimensions.

Lists are the objects which contain elements of different types – like strings, numbers, vectors and another list inside them. A list can also contain a matrix or a function as its elements. In other words, a list is a generic vector containing other objects. A list is created using the list() function.

A Data Frame is an array. Unlike an array, the data we store in the columns of the data frame can be of various types. That is, one column might be a numeric variable, another might be a factor, and a third might be a character variable. All columns have to be of the same length.

Features of a Data Frame:

The column names should be non-empty

The row names should be unique

The data stored in a data frame can be of numeric, factor or character type

Each column should contain the same number of data items

Factors are special vectors that represent categorical data and can be ordered or unordered. Format for creating an array is:

x <- factor(c("yes", "no", "yes"), levels = c("yes", "no"))

Functions are themselves objects in R which can be stored in the project’s workspace. This provides a simple and convenient way to extend R.

To Learn more, please visit our YouTube channel at -
http://www.youtube.com/c/Sql-datatools

To Learn more, please visit our Instagram account at -
https://www.instagram.com/asp.mukesh/
To Learn more, please visit our twitter account at -

https://twitter.com/macxima

Mukesh Singh

Sunday, February 23, 2020

What is R-Programming and Why learning R is important?

R is an Open Sourced powerful programming language and also known as an Advanced Statistical language which was developed in 1993 by Robert Gentleman and Ross Ihaka in the University of Auckland, Auckland, New Zealand. R is cross-plateform compatible, so it can be installed on Windows, MAC OSX and Linux. Due to having fantastic features, it is extensively used by Software Programmers, Statisticians, Data Scientists, and Data Miners.

It is mostly used for statistical computing and graphics supported by the R Foundation for Statistical Computing. So, it is one of the most popular analytics tool used in Data Analytics and Business Analytics.

Why learning R is important?

There are many important features which force us to learn this mind blowing language as mentioned below -

It is a free and open-source programming language issued under GNU (General Public License).
It’s easy to create R packages for solving particular problems
It has cross-platform interoperability which means that it has distributions running on Windows, Linux, and Mac. R code can easily be ported from one platform to another.
It uses an interpreter instead of a compiler, which makes the development of code easier.
It effectively associates different databases, and it does well in bringing in information from Microsoft Excel, as well as, Microsoft Access, MySQL, SQLite, Oracle, etc.
It is a flexible language that bridges the gap between Software Development and Data Analysis.
It provides a wide variety of packages with the diversity of codes, functions, and features tailored for data analysis, statistical modeling, visualization, Machine Learning, and importing and manipulating data.
It integrates various powerful tools to communicate reports in different forms like CSV, XML, HTML, and pdf, and also through interactive websites, with the help of R packages.

Comparison with SAS, SPSS, and Stata

R is comparable to popular commercial statistical packages such as SAS, SPSS, and Stata, but R is available to users at no charge under a free software license.
In January 2009, the New York Times ran an article charting the growth of R, the reasons for its popularity among data scientists and the threat it poses to commercial statistical packages such as SAS.
In June 2017 data scientist Robert Muenchen published a more in-depth comparison between R and other software packages, "The Popularity of Data Science Software".
R is more procedural-code oriented than either SAS or SPSS, both of which make heavy use of pre-programmed procedures (called "procs") that are built-in to the language environment and customized by parameters of each call. R generally processes data in-memory, which limits its usefulness in processing extremely large files.

Commercial support for R

Although R is an open-source project supported by the community developing it, some companies strive to provide commercial support and/or extensions for their customers. This section gives some examples of such companies.
In 2007, Richard Schultz, Martin Schultz, Steve Weston and Kirk Mettler founded Revolution Analytics to provide commercial support for Revolution R, their distribution of R, which also includes components developed by the company.
Major additional components include: ParallelR, the R Productivity Environment IDE, RevoScaleR (for big data analysis), RevoDeployR, web services framework, and the ability for reading and writing data in the SAS file format.
In October 2011, Oracle announced the Big Data Appliance, which integrates R, Apache Hadoop, Oracle Linux, and a NoSQL database with Exadata hardware.
IBM offers support for in-Hadoop execution of R, and provides a programming model for massively parallel in-database analytics in R.
Tibco offers a runtime-version R as a part of Spotfire.
Mango Solutions offers a validation package for R, ValidR,to make it compliant with drug approval agencies, like FDA.

To Learn more, please visit our YouTube channel at -

http://www.youtube.com/c/Sql-datatools

Source : wikipedia

Mukesh Singh

Wednesday, February 26, 2020

Python - Extracting Domain Name From URLs Using Regular Expressions

Monday, February 24, 2020

R - Programming Objects and Data Structures

Sunday, February 23, 2020

What is R-Programming and Why learning R is important?

Popular Posts