Microsoft Business Intelligence (Data Tools)|Python - Extracting Domain Name From URLs Using Regular Expressions

Wednesday, February 26, 2020

Python - Extracting Domain Name From URLs Using Regular Expressions

As a python developers/programmers, we have to accomplished a lot of data cleansing jobs from a file before processing the other business operations.

For an example, you have a raw data text file containing web scrapping data and you have to read some specific data like website URLs by to performing the actual Regular Expression matching to pull the domain names.

Extracting the Domain name accurately can be quite tricky mainly because the domain extension can contain 2 parts (like .com.au or .co.uk) and the sub domain (the prefix) may or may not be there.

The hard part is knowing if the name is at the second or third level or so on.

What is a Regular Expression and which module is used in Python?

Regular expression is a sequence of special character(s) mainly used to find and replace patterns in a string or file, using a specialized syntax held in a pattern.

The Python module re provides full support for Perl-like regular expressions in Python. The re module raises the exception re.error if an error occurs while compiling or using a regular expression.

# Python program to extract domain names from the list of website URLs By Regular Expression.

# Importing module required for regular expressions

import re

# List of website URLs

domainlist=['m.google.com',

'm.docs.google.com',

'www.someisotericdomain.innersite.mall.co.uk',

'www.ouruniversity.department.mit.ac.us',

'www.somestrangeurl.shops.relevantdomain.net',

'www.example.info']

#print values in the list

print(domainlist)

Output -

['m.google.com', 'm.docs.google.com', 'www.someisotericdomain.innersite.mall.co.uk', 'www.ouruniversity.department.mit.ac.us', 'www.somestrangeurl.shops.relevantdomain.net', 'www.example.info']

Now, we have the website URLs in the list and we want to extract only domain name from the list. So, we are going to apply regex based regular expressions such as

# Read list by for loop

# get list of domain

# The regex will have to be enormous in order to catch all kinds of domains

# It returns domain from URL.

#It's quick and doesn't need any input file listing stuff.

for l in domainlist:

# get list of domain

res = re.findall(r'(?<=\.)([^.]+)(?:\.(?:co\.uk|ac\.us|[^.]+(?:$|\n)))',l)

print(l, "|", res[0])

The final output is -

To learn more, please follow us -
http://www.sql-datatools.com
To Learn more, please visit our YouTube channel at —
http://www.youtube.com/c/Sql-datatools
To Learn more, please visit our Instagram account at -
https://www.instagram.com/asp.mukesh/
To Learn more, please visit our twitter account at -
https://twitter.com/macxima

Mukesh Singh

Workings as Technical Lead in Saviance Technologies on MSBI (SSRS, SSIS, SSAS and T-SQL with SQL Server 2005/2008 R2 / 2012 and SharePoint Server 2013, ERP Business application, Macola, ASP.net, C# and Web Services). Apart of this, SSRS integration with SharePoint Server 2013 and PowerShell.

1 comment:

UnknownMarch 9, 2020 at 12:18 AM
Enjoyed reading this article throughout.Nice post! IoT is the trendy course right now and is going to be in
a great demand in near future as jobs for this domain will be sky rocketted.To be on par with the current trend we have to
gain complete knowledge about the subject. For the complete course online
360Digitmg Iot Certification Training
360Digitmg Internet of things courses online
ReplyDelete
Replies

Add comment

Wednesday, February 26, 2020

Python - Extracting Domain Name From URLs Using Regular Expressions

1 comment:

Popular Posts