As a python developers/programmers, we have to accomplished a lot of data cleansing jobs from a file before processing the other business operations.
For an example, you have a raw data text file containing web scrapping data and you have to read some specific data like website URLs by to performing the actual Regular Expression matching to pull the domain names.
Extracting the Domain name accurately can be quite tricky mainly because the domain extension can contain 2 parts (like .com.au or .co.uk) and the sub domain (the prefix) may or may not be there.
The hard part is knowing if the name is at the second or third level or so on.
What is a Regular Expression and which module is used in Python?
Regular expression is a sequence of special character(s) mainly used to find and replace patterns in a string or file, using a specialized syntax held in a pattern.
The Python module re provides full support for Perl-like regular expressions in Python. The re module raises the exception re.error if an error occurs while compiling or using a regular expression.
# Python program
to extract domain names from the list of website URLs By Regular Expression.
# Importing module
required for regular expressions
import re
# List of website
URLs
domainlist=['m.google.com',
'm.docs.google.com',
'www.someisotericdomain.innersite.mall.co.uk',
'www.ouruniversity.department.mit.ac.us',
'www.somestrangeurl.shops.relevantdomain.net',
'www.example.info']
#print values in
the list
print(domainlist)
|
Output -
['m.google.com', 'm.docs.google.com', 'www.someisotericdomain.innersite.mall.co.uk', 'www.ouruniversity.department.mit.ac.us', 'www.somestrangeurl.shops.relevantdomain.net', 'www.example.info']
Now, we have the website URLs in the list and we want to extract only domain name from the list. So, we are going to apply regex based regular expressions such as
# Read list by for
loop
# get list of
domain
# The regex will
have to be enormous in order to catch all kinds of domains
# It returns
domain from URL.
#It's quick and
doesn't need any input file listing stuff.
for l in domainlist:
# get list of domain
res = re.findall(r'(?<=\.)([^.]+)(?:\.(?:co\.uk|ac\.us|[^.]+(?:$|\n)))',l)
print(l, "|", res[0])
|
To learn more, please follow us -
http://www.sql-datatools.com
To Learn more, please visit our YouTube channel at —
http://www.youtube.com/c/Sql-datatools
To Learn more, please visit our Instagram account at -
https://www.instagram.com/asp.mukesh/
To Learn more, please visit our twitter account at -
https://twitter.com/macxima
http://www.sql-datatools.com
To Learn more, please visit our YouTube channel at —
http://www.youtube.com/c/Sql-datatools
To Learn more, please visit our Instagram account at -
https://www.instagram.com/asp.mukesh/
To Learn more, please visit our twitter account at -
https://twitter.com/macxima