Wednesday, February 26, 2020

Python - Extracting Domain Name From URLs Using Regular Expressions

As a python developers/programmers, we have to accomplished a lot of data cleansing jobs from a file before processing the other business operations.
For an example, you have a raw data text file containing web scrapping data and you have to read some specific data like website URLs by to performing the actual Regular Expression matching to pull the domain names.
Extracting the Domain name accurately can be quite tricky mainly because the domain extension can contain 2 parts (like .com.au or .co.uk) and the sub domain (the prefix) may or may not be there.
The hard part is knowing if the name is at the second or third level or so on.

What is a Regular Expression and which module is used in Python?
Regular expression is a sequence of special character(s) mainly used to find and replace patterns in a string or file, using a specialized syntax held in a pattern.
The Python module re provides full support for Perl-like regular expressions in Python. The re module raises the exception re.error if an error occurs while compiling or using a regular expression. 

# Python program to extract domain names from the list of website URLs By Regular Expression.

# Importing module required for regular expressions

import re

# List of website URLs
domainlist=['m.google.com',
                        'm.docs.google.com',
                        'www.someisotericdomain.innersite.mall.co.uk',
                        'www.ouruniversity.department.mit.ac.us',
                        'www.somestrangeurl.shops.relevantdomain.net',
                        'www.example.info']

#print values in the list
 print(domainlist)
  
Output -
['m.google.com', 'm.docs.google.com', 'www.someisotericdomain.innersite.mall.co.uk', 'www.ouruniversity.department.mit.ac.us', 'www.somestrangeurl.shops.relevantdomain.net', 'www.example.info']

Now, we have the website URLs in the list and we want to extract only domain name from the list. So, we are going to apply regex based regular expressions such as

# Read list by for loop
# get list of domain
# The regex will have to be enormous in order to catch all kinds of domains
# It returns domain from URL.
#It's quick and doesn't need any input file listing stuff.
for l in domainlist:
    # get list of domain
   res = re.findall(r'(?<=\.)([^.]+)(?:\.(?:co\.uk|ac\.us|[^.]+(?:$|\n)))',l)
   print(l, "|", res[0])


The final output is - 
m.google.com | google m.docs.google.com | google www.someisotericdomain.innersite.mall.co.uk | mall www.ouruniversity.department.mit.ac.us | mit www.somestrangeurl.shops.relevantdomain.net | relevantdomain www.example.info | example




To learn more, please follow us -
http://www.sql-datatools.com
To Learn more, please visit our YouTube channel at — 
http://www.youtube.com/c/Sql-datatools
To Learn more, please visit our Instagram account at -
https://www.instagram.com/asp.mukesh/
To Learn more, please visit our twitter account at -
https://twitter.com/macxima


1 comment:

  1. Enjoyed reading this article throughout.Nice post! IoT is the trendy course right now and is going to be in
    a great demand in near future as jobs for this domain will be sky rocketted.To be on par with the current trend we have to
    gain complete knowledge about the subject. For the complete course online
    360Digitmg Iot Certification Training
    360Digitmg Internet of things courses online

    ReplyDelete