Microsoft Business Intelligence (Data Tools)|Python

Python - Weather Data Scrapping

If you are working as data scientist where you have to build some models to get the sales forecast then Weather data becomes a very important component to get the weather variables which plays as a root variables for your data model in your machine learning algorithms. You should now have a good understanding of how to scrape web pages and extract data.

There are many paid APIs which give you that specific data and they will have charged for every location. If you don’t have the sufficient project budget then you can write your own code (web scrapping code) in python to get your day level weather data for any particular location.
How Does Web Scraping Work? — When we scrape the web, we write code that sends a request to the server that’s hosting the page we specified. Generally, our code downloads that page’s source code, just as a browser would. But instead of displaying the page visually, it filters through the page looking for HTML elements we’ve specified, and extracting whatever content we’ve instructed it to extract.
Downloading weather data — We now know enough to proceed with extracting information about the local weather from the National Weather Service website. The first step is to find the page we want to scrape.

Here, we are going to use MeteoGuru.uk to get the day level weather data. We will use some most common libraries in Python to write our code and will try to write in very simplistic manner that you can easily understand each line of the code. Below is the UI for London from MeteoGuru.uk

Using CSS Selectors to get the weather info — We can also search for items using CSS selectors. These selectors are how the CSS language allows developers to specify HTML tags to style. Here are some examples:

p a — finds all a tags inside of a p tag
body p a — finds all a tags inside of a p tag inside of a body tag.
html body — finds all body tags inside of an html tag.
p.outer-text — finds all p tags with a class of outer-text.
p#first — finds all p tags with an id of first.
body p.outer-text — finds any p tags with a class of outer-text inside of a body tag.

We now know enough to download the page and start parsing it. In the below code, we:

Download the web page containing the forecast.
Create a BeautifulSoup class to parse the page.

Import Python Libraries — we need some python libraries which help us in the web scrapping. Python Libraries for Web Scraping

requests — this critical library is needed to actually get the data from the web server onto your machine, and it contains some additional cool features like caching too.
Beautiful Soup 4 — This is the library we’ve used here, and it’s designed to make filtering data based on HTML tags straightforward.

#Python program to scrape website

import bs4, re

from bs4 import BeautifulSoup

# The requests library

import requests

# Pandas is an open-source, BSD-licensed Python library providing high-performance,

# easy-to-use data structures and data analysis tools for the Python programming language.

import pandas as pd

#import datetime library

from datetime import datetime

Public variables — these are the variables which will used inside the entire python code as given below –

#define the base url

base_url = "https://%s.meteoguru.uk/"

#define the list of months for archive weather data

lst=["may-2019","june-2019","july-2019","august-2019","september-2019","october-2019", "november-2019","december-2019","january-2020","february-2020","march-2020","april-2020","may-2020","june-2020","july-2019", "august-2020"]

Python function for web scrapping — in this function, we will pass the web URL to download/scrap the whole web page. In this weather website, there are four div which are holding the weather data such as -

1. Finding all instances of a tag at once for beginning weekdays

2. Finding all instances of a tag at once for ending weekdays

3. Finding all instances of a tag at once for beginning weekend

4. Finding all instances of a tag at once for ending weekend

In this function, we are using html parser for the web parsing and defining the dataframe for the data as given below -

below is the code–

#function to get weather data by url input parameter

def get_weather_data(url):

#url='https://june-2020.meteoguru.uk/'

page = requests.get(url)

#Parsing a page with BeautifulSoup

soup = BeautifulSoup(page.content,'html.parser')

# extract region

region = soup.find("h2", attrs={"class": "mb-0"}).text.replace('Weather for ','').replace(', Archive','')

# get next few days' weather dataframe

ndf = pd.DataFrame(columns=["region","date", "day","weather", "max_temp", "min_temp","wind","humidity"])

#Use the find method, which will return a single BeautifulSoup object

days = soup.find("div", attrs={"class": "grid-wraper clearfix width100"}).find("div", attrs={"class": "row"})

#Finding all instances of a tag at once for beginning weekdays

for day in days.findAll("div", attrs={"class": "pl-0 pr-0 grid-box mb-1 rounded with-border width1_7 begining weekdays nextmonth"}):

#print(day)

date_name = day.find("div", attrs={"class": "pl-1 pr-1 rounded background-grey-1 width100"}).text.replace('\t','').replace('\n','').split(',')

date=date_name[0]

dayn=date_name[1]

max_temp = day.find("p", attrs={"class": "pt-2 mb-1 center-text big-text-1 text-center"}).text.replace('\xa0','').replace('\t','').replace('\n','').replace('+','').replace('°C','')

min_temp = day.find("p", attrs={"class": "pt-1 mb-0 pb-0 center-text"}).text.replace('\xa0','').replace('\t','').replace('\n','').replace('+','').replace('°C','')

temp= day.find("span", attrs={"class": "mb-2 pt-0 mt-0 text-center width100 fz-08"}).text.replace('\xa0','').split(':')

weather=temp[0]

wind= temp[1].split(',')[0].replace('mph','')

humidity= temp[3].replace('%','')

#append dataframe

ndf=ndf.append({"region":region,"date": date,"day":dayn, "weather": weather, "max_temp": max_temp, "min_temp": min_temp,"wind":wind,"humidity":humidity},ignore_index=True)

#Finding all instances of a tag at once for ending weekend

for day in days.findAll("div", attrs={"class": "pl-0 pr-0 grid-box mb-1 rounded with-border width1_7 ending weekend nextmonth"}):

#print(day)

date_name = day.find("div", attrs={"class": "pl-1 pr-1 rounded background-grey-1 width100"}).text.replace('\t','').replace('\n','').split(',')

date=date_name[0]

dayn=date_name[1]

max_temp = day.find("p", attrs={"class": "pt-2 mb-1 center-text big-text-1 text-center"}).text.replace('\xa0','').replace('\t','').replace('\n','').replace('+','').replace('°C','')

#max_temp = day.find("p", {"class": "pt-2 mb-1 center-text big-text-1 text-center"}).text.replace('\xa0','').replace('\t','').replace('\n','')

min_temp = day.find("p", attrs={"class": "pt-1 mb-0 pb-0 center-text"}).text.replace('\xa0','').replace('\t','').replace('\n','').replace('+','').replace('°C','')

temp= day.find("span", attrs={"class": "mb-2 pt-0 mt-0 text-center width100 fz-08"}).text.replace('\xa0','').split(':')

weather=temp[0]

#print(temp)

wind= temp[1].split(',')[0].replace('mph','')

humidity= temp[3].replace('%','')

#append dataframe

ndf=ndf.append({"region":region,"date": date,"day":dayn, "weather": weather, "max_temp": max_temp, "min_temp": min_temp,"wind":wind,"humidity":humidity},ignore_index=True)

#Finding all instances of a tag at once for beginning weekend

for day in days.findAll("div", attrs={"class": "pl-0 pr-0 grid-box mb-1 rounded with-border width1_7 begining weekend nextmonth"}):

#print(day)

date_name = day.find("div", attrs={"class": "pl-1 pr-1 rounded background-grey-1 width100"}).text.replace('\t','').replace('\n','').split(',')

date=date_name[0]

dayn=date_name[1]

max_temp = day.find("p", attrs={"class": "pt-2 mb-1 center-text big-text-1 text-center"}).text.replace('\xa0','').replace('\t','').replace('\n','').replace('+','').replace('°C','')

#max_temp = day.find("p", {"class": "pt-2 mb-1 center-text big-text-1 text-center"}).text.replace('\xa0','').replace('\t','').replace('\n','')

min_temp = day.find("p", attrs={"class": "pt-1 mb-0 pb-0 center-text"}).text.replace('\xa0','').replace('\t','').replace('\n','').replace('+','').replace('°C','')

temp= day.find("span", attrs={"class": "mb-2 pt-0 mt-0 text-center width100 fz-08"}).text.replace('\xa0','').split(':')

weather=temp[0]

#print(temp)

wind= temp[1].split(',')[0].replace('mph','')

humidity= temp[3].replace('%','')

#append dataframe

ndf=ndf.append({"region":region,"date": date,"day":dayn, "weather": weather, "max_temp": max_temp, "min_temp": min_temp,"wind":wind,"humidity":humidity},ignore_index=True)

#Finding all instances of a tag at once for ending weekdays

for day in days.findAll("div", attrs={"class": "pl-0 pr-0 grid-box mb-1 rounded with-border width1_7 ending weekdays nextmonth"}):

date_name = day.find("div", attrs={"class": "pl-1 pr-1 rounded background-grey-1 width100"}).text.replace('\t','').replace('\n','').split(',')

date=date_name[0]

dayn=date_name[1]

max_temp = day.find("p", attrs={"class": "pt-2 mb-1 center-text big-text-1 text-center"}).text.replace('\xa0','').replace('\t','').replace('\n','').replace('+','').replace('°C','')

#max_temp = day.find("p", {"class": "pt-2 mb-1 center-text big-text-1 text-center"}).text.replace('\xa0','').replace('\t','').replace('\n','')

min_temp = day.find("p", attrs={"class": "pt-1 mb-0 pb-0 center-text"}).text.replace('\xa0','').replace('\t','').replace('\n','').replace('+','').replace('°C','')

temp= day.find("span", attrs={"class": "mb-2 pt-0 mt-0 text-center width100 fz-08"}).text.replace('\xa0','').split(':')

weather=temp[0]

#print(temp)

wind= temp[1].split(',')[0].replace('mph','')

humidity= temp[3].replace('%','')

#append dataframe

ndf=ndf.append({"region":region,"date": date,"day":dayn, "weather": weather, "max_temp": max_temp, "min_temp": min_temp,"wind":wind,"humidity":humidity},ignore_index=True)

#return day level weather dataframe

return ndf

Extracting all the information from the page — Now that we know how to extract each individual piece of information, we can combine our knowledge with css selectors and list comprehensions to extract everything at once from the web page and we have to generate a web url to call this method as given below -

if __name__ == “__main__”:

#define dataframe columns with headers
df=pd.DataFrame(columns=[“region”,”date”, “day”,”weather”, “max_temp”, “min_temp”,”wind”,”humidity”])

#new list for testing purpose
lst=[“may-2019”]

#for loop in case you have multiple months
for ymon in lst:
print(ymon)
url = base_url%(ymon)
print(url)
# get data
df = df.append(get_weather_data(url),ignore_index=True)
print(df.head())

After running the code, we get the following information's -

may-2019
https://may-2019.meteoguru.uk/

Call function for multiple locations — if you want to run the same code for the multiple locations then you have to create a new list to contain these locations as given below -

# weather location lists

loc=[“England”,”Wales”,”london”]

#base url for multiple locations

burl=’https://%s.meteoguru.uk/%s/'

Now, you can see that we have change the base URL also which is taken two parameters, first one %s is used for the weather month and second %s is used for the location as given below -

# weather location lists

loc=[“England”,”Wales”,”london”]

#base url for multiple locations

burl=’https://%s.meteoguru.uk/%s/'

#for loop for the location

for l in loc:

#loop for the multiple month

for ymon in lst:

#pass parameters in the base url

url = burl%(ymon,l)

#print urls

print(url)

#append dataframe

df = df.append(get_weather_data(url),ignore_index=True)

#save dataframe into csv

df.to_csv(“weather_Uk.csv”, index=False)

https://may-2019.meteoguru.uk/England/

https://may-2019.meteoguru.uk/Wales/

https://may-2019.meteoguru.uk/london/

Combining our data into a Pandas Dataframe — We can now combine the data into a Pandas DataFrame and analyze it. A DataFrame is an object that can store tabular data, making data analysis easy. To learn more, please follow us -

http://www.sql-datatools.com

To Learn more, please visit our YouTube channel at -

http://www.youtube.com/c/Sql-datatools

To Learn more, please visit our Instagram account at -

https://www.instagram.com/asp.mukesh/

To Learn more, please visit our twitter account at -

https://twitter.com/macxima

To Learn more, please visit our Medium account at -

https://medium.com/@macxima