Python - Weather Data Scrapping
If you are working as data scientist where you have to build some models to get the sales forecast then Weather data becomes a very important component to get the weather variables which plays as a root variables for your data model in your machine learning algorithms. You should now have a good understanding of how to scrape web pages and extract data.
How Does Web Scraping Work? — When we scrape the web, we write code that sends a request to the server that’s hosting the page we specified. Generally, our code downloads that page’s source code, just as a browser would. But instead of displaying the page visually, it filters through the page looking for HTML elements we’ve specified, and extracting whatever content we’ve instructed it to extract.
Downloading weather data — We now know enough to proceed with extracting information about the local weather from the National Weather Service website. The first step is to find the page we want to scrape.
Here, we are going to use MeteoGuru.uk to get the day level weather data. We will use some most common libraries in Python to write our code and will try to write in very simplistic manner that you can easily understand each line of the code. Below is the UI for London from MeteoGuru.uk
- p a — finds all a tags inside of a p tag
- body p a — finds all a tags inside of a p tag inside of a body tag.
- html body — finds all body tags inside of an html tag.
- p.outer-text — finds all p tags with a class of outer-text.
- p#first — finds all p tags with an id of first.
- body p.outer-text — finds any p tags with a class of outer-text inside of a body tag.
- Download the web page containing the forecast.
- Create a BeautifulSoup class to parse the page.
- requests — this critical library is needed to actually get the data from the web server onto your machine, and it contains some additional cool features like caching too.
- Beautiful Soup 4 — This is the library we’ve used here, and it’s designed to make filtering data based on HTML tags straightforward.
#Python program to scrape website
import bs4, re
from bs4 import BeautifulSoup
# The requests library
import requests
# Pandas is an open-source, BSD-licensed Python library providing high-performance,
# easy-to-use data structures and data analysis tools for the Python programming language.
import pandas as pd
#import datetime library
from datetime import datetime
Public variables — these are the variables which will used inside the entire python code as given below –
#define the base url
base_url = "https://%s.meteoguru.uk/"
#define the list of months for archive weather data
lst=["may-2019","june-2019","july-2019","august-2019","september-2019","october-2019", "november-2019","december-2019","january-2020","february-2020","march-2020","april-2020","may-2020","june-2020","july-2019", "august-2020"]
Python function for web scrapping — in this function, we will pass the web URL to download/scrap the whole web page. In this weather website, there are four div which are holding the weather data such as -
1. Finding all instances of a tag at once for beginning weekdays
2. Finding all instances of a tag at once for ending weekdays
3. Finding all instances of a tag at once for beginning weekend
4. Finding all instances of a tag at once for ending weekend
In this function, we are using html parser for the web parsing and defining the dataframe for the data as given below -
below is the code–
#function to get weather data by url
input parameter def get_weather_data(url):
#url='https://june-2020.meteoguru.uk/'
page = requests.get(url)
#Parsing a page with BeautifulSoup
soup =
BeautifulSoup(page.content,'html.parser')
# extract region
region = soup.find("h2", attrs={"class":
"mb-0"}).text.replace('Weather for ','').replace(', Archive','')
# get next few days' weather dataframe
ndf = pd.DataFrame(columns=["region","date",
"day","weather", "max_temp",
"min_temp","wind","humidity"])
#Use the find method, which will return a single BeautifulSoup object
days = soup.find("div", attrs={"class":
"grid-wraper clearfix width100"}).find("div",
attrs={"class": "row"})
#Finding all instances of a tag at once for beginning weekdays
for day in days.findAll("div", attrs={"class":
"pl-0 pr-0 grid-box mb-1 rounded with-border width1_7 begining weekdays
nextmonth"}):
#print(day)
date_name = day.find("div", attrs={"class":
"pl-1 pr-1 rounded background-grey-1
width100"}).text.replace('\t','').replace('\n','').split(',')
date=date_name[0]
dayn=date_name[1]
max_temp = day.find("p", attrs={"class":
"pt-2 mb-1 center-text big-text-1
text-center"}).text.replace('\xa0','').replace('\t','').replace('\n','').replace('+','').replace('°C','')
min_temp = day.find("p", attrs={"class":
"pt-1 mb-0 pb-0
center-text"}).text.replace('\xa0','').replace('\t','').replace('\n','').replace('+','').replace('°C','')
temp= day.find("span", attrs={"class": "mb-2
pt-0 mt-0 text-center width100
fz-08"}).text.replace('\xa0','').split(':')
weather=temp[0]
wind= temp[1].split(',')[0].replace('mph','')
humidity= temp[3].replace('%','')
#append dataframe
ndf=ndf.append({"region":region,"date":
date,"day":dayn, "weather": weather,
"max_temp": max_temp, "min_temp":
min_temp,"wind":wind,"humidity":humidity},ignore_index=True)
#Finding all instances of a tag at once for ending weekend
for day in days.findAll("div", attrs={"class":
"pl-0 pr-0 grid-box mb-1 rounded with-border width1_7 ending weekend
nextmonth"}):
#print(day)
date_name = day.find("div", attrs={"class":
"pl-1 pr-1 rounded background-grey-1
width100"}).text.replace('\t','').replace('\n','').split(',')
date=date_name[0]
dayn=date_name[1]
max_temp = day.find("p", attrs={"class":
"pt-2 mb-1 center-text big-text-1 text-center"}).text.replace('\xa0','').replace('\t','').replace('\n','').replace('+','').replace('°C','')
#max_temp = day.find("p", {"class": "pt-2
mb-1 center-text big-text-1
text-center"}).text.replace('\xa0','').replace('\t','').replace('\n','')
min_temp = day.find("p", attrs={"class":
"pt-1 mb-0 pb-0
center-text"}).text.replace('\xa0','').replace('\t','').replace('\n','').replace('+','').replace('°C','')
temp= day.find("span", attrs={"class": "mb-2
pt-0 mt-0 text-center width100 fz-08"}).text.replace('\xa0','').split(':')
weather=temp[0]
#print(temp)
wind= temp[1].split(',')[0].replace('mph','')
humidity= temp[3].replace('%','')
#append dataframe
ndf=ndf.append({"region":region,"date":
date,"day":dayn, "weather": weather,
"max_temp": max_temp, "min_temp":
min_temp,"wind":wind,"humidity":humidity},ignore_index=True)
#Finding all instances of a tag at once for beginning weekend
for day in days.findAll("div", attrs={"class":
"pl-0 pr-0 grid-box mb-1 rounded with-border width1_7 begining weekend
nextmonth"}):
#print(day)
date_name = day.find("div", attrs={"class":
"pl-1 pr-1 rounded background-grey-1
width100"}).text.replace('\t','').replace('\n','').split(',')
date=date_name[0]
dayn=date_name[1]
max_temp = day.find("p", attrs={"class":
"pt-2 mb-1 center-text big-text-1
text-center"}).text.replace('\xa0','').replace('\t','').replace('\n','').replace('+','').replace('°C','')
#max_temp = day.find("p", {"class": "pt-2
mb-1 center-text big-text-1
text-center"}).text.replace('\xa0','').replace('\t','').replace('\n','')
min_temp = day.find("p", attrs={"class":
"pt-1 mb-0 pb-0
center-text"}).text.replace('\xa0','').replace('\t','').replace('\n','').replace('+','').replace('°C','')
temp= day.find("span", attrs={"class": "mb-2
pt-0 mt-0 text-center width100
fz-08"}).text.replace('\xa0','').split(':')
weather=temp[0]
#print(temp)
wind= temp[1].split(',')[0].replace('mph','')
humidity= temp[3].replace('%','')
#append dataframe
ndf=ndf.append({"region":region,"date":
date,"day":dayn, "weather": weather,
"max_temp": max_temp, "min_temp":
min_temp,"wind":wind,"humidity":humidity},ignore_index=True)
#Finding all instances of a tag at once for ending weekdays
for day in days.findAll("div", attrs={"class":
"pl-0 pr-0 grid-box mb-1 rounded with-border width1_7 ending weekdays
nextmonth"}):
date_name = day.find("div", attrs={"class":
"pl-1 pr-1 rounded background-grey-1
width100"}).text.replace('\t','').replace('\n','').split(',')
date=date_name[0]
dayn=date_name[1]
max_temp = day.find("p", attrs={"class":
"pt-2 mb-1 center-text big-text-1
text-center"}).text.replace('\xa0','').replace('\t','').replace('\n','').replace('+','').replace('°C','')
#max_temp = day.find("p", {"class": "pt-2
mb-1 center-text big-text-1 text-center"}).text.replace('\xa0','').replace('\t','').replace('\n','')
min_temp = day.find("p", attrs={"class":
"pt-1 mb-0 pb-0
center-text"}).text.replace('\xa0','').replace('\t','').replace('\n','').replace('+','').replace('°C','')
temp= day.find("span", attrs={"class": "mb-2
pt-0 mt-0 text-center width100
fz-08"}).text.replace('\xa0','').split(':')
weather=temp[0]
#print(temp)
wind= temp[1].split(',')[0].replace('mph','')
humidity= temp[3].replace('%','')
#append dataframe
ndf=ndf.append({"region":region,"date":
date,"day":dayn, "weather": weather,
"max_temp": max_temp, "min_temp":
min_temp,"wind":wind,"humidity":humidity},ignore_index=True)
#return day level weather dataframe
return ndf |
Extracting all the information from the page — Now that we know how to extract each individual piece of
information, we can combine our knowledge with css selectors and list
comprehensions to extract everything at once from the web page and we have to
generate a web url to call this method as given below -
if __name__ ==
“__main__”:
#new list for testing purpose #for loop in case you have multiple months |
After running the code, we get the following information's -
may-2019
https://may-2019.meteoguru.uk/
Call function for multiple locations — if you want to run the same code for the multiple locations
then you have to create a new list to contain these locations as given below -
# weather location lists loc=[“England”,”Wales”,”london”] #base url for multiple locations burl=’https://%s.meteoguru.uk/%s/' |
Now, you can see that we have change the base URL also which is taken two parameters, first one %s is used for the weather month and second %s is used for the location as given below -
# weather location lists loc=[“England”,”Wales”,”london”] burl=’https://%s.meteoguru.uk/%s/' for
l in loc: for
ymon in lst: url
= burl%(ymon,l) print(url) df
= df.append(get_weather_data(url),ignore_index=True) df.to_csv(“weather_Uk.csv”,
index=False) |
This comment has been removed by the author.
ReplyDeleteThere are additionally advisers for documentation and examination. One can likewise discover motivation in the composing practices gave here. Tools for Writing
ReplyDelete