python web scraping Weatherforecast - python

I'm new to Python(actually second time I try to learn the language so i know a little something) and I'm trying to build a script that scrapes the weather forecast.
Now i have a little problem with finding the right html classes to import into python. I have this code now:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://openweathermap.org/city/2743477")
soup = BeautifulSoup(page.content, 'html.parser')
city_name = soup.find(class_="weather-widget__city-name")
print(city_name)
Problem is that this just returns 'None'
I found the class that the code searches for via chrome and inspect page. If i export the html page through python with the following code:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://openweathermap.org/city/2743477")
soup = BeautifulSoup(page.content, 'html.parser')
city_name = soup.find(class_="weather-widget__city-name")
print(soup.prettify())
Then I see the html page in cmd(as expected) but I'm also unable to find 'class_="weather-widget__city-name"' so I'm not amazed that python is also unable to. My question is, why is the html code that python gives me different than the html code Chrome shows on the site? And am I doing something wrong with trying to find the weather widget through BeautifulSoup this way?
Here is a picture from the page, the part that I'm trying to scrape is circled in red.
Screenshot from website
Thanks in advance!

That site is loaded with JS.
Python requests doesn't activate those scripts. One of those scripts is responsible for loading the data you are looking for (you can see it's JS, maybe with a bit jQuery I didn't actually check, by the spinning circle while it's loading).
My suggestion here is use the sites API.
I am not subscribed to the site so I can't show here an example but the trick is simple. You subscribe to the sites API with the basic (and free) plan, get an API key and start sending get requests to the API URLs.
This will also simplify things for you further since you wouldn't need BeautifulSoup for the parsing. The responses are all in JSON.
There is another nastier way around it and that is using selenium. This module will simulate the web browser and all of it's JS activating, HTML rendering, CSS loading mechanisms.
I have experience with both and I strongly recommend sticking to the API (if that option exists).

For sites that use JS to send further requests, after we request the initial URL, one method that can work is to study the network tab of Chrome's developer tools (or the equivalent in any other browser).
You'll usually find a big list of URLs being requested by the browser. Most of them are unnecessary for our purposes. And few of them relate to other sites such as Google, Facebook.
In this particular case, after the initial URL is requested, you'll find a few '.js' files being retrieved and after that, three scripts (forecast, weather, daily) that correspond to the data which finally gets presented by the browser.
From those three, the data you ask for comes from the 'weather' script. If you click on it in the network tab, another side pane will open which will contain header information, preview, etc.
In the Headers tab, you'll find the URL that you need to use, which is:
https://openweathermap.org/data/2.5/weather?id=2743477&units=metric&appid=b1b15e88fa797225412429c1c50c122a1
The b1b15e88fa797225412429c1c50c122a1 might be a general API key that is assigned to a browser request. I don't know for sure. But all we need to know is that it doesn't change. I've tried on two different systems and this value doesn't change.
The 2743477 is, of course, the City ID. You can download a reference of various cities and their IDs in their site itself:
http://bulk.openweathermap.org/sample/
As nutmeg64 said, the site actually responds with a JSON file. That's the case with both the API and the request of this URL found in the network tab of a browser.
As for the codes appearing in the JSON, the site gives you a reference to the codes and their meanings:
https://openweathermap.org/weather-conditions
With this information, you can use requests and json to retrieve and manipulate the data. Here's a sample script:
from pprint import pprint
import json
import requests
city_id = 2743477
url = 'https://openweathermap.org/data/2.5/weather?id={}&units=metric&appid=b1b15e88fa797225412429c1c50c122a1'.format(city_id)
req_headers = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive',
'Host': 'openweathermap.org',
'Referer': 'https://openweathermap.org/city/2743477',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
s = requests.Session()
r = s.get(url, headers=req_headers)
d = json.loads(r.text)
pprint(d)
However, as nutmeg64 said, it's better to use the API, and resist the temptation to bombard the site with more requests than you truly need.
You can find all about their API here:
https://openweathermap.org/current

Use selenium in combination with BeautifulSoup to get any of the table from that page with no hardship. Here is how you can do:
from selenium import webdriver
from bs4 import BeautifulSoup
driver=webdriver.Chrome()
driver.get("https://openweathermap.org/city/2743477")
soup = BeautifulSoup(driver.page_source, 'lxml')
driver.quit()
table_tag = soup.select(".weather-widget__items")[0]
tab_data = [[item.text.strip() for item in row_data.select("td")]
for row_data in table_tag.select("tr")]
for data in tab_data:
print(data)
Partial result:
['Wind', 'Gentle Breeze,\n 3.6 m/s, Southwest ( 220 )']
['Cloudiness', 'Broken clouds']
['Pressure', '1014 hpa']
['Humidity', '100 %']
['Sunrise', '11:53']

Related

Scraping attributes values in Python LXML is giving empty results

I am trying to scrape a site that you will find its link below in the code
The goal is to get the data from within the attributes since there is no text while inspecting the code
Here is the full XPath of an element:
/html/body/div[2]/div[3]/div/div[3]/section[1]/div/div[2]/div[1]
and the code:
import requests
from lxml import html
page = requests.get('https://www.meilleursagents.com/annonces/achat/nice-06000/appartement/')
tree = html.fromstring(page.content)
trying to scrape the attribute 'data-wa-data' value with:
tree.xpath('/html/body/div[2]/div[3]/div/div[3]/section[1]/div/div[2]/div[1]/#data-wa-data')
is yielding empty values
and the same issue is for another element that has a text:
tree.xpath('/html/body/div[2]/div[3]/div/div[3]/section[1]/div/div[2]/div[1]/div/a/div[1]/text()')
The problem is that this website requires the User-Agent to download the complete HTML which is absent in your case. So, to get the complete page pass user-agent as a header.
Note: This website is more aggressive when it comes to blocking. I mean, you cannot even make two consecutive requests with the same user-agent. Thus, my advice would be to rotate the proxies and user-agent. Moreover, also add download delay between each requests to avoid hitting server rapidly.
Code
import requests
from lxml import html
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0'
}
page = requests.get('https://www.meilleursagents.com/annonces/achat/nice-06000/appartement/', headers=headers)
tree = html.fromstring(page.content)
print(tree.xpath('//div[#class="listing-item search-listing-result__item"]/#data-wa-data'))
output
['listing_id=1971029217|realtor_id=21407|source=listings_results', 'listing_id=1971046117|realtor_id=74051|source=listings_results', 'listing_id=1971051280|realtor_id=71648|source=listings_results', 'listing_id=1971053639|realtor_id=21407|source=listings_results', 'listing_id=1971053645|realtor_id=38087|source=listings_results', 'listing_id=1971053650|realtor_id=29634|source=listings_results', 'listing_id=1971053651|realtor_id=29634|source=listings_results', 'listing_id=1971053652|realtor_id=29634|source=listings_results', 'listing_id=1971053656|realtor_id=39588|source=listings_results', 'listing_id=1971053658|realtor_id=39588|source=listings_results', 'listing_id=1971053661|realtor_id=39588|source=listings_results', 'listing_id=1971053662|realtor_id=39588|source=listings_results']

Search text on website BeautifulSoup python

I am trying to find a word on a website via BeautifulSoup, but i can't seem to get it. This is my code so far:
import requests
from bs4 import BeautifulSoup
session = requests.Session()
s = session.get('https://www.doctolib.de/institut/berlin/ciz-berlin-berlin?pid=practice-158431')
soup = BeautifulSoup(s.text, 'html.parser')
tags = soup.find_all(class_="dl-text dl-text-body dl-text-regular dl-text-s dl-text-color-inherit")
for i in tags:
print(i.string)
See below for the picture regarding the specific HTML element. I am try to search and find "Keine Verfügbarkeiten"
Anyone that can help me? Because the code i have used is returning nothing.
Vaccine check
Although the content you look for in that site generates dynamically, it is still available in some script tag in page source (ctrl + U). Following is one of the ways you can parse the same using requests module in combination with re and json.
import re
import json
import requests
url = "https://www.doctolib.de/institut/berlin/ciz-berlin-berlin?pid=practice-158431"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36',
}
res = requests.get(url,headers=headers)
script = re.search(r"window\.translation_keys[^{]+(.*?});",res.text).group(1)
items = json.loads(script)
print(items['root']['common']['availabilities']['no_availabilities_vaccination'])
Output:
Keine Verfügbarkeiten
The page you are retrieving is generating it's content in JavaScript, so your GET request won't find what you are looking for but instead is going to retrieve the actual page source ( view-source:https://www.doctolib.de/institut/berlin/ciz-berlin-berlin?pid=practice-158431 ) without any processing.
What you can do instead is to run Selenium WebDriver that will act like an actual browser allowing it to execute the JavaScript and process the page you see when opening the website from your browser.
Then when you open your page using Selenium you can find the element you are looking for using the find_element_by_css_selector() method
If instead you don't want to use Selenium what you could try to do is to check where the webpage is getting it's data from. With a quick look I can see that it is querying this link to get availabilities data. If you use this method, you can just make a GET request to that API link and parse the JSON response.
Very useful information. Makes sense! I will try using selenium to find the element. If i can stuck I'll get back to you. Thanks!

scraping table data based on date

iam trying to scrape table of kurs transaction https://www.bi.go.id/id/moneter/informasi-kurs/transaksi-bi/Default.aspx
from 2015-2020, but the problem is the link between the default date and the date that I chose is still the same. So how can I tell python to scrape data from 2015-2020(20-Nov-15 -- 20-nov-20)? I'm very new to python and using python 3.thank you in advance
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers={
"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36",
"X-Requested-With":"XMLHttpRequest"
}
url = "https://www.bi.go.id/id/moneter/informasi-kurs/transaksi-bi/Default.aspx"
import requests
from lxml import html
response = requests.get(url)
content= response.content
print(content)
A few different approaches:
Use array slicing if you are working with 1 dimensional data
Use filter / groupby methods from the Pandas library after putting your data into a dataframe
The website requires you to enter in start and end dates for the query. However, as far as I know, bs4 only scrapes html that is already displayed on the browser, and is not so useful for making a query on the web site itself.
From the source code and the POST request it looks like a complicated request so you might be better off simulating mouse clicks.
This can be done using the automated browser testing selenium package to automate opening Google Chrome browser, entering the date into the From and To fields, then clicking the Lihat button, waiting for page to load, then scraping the displayed table using bs4 or selenium.

Python Request not getting all data

I'm trying to scrape data from Google translate for educational purpose.
Here is the code
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
#https://translate.google.com/#view=home&op=translate&sl=en&tl=en&text=hello
#tlid-transliteration-content transliteration-content full
class Phonetizer:
def __init__(self,sentence : str,language_ : str = 'en'):
self.words=sentence.split()
self.language=language_
def get_phoname(self):
for word in self.words:
print(word)
url="https://translate.google.com/#view=home&op=translate&sl="+self.language+"&tl="+self.language+"&text="+word
print(url)
req = Request(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0'})
webpage = urlopen(req).read()
f= open("debug.html","w+")
f.write(webpage.decode("utf-8"))
f.close()
#print(webpage)
bsoup = BeautifulSoup(webpage,'html.parser')
phonems = bsoup.findAll("div", {"class": "tlid-transliteration-content transliteration-content full"})
print(phonems)
#break
The problem is when gives me the html, there is no tlid-transliteration-content transliteration-content full class, of css.
But using inspect, I have found that, phoneme are inside this css class, here take a snap :
I have saved the html, and here it is, take a look, no tlid-transliteration-content transliteration-content full is present and it not like other google translate page, it is not complete. I have heard google blocks crawler, bot, spyder. And it can be easily detected by their system, so I added the additional header, but still I can't access the whole page.
How can I do so ? Access the whole page and read all data from google translate page?
Want to contribute on this project?
I have tried this code below :
from requests_html import AsyncHTMLSession
asession = AsyncHTMLSession()
lang = "en"
word = "hello"
url="https://translate.google.com/#view=home&op=translate&sl="+lang+"&tl="+lang+"&text="+word
async def get_url():
r = await asession.get(url)
print(r)
return r
results = asession.run(get_url)
for result in results:
print(result.html.url)
print(result.html.find('#tlid-transliteration-content'))
print(result.html.find('#tlid-transliteration-content transliteration-content full'))
It gives me nothing, till now.
Yes, this happens because some javascript generated content are rendered by the browser on page load, but what you see is the final DOM, after all kinds of manipulation happened by javascript (adding content). To solve this you would need to use selenium but it has multiple downsides like speed and memory issues. A more modern and better way, in my opinion, is to use requests-html where it will replace both bs4 and urllib and it has a render method as mentioned in the documentation.
Here is a sample code using requests_html, just keep in mind what you trying to print is not utf8 so you might run into some issues printing it on some editors like sublime, it ran fine using cmd.
from requests_html import HTMLSession
session = HTMLSession()
r = session.get("https://translate.google.com/#view=home&op=translate&sl=en&tl=en&text=hello")
r.html.render()
css = ".source-input .tlid-transliteration-content"
print(r.html.find(css, first=True).text)
# output: heˈlō,həˈlō
First of all, I would suggest you to use the Google Translate API instead of scraping google page. The API is a hundred times easier, hassle-free and a legal and conventional way of doing this.
However, if you want to fix this, here is the solution.
You are not dealing with Bot detection here. Google's bot detection is so strong it would just open the google re-captcha page and not even show your desired web-page.
The problem here is that the results of translation are not returned using the URL you have used. This URL just displays the basic translator page, the results are fetched later by javascript and are shown on the page after the page has been loaded. The javascript is not processed by python-requests and this is why the class doesn't even exist in the web-page you are accessing.
The solution is to trace the packets and detect which URL is being used by javascript to fetch results. Fortunately, I have found the found the desired URL for this purpose.
If you request https://translate.google.com/translate_a/single?client=webapp&sl=en&tl=fr&hl=en&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&dt=gt&source=bh&ssel=0&tsel=0&kc=1&tk=327718.241137&q=goodmorning, you will get the response of translator as JSON. You can parse the JSON to get the desired results.
Here, you can face Bot detection here which can straight away throw an HTTP 403 error.
You can also use selenium to process javascript and give you results. Following changes inyour code can fix it using selenium
from selenium import webdriver
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
#https://translate.google.com/#view=home&op=translate&sl=en&tl=en&text=hello
#tlid-transliteration-content transliteration-content full
class Phonetizer:
def __init__(self,sentence : str,language_ : str = 'en'):
self.words=sentence.split()
self.language=language_
def get_phoname(self):
for word in self.words:
print(word)
url="https://translate.google.com/#view=home&op=translate&sl="+self.language+"&tl="+self.language+"&text="+word
print(url)
#req = Request(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0'})
#webpage = urlopen(req).read()
driver = webdriver.Chrome()
driver.get(url)
webpage = driver.page_source
driver.close()
f= open("debug.html","w+")
f.write(webpage.decode("utf-8"))
f.close()
#print(webpage)
bsoup = BeautifulSoup(webpage,'html.parser')
phonems = bsoup.findAll("div", {"class": "tlid-transliteration-content transliteration-content full"})
print(phonems)
#break
You should scrape this page with Javascript support, since the content you're looking for "hiding" inside <script> tag, which urllib does not render.
I would suggest to use Selenium or other equivalent framework.
Take a look here: Web-scraping JavaScript page with Python

BeautifulSoup does not return all elements on page

I'm new to Web scraping and just started using BeautifulSoup. Here's my question.
When you look up a word in Google in such a way using a search query like "define:lucid", in most cases, a panel showing the meaning and the pronunciation appears at the front page. (Shown in the left side of the embedded image)
[Google default dictionary example]
Things I want to scrape and collect automatically are the text of the meaning and the URL in which the mp3 data of the pronunciation is stored. Using the Chrome Inspector manually, these are easily found in its "Elements" section, e.g., the Inspector (shown in the right side of the image) shows the URL, which stores the mp3 data of the pronunciation of "lucid" (here).
However, using requests to get the content of the HTML of the search result and parsing it with BeautifulSoup, like the code below, the soup gets only a few of contents in the panel such as the IPA "/ˈluːsɪd/" and the attribute "adjective" like the result below, and none of the contents I need can be found, such as things in audio elements.
How can I get the information with BeautifulSoup if possible, otherwise what alternative tools are suitable for this task?
P.S. I think the quality of pronunciation from Google dictionary is better than ones from any other dictionary sites. So I want to stick to it.
Code:
import requests
from bs4 import BeautifulSoup
query = "define:lucid"
goog_search = "https://www.google.co.uk/search?q=" + query
r = requests.get(goog_search)
soup = BeautifulSoup(r.text, "html.parser")
print(soup.prettify())
Part of soup content:
</span>
<span style="font:smaller 'Doulos SIL','Gentum','TITUS Cyberbit Basic','Junicode','Aborigonal Serif','Arial Unicode MS','Lucida Sans Unicode','Chrysanthi Unicode';padding-left:15px">
/ˈluːsɪd/
</span>
</div>
</h3>
<table style="font-size:14px;width:100%">
<tr>
<td>
<div style="color:#666;padding:5px 0">
adjective
</div>
The basic request you run is not returning the parts of the page rendered via JavaScript. If you right-click in Chrome and select View Page Source the audio link is not there. Solution: you could render the page via selenium. With the below code I get the <audio> tag including the link.
You'll have to pip install selenium, download ChromeDriver and add the folder containing it to PATH like export PATH=$PATH:~/downloads/
import requests
from bs4 import BeautifulSoup
import time
from selenium import webdriver
def render_page(url):
driver = webdriver.Chrome()
driver.get(url)
time.sleep(3)
r = driver.page_source
#driver.quit()
return r
query = "define:lucid"
goog_search = "https://www.google.co.uk/search?q=" + query
r = render_page(goog_search)
soup = BeautifulSoup(r, "html.parser")
print(soup.prettify())
I checked it. You're right, in the BeautifulSoup output there is no audio elements for some reason. However, having inspected the code, I found a source for the audio file which Google is using, which is http://ssl.gstatic.com/dictionary/static/sounds/oxford/lucid--_gb_1.mp3 and which perfectly works if you substitute "lucid" with any other word.
So, if you need to scrape the audio file, you could just do the following:
url='http://ssl.gstatic.com/dictionary/static/sounds/oxford/'
audio=requests.get(url+'lucid'+'--_gb_1.mp3', stream=True).content
with open('lucid'+'.mp3', 'wb') as f:
f.write(audio)
As for other elements, I'm afraid you'll need just to find the word "definition" in the soup and scrape the content of the tag that contains it.
There's no need in selenium to slow down scraping time as M3RS shows since the data is in the HTML, not rendered via JavaScript. Have a look at the SelectorsGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser.
You're looking for this (CSS selectors reference):
soup.select_one('audio source')['src']
# //ssl.gstatic.com/dictionary/static/sounds/20200429/swagger--_gb_1.mp3
Make sure you're using user-agent because default requests user-agent is python-requests thus Google blocks a request because it knows that it's a bot and not a "real" user visit and you'll receive a different HTML with some sort of an error. user-agent faking user visit by adding this information into HTTP request headers.
Code:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
params = {
'q': 'lucid definition',
'hl': 'en',
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
phonetic = soup.select_one('.S23sjd .LTKOO span').text
audio_link = soup.select_one('audio source')['src']
print(phonetic)
print(audio_link)
# ˈluːsɪd
# //ssl.gstatic.com/dictionary/static/sounds/20200429/swagger--_us_1.mp3
Alternatively, you can achieve the same thing by using Google Direct Answer Box API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to grab the data you want fast, instead of coding everything from scratch, figuring out why certain things don't work as they should, and then maintain it over time if something in the HTML layout is changed.
At the moment, SerpApi doesn't extract audio link. This will be changed in the future. Please, check it in the playground to clarify if the audio link is present.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "lucid definition",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
phonetic = results['answer_box']['syllables']
print(phonetic)
# lu·cid
Disclaimer, I work for SerpApi.

Categories

Resources