Beautifulsoup doesn't reach a child element - python

I wrote the following code trying to scrape a google scholar page
import requests as req
from bs4 import BeautifulSoup as soup
url = r'https://scholar.google.com/scholar?hl=en&q=Sustainability and the measurement of wealth: further reflections'
session = req.Session()
content = session.get(url)
html2bs = soup(content.content, 'lxml')
gs_cit = html2bs.select('#gs_cit')
gs_citd = html2bs.find('div', {'id':"gs_citd"})
gs_cit1 = html2bs.find('div', {'id':"gs_cit1"})
but the gs_citd gives me only this line <div aria-live="assertive" id="gs_citd"></div> and doesn't reach any level beneath it. Also gs_cit1 returns a None.
As appearing in this image
I want to reach the highlighted class to be able to grab the BibTeX citation.
Can you help, please!

Ok, so I figured it out. I used the selenium module for python which creates a virtual browser if you will that will allow you to perform actions like clicking links and getting the output of the resulting HTML. There was another issue I ran into while solving this which was the page had to be loaded otherwise it just returned the content "Loading..." in the pop-up div so I used the python time module to time.sleep(2) for 2 seconds which allowed the content to load in. Then I just parsed the resulting HTML output using BeautifulSoup to find the anchor tag with the class "gs_citi". Then pulled the href from the anchor and put this into a request with "requests" python module. Finally, I wrote the decoded response to a local file - scholar.bib.
I installed chromedriver and selenium on my Mac using these instructions here:
https://gist.github.com/guylaor/3eb9e7ff2ac91b7559625262b8a6dd5f
Then signed by python file to allow to stop firewall issues using these instructions:
Add Python to OS X Firewall Options?
The following is the code I used to produce the output file "scholar.bib":
import os
import time
from selenium import webdriver
from bs4 import BeautifulSoup as soup
import requests as req
# Setup Selenium Chrome Web Driver
chromedriver = "/usr/local/bin/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
# Navigate in Chrome to specified page.
driver.get("https://scholar.google.com/scholar?hl=en&q=Sustainability and the measurement of wealth: further reflections")
# Find "Cite" link by looking for anchors that contain "Cite" - second link selected "[1]"
link = driver.find_elements_by_xpath('//a[contains(text(), "' + "Cite" + '")]')[1]
# Click the link
link.click()
print("Waiting for page to load...")
time.sleep(2) # Sleep for 2 seconds
# Get Page source after waiting for 2 seconds of current page in Chrome
source = driver.page_source
# We are done with the driver so quit.
driver.quit()
# Use BeautifulSoup to parse the html source and use "html.parser" as the Parser
soupify = soup(source, 'html.parser')
# Find anchors with the class "gs_citi"
gs_citt = soupify.find('a',{"class":"gs_citi"})
# Get the href attribute of the first anchor found
href = gs_citt['href']
print("Fetching: ", href)
# Instantiate a new requests session
session = req.Session()
# Get the response object of href
content = session.get(href)
# Get the content and then decode() it.
bibtex_html = content.content.decode()
# Write the decoded data to a file named scholar.bib
with open("scholar.bib","w") as file:
file.writelines(bibtex_html)
Hope this helps anyone looking for a solution to this out.
Scholar.bib file:
#article{arrow2013sustainability,
title={Sustainability and the measurement of wealth: further reflections},
author={Arrow, Kenneth J and Dasgupta, Partha and Goulder, Lawrence H and Mumford, Kevin J and Oleson, Kirsten},
journal={Environment and Development Economics},
volume={18},
number={4},
pages={504--516},
year={2013},
publisher={Cambridge University Press}
}

You can parse BibTeX data using beautifulsoup and requests by parsing data-cid attribute which is a unique publication ID. Then you need to temporarily store those IDs to a list, iterate over them, and make a request to every ID to parse BibTeX publication citation.
Example below will work for ~10-20 requests then Google will throw a CAPTCHA or you'll hit the rate limit. The ideal solution is to have a CAPTCHA solving service as well as proxies.
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
params = {
"q": "samsung",
"hl": "en"
}
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582",
"server": "scholar",
"referer": f"https://scholar.google.com/scholar?hl={params['hl']}&q={params['q']}",
}
def cite_ids() -> list:
response = requests.get("https://scholar.google.com/scholar", params=params, headers=headers, timeout=10)
soup = BeautifulSoup(response.text, "lxml")
# returns a list of publication ID's -> U8bh6Ca9uwQJ
return [result["data-cid"] for result in soup.select(".gs_or")]
def scrape_cite_results() -> list:
bibtex_data = []
for cite_id in cite_ids():
response = requests.get(f"https://scholar.google.com/scholar?output=cite&q=info:{cite_id}:scholar.google.com", headers=headers, timeout=10)
soup = BeautifulSoup(response.text, "lxml")
# selects first matched element which in this case always will be BibTeX
# if Google will not switch BibTeX position.
bibtex_data.append(soup.select_one(".gs_citi")["href"])
# returns a list of BibTex URLs, for example: https://scholar.googleusercontent.com/scholar.bib?q=info:ifd-RAVUVasJ:scholar.google.com/&output=citation&scisdr=CgVDYtsfELLGwov-iJo:AAGBfm0AAAAAYgD4kJr6XdMvDPuv7R8SGODak6AxcJxi&scisig=AAGBfm0AAAAAYgD4kHUUPiUnYgcIY1Vo56muYZpFkG5m&scisf=4&ct=citation&cd=-1&hl=en
return bibtex_data
Alternatively, you can achieve the same thing using Google Scholar API from SerpApi without the need to figure out what proxy provider provides good proxies as well as with the CAPTCHA solving service, besides figuring out how to scrape the data from the JavaScript without using browser automation.
Example to integrate:
import os
from serpapi import GoogleSearch
def organic_results() -> list:
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_scholar",
"q": "samsung", # search query
"hl": "en", # language
}
search = GoogleSearch(params)
results = search.get_dict()
return [result["result_id"] for result in results["organic_results"]]
def cite_results() -> list:
citation_results = []
for citation in organic_results():
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_scholar_cite",
"q": citation
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["links"]:
if "BibTeX" in result["name"]:
citation_results.append(result["link"])
return citation_results
If you would like to parse the data from all available pages, there's a dedicated blog post Scrape historic Google Scholar results using Python at SerpApi which is all about scraping historic 2017-2021 Organic, Cite results to CSV and SQLite using pagination.
Disclaimer, I work for SerpApi.

Related

Scraping a Dynamic Website using Selenium or Beautiful Soup

I am trying to web scrape this dynamic website to get the course names and lecture time offered during a semester: https://www.utsc.utoronto.ca/registrar/timetable
The problem is when you first enter the website there are no courses displayed yet, only after selecting the "Session(s)" and clicking "Search for Courses" will the courses start to show up.
Here is where the problems start:
I cannot do
html = urlopen(url).read()
using urllib.request, as it will only display the HTML of the page when there is nothing.
I did quick search on how to webscrape dynamic website and run across a code like this:
import requests
url = 'https://www.utsc.utoronto.ca/registrar/timetable'
r= requests.get(url)
data = r.json()
print(data)
however, when I run this it returns "JSONDecodeError: Expecting value" and I have no idea why this occurs when it has worked on other dynamic websites.
I do not really have to use Selenium or Beautiful Soup so if there are better alternatives I will gladly try it. Also I was wondering when:
html = urlopen(url).read()
what is the format of the html that is returned? I want to know if I can just copy the changed HTML from inspecting the website after selecting the Session(s) and clicking search.
you can use this code to get the data you need :
import requests
url = "https://www.utsc.utoronto.ca/regoffice/timetable/view/api.php"
# for winter session
payload = "coursecode=&sessions%5B%5D=20219&instructor=&courseTitle="
headers = {
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)
def render_page(url):
driver = webdriver.Chrome(PATH)
driver.get(url)
r = driver.page_source
driver.quit()
return r
#render page using chrome driver and get all the html code on that certain webpage
def create_soup(html_text):
soup = BeautifulSoup(html_text, 'lxml')
return soup
You will need to use selenium for this if the content is loaded dynamically. Create a Beutiful Soup with the returned value from render_page() and see if you can manipulate the data there.

Scrape urls from search results in Python and BeautifulSoup

I was trying to scrape some urls from the search result and I tried to include both cookies setting or user-agent as Mozilla/5.0 and so on. I still cannot get any urls from the search result. Any solution I can get this working?
from bs4 import BeautifulSoup
import requests
monitored_tickers = ['GME', 'TSLA', 'BTC']
def search_for_stock_news_urls(ticker):
search_url = "https://www.google.com/search?q=yahoo+finance+{}&tbm=nws".format(ticker)
r = requests.get(search_url)
soup = BeautifulSoup(r.text, 'html.parser')
atags = soup.find_all('a')
hrefs = [link['href'] for link in atags]
return hrefs
raw_urls = {ticker:search_for_stock_news_urls(ticker) for ticker in monitored_tickers}
raw_urls
You could be running into the issue that requests and bs4 may not be the best tool for what you're trying to accomplish. As balderman said in another comment, using google search api will be easier.
This code:
from googlesearch import search
tickers = ['GME', 'TSLA', 'BTC']
links_list = []
for ticker in tickers:
ticker_links = search(ticker, stop=25)
links_list.append(ticker_links)
will make a list of the top 25 links on google for each ticker, and append that list into another list. Yahoo finance is sure to be in that list of links, and a simple parser based on keyword will get the yahoo finance url for that specific ticker. You could also adjust the search criteria in the search() function to whatever you wish, say ticker + ' yahoo finance' for example.
Google News could be easily scraped with requests and beautifulsoup. It would be enough to use user-agent to extract data from there.
Check out SelectorGadget Chrome extension to visually grab CSS selectors by clicking on the element you want to extract.
If you only want to extract URLs from Google News, then it's as simple as:
for result in soup.select('.dbsr'):
link = result.a['href']
# 10 links here..
Code and example that scrape more in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "yahoo finance BTC",
"hl": "en",
"gl": "us",
"tbm": "nws",
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.dbsr'):
link = result.a['href']
print(link)
-----
'''
https://finance.yahoo.com/news/riot-blockchain-reports-record-second-203000136.html
https://finance.yahoo.com/news/el-salvador-not-require-bitcoin-175818038.html
https://finance.yahoo.com/video/bitcoin-hovers-around-50k-paypal-155437774.html
... other links
'''
Alternatively, you can achieve the same result by using Google News Results API from SerpApi. It's a paid API with a free plan.
The differences is that you don't have to figure out how to extract elements, maintain the parser over time, bypass blocks from Google.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "coca cola",
"tbm": "nws",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for news_result in results["news_results"]:
print(f"Title: {news_result['title']}\nLink: {news_result['link']}\n")
-----
'''
Title: Coca-Cola Co. stock falls Monday, underperforms market
Link: https://www.marketwatch.com/story/coca-cola-co-stock-falls-monday-underperforms-market-01629752653-994caec748bb
... more results
'''
P.S. I wrote a blog post about how to scrape Google News (including pagination) in a bit more detailed way with visual representation.
Disclaimer, I work for SerpApi.

Python Request not getting all data

I'm trying to scrape data from Google translate for educational purpose.
Here is the code
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
#https://translate.google.com/#view=home&op=translate&sl=en&tl=en&text=hello
#tlid-transliteration-content transliteration-content full
class Phonetizer:
def __init__(self,sentence : str,language_ : str = 'en'):
self.words=sentence.split()
self.language=language_
def get_phoname(self):
for word in self.words:
print(word)
url="https://translate.google.com/#view=home&op=translate&sl="+self.language+"&tl="+self.language+"&text="+word
print(url)
req = Request(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0'})
webpage = urlopen(req).read()
f= open("debug.html","w+")
f.write(webpage.decode("utf-8"))
f.close()
#print(webpage)
bsoup = BeautifulSoup(webpage,'html.parser')
phonems = bsoup.findAll("div", {"class": "tlid-transliteration-content transliteration-content full"})
print(phonems)
#break
The problem is when gives me the html, there is no tlid-transliteration-content transliteration-content full class, of css.
But using inspect, I have found that, phoneme are inside this css class, here take a snap :
I have saved the html, and here it is, take a look, no tlid-transliteration-content transliteration-content full is present and it not like other google translate page, it is not complete. I have heard google blocks crawler, bot, spyder. And it can be easily detected by their system, so I added the additional header, but still I can't access the whole page.
How can I do so ? Access the whole page and read all data from google translate page?
Want to contribute on this project?
I have tried this code below :
from requests_html import AsyncHTMLSession
asession = AsyncHTMLSession()
lang = "en"
word = "hello"
url="https://translate.google.com/#view=home&op=translate&sl="+lang+"&tl="+lang+"&text="+word
async def get_url():
r = await asession.get(url)
print(r)
return r
results = asession.run(get_url)
for result in results:
print(result.html.url)
print(result.html.find('#tlid-transliteration-content'))
print(result.html.find('#tlid-transliteration-content transliteration-content full'))
It gives me nothing, till now.
Yes, this happens because some javascript generated content are rendered by the browser on page load, but what you see is the final DOM, after all kinds of manipulation happened by javascript (adding content). To solve this you would need to use selenium but it has multiple downsides like speed and memory issues. A more modern and better way, in my opinion, is to use requests-html where it will replace both bs4 and urllib and it has a render method as mentioned in the documentation.
Here is a sample code using requests_html, just keep in mind what you trying to print is not utf8 so you might run into some issues printing it on some editors like sublime, it ran fine using cmd.
from requests_html import HTMLSession
session = HTMLSession()
r = session.get("https://translate.google.com/#view=home&op=translate&sl=en&tl=en&text=hello")
r.html.render()
css = ".source-input .tlid-transliteration-content"
print(r.html.find(css, first=True).text)
# output: heˈlō,həˈlō
First of all, I would suggest you to use the Google Translate API instead of scraping google page. The API is a hundred times easier, hassle-free and a legal and conventional way of doing this.
However, if you want to fix this, here is the solution.
You are not dealing with Bot detection here. Google's bot detection is so strong it would just open the google re-captcha page and not even show your desired web-page.
The problem here is that the results of translation are not returned using the URL you have used. This URL just displays the basic translator page, the results are fetched later by javascript and are shown on the page after the page has been loaded. The javascript is not processed by python-requests and this is why the class doesn't even exist in the web-page you are accessing.
The solution is to trace the packets and detect which URL is being used by javascript to fetch results. Fortunately, I have found the found the desired URL for this purpose.
If you request https://translate.google.com/translate_a/single?client=webapp&sl=en&tl=fr&hl=en&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&dt=gt&source=bh&ssel=0&tsel=0&kc=1&tk=327718.241137&q=goodmorning, you will get the response of translator as JSON. You can parse the JSON to get the desired results.
Here, you can face Bot detection here which can straight away throw an HTTP 403 error.
You can also use selenium to process javascript and give you results. Following changes inyour code can fix it using selenium
from selenium import webdriver
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
#https://translate.google.com/#view=home&op=translate&sl=en&tl=en&text=hello
#tlid-transliteration-content transliteration-content full
class Phonetizer:
def __init__(self,sentence : str,language_ : str = 'en'):
self.words=sentence.split()
self.language=language_
def get_phoname(self):
for word in self.words:
print(word)
url="https://translate.google.com/#view=home&op=translate&sl="+self.language+"&tl="+self.language+"&text="+word
print(url)
#req = Request(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0'})
#webpage = urlopen(req).read()
driver = webdriver.Chrome()
driver.get(url)
webpage = driver.page_source
driver.close()
f= open("debug.html","w+")
f.write(webpage.decode("utf-8"))
f.close()
#print(webpage)
bsoup = BeautifulSoup(webpage,'html.parser')
phonems = bsoup.findAll("div", {"class": "tlid-transliteration-content transliteration-content full"})
print(phonems)
#break
You should scrape this page with Javascript support, since the content you're looking for "hiding" inside <script> tag, which urllib does not render.
I would suggest to use Selenium or other equivalent framework.
Take a look here: Web-scraping JavaScript page with Python

Google search gives redirect url, not real url python

So basically what I mean is, when I search https://www.google.com/search?q=turtles, the first result's href attribute is a google.com/url redirect. Now, I wouldn't mind this if I was just browsing the internet with my browser, but I am trying to get search results in python. So for this code:
import requests
from bs4 import BeautifulSoup
def get_web_search(query):
query = query.replace(' ', '+') # Replace with %20 also works
response = requests.get('https://www.google.com/search', params={"q":
query})
r_data = response.content
soup = BeautifulSoup(r_data, 'html.parser')
result_raw = []
results = []
for result in soup.find_all('h3', class_='r', limit=1):
result_raw.append(result)
for result in result_raw:
results.append({
'url' : result.find('a').get('href'),
'text' : result.find('a').get_text()
})
print(results)
get_web_search("turtles")
I would expect
[{
url : "https://en.wikipedia.org/wiki/Turtle",
text : "Turtle - Wikipedia"
}]
But what I get instead is
[{'url': '/url?q=https://en.wikipedia.org/wiki/Turtle&sa=U&ved=0ahUKEwja-oaO7u3XAhVMqo8KHYWWCp4QFggVMAA&usg=AOvVaw31hklS09NmMyvgktL1lrTN', 'text': 'Turtle - Wikipedia'}
Is there something I am missing here? Do I need to provide a different header or some other request parameter? Any help is appreciated. Thank you.
NOTE: I saw other posts about this but I am a beginner so I couldn't understand those as they were not in python
Just follow the link's redirect, and it will goto the right page. Assume your link is in the url variable.
import urllib2
url = "/url?q=https://en.wikipedia.org/wiki/Turtle&sa=U&ved=0ahUKEwja-oaO7u3XAhVMqo8KHYWWCp4QFggVMAA&usg=AOvVaw31hklS09NmMyvgktL1lrTN"
url = "www.google.com"+url
response = urllib2.urlopen(url) # 'www.google.com/url?q=https://en.wikipedia.org/wiki/Turtle&sa=U&ved=0ahUKEwja-oaO7u3XAhVMqo8KHYWWCp4QFggVMAA&usg=AOvVaw31hklS09NmMyvgktL1lrTN'
response.geturl() # 'https://en.wikipedia.org/wiki/Turtle'
This works, since you are getting back google's redirect to the url which is what you are really clicking everytime you search. This code, basically just follows the redirect until it arrives at the real url.
Use this package that provides google search
https://pypi.python.org/pypi/google
You can do the same using selenium in combination with python and BeautifulSoup. It will give you the first result no matter whether the webpage is javascript enable or a general one:
from selenium import webdriver
from bs4 import BeautifulSoup
def get_data(search_input):
search_input = search_input.replace(" ","+")
driver.get("https://www.google.com/search?q=" + search_input)
soup = BeautifulSoup(driver.page_source,'lxml')
for result in soup.select('h3.r'):
item = result.select("a")[0].text
link = result.select("a")[0]['href']
print("item_text: {}\nitem_link: {}".format(item,link))
break
if __name__ == '__main__':
driver = webdriver.Chrome()
try:
get_data("turtles")
finally:
driver.quit()
Output:
item_text: Turtle - Wikipedia
item_link: https://en.wikipedia.org/wiki/Turtle
You can use CSS selectors to grab those links.
soup.select_one('.yuRUbf a')['href']
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"
"Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=turtles', headers=headers)
soup = BeautifulSoup(html.text, 'html.parser')
# iterates over organic results container
for result in soup.select('.tF2Cxc'):
# extracts url from "result" container
url = result.select_one('.yuRUbf a')['href']
print(url)
------------
'''
https://en.wikipedia.org/wiki/Turtle
https://www.worldwildlife.org/species/sea-turtle
https://www.britannica.com/animal/turtle-reptile
https://www.britannica.com/story/whats-the-difference-between-a-turtle-and-a-tortoise
https://www.fisheries.noaa.gov/sea-turtles
https://www.fisheries.noaa.gov/species/green-turtle
https://turtlesurvival.org/
https://www.outdooralabama.com/reptiles/turtles
https://www.rewild.org/lost-species/lost-turtles
'''
Alternatively, you can do the same thing using Google Search Engine Results API from SerpApi.
It's a paid API with a free trial of 5,000 searches and the main difference here is that all you have to do is to navigate through structured JSON rather than figuring out why stuff doesn't work.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "turtle",
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
print(result['link'])
--------------
'''
https://en.wikipedia.org/wiki/Turtle
https://www.britannica.com/animal/turtle-reptile
https://www.britannica.com/story/whats-the-difference-between-a-turtle-and-a-tortoise
https://turtlesurvival.org/
https://www.worldwildlife.org/species/sea-turtle
https://www.conserveturtles.org/
'''
Disclaimer, I work for SerpApi.

BeautifulSoup does not return all elements on page

I'm new to Web scraping and just started using BeautifulSoup. Here's my question.
When you look up a word in Google in such a way using a search query like "define:lucid", in most cases, a panel showing the meaning and the pronunciation appears at the front page. (Shown in the left side of the embedded image)
[Google default dictionary example]
Things I want to scrape and collect automatically are the text of the meaning and the URL in which the mp3 data of the pronunciation is stored. Using the Chrome Inspector manually, these are easily found in its "Elements" section, e.g., the Inspector (shown in the right side of the image) shows the URL, which stores the mp3 data of the pronunciation of "lucid" (here).
However, using requests to get the content of the HTML of the search result and parsing it with BeautifulSoup, like the code below, the soup gets only a few of contents in the panel such as the IPA "/ˈluːsɪd/" and the attribute "adjective" like the result below, and none of the contents I need can be found, such as things in audio elements.
How can I get the information with BeautifulSoup if possible, otherwise what alternative tools are suitable for this task?
P.S. I think the quality of pronunciation from Google dictionary is better than ones from any other dictionary sites. So I want to stick to it.
Code:
import requests
from bs4 import BeautifulSoup
query = "define:lucid"
goog_search = "https://www.google.co.uk/search?q=" + query
r = requests.get(goog_search)
soup = BeautifulSoup(r.text, "html.parser")
print(soup.prettify())
Part of soup content:
</span>
<span style="font:smaller 'Doulos SIL','Gentum','TITUS Cyberbit Basic','Junicode','Aborigonal Serif','Arial Unicode MS','Lucida Sans Unicode','Chrysanthi Unicode';padding-left:15px">
/ˈluːsɪd/
</span>
</div>
</h3>
<table style="font-size:14px;width:100%">
<tr>
<td>
<div style="color:#666;padding:5px 0">
adjective
</div>
The basic request you run is not returning the parts of the page rendered via JavaScript. If you right-click in Chrome and select View Page Source the audio link is not there. Solution: you could render the page via selenium. With the below code I get the <audio> tag including the link.
You'll have to pip install selenium, download ChromeDriver and add the folder containing it to PATH like export PATH=$PATH:~/downloads/
import requests
from bs4 import BeautifulSoup
import time
from selenium import webdriver
def render_page(url):
driver = webdriver.Chrome()
driver.get(url)
time.sleep(3)
r = driver.page_source
#driver.quit()
return r
query = "define:lucid"
goog_search = "https://www.google.co.uk/search?q=" + query
r = render_page(goog_search)
soup = BeautifulSoup(r, "html.parser")
print(soup.prettify())
I checked it. You're right, in the BeautifulSoup output there is no audio elements for some reason. However, having inspected the code, I found a source for the audio file which Google is using, which is http://ssl.gstatic.com/dictionary/static/sounds/oxford/lucid--_gb_1.mp3 and which perfectly works if you substitute "lucid" with any other word.
So, if you need to scrape the audio file, you could just do the following:
url='http://ssl.gstatic.com/dictionary/static/sounds/oxford/'
audio=requests.get(url+'lucid'+'--_gb_1.mp3', stream=True).content
with open('lucid'+'.mp3', 'wb') as f:
f.write(audio)
As for other elements, I'm afraid you'll need just to find the word "definition" in the soup and scrape the content of the tag that contains it.
There's no need in selenium to slow down scraping time as M3RS shows since the data is in the HTML, not rendered via JavaScript. Have a look at the SelectorsGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser.
You're looking for this (CSS selectors reference):
soup.select_one('audio source')['src']
# //ssl.gstatic.com/dictionary/static/sounds/20200429/swagger--_gb_1.mp3
Make sure you're using user-agent because default requests user-agent is python-requests thus Google blocks a request because it knows that it's a bot and not a "real" user visit and you'll receive a different HTML with some sort of an error. user-agent faking user visit by adding this information into HTTP request headers.
Code:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
params = {
'q': 'lucid definition',
'hl': 'en',
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
phonetic = soup.select_one('.S23sjd .LTKOO span').text
audio_link = soup.select_one('audio source')['src']
print(phonetic)
print(audio_link)
# ˈluːsɪd
# //ssl.gstatic.com/dictionary/static/sounds/20200429/swagger--_us_1.mp3
Alternatively, you can achieve the same thing by using Google Direct Answer Box API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to grab the data you want fast, instead of coding everything from scratch, figuring out why certain things don't work as they should, and then maintain it over time if something in the HTML layout is changed.
At the moment, SerpApi doesn't extract audio link. This will be changed in the future. Please, check it in the playground to clarify if the audio link is present.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "lucid definition",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
phonetic = results['answer_box']['syllables']
print(phonetic)
# lu·cid
Disclaimer, I work for SerpApi.

Categories

Resources