I wrote this code to scrape email addresses from google search results or websites depending on t url given. However, the output is always blank.
The only thing in the excel sheet is the column name. I'm still new to python so not sure why that's happening.
What am I missing here?
import requests
from bs4 import BeautifulSoup
import pandas as pd
url ="https://www.google.com/search?q=solicitor+bereavement+wales+%27email%27&rlz=1C1CHBD_en-GBIT1013IT1013&sxsrf=AJOqlzWelf5qGpc4uqy_C2cd583OKlSEcQ%3A1675616694195&ei=tuHfY83MC-aIrwSQ3qxY&ved=0ahUKEwjN_9jO7v78AhVmxIsKHRAvCwsQ4dUDCBA&uact=5&oq=solicitor+bereavement+wales+%27email%27&gs_lcp=Cgxnd3Mtd2l6LXNlcnAQAzIFCAAQogQyBwgAEB4QogQyBwgAEB4QogQyBwgAEB4QogQyBwgAEB4QogQ6CggAEEcQ1gQQsANKBAhBGABKBAhGGABQrAxY7xRg1xZoAXABeACAAdIBiAGmBpIBBTEuNC4xmAEAoAEByAEIwAEB&sclient=gws-wiz-serp"
response = requests.get(url)
html_content = response.text
soup = BeautifulSoup(html_content, 'html.parser')
email_addresses = []
for link in soup.find_all('a'):
if 'mailto:' in link.get('href'):
email_addresses.append(link.get('href').replace('mailto:', ''))
df = pd.DataFrame(email_addresses, columns=['Email Addresses'])
df.to_excel('email_addresses_.xlsx',index=False)
First you need to extract all the snippets on the page:
for result in soup.select('.tF2Cxc'):
snippet = result.select_one('.lEBKkf').text
After using regular expression, it will get the email from the snippets (if it's present in the snippet):
match_email = re.findall(r'[\w\.-]+#[\w\.-]+\.\w+', snippet)
email = ''.join(match_email)
Also, instead of a request for a full URL, you can make a request for certain parameters (it’s convenient if you need to change query or other parameters):
params = {
'q': 'intext:"gmail.com" solicitor bereavement wale', # your query
'hl': 'en', # language
'gl': 'us' # country of the search, US -> USA
# other parameters
}
Check full code in the online IDE.
import requests, re, json, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
}
params = {
'q': 'intext:"gmail.com" solicitor bereavement wale', # your query
'hl': 'en', # language
'gl': 'us' # country of the search, US -> USA
}
html = requests.get("https://www.google.com/search",
headers=headers,
params=params).text
soup = BeautifulSoup(html, 'lxml')
data = []
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.find('a')['href']
snippet = result.select_one('.lEBKkf').text
match_email = re.findall(r'[\w\.-]+#[\w\.-]+\.\w+', snippet)
email = ''.join(match_email)
data.append({
'Title': title,
'Link': link,
'Email': email if email else None
})
print(json.dumps(data, indent=2, ensure_ascii=False))
Example output:
[
{
"Title": "Revealed: Billboard's 2022 Top Music Lawyers",
"Link": "https://www.billboard.com/wp-content/uploads/2022/03/march-28-2022-billboard-bulletin.pdf",
"Email": "cmellow.billboard#gmail.com"
},
{
"Title": "Folakemi Jegede, LL.B, BL, LLM, ACIS.'s Post - LinkedIn",
"Link": "https://www.linkedin.com/posts/folakemi-jegede-ll-b-bl-llm-acis-855a8a2a_lawyers-law-advocate-activity-6934498515867815936-9R6G?trk=posts_directory",
"Email": "OurlawandI#gmail.com"
},
other results ...
]
Also you can use Google Search Engine Results API from SerpApi. It's a paid API with a free plan.
The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.
Code example:
from serpapi import GoogleSearch
import os, json, re
params = {
"engine": "google", # search engine
"q": 'intext:"gmail.com" solicitor bereavement wale', # search query
"api_key": "..." # serpapi key from https://serpapi.com/manage-api-key
}
search = GoogleSearch(params) # where data extraction happens
results = search.get_dict() # JSON -> Python dictionary
data = []
for result in results['organic_results']:
title = result['title']
link = result['link']
snippet = result['snippet']
match_email = re.findall(r'[\w\.-]+#[\w\.-]+\.\w+', snippet)
email = '\n'.join(match_email)
data.append({
'title': title,
'link': link,
'email': email if email else None,
})
print(json.dumps(data, indent=2, ensure_ascii=False))
Output: exactly the same as in the previous solution.
It's not finding the html you want because the html is loaded dynamically with javascript. Thus you need to execute the javascript to get all the html.
The selenium module can be used to do this, but it requires a driver to interface with a given browser. So you'll need to install a browser driver in order to use the selenium module. The selenium documentation goes over the installation
Once you have selenium setup, you can use this function to get all the html from the website. Pass its return value into the BeautifulSoup object.
from selenium import webdriver
from time import sleep
def get_page_source(url):
try:
driver = webdriver.Chrome()
driver.get(url)
sleep(3)
return driver.page_source
finally: driver.quit()
Related
I was trying to scrape some urls from the search result and I tried to include both cookies setting or user-agent as Mozilla/5.0 and so on. I still cannot get any urls from the search result. Any solution I can get this working?
from bs4 import BeautifulSoup
import requests
monitored_tickers = ['GME', 'TSLA', 'BTC']
def search_for_stock_news_urls(ticker):
search_url = "https://www.google.com/search?q=yahoo+finance+{}&tbm=nws".format(ticker)
r = requests.get(search_url)
soup = BeautifulSoup(r.text, 'html.parser')
atags = soup.find_all('a')
hrefs = [link['href'] for link in atags]
return hrefs
raw_urls = {ticker:search_for_stock_news_urls(ticker) for ticker in monitored_tickers}
raw_urls
You could be running into the issue that requests and bs4 may not be the best tool for what you're trying to accomplish. As balderman said in another comment, using google search api will be easier.
This code:
from googlesearch import search
tickers = ['GME', 'TSLA', 'BTC']
links_list = []
for ticker in tickers:
ticker_links = search(ticker, stop=25)
links_list.append(ticker_links)
will make a list of the top 25 links on google for each ticker, and append that list into another list. Yahoo finance is sure to be in that list of links, and a simple parser based on keyword will get the yahoo finance url for that specific ticker. You could also adjust the search criteria in the search() function to whatever you wish, say ticker + ' yahoo finance' for example.
Google News could be easily scraped with requests and beautifulsoup. It would be enough to use user-agent to extract data from there.
Check out SelectorGadget Chrome extension to visually grab CSS selectors by clicking on the element you want to extract.
If you only want to extract URLs from Google News, then it's as simple as:
for result in soup.select('.dbsr'):
link = result.a['href']
# 10 links here..
Code and example that scrape more in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "yahoo finance BTC",
"hl": "en",
"gl": "us",
"tbm": "nws",
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.dbsr'):
link = result.a['href']
print(link)
-----
'''
https://finance.yahoo.com/news/riot-blockchain-reports-record-second-203000136.html
https://finance.yahoo.com/news/el-salvador-not-require-bitcoin-175818038.html
https://finance.yahoo.com/video/bitcoin-hovers-around-50k-paypal-155437774.html
... other links
'''
Alternatively, you can achieve the same result by using Google News Results API from SerpApi. It's a paid API with a free plan.
The differences is that you don't have to figure out how to extract elements, maintain the parser over time, bypass blocks from Google.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "coca cola",
"tbm": "nws",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for news_result in results["news_results"]:
print(f"Title: {news_result['title']}\nLink: {news_result['link']}\n")
-----
'''
Title: Coca-Cola Co. stock falls Monday, underperforms market
Link: https://www.marketwatch.com/story/coca-cola-co-stock-falls-monday-underperforms-market-01629752653-994caec748bb
... more results
'''
P.S. I wrote a blog post about how to scrape Google News (including pagination) in a bit more detailed way with visual representation.
Disclaimer, I work for SerpApi.
I have a working code, that prints firstly search titles and then urls but it prints a lot of urls between website titles. But how to print them in format like this and avoid printing the same urls 10 times for each:
1) Title url
2) Title url
and so on...
My code:
search = input("Search:")
page = requests.get(f"https://www.google.com/search?q=" + search)
soup = BeautifulSoup(page.content, "html5lib")
links = soup.findAll("a")
heading_object = soup.find_all('h3')
for info in heading_object:
x = info.getText()
print(x)
for link in links:
link_href = link.get('href')
if "url?q=" in link_href:
y = (link.get('href').split("?q=")[1].split("&sa=U")[0])
print(y)
If you get separatelly titles and links then you can use zip() to group them in pairs
for info, link in zip(heading_object, links):
info = info.getText()
link = link.get('href')
if "?q=" in link:
link = link.split("?q=")[1].split("&sa=U")[0]
print(info, link)
But this may have problem when some title or link doesn't exist on page because then it will create wrong pairs. It will pair title with link for next element. You should rather search elements which keep both title and link and inside every element search single title and single link to create pair. If there is no title or link then you can put some default value and it will not create wrong pairs.
You're looking for this:
for result in soup.select('.yuRUbf'):
title = result.select_one('.DKV0Md').text
url = result.a['href']
print(f'{title}, {url}\n') # prints TITLE, URL followed by a new line.
If you're using f-string then the appropriate way is to use it like so:
page = requests.get(f"https://www.google.com/search?q=" + search) # not proper f-string
page = requests.get(f"https://www.google.com/search?q={search}") # proper f-string
Code:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "python memes",
"hl": "en"
}
soup = BeautifulSoup(requests.get('https://www.google.com/search', headers=headers, params=params).text, 'lxml')
for result in soup.select('.yuRUbf'):
title = result.select_one('.DKV0Md').text
url = result.a['href']
print(f'{title}, {url}\n')
--------
'''
35 Funny And Best Python Programming Memes - CodeItBro, https://www.codeitbro.com/funny-python-programming-memes/
ML Memes (#python.memes_) • Instagram photos and videos, https://www.instagram.com/python.memes_/?hl=en
28 Python Memes ideas - Pinterest, https://in.pinterest.com/codeitbro/python-memes/
'''
Alternatively, you can do the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
One of the differences is that you only need to iterate over JSON rather than figuring out how to scrape stuff.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "python memes",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
title = result['title']
url = result['link']
print(f'{title}, {url}\n')
-------
'''
35 Funny And Best Python Programming Memes - CodeItBro, https://www.codeitbro.com/funny-python-programming-memes/
ML Memes (#python.memes_) • Instagram photos and videos, https://www.instagram.com/python.memes_/?hl=en
28 Python Memes ideas - Pinterest, https://in.pinterest.com/codeitbro/python-memes/
'''
Disclaimer, I work for SerpApi.
So basically what I mean is, when I search https://www.google.com/search?q=turtles, the first result's href attribute is a google.com/url redirect. Now, I wouldn't mind this if I was just browsing the internet with my browser, but I am trying to get search results in python. So for this code:
import requests
from bs4 import BeautifulSoup
def get_web_search(query):
query = query.replace(' ', '+') # Replace with %20 also works
response = requests.get('https://www.google.com/search', params={"q":
query})
r_data = response.content
soup = BeautifulSoup(r_data, 'html.parser')
result_raw = []
results = []
for result in soup.find_all('h3', class_='r', limit=1):
result_raw.append(result)
for result in result_raw:
results.append({
'url' : result.find('a').get('href'),
'text' : result.find('a').get_text()
})
print(results)
get_web_search("turtles")
I would expect
[{
url : "https://en.wikipedia.org/wiki/Turtle",
text : "Turtle - Wikipedia"
}]
But what I get instead is
[{'url': '/url?q=https://en.wikipedia.org/wiki/Turtle&sa=U&ved=0ahUKEwja-oaO7u3XAhVMqo8KHYWWCp4QFggVMAA&usg=AOvVaw31hklS09NmMyvgktL1lrTN', 'text': 'Turtle - Wikipedia'}
Is there something I am missing here? Do I need to provide a different header or some other request parameter? Any help is appreciated. Thank you.
NOTE: I saw other posts about this but I am a beginner so I couldn't understand those as they were not in python
Just follow the link's redirect, and it will goto the right page. Assume your link is in the url variable.
import urllib2
url = "/url?q=https://en.wikipedia.org/wiki/Turtle&sa=U&ved=0ahUKEwja-oaO7u3XAhVMqo8KHYWWCp4QFggVMAA&usg=AOvVaw31hklS09NmMyvgktL1lrTN"
url = "www.google.com"+url
response = urllib2.urlopen(url) # 'www.google.com/url?q=https://en.wikipedia.org/wiki/Turtle&sa=U&ved=0ahUKEwja-oaO7u3XAhVMqo8KHYWWCp4QFggVMAA&usg=AOvVaw31hklS09NmMyvgktL1lrTN'
response.geturl() # 'https://en.wikipedia.org/wiki/Turtle'
This works, since you are getting back google's redirect to the url which is what you are really clicking everytime you search. This code, basically just follows the redirect until it arrives at the real url.
Use this package that provides google search
https://pypi.python.org/pypi/google
You can do the same using selenium in combination with python and BeautifulSoup. It will give you the first result no matter whether the webpage is javascript enable or a general one:
from selenium import webdriver
from bs4 import BeautifulSoup
def get_data(search_input):
search_input = search_input.replace(" ","+")
driver.get("https://www.google.com/search?q=" + search_input)
soup = BeautifulSoup(driver.page_source,'lxml')
for result in soup.select('h3.r'):
item = result.select("a")[0].text
link = result.select("a")[0]['href']
print("item_text: {}\nitem_link: {}".format(item,link))
break
if __name__ == '__main__':
driver = webdriver.Chrome()
try:
get_data("turtles")
finally:
driver.quit()
Output:
item_text: Turtle - Wikipedia
item_link: https://en.wikipedia.org/wiki/Turtle
You can use CSS selectors to grab those links.
soup.select_one('.yuRUbf a')['href']
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"
"Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=turtles', headers=headers)
soup = BeautifulSoup(html.text, 'html.parser')
# iterates over organic results container
for result in soup.select('.tF2Cxc'):
# extracts url from "result" container
url = result.select_one('.yuRUbf a')['href']
print(url)
------------
'''
https://en.wikipedia.org/wiki/Turtle
https://www.worldwildlife.org/species/sea-turtle
https://www.britannica.com/animal/turtle-reptile
https://www.britannica.com/story/whats-the-difference-between-a-turtle-and-a-tortoise
https://www.fisheries.noaa.gov/sea-turtles
https://www.fisheries.noaa.gov/species/green-turtle
https://turtlesurvival.org/
https://www.outdooralabama.com/reptiles/turtles
https://www.rewild.org/lost-species/lost-turtles
'''
Alternatively, you can do the same thing using Google Search Engine Results API from SerpApi.
It's a paid API with a free trial of 5,000 searches and the main difference here is that all you have to do is to navigate through structured JSON rather than figuring out why stuff doesn't work.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "turtle",
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
print(result['link'])
--------------
'''
https://en.wikipedia.org/wiki/Turtle
https://www.britannica.com/animal/turtle-reptile
https://www.britannica.com/story/whats-the-difference-between-a-turtle-and-a-tortoise
https://turtlesurvival.org/
https://www.worldwildlife.org/species/sea-turtle
https://www.conserveturtles.org/
'''
Disclaimer, I work for SerpApi.
I wrote the following code trying to scrape a google scholar page
import requests as req
from bs4 import BeautifulSoup as soup
url = r'https://scholar.google.com/scholar?hl=en&q=Sustainability and the measurement of wealth: further reflections'
session = req.Session()
content = session.get(url)
html2bs = soup(content.content, 'lxml')
gs_cit = html2bs.select('#gs_cit')
gs_citd = html2bs.find('div', {'id':"gs_citd"})
gs_cit1 = html2bs.find('div', {'id':"gs_cit1"})
but the gs_citd gives me only this line <div aria-live="assertive" id="gs_citd"></div> and doesn't reach any level beneath it. Also gs_cit1 returns a None.
As appearing in this image
I want to reach the highlighted class to be able to grab the BibTeX citation.
Can you help, please!
Ok, so I figured it out. I used the selenium module for python which creates a virtual browser if you will that will allow you to perform actions like clicking links and getting the output of the resulting HTML. There was another issue I ran into while solving this which was the page had to be loaded otherwise it just returned the content "Loading..." in the pop-up div so I used the python time module to time.sleep(2) for 2 seconds which allowed the content to load in. Then I just parsed the resulting HTML output using BeautifulSoup to find the anchor tag with the class "gs_citi". Then pulled the href from the anchor and put this into a request with "requests" python module. Finally, I wrote the decoded response to a local file - scholar.bib.
I installed chromedriver and selenium on my Mac using these instructions here:
https://gist.github.com/guylaor/3eb9e7ff2ac91b7559625262b8a6dd5f
Then signed by python file to allow to stop firewall issues using these instructions:
Add Python to OS X Firewall Options?
The following is the code I used to produce the output file "scholar.bib":
import os
import time
from selenium import webdriver
from bs4 import BeautifulSoup as soup
import requests as req
# Setup Selenium Chrome Web Driver
chromedriver = "/usr/local/bin/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
# Navigate in Chrome to specified page.
driver.get("https://scholar.google.com/scholar?hl=en&q=Sustainability and the measurement of wealth: further reflections")
# Find "Cite" link by looking for anchors that contain "Cite" - second link selected "[1]"
link = driver.find_elements_by_xpath('//a[contains(text(), "' + "Cite" + '")]')[1]
# Click the link
link.click()
print("Waiting for page to load...")
time.sleep(2) # Sleep for 2 seconds
# Get Page source after waiting for 2 seconds of current page in Chrome
source = driver.page_source
# We are done with the driver so quit.
driver.quit()
# Use BeautifulSoup to parse the html source and use "html.parser" as the Parser
soupify = soup(source, 'html.parser')
# Find anchors with the class "gs_citi"
gs_citt = soupify.find('a',{"class":"gs_citi"})
# Get the href attribute of the first anchor found
href = gs_citt['href']
print("Fetching: ", href)
# Instantiate a new requests session
session = req.Session()
# Get the response object of href
content = session.get(href)
# Get the content and then decode() it.
bibtex_html = content.content.decode()
# Write the decoded data to a file named scholar.bib
with open("scholar.bib","w") as file:
file.writelines(bibtex_html)
Hope this helps anyone looking for a solution to this out.
Scholar.bib file:
#article{arrow2013sustainability,
title={Sustainability and the measurement of wealth: further reflections},
author={Arrow, Kenneth J and Dasgupta, Partha and Goulder, Lawrence H and Mumford, Kevin J and Oleson, Kirsten},
journal={Environment and Development Economics},
volume={18},
number={4},
pages={504--516},
year={2013},
publisher={Cambridge University Press}
}
You can parse BibTeX data using beautifulsoup and requests by parsing data-cid attribute which is a unique publication ID. Then you need to temporarily store those IDs to a list, iterate over them, and make a request to every ID to parse BibTeX publication citation.
Example below will work for ~10-20 requests then Google will throw a CAPTCHA or you'll hit the rate limit. The ideal solution is to have a CAPTCHA solving service as well as proxies.
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
params = {
"q": "samsung",
"hl": "en"
}
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582",
"server": "scholar",
"referer": f"https://scholar.google.com/scholar?hl={params['hl']}&q={params['q']}",
}
def cite_ids() -> list:
response = requests.get("https://scholar.google.com/scholar", params=params, headers=headers, timeout=10)
soup = BeautifulSoup(response.text, "lxml")
# returns a list of publication ID's -> U8bh6Ca9uwQJ
return [result["data-cid"] for result in soup.select(".gs_or")]
def scrape_cite_results() -> list:
bibtex_data = []
for cite_id in cite_ids():
response = requests.get(f"https://scholar.google.com/scholar?output=cite&q=info:{cite_id}:scholar.google.com", headers=headers, timeout=10)
soup = BeautifulSoup(response.text, "lxml")
# selects first matched element which in this case always will be BibTeX
# if Google will not switch BibTeX position.
bibtex_data.append(soup.select_one(".gs_citi")["href"])
# returns a list of BibTex URLs, for example: https://scholar.googleusercontent.com/scholar.bib?q=info:ifd-RAVUVasJ:scholar.google.com/&output=citation&scisdr=CgVDYtsfELLGwov-iJo:AAGBfm0AAAAAYgD4kJr6XdMvDPuv7R8SGODak6AxcJxi&scisig=AAGBfm0AAAAAYgD4kHUUPiUnYgcIY1Vo56muYZpFkG5m&scisf=4&ct=citation&cd=-1&hl=en
return bibtex_data
Alternatively, you can achieve the same thing using Google Scholar API from SerpApi without the need to figure out what proxy provider provides good proxies as well as with the CAPTCHA solving service, besides figuring out how to scrape the data from the JavaScript without using browser automation.
Example to integrate:
import os
from serpapi import GoogleSearch
def organic_results() -> list:
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_scholar",
"q": "samsung", # search query
"hl": "en", # language
}
search = GoogleSearch(params)
results = search.get_dict()
return [result["result_id"] for result in results["organic_results"]]
def cite_results() -> list:
citation_results = []
for citation in organic_results():
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_scholar_cite",
"q": citation
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["links"]:
if "BibTeX" in result["name"]:
citation_results.append(result["link"])
return citation_results
If you would like to parse the data from all available pages, there's a dedicated blog post Scrape historic Google Scholar results using Python at SerpApi which is all about scraping historic 2017-2021 Organic, Cite results to CSV and SQLite using pagination.
Disclaimer, I work for SerpApi.
I would like to parse Google search results with Python. Everything worked perfectly, but now I keep getting an empty list. Here is the code that used to work fine:
query = urllib.urlencode({'q': self.Tagsinput.GetValue()+footprint,'ie': 'utf-8', 'num':searchresults, 'start': '100'})
result = url + query1
myopener = MyOpener()
page = myopener.open(result)
xss = page.read()
soup = BeautifulSoup.BeautifulSoup(xss)
contents = [x['href'] for x in soup.findAll('a', attrs={'class':'l'})]
This script worked perfectly in December, now it stopped working.
As far as I understand the problem is in this line:
contents = [x['href'] for x in soup.findAll('a', attrs={'class':'l'})]
when I print contents the program returns an empty list: []
Please, anybody, help.
The API works a whole lot better, too. Simple JSON which you can easily parse and manipulate.
import urllib, json
BASE_URL = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&'
url = BASE_URL + urllib.urlencode({'q' : SearchTerm.encode('utf-8')})
raw_res = urllib.urlopen(url).read()
results = json.loads(raw_res)
hit1 = results['responseData']['results'][0]
prettyresult = ' - '.join((urllib.unquote(hit1['url']), hit1['titleNoFormatting']))
At the time of writing this answer you don't have to parse <script> tag (for the most part) to get the output from the Google Search. This can be achieved by using beautifulsoup, requests, and lxml libraries.
Code to get the title, link, and example in the online IDE:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get(f'https://www.google.com/search?q=minecraft', headers=headers).text
soup = BeautifulSoup(html, 'lxml')
for container in soup.findAll('div', class_='tF2Cxc'):
title = container.select_one('.DKV0Md').text
link = container.find('a')['href']
print(f'{title}\n{link}')
# part of the output:
'''
Minecraft Official Site | Minecraft
https://www.minecraft.net/en-us/
Minecraft Classic
https://classic.minecraft.net/
'''
Alternatively, you can do it as well by using Google Search Engine Results API from SerpApi. It's a paid API with a free trial of 5,000 searches. Check out the Playground.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"), # environment for API_KEY
"engine": "google",
"q": "minecraft",
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
title = result['title']
link = result['link']
print(f'{title}\n{link}')
# part of the output:
'''
Minecraft Official Site | Minecraft
https://www.minecraft.net/en-us/
Minecraft Classic
https://classic.minecraft.net/
'''
Disclaimer, I work for SerpApi.