BeautifulSoup script parsing Google search results stopped working - python

I would like to parse Google search results with Python. Everything worked perfectly, but now I keep getting an empty list. Here is the code that used to work fine:
query = urllib.urlencode({'q': self.Tagsinput.GetValue()+footprint,'ie': 'utf-8', 'num':searchresults, 'start': '100'})
result = url + query1
myopener = MyOpener()
page = myopener.open(result)
xss = page.read()
soup = BeautifulSoup.BeautifulSoup(xss)
contents = [x['href'] for x in soup.findAll('a', attrs={'class':'l'})]
This script worked perfectly in December, now it stopped working.
As far as I understand the problem is in this line:
contents = [x['href'] for x in soup.findAll('a', attrs={'class':'l'})]
when I print contents the program returns an empty list: []
Please, anybody, help.

The API works a whole lot better, too. Simple JSON which you can easily parse and manipulate.
import urllib, json
BASE_URL = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&'
url = BASE_URL + urllib.urlencode({'q' : SearchTerm.encode('utf-8')})
raw_res = urllib.urlopen(url).read()
results = json.loads(raw_res)
hit1 = results['responseData']['results'][0]
prettyresult = ' - '.join((urllib.unquote(hit1['url']), hit1['titleNoFormatting']))

At the time of writing this answer you don't have to parse <script> tag (for the most part) to get the output from the Google Search. This can be achieved by using beautifulsoup, requests, and lxml libraries.
Code to get the title, link, and example in the online IDE:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get(f'https://www.google.com/search?q=minecraft', headers=headers).text
soup = BeautifulSoup(html, 'lxml')
for container in soup.findAll('div', class_='tF2Cxc'):
title = container.select_one('.DKV0Md').text
link = container.find('a')['href']
print(f'{title}\n{link}')
# part of the output:
'''
Minecraft Official Site | Minecraft
https://www.minecraft.net/en-us/
Minecraft Classic
https://classic.minecraft.net/
'''
Alternatively, you can do it as well by using Google Search Engine Results API from SerpApi. It's a paid API with a free trial of 5,000 searches. Check out the Playground.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"), # environment for API_KEY
"engine": "google",
"q": "minecraft",
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
title = result['title']
link = result['link']
print(f'{title}\n{link}')
# part of the output:
'''
Minecraft Official Site | Minecraft
https://www.minecraft.net/en-us/
Minecraft Classic
https://classic.minecraft.net/
'''
Disclaimer, I work for SerpApi.

Related

Scrape urls from search results in Python and BeautifulSoup

I was trying to scrape some urls from the search result and I tried to include both cookies setting or user-agent as Mozilla/5.0 and so on. I still cannot get any urls from the search result. Any solution I can get this working?
from bs4 import BeautifulSoup
import requests
monitored_tickers = ['GME', 'TSLA', 'BTC']
def search_for_stock_news_urls(ticker):
search_url = "https://www.google.com/search?q=yahoo+finance+{}&tbm=nws".format(ticker)
r = requests.get(search_url)
soup = BeautifulSoup(r.text, 'html.parser')
atags = soup.find_all('a')
hrefs = [link['href'] for link in atags]
return hrefs
raw_urls = {ticker:search_for_stock_news_urls(ticker) for ticker in monitored_tickers}
raw_urls
You could be running into the issue that requests and bs4 may not be the best tool for what you're trying to accomplish. As balderman said in another comment, using google search api will be easier.
This code:
from googlesearch import search
tickers = ['GME', 'TSLA', 'BTC']
links_list = []
for ticker in tickers:
ticker_links = search(ticker, stop=25)
links_list.append(ticker_links)
will make a list of the top 25 links on google for each ticker, and append that list into another list. Yahoo finance is sure to be in that list of links, and a simple parser based on keyword will get the yahoo finance url for that specific ticker. You could also adjust the search criteria in the search() function to whatever you wish, say ticker + ' yahoo finance' for example.
Google News could be easily scraped with requests and beautifulsoup. It would be enough to use user-agent to extract data from there.
Check out SelectorGadget Chrome extension to visually grab CSS selectors by clicking on the element you want to extract.
If you only want to extract URLs from Google News, then it's as simple as:
for result in soup.select('.dbsr'):
link = result.a['href']
# 10 links here..
Code and example that scrape more in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "yahoo finance BTC",
"hl": "en",
"gl": "us",
"tbm": "nws",
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.dbsr'):
link = result.a['href']
print(link)
-----
'''
https://finance.yahoo.com/news/riot-blockchain-reports-record-second-203000136.html
https://finance.yahoo.com/news/el-salvador-not-require-bitcoin-175818038.html
https://finance.yahoo.com/video/bitcoin-hovers-around-50k-paypal-155437774.html
... other links
'''
Alternatively, you can achieve the same result by using Google News Results API from SerpApi. It's a paid API with a free plan.
The differences is that you don't have to figure out how to extract elements, maintain the parser over time, bypass blocks from Google.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "coca cola",
"tbm": "nws",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for news_result in results["news_results"]:
print(f"Title: {news_result['title']}\nLink: {news_result['link']}\n")
-----
'''
Title: Coca-Cola Co. stock falls Monday, underperforms market
Link: https://www.marketwatch.com/story/coca-cola-co-stock-falls-monday-underperforms-market-01629752653-994caec748bb
... more results
'''
P.S. I wrote a blog post about how to scrape Google News (including pagination) in a bit more detailed way with visual representation.
Disclaimer, I work for SerpApi.

How to print Google Search results properly with bs4?

I have a working code, that prints firstly search titles and then urls but it prints a lot of urls between website titles. But how to print them in format like this and avoid printing the same urls 10 times for each:
1) Title url
2) Title url
and so on...
My code:
search = input("Search:")
page = requests.get(f"https://www.google.com/search?q=" + search)
soup = BeautifulSoup(page.content, "html5lib")
links = soup.findAll("a")
heading_object = soup.find_all('h3')
for info in heading_object:
x = info.getText()
print(x)
for link in links:
link_href = link.get('href')
if "url?q=" in link_href:
y = (link.get('href').split("?q=")[1].split("&sa=U")[0])
print(y)
If you get separatelly titles and links then you can use zip() to group them in pairs
for info, link in zip(heading_object, links):
info = info.getText()
link = link.get('href')
if "?q=" in link:
link = link.split("?q=")[1].split("&sa=U")[0]
print(info, link)
But this may have problem when some title or link doesn't exist on page because then it will create wrong pairs. It will pair title with link for next element. You should rather search elements which keep both title and link and inside every element search single title and single link to create pair. If there is no title or link then you can put some default value and it will not create wrong pairs.
You're looking for this:
for result in soup.select('.yuRUbf'):
title = result.select_one('.DKV0Md').text
url = result.a['href']
print(f'{title}, {url}\n') # prints TITLE, URL followed by a new line.
If you're using f-string then the appropriate way is to use it like so:
page = requests.get(f"https://www.google.com/search?q=" + search) # not proper f-string
page = requests.get(f"https://www.google.com/search?q={search}") # proper f-string
Code:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "python memes",
"hl": "en"
}
soup = BeautifulSoup(requests.get('https://www.google.com/search', headers=headers, params=params).text, 'lxml')
for result in soup.select('.yuRUbf'):
title = result.select_one('.DKV0Md').text
url = result.a['href']
print(f'{title}, {url}\n')
--------
'''
35 Funny And Best Python Programming Memes - CodeItBro, https://www.codeitbro.com/funny-python-programming-memes/
ML Memes (#python.memes_) • Instagram photos and videos, https://www.instagram.com/python.memes_/?hl=en
28 Python Memes ideas - Pinterest, https://in.pinterest.com/codeitbro/python-memes/
'''
Alternatively, you can do the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
One of the differences is that you only need to iterate over JSON rather than figuring out how to scrape stuff.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "python memes",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
title = result['title']
url = result['link']
print(f'{title}, {url}\n')
-------
'''
35 Funny And Best Python Programming Memes - CodeItBro, https://www.codeitbro.com/funny-python-programming-memes/
ML Memes (#python.memes_) • Instagram photos and videos, https://www.instagram.com/python.memes_/?hl=en
28 Python Memes ideas - Pinterest, https://in.pinterest.com/codeitbro/python-memes/
'''
Disclaimer, I work for SerpApi.

Can't find the results of a Google Search with BS4 - Python

I'm trying to get back the stock price from a Google search but the results of the BS4 are more than 300 lines.
enter image description here
Here is my code:
import bs4, requests
exampleFile = requests.get('https://www.google.com/search?q=unip6')
exampleSoup = bs4.BeautifulSoup(exampleFile.text, features="html.parser")
elems = exampleSoup.select('div', {"class": 'IsqQVc NprOob'})
print(len(elems))
for each in elems:
print(each.getText())
print(each.attrs)
print('')
I'd like the outcome was only the price: '23,85'
In this case, the page isn't loaded dynamically, so the target details can be found in the soup. It's also possible to avoid the issue of the changing class name (at least for the time being...), by not using a class selector:
for s in soup.select("div"):
if 'Latest Trade' in s.text:
print(s.text.split('Latest Trade. ')[1].split('BRL')[0])
break
Output:
23.85
Using yahoo finance, you can try:
import pandas as pd
from datetime import datetime, timedelta
now = datetime.now() # time now
past = int((now - timedelta(days=30)).timestamp()) # 30 days ago
now = int(now.timestamp())
ticker = "UNIP6.SA" # https://finance.yahoo.com/quote/UNIP6.SA/
interval = "1d" # "1wk" , "1mo"
df = pd.read_csv(f"https://query1.finance.yahoo.com/v7/finance/download/{ticker}?period1={past}&period2={now}&interval={interval}&events=history")
print(df.iloc[-1]['Close'])
# 23.85
Demo
It's much simpler. All you have to do is to use select_one() to grab just one element since there's no need to use a for loop (use SelectorGadget Chrome extension to grab CSS selectors):
soup.select_one('.wT3VGc').text
# 93,42
And don't forget about user-agent to fake a real user visit, otherwise, Google will treat your requests as python-requests.
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=unip6', headers=headers)
soup = BeautifulSoup(html.text, 'lxml')
current_stock_price = soup.select_one('.wT3VGc').text
print(current_stock_price)
# 93,42
Alternatively, you can do the same thing except don't figure out why the output has more than 300 lines by using Google Direct Answer Box API from SerpApi. It's a paid API with a free trial of 5,000 searches.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "unip6",
}
search = GoogleSearch(params)
results = search.get_dict()
current_stock_price = results['answer_box']['price']
print(current_stock_price)
# 93,42
Disclaimer, I work for SerpApi

Google search gives redirect url, not real url python

So basically what I mean is, when I search https://www.google.com/search?q=turtles, the first result's href attribute is a google.com/url redirect. Now, I wouldn't mind this if I was just browsing the internet with my browser, but I am trying to get search results in python. So for this code:
import requests
from bs4 import BeautifulSoup
def get_web_search(query):
query = query.replace(' ', '+') # Replace with %20 also works
response = requests.get('https://www.google.com/search', params={"q":
query})
r_data = response.content
soup = BeautifulSoup(r_data, 'html.parser')
result_raw = []
results = []
for result in soup.find_all('h3', class_='r', limit=1):
result_raw.append(result)
for result in result_raw:
results.append({
'url' : result.find('a').get('href'),
'text' : result.find('a').get_text()
})
print(results)
get_web_search("turtles")
I would expect
[{
url : "https://en.wikipedia.org/wiki/Turtle",
text : "Turtle - Wikipedia"
}]
But what I get instead is
[{'url': '/url?q=https://en.wikipedia.org/wiki/Turtle&sa=U&ved=0ahUKEwja-oaO7u3XAhVMqo8KHYWWCp4QFggVMAA&usg=AOvVaw31hklS09NmMyvgktL1lrTN', 'text': 'Turtle - Wikipedia'}
Is there something I am missing here? Do I need to provide a different header or some other request parameter? Any help is appreciated. Thank you.
NOTE: I saw other posts about this but I am a beginner so I couldn't understand those as they were not in python
Just follow the link's redirect, and it will goto the right page. Assume your link is in the url variable.
import urllib2
url = "/url?q=https://en.wikipedia.org/wiki/Turtle&sa=U&ved=0ahUKEwja-oaO7u3XAhVMqo8KHYWWCp4QFggVMAA&usg=AOvVaw31hklS09NmMyvgktL1lrTN"
url = "www.google.com"+url
response = urllib2.urlopen(url) # 'www.google.com/url?q=https://en.wikipedia.org/wiki/Turtle&sa=U&ved=0ahUKEwja-oaO7u3XAhVMqo8KHYWWCp4QFggVMAA&usg=AOvVaw31hklS09NmMyvgktL1lrTN'
response.geturl() # 'https://en.wikipedia.org/wiki/Turtle'
This works, since you are getting back google's redirect to the url which is what you are really clicking everytime you search. This code, basically just follows the redirect until it arrives at the real url.
Use this package that provides google search
https://pypi.python.org/pypi/google
You can do the same using selenium in combination with python and BeautifulSoup. It will give you the first result no matter whether the webpage is javascript enable or a general one:
from selenium import webdriver
from bs4 import BeautifulSoup
def get_data(search_input):
search_input = search_input.replace(" ","+")
driver.get("https://www.google.com/search?q=" + search_input)
soup = BeautifulSoup(driver.page_source,'lxml')
for result in soup.select('h3.r'):
item = result.select("a")[0].text
link = result.select("a")[0]['href']
print("item_text: {}\nitem_link: {}".format(item,link))
break
if __name__ == '__main__':
driver = webdriver.Chrome()
try:
get_data("turtles")
finally:
driver.quit()
Output:
item_text: Turtle - Wikipedia
item_link: https://en.wikipedia.org/wiki/Turtle
You can use CSS selectors to grab those links.
soup.select_one('.yuRUbf a')['href']
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"
"Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=turtles', headers=headers)
soup = BeautifulSoup(html.text, 'html.parser')
# iterates over organic results container
for result in soup.select('.tF2Cxc'):
# extracts url from "result" container
url = result.select_one('.yuRUbf a')['href']
print(url)
------------
'''
https://en.wikipedia.org/wiki/Turtle
https://www.worldwildlife.org/species/sea-turtle
https://www.britannica.com/animal/turtle-reptile
https://www.britannica.com/story/whats-the-difference-between-a-turtle-and-a-tortoise
https://www.fisheries.noaa.gov/sea-turtles
https://www.fisheries.noaa.gov/species/green-turtle
https://turtlesurvival.org/
https://www.outdooralabama.com/reptiles/turtles
https://www.rewild.org/lost-species/lost-turtles
'''
Alternatively, you can do the same thing using Google Search Engine Results API from SerpApi.
It's a paid API with a free trial of 5,000 searches and the main difference here is that all you have to do is to navigate through structured JSON rather than figuring out why stuff doesn't work.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "turtle",
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
print(result['link'])
--------------
'''
https://en.wikipedia.org/wiki/Turtle
https://www.britannica.com/animal/turtle-reptile
https://www.britannica.com/story/whats-the-difference-between-a-turtle-and-a-tortoise
https://turtlesurvival.org/
https://www.worldwildlife.org/species/sea-turtle
https://www.conserveturtles.org/
'''
Disclaimer, I work for SerpApi.

How can I scrape the first link of a google search with beautiful soup

I'm trying to make a script that will scrape the first link of a google search so that it will give me back only the first link so I can run a search in the terminal and look at the link later on with the search term. I'm struggling to only get the first result. This is the closest thing I've got so far.
import requests
from bs4 import BeautifulSoup
research_later = "hiya"
goog_search = "https://www.google.co.uk/search?sclient=psy-ab&client=ubuntu&hs=k5b&channel=fs&biw=1366&bih=648&noj=1&q=" + research_later
r = requests.get(goog_search)
soup = BeautifulSoup(r.text)
for link in soup.find_all('a'):
print research_later + " :"+link.get('href')
Seems like Google use cite tag to save the link, so we can just use soup.find('cite').text like this:
import requests
from bs4 import BeautifulSoup
research_later = "hiya"
goog_search = "https://www.google.co.uk/search?sclient=psy-ab&client=ubuntu&hs=k5b&channel=fs&biw=1366&bih=648&noj=1&q=" + research_later
r = requests.get(goog_search)
soup = BeautifulSoup(r.text, "html.parser")
print soup.find('cite').text
Output is:
www.urbandictionary.com/define.php?term=hiya
You can use either select_one() for selecting CSS selectors or find() bs4 methods to get only one element from the page. To grab CSS selectors you can use SelectorGadget extension.
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, json
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=ice cream', headers=headers).text
soup = BeautifulSoup(html, 'lxml')
# locating .tF2Cxc class
# calling for <a> tag and then calling for 'href' attribute
link = soup.select('.yuRUbf a')['href']
print(link)
# https://en.wikipedia.org/wiki/Ice_cream
Alternatively, you can do the same thing by using Google Search Engine Results API from SerpApi. It's a paid API with a free plan.
The main difference is that everything (selecting, bypass blocks, proxy rotation, and more) is already done for the end-user with a JSON output.
Code to integrate:
params = {
"engine": "google",
"q": "ice cream",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
# [0] - first index from the search results
link = results['organic_results'][0]['link']
print(link)
# https://en.wikipedia.org/wiki/Ice_cream
Disclaimer, I work for SerpApi.

Categories

Resources