Scraping google headlines suddenly stop working - python

I have wrote a code for web scraping google news page. It worked fine till today, when it stopped.
It does not give me any error, but It does not scrape anything.
For this code I have watched tutorial from 2018 on youtube and I have used the same url and same 'div's.
When I go to 'inspect' on browser, it still has class="st" and class="slp"
I mean, that means that it worked one year ago till, and it worked yesterday, but It stopped working today
Do you know what can be the problem?
This is the code that worked yesterday:
from textblob import TextBlob
from bs4 import BeautifulSoup
import requests
from datetime import date, timedelta, datetime
term = 'coca cola'
url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(term)
response = requests.get(url)
print(response)
soup = BeautifulSoup(response.text, 'html.parser')
snippet_text = soup.find_all('div', class_='st')
print(len(snippet_text))
news_date = soup.find_all('div', class_='slp')
print(len(news_date))
for paragraph_text, post_date in zip(snippet_text, news_date):
paragraph_text = TextBlob(paragraph_text.get_text())
print(paragraph_text)
todays_date = date.today()
time_ago = TextBlob(post_date.get_text()).split('- ')[1]
print(time_ago)
Does google changes HTML code or url?

Please add user-agent while scraping google.
from bs4 import BeautifulSoup
import requests
from datetime import date, timedelta, datetime
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
term = 'coca cola'
url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(term)
response = requests.get(url,headers=headers)
print(response)
soup = BeautifulSoup(response.text, 'html.parser')
snippet_text = soup.find_all('div', class_='st')
print(len(snippet_text))
news_date = soup.find_all('div', class_='slp')
print(len(news_date))
If you get SSL error maximum reach then add verify=False
response = requests.get(url,headers=headers,verify=False)

As KunduK said, Google is blocking your request because the default user-agent from the requests library is python-requests. You can fake user browser visit by adding headers to your request. List of user-agents among other websites.
Also, you can set timeout to your request (info) to stop waiting for a response after a given number of seconds. Otherwise, the script can hang indefinitely.
You can apply the same logic to Yahoo, Bing, Baidu, Yandex, and other search engines.
Code and full example:
from bs4 import BeautifulSoup
import requests
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get('https://www.google.com/search?hl=en-US&q=coca cola&tbm=nws', headers=headers).text
soup = BeautifulSoup(response, 'lxml')
for headings in soup.findAll('div', class_='dbsr'):
title = headings.find('div', class_='JheGif nDgy9d').text
link = headings.a['href']
print(f'Title: {title}')
print(f'Link: {link}')
print()
Part of output:
Title: Fact check: Georgia is not removing Coca-Cola products from state-owned
buildings
Link: https://www.usatoday.com/story/news/factcheck/2021/04/09/fact-check-georgia-not-removing-coke-products-state-buildings/7129548002/
Title: The 'race for talent' is pushing companies like Delta and Coca-Cola to
speak out against voting laws
Link: https://www.businessinsider.com/georgia-voting-law-merits-response-delta-coca-cola-workers-2021-4
Title: Why Coke's Earnings Could Contain Good News, One Analyst Says
Link: https://www.barrons.com/articles/cokes-stock-is-lagging-why-one-analyst-thinks-next-weeks-earnings-could-include-good-news-51618246989
Alternatively, you can use Google News Result API from SerpApi. Check out Playground to test.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "coca cola",
"tbm": "nws",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for news_result in results["news_results"]:
print(f"Title: {news_result['title']}\nLink: {news_result['link']}\n")
Part of output:
Title: Why Coke's Earnings Could Contain Good News, One Analyst Says
Link: https://www.barrons.com/articles/cokes-stock-is-lagging-why-one-analyst-thinks-next-weeks-earnings-could-include-good-news-51618246989
Title: The 'race for talent' is pushing companies like Delta and Coca-Cola to speak out against voting laws
Link: https://www.businessinsider.com/georgia-voting-law-merits-response-delta-coca-cola-workers-2021-4
Title: 2 Reasons You Shouldn't Buy Coca-Cola Now
Link: https://seekingalpha.com/article/4418712-2-reasons-you-shouldnt-buy-coca-cola-now
Title: Worrying Signs For Coca-Cola
Link: https://seekingalpha.com/article/4418630-worrying-signs-for-coca-cola
Disclaimer, I work for SerpApi.

Related

Beautifulsoup scrap very few listings prices instead of all listings prices on a page

I want to scrap data from a real estate website for my education project. I am using beautifulsoup. I write following code. Code works properly but shows very less data.
import requests
from bs4 import BeautifulSoup
url = "https://www.zillow.com/homes/San-Francisco,-CA_rb/"
headers = {
"Accept-Language": "en-GB,en;q=0.5",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:104.0) Gecko/20100101 Firefox/104.0"
}
response = requests.get(url=url, headers=headers )
soup = BeautifulSoup(response.text, "html.parser")
prices = soup.find_all("span", attrs={"data-test":True})
prices_list = [price.getText().strip("+,/,m,o,1,bd, ") for price in prices]
print(prices_list)
The output of this only shows first 9 listings prices.
['$2,959', '$2,340', '$2,655', '$2,632', '$2,524', '$2,843', '$2,64', '$2,300', '$2,604']
It's because the content is created progressively with continuous requests (Lazy loading). You could try to reverse engineer the backend of the site. I'll look into it and if I find an easy solution I'll update the answer. :)
The API call to their backend looks something like this: https://www.zillow.com/search/GetSearchPageState.htm?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22usersSearchTerm%22%3A%22San%20Francisco%2C%20CA%22%2C%22mapBounds%22%3A%7B%22west%22%3A-123.07190982226562%2C%22east%22%3A-121.79474917773437%2C%22south%22%3A37.63132659190023%2C%22north%22%3A37.918977518603874%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A20330%2C%22regionType%22%3A6%7D%5D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22sortSelection%22%3A%7B%22value%22%3A%22days%22%7D%2C%22isAllHomes%22%3A%7B%22value%22%3Atrue%7D%7D%2C%22isListVisible%22%3Atrue%7D&wants={%22cat1%22:[%22mapResults%22]}&requestId=3
You need to handle cookies correctly in order to see the results but if delivers around 1000 results. Have fun :)
UPDATE:
should look like this
import json
with open("GetSearchPageState.json", "r") as f:
a = json.load(f)
print(a["cat1"]["searchResults"]["mapResults"])

How to extract the anchor tag which is like this <a class="a-no-hover-decoration" ... >?

So,
Let's say I am searching Google for "White Russian", As soon as we do that, we receive some model cards as shown in the image below,
Now, if you will look at the HTML's Inspector, we will get to see that those card's HREF is within an anchor tag and it's like the below image, (... denotes extra stuff)
<a class="a-no-hover-decoration" href="https://www.liquor.com/recipes/white-russian/" .....>
So, What i am interested is in extracting that href from such anchor tags if they exist for a search.
My Attempt,
import requests
from bs4 import BeautifulSoup
req = requests.get("https://www.google.com/search?q=White+Russian")
soup = BeautifulSoup(req.text, 'html.parser')
soup.find_all("a", {"class": "a-no-hover-detection"}) # this returns Nothing
I am kinda new to web-scraping, so will appreciate your help.
My second question is, How to detect that we have such model cards Vs when we don't have such cards for any given random search?
Thanks.
You can also grab CSS selectors visually using SelectorGadgets Chrome extension.
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get('https://www.google.com/search?q=white russian', headers=headers).text
soup = BeautifulSoup(response, 'lxml')
# select() method: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class
for result in soup.select('.cv2VAd .a-no-hover-decoration'):
link = result['href']
print(link)
Output:
https://www.liquor.com/recipes/white-russian/
https://www.delish.com/cooking/recipe-ideas/a29091466/white-russian-cocktail-recipe/
https://www.kahlua.com/en-us/drinks/white-russian/
Alternatively, you can do it using Google Search Engine Results API. It's a paid API with a free trial of 5,000 searches.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "White Russian",
"google_domain": "google.com",
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['recipes_results']:
link = result['link']
print(link)
Output:
https://www.liquor.com/recipes/white-russian/
https://www.delish.com/cooking/recipe-ideas/a29091466/white-russian-cocktail-recipe/
https://www.kahlua.com/en-us/drinks/white-russian/
Disclaimer, I work for SerpApi.
To get correct response from Google's server, specify User-Agent HTTP header, and hl=en parameter (to get english results). Also, the class name is a-no-hover-decoration, not a-no-hover-detection:
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0"
}
params = {"q": "White Russian", "hl": "en"}
req = requests.get(
"https://www.google.com/search", params=params, headers=headers
)
soup = BeautifulSoup(req.text, "html.parser")
for a in soup.find_all("a", {"class": "a-no-hover-decoration"}):
print(a["href"])
Prints:
https://www.liquor.com/recipes/white-russian/
https://www.delish.com/cooking/recipe-ideas/a29091466/white-russian-cocktail-recipe/
https://www.bbcgoodfood.com/recipes/white-russian

I want to fetch the live stock price data through google search

I was trying to fetch the real time stock price through google search using web scraping but its giving me an error
resp = requests.get("https://www.google.com/search?q=apple+share+price&oq=apple+share&aqs=chrome.0.0j69i57j0l4.11811j1j7&sourceid=chrome&ie=UTF-8")
soup = bs.BeautifulSoup(resp.text,'lxml')
tab = soup.find('div',attrs = {'class':'gsrt'}).find('span').text
'NoneType'object has no attribute find
You could use
soup.select_one('td[colspan="3"] b').text
Code:
import requests
from bs4 import BeautifulSoup as bs
headers = {'User-Agent' : 'Mozilla/5.0'}
res = requests.get('https://www.google.com/search?q=apple+share+price&oq=apple+share&aqs=chrome.0.0j69i57j0l4.11811j1j7&sourceid=chrome&ie=UTF-8', headers = headers)
soup = bs(res.content, 'lxml')
quote = soup.select_one('td[colspan="3"] b').text
print(quote)
Try this maybe...
resp = requests.get("https://www.google.com/search?q=apple+share+price&oq=apple+share&aqs=chrome.0.0j69i57j0l4.11811j1j7&sourceid=chrome&ie=UTF-8")
soup = bs(resp.text,'lxml')
tab = soup.find('div', class_='g').findAll('span')
print(tab[3].text.strip())
or, if you only want the price..
resp = requests.get("https://www.google.com/search?q=apple+share+price&oq=apple+share&aqs=chrome.0.0j69i57j0l4.11811j1j7&sourceid=chrome&ie=UTF-8")
soup = bs(resp.text,'lxml')
tab = soup.find('div', class_='g').findAll('span')
price = tab[3].text.strip()
print(price[:7])`
user-agent is not specified in your request. It could be the reason why you were getting an empty result. This way Google treats your request as a python-requests aka automated script, instead of a "real user" visit.
It's fairly easy to do:
Click on SelectorGadget Chrome extension (once installed).
Click on the stock price and receive a CSS selector provided by SelectorGadget.
Use this selector to get the data.
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=nasdaq stock price', headers=headers)
soup = BeautifulSoup(html.text, 'lxml')
current_stock_price = soup.select_one('.wT3VGc').text
print(current_stock_price)
>>> 177,33
Alternatively, you can do the same thing using Google Direct Answer Box API from SerpApi. It's a paid API with a free trial of 5,000 searches.
The biggest difference in this example that you don't have to figure out why the heck something doesn't work, although it should. Everything is already done for the end-user (in this case all selections and figuring out how to scrape this data) with a json output.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "nasdaq stock price",
}
search = GoogleSearch(params)
results = search.get_dict()
current_stock_price = results['answer_box']['price']
print(current_stock_price)
>>> 177.42
Disclaimer, I work for SerpApi.

Get year of first publication Google Scholar

I am working on scraping data from Google Scholar using bs4 and urllib. I am trying to get the first year an article is publsihed. For example, from this page I am trying to get the year 1996. This can be read from the bar chart, but only after the bar chart is clicked. I have written the following code, but it prints out the year visible before the bar chart is clicked.
from bs4 import BeautifulSoup
import urllib.request
url = 'https://scholar.google.com/citations?user=VGoSakQAAAAJ'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'lxml')
year = soup.find('span', {"class": "gsc_g_t"})
print (year)
the chart information is on a different request, this one. There you can get the information you want with the following xpath:
'//span[#class="gsc_g_t"][1]/text()'
or in soup:
soup.find('span', {"class": "gsc_g_t"}).text
Make sure you're using the latest user-agent. Old user-agents is a signal to the website that it might be a bot that sends a request. But a new user-agent does not mean that every website would think that it's a "real" user visit. Check what's your user-agent.
The code snippet is using parsel library which is similar to bs4 but it supports full XPath and translates every CSS selector query to XPath using the cssselect package.
Example code to integrate:
from collections import namedtuple
import requests
from parsel import Selector
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"user": "VGoSakQAAAAJ",
"hl": "en",
"view_op": "citations_histogram"
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}
html = requests.get("https://scholar.google.com/citations", params=params, headers=headers, timeout=30)
selector = Selector(html.text)
Publications = namedtuple("Years", "first_publication")
publications = Publications(sorted([publication.get() for publication in selector.css(".gsc_g_t::text")])[0])
print(selector.css(".gsc_g_t::text").get())
print(sorted([publication.get() for publication in selector.css(".gsc_g_t::text")])[0])
print(publications.first_publication)
# output:
'''
1996
1996
1996
'''
Alternatively, you can achieve the same thing by using Google Scholar Author API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to figure out how to parse the data and maintain the parser over time, figure out how to scale it, and bypass blocks from a search engine, such as Google Scholar search engine.
Example code to integrate:
from serpapi import GoogleScholarSearch
params = {
"api_key": "Your SerpApi API key",
"engine": "google_scholar_author",
"hl": "en",
"author_id": "VGoSakQAAAAJ"
}
search = GoogleScholarSearch(params)
results = search.get_dict()
# already sorted data
first_publication = [year.get("year") for year in results.get("cited_by", {}).get("graph", [])][0]
print(first_publication)
# 1996
If you want to scrape all Profile results based on a given query or you have a list of Author IDs, there's a dedicated scrape all Google Scholar Profile, Author Results to CSV blog post of mine about it.
Disclaimer, I work for SerpApi.

web crawling Google - getting different results

I have written the following Python script, to crawl and scrape headings of Google News search results, within a specific date range. Though the script is working, it's showing the latest search results, and not the ones mentioned in the list.
E.g. Rather than showing results from 1 Jul 2015 - 7 Jul 2015, the script is showing results from May 2016 (current month)
import urllib.request
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
#get and read the URL
url = ("https://www.google.co.in/search?q=banking&num=100&safe=off&espv=2&biw=1920&bih=921&source=lnt&tbs=cdr%3A1%2Ccd_min%3A01%2F07%2F2015%2Ccd_max%3A07%2F07%2F2015&tbm=nws")
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
html = opener.open(url)
bsObj = BeautifulSoup(html.read(), "html5lib")
#extracts all the links from the given page
itmes = bsObj.findAll("h3")
for item in itmes:
itemA = item.a
theHeading = itemA.text
print(theHeading)
Can someone please guide me to the correct method of getting the desired results, sorted by dates?
Thanks in advance.
I did some tests and it seems the problem is coming from the User-Agent which is not detailed enough.
Try replacing this line:
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
with:
opener.addheaders = [('User-agent', "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:36.0) Gecko/20100101 Firefox/36.0"),
It worked for me.
Of course this User-Agent is just an example.
As Julien Salinas wrote, it's because there's no user-agent specified in your request headers.
For example, the default requests user-agent is python-requests thus Google blocks a request because it knows that it's a bot and not a "real" user visit and you received a different HTML with different selectors and elements, and some sort of an error. User-agent fakes user visit by adding this information into HTTP request headers.
Pass user-agent into request headers using requests library:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
response = requests.get('YOUR_URL', headers=headers)
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "best potato recipes",
"hl": "en",
"gl": "us",
"tbm": "nws",
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.WlydOe'):
title = result.select_one('.nDgy9d').text
link = result['href']
print(title, link, sep='\n')
----------
'''
Call of Duty Vanguard (PS5) Beta Impressions – A Champion Hill To Die On
https://wccftech.com/call-of-duty-vanguard-ps5-beta-impressions-a-champion-hill-to-die-on/
Warzone players call for fan-favorite MW2 map to be added to Verdansk
https://charlieintel.com/warzone-players-call-for-fan-favorite-mw2-map-to-be-added-to-verdansk/114014/
'''
Alternatively, you can achieve the same thing by using Google News Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't figure out why certain things don't work, bypass blocks from search engines since it's already done for the end-user, and you only need to iterate over structured JSON and get the data you want.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "Call of duty 360 no scope",
"tbm": "nws",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for news_result in results["news_results"]:
print(f"Title: {news_result['title']}\nLink: {news_result['link']}\n")
----------
'''
Call of Duty Vanguard (PS5) Beta Impressions – A Champion Hill To Die On
https://wccftech.com/call-of-duty-vanguard-ps5-beta-impressions-a-champion-hill-to-die-on/
Warzone players call for fan-favorite MW2 map to be added to Verdansk
https://charlieintel.com/warzone-players-call-for-fan-favorite-mw2-map-to-be-added-to-verdansk/114014/
'''
P.S - I wrote a bit more in-depth blog post about how to scrape Google News Results.
Disclaimer, I work for SerpApi.

Categories

Resources