Get year of first publication Google Scholar - python

I am working on scraping data from Google Scholar using bs4 and urllib. I am trying to get the first year an article is publsihed. For example, from this page I am trying to get the year 1996. This can be read from the bar chart, but only after the bar chart is clicked. I have written the following code, but it prints out the year visible before the bar chart is clicked.
from bs4 import BeautifulSoup
import urllib.request
url = 'https://scholar.google.com/citations?user=VGoSakQAAAAJ'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'lxml')
year = soup.find('span', {"class": "gsc_g_t"})
print (year)

the chart information is on a different request, this one. There you can get the information you want with the following xpath:
'//span[#class="gsc_g_t"][1]/text()'
or in soup:
soup.find('span', {"class": "gsc_g_t"}).text

Make sure you're using the latest user-agent. Old user-agents is a signal to the website that it might be a bot that sends a request. But a new user-agent does not mean that every website would think that it's a "real" user visit. Check what's your user-agent.
The code snippet is using parsel library which is similar to bs4 but it supports full XPath and translates every CSS selector query to XPath using the cssselect package.
Example code to integrate:
from collections import namedtuple
import requests
from parsel import Selector
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"user": "VGoSakQAAAAJ",
"hl": "en",
"view_op": "citations_histogram"
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}
html = requests.get("https://scholar.google.com/citations", params=params, headers=headers, timeout=30)
selector = Selector(html.text)
Publications = namedtuple("Years", "first_publication")
publications = Publications(sorted([publication.get() for publication in selector.css(".gsc_g_t::text")])[0])
print(selector.css(".gsc_g_t::text").get())
print(sorted([publication.get() for publication in selector.css(".gsc_g_t::text")])[0])
print(publications.first_publication)
# output:
'''
1996
1996
1996
'''
Alternatively, you can achieve the same thing by using Google Scholar Author API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to figure out how to parse the data and maintain the parser over time, figure out how to scale it, and bypass blocks from a search engine, such as Google Scholar search engine.
Example code to integrate:
from serpapi import GoogleScholarSearch
params = {
"api_key": "Your SerpApi API key",
"engine": "google_scholar_author",
"hl": "en",
"author_id": "VGoSakQAAAAJ"
}
search = GoogleScholarSearch(params)
results = search.get_dict()
# already sorted data
first_publication = [year.get("year") for year in results.get("cited_by", {}).get("graph", [])][0]
print(first_publication)
# 1996
If you want to scrape all Profile results based on a given query or you have a list of Author IDs, there's a dedicated scrape all Google Scholar Profile, Author Results to CSV blog post of mine about it.
Disclaimer, I work for SerpApi.

Related

Scraping google headlines suddenly stop working

I have wrote a code for web scraping google news page. It worked fine till today, when it stopped.
It does not give me any error, but It does not scrape anything.
For this code I have watched tutorial from 2018 on youtube and I have used the same url and same 'div's.
When I go to 'inspect' on browser, it still has class="st" and class="slp"
I mean, that means that it worked one year ago till, and it worked yesterday, but It stopped working today
Do you know what can be the problem?
This is the code that worked yesterday:
from textblob import TextBlob
from bs4 import BeautifulSoup
import requests
from datetime import date, timedelta, datetime
term = 'coca cola'
url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(term)
response = requests.get(url)
print(response)
soup = BeautifulSoup(response.text, 'html.parser')
snippet_text = soup.find_all('div', class_='st')
print(len(snippet_text))
news_date = soup.find_all('div', class_='slp')
print(len(news_date))
for paragraph_text, post_date in zip(snippet_text, news_date):
paragraph_text = TextBlob(paragraph_text.get_text())
print(paragraph_text)
todays_date = date.today()
time_ago = TextBlob(post_date.get_text()).split('- ')[1]
print(time_ago)
Does google changes HTML code or url?
Please add user-agent while scraping google.
from bs4 import BeautifulSoup
import requests
from datetime import date, timedelta, datetime
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
term = 'coca cola'
url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(term)
response = requests.get(url,headers=headers)
print(response)
soup = BeautifulSoup(response.text, 'html.parser')
snippet_text = soup.find_all('div', class_='st')
print(len(snippet_text))
news_date = soup.find_all('div', class_='slp')
print(len(news_date))
If you get SSL error maximum reach then add verify=False
response = requests.get(url,headers=headers,verify=False)
As KunduK said, Google is blocking your request because the default user-agent from the requests library is python-requests. You can fake user browser visit by adding headers to your request. List of user-agents among other websites.
Also, you can set timeout to your request (info) to stop waiting for a response after a given number of seconds. Otherwise, the script can hang indefinitely.
You can apply the same logic to Yahoo, Bing, Baidu, Yandex, and other search engines.
Code and full example:
from bs4 import BeautifulSoup
import requests
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get('https://www.google.com/search?hl=en-US&q=coca cola&tbm=nws', headers=headers).text
soup = BeautifulSoup(response, 'lxml')
for headings in soup.findAll('div', class_='dbsr'):
title = headings.find('div', class_='JheGif nDgy9d').text
link = headings.a['href']
print(f'Title: {title}')
print(f'Link: {link}')
print()
Part of output:
Title: Fact check: Georgia is not removing Coca-Cola products from state-owned
buildings
Link: https://www.usatoday.com/story/news/factcheck/2021/04/09/fact-check-georgia-not-removing-coke-products-state-buildings/7129548002/
Title: The 'race for talent' is pushing companies like Delta and Coca-Cola to
speak out against voting laws
Link: https://www.businessinsider.com/georgia-voting-law-merits-response-delta-coca-cola-workers-2021-4
Title: Why Coke's Earnings Could Contain Good News, One Analyst Says
Link: https://www.barrons.com/articles/cokes-stock-is-lagging-why-one-analyst-thinks-next-weeks-earnings-could-include-good-news-51618246989
Alternatively, you can use Google News Result API from SerpApi. Check out Playground to test.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "coca cola",
"tbm": "nws",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for news_result in results["news_results"]:
print(f"Title: {news_result['title']}\nLink: {news_result['link']}\n")
Part of output:
Title: Why Coke's Earnings Could Contain Good News, One Analyst Says
Link: https://www.barrons.com/articles/cokes-stock-is-lagging-why-one-analyst-thinks-next-weeks-earnings-could-include-good-news-51618246989
Title: The 'race for talent' is pushing companies like Delta and Coca-Cola to speak out against voting laws
Link: https://www.businessinsider.com/georgia-voting-law-merits-response-delta-coca-cola-workers-2021-4
Title: 2 Reasons You Shouldn't Buy Coca-Cola Now
Link: https://seekingalpha.com/article/4418712-2-reasons-you-shouldnt-buy-coca-cola-now
Title: Worrying Signs For Coca-Cola
Link: https://seekingalpha.com/article/4418630-worrying-signs-for-coca-cola
Disclaimer, I work for SerpApi.

Why the working code is not giving any outputs anymore?

I took the code below from the answer How to use BeautifulSoup to parse google search results in Python
It used to work on my Ubuntu 16.04 and I have both Python 2 and 3.
The code is below:
import urllib
from bs4 import BeautifulSoup
import requests
import webbrowser
text = 'My query goes here'
text = urllib.parse.quote_plus(text)
url = 'https://google.com/search?q=' + text
response = requests.get(url)
#with open('output.html', 'wb') as f:
# f.write(response.content)
#webbrowser.open('output.html')
soup = BeautifulSoup(response.text, 'lxml')
for g in soup.find_all(class_='g'):
print(g.text)
print('-----')
It executes but prints nothing. The problem is really suspicious to me. Any help would be appreciated.
The problem is that Google is serving different HTML when you don't specify User-Agent in headers. To specify custom header, add dict with User-Agent to headers= parameter in requests:
import urllib
from bs4 import BeautifulSoup
import requests
import webbrowser
text = 'My query goes here'
text = urllib.parse.quote_plus(text)
url = 'https://google.com/search?q=' + text
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
for g in soup.find_all(class_='g'):
print(g.text)
print('-----')
Prints:
How to Write the Perfect Query Letter - Query Letter Examplehttps://www.writersdigest.com/.../how-to-write-the-perfect-qu...PuhverdatudTõlgi see leht21. märts 2016 - A literary agent shares a real-life novel pitch that ultimately led to a book deal—and shows you how to query your own work with success.
-----
Inimesed küsivad ka järgmistHow do you start a query letter?What should be included in a query letter?How do you end a query in an email?How long is a query letter?Tagasiside
-----
...and so on.
Learn more about user-agent and request headers.
Basically, user-agent let identifies the browser, its version number, and its host operating system that representing a person (browser) in a Web context that lets servers and network peers identify if it's a bot or not.
Have a look at SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser. CSS selectors reference.
To make it look better, you can pass URL params as a dict() which is more readable and requests do everything for you automatically (same goes for adding user-agent into headers):
params = {
"q": "My query goes here"
}
requests.get("YOUR_URL", params=params)
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "My query goes here"
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
print(title)
-------
'''
MySQL 8.0 Reference Manual :: 3.2 Entering Queries
Google Sheets Query function: Learn the most powerful ...
Understanding MySQL Queries with Explain - Exoscale
An Introductory SQL Tutorial: How to Write Simple Queries
Writing Subqueries in SQL | Advanced SQL - Mode
Getting IO and time statistics for SQL Server queries
How to store MySQL query results in another Table? - Stack ...
More efficient SQL with query planning and optimization (article)
Here are my Data Files. Here are my Queries. Where ... - CIDR
Slow in the Application, Fast in SSMS? - Erland Sommarskog
'''
Alternatively, you can do the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to extract the data you want from JSON string rather than figuring out how to extract, maintain or bypass blocks from Google.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "My query goes here",
"hl": "en",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['title'])
--------
'''
MySQL 8.0 Reference Manual :: 3.2 Entering Queries
Google Sheets Query function: Learn the most powerful ...
Understanding MySQL Queries with Explain - Exoscale
An Introductory SQL Tutorial: How to Write Simple Queries
Writing Subqueries in SQL | Advanced SQL - Mode
Getting IO and time statistics for SQL Server queries
How to store MySQL query results in another Table? - Stack ...
More efficient SQL with query planning and optimization (article)
Here are my Data Files. Here are my Queries. Where ... - CIDR
Slow in the Application, Fast in SSMS? - Erland Sommarskog
'''
Disclaimer, I work for SerpApi.

I want to fetch the live stock price data through google search

I was trying to fetch the real time stock price through google search using web scraping but its giving me an error
resp = requests.get("https://www.google.com/search?q=apple+share+price&oq=apple+share&aqs=chrome.0.0j69i57j0l4.11811j1j7&sourceid=chrome&ie=UTF-8")
soup = bs.BeautifulSoup(resp.text,'lxml')
tab = soup.find('div',attrs = {'class':'gsrt'}).find('span').text
'NoneType'object has no attribute find
You could use
soup.select_one('td[colspan="3"] b').text
Code:
import requests
from bs4 import BeautifulSoup as bs
headers = {'User-Agent' : 'Mozilla/5.0'}
res = requests.get('https://www.google.com/search?q=apple+share+price&oq=apple+share&aqs=chrome.0.0j69i57j0l4.11811j1j7&sourceid=chrome&ie=UTF-8', headers = headers)
soup = bs(res.content, 'lxml')
quote = soup.select_one('td[colspan="3"] b').text
print(quote)
Try this maybe...
resp = requests.get("https://www.google.com/search?q=apple+share+price&oq=apple+share&aqs=chrome.0.0j69i57j0l4.11811j1j7&sourceid=chrome&ie=UTF-8")
soup = bs(resp.text,'lxml')
tab = soup.find('div', class_='g').findAll('span')
print(tab[3].text.strip())
or, if you only want the price..
resp = requests.get("https://www.google.com/search?q=apple+share+price&oq=apple+share&aqs=chrome.0.0j69i57j0l4.11811j1j7&sourceid=chrome&ie=UTF-8")
soup = bs(resp.text,'lxml')
tab = soup.find('div', class_='g').findAll('span')
price = tab[3].text.strip()
print(price[:7])`
user-agent is not specified in your request. It could be the reason why you were getting an empty result. This way Google treats your request as a python-requests aka automated script, instead of a "real user" visit.
It's fairly easy to do:
Click on SelectorGadget Chrome extension (once installed).
Click on the stock price and receive a CSS selector provided by SelectorGadget.
Use this selector to get the data.
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=nasdaq stock price', headers=headers)
soup = BeautifulSoup(html.text, 'lxml')
current_stock_price = soup.select_one('.wT3VGc').text
print(current_stock_price)
>>> 177,33
Alternatively, you can do the same thing using Google Direct Answer Box API from SerpApi. It's a paid API with a free trial of 5,000 searches.
The biggest difference in this example that you don't have to figure out why the heck something doesn't work, although it should. Everything is already done for the end-user (in this case all selections and figuring out how to scrape this data) with a json output.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "nasdaq stock price",
}
search = GoogleSearch(params)
results = search.get_dict()
current_stock_price = results['answer_box']['price']
print(current_stock_price)
>>> 177.42
Disclaimer, I work for SerpApi.

Why can't I see the same page that I requested?

I've been learning Python and tried Web Scraping.
I could manage to scrape Google Result Page for a normal Google Search, though the page was depreciated idk why.
Tried the same for Google Images, and it is depreciated as well. It doesn't appear the same as it was appearing in the browser.
Here's my code.
from bs4 import BeautifulSoup
import requests
from PIL import Image
from io import BytesIO
search = input("Search for : ")
params = {"tbm": "isch", "source": "hp", "q": search}
r = requests.get("https://www.google.com/search", params=params)
print("URL :", r.url)
print("Status : ", r.status_code, "\n\n")
f = open("ImageResult.html", "w+")
f.write(r.text)
For example, I search for "Goku".
The Google Image returns this page.
When I click on the first image, a popup opens. Or say I press ctrl+click. I reach this page.
On this page I can see that the actual image's URL can be accessed through maybe the current url or the link at the "View Image" button. But the issue is, I can't reach this page/popup in the version of the page that I am able to get when I request this page.
UPDATE : I'm sharing the page I am getting.
This depends on a lot of factors like user agent string , cookies and also google experiments . Google is known for serving different ways of same content for many users.On search ,Google loads different pages based on site speed and user agent.Google also randomly runs experiments on searchpage design,etc before rollng in public to implement A/B testing dynamically.
Google Organic results have very little JavaScript and you still can parse data from the <script> tags.
Besides that, the most often problem why you don't see the same results as in your browser is because there's no user-agent being passed into request headers thus when no user-agent is specified while using requests library, it defaults to python-requests and Google understands that it's a bot/script, then it blocks a request (or whatever it does) and you receive a different HTML with different CSS selectors. Check what's your user-agent.
Pass user-agent:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('URL', headers=headers)
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to spend time trying to bypass blocks from Google and figuring out why certain things don't work as they should, and you don't have to maintain the parser over time.
Very simple example code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "how to create minecraft server",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result["link"], sep="\n")
----------
'''
https://help.minecraft.net/hc/en-us/articles/360058525452-How-to-Setup-a-Minecraft-Java-Edition-Server
https://www.minecraft.net/en-us/download/server
https://www.idtech.com/blog/creating-minecraft-server
# other results
'''
Disclaimer, I work for SerpApi.

How to Retrieve 10 first Google Search Results Using Python Requests

I've seen lots of questions regarding this subject and i found out that Google has been updating the way its search engine APIs work.
This link > get the first 10 google results using googleapi shows EXACTLY what I need but the thing is I don't know if it's possible to do that anymore.
I need this to my term paper but by reading Google docs I couldn't find a way to do that.
I've done the "get started" stuff and all I got was a private search engine using custom search engine (CSE).
Alternatively, you can use Python, Selenium and PhantomJS or other browsers to browse through Google's search results and grab the content. I haven't done that personally and don't know if there are challenges there.
I believe the best way would be to use their search APIs. Please try the one you pointed out. If it doesn't work, look for the new APIs.
I came across this question while trying to solve this problem myself and I found an updated solution to this.
Basically I used this guide at Google Custom Search to generate my own api key and search engine, then use python requests to retrieve the json results.
def search(query):
api_key = 'MYAPIKEY'
search_engine_id = 'MYENGINEID'
url = "https://www.googleapis.com/customsearch/v1/siterestrict?key=%s&cx=%s&q=%s" % (api_key, search_engine_id, query)
result = requests.Session().get(url)
json = simplejson.loads(result.content)
return json
I answered the question you attached via link.
Here's the link to that answer and full code example. I'll copy the code for faster access.
First-way using a custom script that returns JSON:
from bs4 import BeautifulSoup
import requests
import json
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=java&oq=java',
headers=headers).text
soup = BeautifulSoup(html, 'lxml')
summary = []
for container in soup.findAll('div', class_='tF2Cxc'):
heading = container.find('h3', class_='LC20lb DKV0Md').text
article_summary = container.find('span', class_='aCOpRe').text
link = container.find('a')['href']
summary.append({
'Heading': heading,
'Article Summary': article_summary,
'Link': link,
})
print(json.dumps(summary, indent=2, ensure_ascii=False))
Using Google Search Engine Results API from SerpApi:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "java",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(f"Title: {result['title']}\nLink: {result['link']}\n")
Disclaimer, I work for SerpApi.

Categories

Resources