How to scrape google maps using python - python

I am trying to scrape the number of reviews of a place from google maps using python. For example the restaurant Pike's Landing (see google maps URL below) has 162 reviews. I want to pull this number in python.
URL: https://www.google.com/maps?cid=15423079754231040967
I am not vert well versed with HTML, but from some basic examples on the internet I wrote the following code, but what I get is a black variable after running this code. If you could let me know what am I dong wrong in this that would be much appreciated.
from urllib.request import urlopen
from bs4 import BeautifulSoup
quote_page ='https://www.google.com/maps?cid=15423079754231040967'
page = urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
price_box = soup.find_all('button',attrs={'class':'widget-pane-link'})
print(price_box.text)

It's hard to do it in pure Python and without an API, here's what I ended with (note that I added &hl=en at the end of the url, to get English results and not in my language):
import re
import requests
from ast import literal_eval
urls = [
'https://www.google.com/maps?cid=15423079754231040967&hl=en',
'https://www.google.com/maps?cid=16168151796978303235&hl=en']
for url in urls:
for g in re.findall(r'\[\\"http.*?\d+ reviews?.*?]', requests.get(url).text):
data = literal_eval(g.replace('null', 'None').replace('\\"', '"'))
print(bytes(data[0], 'utf-8').decode('unicode_escape'))
print(data[1])
Prints:
http://www.google.com/search?q=Pike's+Landing,+4438+Airport+Way,+Fairbanks,+AK+99709,+USA&ludocid=15423079754231040967#lrd=0x51325b1733fa71bf:0xd609c9524d75cbc7,1
469 reviews
http://www.google.com/search?q=Sequoia+TreeScape,+Newmarket,+ON+L3Y+8R5,+Canada&ludocid=16168151796978303235#lrd=0x882ad2157062b6c3:0xe060d065957c4103,1
42 reviews

You need to view the source code of the page and parse window.APP_INITIALIZATION_STATE variable block using a regular expression, there you'll find all needed data.
Alternatively, you can use Google Maps Reviews API from SerpApi.
Example JSON output:
"place_results": {
"title": "Pike's Landing",
"data_id": "0x51325b1733fa71bf:0xd609c9524d75cbc7",
"reviews_link": "https://serpapi.com/search.json?engine=google_maps_reviews&hl=en&place_id=0x51325b1733fa71bf%3A0xd609c9524d75cbc7",
"gps_coordinates": {
"latitude": 64.8299557,
"longitude": -147.8488774
},
"place_id_search": "https://serpapi.com/search.json?data=%214m5%213m4%211s0x51325b1733fa71bf%3A0xd609c9524d75cbc7%218m2%213d64.8299557%214d-147.8488774&engine=google_maps&google_domain=google.com&hl=en&type=place",
"thumbnail": "https://lh5.googleusercontent.com/p/AF1QipNtwheOCQ97QFrUNIwKYUoAPiV81rpiW5cIiQco=w152-h86-k-no",
"rating": 3.9,
"reviews": 839,
"price": "$$",
"type": [
"American restaurant"
],
"description": "Burgers, seafood, steak & river views. Pub fare alongside steak & seafood, served in a dining room with river views & a waterfront patio.",
"service_options": {
"dine_in": true,
"curbside_pickup": true,
"delivery": false
}
}
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google_maps",
"type": "search",
"q": "pike's landing",
"ll": "#40.7455096,-74.0083012,14z",
"google_domain": "google.com",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
reviews = results["place_results"]["reviews"]
print(reviews)
Output:
839
Disclaimer, I work for SerpApi.

Scraping Google Maps without a browser or proxies will lead to blocking after a few successful requests. Therefore, the main problem of scraping Google is dealing with cookies and ReCaptcha.
This is a good post where you can see an example of using selenium in python for the same purpose. The general idea you start a browser and simulate what a user does on the website.
Another way will be using some reliable 3rd party service that will do all job for you and return you the results. For example, you can try Outscraper's Reviews service with a free tier.
from outscraper import ApiClient
api_client = ApiClient(api_key='SECRET_API_KEY')
# Get reviews of the specific place by id
result = api_client.google_maps_reviews('ChIJrc9T9fpYwokRdvjYRHT8nI4', reviewsLimit=20, language='en')
# Get reviews for places found by search query
result = api_client.google_maps_reviews('Memphis Seoul brooklyn usa', reviewsLimit=20, limit=500, language='en')
# Get only new reviews during last 24 hours
from datetime import datetime, timedelta
yesterday_timestamp = int((datetime.now() - timedelta(1)).timestamp())
result = api_client.google_maps_reviews(
'ChIJrc9T9fpYwokRdvjYRHT8nI4', sort='newest', cutoff=yesterday_timestamp, reviewsLimit=100, language='en')
Disclaimer, I work for Outscraper.

Related

Getting href links from a website using Python's Beautiful Soup module

I am trying to get the href links from this page, specifically the links to the pages of those respective clubs. My current code is as follows. I have not included the imports. If needed, I just did import requests and from bs4 import BeautifulSoup:
rsoLink = "https://illinois.campuslabs.com/engage/organizations?query=badminton"
page = requests.get(rsoLink)
beautifulPage = BeautifulSoup(page.content, 'html.parser')
for link in beautifulPage.findAll("a"):
print(link.get('href'))
My output is empty, suggesting that the program did not find the links. When I looked at the HTML structure of the page, the "a" tags seem to be nested deep within the page's structure (they are inside an element which is within another element, which itself is inside an another element). My question is how I would access the links then; do I have to go through all these elements?
The data you see on page is loaded with JavaScript from different URL. So beautifulsoup doesn't see it. To load the data you can use next example:
import json
import requests
url = (
"https://illinois.campuslabs.com/engage/api/discovery/search/organizations"
)
params = {"top": "10", "filter": "", "query": "badminton", "skip": "0"}
data = requests.get(url, params=params).json()
# uncomment to print all data:
# print(json.dumps(data, indent=4))
for v in data["value"]:
print(
"{:<50} {}".format(
v["Name"],
"https://illinois.campuslabs.com/engage/organization/"
+ v["WebsiteKey"],
)
)
Prints:
Badminton For Fun https://illinois.campuslabs.com/engage/organization/badminton4fun
Illini Badminton Intercollegiate Sports Club https://illinois.campuslabs.com/engage/organization/illinibadmintonintercollegiatesportsclub
If you take a look at the actual HTML returned by requests, you can see that none of the actual page content is loaded, suggesting that it's loaded client-side via Javascript, likely using an HTTP request to fetch the necessary data.
Here, the easiest solution would be to inspect the HTTP requests made by the site and look for an API endpoint that returns the organizations data. By checking the Network tab of Chrome DevTools, you can find this endpoint:
https://illinois.campuslabs.com/engage/api/discovery/search/organizations?top=10&filter=&query=badminton&skip=0
Here, you can see the JSON response for all of the organizations that are being loaded into the page by client-side JS. If you take a look at the JSON, you'll notice that a link isn't one of the keys returned, but it's easily constructed using the WebsiteKey key.
Putting all of this together:
import requests
import json
SEARCH_URL = "https://illinois.campuslabs.com/engage/api/discovery/search/organizations"
ORGANIZATION_URL = "https://illinois.campuslabs.com/engage/organization/"
search = "badminton"
resp = requests.get(
SEARCH_URL,
params={"top": 10, "filter": "", "query": search, "skip": 0}
)
organizations = json.loads(resp.text)["value"]
links = [ORGANIZATION_URL + organization["WebsiteKey"] for organization in organizations]
print(links)
Similar strategies can be used to find and use other API endpoints on the site, such as the organization categories.

How to parse and get clean image source from Bing/Google news feed?

I have created a program that will scrape Bing Newsfeed and analyze the content and email me the headline, a summary, and a link to the news. So far I have been able to get all of that correctly using BeautifulSoup.
I want to improve my program by also including an image of the news that gets displayed on the Bing Newsfeed page. I am having trouble getting the image source link because the source seems different.
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.bing.com/news?q=Technology&cf=intr&FORM=NWRFSH').text
soup = BeautifulSoup(source, "html.parser")
for image in soup.find_all("div", class_="image right"):
print(image.img)
If I run the code above, it prints some weird things that don't make much sense to me. Here is an example:
<img class="rms_img" height="132" id="emb249968768" src="/th?id=ON.B139539B9DC398104440D89FAFB6F0C2&pid=News&w=234&h=132&c=14&
rs=2&qlt=90" width="234"/>
All the other img tags are also like this. As you can see the data-src here isn't ideal to get a link of the image that I can use when sending the email.
Can anyone take a look at the website (from my code) and inspect it a bit to see what I might be doing wrong or how I can get all the image links in a clean and usable way when sending the email? Thanks so much.
The src attribute of the img tag is perfectly ok and just what you will find in most website. It's a relative url (doesn't have the "scheme" nor "domain name" parts) with an absolute path (path starting with a forward slash) , so it's the client (in this case your code) responsability to rebuild the full absolute url using the same scheme and domain name as the one used for the initial request and the path from the img tag - in your example, the end result should be something like "https://www.bing.com/th?id=ON.B139539B9DC398104440D89FAFB6F0C2&pid=News&w=234&h=132&c=14&rs=2&qlt=90" (which indeed points to the image).
NB: do not try to parse the url into components by yourself, just use the stdlib's urllib.parse module.
Seems like an answer from bruno desthuilliers no longer works.
To make the parser more reliable, one of the ways is to parse data from inline JSON. It is the case with images. It's changing not so often as other parts of the website like CSS selectors and similar things.
Since you can't parse image data directly from the src attribute, well, you can but it will be a 1x1 image placeholder.
An alternative way would be to parse data from inline JSON + regex where you match the image ID (emb23ACF3D86 as an example) parsed beforehand and use it in your match pattern to make sure you're extracting not some random images but images from news results.
Make sure you're using user-agent because Bing could detect that it's a script that sends a request. It could detect it because the default requests user-agent is python-requests so when you make a request, Bing sees that the user-agent. Check what's your user-agent.
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, json, re
params = {
'q': 'Technology'
# other params: https://serpapi.com/bing-news-api
}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'
}
html = requests.get('https://www.bing.com/news/search', headers=headers, params=params).text
soup = BeautifulSoup(html, 'html.parser')
news_data = []
all_script_tags = soup.select('script')
img_ids = [id['id'] for id in soup.select('.pubimg.rms_img, .rms_img')] # emb23ACF3D86
for news, image_id in zip(soup.select('.card-with-cluster'), img_ids):
# https://regex101.com/r/5XWmaF/1
thumbnails = re.findall(r"processEmbImg\('{_id}','(.*?)'\);".format(_id=image_id), str(all_script_tags))
# returned result in bas64 image which needs to be decoded
# it decodes twice. For some reason the first iteration
# don't remove all Unicode chars.
decoded_thumbnail = "".join([
bytes(bytes(image_id, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for image_id in thumbnails
])
news_data.append({
'title': news.select_one('.title').text,
'link': news.select_one('.title')['href'],
'image': decoded_thumbnail
})
print(json.dumps(news_data, indent=2, ensure_ascii=False))
Outputs (try to copy the image link and paste it in your browser URL bar):
[
{
"title": "Flanders Technology: straffe aankondigingen en onthullingen",
"link": "https://doorbraak.be/flanders-technology-straffe-aankondigingen-en-onthullingen/",
"image": ""
}, ... other results
]
If you don't want to deal with regex, bypassing blocks or something else, a.k.a maintaining parser, then Bing News Engine Results API or Google News Result API may be an option.
Here's an example on how to parse data from Bing/Google News and combine it into single JSON string:
# Keep in mind that I was not using DRY methods here.
from serpapi import GoogleSearch
import json
news_data = {
'bing_news': [],
'google_news': []
}
for engine in ['bing_news', 'google_news']:
if engine == 'bing_news':
params = {
"api_key": "<your-serpapi-api-key>",
"device": "desktop",
"engine": "bing_news",
"q": "Coffee"
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
news_data['bing_news'].append({
'title': resultget('title'),
'link': resultget('link'),
'image': result.get('thumbnail')
})
if engine == 'google_news':
params = {
"api_key": "<your-serpapi-api-key>",
"device": "desktop",
"engine": "google",
"q": "Coffee",
"gl": "us",
"hl": "en",
"tbm": "nws"
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['news_results']:
news_data['google_news'].append({
'title': result.get('title'),
'link': result.get('link'),
'image': result.get('thumbnail')
})
print(json.dumps(news_data, indent=2, ensure_ascii=False))
Outputs:
{
"bing_news": [
{
"title": "Is Decaf or Caffeinated Coffee Better for Heart Disease Symptoms?",
"link": "https://news.yahoo.com/decaf-caffeinated-coffee-better-heart-194648652.html",
"image": "https://serpapi.com/searches/63469624f05eb8bd3ec0eaa0/images/c9deaf41400f27622ff9680d72158ee9c74e042768bc6201d72f8b7031003236.gif"
}, ... other bing news
],
"google_news": [
{
"title": "9 Best Coffee Items on Sale for Amazon Prime Day 2022",
"link": "https://www.thekitchn.com/prime-day-coffee-deals-october-2022-23459339",
"image": "https://serpapi.com/searches/6346981060739305e5fed620/images/3283bbc090b4be4dafbc522fab6467927bd3225fd94f0f09c764eaa814e78117.jpeg"
}, ... other google news
]

How extract description in a google search using python?

I want to extract the description from the google search,
now I have this code:
from urlparse import urlparse, parse_qs
import urllib
from lxml.html import fromstring
from requests import get
url='https://www.google.com/search?q=Gotham'
raw = get(url).text
pg = fromstring(raw)
v=[]
for result in pg.cssselect(".r a"):
url = result.get("href")
if url.startswith("/url?"):
url = parse_qs(urlparse(url).query)['q']
print url[0]
that extract urls related with the search, how can I extract the description that appears under the url?
You can scrape Google Search Description Website using BeautifulSoup web scraping library.
To collect information from all pages you can use "pagination" with while True loop. The while loop is an endless loop, the exit from which in our case is the presence of a switch button to the next page, namely the CSS selector ".d6cvqb a[id=pnnext]":
if soup.select_one('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break
You can use CSS selectors search to find all the information you need (description, title, etc.) which are easy to identify on the page using a SelectorGadget Chrome extension (not always work perfectly if the website is rendered via JavaScript).
Make sure you're using request headers user-agent to act as a "real" user visit. Because default requests user-agent is python-requests and websites understand that it's most likely a script that sends a request. Check what's your user-agent.
Check code in online IDE.
from bs4 import BeautifulSoup
import requests, json, lxml
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "gotham", # query
"hl": "en", # language
"gl": "us", # country of the search, US -> USA
"start": 0, # number page by default up to 0
#"num": 100 # parameter defines the maximum number of results to return.
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
page_num = 0
website_data = []
while True:
page_num += 1
print(f"page: {page_num}")
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select(".tF2Cxc"):
website_name = result.select_one(".yuRUbf a")["href"]
try:
description = result.select_one(".lEBKkf").text
except:
description = None
website_data.append({
"website_name": website_name,
"description": description
})
if soup.select_one('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break
print(json.dumps(website_data, indent=2, ensure_ascii=False))
Example output:
[
{
"website_name": "https://www.imdb.com/title/tt3749900/",
"description": "The show follows Jim as he cracks strange cases whilst trying to help a young Bruce Wayne solve the mystery of his parents' murder. It seemed each week for a ..."
},
{
"website_name": "https://www.netflix.com/watch/80023082",
"description": "When the key witness in a homicide ends up dead while being held for questioning, Gordon suspects an inside job and seeks details from an old friend."
},
{
"website_name": "https://www.gothamknightsgame.com/",
"description": "Gotham Knights is an open-world, action RPG set in the most dynamic and interactive Gotham City yet. In either solo-play or with one other hero, ..."
},
# ...
]
Or you can also use Google Search Engine Results API from SerpApi. It's a paid API with the free plan.
The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.
Code example:
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os
params = {
"api_key": os.getenv("API_KEY"), # serpapi key
"engine": "google", # serpapi parser engine
"q": "gotham", # search query
"num": "100" # number of results per page (100 per page in this case)
# other search parameters: https://serpapi.com/search-api#api-parameters
}
search = GoogleSearch(params) # where data extraction happens
organic_results_data = []
page_num = 0
while True:
results = search.get_dict() # JSON -> Python dictionary
page_num += 1
for result in results["organic_results"]:
organic_results_data.append({
"title": result.get("title"),
"snippet": result.get("snippet")
})
if "next_link" in results.get("serpapi_pagination", []):
search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
else:
break
print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))
Output:
[
{
"title": "Gotham (TV Series 2014–2019) - IMDb",
"snippet": "The show follows Jim as he cracks strange cases whilst trying to help a young Bruce Wayne solve the mystery of his parents' murder. It seemed each week for a ..."
},
{
"title": "Gotham (TV series) - Wikipedia",
"snippet": "Gotham is an American superhero crime drama television series developed by Bruno Heller, produced by Warner Bros. Television and based on characters from ..."
},
# ...
]

Beautifulsoup doesn't reach a child element

I wrote the following code trying to scrape a google scholar page
import requests as req
from bs4 import BeautifulSoup as soup
url = r'https://scholar.google.com/scholar?hl=en&q=Sustainability and the measurement of wealth: further reflections'
session = req.Session()
content = session.get(url)
html2bs = soup(content.content, 'lxml')
gs_cit = html2bs.select('#gs_cit')
gs_citd = html2bs.find('div', {'id':"gs_citd"})
gs_cit1 = html2bs.find('div', {'id':"gs_cit1"})
but the gs_citd gives me only this line <div aria-live="assertive" id="gs_citd"></div> and doesn't reach any level beneath it. Also gs_cit1 returns a None.
As appearing in this image
I want to reach the highlighted class to be able to grab the BibTeX citation.
Can you help, please!
Ok, so I figured it out. I used the selenium module for python which creates a virtual browser if you will that will allow you to perform actions like clicking links and getting the output of the resulting HTML. There was another issue I ran into while solving this which was the page had to be loaded otherwise it just returned the content "Loading..." in the pop-up div so I used the python time module to time.sleep(2) for 2 seconds which allowed the content to load in. Then I just parsed the resulting HTML output using BeautifulSoup to find the anchor tag with the class "gs_citi". Then pulled the href from the anchor and put this into a request with "requests" python module. Finally, I wrote the decoded response to a local file - scholar.bib.
I installed chromedriver and selenium on my Mac using these instructions here:
https://gist.github.com/guylaor/3eb9e7ff2ac91b7559625262b8a6dd5f
Then signed by python file to allow to stop firewall issues using these instructions:
Add Python to OS X Firewall Options?
The following is the code I used to produce the output file "scholar.bib":
import os
import time
from selenium import webdriver
from bs4 import BeautifulSoup as soup
import requests as req
# Setup Selenium Chrome Web Driver
chromedriver = "/usr/local/bin/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
# Navigate in Chrome to specified page.
driver.get("https://scholar.google.com/scholar?hl=en&q=Sustainability and the measurement of wealth: further reflections")
# Find "Cite" link by looking for anchors that contain "Cite" - second link selected "[1]"
link = driver.find_elements_by_xpath('//a[contains(text(), "' + "Cite" + '")]')[1]
# Click the link
link.click()
print("Waiting for page to load...")
time.sleep(2) # Sleep for 2 seconds
# Get Page source after waiting for 2 seconds of current page in Chrome
source = driver.page_source
# We are done with the driver so quit.
driver.quit()
# Use BeautifulSoup to parse the html source and use "html.parser" as the Parser
soupify = soup(source, 'html.parser')
# Find anchors with the class "gs_citi"
gs_citt = soupify.find('a',{"class":"gs_citi"})
# Get the href attribute of the first anchor found
href = gs_citt['href']
print("Fetching: ", href)
# Instantiate a new requests session
session = req.Session()
# Get the response object of href
content = session.get(href)
# Get the content and then decode() it.
bibtex_html = content.content.decode()
# Write the decoded data to a file named scholar.bib
with open("scholar.bib","w") as file:
file.writelines(bibtex_html)
Hope this helps anyone looking for a solution to this out.
Scholar.bib file:
#article{arrow2013sustainability,
title={Sustainability and the measurement of wealth: further reflections},
author={Arrow, Kenneth J and Dasgupta, Partha and Goulder, Lawrence H and Mumford, Kevin J and Oleson, Kirsten},
journal={Environment and Development Economics},
volume={18},
number={4},
pages={504--516},
year={2013},
publisher={Cambridge University Press}
}
You can parse BibTeX data using beautifulsoup and requests by parsing data-cid attribute which is a unique publication ID. Then you need to temporarily store those IDs to a list, iterate over them, and make a request to every ID to parse BibTeX publication citation.
Example below will work for ~10-20 requests then Google will throw a CAPTCHA or you'll hit the rate limit. The ideal solution is to have a CAPTCHA solving service as well as proxies.
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
params = {
"q": "samsung",
"hl": "en"
}
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582",
"server": "scholar",
"referer": f"https://scholar.google.com/scholar?hl={params['hl']}&q={params['q']}",
}
def cite_ids() -> list:
response = requests.get("https://scholar.google.com/scholar", params=params, headers=headers, timeout=10)
soup = BeautifulSoup(response.text, "lxml")
# returns a list of publication ID's -> U8bh6Ca9uwQJ
return [result["data-cid"] for result in soup.select(".gs_or")]
def scrape_cite_results() -> list:
bibtex_data = []
for cite_id in cite_ids():
response = requests.get(f"https://scholar.google.com/scholar?output=cite&q=info:{cite_id}:scholar.google.com", headers=headers, timeout=10)
soup = BeautifulSoup(response.text, "lxml")
# selects first matched element which in this case always will be BibTeX
# if Google will not switch BibTeX position.
bibtex_data.append(soup.select_one(".gs_citi")["href"])
# returns a list of BibTex URLs, for example: https://scholar.googleusercontent.com/scholar.bib?q=info:ifd-RAVUVasJ:scholar.google.com/&output=citation&scisdr=CgVDYtsfELLGwov-iJo:AAGBfm0AAAAAYgD4kJr6XdMvDPuv7R8SGODak6AxcJxi&scisig=AAGBfm0AAAAAYgD4kHUUPiUnYgcIY1Vo56muYZpFkG5m&scisf=4&ct=citation&cd=-1&hl=en
return bibtex_data
Alternatively, you can achieve the same thing using Google Scholar API from SerpApi without the need to figure out what proxy provider provides good proxies as well as with the CAPTCHA solving service, besides figuring out how to scrape the data from the JavaScript without using browser automation.
Example to integrate:
import os
from serpapi import GoogleSearch
def organic_results() -> list:
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_scholar",
"q": "samsung", # search query
"hl": "en", # language
}
search = GoogleSearch(params)
results = search.get_dict()
return [result["result_id"] for result in results["organic_results"]]
def cite_results() -> list:
citation_results = []
for citation in organic_results():
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_scholar_cite",
"q": citation
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["links"]:
if "BibTeX" in result["name"]:
citation_results.append(result["link"])
return citation_results
If you would like to parse the data from all available pages, there's a dedicated blog post Scrape historic Google Scholar results using Python at SerpApi which is all about scraping historic 2017-2021 Organic, Cite results to CSV and SQLite using pagination.
Disclaimer, I work for SerpApi.

Scrape authors h-index, i10-index and total citations from Google Scholar

I am working on a project to scrape data from Google Scholar. I want to scrape an authors h-index, total citations and i-10 index (all). For example from Louisa Gilbert I wish to scrape:
h-index = 36
i10-index = 74
citations = 4383
I have written this:
from bs4 import BeautifulSoup
import urllib.request
url="https://scholar.google.ca/citations?user=OdQKi7wAAAAJ&hl=en"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
but I am unsure how to continue. (I understand there are some libraries available, but none allow you to scrape h-index's and i10-index's.)
Your are almost there. You need to find the HTML elements that contain the data that you want to extract. In this particular case, the indexes are included in the tag <td class="gsc_rsb_std">. You need to pick up these tags from the Soup element and then use the method string to recover the text from within the tags:
indexes = soup.find_all("td", "gsc_rsb_std")
h_index = indexes[2].string
i10_index = indexes[4].string
citations = indexes[0].string
To scrape all of the information from Google Scholar Author page you could use a third party solution like SerpApi. It's a paid API with a free trial.
Example python code (available in other libraries also):
from serpapi import GoogleSearch
params = {
"api_key": "SECRET_API_KEY",
"engine": "google_scholar_author",
"hl": "en",
"author_id": "-muoO7gAAAAJ"
}
search = GoogleSearch(params)
results = search.get_dict()
Example JSON output:
"cited_by": {
"table": [
{
"citations": {
"all": 7326,
"since_2016": 2613
}
},
{
"h_index": {
"all": 47,
"since_2016": 27
}
},
{
"i10_index": {
"all": 103,
"since_2016": 79
}
}
]
}
You can check out the documentation for more details.
Disclaimer: I work at SerpApi.

Categories

Resources