How extract description in a google search using python? - python

I want to extract the description from the google search,
now I have this code:
from urlparse import urlparse, parse_qs
import urllib
from lxml.html import fromstring
from requests import get
url='https://www.google.com/search?q=Gotham'
raw = get(url).text
pg = fromstring(raw)
v=[]
for result in pg.cssselect(".r a"):
url = result.get("href")
if url.startswith("/url?"):
url = parse_qs(urlparse(url).query)['q']
print url[0]
that extract urls related with the search, how can I extract the description that appears under the url?

You can scrape Google Search Description Website using BeautifulSoup web scraping library.
To collect information from all pages you can use "pagination" with while True loop. The while loop is an endless loop, the exit from which in our case is the presence of a switch button to the next page, namely the CSS selector ".d6cvqb a[id=pnnext]":
if soup.select_one('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break
You can use CSS selectors search to find all the information you need (description, title, etc.) which are easy to identify on the page using a SelectorGadget Chrome extension (not always work perfectly if the website is rendered via JavaScript).
Make sure you're using request headers user-agent to act as a "real" user visit. Because default requests user-agent is python-requests and websites understand that it's most likely a script that sends a request. Check what's your user-agent.
Check code in online IDE.
from bs4 import BeautifulSoup
import requests, json, lxml
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "gotham", # query
"hl": "en", # language
"gl": "us", # country of the search, US -> USA
"start": 0, # number page by default up to 0
#"num": 100 # parameter defines the maximum number of results to return.
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
page_num = 0
website_data = []
while True:
page_num += 1
print(f"page: {page_num}")
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select(".tF2Cxc"):
website_name = result.select_one(".yuRUbf a")["href"]
try:
description = result.select_one(".lEBKkf").text
except:
description = None
website_data.append({
"website_name": website_name,
"description": description
})
if soup.select_one('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break
print(json.dumps(website_data, indent=2, ensure_ascii=False))
Example output:
[
{
"website_name": "https://www.imdb.com/title/tt3749900/",
"description": "The show follows Jim as he cracks strange cases whilst trying to help a young Bruce Wayne solve the mystery of his parents' murder. It seemed each week for a ..."
},
{
"website_name": "https://www.netflix.com/watch/80023082",
"description": "When the key witness in a homicide ends up dead while being held for questioning, Gordon suspects an inside job and seeks details from an old friend."
},
{
"website_name": "https://www.gothamknightsgame.com/",
"description": "Gotham Knights is an open-world, action RPG set in the most dynamic and interactive Gotham City yet. In either solo-play or with one other hero, ..."
},
# ...
]
Or you can also use Google Search Engine Results API from SerpApi. It's a paid API with the free plan.
The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.
Code example:
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os
params = {
"api_key": os.getenv("API_KEY"), # serpapi key
"engine": "google", # serpapi parser engine
"q": "gotham", # search query
"num": "100" # number of results per page (100 per page in this case)
# other search parameters: https://serpapi.com/search-api#api-parameters
}
search = GoogleSearch(params) # where data extraction happens
organic_results_data = []
page_num = 0
while True:
results = search.get_dict() # JSON -> Python dictionary
page_num += 1
for result in results["organic_results"]:
organic_results_data.append({
"title": result.get("title"),
"snippet": result.get("snippet")
})
if "next_link" in results.get("serpapi_pagination", []):
search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
else:
break
print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))
Output:
[
{
"title": "Gotham (TV Series 2014–2019) - IMDb",
"snippet": "The show follows Jim as he cracks strange cases whilst trying to help a young Bruce Wayne solve the mystery of his parents' murder. It seemed each week for a ..."
},
{
"title": "Gotham (TV series) - Wikipedia",
"snippet": "Gotham is an American superhero crime drama television series developed by Bruno Heller, produced by Warner Bros. Television and based on characters from ..."
},
# ...
]

Related

python web scraping for emails

I wrote this code to scrape email addresses from google search results or websites depending on t url given. However, the output is always blank.
The only thing in the excel sheet is the column name. I'm still new to python so not sure why that's happening.
What am I missing here?
import requests
from bs4 import BeautifulSoup
import pandas as pd
url ="https://www.google.com/search?q=solicitor+bereavement+wales+%27email%27&rlz=1C1CHBD_en-GBIT1013IT1013&sxsrf=AJOqlzWelf5qGpc4uqy_C2cd583OKlSEcQ%3A1675616694195&ei=tuHfY83MC-aIrwSQ3qxY&ved=0ahUKEwjN_9jO7v78AhVmxIsKHRAvCwsQ4dUDCBA&uact=5&oq=solicitor+bereavement+wales+%27email%27&gs_lcp=Cgxnd3Mtd2l6LXNlcnAQAzIFCAAQogQyBwgAEB4QogQyBwgAEB4QogQyBwgAEB4QogQyBwgAEB4QogQ6CggAEEcQ1gQQsANKBAhBGABKBAhGGABQrAxY7xRg1xZoAXABeACAAdIBiAGmBpIBBTEuNC4xmAEAoAEByAEIwAEB&sclient=gws-wiz-serp"
response = requests.get(url)
html_content = response.text
soup = BeautifulSoup(html_content, 'html.parser')
email_addresses = []
for link in soup.find_all('a'):
if 'mailto:' in link.get('href'):
email_addresses.append(link.get('href').replace('mailto:', ''))
df = pd.DataFrame(email_addresses, columns=['Email Addresses'])
df.to_excel('email_addresses_.xlsx',index=False)
First you need to extract all the snippets on the page:
for result in soup.select('.tF2Cxc'):
snippet = result.select_one('.lEBKkf').text
After using regular expression, it will get the email from the snippets (if it's present in the snippet):
match_email = re.findall(r'[\w\.-]+#[\w\.-]+\.\w+', snippet)
email = ''.join(match_email)
Also, instead of a request for a full URL, you can make a request for certain parameters (it’s convenient if you need to change query or other parameters):
params = {
'q': 'intext:"gmail.com" solicitor bereavement wale', # your query
'hl': 'en', # language
'gl': 'us' # country of the search, US -> USA
# other parameters
}
Check full code in the online IDE.
import requests, re, json, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
}
params = {
'q': 'intext:"gmail.com" solicitor bereavement wale', # your query
'hl': 'en', # language
'gl': 'us' # country of the search, US -> USA
}
html = requests.get("https://www.google.com/search",
headers=headers,
params=params).text
soup = BeautifulSoup(html, 'lxml')
data = []
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.find('a')['href']
snippet = result.select_one('.lEBKkf').text
match_email = re.findall(r'[\w\.-]+#[\w\.-]+\.\w+', snippet)
email = ''.join(match_email)
data.append({
'Title': title,
'Link': link,
'Email': email if email else None
})
print(json.dumps(data, indent=2, ensure_ascii=False))
Example output:
[
{
"Title": "Revealed: Billboard's 2022 Top Music Lawyers",
"Link": "https://www.billboard.com/wp-content/uploads/2022/03/march-28-2022-billboard-bulletin.pdf",
"Email": "cmellow.billboard#gmail.com"
},
{
"Title": "Folakemi Jegede, LL.B, BL, LLM, ACIS.'s Post - LinkedIn",
"Link": "https://www.linkedin.com/posts/folakemi-jegede-ll-b-bl-llm-acis-855a8a2a_lawyers-law-advocate-activity-6934498515867815936-9R6G?trk=posts_directory",
"Email": "OurlawandI#gmail.com"
},
other results ...
]
Also you can use Google Search Engine Results API from SerpApi. It's a paid API with a free plan.
The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.
Code example:
from serpapi import GoogleSearch
import os, json, re
params = {
"engine": "google", # search engine
"q": 'intext:"gmail.com" solicitor bereavement wale', # search query
"api_key": "..." # serpapi key from https://serpapi.com/manage-api-key
}
search = GoogleSearch(params) # where data extraction happens
results = search.get_dict() # JSON -> Python dictionary
data = []
for result in results['organic_results']:
title = result['title']
link = result['link']
snippet = result['snippet']
match_email = re.findall(r'[\w\.-]+#[\w\.-]+\.\w+', snippet)
email = '\n'.join(match_email)
data.append({
'title': title,
'link': link,
'email': email if email else None,
})
print(json.dumps(data, indent=2, ensure_ascii=False))
Output: exactly the same as in the previous solution.
It's not finding the html you want because the html is loaded dynamically with javascript. Thus you need to execute the javascript to get all the html.
The selenium module can be used to do this, but it requires a driver to interface with a given browser. So you'll need to install a browser driver in order to use the selenium module. The selenium documentation goes over the installation
Once you have selenium setup, you can use this function to get all the html from the website. Pass its return value into the BeautifulSoup object.
from selenium import webdriver
from time import sleep
def get_page_source(url):
try:
driver = webdriver.Chrome()
driver.get(url)
sleep(3)
return driver.page_source
finally: driver.quit()

Why can I only scrape first 4 pages of results on eBay?

I have a simple script to analyze sold data on eBay (baseball trading cards). It seems to be working fine for the first 4 pages but on the 5th page it simply does not load in the desired html content anymore, and I am not able to figure out why this happens:
#Import statements
import requests
import time
from bs4 import BeautifulSoup as soup
from tqdm import tqdm
#FOR DEBUG
Page_1="https://www.ebay.com/sch/213/i.html?_from=R40&LH_Sold=1&_sop=16&_pgn=1"
#Request URL working example
source=requests.get(Page_1)
time.sleep(5)
eBay_full = soup(source.text, "lxml")
Complete_container=eBay_full.find("ul",{"class":"b-list__items_nofooter"})
Single_item=Complete_container.find_all("div",{"class":"s-item__wrapper clearfix"})
items=[]
#For all items on page perform desired operation
for i in tqdm(Single_item):
items.append(i.find("a", {"class": "s-item__link"})["href"].split('?')[0].split('/')[-1])
#Works fine for Links_to_check[0] upto Links_to_check[3]
However, when I try to scrape the fifth page or further pages the following occurs:
Page_5="https://www.ebay.com/sch/213/i.html?_from=R40&LH_Sold=1&_sop=16&_pgn=5"
source=requests.get(Page_5)
time.sleep(5)
eBay_full = soup(source.text, "lxml")
Complete_container=eBay_full.find("ul",{"class":"b-list__items_nofooter"})
Single_item=Complete_container.find_all("div",{"class":"s-item__wrapper clearfix"})
items=[]
#For all items on page perform desired operation
for i in tqdm(Single_item):
items.append(i.find("a", {"class": "s-item__link"})["href"].split('?')[0].split('/')[-1])
----> 5 Single_item=Complete_container.find_all("div",{"class":"s-item__wrapper clearfix"})
6 items=[]
7 #For all items on page perform desired operation
AttributeError: 'NoneType' object has no attribute 'find_all'
This seems a logical consequence of the ul class b-list__items_nofooter missing in the eBay_full soup for the later pages. The question however is why is this information missing? Scrolling through the soup, all items of interest seem to be absent. On the webpage itself this information is, as expected, present. Who can guide me?
As per #Sebastien D his remark the problem has been solved
In the headers variable put only one of these browsers, along with the current stable version number (e.g. Chrome/53.0.2785.143, latest found here)
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}
source= requests.get(Page_5, headers=headers, timeout=2)
As Sebastien D suggested, the main problem lies in that eBay understands that the bot/script send a request.
But how does eBay understand it? It's because default requests user-agent is python-requests and eBay understands it and seems to block the requests made with such user-agent.
By adding a custom user-agent we can somewhat fake real user request. However, it's not completely reliable, and headers might need to be rotated or/and used with proxies, ideally residential.
List of user-agents at whatismybrowser.
As a side note, you can use the SelectorGadget Chrome extension to easily select CSS selectors by clicking on the desired element in your browser, which does not always work perfectly if the page is heavily using JS ( in this case we can).
The example below shows how to extract listings from all pages. Code in online IDE.
from bs4 import BeautifulSoup
import requests, json, lxml
# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36",
}
params = {
'_nkw': 'baseball trading cards', # search query
'LH_Sold': '1', # shows sold items
'_pgn': 1 # page number
}
data = []
while True:
page = requests.get('https://www.ebay.com/sch/i.html', params=params, headers=headers, timeout=30)
soup = BeautifulSoup(page.text, 'lxml')
print(f"Extracting page: {params['_pgn']}")
print("-" * 10)
for products in soup.select(".s-item__info"):
title = products.select_one(".s-item__title span").text
price = products.select_one(".s-item__price").text
link = products.select_one(".s-item__link")["href"]
data.append({
"title" : title,
"price" : price,
"link" : link
})
if soup.select_one(".pagination__next"):
params['_pgn'] += 1
else:
break
print(json.dumps(data, indent=2, ensure_ascii=False))
Example output
Extracting page: 1
----------
[
{
"title": "Shop on eBay",
"price": "$20.00",
"link": "https://ebay.com/itm/123456?hash=item28caef0a3a:g:E3kAAOSwlGJiMikD&amdata=enc%3AAQAHAAAAsJoWXGf0hxNZspTmhb8%2FTJCCurAWCHuXJ2Xi3S9cwXL6BX04zSEiVaDMCvsUbApftgXEAHGJU1ZGugZO%2FnW1U7Gb6vgoL%2BmXlqCbLkwoZfF3AUAK8YvJ5B4%2BnhFA7ID4dxpYs4jjExEnN5SR2g1mQe7QtLkmGt%2FZ%2FbH2W62cXPuKbf550ExbnBPO2QJyZTXYCuw5KVkMdFMDuoB4p3FwJKcSPzez5kyQyVjyiIq6PB2q%7Ctkp%3ABlBMULq7kqyXYA"
},
{
"title": "Ken Griffey Jr. Seattle Mariners 1989 Topps Traded RC Rookie Card #41T",
"price": "$7.20",
"link": "https://www.ebay.com/itm/385118055958?hash=item59aad32e16:g:EwgAAOSwhgljI0Vm&amdata=enc%3AAQAHAAAAoFRRlvb50yb%2FN4cmlg5OtVDKIH0DsaMJBL3Tp67SI1dCSP1WPdZW3f16bTf4HTSUhX0g3OMmZSitEY3F3SVGg0%2FhSBF3ykE9X88Lo2EHuS2b23tA1kGiG92F9xyr73RLorcidserdH8tvUXhxmT4pJDnCfMAdfqtRzSIxcB6h4aDC1J1XvJ5IyRfYtWBGUQ60ykrA7mNlhH53cwZe5MiRSw%3D%7Ctkp%3ABk9SR7rKxt7sYA"
},
{
"title": "Ken Griffey Jr. 1989 Score Traded Rookie Card Gem 10 Auto Beckett 13604418",
"price": "$349.00",
"link": "https://www.ebay.com/itm/353982131344?hash=item526afaac90:g:9hQAAOSwvCpiQ5FY&amdata=enc%3AAQAHAAAAoOKm1SWvHtdNVIEqtE4m5%2B453xtvR75ZimUBLL16P0WwfJy%2BJJQ2Phd9crgAacTWlsqp9HB%2Ft0McttOjmCfyL0RDQB%2FYOWQK3hxj%2FoDRmybJRipjqb0JG2%2BCa1RhI04PN3R5wpH9vvYqefwY6JuAsPqDU0SmSk6h1h%2FQr7cfJqOmdCo0cqbwPcJ8OcvAyP07txigrDyO55XqFD7CHcSmUPA%3D%7Ctkp%3ABk9SR7rKxt7sYA"
},
{
"title": "Mike Jorgensen NY Mets MLB OF-1B 1972 Topps Baseball Card #16 Single Original",
"price": "$1.19",
"link": "https://www.ebay.com/itm/374255790865?hash=item5723622b11:g:KiwAAOSwz4ljI0G4&amdata=enc%3AAQAHAAAAoPVkKyeDZ7wbRNBwQppCcjVmLlOlY3ylPVwQyG7dfOy1UtPYhK7tRXtvn5v3M5n%2F35MS1LXLvWAioKRrMGPEPCmDoMkhdynuH3csaincrM%2F6JNwwIUFa3F%2FcylfPqnrxjJXF7cZ3ga9aCihTM6sfVJc1kzNkaBw2C2ewMyQ3ARgYpuDcUa6CMo4zBKF%2FGTj5KlZieLYywQm4dnzLCrFbtEM%3D%7Ctkp%3ABk9SR7rKxt7sYA"
},
# ...
]

How to parse and get clean image source from Bing/Google news feed?

I have created a program that will scrape Bing Newsfeed and analyze the content and email me the headline, a summary, and a link to the news. So far I have been able to get all of that correctly using BeautifulSoup.
I want to improve my program by also including an image of the news that gets displayed on the Bing Newsfeed page. I am having trouble getting the image source link because the source seems different.
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.bing.com/news?q=Technology&cf=intr&FORM=NWRFSH').text
soup = BeautifulSoup(source, "html.parser")
for image in soup.find_all("div", class_="image right"):
print(image.img)
If I run the code above, it prints some weird things that don't make much sense to me. Here is an example:
<img class="rms_img" height="132" id="emb249968768" src="/th?id=ON.B139539B9DC398104440D89FAFB6F0C2&pid=News&w=234&h=132&c=14&
rs=2&qlt=90" width="234"/>
All the other img tags are also like this. As you can see the data-src here isn't ideal to get a link of the image that I can use when sending the email.
Can anyone take a look at the website (from my code) and inspect it a bit to see what I might be doing wrong or how I can get all the image links in a clean and usable way when sending the email? Thanks so much.
The src attribute of the img tag is perfectly ok and just what you will find in most website. It's a relative url (doesn't have the "scheme" nor "domain name" parts) with an absolute path (path starting with a forward slash) , so it's the client (in this case your code) responsability to rebuild the full absolute url using the same scheme and domain name as the one used for the initial request and the path from the img tag - in your example, the end result should be something like "https://www.bing.com/th?id=ON.B139539B9DC398104440D89FAFB6F0C2&pid=News&w=234&h=132&c=14&rs=2&qlt=90" (which indeed points to the image).
NB: do not try to parse the url into components by yourself, just use the stdlib's urllib.parse module.
Seems like an answer from bruno desthuilliers no longer works.
To make the parser more reliable, one of the ways is to parse data from inline JSON. It is the case with images. It's changing not so often as other parts of the website like CSS selectors and similar things.
Since you can't parse image data directly from the src attribute, well, you can but it will be a 1x1 image placeholder.
An alternative way would be to parse data from inline JSON + regex where you match the image ID (emb23ACF3D86 as an example) parsed beforehand and use it in your match pattern to make sure you're extracting not some random images but images from news results.
Make sure you're using user-agent because Bing could detect that it's a script that sends a request. It could detect it because the default requests user-agent is python-requests so when you make a request, Bing sees that the user-agent. Check what's your user-agent.
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, json, re
params = {
'q': 'Technology'
# other params: https://serpapi.com/bing-news-api
}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'
}
html = requests.get('https://www.bing.com/news/search', headers=headers, params=params).text
soup = BeautifulSoup(html, 'html.parser')
news_data = []
all_script_tags = soup.select('script')
img_ids = [id['id'] for id in soup.select('.pubimg.rms_img, .rms_img')] # emb23ACF3D86
for news, image_id in zip(soup.select('.card-with-cluster'), img_ids):
# https://regex101.com/r/5XWmaF/1
thumbnails = re.findall(r"processEmbImg\('{_id}','(.*?)'\);".format(_id=image_id), str(all_script_tags))
# returned result in bas64 image which needs to be decoded
# it decodes twice. For some reason the first iteration
# don't remove all Unicode chars.
decoded_thumbnail = "".join([
bytes(bytes(image_id, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for image_id in thumbnails
])
news_data.append({
'title': news.select_one('.title').text,
'link': news.select_one('.title')['href'],
'image': decoded_thumbnail
})
print(json.dumps(news_data, indent=2, ensure_ascii=False))
Outputs (try to copy the image link and paste it in your browser URL bar):
[
{
"title": "Flanders Technology: straffe aankondigingen en onthullingen",
"link": "https://doorbraak.be/flanders-technology-straffe-aankondigingen-en-onthullingen/",
"image": ""
}, ... other results
]
If you don't want to deal with regex, bypassing blocks or something else, a.k.a maintaining parser, then Bing News Engine Results API or Google News Result API may be an option.
Here's an example on how to parse data from Bing/Google News and combine it into single JSON string:
# Keep in mind that I was not using DRY methods here.
from serpapi import GoogleSearch
import json
news_data = {
'bing_news': [],
'google_news': []
}
for engine in ['bing_news', 'google_news']:
if engine == 'bing_news':
params = {
"api_key": "<your-serpapi-api-key>",
"device": "desktop",
"engine": "bing_news",
"q": "Coffee"
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
news_data['bing_news'].append({
'title': resultget('title'),
'link': resultget('link'),
'image': result.get('thumbnail')
})
if engine == 'google_news':
params = {
"api_key": "<your-serpapi-api-key>",
"device": "desktop",
"engine": "google",
"q": "Coffee",
"gl": "us",
"hl": "en",
"tbm": "nws"
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['news_results']:
news_data['google_news'].append({
'title': result.get('title'),
'link': result.get('link'),
'image': result.get('thumbnail')
})
print(json.dumps(news_data, indent=2, ensure_ascii=False))
Outputs:
{
"bing_news": [
{
"title": "Is Decaf or Caffeinated Coffee Better for Heart Disease Symptoms?",
"link": "https://news.yahoo.com/decaf-caffeinated-coffee-better-heart-194648652.html",
"image": "https://serpapi.com/searches/63469624f05eb8bd3ec0eaa0/images/c9deaf41400f27622ff9680d72158ee9c74e042768bc6201d72f8b7031003236.gif"
}, ... other bing news
],
"google_news": [
{
"title": "9 Best Coffee Items on Sale for Amazon Prime Day 2022",
"link": "https://www.thekitchn.com/prime-day-coffee-deals-october-2022-23459339",
"image": "https://serpapi.com/searches/6346981060739305e5fed620/images/3283bbc090b4be4dafbc522fab6467927bd3225fd94f0f09c764eaa814e78117.jpeg"
}, ... other google news
]

How to scrape google maps using python

I am trying to scrape the number of reviews of a place from google maps using python. For example the restaurant Pike's Landing (see google maps URL below) has 162 reviews. I want to pull this number in python.
URL: https://www.google.com/maps?cid=15423079754231040967
I am not vert well versed with HTML, but from some basic examples on the internet I wrote the following code, but what I get is a black variable after running this code. If you could let me know what am I dong wrong in this that would be much appreciated.
from urllib.request import urlopen
from bs4 import BeautifulSoup
quote_page ='https://www.google.com/maps?cid=15423079754231040967'
page = urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
price_box = soup.find_all('button',attrs={'class':'widget-pane-link'})
print(price_box.text)
It's hard to do it in pure Python and without an API, here's what I ended with (note that I added &hl=en at the end of the url, to get English results and not in my language):
import re
import requests
from ast import literal_eval
urls = [
'https://www.google.com/maps?cid=15423079754231040967&hl=en',
'https://www.google.com/maps?cid=16168151796978303235&hl=en']
for url in urls:
for g in re.findall(r'\[\\"http.*?\d+ reviews?.*?]', requests.get(url).text):
data = literal_eval(g.replace('null', 'None').replace('\\"', '"'))
print(bytes(data[0], 'utf-8').decode('unicode_escape'))
print(data[1])
Prints:
http://www.google.com/search?q=Pike's+Landing,+4438+Airport+Way,+Fairbanks,+AK+99709,+USA&ludocid=15423079754231040967#lrd=0x51325b1733fa71bf:0xd609c9524d75cbc7,1
469 reviews
http://www.google.com/search?q=Sequoia+TreeScape,+Newmarket,+ON+L3Y+8R5,+Canada&ludocid=16168151796978303235#lrd=0x882ad2157062b6c3:0xe060d065957c4103,1
42 reviews
You need to view the source code of the page and parse window.APP_INITIALIZATION_STATE variable block using a regular expression, there you'll find all needed data.
Alternatively, you can use Google Maps Reviews API from SerpApi.
Example JSON output:
"place_results": {
"title": "Pike's Landing",
"data_id": "0x51325b1733fa71bf:0xd609c9524d75cbc7",
"reviews_link": "https://serpapi.com/search.json?engine=google_maps_reviews&hl=en&place_id=0x51325b1733fa71bf%3A0xd609c9524d75cbc7",
"gps_coordinates": {
"latitude": 64.8299557,
"longitude": -147.8488774
},
"place_id_search": "https://serpapi.com/search.json?data=%214m5%213m4%211s0x51325b1733fa71bf%3A0xd609c9524d75cbc7%218m2%213d64.8299557%214d-147.8488774&engine=google_maps&google_domain=google.com&hl=en&type=place",
"thumbnail": "https://lh5.googleusercontent.com/p/AF1QipNtwheOCQ97QFrUNIwKYUoAPiV81rpiW5cIiQco=w152-h86-k-no",
"rating": 3.9,
"reviews": 839,
"price": "$$",
"type": [
"American restaurant"
],
"description": "Burgers, seafood, steak & river views. Pub fare alongside steak & seafood, served in a dining room with river views & a waterfront patio.",
"service_options": {
"dine_in": true,
"curbside_pickup": true,
"delivery": false
}
}
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google_maps",
"type": "search",
"q": "pike's landing",
"ll": "#40.7455096,-74.0083012,14z",
"google_domain": "google.com",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
reviews = results["place_results"]["reviews"]
print(reviews)
Output:
839
Disclaimer, I work for SerpApi.
Scraping Google Maps without a browser or proxies will lead to blocking after a few successful requests. Therefore, the main problem of scraping Google is dealing with cookies and ReCaptcha.
This is a good post where you can see an example of using selenium in python for the same purpose. The general idea you start a browser and simulate what a user does on the website.
Another way will be using some reliable 3rd party service that will do all job for you and return you the results. For example, you can try Outscraper's Reviews service with a free tier.
from outscraper import ApiClient
api_client = ApiClient(api_key='SECRET_API_KEY')
# Get reviews of the specific place by id
result = api_client.google_maps_reviews('ChIJrc9T9fpYwokRdvjYRHT8nI4', reviewsLimit=20, language='en')
# Get reviews for places found by search query
result = api_client.google_maps_reviews('Memphis Seoul brooklyn usa', reviewsLimit=20, limit=500, language='en')
# Get only new reviews during last 24 hours
from datetime import datetime, timedelta
yesterday_timestamp = int((datetime.now() - timedelta(1)).timestamp())
result = api_client.google_maps_reviews(
'ChIJrc9T9fpYwokRdvjYRHT8nI4', sort='newest', cutoff=yesterday_timestamp, reviewsLimit=100, language='en')
Disclaimer, I work for Outscraper.

BeautifulSoup.select Method

This script is suppose to take command line string and run it through the google search engine and then if results are found it will open up the first 5 in different tabs. I am having some issues trying to get it to work. I think the problem is happening towards the bottom where it says link = soup.select(".r a"), I have been altering the values here and then it will show the next line with an actual length. But running it like this shows the length to still be 0. I am trying to scrape the .r class and a tag because that seems to be where the searched results start on the google result source code.
import requests
import bs4
import sys
import webbrowser
print("Googling...")
response = requests.get("https://www.google.com/#q=" + " ".join(sys.argv[1:]))
response.raise_for_status()
'''Function to return the top search result links'''
soup = bs4.BeautifulSoup(response.text, "html.parser")
'''Open a browser tab for each result'''
links = soup.select(".r a")
print(len(links))
numOpen = min(5, len(links))
for i in range(numOpen):
webbrowser.open("https://google.com/#q=" + links[i].get("href"))
Your logic is right except the URL for google search is not right.
It's gotta be
response = requests.get("https://www.google.com/search?q=" + " ".join(sys.argv[1:]))
...
for i in range(numOpen):
webbrowser.open("https://www.google.com" + links[i].get("href"))
Here is the full code:
import requests
import bs4
import sys
import webbrowser
print("Googling...")
response = requests.get("https://www.google.com/search?q=" + " ".join(sys.argv[1:]))
response.raise_for_status()
'''Function to return the top search result links'''
soup = bs4.BeautifulSoup(response.text, "html.parser")
'''Open a browser tab for each result'''
links = soup.select(".r a")
print(len(links))
numOpen = min(5, len(links))
for i in range(numOpen):
webbrowser.open("https://www.google.com" + links[i].get("href"))
You are right! The problem should be resulted from select(".r a")
I suggest you try find_all('a',{"data-uch":1}), which will find all a tags with attribute data-uch = 1
Explanation:
"If you look up a little from the element, though, there is an element like this: . Looking through the rest of the HTML source,
it looks like the r class is used only for search result links."
The sentence above is from the book. However, in real, if you print this soup variable, soup = bs4.BeautifulSoup(response.text, "html.parser"), you will not find any <h3 class="r">`` in the HTML source code. That is whyprint(len(links))``` always show 0.
Instead of using min(5, len(links)) you can use slicing:
links = soup.select('.r a')[:5]
# or
for i in soup.select('.r a')[:5]:
# other code..
Also, you can use find_all() limit argument.
Make sure you're using user-agent because default requests user-agent is python-requests thus Google blocks a request because it knows that it's a bot and not a "real" user visit and you'll receive a different HTML with some sort of an error. User-agent fakes user visit by adding this information into HTTP request headers.
I wrote a dedicated blog about how to reduce the chance of being blocked while web scraping search engines that cover multiple solutions.
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "samurai cop what does katana mean",
"gl": "us",
"hl": "en",
"num": "100"
}
html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc')[:5]:
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
print(title, link, sep='\n')
--------
'''
Samurai Cop - He speaks fluent Japanese - YouTube
https://www.youtube.com/watch?v=paTW3wOyIYw
Samurai Cop - What does "katana" mean? - Quotes.net
https://www.quotes.net/mquote/1060647
"It means "Japanese sword"... 2 minute review of a ... - Reddit
https://www.reddit.com/r/NewTubers/comments/47hw1g/what_does_katana_mean_it_means_japanese_sword_2/
Samurai Cop (1991) - Mathew Karedas as Joe Marshall - IMDb
https://www.imdb.com/title/tt0130236/characters/nm0360481
What does Katana mean? - Samurai Cop quotes - Subzin.com
http://www.subzin.com/quotes/Samurai+Cop/What+does+Katana+mean%3F+-+It+means+Japanese+sword
'''
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to figure out how to pick the correct selector or how to bypass blocks from search engines since it's already done for the end-user. All that really needs to be done is to iterate over structured JSON and get the data you want.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "samurai cop what does katana mean",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"][:5]:
print(result['title'])
print(result['link'])
---------
'''
Samurai Cop - He speaks fluent Japanese - YouTube
https://www.youtube.com/watch?v=paTW3wOyIYw
Samurai Cop - What does "katana" mean? - Quotes.net
https://www.quotes.net/mquote/1060647
"It means "Japanese sword"... 2 minute review of a ... - Reddit
https://www.reddit.com/r/NewTubers/comments/47hw1g/what_does_katana_mean_it_means_japanese_sword_2/
Samurai Cop (1991) - Mathew Karedas as Joe Marshall - IMDb
https://www.imdb.com/title/tt0130236/characters/nm0360481
What does Katana mean? - Samurai Cop quotes - Subzin.com
http://www.subzin.com/quotes/Samurai+Cop/What+does+Katana+mean%3F+-+It+means+Japanese+sword
'''
Disclaimer, I work for SerpApi.

Categories

Resources