I've been learning Python and tried Web Scraping.
I could manage to scrape Google Result Page for a normal Google Search, though the page was depreciated idk why.
Tried the same for Google Images, and it is depreciated as well. It doesn't appear the same as it was appearing in the browser.
Here's my code.
from bs4 import BeautifulSoup
import requests
from PIL import Image
from io import BytesIO
search = input("Search for : ")
params = {"tbm": "isch", "source": "hp", "q": search}
r = requests.get("https://www.google.com/search", params=params)
print("URL :", r.url)
print("Status : ", r.status_code, "\n\n")
f = open("ImageResult.html", "w+")
f.write(r.text)
For example, I search for "Goku".
The Google Image returns this page.
When I click on the first image, a popup opens. Or say I press ctrl+click. I reach this page.
On this page I can see that the actual image's URL can be accessed through maybe the current url or the link at the "View Image" button. But the issue is, I can't reach this page/popup in the version of the page that I am able to get when I request this page.
UPDATE : I'm sharing the page I am getting.
This depends on a lot of factors like user agent string , cookies and also google experiments . Google is known for serving different ways of same content for many users.On search ,Google loads different pages based on site speed and user agent.Google also randomly runs experiments on searchpage design,etc before rollng in public to implement A/B testing dynamically.
Google Organic results have very little JavaScript and you still can parse data from the <script> tags.
Besides that, the most often problem why you don't see the same results as in your browser is because there's no user-agent being passed into request headers thus when no user-agent is specified while using requests library, it defaults to python-requests and Google understands that it's a bot/script, then it blocks a request (or whatever it does) and you receive a different HTML with different CSS selectors. Check what's your user-agent.
Pass user-agent:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('URL', headers=headers)
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to spend time trying to bypass blocks from Google and figuring out why certain things don't work as they should, and you don't have to maintain the parser over time.
Very simple example code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "how to create minecraft server",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result["link"], sep="\n")
----------
'''
https://help.minecraft.net/hc/en-us/articles/360058525452-How-to-Setup-a-Minecraft-Java-Edition-Server
https://www.minecraft.net/en-us/download/server
https://www.idtech.com/blog/creating-minecraft-server
# other results
'''
Disclaimer, I work for SerpApi.
Related
I'm trying to get the page source of an imgur website using requests, but the results I'm getting are different from the source. I understand that these pages are rendered using JS, but that is not what I am searching for.
It seems I'm getting redirected because they detect I'm using an automated browser, but I'd prefer not to use selenium here. For example, see the following code to scrape the page source of two imgur ID's (one valid ID and one invalid ID) with different page sources.
import requests
from bs4 import BeautifulSoup
url1 = "https://i.imgur.com/ssXK5" #valid ID
url2 = "https://i.imgur.com/ssXK4" #invalid ID
def get_source(url):
headers = {
"User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Mobile Safari/537.36"}
page = requests.get(url, headers = headers)
soup = BeautifulSoup(page.content, 'html.parser')
return soup
page1 = get_source(url1)
page2 = get_source(url2)
print(page1==page2)
#True
The scraped page sources are identical, so I presume it's an anti-scraping thing. I know there is an imgur API, but I'd like to know how to circumvent such a redirection, if possible. Is there any way to get the actual source code using the requests module?
Thanks.
I took the code below from the answer How to use BeautifulSoup to parse google search results in Python
It used to work on my Ubuntu 16.04 and I have both Python 2 and 3.
The code is below:
import urllib
from bs4 import BeautifulSoup
import requests
import webbrowser
text = 'My query goes here'
text = urllib.parse.quote_plus(text)
url = 'https://google.com/search?q=' + text
response = requests.get(url)
#with open('output.html', 'wb') as f:
# f.write(response.content)
#webbrowser.open('output.html')
soup = BeautifulSoup(response.text, 'lxml')
for g in soup.find_all(class_='g'):
print(g.text)
print('-----')
It executes but prints nothing. The problem is really suspicious to me. Any help would be appreciated.
The problem is that Google is serving different HTML when you don't specify User-Agent in headers. To specify custom header, add dict with User-Agent to headers= parameter in requests:
import urllib
from bs4 import BeautifulSoup
import requests
import webbrowser
text = 'My query goes here'
text = urllib.parse.quote_plus(text)
url = 'https://google.com/search?q=' + text
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
for g in soup.find_all(class_='g'):
print(g.text)
print('-----')
Prints:
How to Write the Perfect Query Letter - Query Letter Examplehttps://www.writersdigest.com/.../how-to-write-the-perfect-qu...PuhverdatudTõlgi see leht21. märts 2016 - A literary agent shares a real-life novel pitch that ultimately led to a book deal—and shows you how to query your own work with success.
-----
Inimesed küsivad ka järgmistHow do you start a query letter?What should be included in a query letter?How do you end a query in an email?How long is a query letter?Tagasiside
-----
...and so on.
Learn more about user-agent and request headers.
Basically, user-agent let identifies the browser, its version number, and its host operating system that representing a person (browser) in a Web context that lets servers and network peers identify if it's a bot or not.
Have a look at SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser. CSS selectors reference.
To make it look better, you can pass URL params as a dict() which is more readable and requests do everything for you automatically (same goes for adding user-agent into headers):
params = {
"q": "My query goes here"
}
requests.get("YOUR_URL", params=params)
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "My query goes here"
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
print(title)
-------
'''
MySQL 8.0 Reference Manual :: 3.2 Entering Queries
Google Sheets Query function: Learn the most powerful ...
Understanding MySQL Queries with Explain - Exoscale
An Introductory SQL Tutorial: How to Write Simple Queries
Writing Subqueries in SQL | Advanced SQL - Mode
Getting IO and time statistics for SQL Server queries
How to store MySQL query results in another Table? - Stack ...
More efficient SQL with query planning and optimization (article)
Here are my Data Files. Here are my Queries. Where ... - CIDR
Slow in the Application, Fast in SSMS? - Erland Sommarskog
'''
Alternatively, you can do the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to extract the data you want from JSON string rather than figuring out how to extract, maintain or bypass blocks from Google.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "My query goes here",
"hl": "en",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['title'])
--------
'''
MySQL 8.0 Reference Manual :: 3.2 Entering Queries
Google Sheets Query function: Learn the most powerful ...
Understanding MySQL Queries with Explain - Exoscale
An Introductory SQL Tutorial: How to Write Simple Queries
Writing Subqueries in SQL | Advanced SQL - Mode
Getting IO and time statistics for SQL Server queries
How to store MySQL query results in another Table? - Stack ...
More efficient SQL with query planning and optimization (article)
Here are my Data Files. Here are my Queries. Where ... - CIDR
Slow in the Application, Fast in SSMS? - Erland Sommarskog
'''
Disclaimer, I work for SerpApi.
I am working on scraping data from Google Scholar using bs4 and urllib. I am trying to get the first year an article is publsihed. For example, from this page I am trying to get the year 1996. This can be read from the bar chart, but only after the bar chart is clicked. I have written the following code, but it prints out the year visible before the bar chart is clicked.
from bs4 import BeautifulSoup
import urllib.request
url = 'https://scholar.google.com/citations?user=VGoSakQAAAAJ'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'lxml')
year = soup.find('span', {"class": "gsc_g_t"})
print (year)
the chart information is on a different request, this one. There you can get the information you want with the following xpath:
'//span[#class="gsc_g_t"][1]/text()'
or in soup:
soup.find('span', {"class": "gsc_g_t"}).text
Make sure you're using the latest user-agent. Old user-agents is a signal to the website that it might be a bot that sends a request. But a new user-agent does not mean that every website would think that it's a "real" user visit. Check what's your user-agent.
The code snippet is using parsel library which is similar to bs4 but it supports full XPath and translates every CSS selector query to XPath using the cssselect package.
Example code to integrate:
from collections import namedtuple
import requests
from parsel import Selector
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"user": "VGoSakQAAAAJ",
"hl": "en",
"view_op": "citations_histogram"
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}
html = requests.get("https://scholar.google.com/citations", params=params, headers=headers, timeout=30)
selector = Selector(html.text)
Publications = namedtuple("Years", "first_publication")
publications = Publications(sorted([publication.get() for publication in selector.css(".gsc_g_t::text")])[0])
print(selector.css(".gsc_g_t::text").get())
print(sorted([publication.get() for publication in selector.css(".gsc_g_t::text")])[0])
print(publications.first_publication)
# output:
'''
1996
1996
1996
'''
Alternatively, you can achieve the same thing by using Google Scholar Author API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to figure out how to parse the data and maintain the parser over time, figure out how to scale it, and bypass blocks from a search engine, such as Google Scholar search engine.
Example code to integrate:
from serpapi import GoogleScholarSearch
params = {
"api_key": "Your SerpApi API key",
"engine": "google_scholar_author",
"hl": "en",
"author_id": "VGoSakQAAAAJ"
}
search = GoogleScholarSearch(params)
results = search.get_dict()
# already sorted data
first_publication = [year.get("year") for year in results.get("cited_by", {}).get("graph", [])][0]
print(first_publication)
# 1996
If you want to scrape all Profile results based on a given query or you have a list of Author IDs, there's a dedicated scrape all Google Scholar Profile, Author Results to CSV blog post of mine about it.
Disclaimer, I work for SerpApi.
I've seen lots of questions regarding this subject and i found out that Google has been updating the way its search engine APIs work.
This link > get the first 10 google results using googleapi shows EXACTLY what I need but the thing is I don't know if it's possible to do that anymore.
I need this to my term paper but by reading Google docs I couldn't find a way to do that.
I've done the "get started" stuff and all I got was a private search engine using custom search engine (CSE).
Alternatively, you can use Python, Selenium and PhantomJS or other browsers to browse through Google's search results and grab the content. I haven't done that personally and don't know if there are challenges there.
I believe the best way would be to use their search APIs. Please try the one you pointed out. If it doesn't work, look for the new APIs.
I came across this question while trying to solve this problem myself and I found an updated solution to this.
Basically I used this guide at Google Custom Search to generate my own api key and search engine, then use python requests to retrieve the json results.
def search(query):
api_key = 'MYAPIKEY'
search_engine_id = 'MYENGINEID'
url = "https://www.googleapis.com/customsearch/v1/siterestrict?key=%s&cx=%s&q=%s" % (api_key, search_engine_id, query)
result = requests.Session().get(url)
json = simplejson.loads(result.content)
return json
I answered the question you attached via link.
Here's the link to that answer and full code example. I'll copy the code for faster access.
First-way using a custom script that returns JSON:
from bs4 import BeautifulSoup
import requests
import json
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=java&oq=java',
headers=headers).text
soup = BeautifulSoup(html, 'lxml')
summary = []
for container in soup.findAll('div', class_='tF2Cxc'):
heading = container.find('h3', class_='LC20lb DKV0Md').text
article_summary = container.find('span', class_='aCOpRe').text
link = container.find('a')['href']
summary.append({
'Heading': heading,
'Article Summary': article_summary,
'Link': link,
})
print(json.dumps(summary, indent=2, ensure_ascii=False))
Using Google Search Engine Results API from SerpApi:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "java",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(f"Title: {result['title']}\nLink: {result['link']}\n")
Disclaimer, I work for SerpApi.
I am trying to use BeautifulSoup to find a random image from google images. My code looks like this.
import urllib, bs4, random
from urllib import request
urlname = "https://www.google.com/search?hl=en&q=" + str(random. randrange(999999)) + "&ion=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.&bvm=bv. 42553238,d.dmg&biw=1354&bih=622&um=1&ie=UTF- 8&tbm=isch&source=og&sa=N&tab=wi&ei=sNEfUf-fHvLx0wG7uoG4DQ"
page = bs4.BeautifulSoup(urllib.request.urlopen(urlname)
But whenever I try to get the HTML from the page object, I get:
urllib.error.HTTPError: HTTP Error 403: Forbidden
I test the URLs that are generated by pasting them into my web browser, and the browser doesn't return this error. What's going on?
I am pretty sure that google is telling you: "Please don't do this". See this explanation of the http 403 error.
What is going on is that your python script, or more specifically urllib is sending headers, telling google that this is some kind of plain request, which is not coming from a browser.
Google is doing that rightfully so, since otherwise many people would simply scrape their website and show the google results as their own.
There are two solutions that I can see so far.
1) Use the google custom search API. It supports image search and has a free quota of 100 queries per day - for more queries you will have to pay.
2) Tools like mechanize are misleading websites, by telling them that they are browsers, and not in fact scraping bots, by e.g. sending manipulated headers. Common issues here are that if your scraper is too greedy(too many requests in a short interval) google will permanently block your IP address...
That is because there's no user-agent specified. The default requests user-agent is python-requests thus Google blocks a request because it knows that it's a bot and not a "real" user visit and user-agent fakes it.
To scrape Google Images, both thumbnail and full resolution URL you need to parse the date from the page source inside <script> tags:
# find all <script> tags:
soup.select('script')
# match images data via regex:
matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
# match desired images (full res size) via regex:
# https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
# if you try to json.loads() without json.dumps() it will throw an error:
# "Expecting property name enclosed in double quotes"
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
matched_images_data_json)
# Extract and decode them using bytes() and decode():
for fixed_full_res_image in matched_google_full_resolution_images:
original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
Code and full example in the online IDE that downloads Google Images as well:
import requests, lxml, re, json, shutil, urllib.request
from bs4 import BeautifulSoup
from py_random_words import RandomWords
random_word = RandomWords().get_word()
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": random_word,
"tbm": "isch",
"hl": "en",
"ijn": "0",
}
html = requests.get("https://www.google.com/search", params=params, headers=headers)
soup = BeautifulSoup(html.text, 'lxml')
def get_images_data():
print('\nGoogle Images Metadata:')
for google_image in soup.select('.isv-r.PNCib.MSM1fd.BUooTd'):
title = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['title']
source = google_image.select_one('.fxgdke').text
link = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['href']
print(f'{title}\n{source}\n{link}\n')
# this steps could be refactored to a more compact
all_script_tags = soup.select('script')
# # https://regex101.com/r/48UZhY/4
matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
# https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
# if you try to json.loads() without json.dumps() it will throw an error:
# "Expecting property name enclosed in double quotes"
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)
# https://regex101.com/r/pdZOnW/3
matched_google_image_data = re.findall(r'\[\"GRID_STATE0\",null,\[\[1,\[0,\".*?\",(.*),\"All\",', matched_images_data_json)
# https://regex101.com/r/NnRg27/1
matched_google_images_thumbnails = ', '.join(
re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
str(matched_google_image_data))).split(', ')
print('Google Image Thumbnails:') # in order
for fixed_google_image_thumbnail in matched_google_images_thumbnails:
# https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
google_image_thumbnail_not_fixed = bytes(fixed_google_image_thumbnail, 'ascii').decode('unicode-escape')
# after first decoding, Unicode characters are still present. After the second iteration, they were decoded.
google_image_thumbnail = bytes(google_image_thumbnail_not_fixed, 'ascii').decode('unicode-escape')
print(google_image_thumbnail)
# removing previously matched thumbnails for easier full resolution image matches.
removed_matched_google_images_thumbnails = re.sub(
r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', '', str(matched_google_image_data))
# https://regex101.com/r/fXjfb1/4
# https://stackoverflow.com/a/19821774/15164646
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
removed_matched_google_images_thumbnails)
print('\nFull Resolution Images:') # in order
for index, fixed_full_res_image in enumerate(matched_google_full_resolution_images):
# https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
print(original_size_img)
# ------------------------------------------------
# Download original images
print(f'Downloading {index} image...')
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(original_size_img, f'Bs4_Images/original_size_img_{index}.jpg')
get_images_data()
Alternatively, you can achieve the same thing by using Google Images API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to deal with regex to match and extract needed data from the page source, instead, you only need to iterate over structured JSON and get what you want, fast, and don't need to maintain it over time.
Code to integrate to achieve your goal:
import os, urllib.request, json # json for pretty output
from serpapi import GoogleSearch
from py_random_words import RandomWords
random_word = RandomWords().get_word()
def get_google_images():
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": random_word,
"tbm": "isch"
}
search = GoogleSearch(params)
results = search.get_dict()
print(json.dumps(results['images_results'], indent=2, ensure_ascii=False))
# -----------------------
# Downloading images
for index, image in enumerate(results['images_results']):
print(f'Downloading {index} image...')
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(image['original'], f'SerpApi_Images/original_size_img_{index}.jpg')
get_google_images()
P.S - I wrote a more in-depth blog post about how to scrape Google Images, and how to reduce the chance of being blocked while web scraping search engines.
Disclaimer, I work for SerpApi.