Python lxml webscraping Google only printing empty lists - python

I've looked around for a solution to this problem but for the life of me I cannot figure it out!
This is my first attempt at writing anything in python, and what I want my script to do is load a list of subjects from a text file, generate a Google search URL, and scrape these URLs one by one to output the amount of 'results found:' according to Google, in addition to the links of the top 15 results.
My problem is that when I run my code, all that is printed are empty lists:
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
**END_OBJECT**
..etc.
Here is my code:
from lxml import html
import requests
def iterate():
with open("list.txt", "r") as infile:
for line in infile:
if not line.strip():
break
yield line
output = open ("statistic_out.txt", "w")
for line in iterate():
raw = line
output.write(raw + " services")
request = raw.replace(" ", "%20")
page = requests.get('https://www.google.com.au/search?safe=off&tbs=ctr:countryAU&cr=countryAU&q=' + request + 'services%20-yellowpages%20-abs', verify=False)
path = html.fromstring(page.text)
#This will create a list of buyers:
resultCount = path.xpath('//*[#id="resultStats"]/text()')
#This will create a list of prices
print(resultCount)
print('\n')
resultUrlList1 = path.xpath('//*[#id="rso"]/div[2]/li[1]/div/h3/a/text()')
resultUrlList2 = path.xpath('//*[#id="rso"]/div[2]/li[2]/div/h3/a/text()')
resultUrlList3 = path.xpath('//*[#id="rso"]/div[2]/li[3]/div/h3/a/text()')
resultUrlList4 = path.xpath('//*[#id="rso"]/div[2]/li[4]/div/h3/a/text()')
resultUrlList5 = path.xpath('//*[#id="rso"]/div[2]/li[5]/div/h3/a/text()')
resultUrlList6 = path.xpath('//*[#id="rso"]/div[2]/li[6]/div/h3/a/text()')
resultUrlList7 = path.xpath('//*[#id="rso"]/div[2]/li[7]/div/h3/a/text()')
resultUrlList8 = path.xpath('//*[#id="rso"]/div[2]/li[8]/div/h3/a/text()')
resultUrlList9 = path.xpath('//*[#id="rso"]/div[2]/li[9]/div/h3/a/text()')
resultUrlList10 = path.xpath('//*[#id="rso"]/div[2]/li[10]/div/h3/a/text()')
resultUrlList11 = path.xpath('//*[#id="rso"]/div[2]/li[11]/div/h3/a/text()')
resultUrlList12 = path.xpath('//*[#id="rso"]/div[2]/li[12]/div/h3/a/text()')
resultUrlList13 = path.xpath('//*[#id="rso"]/div[2]/li[13]/div/h3/a/text()')
resultUrlList14 = path.xpath('//*[#id="rso"]/div[2]/li[14]/div/h3/a/text()')
resultUrlList15 = path.xpath('//*[#id="rso"]/div[2]/li[15]/div/h3/a/text()')
print(resultUrlList1)
print('\n')
print(resultUrlList2)
print('\n')
print(resultUrlList3)
print('\n')
print(resultUrlList4)
print('\n')
print(resultUrlList5)
print('\n')
print(resultUrlList6)
print('\n')
print(resultUrlList7)
print('\n')
print(resultUrlList8)
print('\n')
print(resultUrlList9)
print('\n')
print(resultUrlList10)
print('\n')
print(resultUrlList11)
print('\n')
print(resultUrlList12)
print('\n')
print(resultUrlList13)
print('\n')
print(resultUrlList14)
print('\n')
print(resultUrlList15)
print('\n')
print("**END_OBJECT** \n")
The actual HTML structure is that of any Google search:
Any help would be greatly appreciated - as I am completely lost as to why this is occurring.
EDIT:
It appears that my script is hitting Google's anti-bot protections and path.content shows messages along the lines of "This page checks to see if it's really you sending the requests, and not a robot."
I'm unsure if there are easy ways to bypass this though will update if I find any.

The problem is that you don't specify the user-agent. You need to send user-agent which will act as a "real" user visit. When a bot or browser sends a fake user-agent string to announce itself as a different client. Because default requests user-agent is python-requests and Google understands it, blocks a requests and you receive a different HTML with an error that contains different elements which your code can't recognize and find.
You can read more about it in the blog post I wrote about how to reduce the chance of being blocked while web scraping.
Pass user-agent:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('URL', headers=headers)
Also, instead of creating 15 resultUrlList, you can iterate over all of them in a for loop:
# .tF2Cxc CSS selector is a container with title, link and other data
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
params = {
'q': 'how to create minecraft server',
'gl': 'us',
'hl': 'en',
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
print(title, link, sep='\n')
----------
'''
How to Setup a Minecraft: Java Edition Server – Home
https://help.minecraft.net/hc/en-us/articles/360058525452-How-to-Setup-a-Minecraft-Java-Edition-Server
Minecraft Server Download
https://www.minecraft.net/en-us/download/server
Setting Up Your Own Minecraft Server - iD Tech
https://www.idtech.com/blog/creating-minecraft-server
# other results
'''
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't need to figure how to to make things work: how to bypass blocks, how to extract data, how to maintain script over time if something in HTML will be changed, etc. Instead you only need to iterate over structured JSON and get the data you want.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "how to create minecraft server",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['title'])
print(result['link'])
----------
'''
How to Setup a Minecraft: Java Edition Server – Home
https://help.minecraft.net/hc/en-us/articles/360058525452-How-to-Setup-a-Minecraft-Java-Edition-Server
Minecraft Server Download
https://www.minecraft.net/en-us/download/server
Setting Up Your Own Minecraft Server - iD Tech
https://www.idtech.com/blog/creating-minecraft-server
# other results
'''
Disclaimer, I work for SerpApi.

Related

Python requests 403 FORBIDEN

I can't scrape 'https://www.upwork.com/'
I try this code:
import requests
url = "https://www.upwork.com/"
header={'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15"}
requests=requests.get(url,headers=header)
soup=BeautifulSoup(requests.content,'html.parser')
and also used :
import requests
headers = {
'authority': 'www.upwork.com',
'accept':
...
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36',
}
response = requests.get('https://www.upwork.com/', headers=headers)
the response is always = 403
The Upwork platform is heavily protected against bots and scraper. So there is no surprise in the fact that you basic scraper gets detected and blocked immediately. In general, when you want to scrape big websites (Google, Amazon, Upwork, Freelancer etc.), it is recommended to either build a complex scraper, or to use a third party service. Let me continue the explanation for each case:
1. Build a complex scraper:
By complex scraper, I mean a web scraper that can go undetected. As an engineer at WebScrapingAPI and a researcher working exactly on this matter (namely web fingerprinting techniques), I can tell you that this task is a lot, even for an experienced programmer.
However, a good place to start for you would be to drop requests and use Selenium instead. Here is an example that successfully access:
from selenium import webdriver
# Initiate the webdriver
driver = webdriver.Chrome()
# Navigate to website
driver.get('https://www.upwork.com/')
# Get raw HTML and quit driver
html = driver.page_source
# Print the HTML
print(html)
## You can then use this HTML content with BeautifulSoup for example
## in order to extract the desired elements from your page
## Add your code bellow:
# Close the webdriver
driver.quit()
2. Use a third party provider:
There are quite a few web scraping providers. However, since I know the product we've built at WebScrapingAPI, I will recommend you use ours. I've even designed and tested a script that fetches data from Upwork here:
import json
import requests,json
from bs4 import BeautifulSoup
API_KEY = '<YOUR_API_KEY>'
SCRAPER_URL = 'https://api.webscrapingapi.com/v1'
TARGET_URL = 'https://www.upwork.com/freelance-jobs/'
CATEGORY = 'python'
PARAMS = {
"api_key":API_KEY,
"url": TARGET_URL + CATEGORY,
"render_js":1,
"proxy_type":"residential",
"extract_rules":'{"jobs":{"selector":"div.job-tile-wrapper","output":"html"}}',
}
response = requests.get(SCRAPER_URL, params=PARAMS)
json = json.loads(response.text)
for job in json['jobs']:
soup = BeautifulSoup(job, 'html.parser')
try:
job_title = soup.find('a', attrs={'data-qa':'job-title'}).text.strip()
job_about = soup.find('p', attrs={'data-qa':'job-description'}).text.strip()
job_price = soup.select_one('div.row>p.col-6.col-sm-3.col-md-3.mb-0.pb-15.pb-md-20>strong').text.strip()
job_level = soup.select_one('div.row>p.col-6.col-sm-4.mb-0.pb-15.pb-md-20>strong').text.strip()
print('Title: ' + job_title)
print('Price: ' + job_price)
print('Level: ' + job_level)
print('About: ' + job_about)
print('\n')
except:
pass
Result:
Title: Web Scraping Using Python
Price: $50
Level: Intermediate
About: Deliverables are 2 Python Scripts:
1. I am looking for a Python Script that will allow me to export a JSON response and put into csv.…
Title: Think or Swim Trading Automation (TD Ameritrade) Bot creation
Price: $550
Level: Entry
About: I have a back tested program I wish to automated into a trading bot. See attached. This is a very simple program in theory.
Title: Resy Reservation Bot / Snipe
Price: $150
Level: Intermediate
About: I would like to create a Bot that allows me to book a restaurant reservation the moment it is posted on the website.
https://resy.com/…
Title: Emails are not being received in my email account
Price: $5
Level: Entry
About: Emails are not being received in my email account. I am using the Hestial control panel

Want to send a request get in python from different country

So I want to scrape details from https://bookdepository.com
The problem is that it detects the country and change the prices.
I want it to be a different country.
This is my cost, I run it on real.it and I need the book depository website to think I'm from Israel.
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"}
bookdepo_url = 'https://www.bookdepository.com/search?search=Find+book&searchTerm=' + "0671646788".replace(' ', "+")
search_result = requests.get(bookdepo_url, headers = headers)
soup = BeautifulSoup(search_result.text, 'html.parser')
result_divs = soup.find_all("div", class_= "book-item")
You would either need to route your requests through a proxy server, a VPN, or you would need to execute your code on a machine based in Israel.
That being said, the following works (as of the time of this writing):
import pprint
from bs4 import BeautifulSoup
import requests
def make_proxy_entry(proxy_ip_port):
val = f"http://{proxy_ip_port}"
return dict(http=val, https=val)
headers = {
"User-Agent": (
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36')
}
bookdepo_url = (
'https://www.bookdepository.com/search?search=Find+book&searchTerm='
'0671646788'
)
ip_opts = ['82.166.105.66:44081', '82.81.32.165:3128', '82.81.169.142:80',
'81.218.45.159:8080', '82.166.105.66:43926', '82.166.105.66:58774',
'31.154.189.206:8080', '31.154.189.224:8080', '31.154.189.211:8080',
'213.8.208.233:8080', '81.218.45.231:8888', '192.116.48.186:3128',
'185.138.170.204:8080', '213.151.40.43:8080', '81.218.45.141:8080']
search_result = None
for ip_port in ip_opts:
proxy_entry = make_proxy_entry(ip_port)
try:
search_result = requests.get(bookdepo_url, headers=headers,
proxies=proxy_entry)
pprint.pprint('Successfully gathered results')
break
except Exception as e:
pprint.pprint(f'Failed to connect to endpoint, with proxy {ip_port}.\n'
f'Details: {pprint.saferepr(e)}')
else:
pprint.pprint('Never made successful connection to end-point!')
search_result = None
if search_result:
soup = BeautifulSoup(search_result.text, 'html.parser')
result_divs = soup.find_all("div", class_= "book-item")
pprint.pprint(result_divs)
This solution makes use of the request library's proxies parameter. I scraped a list of proxies from one of the many free proxy-list sites: http://spys.one/free-proxy-list/IL/
The list of proxy IP addresses and ports was created using the following JavaScript snippet to scrape data off the page via my browser's Dev Tools:
console.log(
"['" +
Array.from(document.querySelectorAll('td>font.spy14'))
.map(e=>e.parentElement)
.filter(e=>e.offsetParent !== null)
.filter(e=>window.getComputedStyle(e).display !== 'none')
.filter(e=>e.innerText.match(/\s*(\d{1,3}\.){3}\d{1,3}\s*:\s*\d+\s*/))
.map(e=>e.innerText)
.join("', '") +
"']"
)
Note: Yes, that JavaScript is ugly and gross, but it got the job done.
At the end of the Python script's execution, I do see that the final currency resolves, as desired, to Israeli New Shekel (ILS), based on elements like the following in the resultant HTML:
<a ... data-currency="ILS" data-isbn="9780671646783" data-price="57.26" ...>

I want to fetch the live stock price data through google search

I was trying to fetch the real time stock price through google search using web scraping but its giving me an error
resp = requests.get("https://www.google.com/search?q=apple+share+price&oq=apple+share&aqs=chrome.0.0j69i57j0l4.11811j1j7&sourceid=chrome&ie=UTF-8")
soup = bs.BeautifulSoup(resp.text,'lxml')
tab = soup.find('div',attrs = {'class':'gsrt'}).find('span').text
'NoneType'object has no attribute find
You could use
soup.select_one('td[colspan="3"] b').text
Code:
import requests
from bs4 import BeautifulSoup as bs
headers = {'User-Agent' : 'Mozilla/5.0'}
res = requests.get('https://www.google.com/search?q=apple+share+price&oq=apple+share&aqs=chrome.0.0j69i57j0l4.11811j1j7&sourceid=chrome&ie=UTF-8', headers = headers)
soup = bs(res.content, 'lxml')
quote = soup.select_one('td[colspan="3"] b').text
print(quote)
Try this maybe...
resp = requests.get("https://www.google.com/search?q=apple+share+price&oq=apple+share&aqs=chrome.0.0j69i57j0l4.11811j1j7&sourceid=chrome&ie=UTF-8")
soup = bs(resp.text,'lxml')
tab = soup.find('div', class_='g').findAll('span')
print(tab[3].text.strip())
or, if you only want the price..
resp = requests.get("https://www.google.com/search?q=apple+share+price&oq=apple+share&aqs=chrome.0.0j69i57j0l4.11811j1j7&sourceid=chrome&ie=UTF-8")
soup = bs(resp.text,'lxml')
tab = soup.find('div', class_='g').findAll('span')
price = tab[3].text.strip()
print(price[:7])`
user-agent is not specified in your request. It could be the reason why you were getting an empty result. This way Google treats your request as a python-requests aka automated script, instead of a "real user" visit.
It's fairly easy to do:
Click on SelectorGadget Chrome extension (once installed).
Click on the stock price and receive a CSS selector provided by SelectorGadget.
Use this selector to get the data.
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=nasdaq stock price', headers=headers)
soup = BeautifulSoup(html.text, 'lxml')
current_stock_price = soup.select_one('.wT3VGc').text
print(current_stock_price)
>>> 177,33
Alternatively, you can do the same thing using Google Direct Answer Box API from SerpApi. It's a paid API with a free trial of 5,000 searches.
The biggest difference in this example that you don't have to figure out why the heck something doesn't work, although it should. Everything is already done for the end-user (in this case all selections and figuring out how to scrape this data) with a json output.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "nasdaq stock price",
}
search = GoogleSearch(params)
results = search.get_dict()
current_stock_price = results['answer_box']['price']
print(current_stock_price)
>>> 177.42
Disclaimer, I work for SerpApi.

google search with python requests library

(I've tried looking but all of the other answers seem to be using urllib2)
I've just started trying to use requests, but I'm still not very clear on how to send or request something additional from the page. For example, I'll have
import requests
r = requests.get('http://google.com')
but I have no idea how to now, for example, do a google search using the search bar presented. I've read the quickstart guide but I'm not very familiar with HTML POST and the like, so it hasn't been very helpful.
Is there a clean and elegant way to do what I am asking?
Request Overview
The Google search request is a standard HTTP GET command. It includes a collection of parameters relevant to your queries. These parameters are included in the request URL as name=value pairs separated by ampersand (&) characters. Parameters include data like the search query and a unique CSE ID (cx) that identifies the CSE that is making the HTTP request. The WebSearch or Image Search service returns XML results in response to your HTTP requests.
First, you must get your CSE ID (cx parameter) at Control Panel of Custom Search Engine
Then, See the official Google Developers site for Custom Search.
There are many examples like this:
http://www.google.com/search?
start=0
&num=10
&q=red+sox
&cr=countryCA
&lr=lang_fr
&client=google-csbe
&output=xml_no_dtd
&cx=00255077836266642015:u-scht7a-8i
And there are explained the list of parameters that you can use.
import requests
from bs4 import BeautifulSoup
headers_Get = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1'
}
def google(q):
s = requests.Session()
q = '+'.join(q.split())
url = 'https://www.google.com/search?q=' + q + '&ie=utf-8&oe=utf-8'
r = s.get(url, headers=headers_Get)
soup = BeautifulSoup(r.text, "html.parser")
output = []
for searchWrapper in soup.find_all('h3', {'class':'r'}): #this line may change in future based on google's web page structure
url = searchWrapper.find('a')["href"]
text = searchWrapper.find('a').text.strip()
result = {'text': text, 'url': url}
output.append(result)
return output
Will return an array of google results in {'text': text, 'url': url} format. Top result url would be google('search query')[0]['url']
input:
import requests
def googleSearch(query):
with requests.session() as c:
url = 'https://www.google.co.in'
query = {'q': query}
urllink = requests.get(url, params=query)
print urllink.url
googleSearch('Linkin Park')
output:
https://www.google.co.in/?q=Linkin+Park
The readable way to send a request with many query parameters would be to pass URL parameters as a dictionary:
params = {
'q': 'minecraft', # search query
'gl': 'us', # country where to search from
'hl': 'en', # language
}
requests.get('URL', params=params)
But, in order to get the actual response (output/text/data) that you see in the browser you need to send additional headers, more specifically user-agent which is needed to act as a "real" user visit when bot or browser sends a fake user-agent string to announce themselves as a different client.
The reason that your request might be blocked is that the default requests user agent is python-requests and websites understand that. Check what's your user agent.
You can read more about it in the blog post I wrote about how to reduce the chance of being blocked while web scraping.
Pass user-agent:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}
requests.get('URL', headers=headers)
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}
params = {
'q': 'minecraft',
'gl': 'us',
'hl': 'en',
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
print(title, link, sep='\n')
Alternatively, you can achieve the same thing by using Google Organic API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to create it from scratch and maintain it.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "tesla",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['title'])
print(result['link'])
Disclaimer, I work for SerpApi.
In this code by using bs4 you can get all the h3 and print their text
# Import the beautifulsoup
# and request libraries of python.
import requests
import bs4
# Make two strings with default google search URL
# 'https://google.com/search?q=' and
# our customized search keyword.
# Concatenate them
text= "c++ linear search program"
url = 'https://google.com/search?q=' + text
# Fetch the URL data using requests.get(url),
# store it in a variable, request_result.
request_result=requests.get( url )
# Creating soup from the fetched request
soup = bs4.BeautifulSoup(request_result.text,"html.parser")
filter=soup.find_all("h3")
for i in range(0,len(filter)):
print(filter[i].get_text())
You can use 'webbroser', I think it doesn't get easier than that:
import webbrowser
query = input('Enter your query: ')
webbrowser.open(f'https://google.com/search?q={query}')

BeautifulSoup and urllib not parsing google images page

I am trying to use BeautifulSoup to find a random image from google images. My code looks like this.
import urllib, bs4, random
from urllib import request
urlname = "https://www.google.com/search?hl=en&q=" + str(random. randrange(999999)) + "&ion=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.&bvm=bv. 42553238,d.dmg&biw=1354&bih=622&um=1&ie=UTF- 8&tbm=isch&source=og&sa=N&tab=wi&ei=sNEfUf-fHvLx0wG7uoG4DQ"
page = bs4.BeautifulSoup(urllib.request.urlopen(urlname)
But whenever I try to get the HTML from the page object, I get:
urllib.error.HTTPError: HTTP Error 403: Forbidden
I test the URLs that are generated by pasting them into my web browser, and the browser doesn't return this error. What's going on?
I am pretty sure that google is telling you: "Please don't do this". See this explanation of the http 403 error.
What is going on is that your python script, or more specifically urllib is sending headers, telling google that this is some kind of plain request, which is not coming from a browser.
Google is doing that rightfully so, since otherwise many people would simply scrape their website and show the google results as their own.
There are two solutions that I can see so far.
1) Use the google custom search API. It supports image search and has a free quota of 100 queries per day - for more queries you will have to pay.
2) Tools like mechanize are misleading websites, by telling them that they are browsers, and not in fact scraping bots, by e.g. sending manipulated headers. Common issues here are that if your scraper is too greedy(too many requests in a short interval) google will permanently block your IP address...
That is because there's no user-agent specified. The default requests user-agent is python-requests thus Google blocks a request because it knows that it's a bot and not a "real" user visit and user-agent fakes it.
To scrape Google Images, both thumbnail and full resolution URL you need to parse the date from the page source inside <script> tags:
# find all <script> tags:
soup.select('script')
# match images data via regex:
matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
# match desired images (full res size) via regex:
# https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
# if you try to json.loads() without json.dumps() it will throw an error:
# "Expecting property name enclosed in double quotes"
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
matched_images_data_json)
# Extract and decode them using bytes() and decode():
for fixed_full_res_image in matched_google_full_resolution_images:
original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
Code and full example in the online IDE that downloads Google Images as well:
import requests, lxml, re, json, shutil, urllib.request
from bs4 import BeautifulSoup
from py_random_words import RandomWords
random_word = RandomWords().get_word()
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": random_word,
"tbm": "isch",
"hl": "en",
"ijn": "0",
}
html = requests.get("https://www.google.com/search", params=params, headers=headers)
soup = BeautifulSoup(html.text, 'lxml')
def get_images_data():
print('\nGoogle Images Metadata:')
for google_image in soup.select('.isv-r.PNCib.MSM1fd.BUooTd'):
title = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['title']
source = google_image.select_one('.fxgdke').text
link = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['href']
print(f'{title}\n{source}\n{link}\n')
# this steps could be refactored to a more compact
all_script_tags = soup.select('script')
# # https://regex101.com/r/48UZhY/4
matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
# https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
# if you try to json.loads() without json.dumps() it will throw an error:
# "Expecting property name enclosed in double quotes"
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)
# https://regex101.com/r/pdZOnW/3
matched_google_image_data = re.findall(r'\[\"GRID_STATE0\",null,\[\[1,\[0,\".*?\",(.*),\"All\",', matched_images_data_json)
# https://regex101.com/r/NnRg27/1
matched_google_images_thumbnails = ', '.join(
re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
str(matched_google_image_data))).split(', ')
print('Google Image Thumbnails:') # in order
for fixed_google_image_thumbnail in matched_google_images_thumbnails:
# https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
google_image_thumbnail_not_fixed = bytes(fixed_google_image_thumbnail, 'ascii').decode('unicode-escape')
# after first decoding, Unicode characters are still present. After the second iteration, they were decoded.
google_image_thumbnail = bytes(google_image_thumbnail_not_fixed, 'ascii').decode('unicode-escape')
print(google_image_thumbnail)
# removing previously matched thumbnails for easier full resolution image matches.
removed_matched_google_images_thumbnails = re.sub(
r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', '', str(matched_google_image_data))
# https://regex101.com/r/fXjfb1/4
# https://stackoverflow.com/a/19821774/15164646
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
removed_matched_google_images_thumbnails)
print('\nFull Resolution Images:') # in order
for index, fixed_full_res_image in enumerate(matched_google_full_resolution_images):
# https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
print(original_size_img)
# ------------------------------------------------
# Download original images
print(f'Downloading {index} image...')
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(original_size_img, f'Bs4_Images/original_size_img_{index}.jpg')
get_images_data()
Alternatively, you can achieve the same thing by using Google Images API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to deal with regex to match and extract needed data from the page source, instead, you only need to iterate over structured JSON and get what you want, fast, and don't need to maintain it over time.
Code to integrate to achieve your goal:
import os, urllib.request, json # json for pretty output
from serpapi import GoogleSearch
from py_random_words import RandomWords
random_word = RandomWords().get_word()
def get_google_images():
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": random_word,
"tbm": "isch"
}
search = GoogleSearch(params)
results = search.get_dict()
print(json.dumps(results['images_results'], indent=2, ensure_ascii=False))
# -----------------------
# Downloading images
for index, image in enumerate(results['images_results']):
print(f'Downloading {index} image...')
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(image['original'], f'SerpApi_Images/original_size_img_{index}.jpg')
get_google_images()
P.S - I wrote a more in-depth blog post about how to scrape Google Images, and how to reduce the chance of being blocked while web scraping search engines.
Disclaimer, I work for SerpApi.

Categories

Resources