I took the code below from the answer How to use BeautifulSoup to parse google search results in Python
It used to work on my Ubuntu 16.04 and I have both Python 2 and 3.
The code is below:
import urllib
from bs4 import BeautifulSoup
import requests
import webbrowser
text = 'My query goes here'
text = urllib.parse.quote_plus(text)
url = 'https://google.com/search?q=' + text
response = requests.get(url)
#with open('output.html', 'wb') as f:
# f.write(response.content)
#webbrowser.open('output.html')
soup = BeautifulSoup(response.text, 'lxml')
for g in soup.find_all(class_='g'):
print(g.text)
print('-----')
It executes but prints nothing. The problem is really suspicious to me. Any help would be appreciated.
The problem is that Google is serving different HTML when you don't specify User-Agent in headers. To specify custom header, add dict with User-Agent to headers= parameter in requests:
import urllib
from bs4 import BeautifulSoup
import requests
import webbrowser
text = 'My query goes here'
text = urllib.parse.quote_plus(text)
url = 'https://google.com/search?q=' + text
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
for g in soup.find_all(class_='g'):
print(g.text)
print('-----')
Prints:
How to Write the Perfect Query Letter - Query Letter Examplehttps://www.writersdigest.com/.../how-to-write-the-perfect-qu...PuhverdatudTõlgi see leht21. märts 2016 - A literary agent shares a real-life novel pitch that ultimately led to a book deal—and shows you how to query your own work with success.
-----
Inimesed küsivad ka järgmistHow do you start a query letter?What should be included in a query letter?How do you end a query in an email?How long is a query letter?Tagasiside
-----
...and so on.
Learn more about user-agent and request headers.
Basically, user-agent let identifies the browser, its version number, and its host operating system that representing a person (browser) in a Web context that lets servers and network peers identify if it's a bot or not.
Have a look at SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser. CSS selectors reference.
To make it look better, you can pass URL params as a dict() which is more readable and requests do everything for you automatically (same goes for adding user-agent into headers):
params = {
"q": "My query goes here"
}
requests.get("YOUR_URL", params=params)
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "My query goes here"
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
print(title)
-------
'''
MySQL 8.0 Reference Manual :: 3.2 Entering Queries
Google Sheets Query function: Learn the most powerful ...
Understanding MySQL Queries with Explain - Exoscale
An Introductory SQL Tutorial: How to Write Simple Queries
Writing Subqueries in SQL | Advanced SQL - Mode
Getting IO and time statistics for SQL Server queries
How to store MySQL query results in another Table? - Stack ...
More efficient SQL with query planning and optimization (article)
Here are my Data Files. Here are my Queries. Where ... - CIDR
Slow in the Application, Fast in SSMS? - Erland Sommarskog
'''
Alternatively, you can do the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to extract the data you want from JSON string rather than figuring out how to extract, maintain or bypass blocks from Google.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "My query goes here",
"hl": "en",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['title'])
--------
'''
MySQL 8.0 Reference Manual :: 3.2 Entering Queries
Google Sheets Query function: Learn the most powerful ...
Understanding MySQL Queries with Explain - Exoscale
An Introductory SQL Tutorial: How to Write Simple Queries
Writing Subqueries in SQL | Advanced SQL - Mode
Getting IO and time statistics for SQL Server queries
How to store MySQL query results in another Table? - Stack ...
More efficient SQL with query planning and optimization (article)
Here are my Data Files. Here are my Queries. Where ... - CIDR
Slow in the Application, Fast in SSMS? - Erland Sommarskog
'''
Disclaimer, I work for SerpApi.
Related
I can't scrape 'https://www.upwork.com/'
I try this code:
import requests
url = "https://www.upwork.com/"
header={'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15"}
requests=requests.get(url,headers=header)
soup=BeautifulSoup(requests.content,'html.parser')
and also used :
import requests
headers = {
'authority': 'www.upwork.com',
'accept':
...
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36',
}
response = requests.get('https://www.upwork.com/', headers=headers)
the response is always = 403
The Upwork platform is heavily protected against bots and scraper. So there is no surprise in the fact that you basic scraper gets detected and blocked immediately. In general, when you want to scrape big websites (Google, Amazon, Upwork, Freelancer etc.), it is recommended to either build a complex scraper, or to use a third party service. Let me continue the explanation for each case:
1. Build a complex scraper:
By complex scraper, I mean a web scraper that can go undetected. As an engineer at WebScrapingAPI and a researcher working exactly on this matter (namely web fingerprinting techniques), I can tell you that this task is a lot, even for an experienced programmer.
However, a good place to start for you would be to drop requests and use Selenium instead. Here is an example that successfully access:
from selenium import webdriver
# Initiate the webdriver
driver = webdriver.Chrome()
# Navigate to website
driver.get('https://www.upwork.com/')
# Get raw HTML and quit driver
html = driver.page_source
# Print the HTML
print(html)
## You can then use this HTML content with BeautifulSoup for example
## in order to extract the desired elements from your page
## Add your code bellow:
# Close the webdriver
driver.quit()
2. Use a third party provider:
There are quite a few web scraping providers. However, since I know the product we've built at WebScrapingAPI, I will recommend you use ours. I've even designed and tested a script that fetches data from Upwork here:
import json
import requests,json
from bs4 import BeautifulSoup
API_KEY = '<YOUR_API_KEY>'
SCRAPER_URL = 'https://api.webscrapingapi.com/v1'
TARGET_URL = 'https://www.upwork.com/freelance-jobs/'
CATEGORY = 'python'
PARAMS = {
"api_key":API_KEY,
"url": TARGET_URL + CATEGORY,
"render_js":1,
"proxy_type":"residential",
"extract_rules":'{"jobs":{"selector":"div.job-tile-wrapper","output":"html"}}',
}
response = requests.get(SCRAPER_URL, params=PARAMS)
json = json.loads(response.text)
for job in json['jobs']:
soup = BeautifulSoup(job, 'html.parser')
try:
job_title = soup.find('a', attrs={'data-qa':'job-title'}).text.strip()
job_about = soup.find('p', attrs={'data-qa':'job-description'}).text.strip()
job_price = soup.select_one('div.row>p.col-6.col-sm-3.col-md-3.mb-0.pb-15.pb-md-20>strong').text.strip()
job_level = soup.select_one('div.row>p.col-6.col-sm-4.mb-0.pb-15.pb-md-20>strong').text.strip()
print('Title: ' + job_title)
print('Price: ' + job_price)
print('Level: ' + job_level)
print('About: ' + job_about)
print('\n')
except:
pass
Result:
Title: Web Scraping Using Python
Price: $50
Level: Intermediate
About: Deliverables are 2 Python Scripts:
1. I am looking for a Python Script that will allow me to export a JSON response and put into csv.…
Title: Think or Swim Trading Automation (TD Ameritrade) Bot creation
Price: $550
Level: Entry
About: I have a back tested program I wish to automated into a trading bot. See attached. This is a very simple program in theory.
Title: Resy Reservation Bot / Snipe
Price: $150
Level: Intermediate
About: I would like to create a Bot that allows me to book a restaurant reservation the moment it is posted on the website.
https://resy.com/…
Title: Emails are not being received in my email account
Price: $5
Level: Entry
About: Emails are not being received in my email account. I am using the Hestial control panel
I've been learning Python and tried Web Scraping.
I could manage to scrape Google Result Page for a normal Google Search, though the page was depreciated idk why.
Tried the same for Google Images, and it is depreciated as well. It doesn't appear the same as it was appearing in the browser.
Here's my code.
from bs4 import BeautifulSoup
import requests
from PIL import Image
from io import BytesIO
search = input("Search for : ")
params = {"tbm": "isch", "source": "hp", "q": search}
r = requests.get("https://www.google.com/search", params=params)
print("URL :", r.url)
print("Status : ", r.status_code, "\n\n")
f = open("ImageResult.html", "w+")
f.write(r.text)
For example, I search for "Goku".
The Google Image returns this page.
When I click on the first image, a popup opens. Or say I press ctrl+click. I reach this page.
On this page I can see that the actual image's URL can be accessed through maybe the current url or the link at the "View Image" button. But the issue is, I can't reach this page/popup in the version of the page that I am able to get when I request this page.
UPDATE : I'm sharing the page I am getting.
This depends on a lot of factors like user agent string , cookies and also google experiments . Google is known for serving different ways of same content for many users.On search ,Google loads different pages based on site speed and user agent.Google also randomly runs experiments on searchpage design,etc before rollng in public to implement A/B testing dynamically.
Google Organic results have very little JavaScript and you still can parse data from the <script> tags.
Besides that, the most often problem why you don't see the same results as in your browser is because there's no user-agent being passed into request headers thus when no user-agent is specified while using requests library, it defaults to python-requests and Google understands that it's a bot/script, then it blocks a request (or whatever it does) and you receive a different HTML with different CSS selectors. Check what's your user-agent.
Pass user-agent:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('URL', headers=headers)
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to spend time trying to bypass blocks from Google and figuring out why certain things don't work as they should, and you don't have to maintain the parser over time.
Very simple example code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "how to create minecraft server",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result["link"], sep="\n")
----------
'''
https://help.minecraft.net/hc/en-us/articles/360058525452-How-to-Setup-a-Minecraft-Java-Edition-Server
https://www.minecraft.net/en-us/download/server
https://www.idtech.com/blog/creating-minecraft-server
# other results
'''
Disclaimer, I work for SerpApi.
I am working on scraping data from Google Scholar using bs4 and urllib. I am trying to get the first year an article is publsihed. For example, from this page I am trying to get the year 1996. This can be read from the bar chart, but only after the bar chart is clicked. I have written the following code, but it prints out the year visible before the bar chart is clicked.
from bs4 import BeautifulSoup
import urllib.request
url = 'https://scholar.google.com/citations?user=VGoSakQAAAAJ'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'lxml')
year = soup.find('span', {"class": "gsc_g_t"})
print (year)
the chart information is on a different request, this one. There you can get the information you want with the following xpath:
'//span[#class="gsc_g_t"][1]/text()'
or in soup:
soup.find('span', {"class": "gsc_g_t"}).text
Make sure you're using the latest user-agent. Old user-agents is a signal to the website that it might be a bot that sends a request. But a new user-agent does not mean that every website would think that it's a "real" user visit. Check what's your user-agent.
The code snippet is using parsel library which is similar to bs4 but it supports full XPath and translates every CSS selector query to XPath using the cssselect package.
Example code to integrate:
from collections import namedtuple
import requests
from parsel import Selector
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"user": "VGoSakQAAAAJ",
"hl": "en",
"view_op": "citations_histogram"
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}
html = requests.get("https://scholar.google.com/citations", params=params, headers=headers, timeout=30)
selector = Selector(html.text)
Publications = namedtuple("Years", "first_publication")
publications = Publications(sorted([publication.get() for publication in selector.css(".gsc_g_t::text")])[0])
print(selector.css(".gsc_g_t::text").get())
print(sorted([publication.get() for publication in selector.css(".gsc_g_t::text")])[0])
print(publications.first_publication)
# output:
'''
1996
1996
1996
'''
Alternatively, you can achieve the same thing by using Google Scholar Author API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to figure out how to parse the data and maintain the parser over time, figure out how to scale it, and bypass blocks from a search engine, such as Google Scholar search engine.
Example code to integrate:
from serpapi import GoogleScholarSearch
params = {
"api_key": "Your SerpApi API key",
"engine": "google_scholar_author",
"hl": "en",
"author_id": "VGoSakQAAAAJ"
}
search = GoogleScholarSearch(params)
results = search.get_dict()
# already sorted data
first_publication = [year.get("year") for year in results.get("cited_by", {}).get("graph", [])][0]
print(first_publication)
# 1996
If you want to scrape all Profile results based on a given query or you have a list of Author IDs, there's a dedicated scrape all Google Scholar Profile, Author Results to CSV blog post of mine about it.
Disclaimer, I work for SerpApi.
I've seen lots of questions regarding this subject and i found out that Google has been updating the way its search engine APIs work.
This link > get the first 10 google results using googleapi shows EXACTLY what I need but the thing is I don't know if it's possible to do that anymore.
I need this to my term paper but by reading Google docs I couldn't find a way to do that.
I've done the "get started" stuff and all I got was a private search engine using custom search engine (CSE).
Alternatively, you can use Python, Selenium and PhantomJS or other browsers to browse through Google's search results and grab the content. I haven't done that personally and don't know if there are challenges there.
I believe the best way would be to use their search APIs. Please try the one you pointed out. If it doesn't work, look for the new APIs.
I came across this question while trying to solve this problem myself and I found an updated solution to this.
Basically I used this guide at Google Custom Search to generate my own api key and search engine, then use python requests to retrieve the json results.
def search(query):
api_key = 'MYAPIKEY'
search_engine_id = 'MYENGINEID'
url = "https://www.googleapis.com/customsearch/v1/siterestrict?key=%s&cx=%s&q=%s" % (api_key, search_engine_id, query)
result = requests.Session().get(url)
json = simplejson.loads(result.content)
return json
I answered the question you attached via link.
Here's the link to that answer and full code example. I'll copy the code for faster access.
First-way using a custom script that returns JSON:
from bs4 import BeautifulSoup
import requests
import json
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=java&oq=java',
headers=headers).text
soup = BeautifulSoup(html, 'lxml')
summary = []
for container in soup.findAll('div', class_='tF2Cxc'):
heading = container.find('h3', class_='LC20lb DKV0Md').text
article_summary = container.find('span', class_='aCOpRe').text
link = container.find('a')['href']
summary.append({
'Heading': heading,
'Article Summary': article_summary,
'Link': link,
})
print(json.dumps(summary, indent=2, ensure_ascii=False))
Using Google Search Engine Results API from SerpApi:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "java",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(f"Title: {result['title']}\nLink: {result['link']}\n")
Disclaimer, I work for SerpApi.
I've looked around for a solution to this problem but for the life of me I cannot figure it out!
This is my first attempt at writing anything in python, and what I want my script to do is load a list of subjects from a text file, generate a Google search URL, and scrape these URLs one by one to output the amount of 'results found:' according to Google, in addition to the links of the top 15 results.
My problem is that when I run my code, all that is printed are empty lists:
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
**END_OBJECT**
..etc.
Here is my code:
from lxml import html
import requests
def iterate():
with open("list.txt", "r") as infile:
for line in infile:
if not line.strip():
break
yield line
output = open ("statistic_out.txt", "w")
for line in iterate():
raw = line
output.write(raw + " services")
request = raw.replace(" ", "%20")
page = requests.get('https://www.google.com.au/search?safe=off&tbs=ctr:countryAU&cr=countryAU&q=' + request + 'services%20-yellowpages%20-abs', verify=False)
path = html.fromstring(page.text)
#This will create a list of buyers:
resultCount = path.xpath('//*[#id="resultStats"]/text()')
#This will create a list of prices
print(resultCount)
print('\n')
resultUrlList1 = path.xpath('//*[#id="rso"]/div[2]/li[1]/div/h3/a/text()')
resultUrlList2 = path.xpath('//*[#id="rso"]/div[2]/li[2]/div/h3/a/text()')
resultUrlList3 = path.xpath('//*[#id="rso"]/div[2]/li[3]/div/h3/a/text()')
resultUrlList4 = path.xpath('//*[#id="rso"]/div[2]/li[4]/div/h3/a/text()')
resultUrlList5 = path.xpath('//*[#id="rso"]/div[2]/li[5]/div/h3/a/text()')
resultUrlList6 = path.xpath('//*[#id="rso"]/div[2]/li[6]/div/h3/a/text()')
resultUrlList7 = path.xpath('//*[#id="rso"]/div[2]/li[7]/div/h3/a/text()')
resultUrlList8 = path.xpath('//*[#id="rso"]/div[2]/li[8]/div/h3/a/text()')
resultUrlList9 = path.xpath('//*[#id="rso"]/div[2]/li[9]/div/h3/a/text()')
resultUrlList10 = path.xpath('//*[#id="rso"]/div[2]/li[10]/div/h3/a/text()')
resultUrlList11 = path.xpath('//*[#id="rso"]/div[2]/li[11]/div/h3/a/text()')
resultUrlList12 = path.xpath('//*[#id="rso"]/div[2]/li[12]/div/h3/a/text()')
resultUrlList13 = path.xpath('//*[#id="rso"]/div[2]/li[13]/div/h3/a/text()')
resultUrlList14 = path.xpath('//*[#id="rso"]/div[2]/li[14]/div/h3/a/text()')
resultUrlList15 = path.xpath('//*[#id="rso"]/div[2]/li[15]/div/h3/a/text()')
print(resultUrlList1)
print('\n')
print(resultUrlList2)
print('\n')
print(resultUrlList3)
print('\n')
print(resultUrlList4)
print('\n')
print(resultUrlList5)
print('\n')
print(resultUrlList6)
print('\n')
print(resultUrlList7)
print('\n')
print(resultUrlList8)
print('\n')
print(resultUrlList9)
print('\n')
print(resultUrlList10)
print('\n')
print(resultUrlList11)
print('\n')
print(resultUrlList12)
print('\n')
print(resultUrlList13)
print('\n')
print(resultUrlList14)
print('\n')
print(resultUrlList15)
print('\n')
print("**END_OBJECT** \n")
The actual HTML structure is that of any Google search:
Any help would be greatly appreciated - as I am completely lost as to why this is occurring.
EDIT:
It appears that my script is hitting Google's anti-bot protections and path.content shows messages along the lines of "This page checks to see if it's really you sending the requests, and not a robot."
I'm unsure if there are easy ways to bypass this though will update if I find any.
The problem is that you don't specify the user-agent. You need to send user-agent which will act as a "real" user visit. When a bot or browser sends a fake user-agent string to announce itself as a different client. Because default requests user-agent is python-requests and Google understands it, blocks a requests and you receive a different HTML with an error that contains different elements which your code can't recognize and find.
You can read more about it in the blog post I wrote about how to reduce the chance of being blocked while web scraping.
Pass user-agent:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('URL', headers=headers)
Also, instead of creating 15 resultUrlList, you can iterate over all of them in a for loop:
# .tF2Cxc CSS selector is a container with title, link and other data
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
params = {
'q': 'how to create minecraft server',
'gl': 'us',
'hl': 'en',
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
print(title, link, sep='\n')
----------
'''
How to Setup a Minecraft: Java Edition Server – Home
https://help.minecraft.net/hc/en-us/articles/360058525452-How-to-Setup-a-Minecraft-Java-Edition-Server
Minecraft Server Download
https://www.minecraft.net/en-us/download/server
Setting Up Your Own Minecraft Server - iD Tech
https://www.idtech.com/blog/creating-minecraft-server
# other results
'''
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't need to figure how to to make things work: how to bypass blocks, how to extract data, how to maintain script over time if something in HTML will be changed, etc. Instead you only need to iterate over structured JSON and get the data you want.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "how to create minecraft server",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['title'])
print(result['link'])
----------
'''
How to Setup a Minecraft: Java Edition Server – Home
https://help.minecraft.net/hc/en-us/articles/360058525452-How-to-Setup-a-Minecraft-Java-Edition-Server
Minecraft Server Download
https://www.minecraft.net/en-us/download/server
Setting Up Your Own Minecraft Server - iD Tech
https://www.idtech.com/blog/creating-minecraft-server
# other results
'''
Disclaimer, I work for SerpApi.