I've been learning a lot of python lately to work on some projects at work.
Currently I need to do some web scraping with google search results. I found several sites that demonstrated how to use ajax google api to search, however after attempting to use it, it appears to no longer be supported. Any suggestions?
I've been searching for quite a while to find a way but can't seem to find any solutions that currently work.
You can always directly scrape Google results. To do this, you can use the URL https://google.com/search?q=<Query> this will return the top 10 search results.
Then you can use lxml for example to parse the page. Depending on what you use, you can either query the resulting node tree via a CSS-Selector (.r a) or using a XPath-Selector (//h3[#class="r"]/a)
In some cases the resulting URL will redirect to Google. Usually it contains a query-parameter qwhich will contain the actual request URL.
Example code using lxml and requests:
from urllib.parse import urlencode, urlparse, parse_qs
from lxml.html import fromstring
from requests import get
raw = get("https://www.google.com/search?q=StackOverflow").text
page = fromstring(raw)
for result in page.cssselect(".r a"):
url = result.get("href")
if url.startswith("/url?"):
url = parse_qs(urlparse(url).query)['q']
print(url[0])
A note on google banning your IP: In my experience, google only bans
if you start spamming google with search requests. It will respond
with a 503 if Google thinks you are bot.
Here is another service that can be used for scraping SERPs (https://zenserp.com) It does not require a client and is cheaper.
Here is a python code sample:
import requests
headers = {
'apikey': '',
}
params = (
('q', 'Pied Piper'),
('location', 'United States'),
('search_engine', 'google.com'),
('language', 'English'),
)
response = requests.get('https://app.zenserp.com/api/search', headers=headers, params=params)
You have 2 options. Building it yourself or using a SERP API.
A SERP API will return the Google search results as a formatted JSON response.
I would recommend a SERP API as it is easier to use, and you don't have to worry about getting blocked by Google.
1. SERP API
I have good experience with the scraperbox serp api.
You can use the following code to call the API. Make sure to replace YOUR_API_TOKEN with your scraperbox API token.
import urllib.parse
import urllib.request
import ssl
import json
ssl._create_default_https_context = ssl._create_unverified_context
# Urlencode the query string
q = urllib.parse.quote_plus("Where can I get the best coffee")
# Create the query URL.
query = "https://api.scraperbox.com/google"
query += "?token=%s" % "YOUR_API_TOKEN"
query += "&q=%s" % q
query += "&proxy_location=gb"
# Call the API.
request = urllib.request.Request(query)
raw_response = urllib.request.urlopen(request).read()
raw_json = raw_response.decode("utf-8")
response = json.loads(raw_json)
# Print the first result title
print(response["organic_results"][0]["title"])
2. Build your own Python scraper
I recently wrote an in-depth blog post on how to scrape search results with Python.
Here is a quick summary.
First you should get the HTML contents of the Google search result page.
import urllib.request
url = 'https://google.com/search?q=Where+can+I+get+the+best+coffee'
# Perform the request
request = urllib.request.Request(url)
# Set a normal User Agent header, otherwise Google will block the request.
request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36')
raw_response = urllib.request.urlopen(request).read()
# Read the repsonse as a utf-8 string
html = raw_response.decode("utf-8")
Then you can use BeautifulSoup to extract the search results.
For example, the following code will get all titles.
from bs4 import BeautifulSoup
# The code to get the html contents here.
soup = BeautifulSoup(html, 'html.parser')
# Find all the search result divs
divs = soup.select("#search div.g")
for div in divs:
# Search for a h3 tag
results = div.select("h3")
# Check if we have found a result
if (len(results) >= 1):
# Print the title
h3 = results[0]
print(h3.get_text())
You can extend this code to also extract the search result urls and descriptions.
You can also use a third party service like Serp API - I wrote and run this tool - that is a paid Google search engine results API. It solves the issues of being blocked, and you don't have to rent proxies and do the result parsing yourself.
It's easy to integrate with Python:
from lib.google_search_results import GoogleSearchResults
params = {
"q" : "Coffee",
"location" : "Austin, Texas, United States",
"hl" : "en",
"gl" : "us",
"google_domain" : "google.com",
"api_key" : "demo",
}
query = GoogleSearchResults(params)
dictionary_results = query.get_dictionary()
GitHub: https://github.com/serpapi/google-search-results-python
Current answers will work but google will ban your for scrapping.
My current solution uses the requests_ip_rotator
import requests
from requests_ip_rotator import ApiGateway
import os
keywords = ['test']
def parse(keyword, session):
url = f"https://www.google.com/search?q={keyword}"
response = session.get(url)
print(response)
if __name__ == '__main__':
AWS_ACCESS_KEY_ID = ''
AWS_SECRET_ACCESS_KEY = ''
gateway = ApiGateway("https://www.google.com", access_key_id=AWS_ACCESS_KEY_ID,
access_key_secret=AWS_SECRET_ACCESS_KEY)
gateway.start()
session = requests.Session()
session.mount("https://www.google.com", gateway)
for keyword in keywords:
parse(keyword, session)
gateway.shutdown()
AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY you can create in AWS console.
This solution allow you to parse 1 million requests (amazon free limit)
Related
I am looking at this website which is pulling data from the Tableau API and pushing it to MapBox. The API request for this is the following GET request:
https://public.tableau.com/vizql/w/RouteQ42020Impactmapping/v/RouteQ42022ImpactMapping/bootstrapSession/sessions/A512A428E92E477AB16A339042B822F0-0:0
You can generate this API using the following if you wish:
import requests
from bs4 import BeautifulSoup
import json
host_url = "https://public.tableau.com"
path = "/views/RouteQ42020Impactmapping/RouteQ42022ImpactMapping"
url = f"{host_url}{path}"
r = requests.get(
url,
params= {
":embed": "y",
":showVizHome": "no"
}
)
soup = BeautifulSoup(r.text, "html.parser")
tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)
session_id = tableauData['sessionid']
...however, I pulled the URL out of ChromeTools. It works well and I mainly get the payload I want. However, when I want to subset the data using the filter drop downs at the initial link I posted, I can see new XHR requests appearing in Chrome Tools, but I cannot find where this filter has been applied as a parameter in the HTTP call. For example I would want to change BARB Region to North West only.
Can anyone tell me how to do this?
Thanks
I have been struggling to do a web scraping with the below code and its showing me null records. If we print the output data, it dosent show the requested output. this is the web site i am going to do this web scraping https://coinmarketcap.com/. there are several pages which need to be taken in to the data frame. (64 Pages)
import requests
import pandas as pd
url = "https://api.coinmarketcap.com/data-api/v3/topsearch/rank"
req= requests.post(url)
main_data=req.json()
can anyone help me to sort this out?
Instead of using post requests use get in request call it will
work!
import requests
res=requests.get("https://api.coinmarketcap.com/data-api/v3/topsearch/rank")
main_data=res.json()
data=main_data['data']['cryptoTopSearchRanks']
With All pages: You can find this URL from Network tab go to xhr and reload now go to second page URL will avail in xhr tab you can copy and make call of it i have shorten the URL here
res=requests.get("https://coinmarketcap.com/")
soup=BeautifulSoup(res.text,"html.parser")
last_page=soup.find_all("p",class_="sc-1eb5slv-0 hykWbK")[-1].get_text().split(" ")[-1]
res=requests.get(f"https://api.coinmarketcap.com/data-api/v3/cryptocurrency/listing?start=1&limit={last_page}&sortBy=market_cap&sortType=desc&convert=USD,BTC,ETH&cryptoType=all&tagType=all&audited=false&aux=ath")
Now use json method
data=res.json()['data']['cryptoCurrencyList']
print(len(data))
Output:
6304
For getting/reading the data you need to use get method not post
import requests
import pandas as pd
import json
url = "https://api.coinmarketcap.com/data-api/v3/topsearch/rank"
req = requests.get(url)
main_data = req.json()
print(main_data) # without pretty printing
pretty_json = json.loads(req.text)
print(json.dumps(pretty_json, indent=4)) # with pretty print
Their terms of use prohibit web scraping. The site provides a well-documented API that has a free tier. Register and get API token:
from requests import Session
url = 'https://pro-api.coinmarketcap.com/v1/cryptocurrency/listings/latest'
parameters = {
'start':'1',
'limit':'5000',
'convert':'USD'
}
headers = {
'Accepts': 'application/json',
'X-CMC_PRO_API_KEY': HIDDEN_TOKEN, # replace that with your API Key
}
session = Session()
session.headers.update(headers)
response = session.get(url, params=parameters)
data = response.json()
print(data)
I am scraping players name through the NBA website. The player's name webpage is designed using a single page application. The Players are distributed across several pages in alphabetical order. I am unable to extract the names of all the players.
Here is the link: https://in.global.nba.com/playerindex/
from selenium import webdriver
from bs4 import BeautifulSoup
class make():
def __init__(self):
self.first=""
self.last=""
driver= webdriver.PhantomJS(executable_path=r'E:\Downloads\Compressed\phantomjs-2.1.1-windows\bin\phantomjs.exe')
driver.get('https://in.global.nba.com/playerindex/')
html_doc = driver.page_source
soup = BeautifulSoup(html_doc,'lxml')
names = []
layer = soup.find_all("a",class_="player-name ng-isolate-scope")
for a in layer:
span = a.find("span",class_="ng-binding")
thing = make()
thing.first = span.text
spans = a.find("span",class_="ng-binding").find_next_sibling()
thing.last = spans.text
names.append(thing)
When dealing with SPAs, you shouldn't try to extract info from DOM, because the DOM is incomplete without running a JS-capable browser to populate it with data. Open up the page source, and you'll see the page HTML doesn't have the data you need.
But most SPAs load their data using XHR requests. You can monitor network requests in Developer Console (F12) to see the requests being made during page load.
Here https://in.global.nba.com/playerindex/ loads player list from https://in.global.nba.com/stats2/league/playerlist.json?locale=en
Simulate that request yourself, then pick whatever you need. Inspect the request headers to figure out what you need to send with the request.
import requests
if __name__ == '__main__':
page_url = 'https://in.global.nba.com/playerindex/'
s = requests.Session()
s.headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0'}
# visit the homepage to populate session with necessary cookies
res = s.get(page_url)
res.raise_for_status()
json_url = 'https://in.global.nba.com/stats2/league/playerlist.json?locale=en'
res = s.get(json_url)
res.raise_for_status()
data = res.json()
player_names = [p['playerProfile']['displayName'] for p in data['payload']['players']]
print(player_names)
output:
['Steven Adams', 'Bam Adebayo', 'Deng Adel', 'LaMarcus Aldridge', 'Kyle Alexander', 'Nickeil Alexander-Walker', ...
Dealing with auth
One thing to watch out for is that some websites require an authentication token to be sent with requests. You can see it in the API requests if it's present.
If you're building a scraper that needs to be functional in the long(er) term, you might want to make the script more robust by extracting the token from the page and including it in requests.
This token (mostly a JWT token, starts with ey...) usually sits somewhere in the HTML, encoded as JSON. Or it is sent to the client as a cookie, and the browser attaches it to the request, or in a header. In short, it could be anywhere. Scan the requests & responses to figure out where the token is coming from and how you can retrieve it yourself.
...
<script>
const state = {"token": "ey......", ...};
</script>
import json
import re
res = requests.get('url/to/page')
# extract the token from the page. Here `state` is an object serialized as JSON,
# we take everything after `=` sign until the semicolon and deserialize it
state = json.loads(re.search(r'const state = (.*);', res.text).group(1))
token = state['token']
res = requests.get('url/to/api/with/auth', headers={'authorization': f'Bearer {token}'})
I tried to get html code from a site name dcinside in Korea, i am using requests but cannot get html code
and this is my code
import requests
url = "http://gall.dcinside.com/board/lists/?id=bitcoins&page=1"
req = requests.get(url)
print (req)
print (req.content)
but the result was
Why I cannot get html codes even using requests??
Most likely they are detecting that you are trying to crawl data dynamically, and not giving any content as a response. Try pretending to be a browser and passing some User-Agent headers.
headers = {
'User-Agent': 'My User Agent 1.0',
'From': 'youremail#domain.com'
}
response = requests.get(url, headers=headers)
# use authentic mozilla or chrome user-agent strings if this doesn't work
Take a look at this:
Python Web Crawlers and "getting" html source code
Like the guy said in the aforementioned post, you should use urllib2 which will allow you to easily obtain web resources.
I'm trying to learn python, so I decided to write a script that could translate something using google translate. Till now I wrote this:
import sys
from BeautifulSoup import BeautifulSoup
import urllib2
import urllib
data = {'sl':'en','tl':'it','text':'word'}
request = urllib2.Request('http://www.translate.google.com', urllib.urlencode(data))
request.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11')
opener = urllib2.build_opener()
feeddata = opener.open(request).read()
#print feeddata
soup = BeautifulSoup(feeddata)
print soup.find('span', id="result_box")
print request.get_method()
And now I'm stuck. I can't see any bugs in it, but it still doesn't work (by that I mean that the script will run, but it wont translate the word).
Does anyone know how to fix it?
(Sorry for my poor English)
I made this script if you want to check it:
https://github.com/mouuff/Google-Translate-API
: )
Google translate is meant to be used with a GET request and not a POST request. However, urrllib2 will automatically submit a POST if you add any data to your request.
The solution is to construct the url with a querystring so you will be submitting a GET.
You'll need to alter the request = urllib2.Request('http://www.translate.google.com', urllib.urlencode(data)) line of your code.
Here goes:
querystring = urllib.urlencode(data)
request = urllib2.Request('http://www.translate.google.com' + '?' + querystring )
And you will get the following output:
<span id="result_box" class="short_text">
<span title="word" onmouseover="this.style.backgroundColor='#ebeff9'" onmouseout="this.style.backgroundColor='#fff'">
parola
</span>
</span>
By the way, you're kinda breaking Google's terms of service; look into them if you're doing more than hacking a little script for training.
Using requests
I strongly advise you to stay away from urllib if possible, and use the excellent requests library, which will allow you to efficiently use HTTP with Python.
Yes their documentation is not so easy to uncover.
Here's what you do:
In the Google Cloud Platform Console:
1.1 Go to the Projects page and select or create a new project
1.2 Enable billing for your project
1.3 Enable the Cloud Translation API
1.4 Create a new API key in your project, make sure to restrict usage by IP or other means available there.
In the machine where you want to run the client:
pip install --upgrade google-api-python-client
Then you can write this to send translation requests and receive responses:
Here's the code:
import json
from apiclient.discovery import build
query='this is a test to translate english to spanish'
target_language = 'es'
service = build('translate','v2',developerKey='INSERT_YOUR_APP_API_KEY_HERE')
collection = service.translations()
request = collection.list(q=query, target=target_language)
response = request.execute()
response_json = json.dumps(response)
ascii_translation = ((response['translations'][0])['translatedText']).encode('utf-8').decode('ascii', 'ignore')
utf_translation = ((response['translations'][0])['translatedText']).encode('utf-8')
print response
print ascii_translation
print utf_translation