Web Scraping with Requests -Python

Web Scraping with Requests -Python - python

I have been struggling to do a web scraping with the below code and its showing me null records. If we print the output data, it dosent show the requested output. this is the web site i am going to do this web scraping https://coinmarketcap.com/. there are several pages which need to be taken in to the data frame. (64 Pages)
import requests
import pandas as pd
url = "https://api.coinmarketcap.com/data-api/v3/topsearch/rank"
req= requests.post(url)
main_data=req.json()
can anyone help me to sort this out?

Instead of using post requests use get in request call it will
work!
import requests
res=requests.get("https://api.coinmarketcap.com/data-api/v3/topsearch/rank")
main_data=res.json()
data=main_data['data']['cryptoTopSearchRanks']
With All pages: You can find this URL from Network tab go to xhr and reload now go to second page URL will avail in xhr tab you can copy and make call of it i have shorten the URL here
res=requests.get("https://coinmarketcap.com/")
soup=BeautifulSoup(res.text,"html.parser")
last_page=soup.find_all("p",class_="sc-1eb5slv-0 hykWbK")[-1].get_text().split(" ")[-1]
res=requests.get(f"https://api.coinmarketcap.com/data-api/v3/cryptocurrency/listing?start=1&limit={last_page}&sortBy=market_cap&sortType=desc&convert=USD,BTC,ETH&cryptoType=all&tagType=all&audited=false&aux=ath")
Now use json method
data=res.json()['data']['cryptoCurrencyList']
print(len(data))
Output:
6304

For getting/reading the data you need to use get method not post
import requests
import pandas as pd
import json
url = "https://api.coinmarketcap.com/data-api/v3/topsearch/rank"
req = requests.get(url)
main_data = req.json()
print(main_data) # without pretty printing
pretty_json = json.loads(req.text)
print(json.dumps(pretty_json, indent=4)) # with pretty print

Their terms of use prohibit web scraping. The site provides a well-documented API that has a free tier. Register and get API token:
from requests import Session
url = 'https://pro-api.coinmarketcap.com/v1/cryptocurrency/listings/latest'
parameters = {
'start':'1',
'limit':'5000',
'convert':'USD'
}
headers = {
'Accepts': 'application/json',
'X-CMC_PRO_API_KEY': HIDDEN_TOKEN, # replace that with your API Key
}
session = Session()
session.headers.update(headers)
response = session.get(url, params=parameters)
data = response.json()
print(data)

Related

How to set filter params in this Tableau API call

I am looking at this website which is pulling data from the Tableau API and pushing it to MapBox. The API request for this is the following GET request:
https://public.tableau.com/vizql/w/RouteQ42020Impactmapping/v/RouteQ42022ImpactMapping/bootstrapSession/sessions/A512A428E92E477AB16A339042B822F0-0:0
You can generate this API using the following if you wish:
import requests
from bs4 import BeautifulSoup
import json
host_url = "https://public.tableau.com"
path = "/views/RouteQ42020Impactmapping/RouteQ42022ImpactMapping"
url = f"{host_url}{path}"
r = requests.get(
url,
params= {
":embed": "y",
":showVizHome": "no"
}
)
soup = BeautifulSoup(r.text, "html.parser")
tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)
session_id = tableauData['sessionid']
...however, I pulled the URL out of ChromeTools. It works well and I mainly get the payload I want. However, when I want to subset the data using the filter drop downs at the initial link I posted, I can see new XHR requests appearing in Chrome Tools, but I cannot find where this filter has been applied as a parameter in the HTTP call. For example I would want to change BARB Region to North West only.
Can anyone tell me how to do this?
Thanks

Python requests - using twitter search

I am trying to use requests to get data from twitter but when i run my code i get this error: simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
This is my code so far:
import requests
url = 'https://twitter.com/search?q=memes&src=typed_query'
results = requests.get(url)
better_results = results.json()
better_results['results'][1]['text'].encode('utf-8')
print(better_results)

because you are making a request to a dynamic website.
when we are making a request to a dynamic website we must render the html first in order to receive all the content that we were expecting to receive.
just making the request is not enough.
other libraries such as requests_html render the html and javascript in background using a lite browser.
you can try this code:
# pip install requests_html
from requests_html import HTMLSession
url = 'https://twitter.com/search?q=memes&src=typed_query'
session = HTMLSession()
response = session.get(url)
# rendering part
response.html.render(timeout=20)
better_results = response.json()
better_results['results'][1]['text'].encode('utf-8')
print(better_results)

Web Scraping using Requests - Python

I am trying to get data using the Resquest library, but I’m doing something wrong. My explanation, manual search:
URL - https://www9.sabesp.com.br/agenciavirtual/pages/template/siteexterno.iface?idFuncao=18
I fill in the “Informe o RGI” field and after clicking on the Prosseguir button (like Next):
enter image description here
I get this result:
enter image description here
Before I coding, I did the manual search and checked the Form Data:
enter image description here
And then I tried it with this code:
import requests
data = { "frmhome:rgi1": "0963489410"}
url = "https://www9.sabesp.com.br/agenciavirtual/block/send-receive-updates"
res = requests.post(url, data=data)
print(res.text)
My output is:
<session-expired/>
What am I doing wrong?
Many thanks.

When you go to the site using the browser, a session is created and stored in a cookie on your machine. When you make the POST request, the cookies are sent with the request. You receive an session-expired error because you're not sending any session data with your request.
Try this code. It requests the entry page first and stores the cookies. The cookies are then sent with the POST request.
import requests
session = requests.Session() # start session
# get entry page with cookies
response = session.get('https://www9.sabesp.com.br/agenciavirtual/pages/home/paginainicial.iface', timeout=30)
cks = session.cookies # save cookies with Session data
print(session.cookies.get_dict())
data = { "frmhome:rgi1": "0963489410"}
url = "https://www9.sabesp.com.br/agenciavirtual/block/send-receive-updates"
res = requests.post(url, data=data, cookies=cks) # send cookies with request
print(res.text)

Method not allowed first API

been through a few web scraping tutorials now trying a basic api scraper.
This is my code
from bs4 import BeautifulSoup
import requests
url = 'https://qships.tmr.qld.gov.au/webx/services/wxdata.svc/GetDataX'
response = requests.get(url, timeout=5)
content = BeautifulSoup(response.content, "html.parser")
print (content)
comes up with method not allowed :(
Im still learning so any advice will be well recieved
cheers

It is clearly a problem with your URL, service doesn't allow to retrieve information. but you can check this URL, where the steps for retrieving metadata are described.
https://qships.tmr.qld.gov.au/webx/services/wxdata.svc

Google Search Web Scraping with Python

I've been learning a lot of python lately to work on some projects at work.
Currently I need to do some web scraping with google search results. I found several sites that demonstrated how to use ajax google api to search, however after attempting to use it, it appears to no longer be supported. Any suggestions?
I've been searching for quite a while to find a way but can't seem to find any solutions that currently work.

You can always directly scrape Google results. To do this, you can use the URL https://google.com/search?q=<Query> this will return the top 10 search results.
Then you can use lxml for example to parse the page. Depending on what you use, you can either query the resulting node tree via a CSS-Selector (.r a) or using a XPath-Selector (//h3[#class="r"]/a)
In some cases the resulting URL will redirect to Google. Usually it contains a query-parameter qwhich will contain the actual request URL.
Example code using lxml and requests:
from urllib.parse import urlencode, urlparse, parse_qs
from lxml.html import fromstring
from requests import get
raw = get("https://www.google.com/search?q=StackOverflow").text
page = fromstring(raw)
for result in page.cssselect(".r a"):
url = result.get("href")
if url.startswith("/url?"):
url = parse_qs(urlparse(url).query)['q']
print(url[0])
A note on google banning your IP: In my experience, google only bans
if you start spamming google with search requests. It will respond
with a 503 if Google thinks you are bot.

Here is another service that can be used for scraping SERPs (https://zenserp.com) It does not require a client and is cheaper.
Here is a python code sample:
import requests
headers = {
'apikey': '',
}
params = (
('q', 'Pied Piper'),
('location', 'United States'),
('search_engine', 'google.com'),
('language', 'English'),
)
response = requests.get('https://app.zenserp.com/api/search', headers=headers, params=params)

You have 2 options. Building it yourself or using a SERP API.
A SERP API will return the Google search results as a formatted JSON response.
I would recommend a SERP API as it is easier to use, and you don't have to worry about getting blocked by Google.
1. SERP API
I have good experience with the scraperbox serp api.
You can use the following code to call the API. Make sure to replace YOUR_API_TOKEN with your scraperbox API token.
import urllib.parse
import urllib.request
import ssl
import json
ssl._create_default_https_context = ssl._create_unverified_context
# Urlencode the query string
q = urllib.parse.quote_plus("Where can I get the best coffee")
# Create the query URL.
query = "https://api.scraperbox.com/google"
query += "?token=%s" % "YOUR_API_TOKEN"
query += "&q=%s" % q
query += "&proxy_location=gb"
# Call the API.
request = urllib.request.Request(query)
raw_response = urllib.request.urlopen(request).read()
raw_json = raw_response.decode("utf-8")
response = json.loads(raw_json)
# Print the first result title
print(response["organic_results"][0]["title"])
2. Build your own Python scraper
I recently wrote an in-depth blog post on how to scrape search results with Python.
Here is a quick summary.
First you should get the HTML contents of the Google search result page.
import urllib.request
url = 'https://google.com/search?q=Where+can+I+get+the+best+coffee'
# Perform the request
request = urllib.request.Request(url)
# Set a normal User Agent header, otherwise Google will block the request.
request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36')
raw_response = urllib.request.urlopen(request).read()
# Read the repsonse as a utf-8 string
html = raw_response.decode("utf-8")
Then you can use BeautifulSoup to extract the search results.
For example, the following code will get all titles.
from bs4 import BeautifulSoup
# The code to get the html contents here.
soup = BeautifulSoup(html, 'html.parser')
# Find all the search result divs
divs = soup.select("#search div.g")
for div in divs:
# Search for a h3 tag
results = div.select("h3")
# Check if we have found a result
if (len(results) >= 1):
# Print the title
h3 = results[0]
print(h3.get_text())
You can extend this code to also extract the search result urls and descriptions.

You can also use a third party service like Serp API - I wrote and run this tool - that is a paid Google search engine results API. It solves the issues of being blocked, and you don't have to rent proxies and do the result parsing yourself.
It's easy to integrate with Python:
from lib.google_search_results import GoogleSearchResults
params = {
"q" : "Coffee",
"location" : "Austin, Texas, United States",
"hl" : "en",
"gl" : "us",
"google_domain" : "google.com",
"api_key" : "demo",
}
query = GoogleSearchResults(params)
dictionary_results = query.get_dictionary()
GitHub: https://github.com/serpapi/google-search-results-python

Current answers will work but google will ban your for scrapping.
My current solution uses the requests_ip_rotator
import requests
from requests_ip_rotator import ApiGateway
import os
keywords = ['test']
def parse(keyword, session):
url = f"https://www.google.com/search?q={keyword}"
response = session.get(url)
print(response)
if __name__ == '__main__':
AWS_ACCESS_KEY_ID = ''
AWS_SECRET_ACCESS_KEY = ''
gateway = ApiGateway("https://www.google.com", access_key_id=AWS_ACCESS_KEY_ID,
access_key_secret=AWS_SECRET_ACCESS_KEY)
gateway.start()
session = requests.Session()
session.mount("https://www.google.com", gateway)
for keyword in keywords:
parse(keyword, session)
gateway.shutdown()
AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY you can create in AWS console.
This solution allow you to parse 1 million requests (amazon free limit)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web Scraping with Requests -Python - python

Related

How to set filter params in this Tableau API call

Python requests - using twitter search

Web Scraping using Requests - Python

Method not allowed first API

Google Search Web Scraping with Python

Categories

Resources