AJAX web scraping using python Requests

AJAX web scraping using python Requests - python

I was trying to scrape this website but was'nt getting the table data. I even got the request data from the Chrome dev tools but I cannot find out what I'm doing wrong.
Here is my script:
import requests,json
url='https://www.assetmanagement.hsbc.de/api/v1/nav/funds'
payload={"appliedFilters":[[{"active":True,"id":"Yes"}]],"paging":{"fundsPerPage":-1,"currentPage":1},"view":"Documents","searchTerm":[],"selectedValues":[],"pageInformation":{"country":"DE","language":"DE","investorType":"INST","tokenIssue":{"url":"/api/v1/token/issue"},"dataUrl":{"url":"/api/v1/nav/funds","id":"e0FFNDg5MTJELUFEMzEtNEQ5RC04MzA4LTdBQzZERTgyQTc4Rn0="},"shareClassUrl":{"url":"/api/v1/nav/shareclass","id":"ezUxODdjODJiLWY1YmItNDIzOC1hM2Y0LWY5NzZlY2JmMTU3OX0="},"filterUrl":{"url":"/api/v1/nav/filters","id":"ezRFREYxQTU3LTVENkYtNDBDRC1CMjJDLTQ0NDc4Nzc1NTlFQn0="},"presentationUrl":{"url":"/api/v1/nav/presentation","id":"e0E1NEZDODZGLUE5MDctNDUzQi04RTYyLTIxNDNBMEM1MEVGQ30="},"liveDataUrl":{"id":"ezlEMjA2MDk5LUNCRTItNENGMy1BRThBLUM0RTMwMEIzMjlDQ30="},"fundDetailPageUrl":"/de/institutional-investors/fund-centre","forceHttps":True}}
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"}
r = requests.post(url,headers=headers,data=payload)
print(r.content)

While it lacked initially the IFC-Cache-Header http header in the first place, there is also a JWT token that is passed via Authorization header.
To retrieve this token, you first need to extract values from the root page :
GET https://www.assetmanagement.hsbc.de/de/institutional-investors/fund-centre
which features the following javacript object:
window.HSBC.dpas = {
"pageInformation": {
"country": "X", <========= HERE
"language": "X", <========= HERE
"tokenIssue": {
"url": "/api/v1/token/issue",
},
"dataUrl": {
"url": "/api/v1/nav/funds",
"id": "XXXXXXXXXXXXXXXXXXXXXXXXXXXX" <========= HERE
},
....
}
}
You can extract the window.HSBC.dpas javascript object value using regex and then reformat the string so that it becomes valid JSON
These values are then passed in http headers such as X-COUNTRY, X-COMPONENT and X-LANGUAGE to the following call:
GET https://www.assetmanagement.hsbc.de/api/v1/token/issue
It returns the JWT token directly and add the Authorization header to the request as Authorization: Bearer {token}:
GET https://www.assetmanagement.hsbc.de/api/v1/nav/funds
Example:
import requests
import re
import json
api_url = "https://www.assetmanagement.hsbc.de/api/v1"
funds_url=f"{api_url}/nav/funds"
token_url = f"{api_url}/token/issue"
# call the /fund-centre url to get the documentID value in the javascript
url = "https://www.assetmanagement.hsbc.de/de/institutional-investors/fund-centre?f=Yes&n=-1&v=Documents"
r = requests.get(url,
params = {
"f":"Yes",
"n": "-1",
"v": "Documents"
})
# this gets the javascript object
res = re.search(r"^.*window\.HSBC\.dpas\s*=\s*([^;]*);", r.text, re.DOTALL)
group = res.group(1)
# convert to valid JSON: remove trailing commas: https://stackoverflow.com/a/56595068 (added "e")
regex = r'''(?<=[}\]"'e]),(?!\s*[{["'])'''
result_json = re.sub(regex, "", group, 0)
result = json.loads(result_json)
print(result["pageInformation"]["dataUrl"])
# call /token/issue API to get a token
r = requests.post(token_url,
headers= {
"X-Country": result["pageInformation"]["country"],
"X-Component": result["pageInformation"]["dataUrl"]["id"],
"X-Language": result["pageInformation"]["language"]
}, data={})
token = r.text
print(token)
# call /nav/funds API
payload={
"appliedFilters":[[{"active":True,"id":"Yes"}]],
"paging":{"fundsPerPage":-1,"currentPage":1},
"view":"Documents",
"searchTerm":[],
"selectedValues":[],
"pageInformation": result["pageInformation"]
}
headers={
"IFC-Cache-Header": "de,de,inst,documents,yes,1,n-1",
"Authorization": f"Bearer {token}"
}
r = requests.post(funds_url,headers=headers,json=payload)
print(r.content)
Try this on repl.it

Related

Unable to query graphql with a sha256 hash to scrape property links from a webpage

After visiting this website, when I fill out the inputbox with Sydney CBD, NSW and hit the search button, I can see the required results displayed on that site.
I wish to scrape the property links using requests module. When I go for the following attempt, I can get the property links from the first page.
The problem here is that I hardcoded the value of sha256Hash within params, which is not what I want to do. I don't know if the ID retrieved by issuing a get requests to the suggestion url needs to be converted to sha256Hash.
However, when I do that using this function get_hashed_string(), the value it produces is different from the hardcoded one that is available within params. As a result, the script spits out a keyError on this line: container = res.json().
import requests
import hashlib
from pprint import pprint
from bs4 import BeautifulSoup
url = 'https://suggest.realestate.com.au/consumer-suggest/suggestions'
link = 'https://lexa.realestate.com.au/graphql'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
payload = {
'max': '7',
'type': 'suburb,region,precinct,state,postcode',
'src': 'homepage-web',
'query': 'Sydney CBD, NSW'
}
params = {"operationName":"searchByQuery","variables":{"query":"{\"channel\":\"buy\",\"page\":1,\"pageSize\":25,\"filters\":{\"surroundingSuburbs\":true,\"excludeNoSalePrice\":false,\"ex-under-contract\":false,\"ex-deposit-taken\":false,\"excludeAuctions\":false,\"excludePrivateSales\":false,\"furnished\":false,\"petsAllowed\":false,\"hasScheduledAuction\":false},\"localities\":[{\"searchLocation\":\"sydney cbd, nsw\"}]}","testListings":False,"nullifyOptionals":False},"extensions":{"persistedQuery":{"version":1,"sha256Hash":"ef58e42a4bd826a761f2092d573ee0fb1dac5a70cd0ce71abfffbf349b5b89c1"}}}
def get_hashed_string(keyword):
hashed_str = hashlib.sha256(keyword.encode('utf-8')).hexdigest()
return hashed_str
with requests.Session() as s:
s.headers.update(headers)
r = s.get(url,params=payload)
hashed_id = r.json()['_embedded']['suggestions'][0]['id']
# params['extensions']['persistedQuery']['sha256Hash'] = get_hashed_string(hashed_id)
res = s.post(link,json=params)
container = res.json()['data']['buySearch']['results']['exact']['items']
for item in container:
print(item['listing']['_links']['canonical']['href'])
If I run the script as is, it works beautifully. When I uncomment the line params['extensions']['persistedQuery']--> and run the script again, the script breaks.
How can I generate the value of sha256Hash and use the same within the script above?

This is not how graphql works. The sha value stays the same across all requests but what you're missing is a valid graphql query.
You have to reconstruct that first and then just use the API pagination - that's the key.
Here's how:
import json
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/109.0",
"Accept": "application/graphql+json, application/json",
"Content-Type": "application/json",
"Host": "lexa.realestate.com.au",
"Referer": "https://www.realestate.com.au/",
}
endpoint = "https://lexa.realestate.com.au/graphql"
graph_query = "{\"channel\":\"buy\",\"page\":page_number,\"pageSize\":25,\"filters\":{\"surroundingSuburbs\":true," \
"\"excludeNoSalePrice\":false,\"ex-under-contract\":false,\"ex-deposit-taken\":false," \
"\"excludeAuctions\":false,\"excludePrivateSales\":false,\"furnished\":false,\"petsAllowed\":false," \
"\"hasScheduledAuction\":false},\"localities\":[{\"searchLocation\":\"sydney cbd, nsw\"}]}"
graph_json = {
"operationName": "searchByQuery",
"variables": {
"query": "",
"testListings": False,
"nullifyOptionals": False
},
"extensions": {
"persistedQuery": {
"version": 1,
"sha256Hash": "ef58e42a4bd826a761f2092d573ee0fb1dac5a70cd0ce71abfffbf349b5b89c1"
}
}
}
if __name__ == '__main__':
with requests.Session() as s:
for page in range(1, 3):
graph_json['variables']['query'] = graph_query.replace('page_number', str(page))
r = s.post(endpoint, headers=headers, data=json.dumps(graph_json))
listing = r.json()['data']['buySearch']['results']['exact']['items']
for item in listing:
print(item['listing']['_links']['canonical']['href'])
This should give you:
https://www.realestate.com.au/property-apartment-nsw-sydney-140558991
https://www.realestate.com.au/property-apartment-nsw-sydney-141380404
https://www.realestate.com.au/property-apartment-nsw-sydney-140310979
https://www.realestate.com.au/property-apartment-nsw-sydney-141259592
https://www.realestate.com.au/property-apartment-nsw-barangaroo-140555291
https://www.realestate.com.au/property-apartment-nsw-sydney-140554403
https://www.realestate.com.au/property-apartment-nsw-millers+point-141245584
https://www.realestate.com.au/property-apartment-nsw-haymarket-139205259
https://www.realestate.com.au/project/hyde-metropolitan-by-deicorp-sydney-600036803
https://www.realestate.com.au/property-apartment-nsw-haymarket-140807411
https://www.realestate.com.au/property-apartment-nsw-sydney-141370756
https://www.realestate.com.au/property-apartment-nsw-sydney-141370364
https://www.realestate.com.au/property-apartment-nsw-haymarket-140425111
https://www.realestate.com.au/project/greenland-centre-sydney-600028910
https://www.realestate.com.au/property-apartment-nsw-sydney-141364136
https://www.realestate.com.au/property-apartment-nsw-sydney-139367203
https://www.realestate.com.au/property-apartment-nsw-sydney-141156696
https://www.realestate.com.au/property-apartment-nsw-sydney-141362880
https://www.realestate.com.au/property-studio-nsw-sydney-141311384
https://www.realestate.com.au/property-apartment-nsw-haymarket-141354876
https://www.realestate.com.au/property-apartment-nsw-the+rocks-140413283
https://www.realestate.com.au/property-apartment-nsw-sydney-141350552
https://www.realestate.com.au/property-apartment-nsw-sydney-140657935
https://www.realestate.com.au/property-apartment-nsw-barangaroo-139149039
https://www.realestate.com.au/property-apartment-nsw-haymarket-141034784
https://www.realestate.com.au/property-apartment-nsw-sydney-141230640
https://www.realestate.com.au/property-apartment-nsw-barangaroo-141340768
https://www.realestate.com.au/property-apartment-nsw-haymarket-141337684
https://www.realestate.com.au/property-unitblock-nsw-millers+point-141337528
https://www.realestate.com.au/property-apartment-nsw-sydney-141028828
https://www.realestate.com.au/property-apartment-nsw-sydney-141223160
https://www.realestate.com.au/property-apartment-nsw-sydney-140643067
https://www.realestate.com.au/property-apartment-nsw-sydney-140768179
https://www.realestate.com.au/property-apartment-nsw-haymarket-139406051
https://www.realestate.com.au/property-apartment-nsw-haymarket-139406047
https://www.realestate.com.au/property-apartment-nsw-sydney-139652067
https://www.realestate.com.au/property-apartment-nsw-sydney-140032667
https://www.realestate.com.au/property-apartment-nsw-sydney-127711002
https://www.realestate.com.au/property-apartment-nsw-sydney-140903924
https://www.realestate.com.au/property-apartment-nsw-walsh+bay-139130519
https://www.realestate.com.au/property-apartment-nsw-sydney-140285823
https://www.realestate.com.au/property-apartment-nsw-sydney-140761223
https://www.realestate.com.au/project/111-castlereagh-sydney-600031082
https://www.realestate.com.au/property-apartment-nsw-sydney-140633099
https://www.realestate.com.au/property-apartment-nsw-haymarket-141102892
https://www.realestate.com.au/property-apartment-nsw-sydney-139522379
https://www.realestate.com.au/property-apartment-nsw-sydney-139521259
https://www.realestate.com.au/property-apartment-nsw-sydney-139521219
https://www.realestate.com.au/property-apartment-nsw-haymarket-140007279
https://www.realestate.com.au/property-apartment-nsw-haymarket-139156515

adding x-www-form-urlencoded to post request

example from postman
import urllib2
url = "http://www.example.com/posts"
req = urllib2.Request(url,headers={'User-Agent' : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.30 (KHTML, like Gecko) Ubuntu/11.04 Chromium/12.0.742.112 Chrome/12.0.742.112 Safari/534.30" , "Content-Type": "application/x-www-form-urlencoded"})
con = urllib2.urlopen(req)
print con.read()
now this code works fine but i want to add the value as you see in postman picture to get the response that i want i don't know how to add the key and value postid = 134686 to python and it's post request in postman

Form-encoded is the normal way to send a POST request with data. Just supply a data dict; you don't even need to specify the content-type.
data = {'postid': 134786}
headers = {'User-Agent' : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.30 (KHTML, like Gecko) Ubuntu/11.04 Chromium/12.0.742.112 Chrome/12.0.742.112 Safari/534.30"}
req = urllib2.Request(url, headers=headers, data=data)

Just to an important thing to note is that for nested json data you will need to convert the nested json object to string.
data = { 'key1': 'value',
'key2': {
'nested_key1': 'nested_value1',
'nested_key2': 123
}
}
The dictionary needs to be transformed in this format
inner_dictionary = {
'nested_key1': 'nested_value1',
'nested_key2': 123
}
data = { 'key1': 'value',
'key2': json.dumps(inner_dictionary)
}
r = requests.post(URL, data = data)

Incorrect response from post request with requests

Search url - http://aptaapps.apta.org/findapt/Default.aspx?UniqueKey=.
Need to get data for the zipcode(10017)
Sending post requests but I receive the search page(response from the search url) but not the page with results.
My code:
# -*- coding: UTF-8 -*-
import requests
from bs4 import BeautifulSoup, element
search_url = "http://aptaapps.apta.org/findapt/Default.aspx?UniqueKey="
session = requests.Session()
r = session.get(search_url)
post_page = BeautifulSoup(r.text, "lxml")
try:
target_value = post_page.find("input", id="__EVENTTARGET")["value"]
except TypeError:
target_value = ""
try:
arg_value = post_page.find("input", id="__EVENTARGUMENT")["value"]
except TypeError:
arg_value = ""
try:
state_value = post_page.find("input", id="__VIEWSTATE")["value"]
except TypeError:
state_value = ""
try:
generator_value = post_page.find("input", id="__VIEWSTATEGENERATOR")["value"]
except TypeError:
generator_value = ""
try:
validation_value = post_page.find("input", id="__EVENTVALIDATION")["value"]
except TypeError:
validation_value = ""
post_data = {
"__EVENTTARGET": target_value,
"__EVENTARGUMENT": arg_value,
"__VIEWSTATE": state_value,
"__VIEWSTATEGENERATOR": generator_value,
"__EVENTVALIDATION": validation_value,
"ctl00$SearchTerms2": "",
"ctl00$maincontent$txtZIP": "10017",
"ctl00$maincontent$txtCity": "",
"ctl00$maincontent$lstStateProvince": "",
"ctl00$maincontent$radDist": "1",
"ctl00$maincontent$btnSearch": "Find a Physical Therapist"
}
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4",
"Cache-Control": "max-age=0",
"Content-Length": "3025",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "aptaapps.apta.org",
"Origin": "http://aptaapps.apta.org",
"Proxy-Connection": "keep-alive",
"Referer": "http://aptaapps.apta.org/findapt/default.aspx?UniqueKey=",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36"
}
post_r = session.post(search_url, data=post_data, headers=headers)
print(post_r.text)

Short Answer:
try to replace:
post_r = session.post(search_url, data=post_data, headers=headers)
to:
post_r = session.post(search_url, json=post_data, headers=headers)
Long Answer:
For POST method, there are many kinds of data types to post in. Such as form-data, x-www-form-urlencoded, application/json, file and etc.
You should know what is the type of the post data. There is a brilliant chrome plugin called postman. You can use it to try different data type and find what is the correct one.
After you find, use the correct parameter key in requests.post, the parameter data if for form-data and x-www-form-urlencoded. The parameter json is for json format. You can reference the document of requests to know more about the parameter.

Using Python to Post json contact list to Qualtrics API, error with Content-Type

I'm trying to import contacts into a contact list in Qualtrics. I am using python to do this.
Token = 'MyToken' #when running the code I put in my actual token and id
ContactsID = 'MyContactsID'
data = open('contacts.json', 'rb')
headers = {'X-API-TOKEN': Token, 'Content-Type':'application/json',}
r = requests.post('https://az1.qualtrics.com/API/v3/mailinglists/' + ContactsID +'/contactimports', headers=headers, data=data)
r.text
This code gives me the following error: '{"meta":{"httpStatus":"400 - Bad Request","error":{"errorMessage":"Invalid Content-Type. expected=multipart/form-data found=application/json","errorCode":"RP_0.1"},"requestId":null}}'
I changed the content type to multipart/form-data that it says it is expecting and received the response "413", which qualtrics explains means "The request body was too large. This can also happen in cases where a multipart/form-data request is malformed."
I have tested my json and verified that it is valid. Also, I don't know why the request body would be too large because it's only 13 contacts that I'm trying to import. Any ideas?

With the help of Qualtrics Support, I was eventually able to get the following code to work:
Token = 'MyToken' #when running the code I put in my actual token and id
ContactsID = 'MyContactsID'
url = "https://az1.qualtrics.com/API/v3/mailinglists/" + ContactsID + "/contactimports/"
headers = {
'content-type': "multipart/form-data; boundary=---BOUNDRY",
'x-api-token': "Token"
}
files = {'contacts': ('contacts', open('contacts.json', 'rb'), 'application/json')}
request = requests.post(url, headers=headers, files=files)
print(request.text)
Please note that if you want to use this code, you will need to change "az1" in the URL to your own Qualtrics datacenter ID.

You need to use files = .. for a multipart request:
Token = 'MyToken' #when running the code I put in my actual token and id
ContactsID = 'MyContactsID'
data = open('contacts.json', 'rb')
headers = {'X-API-TOKEN': Token}
r = requests.post('https://az1.qualtrics.com/API/v3/mailinglists/' + ContactsID +'/contactimports',files={"file":data}, headers=headers)
r.text
Once you do requests will take care of the rest:
In [36]: url = 'http://httpbin.org/post'
In [37]: headers = {'X-API-TOKEN': "123456789"}
In [38]: files = {'file': open('a.csv', 'rb')}
In [39]: r = requests.post(url, files=files, headers=headers)
In [40]: print r.text
{
"args": {},
"data": "",
"files": {
"file": "a,b,c\n1,2,3"
},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Content-Length": "152",
"Content-Type": "multipart/form-data; boundary=3830dbe5fa6141f69d3d85dee4ba6e78",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.10.0",
"X-Api-Token": "123456789"
},
"json": null,
"origin": "51.171.98.185",
"url": "http://httpbin.org/post"
}
In [41]: print(r.request.body)
--3830dbe5fa6141f69d3d85dee4ba6e78
Content-Disposition: form-data; name="file"; filename="a.csv"
a,b,c
1,2,3
--3830dbe5fa6141f69d3d85dee4ba6e78--
looking at the docs, you actually want something closer to:
Token = 'MyToken' #when running the code I put in my actual token and id
ContactsID = 'MyContactsID'
data = open('contacts.json', 'rb')
files = {'file': ('contact', data ,'application/json', {'X-API-TOKEN': Token})}
r = requests.post('https://az1.qualtrics.com/API/v3/mailinglists/' + ContactsID +'/contactimports',files=files)

Getting past an AJAX form using Python Requests

I've been trying to get past the form page on http://dq.ndc.bsnl.co.in/bsnl-web/residentialSearch.seam using the python Requests module.
The problem I'm guessing is the AJAX in the form field. And I really have no clue about how to go about sending a request with Python Requests for that.
I know that this can be done through Selenium, but I need it done through requests.
Here's my current code:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0'
}
payload = {
"residential": "residential",
"residential:j_id12": "",
"residential:firstField": 'a',
"residential:criteria1": "3",
"residential:city": "ASIND",
"residential:button1": "residential:button1",
"residential:suggestionBoxId_selection": "",
"javax.faces.ViewState": "j_id1"
}
with requests.Session() as s:
# print s.headers
print s.get('http://dq.ndc.bsnl.co.in/bsnl-web/residentialSearch.seam')
print s.headers
print s.cookies
resp = s.post(
'http://dq.ndc.bsnl.co.in/bsnl-web/residentialSearch.seam',
data=payload, headers=headers)
print resp.text

You are pretty near to the full solution. First you need the AJAXREQUEST in the payload to start the search and then follow the redirect to the first results page. The next pages you get with more requests. Only problem: there is no real end-of-pages mark, it starts over with the first page again. So I have to look into the contents for Page x of y.
import re
import requests
import requests.models
# non-standard conform redirect:
requests.Response.is_redirect = property(lambda self: (
'location' in self.headers and (
self.status_code in requests.models.REDIRECT_STATI or
self.headers.get('Ajax-Response', '') == 'redirect'
)))
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0'
}
payload = {
"AJAXREQUEST": "loader2",
"residential": "residential",
"residential:j_id12": "",
"residential:firstField": 'a',
"residential:criteria1": "3",
"residential:city": "ASIND",
"residential:button1": "residential:button1",
"residential:suggestionBoxId_selection": "",
"javax.faces.ViewState": "j_id1"
}
with requests.Session() as s:
print s.get('http://dq.ndc.bsnl.co.in/bsnl-web/residentialSearch.seam')
print s.headers
print s.cookies
resp = s.post(
'http://dq.ndc.bsnl.co.in/bsnl-web/residentialSearch.seam',
data=payload, headers=headers)
while True:
# do data processing
for l in resp.text.split("subscriber');")[1:]: print l[2:].split('<')[0]
# look for next page
current, last = re.search('Page (\d+) of (\d+)', resp.text).groups()
if int(current) == int(last):
break
resp = s.post('http://dq.ndc.bsnl.co.in/bsnl-web/resSrchDtls.seam',
data={'AJAXREQUEST':'_viewRoot',
'j_id10':'j_id10',
'javax.faces.ViewState':'j_id2',
'j_id10:PGDOWNLink':'j_id10:PGDOWNLink',
}, headers=headers)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

AJAX web scraping using python Requests - python

Related

Unable to query graphql with a sha256 hash to scrape property links from a webpage

adding x-www-form-urlencoded to post request

Incorrect response from post request with requests

Using Python to Post json contact list to Qualtrics API, error with Content-Type

Getting past an AJAX form using Python Requests

Categories

Resources