example from postman
import urllib2
url = "http://www.example.com/posts"
req = urllib2.Request(url,headers={'User-Agent' : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.30 (KHTML, like Gecko) Ubuntu/11.04 Chromium/12.0.742.112 Chrome/12.0.742.112 Safari/534.30" , "Content-Type": "application/x-www-form-urlencoded"})
con = urllib2.urlopen(req)
print con.read()
now this code works fine but i want to add the value as you see in postman picture to get the response that i want i don't know how to add the key and value postid = 134686 to python and it's post request in postman
Form-encoded is the normal way to send a POST request with data. Just supply a data dict; you don't even need to specify the content-type.
data = {'postid': 134786}
headers = {'User-Agent' : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.30 (KHTML, like Gecko) Ubuntu/11.04 Chromium/12.0.742.112 Chrome/12.0.742.112 Safari/534.30"}
req = urllib2.Request(url, headers=headers, data=data)
Just to an important thing to note is that for nested json data you will need to convert the nested json object to string.
data = { 'key1': 'value',
'key2': {
'nested_key1': 'nested_value1',
'nested_key2': 123
}
}
The dictionary needs to be transformed in this format
inner_dictionary = {
'nested_key1': 'nested_value1',
'nested_key2': 123
}
data = { 'key1': 'value',
'key2': json.dumps(inner_dictionary)
}
r = requests.post(URL, data = data)
Related
After visiting this website, when I fill out the inputbox with Sydney CBD, NSW and hit the search button, I can see the required results displayed on that site.
I wish to scrape the property links using requests module. When I go for the following attempt, I can get the property links from the first page.
The problem here is that I hardcoded the value of sha256Hash within params, which is not what I want to do. I don't know if the ID retrieved by issuing a get requests to the suggestion url needs to be converted to sha256Hash.
However, when I do that using this function get_hashed_string(), the value it produces is different from the hardcoded one that is available within params. As a result, the script spits out a keyError on this line: container = res.json().
import requests
import hashlib
from pprint import pprint
from bs4 import BeautifulSoup
url = 'https://suggest.realestate.com.au/consumer-suggest/suggestions'
link = 'https://lexa.realestate.com.au/graphql'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
payload = {
'max': '7',
'type': 'suburb,region,precinct,state,postcode',
'src': 'homepage-web',
'query': 'Sydney CBD, NSW'
}
params = {"operationName":"searchByQuery","variables":{"query":"{\"channel\":\"buy\",\"page\":1,\"pageSize\":25,\"filters\":{\"surroundingSuburbs\":true,\"excludeNoSalePrice\":false,\"ex-under-contract\":false,\"ex-deposit-taken\":false,\"excludeAuctions\":false,\"excludePrivateSales\":false,\"furnished\":false,\"petsAllowed\":false,\"hasScheduledAuction\":false},\"localities\":[{\"searchLocation\":\"sydney cbd, nsw\"}]}","testListings":False,"nullifyOptionals":False},"extensions":{"persistedQuery":{"version":1,"sha256Hash":"ef58e42a4bd826a761f2092d573ee0fb1dac5a70cd0ce71abfffbf349b5b89c1"}}}
def get_hashed_string(keyword):
hashed_str = hashlib.sha256(keyword.encode('utf-8')).hexdigest()
return hashed_str
with requests.Session() as s:
s.headers.update(headers)
r = s.get(url,params=payload)
hashed_id = r.json()['_embedded']['suggestions'][0]['id']
# params['extensions']['persistedQuery']['sha256Hash'] = get_hashed_string(hashed_id)
res = s.post(link,json=params)
container = res.json()['data']['buySearch']['results']['exact']['items']
for item in container:
print(item['listing']['_links']['canonical']['href'])
If I run the script as is, it works beautifully. When I uncomment the line params['extensions']['persistedQuery']--> and run the script again, the script breaks.
How can I generate the value of sha256Hash and use the same within the script above?
This is not how graphql works. The sha value stays the same across all requests but what you're missing is a valid graphql query.
You have to reconstruct that first and then just use the API pagination - that's the key.
Here's how:
import json
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/109.0",
"Accept": "application/graphql+json, application/json",
"Content-Type": "application/json",
"Host": "lexa.realestate.com.au",
"Referer": "https://www.realestate.com.au/",
}
endpoint = "https://lexa.realestate.com.au/graphql"
graph_query = "{\"channel\":\"buy\",\"page\":page_number,\"pageSize\":25,\"filters\":{\"surroundingSuburbs\":true," \
"\"excludeNoSalePrice\":false,\"ex-under-contract\":false,\"ex-deposit-taken\":false," \
"\"excludeAuctions\":false,\"excludePrivateSales\":false,\"furnished\":false,\"petsAllowed\":false," \
"\"hasScheduledAuction\":false},\"localities\":[{\"searchLocation\":\"sydney cbd, nsw\"}]}"
graph_json = {
"operationName": "searchByQuery",
"variables": {
"query": "",
"testListings": False,
"nullifyOptionals": False
},
"extensions": {
"persistedQuery": {
"version": 1,
"sha256Hash": "ef58e42a4bd826a761f2092d573ee0fb1dac5a70cd0ce71abfffbf349b5b89c1"
}
}
}
if __name__ == '__main__':
with requests.Session() as s:
for page in range(1, 3):
graph_json['variables']['query'] = graph_query.replace('page_number', str(page))
r = s.post(endpoint, headers=headers, data=json.dumps(graph_json))
listing = r.json()['data']['buySearch']['results']['exact']['items']
for item in listing:
print(item['listing']['_links']['canonical']['href'])
This should give you:
https://www.realestate.com.au/property-apartment-nsw-sydney-140558991
https://www.realestate.com.au/property-apartment-nsw-sydney-141380404
https://www.realestate.com.au/property-apartment-nsw-sydney-140310979
https://www.realestate.com.au/property-apartment-nsw-sydney-141259592
https://www.realestate.com.au/property-apartment-nsw-barangaroo-140555291
https://www.realestate.com.au/property-apartment-nsw-sydney-140554403
https://www.realestate.com.au/property-apartment-nsw-millers+point-141245584
https://www.realestate.com.au/property-apartment-nsw-haymarket-139205259
https://www.realestate.com.au/project/hyde-metropolitan-by-deicorp-sydney-600036803
https://www.realestate.com.au/property-apartment-nsw-haymarket-140807411
https://www.realestate.com.au/property-apartment-nsw-sydney-141370756
https://www.realestate.com.au/property-apartment-nsw-sydney-141370364
https://www.realestate.com.au/property-apartment-nsw-haymarket-140425111
https://www.realestate.com.au/project/greenland-centre-sydney-600028910
https://www.realestate.com.au/property-apartment-nsw-sydney-141364136
https://www.realestate.com.au/property-apartment-nsw-sydney-139367203
https://www.realestate.com.au/property-apartment-nsw-sydney-141156696
https://www.realestate.com.au/property-apartment-nsw-sydney-141362880
https://www.realestate.com.au/property-studio-nsw-sydney-141311384
https://www.realestate.com.au/property-apartment-nsw-haymarket-141354876
https://www.realestate.com.au/property-apartment-nsw-the+rocks-140413283
https://www.realestate.com.au/property-apartment-nsw-sydney-141350552
https://www.realestate.com.au/property-apartment-nsw-sydney-140657935
https://www.realestate.com.au/property-apartment-nsw-barangaroo-139149039
https://www.realestate.com.au/property-apartment-nsw-haymarket-141034784
https://www.realestate.com.au/property-apartment-nsw-sydney-141230640
https://www.realestate.com.au/property-apartment-nsw-barangaroo-141340768
https://www.realestate.com.au/property-apartment-nsw-haymarket-141337684
https://www.realestate.com.au/property-unitblock-nsw-millers+point-141337528
https://www.realestate.com.au/property-apartment-nsw-sydney-141028828
https://www.realestate.com.au/property-apartment-nsw-sydney-141223160
https://www.realestate.com.au/property-apartment-nsw-sydney-140643067
https://www.realestate.com.au/property-apartment-nsw-sydney-140768179
https://www.realestate.com.au/property-apartment-nsw-haymarket-139406051
https://www.realestate.com.au/property-apartment-nsw-haymarket-139406047
https://www.realestate.com.au/property-apartment-nsw-sydney-139652067
https://www.realestate.com.au/property-apartment-nsw-sydney-140032667
https://www.realestate.com.au/property-apartment-nsw-sydney-127711002
https://www.realestate.com.au/property-apartment-nsw-sydney-140903924
https://www.realestate.com.au/property-apartment-nsw-walsh+bay-139130519
https://www.realestate.com.au/property-apartment-nsw-sydney-140285823
https://www.realestate.com.au/property-apartment-nsw-sydney-140761223
https://www.realestate.com.au/project/111-castlereagh-sydney-600031082
https://www.realestate.com.au/property-apartment-nsw-sydney-140633099
https://www.realestate.com.au/property-apartment-nsw-haymarket-141102892
https://www.realestate.com.au/property-apartment-nsw-sydney-139522379
https://www.realestate.com.au/property-apartment-nsw-sydney-139521259
https://www.realestate.com.au/property-apartment-nsw-sydney-139521219
https://www.realestate.com.au/property-apartment-nsw-haymarket-140007279
https://www.realestate.com.au/property-apartment-nsw-haymarket-139156515
I want to crawl the data from this website:'http://www.stcn.com/article/search.html?search_type=all&page_time=1', but the website needs to have cookies on the homepage first, so I first get the cookies he needs from this website('http://www.stcn.com/article/search.html') and set them into the request, but it doesn't work after many attempts.
My code looks like this:
import requests
headers = {
'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36','Host':'www.stcn.com'}
def _getStcnCookie(keyWords='all'):
url = "http://www.stcn.com/article/search.html"
data = {'keyword': keyWords}
r = requests.get(url, data, headers=headers, timeout=10)
if r.status_code != 200:
return None
return requests.utils.dict_from_cookiejar(r.cookies)
def searchStcnData(url,keyWords) :
myHeader = dict.copy(headers)
myHeader['X-Requested-With'] = 'XMLHttpRequest'
cookies = _getStcnCookie(keyWords=keyWords)
print(cookies)
jar = requests.cookies.cookiejar_from_dict(cookies)
data = {'keyword':'Paxlovid', 'page_time': 1, 'search_type': 'all'}
#Option One
s = requests.Session()
response = s.post(url, data, headers=myHeader, timeout=5, cookies=cookies)
print(response.text)
# Option two
# myHeader['Cookie'] = 'advanced-stcn_web=potef1789mm5nqgmd6jc1rcih3; path=/; HttpOnly;'+cookiesStr
# Option three
r = requests.post(url, data, headers=myHeader, timeout=5, cookies=cookies)
print(r.json())
return r.json()
searchStcnData('http://www.stcn.com/article/search.html?search_type=all&page_time=1','Paxlovid')
I've tried options 1, 2, and 3 to no avail.
I set cookies in Postman, and only set 'advanced-stcn_web=5sdfitvu42qggmnjvop4dearj4' can get the data, like this :
{
"state": 1,
"msg": "操作成功",
"data": "<li class=\"\">\n <div class=\"content\">\n <div class=\"tt\">\n <a href=\"/article/detail/769123.html\" target=\"_blank\">\n ......
"page_time": 2
}
I was trying to scrape this website but was'nt getting the table data. I even got the request data from the Chrome dev tools but I cannot find out what I'm doing wrong.
Here is my script:
import requests,json
url='https://www.assetmanagement.hsbc.de/api/v1/nav/funds'
payload={"appliedFilters":[[{"active":True,"id":"Yes"}]],"paging":{"fundsPerPage":-1,"currentPage":1},"view":"Documents","searchTerm":[],"selectedValues":[],"pageInformation":{"country":"DE","language":"DE","investorType":"INST","tokenIssue":{"url":"/api/v1/token/issue"},"dataUrl":{"url":"/api/v1/nav/funds","id":"e0FFNDg5MTJELUFEMzEtNEQ5RC04MzA4LTdBQzZERTgyQTc4Rn0="},"shareClassUrl":{"url":"/api/v1/nav/shareclass","id":"ezUxODdjODJiLWY1YmItNDIzOC1hM2Y0LWY5NzZlY2JmMTU3OX0="},"filterUrl":{"url":"/api/v1/nav/filters","id":"ezRFREYxQTU3LTVENkYtNDBDRC1CMjJDLTQ0NDc4Nzc1NTlFQn0="},"presentationUrl":{"url":"/api/v1/nav/presentation","id":"e0E1NEZDODZGLUE5MDctNDUzQi04RTYyLTIxNDNBMEM1MEVGQ30="},"liveDataUrl":{"id":"ezlEMjA2MDk5LUNCRTItNENGMy1BRThBLUM0RTMwMEIzMjlDQ30="},"fundDetailPageUrl":"/de/institutional-investors/fund-centre","forceHttps":True}}
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"}
r = requests.post(url,headers=headers,data=payload)
print(r.content)
While it lacked initially the IFC-Cache-Header http header in the first place, there is also a JWT token that is passed via Authorization header.
To retrieve this token, you first need to extract values from the root page :
GET https://www.assetmanagement.hsbc.de/de/institutional-investors/fund-centre
which features the following javacript object:
window.HSBC.dpas = {
"pageInformation": {
"country": "X", <========= HERE
"language": "X", <========= HERE
"tokenIssue": {
"url": "/api/v1/token/issue",
},
"dataUrl": {
"url": "/api/v1/nav/funds",
"id": "XXXXXXXXXXXXXXXXXXXXXXXXXXXX" <========= HERE
},
....
}
}
You can extract the window.HSBC.dpas javascript object value using regex and then reformat the string so that it becomes valid JSON
These values are then passed in http headers such as X-COUNTRY, X-COMPONENT and X-LANGUAGE to the following call:
GET https://www.assetmanagement.hsbc.de/api/v1/token/issue
It returns the JWT token directly and add the Authorization header to the request as Authorization: Bearer {token}:
GET https://www.assetmanagement.hsbc.de/api/v1/nav/funds
Example:
import requests
import re
import json
api_url = "https://www.assetmanagement.hsbc.de/api/v1"
funds_url=f"{api_url}/nav/funds"
token_url = f"{api_url}/token/issue"
# call the /fund-centre url to get the documentID value in the javascript
url = "https://www.assetmanagement.hsbc.de/de/institutional-investors/fund-centre?f=Yes&n=-1&v=Documents"
r = requests.get(url,
params = {
"f":"Yes",
"n": "-1",
"v": "Documents"
})
# this gets the javascript object
res = re.search(r"^.*window\.HSBC\.dpas\s*=\s*([^;]*);", r.text, re.DOTALL)
group = res.group(1)
# convert to valid JSON: remove trailing commas: https://stackoverflow.com/a/56595068 (added "e")
regex = r'''(?<=[}\]"'e]),(?!\s*[{["'])'''
result_json = re.sub(regex, "", group, 0)
result = json.loads(result_json)
print(result["pageInformation"]["dataUrl"])
# call /token/issue API to get a token
r = requests.post(token_url,
headers= {
"X-Country": result["pageInformation"]["country"],
"X-Component": result["pageInformation"]["dataUrl"]["id"],
"X-Language": result["pageInformation"]["language"]
}, data={})
token = r.text
print(token)
# call /nav/funds API
payload={
"appliedFilters":[[{"active":True,"id":"Yes"}]],
"paging":{"fundsPerPage":-1,"currentPage":1},
"view":"Documents",
"searchTerm":[],
"selectedValues":[],
"pageInformation": result["pageInformation"]
}
headers={
"IFC-Cache-Header": "de,de,inst,documents,yes,1,n-1",
"Authorization": f"Bearer {token}"
}
r = requests.post(funds_url,headers=headers,json=payload)
print(r.content)
Try this on repl.it
I've been trying to get past the form page on http://dq.ndc.bsnl.co.in/bsnl-web/residentialSearch.seam using the python Requests module.
The problem I'm guessing is the AJAX in the form field. And I really have no clue about how to go about sending a request with Python Requests for that.
I know that this can be done through Selenium, but I need it done through requests.
Here's my current code:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0'
}
payload = {
"residential": "residential",
"residential:j_id12": "",
"residential:firstField": 'a',
"residential:criteria1": "3",
"residential:city": "ASIND",
"residential:button1": "residential:button1",
"residential:suggestionBoxId_selection": "",
"javax.faces.ViewState": "j_id1"
}
with requests.Session() as s:
# print s.headers
print s.get('http://dq.ndc.bsnl.co.in/bsnl-web/residentialSearch.seam')
print s.headers
print s.cookies
resp = s.post(
'http://dq.ndc.bsnl.co.in/bsnl-web/residentialSearch.seam',
data=payload, headers=headers)
print resp.text
You are pretty near to the full solution. First you need the AJAXREQUEST in the payload to start the search and then follow the redirect to the first results page. The next pages you get with more requests. Only problem: there is no real end-of-pages mark, it starts over with the first page again. So I have to look into the contents for Page x of y.
import re
import requests
import requests.models
# non-standard conform redirect:
requests.Response.is_redirect = property(lambda self: (
'location' in self.headers and (
self.status_code in requests.models.REDIRECT_STATI or
self.headers.get('Ajax-Response', '') == 'redirect'
)))
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0'
}
payload = {
"AJAXREQUEST": "loader2",
"residential": "residential",
"residential:j_id12": "",
"residential:firstField": 'a',
"residential:criteria1": "3",
"residential:city": "ASIND",
"residential:button1": "residential:button1",
"residential:suggestionBoxId_selection": "",
"javax.faces.ViewState": "j_id1"
}
with requests.Session() as s:
print s.get('http://dq.ndc.bsnl.co.in/bsnl-web/residentialSearch.seam')
print s.headers
print s.cookies
resp = s.post(
'http://dq.ndc.bsnl.co.in/bsnl-web/residentialSearch.seam',
data=payload, headers=headers)
while True:
# do data processing
for l in resp.text.split("subscriber');")[1:]: print l[2:].split('<')[0]
# look for next page
current, last = re.search('Page (\d+) of (\d+)', resp.text).groups()
if int(current) == int(last):
break
resp = s.post('http://dq.ndc.bsnl.co.in/bsnl-web/resSrchDtls.seam',
data={'AJAXREQUEST':'_viewRoot',
'j_id10':'j_id10',
'javax.faces.ViewState':'j_id2',
'j_id10:PGDOWNLink':'j_id10:PGDOWNLink',
}, headers=headers)
Im always getting the error message "Bad Request" when im trying to Post data to steam, i did lot of resears and i dont know how to fix this.
Post Values:
# Post Values
total = int(item['price'])
fee = int(item['fee'])
subtotal = total-fee
Cookies:
# Cookies
c = []
c.append('steamMachineAuthXXXXXXXXXXXXXXXXX='+steamMachineAuth)
c.append('steamLogin='+steamLogin)
c.append('steamLoginSecure='+steamLoginSecure)
c.append('sessionid='+sessionid)
cookie = ''
for value in c:
cookie = cookie+''+value+'; '
Headers:
# Headers
headers = {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "de-DE,de;q=0.8,en-US;q=0.6,en;q=0.4",
"Connection": "keep-alive",
"Host": "steamcommunity.com",
"Referer": hosturl+"market/listings/"+appid+"/"+item['market_hash_name'],
"Cookie": cookie,
"Origin": hosturl,
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"
}
Post data:
# Post Data
post = {
'sessionid': sessionid,
'currency': int(currency),
'subtotal': subtotal,
'fee': fee,
'total': total,
'quantity': 1
}
Url:
# url
url = hosturl+'market/buylisting/'+item['listingid']
Sending Request:
# Sending Request
se = requests.Session()
re = se.post(url, data=post, headers=headers)
print re.reason
Output:
Bad Request
I can't speak specifically about the Steam service as I haven't used it yet, but my experience with typical Bad Request responses is that you're either trying an HTTP verb that isn't supported or your request is not formatted correctly.
In your case, I suspect it's the latter.
My first candidate to look at is your cookie formatting. Are you sure you don't have characters that need to be escaped?
I could suggest using something like this instead:
c = {
'steamMachineAuthXXXXXXXXXXXXXXXXX': steamMachineAuth,
'steamLogin': steamLogin,
'steamLoginSecure': steamLoginSecure,
'sessionid': sessionid
}
cookie = '; ',join('{}="{}"'.format(k, v) for k, v in c.items())