I use requests to pretend firefox and from the fiddler, I saw header is same, but the SystaxView not same
payload = {'searchType':'U'}
s.post(url,data=payload)
but I got error 500, From the syntax view, I saw in requests it will change to searchType=U
But Real browser will output searchType='U'.
I tried payload = {'searchType':'\'U\''} it will becomesearchType=%27U%27 in Syntax view.
any idea? I only find 1 difference, so I suspect it will trigger 500 error.
import requests
s=requests.Session()
s.headers.update({'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0'})
s.get('http://gls.fehd.gov.hk/fehd_lgs/jsp/search/searchMainPage.jsp?lang=zh_TW')
s.headers.update({'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'X-Requested-With': 'XMLHttpRequest'})
s.headers.update({'Referer': 'http://gls.fehd.gov.hk/fehd_lgs/jsp/search/searchMainPage.jsp?lang=zh_TW', 'HOST':'gls.fehd.gov.hk'})
s.headers.update({'Accept': 'application/xml, text/xml, */*; q=0.01'})
payload={'searchType':'U','deceased_surName':'','deceased_firstName':'','deceased_age':'','deceased_gender':'M','deceased_nationality':'','deathYear':'','deathMonth':'default','deathDay':'default','burialYear':'','burialMonth':'default','burialDay':'default','sectionNo':'','graveNo':''}
url='http://gls.fehd.gov.hk/FEHD_LGS/util/getSearchResult.jsp'
s.post(url,data=payload)
If the value you want to send is 'U' this might help you send it correctly.
payload = {'searchType': "'U'"}
s.post(url,data=payload)
Edit:
I don't think you need to make a post request. Try making a get request:
url='http://gls.fehd.gov.hk/FEHD_LGS/util/getSearchResult.jsp'
response = requests.get("%s?%s" % (url, "searchType='U'&deceased_surname=%E6%A5%8A&deceased_firstname=&deceased_age=&deceased_gender='M'&deceased_nationality=&deathYear=&deathMonth=default&deathDay=default&burialYear=&burialMonth=default&burialDay=default§ionNo=&graveNo=–"))
print(response.content.decode())
select * from cccs_dece_info where SITE_ID in (12,13) and GRAVE_TYPE in ('U') and ( DECEASED_CNAME like '楊%' or upper(DECEASED_ENAME) like '楊 %' or DECEASED_ALIAS = '楊' or DECEASED_ALIAS = '楊') and ( DECEASED_SEX_CODE in ('M', 'U')) and ( GRAVE_NO='–' )
java.sql.SQLSyntaxErrorException: ORA-01722: invalid number
If your server handle post payload in json format, format your payload to json first.
import requests
import json
url = "http://someurl.com/"
# format for json payload
def post(url, param):
payload = json.dumps(param)
payload = payload.replace(", ", ",")
payload = payload.replace("{", "{\n\t")
payload = payload.replace("\",", "\",\n\t")
payload = payload.replace("}", "\n}")
return response = requests.request("POST", url, data=payload)
payloads = dict(searchType ='U')
response = post(url, payloads)
print(response.response.text)
There is nothing wrong with the code, look like there are something wrong with your url/server,.. I checked with Postman look like this picture
Have you tried another method to do POST payload? (ex:Postman or PHP POST Method)
Related
I'm trying to scrap this website https://triller.co/ , so I want to get information from profile pages like this https://triller.co/#warnermusicarg , what I do is trying to request the json url that contains the information, in this case it's https://social.triller.co/v1.5/api/users/by_username/warnermusicarg
When I use requests.get() it works normally and I can retrieve all the information.
import requests
import urllib.parse
from urllib.parse import urlencode
url = 'https://social.triller.co/v1.5/api/users/by_username/warnermusicarg'
headers = {'authority':'social.triller.co',
'method':'GET',
'path':'/v1.5/api/users/by_username/warnermusicarg',
'scheme':'https',
'accept':'*/*',
'accept-encoding':'gzip, deflate, br',
'accept-language':'ar,en-US;q=0.9,en;q=0.8',
'authorization': 'Bearer eyJhbGciOiJIUzI1NiIsImlhdCI6MTY0MDc4MDc5NSwiZXhwIjoxNjkyNjIwNzk1fQ.eyJpZCI6IjUyNjQ3ODY5OCJ9.Ds-acbfcGSeUrGDSs47pBiT3b13Eb9SMcB8BF8OylqQ',
'origin':'https://triller.co',
'sec-ch-ua':'" Not A;Brand";v="99", "Chromium";v="96", "Google Chrome";v="96"',
'sec-ch-ua-mobile':'?0',
'sec-ch-ua-platform':'"Windows"',
'sec-fetch-dest':'empty',
'sec-fetch-mode':'cors',
'sec-fetch-site':'same-site',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
response = requests.get(url, headers=headers)
The problem arises when I try to use an API proxy providers as Webscraping.ai, ScrapingBee, etc
api_key='my_api_key'
api_url='https://api.webscraping.ai/html?'
params = {'api_key': api_key, 'timeout': '20000', 'url':url}
proxy_url = api_url + urlencode(params)
response2 = requests.get(proxy_url, headers=headers)
This gives me this error
2022-01-08 22:30:59 [urllib3.connectionpool] DEBUG: https://api.webscraping.ai:443 "GET /html?api_key=my_api_key&timeout=20000&url=https%3A%2F%2Fsocial.triller.co%2Fv1.5%2Fapi%2Fusers%2Fby_username%2Fwarnermusicarg&render_js=false HTTP/1.1" 502 91
{'status_code': 403, 'status_message': '', 'message': 'Unexpected HTTP code on the target page'}
What I tried to do is:
1- I searched for the meaning of 403 code in the documentation of my API proxy provider, it said that api_key is wrong, but I'm 100% sure it's correct,
Also, I changed to another API proxy provider but the same issue,
Also, I had the same issue with twitter.com
And I don't know what to do?
Currently, the code on the question successfully returns a response with code 200, but there are 2 possible issues:
Some sites block datacenter proxies, try to use proxy=residential API parameter (params = {'api_key': api_key, 'timeout': '20000', proxy: 'residential', 'url':url}).
Some of the headers on your headers parameter are unnecessary. Webscraping.AI uses its own set of headers to mimic the behaviors of normal browsers, so setting custom user-agent, accept-language, etc., may interfere with them and cause 403 responses from the target site. Use only the necessary headers. Looks like it will be only the authorization header in your case.
I don't know exactly what caused this error but I tried using their webscraping_ai.ApiClient() instance as in here and it worked,
configuration = webscraping_ai.Configuration(
host = "https://api.webscraping.ai",
api_key = {
'api_key': 'my_api_key'
}
)
with webscraping_ai.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = webscraping_ai.HTMLApi(api_client)
url_j = url # str | URL of the target page
headers = headers
timeout = 20000
js = False
proxy = 'datacenter'
api_response = api_instance.get_html(url_j, headers=headers, timeout=timeout, js=js, proxy=proxy)
I want to send a value for "User-agent" while requesting a webpage using Python Requests. I am not sure is if it is okay to send this as a part of the header, as in the code below:
debug = {'verbose': sys.stderr}
user_agent = {'User-agent': 'Mozilla/5.0'}
response = requests.get(url, headers = user_agent, config=debug)
The debug information isn't showing the headers being sent during the request.
Is it acceptable to send this information in the header? If not, how can I send it?
The user-agent should be specified as a field in the header.
Here is a list of HTTP header fields, and you'd probably be interested in request-specific fields, which includes User-Agent.
If you're using requests v2.13 and newer
The simplest way to do what you want is to create a dictionary and specify your headers directly, like so:
import requests
url = 'SOME URL'
headers = {
'User-Agent': 'My User Agent 1.0',
'From': 'youremail#domain.example' # This is another valid field
}
response = requests.get(url, headers=headers)
If you're using requests v2.12.x and older
Older versions of requests clobbered default headers, so you'd want to do the following to preserve default headers and then add your own to them.
import requests
url = 'SOME URL'
# Get a copy of the default headers that requests would use
headers = requests.utils.default_headers()
# Update the headers with your custom ones
# You don't have to worry about case-sensitivity with
# the dictionary keys, because default_headers uses a custom
# CaseInsensitiveDict implementation within requests' source code.
headers.update(
{
'User-Agent': 'My User Agent 1.0',
}
)
response = requests.get(url, headers=headers)
It's more convenient to use a session, this way you don't have to remember to set headers each time:
session = requests.Session()
session.headers.update({'User-Agent': 'Custom user agent'})
session.get('https://httpbin.org/headers')
By default, session also manages cookies for you. In case you want to disable that, see this question.
It will send the request like browser
import requests
url = 'https://Your-url'
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}
response= requests.get(url.strip(), headers=headers, timeout=10)
I want to land on the main (learning) page of my Duolingo profile but I am having a little trouble finding the correct way to sign into the website with my credentials using Python Requests.
I have tried making requests as well as I understood them but I am pretty much a noob in this so it has all went in vain thus far.
Help would be really appreciated!
This is what I was trying by my own means by the way:
#The Dictionary Keys/Values and the Post Request URL were taken from the Network Source code in Inspect on Google Chrome
import requests
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/81.0.4044.138 Safari/537.36'
}
login_data =
{
'identifier': 'something#email.com',
'password': 'myPassword'
}
with requests.Session() as s:
url = "https://www.duolingo.com/2017-06-30/login?fields="
s.post(url, headers = headers, params = login_data)
r = s.get("https://www.duolingo.com/learn")
print(r.content)
The post request receives the following content:
b'{"details": "Malformed JSON: No JSON object could be decoded", "error": "BAD_REQUEST_SCHEMA"}'
And since the login fails, the get request for the learn page receives this:
b'<html>\n <head>\n <title>401 Unauthorized</title>\n </head>\n <body>\n <h1>401
Unauthorized</h1>\n This server could not verify that you are authorized to access the document you
requested. Either you supplied the wrong credentials (e.g., bad password), or your browser does not
understand how to supply the credentials required.<br/><br/>\n\n\n\n </body>\n</html>'
Sorry if I am making any stupid mistakes. I do not know a lot about all this. Thanks!
If you inspect the POST request carefully you can see that:
accepted content type is application/json
there are more fields than you have supplied (distinctId, landingUrl)
the data is sent as a json request body and not url params
The only thing that you need to figure out is how to get distinctId then you can do the following:
EDIT:
Sending email/password as json body appears to be enough and there is no need to get distinctId, example:
import requests
import json
headers = {'content-type': 'application/json'}
data = {
'identifier': 'something#email.com',
'password': 'myPassword',
}
with requests.Session() as s:
url = "https://www.duolingo.com/2017-06-30/login?fields="
# use json.dumps to convert dict to serialized json string
s.post(url, headers=headers, data=json.dumps(data))
r = s.get("https://www.duolingo.com/learn")
print(r.content)
I have a crawl-data task, after inspecting the URL with Firefox F12 (DevTools), I find that the site needs a JSON array input looks like:
phyIDs: Array
0: "FDER047ERDF"
and returns some data also in JSON format:
trueIDs: Array
0: "802.112.1"
What I need is just the 'trueIDs', so I use Python 3.6.1 and Requests to do the job, here is parts of the code:
import json
import requests
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0',
'Cookie': 'JSFDKF.......',
'Content-Type': 'application/json;charset=UTF-8'}
data = {'phyIDs': json.dumps([{0: 'FDER047ERDF'}])}
resp = requests.post(url, headers=headers, verify=False,
data=data)
print(resp.text)
But the printed response text is a html like message saying that some error occurs, and the status_code is 500, however, if I comment the 'Content-Type' part in headers, and use normal dict instead of JSON as the input data, then nothing returns and the status_code changes to 415, now I don't know what to do and hope someone could help me, thanks very much!
...........
Thanks guys, I have solved this. The problem is that I shouldn't add '0' in the JSON array!
I am testing some application where I send some POST requests, want to test the behavior of the application when some headers are missing in the request to verify that it generates the correct error codes.
For doing this, my code is as follows.
header = {'Content-type': 'application/json'}
data = "hello world"
request = urllib2.Request(url, data, header)
f = urllib2.urlopen(request)
response = f.read()
The problem is urllib2 adds it's own headers like Content-Length, Accept-Encoding when it sends the POST request, but I don't want urllib2 to add any more headers than the one I specified in the headers dict above, is there a way to do that, I tried setting the other headers I don't want to None, but they still go with those empty values as part of the request which I don't want.
The header takes a dictionary type, example below using a chrome user-agent. For all standard and some non-stranded header fields take a look here. You also need to encode your data with urllib not urllib2. This is all mention in the python documentation here
import urllib
import urllib2
url = 'http://www.someserver.com/cgi-bin/register.cgi'
user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }
data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
the_page = response.read()