I have this
payload = {'from':'me', 'lang':lang, 'url':csv_url}
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'
}
api_url = 'http://dev.mypage.com/app/import/'
sent = requests.get(api_url , params=payload, headers=headers)
I just keep getting 403. I am looking up to this requests docs
what am I doing wrong?
UPDATE:
The url only accepts loggedin users. how can I login there with requests?
This is how it's usually done using a Session object:
# start new session to persist data between requests
session = requests.Session()
# log in session
response = session.post(
'http://dev.mypage.com/login/',
data={'user':'username', 'password':'12345'}
)
# make sure log in was successful
if not 200 <= response.status_code < 300:
raise Exception("Error while logging in, code: %d" % response. status_code)
# ... use session object to make logged-in requests, your example:
api_url = 'http://dev.mypage.com/app/import/'
sent = session.get(api_url , params=payload, headers=headers)
You should obviously adapt this to your usage scenario.
The reason a session is needed is that the HTTP protocol does not have the concept of a session, so sessions are implemented over HTTP.
Related
I am trying to login to a site through Python as if I am going to scrape it, but after trying for several hours I could not figure out why after I receive authorization token through successful post request I still receive a get html response as if I am not logged in.
The website is https://www.packtpub.com/ and below is my code.
import requests_html
import json
s_async = requests_html.AsyncHTMLSession()
#define vars for post request
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.41 Safari/537.36 Edg/101.0.1210.32"
data = {"username":username,"password":password}
post_url = "https://services.packtpub.com/auth-v1/users/tokens"
headers = {"User-Agent":user_agent}
#post request
r_post = await s_async.post(post_url,json=data,headers=headers)
if r_post.status_code!=200:
raise Exception("Check response code from post request")
#define vars for get request's header
#auth_token has the 'Bearer ' prefix added as this is how the authorization
#token is sent in the get header by my browser inspected through DevTools
auth_token = 'Bearer '+json.loads(r_post.text)['data']['access']
accept = "application/json, text/plain, */*"
accept_encoding = "gzip, deflate, br"
accept_language = "en-US,en;q=0.9,bg;q=0.8"
origin = "https://account.packtpub.com"
referer = "https://account.packtpub.com/"
#sec_ch_ua= r" Not A;Brand";v="99", "Chromium";v="101", "Microsoft Edge";v="101"
sec_ch_ua_mobile = "?0"
sec_ch_ua_platform = "Windows"
##sec-fetch-dest: empty
sec_fetch_mode = "cors"
sec_fetch_site = "same-site"
#the detailed headers are an attempt to copy get header from the successful get request by the browser
headers = {"User-Agent":user_agent,
"authorization":auth_token,
"accept":accept,
"accept-encoding":accept_encoding,
"accept-language":accept_language,
"origin":origin,
"referer":referer,
"sec-ch-ua-mobile":sec_ch_ua_mobile,
"sec-ch-ua-platform":sec_ch_ua_platform,
"sec-fetch-mode":sec_fetch_mode,
"sec-fetch-site":sec_fetch_site}
#get request
r_get = await s_async.get("https://www.packtpub.com",headers=headers)
if r_get.status_code!=200:
raise Exception("Check response code from get request")
await r_get.html.arender()
#looking for signs that the login is successful, more specifically I am looking for
#the absence of the "User Sign In" button
with open(r"C:\packtpub_inspect.html","wb") as file:
file.write(r_get.html.raw_html)`
I'm trying to scrap this website https://triller.co/ , so I want to get information from profile pages like this https://triller.co/#warnermusicarg , what I do is trying to request the json url that contains the information, in this case it's https://social.triller.co/v1.5/api/users/by_username/warnermusicarg
When I use requests.get() it works normally and I can retrieve all the information.
import requests
import urllib.parse
from urllib.parse import urlencode
url = 'https://social.triller.co/v1.5/api/users/by_username/warnermusicarg'
headers = {'authority':'social.triller.co',
'method':'GET',
'path':'/v1.5/api/users/by_username/warnermusicarg',
'scheme':'https',
'accept':'*/*',
'accept-encoding':'gzip, deflate, br',
'accept-language':'ar,en-US;q=0.9,en;q=0.8',
'authorization': 'Bearer eyJhbGciOiJIUzI1NiIsImlhdCI6MTY0MDc4MDc5NSwiZXhwIjoxNjkyNjIwNzk1fQ.eyJpZCI6IjUyNjQ3ODY5OCJ9.Ds-acbfcGSeUrGDSs47pBiT3b13Eb9SMcB8BF8OylqQ',
'origin':'https://triller.co',
'sec-ch-ua':'" Not A;Brand";v="99", "Chromium";v="96", "Google Chrome";v="96"',
'sec-ch-ua-mobile':'?0',
'sec-ch-ua-platform':'"Windows"',
'sec-fetch-dest':'empty',
'sec-fetch-mode':'cors',
'sec-fetch-site':'same-site',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
response = requests.get(url, headers=headers)
The problem arises when I try to use an API proxy providers as Webscraping.ai, ScrapingBee, etc
api_key='my_api_key'
api_url='https://api.webscraping.ai/html?'
params = {'api_key': api_key, 'timeout': '20000', 'url':url}
proxy_url = api_url + urlencode(params)
response2 = requests.get(proxy_url, headers=headers)
This gives me this error
2022-01-08 22:30:59 [urllib3.connectionpool] DEBUG: https://api.webscraping.ai:443 "GET /html?api_key=my_api_key&timeout=20000&url=https%3A%2F%2Fsocial.triller.co%2Fv1.5%2Fapi%2Fusers%2Fby_username%2Fwarnermusicarg&render_js=false HTTP/1.1" 502 91
{'status_code': 403, 'status_message': '', 'message': 'Unexpected HTTP code on the target page'}
What I tried to do is:
1- I searched for the meaning of 403 code in the documentation of my API proxy provider, it said that api_key is wrong, but I'm 100% sure it's correct,
Also, I changed to another API proxy provider but the same issue,
Also, I had the same issue with twitter.com
And I don't know what to do?
Currently, the code on the question successfully returns a response with code 200, but there are 2 possible issues:
Some sites block datacenter proxies, try to use proxy=residential API parameter (params = {'api_key': api_key, 'timeout': '20000', proxy: 'residential', 'url':url}).
Some of the headers on your headers parameter are unnecessary. Webscraping.AI uses its own set of headers to mimic the behaviors of normal browsers, so setting custom user-agent, accept-language, etc., may interfere with them and cause 403 responses from the target site. Use only the necessary headers. Looks like it will be only the authorization header in your case.
I don't know exactly what caused this error but I tried using their webscraping_ai.ApiClient() instance as in here and it worked,
configuration = webscraping_ai.Configuration(
host = "https://api.webscraping.ai",
api_key = {
'api_key': 'my_api_key'
}
)
with webscraping_ai.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = webscraping_ai.HTMLApi(api_client)
url_j = url # str | URL of the target page
headers = headers
timeout = 20000
js = False
proxy = 'datacenter'
api_response = api_instance.get_html(url_j, headers=headers, timeout=timeout, js=js, proxy=proxy)
I use requests to pretend firefox and from the fiddler, I saw header is same, but the SystaxView not same
payload = {'searchType':'U'}
s.post(url,data=payload)
but I got error 500, From the syntax view, I saw in requests it will change to searchType=U
But Real browser will output searchType='U'.
I tried payload = {'searchType':'\'U\''} it will becomesearchType=%27U%27 in Syntax view.
any idea? I only find 1 difference, so I suspect it will trigger 500 error.
import requests
s=requests.Session()
s.headers.update({'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0'})
s.get('http://gls.fehd.gov.hk/fehd_lgs/jsp/search/searchMainPage.jsp?lang=zh_TW')
s.headers.update({'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'X-Requested-With': 'XMLHttpRequest'})
s.headers.update({'Referer': 'http://gls.fehd.gov.hk/fehd_lgs/jsp/search/searchMainPage.jsp?lang=zh_TW', 'HOST':'gls.fehd.gov.hk'})
s.headers.update({'Accept': 'application/xml, text/xml, */*; q=0.01'})
payload={'searchType':'U','deceased_surName':'','deceased_firstName':'','deceased_age':'','deceased_gender':'M','deceased_nationality':'','deathYear':'','deathMonth':'default','deathDay':'default','burialYear':'','burialMonth':'default','burialDay':'default','sectionNo':'','graveNo':''}
url='http://gls.fehd.gov.hk/FEHD_LGS/util/getSearchResult.jsp'
s.post(url,data=payload)
If the value you want to send is 'U' this might help you send it correctly.
payload = {'searchType': "'U'"}
s.post(url,data=payload)
Edit:
I don't think you need to make a post request. Try making a get request:
url='http://gls.fehd.gov.hk/FEHD_LGS/util/getSearchResult.jsp'
response = requests.get("%s?%s" % (url, "searchType='U'&deceased_surname=%E6%A5%8A&deceased_firstname=&deceased_age=&deceased_gender='M'&deceased_nationality=&deathYear=&deathMonth=default&deathDay=default&burialYear=&burialMonth=default&burialDay=default§ionNo=&graveNo=–"))
print(response.content.decode())
select * from cccs_dece_info where SITE_ID in (12,13) and GRAVE_TYPE in ('U') and ( DECEASED_CNAME like '楊%' or upper(DECEASED_ENAME) like '楊 %' or DECEASED_ALIAS = '楊' or DECEASED_ALIAS = '楊') and ( DECEASED_SEX_CODE in ('M', 'U')) and ( GRAVE_NO='–' )
java.sql.SQLSyntaxErrorException: ORA-01722: invalid number
If your server handle post payload in json format, format your payload to json first.
import requests
import json
url = "http://someurl.com/"
# format for json payload
def post(url, param):
payload = json.dumps(param)
payload = payload.replace(", ", ",")
payload = payload.replace("{", "{\n\t")
payload = payload.replace("\",", "\",\n\t")
payload = payload.replace("}", "\n}")
return response = requests.request("POST", url, data=payload)
payloads = dict(searchType ='U')
response = post(url, payloads)
print(response.response.text)
There is nothing wrong with the code, look like there are something wrong with your url/server,.. I checked with Postman look like this picture
Have you tried another method to do POST payload? (ex:Postman or PHP POST Method)
I have tried logging into GitHub using the following code:
url = 'https://github.com/login'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
'login':'username',
'password':'password',
'authenticity_token':'Token that keeps changing',
'commit':'Sign in',
'utf8':'%E2%9C%93'
}
res = requests.post(url)
print(res.text)
Now, res.text prints the code of login page. I understand that it maybe because the token keeps changing continuously. I have also tried setting the URL to https://github.com/session but that does not work either.
Can anyone tell me a way to generate the token. I am looking for a way to login without using the API. I had asked another question where I mentioned that I was unable to login. One comment said that I am not doing it right and it is possible to login just by using the requests module without the help of Github API.
ME:
So, can I log in to Facebook or Github using the POST method? I have tried that and it did not work.
THE USER:
Well, presumably you did something wrong
Can anyone please tell me what I did wrong?
After the suggestion about using sessions, I have updated my code:
s = requests.Session()
headers = {Same as above}
s.put('https://github.com/session', headers=headers)
r = s.get('https://github.com/')
print(r.text)
I still can't get past the login page.
I think you get back to the login page because you are redirected and since your code doesn't send back your cookies, you can't have a session.
You are looking for session persistance, requests provides it :
Session Objects The Session object allows you to persist certain
parameters across requests. It also persists cookies across all
requests made from the Session instance, and will use urllib3's
connection pooling. So if you're making several requests to the same
host, the underlying TCP connection will be reused, which can result
in a significant performance increase (see HTTP persistent
connection).
s = requests.Session()
s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
r = s.get('http://httpbin.org/cookies')
print(r.text)
# '{"cookies": {"sessioncookie": "123456789"}}'
http://docs.python-requests.org/en/master/user/advanced/
Actually in post method the request parameters should be in request body, not in header.So the login data should be in data parameter.
For github, authenticity token is present in value attribute of an input tag which is extracted using BeautifulSoup library.
This code works fine
import requests
from getpass import getpass
from bs4 import BeautifulSoup
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
login_data = {
'commit': 'Sign in',
'utf8': '%E2%9C%93',
'login': input('Username: '),
'password': getpass()
}
url = 'https://github.com/session'
session = requests.Session()
response = session.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html5lib')
login_data['authenticity_token'] = soup.find(
'input', attrs={'name': 'authenticity_token'})['value']
response = session.post(url, data=login_data, headers=headers)
print(response.status_code)
response = session.get('https://github.com', headers=headers)
print(response.text)
This code works perfectly
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
login_data = {
'commit': 'Sign in',
'utf8': '%E2%9C%93',
'login': 'your-username',
'password': 'your-password'
}
with requests.Session() as s:
url = "https://github.com/session"
r = s.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'html5lib')
login_data['authenticity_token'] = soup.find('input', attrs={'name': 'authenticity_token'})['value']
r = s.post(url, data=login_data, headers=headers)
You can also try using the PyGitHub API to perform common git tasks.
Check the link below:
https://github.com/PyGithub/PyGithub
I am trying to make a http request using requests library to the redirect url (in response headers-Location). When using Chrome inspection, I can see the response status is 302.
However, in python, requests always returns a 200 status. I added the allow_redirects=False, but the status is still always 200.
The url is https://api.weibo.com/oauth2/authorize?redirect_uri=http%3A//oauth.weico.cc&response_type=code&client_id=211160679
the first line entered the test account: moyan429#hotmail.com
the second line entered the password: 112358
and then click the first button to login.
My Python code:
import requests
user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.152 Safari/537.36'
session = requests.session()
session.headers['User-Agent'] = user_agent
session.headers['Host'] = 'api.weibo.com'
session.headers['Origin']='https://api.weibo.com'
session.headers['Referer'] ='https://api.weibo.com/oauth2/authorize?redirect_uri=http%3A//oauth.weico.cc&response_type=code&client_id=211160679'
session.headers['Connection']='keep-alive'
data = {
'client_id': api_key,
'redirect_uri': callback_url,
'userId':'moyan429#hotmail.com',
'passwd': '112358',
'switchLogin': '0',
'action': 'login',
'response_type': 'code',
'quick_auth': 'null'
}
resp = session.post(
url='https://api.weibo.com/oauth2/authorize',
data=data,
allow_redirects=False
)
code = resp.url[-32:]
print code
You are probably getting an API error message. Use print resp.text to see what the server tells you is wrong here.
Note that you can always inspect resp.history to see if there were any redirects; if there were any you'll find a list of response objects.
Do not set the Host or Connection headers; leave those to requests to handle. I doubt the Origin or Referer headers here needed either. Since this is an API, the User-Agent header is probably also overkill.