Python requests: disallow cookies - python

I am using Python requests:
import requests
image_url = my_url
headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36', 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Encoding':'gzip,deflate,sdch'}
r = requests.get(image_url, headers=headers)
I would like the response to be the same as if I were sending the request from a browser that does NOT allow cookies to be set. The reason for this is that some sites give a different response depending on whether or not my browser allows cookies, and I need the non-cookie response.

Cookies are sent or not. If you don't set a cookie header, no cookie is sent. So the request in your question should be treated as sending no cookie.
The server sends a cookie in its response. If you set it in the next request, the server will recognize this. If you don't set it in the next request, the server will see that you don't accept cookies.
see http://docs.python-requests.org/en/latest/user/quickstart/#cookies

Related

Code works from localhost but not on server - https://www.nseindia.com/api/equity-stockIndices?index=NIFTY%2050 - python

I am trying to access https://www.nseindia.com/api/equity-stockIndices?index=NIFTY%2050. It is working fine from my localhost (code compiled in vscode) but when I deploy it on the server I get HTTP 499 error.
Did anybody get through this and was able to fetch the data using this approach?
Looks like nse is blocking the request somehow. But then how is it working from a localhost?
P.S. - I am a paid user of pythonAnywhere (Hacker) subscription
import requests
import time
def marketDatafn(query):
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36'}
main_url = "https://www.nseindia.com/"
session = requests.Session()
response = session.get(main_url, headers=headers)
cookies = response.cookies
url = "https://www.nseindia.com/api/equity-stockIndices?index=NIFTY%2050"
nifty50DataReq = session.get(url, headers=headers, cookies=cookies, timeout=15)
nifty50DataJson = nifty100DataReq.json()
return nifty50DataJson['data']
Actually "Pythonanywhere" only supports those website which are in this whitelist.
And I have found that there are only two subdomain available under "nseindia.com", which is not that you are trying to request.
bricsonline.nseindia.com
bricsonlinereguat.nseindia.com
So, pythonanywhere is blocking you to sent request to that website.
Here's the link to read more about how to request to add your website there.

Returning 403 Forbidden from simple get but loads okay in browser

I'm trying to get some data from a page, but it's returning the error [403 Forbidden].
I thought it was the user agent, but I tried several user agents and it still returns the error.
I also tried to use the library fake user-agent but I did not succeed.
with requests.Session() as c:
url = '...'
#headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2224.3 Safari/537.36'}
ua = UserAgent()
header = {'User-Agent':str(ua.chrome)}
page = c.get(url, headers=header)
print page.content
When I access the page manually, everything works.
I'm using python 2.7.14 and requests library, Any idea?
The site could be using anything in the request to trigger the rejection.
So, copy all headers from the request that your browser makes. Then delete them one by one1 to find out which are essential.
As per Python requests. 403 Forbidden, to add custom headers to the request, do:
result = requests.get(url, headers={'header':'value', <etc>})
1A faster way would be to delete half of them each time instead but that's more complicated since there are probably multiple essential headers
These all headers I can see for a generic GET request that are included by the browser:
Host: <URL>
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Try to include all those incrementally in your request (1 by 1) in order to identify which one(s) is/are required for a successful request.
On the other hand, take look of the tabs: Cookies and/or Security available in your browser console / developer tools under Network option.

Python requests user agent not working

I am using python requests to get the html page.
I am using the latest version of chrome in the user agent.
But the response tells that Please update your browser.
Here is my sample code.
import requests
url = 'https://www.choicehotels.com/alabama/mobile/quality-inn-hotels/al045/hotel-reviews/4'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36', 'content-type': 'application/xhtml+xml', 'referer': url}
url_response = s.get(url, headers=headers, timeout=15)
print url_response.text
I am using python 2.7 in a windows server.
But when I ran the same code in my local I got the required output.
Please update your browser is the answer.
You cannot do https with old browser (and request in python2.7 could be old browser). There were a lot of security problems in https protocols, so it seems that servers doesn't allow to connect with unsecure encryptions and connection standards.

Missing certain part of Cookies using requests.get()?

BackgroundInfo:
I am scraping amazon. I need to set up the session cookies before using requests.session.get() to get the final version of the page source code of a url.
Code:
import requests
# I am currently working in China, so it's cn.
# Use the homepage to get cookies. Then use it later to scrape data.
homepage = 'http://www.amazon.cn'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
response = requests.get(homepage,headers = headers)
cookies = response.cookies
#set up the Session object, so as to preserve the cookies between requests.
session = requests.Session()
session.headers = headers
session.cookies = cookies
#now begin download the source code
url = 'https://www.amazon.cn/TCL-%E7%8E%8B%E7%89%8C-L65C2-CUDG-65%E8%8B%B1%E5%AF%B8-%E6%96%B0%E7%9A%84HDR%E6%8A%80%E6%9C%AF-%E5%85%A8%E6%96%B0%E7%9A%84%E9%87%8F%E5%AD%90%E7%82%B9%E6%8A%80%E6%9C%AF-%E9%BB%91%E8%89%B2/dp/B01FXB0ZG4/ref=sr_1_2?ie=UTF8&qid=1476165637&sr=8-2&keywords=L65C2-CUDG'
response = session.get(url)
Desired Result:
When navigate to the amazon homepage in Chrome, the cookies should be something like:
As you can find in the cookies part,which I underscore in red, part of the cookies set by the response to our request to the homepage is "ubid-acbcn", which is also part of the request header, probably left from last visit.
So that is the cookie I want, which I attempted to get by the above code.
In python code, it should be a cookieJar, or a dictionary. Either way, its content should be something that contains 'ubid-acbcn' and 'session-id':
{'ubid-acbcn':'453-7613662-1073007','session-id':'455-1363863-7141553','otherparts':'otherparts'}
What I am getting instead:
The 'session-id' is there, but the 'ubid-acbcn' is missing.
>>homepage = 'http://www.amazon.cn'
>>headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
>>response = requests.get(homepage,headers = headers)
>>cookies = response.cookies
>>print(cookies.get_dict()):
>>{'session-id': '456-2975694-3270026','otherparts':'otherparts'}
Related Info:
OS: WINDOWS 10
PYTHON: 3.5
requests: 2.11.1
I am sorry for being a bit verbose.
What I tried and figure:
I googled for certain keywords, but nobody seems to be facing this
problem.
I figure it might be something to do with the amazon
anti-scraping measure. But other than change my headers to disguise
myself as a human, there isn't much I know I should do.
I have also entertained the possibility that tt might not be a case of missing cookie. But rather I have not set up my requests.get(homepage,headers = headers) properly, hence the response.cookie is not as expected. Given this,I have tried to copying the request header in my browser, leaving out only the cookie part, but still the response cookie is missing the 'ubid-acbcn' part. Maybe some other parameter has to be set up?
You're trying to get cookies from simple "nameless" GET request. But if to sent it "on behalf" of Session you can get required ubid-acbcn value:
session = requests.Session()
homepage = 'http://www.amazon.cn'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
response = session.get(homepage,headers = headers)
cookies = response.cookies
print(cookies.get_dict())
Output:
{'ubid-acbcn': '456-2652288-5841140' ...}
The cookies being set are from other pages/resources, probably loaded by JavaScript code. So you probably need to used selenium web driver for it. Check out the link for detail discussion.
not getting all cookie info using python requests module

python requests handle error 302?

I am trying to make a http request using requests library to the redirect url (in response headers-Location). When using Chrome inspection, I can see the response status is 302.
However, in python, requests always returns a 200 status. I added the allow_redirects=False, but the status is still always 200.
The url is https://api.weibo.com/oauth2/authorize?redirect_uri=http%3A//oauth.weico.cc&response_type=code&client_id=211160679
the first line entered the test account: moyan429#hotmail.com
the second line entered the password: 112358
and then click the first button to login.
My Python code:
import requests
user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.152 Safari/537.36'
session = requests.session()
session.headers['User-Agent'] = user_agent
session.headers['Host'] = 'api.weibo.com'
session.headers['Origin']='https://api.weibo.com'
session.headers['Referer'] ='https://api.weibo.com/oauth2/authorize?redirect_uri=http%3A//oauth.weico.cc&response_type=code&client_id=211160679'
session.headers['Connection']='keep-alive'
data = {
'client_id': api_key,
'redirect_uri': callback_url,
'userId':'moyan429#hotmail.com',
'passwd': '112358',
'switchLogin': '0',
'action': 'login',
'response_type': 'code',
'quick_auth': 'null'
}
resp = session.post(
url='https://api.weibo.com/oauth2/authorize',
data=data,
allow_redirects=False
)
code = resp.url[-32:]
print code
You are probably getting an API error message. Use print resp.text to see what the server tells you is wrong here.
Note that you can always inspect resp.history to see if there were any redirects; if there were any you'll find a list of response objects.
Do not set the Host or Connection headers; leave those to requests to handle. I doubt the Origin or Referer headers here needed either. Since this is an API, the User-Agent header is probably also overkill.

Categories

Resources