I'm a bit new to the python requests library, and am having some trouble accessing cookies after forms authentication. I captured packets in wireshark, and I'm certain that the cookies are being set (HTTP stream output at bottom). I'm following documentation here: https://2.python-requests.org/en/master/user/quickstart/#cookies.
My request is as follows:
r = requests.post('http://192.168.2.111/Account/LogOn', data = {'User': 'Customer', 'Password': 'Customer', 'button': 'submit'})
If I invoke
print(r.cookies), The only thing that is returned is <RequestsCookieJar[]>
OR, if I try to access them using r.cookies['mydeviceAG_POEWebTool'] I get a key error: KeyError: "name='mydeviceAG_POEWebTool', domain=None, path=None"
The HTTP Stream from wireshark is here:
POST /Account/LogOn HTTP/1.1
Host: 192.168.2.111
User-Agent: python-requests/2.24.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
Content-Length: 49
Content-Type: application/x-www-form-urlencoded
User=Customer&Password=Customer&button=submitHTTP/1.1 302 Found
Date: Thu, 24 Sep 2020 10:30:43 GMT
Server: Apache/2.4.25 (Debian)
Location: /States
X-AspNet-Version: 4.0.30319
Cache-Control: private
Set-Cookie: mydeviceAG_POEWebTool=8081AD7E6EEEB463C0AD8458; path=/
Set-Cookie: .mydeviceAG_POEWebTool_AUTH=jmnjqmKeL0ge8fz/sYrP3Xm+ntUTnPLrWBGtKAmAvnkKIjPKYQVn9xVrRa7EUEHLTfB1KNCKjotabnb7QqnDlQlZKuQkJ0J8rLmxuxrtCMFDsa/d6jyUj/PUckJ8V0Te; path=/; expires=Thu, 24 Sep 2020 13:30:44 GMT
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 112
Keep-Alive: timeout=5, m=100
Connection: Keep-Alive
Content-Type: text/html
<html><head><title>Object moved</title></head><body>
<h2>Object moved to here</h2>
</body><html>
** After reading the link offered by #D-E-N below, and searching around a bit, I tried the code below, which gets me further, but only stores the first of two cookies in the cookie jar. The server sends back two cookies:
Set-Cookie: AG_POEWebTool=9E508CCAE50EACB4AD1B33D8; path=/
Set-Cookie: .AG_POEWebTool_AUTH=WaXAtyh/tGnoZFPIiV4xoF6BhwGbr0jlaFq
but the only cookie stored is the AG_POEWebTool one. **
with requests.Session() as s:
s.get('http://192.168.2.111/Account/LogOn', timeout = 5, params = {'ReturnUrl':'%2fLogfiles'}, proxies = proxies)
r = s.post('http://192.168.2.111/Account/LogOn', data = {'User': 'Customer', 'Password': 'Customer', 'button': 'submit'}, timeout = 5, params = {'ReturnUrl':'%2fLogfiles'}, proxies = proxies)
cookiedict = s.cookies.get_dict()
print('URL:')
for item in cookiedict.items():
print(item)
response=s.get('http://192.168.2.111/Logfiles')
This is the response that I get:
cookiedict items:
('AG_POEWebTool', '471B4EB1153E4733E0EA1A40')
Process finished with exit code 0
Related
I'm trying to write a python script that would help me install a theme remotely. Unfortunately, the upload part doesn't play nice, trying to do it with requests' POST helpers.
The HTTP headers of a successful upload look like this:
http://127.0.0.1/wordpress/wp-admin/update.php?action=upload-theme
POST /wordpress/wp-admin/update.php?action=upload-theme HTTP/1.1
Host: 127.0.0.1
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Content-Type: multipart/form-data; boundary=---------------------------2455316848522
Content-Length: 2580849
Referer: http://127.0.0.1/wordpress/wp-admin/theme-install.php
Cookie: wordpress_5bd7a9c61cda6e66fc921a05bc80ee93=admin%7C1497659497%7C4a1VklpOs93uqpjylWqckQs80PccH1QMbZqn15lovQu%7Cee7366eea9b5bc9a9d492a664a04cb0916b97b0d211e892875cec86cf43e2f9d; wordpress_test_cookie=WP+Cookie+check; wordpress_logged_in_5bd7a9c61cda6e66fc921a05bc80ee93=admin%7C1497659497%7C4a1VklpOs93uqpjylWqckQs80PccH1QMbZqn15lovQu%7C9949f19ef5d900daf1b859c0bb4e2129cf86d6a970718a1b63e3b9e56dc5e710; wp-settings-1=libraryContent%3Dbrowse; wp-settings-time-1=1497486698
Connection: keep-alive
Upgrade-Insecure-Requests: 1
-----------------------------2455316848522: undefined
Content-Disposition: form-data; name="_wpnonce"
b1467671e0
-----------------------------2455316848522
Content-Disposition: form-data; name="_wp_http_referer"
/wordpress/wp-admin/theme-install.php
-----------------------------2455316848522
Content-Disposition: form-data; name="themezip"; filename="oedipus_theme.zip"
Content-Type: application/octet-stream
PK
HTTP/1.1 200 OK
Date: Thu, 15 Jun 2017 01:33:25 GMT
Server: Apache/2.4.25 (Win32) OpenSSL/1.0.2j PHP/7.1.1
X-Powered-By: PHP/7.1.1
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, max-age=0
X-Frame-Options: SAMEORIGIN
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8
----------------------------------------------------------
To create a simple session for WP, in order to use later for uploads:
global wp_session
def wpCreateSession(uname, upassword, site_link):
"""
:param uname: Username for the login.
:param upaswword: Password for the login.
:param site_link: Site to login on.
:return: Returns a sessions for the said website.
"""
global wp_session
wp_session = requests.session()
wp_session.post(site_link, data={'log' : uname, 'pwd' : upassword})
To upload the said file to WP, using the wp_session global:
def wpUploadTheme(file_name):
global wp_session
try:
with open(file_name, 'rb') as up_file:
r = wp_session.post('http://127.0.0.1/wordpress/wp-admin/update.php', files = {file_name: up_file})
print "Got after try."
finally:
up_file.close()
And this last bit is where it doesn't work, the upload is not successful and I get returned to WordPress' basic 404.
I have also tried requests_toolbelt MultiPart_Encoder to no avail.
Question: 'requests' POST file fails when trying to upload
Check your files dict, your dict is invalid
files = {file_name: up_file}
Maybe you need a full blown files dict, for instance:
files = {'themezip': ('oedipus_theme.zip',
open('oedipus_theme.zip', 'rb'),
'application/octet-stream', {'Expires': '0'})}
From docs.python-requests.org
files = {'file': open('test.jpg', 'rb')}
requests.post(url, files=files)
From SO Answer Upload Image using POST form data in Python-requests
We've got a server providing .txt files, basically some log files growing over time. When I use urllib2 to send GET to the server r = urllib2.urlopen('http://example.com') , the headers of the response would be:
Date: XXX
Server: Apache
Last-Modified: XXX
Accept-Ranges: bytes
Content-Length: 12345678
Vary: Accept-Encoding
Connection: close
Content-Type: text/plain
While if r = requests.get('http://example.com'):
Content-Encoding: gzip
Accept-Ranges: bytes
Vary: Accept-Encoding
Keep-alive: timeout=5, max=128
Last-Modified: XXX
Connection: Keep-Alive
ETag: xxxxxxxxx
Content-Type: text/plain
The second response is the same with what I get using chrome develop tools. So why the two are different? I need the Content-Length header to determine how many bytes I need to download every time, becasue the file could grow really big.
EDIT:
Using httpbin.org/get to test:
urllib2 response:
{u'args': {},
u'headers': {u'Accept-Encoding': u'identity',
u'Host': u'httpbin.org',
u'User-Agent': u'Python-urllib/2.7'},
u'origin': u'ip',
u'url': u'http://httpbin.org/get'}
response headers:
Server: nginx
Date: Sat, 14 Jan 2017 07:41:16 GMT
Content-Type: application/json
Content-Length: 207
Connection: close
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
requests response:
{u'args': {},
u'headers': {u'Accept': u'*/*',
u'Accept-Encoding': u'gzip, deflate',
u'Host': u'httpbin.org',
u'User-Agent': u'python-requests/2.11.1'},
u'origin': u'ip',
u'url': u'http://httpbin.org/get'}
response headers:
Server : nginx
Date : Sat, 14 Jan 2017 07:42:39 GMT
Content-Type : application/json
Content-Length : 239
Connection : keep-alive
Access-Control-Allow-Origin : *
Access-Control-Allow-Credentials : true
Quote from Lukasa at github:
The response is different because requests indicates that it supports
gzip-encoded bodies, by sending an Accept-Encoding: gzip, deflate
header field. urllib2 does not. You'll find if you added that header
to your urllib2 request that you get the new behaviour.
Clearly, in this case, the server is dynamically gzipping the
responses. This means it doesn't know how long the response will be,
so it is sending using chunked transfer encoding.
If you really must get the Content-Length header, then you should add
the following headers to your Requests request: {'Accept-Encoding':
'identity'}.
This is a follow-up from a question I saw earlier today. In this question, a user asks about a problem downloading a pdf from this url:
http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009
I would think that the two download functions below would give the same result, but the urllib2 version downloads some html with a script tag referencing a pdf loader, while the requests version downloads the real pdf. Can someone explain the difference in behavior?
import urllib2
import requests
def get_pdf_urllib2(url, outfile='ex.pdf'):
resp = urllib2.urlopen(url)
with open(outfile, 'wb') as f:
f.write(resp.read())
def get_pdf_requests(url, outfile='ex.pdf'):
resp = requests.get(url)
with open(outfile, 'wb') as f:
f.write(resp.content)
Is requests smart enough to wait for dynamic websites to render before downloading?
Edit
Following up on #cwallenpoole's idea, I compared the headers and tried swapping headers from the requests request into the urllib2 request. The magic header was Cookie; the below functions write the same file for the example URL.
def get_pdf_urllib2(url, outfile='ex.pdf'):
req = urllib2.request(url, headers={'Cookie':'I2KBRCK=1'})
resp = urllib2.urlopen(req)
with open(outfile, 'wb') as f:
f.write(resp.read())
def get_pdf_requests(url, outfile='ex.pdf'):
resp = requests.get(url)
with open(outfile, 'wb') as f:
f.write(resp.content)
Next question: where did requests get that cookie? Is requests making multiple trips to the server?
Edit 2
Cookie came from a redirect header:
>>> handler=urllib2.HTTPHandler(debuglevel=1)
>>> opener=urllib2.build_opener(handler)
>>> urllib2.install_opener(opener)
>>> respurl=urllib2.urlopen(req1)
send: 'GET /doi/pdf/10.1177/0956797614553009 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: journals.sagepub.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Server: AtyponWS/7.1
header: P3P: CP="NOI DSP ADM OUR IND OTC"
header: Location: http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009?cookieSet=1
header: Set-Cookie: I2KBRCK=1; path=/; expires=Thu, 14-Dec-2017 17:28:28 GMT
header: Content-Type: text/html; charset=utf-8
header: Content-Length: 110
header: Connection: close
header: Date: Wed, 14 Dec 2016 17:28:28 GMT
send: 'GET /doi/pdf/10.1177/0956797614553009?cookieSet=1 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: journals.sagepub.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Server: AtyponWS/7.1
header: Location: http://journals.sagepub.com/action/cookieAbsent
header: Content-Type: text/html; charset=utf-8
header: Content-Length: 85
header: Connection: close
header: Date: Wed, 14 Dec 2016 17:28:28 GMT
send: 'GET /action/cookieAbsent HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: journals.sagepub.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: AtyponWS/7.1
header: Cache-Control: no-cache
header: Pragma: no-cache
header: X-Webstats-RespID: 8344872279f77f45555d5f9aeb97985b
header: Set-Cookie: JSESSIONID=aaavQMGH8mvlh_-5Ct7Jv; path=/
header: Content-Type: text/html; charset=UTF-8
header: Connection: close
header: Transfer-Encoding: chunked
header: Date: Wed, 14 Dec 2016 17:28:28 GMT
header: Vary: Accept-Encoding
I'll bet that it's an issue with the User Agent header (I just used curl http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009 and got the same as you report with urllib2). This is part of the request header that lets a website know what type of program/user/whatever is accessing the site (not the library, the HTTP request).
By default, it looks like urllib2 uses: Python-urllib/2.1
And requests uses: python-requests/{package version} {runtime}/{runtime version} {uname}/{uname -r}
If you're working on a Mac, I'll bet that the site is reading Darwin/13.1.0 or similar and then serving you the macos appropriate content. Otherwise, it's probably trying to direct you to some default alternate content (or prevent you from scraping that URL).
Hi iam trying to scrap some data off from this URL:
http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1
As you may have noticed, if cookies and session data is not yet set you will be redirected to its base url (http://www.21cineplex.com/)
I tried to do it like this:
def main():
try:
cj = CookieJar()
baseurl = "http://www.21cineplex.com"
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.open(baseurl)
urllib2.install_opener(opener)
movieSource = urllib2.urlopen('http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1').read()
splitSource = re.findall(r'<ul class="w462">(.*?)</ul>', movieSource)
print splitSource
except Exception, e:
str(e)
print "Error occured in main Block"
However, i ended up failing to scrap from that particular URL.
A quick inspection reveals that the website is setting a session ID (PHPSESSID) and make a copy to the client's cookie as such.
The question is how do i mitigate such example?
ps: i've tried to install request (via pip) how ever it gives me (404):
Getting page https://pypi.python.org/simple/request/
Could not fetch URL https://pypi.python.org/simple/request/: HTTP Error 404: Not Found (request does not have any releases)
Will skip URL https://pypi.python.org/simple/request/ when looking for download links for request
Getting page https://pypi.python.org/simple/
URLs to search for versions for request:
* https://pypi.python.org/simple/request/
Getting page https://pypi.python.org/simple/request/
Could not fetch URL https://pypi.python.org/simple/request/: HTTP Error 404: Not Found (request does not have any releases)
Will skip URL https://pypi.python.org/simple/request/ when looking for download links for request
Could not find any downloads that satisfy the requirement request
Cleaning up...
Thanks to #Chainik i got it to work now. I ended up modify my code like this:
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
baseurl = "http://www.21cineplex.com/"
regex = '<ul class="w462">(.*?)</ul>'
opener.open(baseurl)
urllib2.install_opener(opener)
request = urllib2.Request('http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1')
request.add_header('Referer', baseurl)
requestData = urllib2.urlopen(request)
htmlText = requestData.read()
Once, the html text is retrieved. It's all about parsing its content.
Cheers
Try setting a referer URL, see below.
Without referer URL set (302 redirect):
$ curl -I "http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1"
HTTP/1.1 302 Moved Temporarily
Server: nginx
Date: Thu, 19 Sep 2013 09:19:19 GMT
Content-Type: text/html
Connection: keep-alive
X-Powered-By: PHP/5.4.17
Set-Cookie: PHPSESSID=5effe043db4fd83b2c5927818cb1a7ca; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: kota=3; expires=Fri, 19-Sep-2014 09:19:19 GMT; path=/
Location: http://www.21cineplex.com/
With referer URL set (HTTP/200):
$ curl -I -e "http://www.21cineplex.com/"
"http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1"
HTTP/1.1 200 OK
Server: nginx
Date: Thu, 19 Sep 2013 09:19:24 GMT
Content-Type: text/html
Connection: keep-alive
Vary: Accept-Encoding
X-Powered-By: PHP/5.4.17
Set-Cookie: PHPSESSID=a7abd6592c87e0c1a8fab4f855baa0a4; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: kota=3; expires=Fri, 19-Sep-2014 09:19:24 GMT; path=/
To set referer URL using urllib, see this post
-- ab1
I am using the python urllib2 library for opening URL, and what I want is to get the complete header info of the request. When I use response.info I only get this:
Date: Mon, 15 Aug 2011 12:00:42 GMT
Server: Apache/2.2.0 (Unix)
Last-Modified: Tue, 01 May 2001 18:40:33 GMT
ETag: "13ef600-141-897e4a40"
Accept-Ranges: bytes
Content-Length: 321
Connection: close
Content-Type: text/html
I am expecting the complete info as given by live_http_headers (add-on for firefox), e.g:
http://www.yellowpages.com.mt/Malta-Web/127151.aspx
GET /Malta-Web/127151.aspx HTTP/1.1
Host: www.yellowpages.com.mt
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive
Cookie: __utma=156587571.1883941323.1313405289.1313405289.1313405289.1; __utmz=156587571.1313405289.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)
HTTP/1.1 302 Found
Connection: Keep-Alive
Content-Length: 141
Date: Mon, 15 Aug 2011 12:17:25 GMT
Location: http://www.trucks.com.mt
Content-Type: text/html; charset=utf-8
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET, UrlRewriter.NET 2.0.0
X-AspNet-Version: 2.0.50727
Set-Cookie: ASP.NET_SessionId=zhnqh5554omyti55dxbvmf55; path=/; HttpOnly
Cache-Control: private
My request function is:
def dorequest(url, post=None, headers={}):
cOpener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookielib.CookieJar()))
urllib2.install_opener( cOpener )
if post:
post = urllib.urlencode(post)
req = urllib2.Request(url, post, headers)
response = cOpener.open(req)
print response.info() // this does not give complete header info, how can i get complete header info??
return response.read()
url = 'http://www.yellowpages.com.mt/Malta-Web/127151.aspx'
html = dorequest(url)
Is it possible to achieve the desired header info details by using urllib2? I don't want to switch to httplib.
Those are all of the headers the server is sending when you do the request with urllib2.
Firefox is showing you the headers it's sending to the server as well.
When the server gets those headers from Firefox, some of them may trigger it to send back additional headers, so you end up with more response headers as well.
Duplicate the exact headers Firefox sends, and you'll get back an identical response.
Edit: That location header is sent by the page that does the redirect, not the page you're redirected to. Just use response.url to get the location of the page you've been sent to.
That first URL uses a 302 redirect. If you don't want to follow the redirect, but see the headers from the first page instead, use a URLOpener instead of a FancyURLOpener, which automatically follows redirects.
I see that server returns HTTP/1.1 302 Found - HTTP redirect.
urllib automatically follow redirects, so headers returned by urllib is headers from http://www.trucks.com.mt, not http://www.yellowpages.com.mt/Malta-Web/127151.aspx