I have a working cURL command, which gets a number of cookies from this particular (alfresco, although I don't think this is particularly an alfresco issue), however when I try and encode the same request with python requests, I get only one cookie.
The cURL request is
curl -v --junk-session-cookies -H 'Content-Type: application/x-www-form-urlencoded' --cookie-jar cookies.txt --cookie cookies.txt -H 'Origin: http://<url>' -D headers.txt -e ';auto' -X POST -d #credentials http://<url>/share/page/dologin
And the python is
r=s.post('%s/share/page/dologin' % <url>,
data=credentials, allow_redirects=False)
where
s.headers
Out[121]: {'Origin': <url>, 'Host' :<url>, 'referer':<url> , 'Accept': '*/*', 'User-Agent': 'python-requests/2.4.1 CPython/2.7.5+ Linux/3.11.0-26-generic','Content-Type': 'application/x-www-form-urlencoded'}
designed to closely match the headers that cURL is sending in.
The allow_redirects=False is there to mirror the lack of the -L option.
cURL yields:
Added cookie JSESSIONID="<removed>" for domain 10.12.3.166, path /share/, expire 0
< Set-Cookie: JSESSIONID=<removed>; Path=/share/; HttpOnly
* Added cookie alfLogin="<removed>" for domain 10.12.3.166, path /share, expire 1413289198
< Set-Cookie: alfLogin=<removed>; Expires=Tue, 14-Oct-2014 12:19:58 GMT; Path=/share
* Added cookie alfUsername3="<obf>" for domain 10.12.3.166, path /share, expire 1413289198
< Set-Cookie: alfUsername3=<obf>; Expires=Tue, 14-Oct-2014 12:19:58 GMT; Path=/share
Whereas I only get the JSESSIONID cookie in requests.
Update
I foolishly failed to include the headers sent in the cURL command, and those received.
Those sent were
> User-Agent: curl/7.32.0
> Host: <url>
> Accept: */*
> Referer:
> Content-Type: application/x-www-form-urlencoded
> Origin: <url>
> Content-Length: 44
And the headers sent back are
HTTP/1.1 302 Found
Server: Apache-Coyote/1.1
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Set-Cookie: JSESSIONID=<rem>; Path=/share/; HttpOnly
Set-Cookie: alfLogin=<rem>; Expires=Tue, 14-Oct-2014 12:19:58 GMT; Path=/share
Set-Cookie: alfUsername3=<rem>; Expires=Tue, 14-Oct-2014 12:19:58 GMT; Path=/share
Location: <url>/share
The headers sent back to requests are
{'transfer-encoding': 'chunked', 'set-cookie': 'JSESSIONID=<rem>; Path=/share/; HttpOnly', 'server': 'Apache-Coyote/1.1', 'connection': 'close', 'date': 'Wed, 08 Oct 2014 08:54:53 GMT', 'content-type': 'text/html;charset=ISO-8859-1'}
The solution was to remove the "Host" header and set the referer field to be ''. I then got all the cookies I needed, so thanks to all the helpful comments.
Related
I'm a bit new to the python requests library, and am having some trouble accessing cookies after forms authentication. I captured packets in wireshark, and I'm certain that the cookies are being set (HTTP stream output at bottom). I'm following documentation here: https://2.python-requests.org/en/master/user/quickstart/#cookies.
My request is as follows:
r = requests.post('http://192.168.2.111/Account/LogOn', data = {'User': 'Customer', 'Password': 'Customer', 'button': 'submit'})
If I invoke
print(r.cookies), The only thing that is returned is <RequestsCookieJar[]>
OR, if I try to access them using r.cookies['mydeviceAG_POEWebTool'] I get a key error: KeyError: "name='mydeviceAG_POEWebTool', domain=None, path=None"
The HTTP Stream from wireshark is here:
POST /Account/LogOn HTTP/1.1
Host: 192.168.2.111
User-Agent: python-requests/2.24.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
Content-Length: 49
Content-Type: application/x-www-form-urlencoded
User=Customer&Password=Customer&button=submitHTTP/1.1 302 Found
Date: Thu, 24 Sep 2020 10:30:43 GMT
Server: Apache/2.4.25 (Debian)
Location: /States
X-AspNet-Version: 4.0.30319
Cache-Control: private
Set-Cookie: mydeviceAG_POEWebTool=8081AD7E6EEEB463C0AD8458; path=/
Set-Cookie: .mydeviceAG_POEWebTool_AUTH=jmnjqmKeL0ge8fz/sYrP3Xm+ntUTnPLrWBGtKAmAvnkKIjPKYQVn9xVrRa7EUEHLTfB1KNCKjotabnb7QqnDlQlZKuQkJ0J8rLmxuxrtCMFDsa/d6jyUj/PUckJ8V0Te; path=/; expires=Thu, 24 Sep 2020 13:30:44 GMT
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 112
Keep-Alive: timeout=5, m=100
Connection: Keep-Alive
Content-Type: text/html
<html><head><title>Object moved</title></head><body>
<h2>Object moved to here</h2>
</body><html>
** After reading the link offered by #D-E-N below, and searching around a bit, I tried the code below, which gets me further, but only stores the first of two cookies in the cookie jar. The server sends back two cookies:
Set-Cookie: AG_POEWebTool=9E508CCAE50EACB4AD1B33D8; path=/
Set-Cookie: .AG_POEWebTool_AUTH=WaXAtyh/tGnoZFPIiV4xoF6BhwGbr0jlaFq
but the only cookie stored is the AG_POEWebTool one. **
with requests.Session() as s:
s.get('http://192.168.2.111/Account/LogOn', timeout = 5, params = {'ReturnUrl':'%2fLogfiles'}, proxies = proxies)
r = s.post('http://192.168.2.111/Account/LogOn', data = {'User': 'Customer', 'Password': 'Customer', 'button': 'submit'}, timeout = 5, params = {'ReturnUrl':'%2fLogfiles'}, proxies = proxies)
cookiedict = s.cookies.get_dict()
print('URL:')
for item in cookiedict.items():
print(item)
response=s.get('http://192.168.2.111/Logfiles')
This is the response that I get:
cookiedict items:
('AG_POEWebTool', '471B4EB1153E4733E0EA1A40')
Process finished with exit code 0
I am trying to access the Buxfer REST API using Python and urllib2.
The issue is I get the following response:
urllib2.HTTPError: HTTP Error 403: Forbidden
But when I try the same call through my browser, it works fine...
The script goes as follows:
username = "xxx#xxxcom"
password = "xxx"
#############
def checkError(response):
result = simplejson.load(response)
response = result['response']
if response['status'] != "OK":
print "An error occured: %s" % response['status'].replace('ERROR: ', '')
sys.exit(1)
return response
base = "https://www.buxfer.com/api";
url = base + "/login?userid=" + username + "&password=" + password;
req = urllib2.Request(url=url)
response = checkError(urllib2.urlopen(req))
token = response['token']
url = base + "/budgets?token=" + token;
req = urllib2.Request(url=url)
response = checkError(urllib2.urlopen(req))
for budget in response['budgets']:
print "%12s %8s %10.2f %10.2f" % (budget['name'], budget['currentPeriod'], budget['limit'], budget['remaining'])
sys.exit(0)
I also tried using the requests library but the same error appears.
The server I am tring to access from is an ubuntu 14.04, any help on explaining or solving why this happens will be appreciated.
EDIT:
This is the full error message:
{
'cookies': <<class 'requests.cookies.RequestsCookieJar'>[]>,
'_content': '
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>403 Forbidden</title>
</head><body>
<h1>Forbidden</h1>
<p>You don't have permission to access /api/login
on this server.</p>
<hr>
<address>Apache/2.4.7 (Ubuntu) Server at www.buxfer.com Port 443</address>
</body></html>',
headers': CaseInsensitiveDict({
'date': 'Sun, 31 Jan 2016 12:06:44 GMT',
'content-length': '291',
'content-type': 'text/html; charset=iso-8859-1',
'server': 'Apache/2.4.7 (Ubuntu)'
}),
'url': u'https://www.buxfer.com/api/login?password=xxxx&userid=xxxx%40xxxx.com',
'status_code': 403,
'_content_consumed': True,
'encoding': 'iso-8859-1',
'request': <PreparedRequest [GET]>,
'connection': <requests.adapters.HTTPAdapter object at 0x7fc7308102d0>,
'elapsed': datetime.timedelta(0, 0, 400442),
'raw': <urllib3.response.HTTPResponse object at 0x7fc7304d14d0>,
'reason': 'Forbidden',
'history': []
}
EDIT 2: (Network parameters in GoogleChrome browser)
Request Method:GET
Status Code:200 OK
Remote Address:52.20.61.39:443
Response Headers
view parsed
HTTP/1.1 200 OK
Date: Mon, 01 Feb 2016 11:01:10 GMT
Server: Apache/2.4.7 (Ubuntu)
X-Frame-Options: SAMEORIGIN
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Cache-Controle: no-cache
Set-Cookie: login=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; Max-Age=0; path=/; domain=buxfer.com
Set-Cookie: remember=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; Max-Age=0; path=/; domain=buxfer.com
Vary: Accept-Encoding
Content-Encoding: gzip
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: application/x-javascript; charset=utf-8
Request Headers
view source
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip, deflate, sdch
Accept-Language:en-US,en;q=0.8,fr;q=0.6
Connection:keep-alive
Cookie:PHPSESSID=pjvg8la01ic64tkkfu1qmecv20; api-session=vbnbmp3sb99lqqea4q4iusd4v3; __utma=206445312.1301084386.1454066594.1454241953.1454254906.4; __utmc=206445312; __utmz=206445312.1454066594.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided)
Host:www.buxfer.com
Upgrade-Insecure-Requests:1
User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36
EDIT 3:
I can also access through my local pycharm console without any issues, it's just when I try to do it from my remote server...
It could be that you need to do a POST rather than a GET request. Most logins work this way.
Using the requests library, you would need
response = requests.post(
base + '/login',
data={
'userid': username,
'password': password
}
)
To affirm #Carl from the offical website:
This command only looks at POST parameters and discards GET parameters.
https://www.buxfer.com/help/api#login
When I try to open http://www.comicbookdb.com/browse.php (which works fine in my browser) from Python, I get an empty response:
>>> import urllib.request
>>> content = urllib.request.urlopen('http://www.comicbookdb.com/browse.php')
>>> print(content.read())
b''
The same also happens when I set a User-agent:
>>> opener = urllib.request.build_opener()
>>> opener.addheaders = [('User-agent', 'Mozilla/5.0')]
>>> content = opener.open('http://www.comicbookdb.com/browse.php')
>>> print(content.read())
b''
Or when I use httplib2 instead:
>>> import httplib2
>>> h = httplib2.Http('.cache')
>>> response, content = h.request('http://www.comicbookdb.com/browse.php')
>>> print(content)
b''
>>> print(response)
{'cache-control': 'no-store, no-cache, must-revalidate, post-check=0, pre-check=0', 'content-location': 'http://www.comicbookdb.com/browse.php', 'expires': 'Thu, 19 Nov 1981 08:52:00 GMT', 'content-length': '0', 'set-cookie': 'PHPSESSID=590f5997a91712b7134c2cb3291304a8; path=/', 'date': 'Wed, 25 Dec 2013 15:12:30 GMT', 'server': 'Apache', 'pragma': 'no-cache', 'content-type': 'text/html', 'status': '200'}
Or when I try to download it using cURL:
C:\>curl -v http://www.comicbookdb.com/browse.php
* About to connect() to www.comicbookdb.com port 80
* Trying 208.76.81.137... * connected
* Connected to www.comicbookdb.com (208.76.81.137) port 80
> GET /browse.php HTTP/1.1
User-Agent: curl/7.13.1 (i586-pc-mingw32msvc) libcurl/7.13.1 zlib/1.2.2
Host: www.comicbookdb.com
Pragma: no-cache
Accept: */*
< HTTP/1.1 200 OK
< Date: Wed, 25 Dec 2013 15:20:06 GMT
< Server: Apache
< Expires: Thu, 19 Nov 1981 08:52:00 GMT
< Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
< Pragma: no-cache
< Set-Cookie: PHPSESSID=0a46f2d390639da7eb223ad47380b394; path=/
< Content-Length: 0
< Content-Type: text/html
* Connection #0 to host www.comicbookdb.com left intact
* Closing connection #0
Opening the URL in a browser or downloading it with Wget seems to work fine, though:
C:\>wget http://www.comicbookdb.com/browse.php
--16:16:26-- http://www.comicbookdb.com/browse.php
=> `browse.php'
Resolving www.comicbookdb.com... 208.76.81.137
Connecting to www.comicbookdb.com[208.76.81.137]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
[ <=> ] 40,687 48.75K/s
16:16:27 (48.75 KB/s) - `browse.php' saved [40687]
As does downloading a different file from the same server:
>>> content = urllib.request.urlopen('http://www.comicbookdb.com/index.php')
>>> print(content.read(100))
b'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"\n\t\t"http://www.w3.org/TR/1999/REC-html'
So why doesn't the other URL work?
It seems the server expects a Connection: keep-alive header, which e.g. curl (and I expect the other failing clients too) do not add by default.
With curl you can use this command, which will display a non-empty response:
curl -v -H 'Connection: keep-alive' http://www.comicbookdb.com/browse.php
With Python you can use this code:
import httplib2
h = httplib2.Http('.cache')
response, content = h.request('http://www.comicbookdb.com/browse.php', headers={'Connection':'keep-alive'})
print(content)
print(response)
Hi iam trying to scrap some data off from this URL:
http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1
As you may have noticed, if cookies and session data is not yet set you will be redirected to its base url (http://www.21cineplex.com/)
I tried to do it like this:
def main():
try:
cj = CookieJar()
baseurl = "http://www.21cineplex.com"
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.open(baseurl)
urllib2.install_opener(opener)
movieSource = urllib2.urlopen('http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1').read()
splitSource = re.findall(r'<ul class="w462">(.*?)</ul>', movieSource)
print splitSource
except Exception, e:
str(e)
print "Error occured in main Block"
However, i ended up failing to scrap from that particular URL.
A quick inspection reveals that the website is setting a session ID (PHPSESSID) and make a copy to the client's cookie as such.
The question is how do i mitigate such example?
ps: i've tried to install request (via pip) how ever it gives me (404):
Getting page https://pypi.python.org/simple/request/
Could not fetch URL https://pypi.python.org/simple/request/: HTTP Error 404: Not Found (request does not have any releases)
Will skip URL https://pypi.python.org/simple/request/ when looking for download links for request
Getting page https://pypi.python.org/simple/
URLs to search for versions for request:
* https://pypi.python.org/simple/request/
Getting page https://pypi.python.org/simple/request/
Could not fetch URL https://pypi.python.org/simple/request/: HTTP Error 404: Not Found (request does not have any releases)
Will skip URL https://pypi.python.org/simple/request/ when looking for download links for request
Could not find any downloads that satisfy the requirement request
Cleaning up...
Thanks to #Chainik i got it to work now. I ended up modify my code like this:
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
baseurl = "http://www.21cineplex.com/"
regex = '<ul class="w462">(.*?)</ul>'
opener.open(baseurl)
urllib2.install_opener(opener)
request = urllib2.Request('http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1')
request.add_header('Referer', baseurl)
requestData = urllib2.urlopen(request)
htmlText = requestData.read()
Once, the html text is retrieved. It's all about parsing its content.
Cheers
Try setting a referer URL, see below.
Without referer URL set (302 redirect):
$ curl -I "http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1"
HTTP/1.1 302 Moved Temporarily
Server: nginx
Date: Thu, 19 Sep 2013 09:19:19 GMT
Content-Type: text/html
Connection: keep-alive
X-Powered-By: PHP/5.4.17
Set-Cookie: PHPSESSID=5effe043db4fd83b2c5927818cb1a7ca; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: kota=3; expires=Fri, 19-Sep-2014 09:19:19 GMT; path=/
Location: http://www.21cineplex.com/
With referer URL set (HTTP/200):
$ curl -I -e "http://www.21cineplex.com/"
"http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1"
HTTP/1.1 200 OK
Server: nginx
Date: Thu, 19 Sep 2013 09:19:24 GMT
Content-Type: text/html
Connection: keep-alive
Vary: Accept-Encoding
X-Powered-By: PHP/5.4.17
Set-Cookie: PHPSESSID=a7abd6592c87e0c1a8fab4f855baa0a4; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: kota=3; expires=Fri, 19-Sep-2014 09:19:24 GMT; path=/
To set referer URL using urllib, see this post
-- ab1
I am using the python urllib2 library for opening URL, and what I want is to get the complete header info of the request. When I use response.info I only get this:
Date: Mon, 15 Aug 2011 12:00:42 GMT
Server: Apache/2.2.0 (Unix)
Last-Modified: Tue, 01 May 2001 18:40:33 GMT
ETag: "13ef600-141-897e4a40"
Accept-Ranges: bytes
Content-Length: 321
Connection: close
Content-Type: text/html
I am expecting the complete info as given by live_http_headers (add-on for firefox), e.g:
http://www.yellowpages.com.mt/Malta-Web/127151.aspx
GET /Malta-Web/127151.aspx HTTP/1.1
Host: www.yellowpages.com.mt
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive
Cookie: __utma=156587571.1883941323.1313405289.1313405289.1313405289.1; __utmz=156587571.1313405289.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)
HTTP/1.1 302 Found
Connection: Keep-Alive
Content-Length: 141
Date: Mon, 15 Aug 2011 12:17:25 GMT
Location: http://www.trucks.com.mt
Content-Type: text/html; charset=utf-8
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET, UrlRewriter.NET 2.0.0
X-AspNet-Version: 2.0.50727
Set-Cookie: ASP.NET_SessionId=zhnqh5554omyti55dxbvmf55; path=/; HttpOnly
Cache-Control: private
My request function is:
def dorequest(url, post=None, headers={}):
cOpener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookielib.CookieJar()))
urllib2.install_opener( cOpener )
if post:
post = urllib.urlencode(post)
req = urllib2.Request(url, post, headers)
response = cOpener.open(req)
print response.info() // this does not give complete header info, how can i get complete header info??
return response.read()
url = 'http://www.yellowpages.com.mt/Malta-Web/127151.aspx'
html = dorequest(url)
Is it possible to achieve the desired header info details by using urllib2? I don't want to switch to httplib.
Those are all of the headers the server is sending when you do the request with urllib2.
Firefox is showing you the headers it's sending to the server as well.
When the server gets those headers from Firefox, some of them may trigger it to send back additional headers, so you end up with more response headers as well.
Duplicate the exact headers Firefox sends, and you'll get back an identical response.
Edit: That location header is sent by the page that does the redirect, not the page you're redirected to. Just use response.url to get the location of the page you've been sent to.
That first URL uses a 302 redirect. If you don't want to follow the redirect, but see the headers from the first page instead, use a URLOpener instead of a FancyURLOpener, which automatically follows redirects.
I see that server returns HTTP/1.1 302 Found - HTTP redirect.
urllib automatically follow redirects, so headers returned by urllib is headers from http://www.trucks.com.mt, not http://www.yellowpages.com.mt/Malta-Web/127151.aspx