I used the following python code to download the html page:
response = urllib2.urlopen(current_URL)
msg = response.read()
print msg
For a page such as this one, it opens the url without error but then prints only part of the html-page!
In the following lines you can find the http headers of the html-page. I think the problem is due to "Transfer-Encoding: chunked".
It seems urllib2 returns only the first chunk! I have difficulties reading the remaining chunks. How I can read the remaining chunks?
Server: nginx/1.0.5
Date: Wed, 27 Feb 2013 14:41:28 GMT
Content-Type: text/html;charset=UTF-8
Transfer-Encoding: chunked
Connection: close
Set-Cookie: route=c65b16937621878dd49065d7d58047b2; Path=/
Set-Cookie: JSESSIONID=EE18E813EE464664EA64086D5AE9A290.tpdjo13v_3; Path=/
Pragma: No-cache
Cache-Control: no-cache,no-store,max-age=0
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Vary: Accept-Encoding
Content-Language: fr
I've found out that if I Accept-Language header is specified than server doesn't drop TCP connection, otherwise it does.
curl -H "Accept-Language:uk,en-US;q=0.8,en;q=0.6,ru;q=0.4" -v 'http://www.legifrance.gouv.fr/affichJuriJudi.do?oldAction=rechJuriJudi&idTexte=JURITEXT000024053954&fastReqId=660326373&fastPos=1'
Related
I am trying to download pdf using requests(python 2.7), and, this is the code that I am using:
file_resp = requests.post(file_url, data=payload, headers={"Referer":file_referer_url})
with open('test.pdf', 'wb') as f:
f.write(file_resp.content)
But, downloaded pdf is corrupted. This is the response header that I got:
Cache-Control: no-cache, no-store
Content-Encoding: gzip
Content-Type: text/html; charset=utf-8
Date: Tue, 23 Jun 2020 05:49:32 GMT
Expires: -1
Pragma: no-cache
Server: Microsoft-IIS/8.5
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-AspNet-Version: 4.0.30319
x-frame-options: SAMEORIGIN
X-Powered-By: ASP.NET
Also, response is something like this:
JVBERi0xLjQNCiW0tba3DQoxIDAgb2JqDQo8P...(long sequence like this...)
Please, can anyone help me with what I might be doing wrong?
Response is of type string, so simply Base64 decode the data with base64.b64decode and write it to a file.
with open('application.pdf', 'wb') as file_to_save:
decoded_pdf_data = base64.b64decode(file_resp.content)
file_to_save.write(decoded_pdf_data)
I am writing an controller endpoint in Python and I want to modify response code, content-text etc. How can I achieve it.
#http.route('/get/data/',methods=['GET','POST'],type="http",auth='public',csrf=False)
def fetchSms(self,**kwargs):
mydata = {"date":"2018-10-13T00:46:17.25Z"}
return simplejson.dumps(mydata)
I have to return mydata to client after setting content-type="application/json" and response response code should be 200
Current Output:
HTTP/1.1 200 OK
Server: nginx/1.12.2
Date: Mon, 15 Oct 2018 16:57:49 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 192
Connection: keep-alive
Set-Cookie: session_id=43edf5hfgh436sdgdga9d7618deb74af7; Expires=Sun, 13-Jan-2019 16:57:49 GMT; Max-Age=7776000; Path=/
I'm working on a python script which works with a JSON returned by an URL.
Since a couple of days urllib2 returns (just sometimes) an old state of the JSON.
I did add the headers "Cache-Control":"max-age=0" etc. still it sometimes happen.
If I print out the request info I get:
Server: nginx/1.8.0
Date: Thu, 03 Sep 2015 17:02:47 GMT
Content-Type: application/json; charset=utf-8
Content-Length: 3539
Status: 200 OK
X-XHR-Current-Location: /shop/169464.json
X-UA-Compatible: IE=Edge,chrome=1
ETag: "b1fbe7a01e0832025a3afce23fc2ab56"
X-Request-Id: 4cc0d399f943ad09a903f18a6ce1c488
X-Runtime: 0.123033
X-Rack-Cache: miss
Accept-Ranges: bytes
X-Varnish: 1707606900 1707225496
Age: 2860
Via: 1.1 varnish
Cache-Control: private, max-age=0, must-revalidate
Pragma: no-cache
X-Cache: HIT
X-Cache: MISS from adsl
X-Cache-Lookup: MISS from adsl:21261
Connection: close
has it something to do with the header "Age" or "X-Cache-Rack"? Or any ideas how I can fix it?
thanks in advance!
try to fake the user-agent, remove cookies, drop sessions.
fake_user_agent = ['chrome','firefox','safari']
request = urllib2.Request(url)
request.add_header('User-Agent', get_random(fake_user_agent))
content = urllib2.build_opener().open(request)
if all doesn't work, then try using tor to change ip per request.
if nothing works, then you can't bypass it because you are most definitely connecting to the transparent proxy
I have a question about writing a small tool that would provide the headers for any website. I am new to python but wanted to know if there is anything else other than encoding that I would have to account for in my code when developing the tool? I have a rough draft of my code shown below. Any pointers from the python coders?
#!/usr/bin/python
import sys, urllib
if len(sys.argv) == 2:
website = sys.argv[1]
website = urllib.urlopen(sys.argv[1])
if(website.code != 200):
print "Something went wrong here"
print website.code
exit(0)
print 'Printing the headers'
print '-----------------------------------------'
for header, value in website.headers.items() :
print header + ' : ' + value
Seems a fairly straightforward script (though this question seems more of a fit for stackoverflow). Couple comments, first curl -I is a useful command line tool to compare against. Second, even when you don't get 200 status, there are still often useful content or headers you may want to display. E.g.,
$ curl -I http://security.stackexchange.com/asdf
HTTP/1.1 404 Not Found
Cache-Control: private
Content-Length: 24068
Content-Type: text/html; charset=utf-8
X-Frame-Options: SAMEORIGIN
Set-Cookie: prov=678b5b9c-0130-4398-9834-673475961dc6; domain=.stackexchange.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly
Date: Fri, 25 Apr 2014 07:24:00 GMT
Also note urllib follows redirects automatically. E.g., with curl you'll see:
$ curl -I http://www.security.stackexchange.com
HTTP/1.1 301 Moved Permanently
Content-Length: 157
Content-Type: text/html; charset=UTF-8
Location: http://security.stackexchange.com/
Date: Fri, 25 Apr 2014 07:26:52 GMT
while your tool will just give.
$ python user3567119.py http://www.security.stackexchange.com
Printing the headers
-----------------------------------------
content-length : 68639
set-cookie : prov=9bf4f3d4-e3ae-4161-8e34-9aaa83f0aa4b; domain=.stackexchange.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly
expires : Fri, 25 Apr 2014 07:29:32 GMT
vary : *
last-modified : Fri, 25 Apr 2014 07:28:32 GMT
connection : close
cache-control : public, no-cache="Set-Cookie", max-age=60
date : Fri, 25 Apr 2014 07:28:31 GMT
x-frame-options : SAMEORIGIN
content-type : text/html; charset=utf-8
Third, if you continue playing around with HTTP requests in python, I highly recommend using requests. With requests, you'll be able to see the 301 if you do:
In [1]: import requests
In [2]: r=requests.get('http://www.security.stackexchange.com')
In [3]: r
Out[3]: <Response [200]>
In [4]: r.history
Out[4]: (<Response [301]>,)
It's also worth trying out some HTTP requests in just plain old telnet. E.g., telnet security.stackexchange.com 80 then quickly type:
GET / HTTP/1.1
Host: security.stackexchange.com
followed by a blank line. Then you'll see the actual HTTP response on the wire (instead of recreating it after urllib has processed the HTTP response):
HTTP/1.1 200 OK
Cache-Control: public, no-cache="Set-Cookie", max-age=60
Content-Type: text/html; charset=utf-8
Expires: Fri, 25 Apr 2014 07:38:37 GMT
Last-Modified: Fri, 25 Apr 2014 07:37:37 GMT
Vary: *
X-Frame-Options: SAMEORIGIN
Set-Cookie: prov=a75de1f2-678b-4a9d-bbfd-39e933e60237; domain=.stackexchange.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly
Date: Fri, 25 Apr 2014 07:37:36 GMT
Content-Length: 68849
<!DOCTYPE html>
When attempting to check the 'content-length' header for some web pages using urllib2 in python, the header is missing. For example, the response from google.com is missing this header. Any idea why?
Example:
r = urllib2.urlopen('http://www.google.com')
i = r.info()
print i.keys()
Gives:
['x-xss-protection', 'set-cookie', 'expires', 'server', 'connection', 'cache-control', 'date', 'p3p', 'content-type', 'x-frame-options']
You can see here that an http response can either contain Content-Length or Transfer-Encoding: chunked.
However, when Transfer-Encoding: chunked is used in the header, after the headers, you'll get a hexadecimal string which if converted to decimal, will give you the length of the next chunk. And after the last chunk you'll get a 0 for this value which means you've reached the end of the file.
You can use regular expressions to get this hexadecimal value (not a must though)
read = #string containing a line or a part of the http response
hexPat = re.compile(r'([0-9A-F]+)\r\n', re.I)
match = re.search(hexPat, read)
chunkLen = int(match.group(1), 16) #converts hexadecimal to decimal
or You can just read the first hexadecimal value, get the length of the first chunk and receive that chunk, then get the length of the next chunk and so on till you find a 0
The Content-Length of a HEAD response SHOULD, but not always does include the Content-Length value of a GET response:
Stack Overflow does:
> telnet stackoverflow.com 80
HEAD / HTTP/1.1
Host: stackoverflow.com
HTTP/1.1 200 OK
Cache-Control: public, max-age=60
Content-Length: 362245 <--------
Content-Type: text/html; charset=utf-8
Expires: Mon, 04 Oct 2010 11:51:49 GMT
Last-Modified: Mon, 04 Oct 2010 11:50:49 GMT
Vary: *
Date: Mon, 04 Oct 2010 11:50:49 GMT
Google doesn't:
> telnet www.google.com 80
HEAD / HTTP/1.1
Host: www.google.ie
HTTP/1.1 200 OK
Date: Mon, 04 Oct 2010 11:55:36 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
Server: gws
X-XSS-Protection: 1; mode=block
Transfer-Encoding: chunked