I need to detect character encoding in HTTP responses. To do this I look at the headers, then if it's not set in the content-type header I have to peek at the response and look for a "<meta http-equiv='content-type'>" header. I'd like to be able to write a function that looks and works something like this:
response = urllib2.urlopen("http://www.example.com/")
encoding = detect_html_encoding(response)
...
page_text = response.read()
However, if I do response.read() in my "detect_html_encoding" method, then the subseuqent response.read() after the call to my function will fail.
Is there an easy way to peek at the response and/or rewind after a read?
def detectit(response):
# try headers &c, then, worst case...:
content = response.read()
response.read = lambda: content
# now detect based on content
The trick of course is ensuring that response.read() WILL return the same thing again if needed... that's why we assign that lambda to it if necessary, i.e., if we already needed to extract the content -- that ensures the same content can be extracted again (and again, and again, ...;-).
If it's in the HTTP headers (not the document itself) you could use response.info() to detect the encoding
If you want to parse the HTML, save the response data:
page_text = response.read()
encoding = detect_html_encoding(response, page_text)
Related
I'm using Selenium-wire to try and read the request response text of some network traffic. The code I have isn't fully reproducable as the account is behind a paywall.
The bit of selenium-wire I'm currently using using is:
for request in driver.requests:
if request.method == 'POST' and request.headers['Content-Type'] == 'application/json':
# The body is in bytes so convert to a string
body = driver.last_request.body.decode('utf-8')
# Load the JSON
data = json.loads(body)
Unfortunately though, that is reading the payload of the request
and I'm trying to parse the Response:
You need to get last_request's response:
body = driver.last_request.response.body.decode('utf-8')
data = json.loads(body)
I usually use these 3 steps
# I define the scopes to avoid other post requests that are not related
# we can also use it to only select the required endpoints
driver.scopes = [
# .* is a regex stands for any char 0 or more times
'.*stackoverflow.*',
'.*github.*'
]
# visit the page
driver.get('LINK')
# get the response
response = driver.last_request # or driver.requests[-1]
# get the json
js = json.loads(
decode(
response.response.body,
# get the encoding from the request
response.headers.get('Content-Encoding', 'identity'),
)
)
# this clears all the requests it's a good idea to do after each visit to the page
del driver.requests
for more info here is the doc
Summary:
Currently i am doing a GET Request on a {.log} URL which is having around 7000+ lines.
I need to GET the Response, validate for a particular message in the response and if its not present, i need to do a GET Request again on the same URL.
This iteration on the GET is very time consuming and most of the time results in a stuck state
Expectation:
I need a way out wherein i do a GET Request operation and fetch only last 100 lines as a response rather than fetching all the 7000+ lines every time.
URL = "http://sdd.log"
Code
def get_log(self):
logging.info("Sending a get request to retrieve pronghorn log")
resp = requests.request("GET", "http://ssdg.log")
logging.info("Printing the callback url response")
#logging.info(resp)
#logging.info(resp.text)
return resp.text
You cannot simply download only the last 100 lines of an HTTP request. You can however simply get the last 100 lines of the resulting response by using
data = resp.text.split('\n')
last_lines = '\n'.join(data[-100:])
return last_lines
So, if your server accepts range requests then you can use code like this to get the last 4096 bytes
import requests
from io import BytesIO
url = 'https://file-examples.com/wp-content/uploads/2017/10/file_example_JPG_100kB.jpg'
resp = requests.request("HEAD", url)
unit = resp.headers['Accept-Ranges']
print(resp.headers['Content-Length'])
print(unit)
headers = {'Range': f'{unit}=-4096'}
print(headers)
resp = requests.request("GET", url, headers=headers)
b = BytesIO()
for chunk in resp.iter_content(chunk_size=128):
b.write(chunk)
print(b.tell())
b.seek(0)
data = b.read()
print(f"len(data): {len(data)}")
An answer here (Size of raw response in bytes) says :
Just take the len() of the content of the response:
>>> response = requests.get('https://github.com/')
>>> len(response.content)
51671
However doing that does not get the accurate content length. For example check out this python code:
import sys
import requests
def proccessUrl(url):
try:
r = requests.get(url)
print("Correct Content Length: "+r.headers['Content-Length'])
print("bytes of r.text : "+str(sys.getsizeof(r.text)))
print("bytes of r.content : "+str(sys.getsizeof(r.content)))
print("len r.text : "+str(len(r.text)))
print("len r.content : "+str(len(r.content)))
except Exception as e:
print(str(e))
#this url contains a content-length header, we will use that to see if the content length we calculate is the same.
proccessUrl("https://stackoverflow.com")
If we try and manually calculate the content length and compare it to what is in the header, we get an answer that is much larger?
Correct Content Length: 51504
bytes of r.text : 515142
bytes of r.content : 257623
len r.text : 257552
len r.content : 257606
Why does len(r.content) not return the correct content length? And how can we manually calculate it accurately if the header is missing?
The Content-Length header reflects the body of the response. That's not the same thing as the length of the text or content attributes, because the response could be compressed. requests decompresses the response for you.
You'd have to bypass a lot of internal plumbing to get the original, compressed, raw content, and then you have to access some more internals if you want the response object to still work correctly. The 'easiest' method is to enable streaming, then reading from the raw socket:
from io import BytesIO
r = requests.get(url, stream=True)
# read directly from the raw urllib3 connection
raw_content = r.raw.read()
content_length = len(raw_content)
# replace the internal file-object to serve the data again
r.raw._fp = BytesIO(raw_content)
Demo:
>>> import requests
>>> from io import BytesIO
>>> url = "https://stackoverflow.com"
>>> r = requests.get(url, stream=True)
>>> r.headers['Content-Encoding'] # a compressed response
'gzip'
>>> r.headers['Content-Length'] # the raw response contains 52055 bytes of compressed data
'52055'
>>> r.headers['Content-Type'] # we are served UTF-8 HTML data
'text/html; charset=utf-8'
>>> raw_content = r.raw.read()
>>> len(raw_content) # the raw content body length
52055
>>> r.raw._fp = BytesIO(raw_content)
>>> len(r.content) # the decompressed binary content, byte count
258719
>>> len(r.text) # the Unicode content decoded from UTF-8, character count
258658
This reads the full response into memory, so don't use this if you expect large responses! In that case, you could instead use shutil.copyfileobj() to copy the data from the r.raw file to a spooled temporary file (which will switch to an on-disk file once a certain size is reached), get the file size of that file, then stuff that file onto r.raw._fp.
A function that adds a Content-Type header to any request that is missing that header would look like this:
import requests
import shutil
import tempfile
def ensure_content_length(
url, *args, method='GET', session=None, max_size=2**20, # 1Mb
**kwargs
):
kwargs['stream'] = True
session = session or requests.Session()
r = session.request(method, url, *args, **kwargs)
if 'Content-Length' not in r.headers:
# stream content into a temporary file so we can get the real size
spool = tempfile.SpooledTemporaryFile(max_size)
shutil.copyfileobj(r.raw, spool)
r.headers['Content-Length'] = str(spool.tell())
spool.seek(0)
# replace the original socket with our temporary file
r.raw._fp.close()
r.raw._fp = spool
return r
This accepts an existing session, and lets you specify the request method too. Adjust max_size as needed for your memory constraints. Demo on https://github.com, which lacks a Content-Length header:
>>> r = ensure_content_length('https://github.com/')
>>> r
<Response [200]>
>>> r.headers['Content-Length']
'14490'
>>> len(r.content)
54814
Note that if there is no Content-Encoding header present or the value for that header is set to identity, and the Content-Length is available, then just you can rely on Content-Length being the full size of the response. That's because then there is obviously no compression applied.
As a side note: you should not use sys.getsizeof() if what your are after is the length of a bytes or str object (the number of bytes or characters in that object). sys.getsizeof() gives you the internal memory footprint of a Python object, which covers more than just the number of bytes or characters in that object. See What is the difference between len() and sys.getsizeof() methods in python?
I'm trying to download a large file from a server with Python 2:
req = urllib2.Request("https://myserver/mylargefile.gz")
rsp = urllib2.urlopen(req)
data = rsp.read()
The server sends data with "Transfer-Encoding: chunked" and I'm only getting some binary data, which cannot be unpacked by gunzip.
Do I have to iterate over multiple read()s? Or multiple requests? If so, how do they have to look like?
Note: I'm trying to solve the problem with only the Python 2 standard library, without additional libraries such as urllib3 or requests. Is this even possible?
From the python documentation on urllib2.urlopen:
One caveat: the read() method, if the size argument is omitted or
negative, may not read until the end of the data stream; there is no
good way to determine that the entire stream from a socket has been
read in the general case.
So, read the data in a loop:
req = urllib2.Request("https://myserver/mylargefile.gz")
rsp = urllib2.urlopen(req)
data = rsp.read(8192)
while data:
# .. Do Something ..
data = rsp.read(8192)
If I'm not mistaken, the following worked for me - a while back:
data = ''
chunk = rsp.read()
while chunk:
data += chunk
chunk = rsp.read()
Each read reads one chunk - so keep on reading until nothing more's coming.
Don't have documenation ready supporting this...yet.
I have the same problem.
I found that "Transfer-Encoding: chunked" often appears with "Content-Encoding:
gzip".
So maybe we can get the compressed content and unzip it.
It works for me.
import urllib2
from StringIO import StringIO
import gzip
req = urllib2.Request(url)
req.add_header('Accept-encoding', 'gzip, deflate')
rsp = urllib2.urlopen(req)
if rsp.info().get('Content-Encoding') == 'gzip':
buf = StringIO(rsp.read())
f = gzip.GzipFile(fileobj=buf)
data = f.read()
I would like to open a StackExchange API (search endpoint) URL and parse the result [0]. The documentation says that all results are in JSON format [1]. I open up this URL in my web browser and the results are absolutely fine [2]. However, when I try opening it up using a Python program it returns encoded text which I am unable to parse. Here's a snip
á¬ôŸ?ÍøäÅ€ˆËç?bçÞIË
¡ëf)j´ñ‚TF8¯KÚpr®´Ö©iUizEÚD +¦¯÷tgNÈÑ.G¾LPUç?Ñ‘Ù~]ŒäÖÂ9Ÿð1£µ$JNóa?Z&Ÿtž'³Ðà#Í°¬õÅj5ŸE÷*æJî”Ï>íÓé’çÔqQI’†ksS™¾þEíqÝýly
My program to open a URL is as follows. What am I doing particularly wrong?
''' Opens a URL and returns the result '''
def open_url(query):
request = urllib2.Request(query)
response = urllib2.urlopen(request)
text = response.read()
#results = json.loads(text)
print text
title = openRawResource, AssetManager.AssetInputStream throws IOException on read of larger files
page1_query = stackoverflow_search_endpoint % (1,urllib.quote_plus(title),access_token,key)
[0] https://api.stackexchange.com/2.1/search/advanced?page=1&pagesize=100&order=desc&sort=relevance&q=openRawResource%2C+AssetManager.AssetInputStream+throws+IOException+on+read+of+larger+files&site=stackoverflow&access_token=******&key=******
[1] https://api.stackexchange.com/docs
[2] http://hastebin.com/qoxaxahaxa.sm
Soultion
I found the solution. Here's how you would do it.
request = urllib2.Request(query)
request.add_header('Accept-encoding', 'gzip')
response = urllib2.urlopen(request)
if response.info().get('Content-Encoding') == 'gzip':
buf = StringIO( response.read())
f = gzip.GzipFile(fileobj=buf)
data = f.read()
result = json.loads(data)
Can not post the complete output as it is too huge.Many Thanks to Evert and Kristaps for pointing out about decompression and setting headers on the request. In addition, another similar question one would want to look into [3].
[3] Does python urllib2 automatically uncompress gzip data fetched from webpage?
The next paragraph of the documentation says:
Additionally, all API responses are compressed. The Content-Encoding
header is always set, but some proxies will strip this out. The proper way to decode API responses can be found here.
Your output does look like it may be compressed. Browsers automatically decompress data (depending on the Content-Encoding), so you would need to look at the header and do the same: results = json.loads(zlib.decompress(text)) or something similar.
Do check the here link as well.
I found the solution. Here's how you would do it.
request = urllib2.Request(query)
request.add_header('Accept-encoding', 'gzip')
response = urllib2.urlopen(request)
if response.info().get('Content-Encoding') == 'gzip':
buf = StringIO( response.read())
f = gzip.GzipFile(fileobj=buf)
data = f.read()
result = json.loads(data)
Can not post the complete output as it is too huge.Many Thanks to Evert and Kristaps for pointing out about decompression and setting headers on the request. In addition, another similar question one would want to look into [1].
[1] Does python urllib2 automatically uncompress gzip data fetched from webpage?