urllib2 not retrieving entire HTTP response - python

I'm perplexed as to why I'm not able to download the entire contents of some JSON responses from FriendFeed using urllib2.
>>> import urllib2
>>> stream = urllib2.urlopen('http://friendfeed.com/api/room/the-life-scientists/profile?format=json')
>>> stream.headers['content-length']
'168928'
>>> data = stream.read()
>>> len(data)
61058
>>> # We can see here that I did not retrieve the full JSON
... # given that the stream doesn't end with a closing }
...
>>> data[-40:]
'ce2-003048343a40","name":"Vincent Racani'
How can I retrieve the full response with urllib2?

Best way to get all of the data:
fp = urllib2.urlopen("http://www.example.com/index.cfm")
response = ""
while 1:
data = fp.read()
if not data: # This might need to be if data == "": -- can't remember
break
response += data
print response
The reason is that .read() isn't guaranteed to return the entire response, given the nature of sockets. I thought this was discussed in the documentation (maybe urllib) but I cannot find it.

Use tcpdump (or something like it) to monitor the actual network interactions - then you can analyze why the site is broken for some client libraries. Ensure that you repeat multiple times by scripting the test, so you can see if the problem is consistent:
import urllib2
url = 'http://friendfeed.com/api/room/friendfeed-feedback/profile?format=json'
stream = urllib2.urlopen(url)
expected = int(stream.headers['content-length'])
data = stream.read()
datalen = len(data)
print expected, datalen, expected == datalen
The site's working consistently for me so I can't give examples of finding failures :)

Keep calling stream.read() until it's done...
while data = stream.read() :
... do stuff with data

readlines()
also works

Related

How can i get all the content on a website

i would like to do webscraping
so i do a simple request:
import urllib.request
fp = urllib.request.urlopen("https://www.iadfrance.fr/trouver-un-conseiller")
mybytes = fp.read()
mystr = mybytes.decode("utf8")
faa = open("demofile2.txt", "a")
faa.write(mystr)
faa.close()
fp.close()
but
i don't find any name in my file;
Why? and there is a way to get all the performers on the map?
Thanks for your answers!
Here is how you get the data
import requests
r = requests.get('https://www.iadfrance.fr/agent-search-location?southwestlat=48.8251752&southwestlng=2.2935677&northeastlat=48.8816507&northeastlng=2.4039459')
if r.status_code == 200:
print(r.json())
else:
print(f'Oops. Status code is {r.status_code}')
The fundamental concept here has a name, "HATEOAS", Hypermedia as the Engine of Application State.
The first response that you get contains the next list of resources that you need to ask. In turn, they may contain quite a few more. Some of those resources might be Javascript, which when executed requests even more data. That's inconvenient and a violation of the theoretical HATEOAS model, but it is very much the practice for interactive websites.

What exactly requests function does?

So I`m trying to send a request to a webpage and read its response. I did a code that compares the request and the page, and I cant get the same page text. Am I using "requests" correctly?
I really think that I misunderstand how requests function works and what it does. Can someone help me please?
import requests
import urllib
def search():
pr = {'q':'pink'}
r = requests.get('http://stackoverflow.com/search',params=pr)
returntext = r.text
urllibtest(returntext)
def urllibtest(returntext):
connection = urllib.urlopen("http://stackoverflow.com/search?q=pink")
output = connection.read()
connection.close()
if output == returntext:
print("ITS THE SAME PAGE")
else:
print("ITS NOT THE SAME PAGE")
search()
First of all, there is no good reason to expect two different stack overflow searches to return the exact same response anyway.
There is one logical difference here too, requests automatically decodes the output for you:
>>> type(output)
str
>>> type(r.text)
unicode
You can use the content instead if you don't want it decoded, and use a more predictable source to see the same content returned - for example:
>>> r1 = urllib.urlopen('http://httpbin.org').read()
>>> r2 = requests.get('http://httpbin.org').content
>>> r1 == r2
True

Python-JSON - How to parse API output?

I'm pretty new.
I wrote this python script to make an API call from blockr.io to check the balance of multiple bitcoin addresses.
The contents of btcaddy.txt are bitcoin addresses seperated by commas. For this example, let it parse this.
import urllib2
import json
btcaddy = open("btcaddy.txt","r")
urlRequest = urllib2.Request("http://btc.blockr.io/api/v1/address/info/" + btcaddy.read())
data = urllib2.urlopen(urlRequest).read()
json_data = json.loads(data)
balance = float(json_data['data''address'])
print balance
raw_input()
However, it gives me an error. What am I doing wrong? For now, how do I get it to print the balance of the addresses?
You've done multiple things wrong in your code. Here's my fix. I recommend a for loop.
import json
import urllib
addresses = open("btcaddy.txt", "r").read()
base_url = "http://btc.blockr.io/api/v1/address/info/"
request = urllib.urlopen(base_url+addresses)
result = json.loads(request.read())['data']
for balance in result:
print balance['address'], ":" , balance['balance'], "BTC"
You don't need an input at the end, too.
Your question is clear, but your tries not.
You said, you have a file, with at least, more than registry. So you need to retrieve the lines of this file.
with open("btcaddy.txt","r") as a:
addresses = a.readlines()
Now you could iterate over registries and make a request to this uri. The urllib module is enough for this task.
import json
import urllib
base_url = "http://btc.blockr.io/api/v1/address/info/%s"
for address in addresses:
request = urllib.request.urlopen(base_url % address)
result = json.loads(request.read().decode('utf8'))
print(result)
HTTP sends bytes as response, so you should to us decode('utf8') as approach to handle with data.

Retrieving an image over HTTP in Python

Am reading from a free ebook called "Python for Informatics".
I have the following code:
import socket
import time
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('www.py4inf.com', 80))
mysock.send('GET http://www.py4inf.com/cover.jpg HTTP/1.0\n\n')
count = 0
picture = ""
while True:
data = mysock.recv(5120)
if (len(data) < 1):
break
# time.sleep(0.25)
count = count + len(data)
print len(data), count
picture = picture + data
mysock.close()
# Look for the end of the header (2 CRLF)
pos = picture.find("\r\n\r\n")
print 'Header length',pos
print picture[:pos]
# Skip past the header and save the picture data
picture = picture[pos+4:]
fhand = open("stuff.jpg","w")
fhand.write(picture)
fhand.close()
I have no knowledge in http and am having a hard time understanding the above code!
I think I do understand what mysock.connect() and mysock.send() do however I need explanation of the 1st line: 1) mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) . What does it do?
Now, about the line: 2) data = mysock.recv(5120). It created a var called data in which saves 5120 bytes its time the while loop run. But what type of data is this data and what happens when I run: picture = picture + data ? It's picture = "" + data,
so it adds a string to what? If am right, data has both string data
(header) + jpeg file
???
and finally: 3)
pos = picture.find("\r\n\r\n")
print 'Header length',pos
print picture[:pos]
pos = picture.find("/r/n/r/n"), this searches inside picture variable to find 2 new lines "/n/n" because we used the line mysock.send('GET http://www.py4inf.com/cover.jpg HTTP/1.0\n\n')??
Is there any way to instantly save the jpeg file on our hard drive without retrieving the http header and seperating the header from the jpeg file?
Sorry for my English... Feel free to ask something that you may don't understand!
Thanks
The line mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) calls the socket class from the socket library to create a new network endpoint. socket.AF_INET tells the call to create an IP-based socket, and socket.SOCK_STREAM requests a stream-oriented (TCP) socket, which automatically sends any necessary acknowledgements and retries as appropriate.
The statement data = mysock.recv(5120) reads chunks of up to 5120 bytes. When there is no more data the recv() call returns the empty string. The test seems rather perverse, and it would IMHO be better to use if len(data) == 0 or even if not len(data), but this is a detail of style rather than substance. The statement picture = picture + data therefore accumulates the response data 5120 bytes at a time (though the naming is poor, because the accumulated data actually includes the HTTP headers as well as the picture data).
The statement pos = picture.find("\r\n\r\n") seeks inside the returned string to locate the end of the HTTP headers. Since it finds the beginning rather than the end of the string, 4 must be added to the offset to give the starting position of the picture data.
The example given is attempting to demonstrate low-level access to HTTP data without, apparently, giving you sufficient background about what is going on. A more normal way to access the data would use a higher-level library such as urllib. Here's some code that retrieves the image much more simply:
>>> import urllib
>>> response = urllib.urlopen("http://www.py4inf.com/cover.jpg")
>>> content = response.read()
>>> outf = open("cover.jpg", 'wb')
>>> outf.write(content)
>>> outf.close()
I could open the resulting JPEG file without any issues.
EDIT 2020-10-09 A more up-to-date way of obtaining the same result would use the requests module to the same effect, and a context manager to ensure correct resource management.
>>> import requests
>>> response = requests.get("http://www.py4inf.com/cover.jpg")
>>> with open("result.jpg", "wb") as outf:
... outf.write(response.content)
...
70057
>>>
Your first question has been asked and answered several times on SO. The short answer is, "It's just a technicality; you don't really need to know."
You are correct.
The header ends with two CRLF. If you save the file without discarding the header, it won't be in JPEG format, and you won't be able to use it. The header is there to permit the file to be transmitted over the internet. You have to discard it and save only the payload.

IncompleteRead using httplib

I have been having a persistent problem getting an rss feed from a particular website. I wound up writing a rather ugly procedure to perform this function, but I am curious why this happens and whether any higher level interfaces handle this problem properly. This problem isn't really a show stopper, since I don't need to retrieve the feed very often.
I have read a solution that traps the exception and returns the partial content, yet since the incomplete reads differ in the amount of bytes that are actually retrieved, I have no certainty that such solution will actually work.
#!/usr/bin/env python
import os
import sys
import feedparser
from mechanize import Browser
import requests
import urllib2
from httplib import IncompleteRead
url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
content = feedparser.parse(url)
if 'bozo_exception' in content:
print content['bozo_exception']
else:
print "Success!!"
sys.exit(0)
print "If you see this, please tell me what happened."
# try using mechanize
b = Browser()
r = b.open(url)
try:
r.read()
except IncompleteRead, e:
print "IncompleteRead using mechanize", e
# try using urllib2
r = urllib2.urlopen(url)
try:
r.read()
except IncompleteRead, e:
print "IncompleteRead using urllib2", e
# try using requests
try:
r = requests.request('GET', url)
except IncompleteRead, e:
print "IncompleteRead using requests", e
# this function is old and I categorized it as ...
# "at least it works darnnit!", but I would really like to
# learn what's happening. Please help me put this function into
# eternal rest.
def get_rss_feed(url):
response = urllib2.urlopen(url)
read_it = True
content = ''
while read_it:
try:
content += response.read(1)
except IncompleteRead:
read_it = False
return content, response.info()
content, info = get_rss_feed(url)
feed = feedparser.parse(content)
As already stated, this isn't a mission critical problem, yet a curiosity, as even though I can expect urllib2 to have this problem, I am surprised that this error is encountered in mechanize and requests as well. The feedparser module doesn't even throw an error, so checking for errors depends on the presence of a 'bozo_exception' key.
Edit: I just wanted to mention that both wget and curl perform the function flawlessly, retrieving the full payload correctly every time. I have yet to find a pure python method to work, excepting my ugly hack, and I am very curious to know what is happening on the backend of httplib. On a lark, I decided to also try this with twill the other day and got the same httplib error.
P.S. There is one thing that also strikes me as very odd. The IncompleteRead happens consistently at one of two breakpoints in the payload. It seems that feedparser and requests fail after reading 926 bytes, yet mechanize and urllib2 fail after reading 1854 bytes. This behavior is consistend, and I am left without explanation or understanding.
At the end of the day, all of the other modules (feedparser, mechanize, and urllib2) call httplib which is where the exception is being thrown.
Now, first things first, I also downloaded this with wget and the resulting file was 1854 bytes. Next, I tried with urllib2:
>>> import urllib2
>>> url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
>>> f = urllib2.urlopen(url)
>>> f.headers.headers
['Cache-Control: private\r\n',
'Content-Type: text/xml; charset=utf-8\r\n',
'Server: Microsoft-IIS/7.5\r\n',
'X-AspNet-Version: 4.0.30319\r\n',
'X-Powered-By: ASP.NET\r\n',
'Date: Mon, 07 Jan 2013 23:21:51 GMT\r\n',
'Via: 1.1 BC1-ACLD\r\n',
'Transfer-Encoding: chunked\r\n',
'Connection: close\r\n']
>>> f.read()
< Full traceback cut >
IncompleteRead: IncompleteRead(1854 bytes read)
So it is reading all 1854 bytes but then thinks there is more to come. If we explicitly tell it to read only 1854 bytes it works:
>>> f = urllib2.urlopen(url)
>>> f.read(1854)
'\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">...snip...</rss>'
Obviously, this is only useful if we always know the exact length ahead of time. We can use the fact the partial read is returned as an attribute on the exception to capture the entire contents:
>>> try:
... contents = f.read()
... except httplib.IncompleteRead as e:
... contents = e.partial
...
>>> print contents
'\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">...snip...</rss>'
This blog post suggests this is a fault of the server, and describes how to monkey-patch the httplib.HTTPResponse.read() method with the try..except block above to handle things behind the scenes:
import httplib
def patch_http_response_read(func):
def inner(*args):
try:
return func(*args)
except httplib.IncompleteRead, e:
return e.partial
return inner
httplib.HTTPResponse.read = patch_http_response_read(httplib.HTTPResponse.read)
I applied the patch and then feedparser worked:
>>> import feedparser
>>> url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
>>> feedparser.parse(url)
{'bozo': 0,
'encoding': 'utf-8',
'entries': ...
'status': 200,
'version': 'rss20'}
This isn't the nicest way of doing things, but it seems to work. I'm not expert enough in the HTTP protocols to say for sure whether the server is doing things wrong, or whether httplib is mis-handling an edge case.
I find out in my case, send a HTTP/1.0 request , fix the problem, just adding this to the code:
import httplib
httplib.HTTPConnection._http_vsn = 10
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.0'
after I do the request :
req = urllib2.Request(url, post, headers)
filedescriptor = urllib2.urlopen(req)
img = filedescriptor.read()
after I back to http 1.1 with (for connections that support 1.1) :
httplib.HTTPConnection._http_vsn = 11
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.1'
I have fixed the issue by using HTTPS instead of HTTP and its working fine. No code change was required.

Categories

Resources