Retrieving an image over HTTP in Python

Retrieving an image over HTTP in Python - python

Am reading from a free ebook called "Python for Informatics".
I have the following code:
import socket
import time
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('www.py4inf.com', 80))
mysock.send('GET http://www.py4inf.com/cover.jpg HTTP/1.0\n\n')
count = 0
picture = ""
while True:
data = mysock.recv(5120)
if (len(data) < 1):
break
# time.sleep(0.25)
count = count + len(data)
print len(data), count
picture = picture + data
mysock.close()
# Look for the end of the header (2 CRLF)
pos = picture.find("\r\n\r\n")
print 'Header length',pos
print picture[:pos]
# Skip past the header and save the picture data
picture = picture[pos+4:]
fhand = open("stuff.jpg","w")
fhand.write(picture)
fhand.close()
I have no knowledge in http and am having a hard time understanding the above code!
I think I do understand what mysock.connect() and mysock.send() do however I need explanation of the 1st line: 1) mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) . What does it do?
Now, about the line: 2) data = mysock.recv(5120). It created a var called data in which saves 5120 bytes its time the while loop run. But what type of data is this data and what happens when I run: picture = picture + data ? It's picture = "" + data,
so it adds a string to what? If am right, data has both string data
(header) + jpeg file
???
and finally: 3)
pos = picture.find("\r\n\r\n")
print 'Header length',pos
print picture[:pos]
pos = picture.find("/r/n/r/n"), this searches inside picture variable to find 2 new lines "/n/n" because we used the line mysock.send('GET http://www.py4inf.com/cover.jpg HTTP/1.0\n\n')??
Is there any way to instantly save the jpeg file on our hard drive without retrieving the http header and seperating the header from the jpeg file?
Sorry for my English... Feel free to ask something that you may don't understand!
Thanks

The line mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) calls the socket class from the socket library to create a new network endpoint. socket.AF_INET tells the call to create an IP-based socket, and socket.SOCK_STREAM requests a stream-oriented (TCP) socket, which automatically sends any necessary acknowledgements and retries as appropriate.
The statement data = mysock.recv(5120) reads chunks of up to 5120 bytes. When there is no more data the recv() call returns the empty string. The test seems rather perverse, and it would IMHO be better to use if len(data) == 0 or even if not len(data), but this is a detail of style rather than substance. The statement picture = picture + data therefore accumulates the response data 5120 bytes at a time (though the naming is poor, because the accumulated data actually includes the HTTP headers as well as the picture data).
The statement pos = picture.find("\r\n\r\n") seeks inside the returned string to locate the end of the HTTP headers. Since it finds the beginning rather than the end of the string, 4 must be added to the offset to give the starting position of the picture data.
The example given is attempting to demonstrate low-level access to HTTP data without, apparently, giving you sufficient background about what is going on. A more normal way to access the data would use a higher-level library such as urllib. Here's some code that retrieves the image much more simply:
>>> import urllib
>>> response = urllib.urlopen("http://www.py4inf.com/cover.jpg")
>>> content = response.read()
>>> outf = open("cover.jpg", 'wb')
>>> outf.write(content)
>>> outf.close()
I could open the resulting JPEG file without any issues.
EDIT 2020-10-09 A more up-to-date way of obtaining the same result would use the requests module to the same effect, and a context manager to ensure correct resource management.
>>> import requests
>>> response = requests.get("http://www.py4inf.com/cover.jpg")
>>> with open("result.jpg", "wb") as outf:
... outf.write(response.content)
...
70057
>>>

Your first question has been asked and answered several times on SO. The short answer is, "It's just a technicality; you don't really need to know."
You are correct.
The header ends with two CRLF. If you save the file without discarding the header, it won't be in JPEG format, and you won't be able to use it. The header is there to permit the file to be transmitted over the internet. You have to discard it and save only the payload.

Related

How do I create and write headers to a csv file if it doesn't exist, but if it already exists then write data?

my first post here in stackoverflow and trying to dip my feet into python by writing a program that calls data from an API of an online game I play :)
I've written the below code to create a .csv file if it doesn't exist, and then use a for loop to call an API twice (each with different match IDs). The response is in JSON, and the idea is that if the file is empty (i.e. newly created), it will execute the if statement to write headers in, and if it's not empty (i.e. the headers have already been written in), then to write the values.
My code returns a .csv with the headers written twice - so for some reason within the for loop the file size doesn't change even though the headers have been written. Is there something i'm missing here? Much appreciated!
import urllib.request, urllib.parse, urllib.error
import json
import csv
import ssl
import os
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
api_key = 'XXX'
puuid = 'XXX'
matchlist = ['0e8194de-f011-4018-aca2-36b1a749badd','ae207558-599e-480e-ae97-7b59f97ec8d7']
f = csv.writer(open('my_file.csv','w+'))
for matchid in matchlist:
matchdeturl = 'https://europe.api.riotgames.com/lor/match/v1/matches/'+ matchid +'?api_key=' + api_key
matchdetuh = urllib.request.urlopen(matchdeturl, context = ctx)
matchdet = json.loads(matchdetuh.read().decode())
matchplayers = matchdet['info']
#if file is blank, write headers, if not write values
if os.stat('my_file.csv').st_size == 0:
f.writerow(list(matchplayers))
f.writerow(matchplayers.values())
else:
f.writerow(matchplayers.values())

It's possible that the file buffers instead of writing immediately to disk because IO is an expensive operation. Either flush the file before checking its size, or set a flag in your loop and check that flag instead of checking the file size.
f = csv.writer(open('my_file.csv','w+'))
needs_header = os.stat('my_file.csv').st_size == 0
for matchid in matchlist:
# do stuff
#if file needs a header, write headers
if needs_header:
f.writerow(list(matchplayers))
needs_header = False
# Then, write values
f.writerow(matchplayers.values())

Should I switch from "urllib.request.urlretrieve(..)" to "urllib.request.urlopen(..)"?

1. Deprecation problem
In Python 3.7, I download a big file from a URL using the urllib.request.urlretrieve(..) function. In the documentation (https://docs.python.org/3/library/urllib.request.html) I read the following just above the urllib.request.urlretrieve(..) docs:
Legacy interface
The following functions and classes are ported from the Python 2 module urllib (as opposed to urllib2). They might become deprecated at some point in the future.
2. Searching an alternative
To keep my code future-proof, I'm on the lookout for an alternative. The official Python docs don't mention a specific one, but it looks like urllib.request.urlopen(..) is the most straightforward candidate. It's at the top of the docs page.
Unfortunately, the alternatives - like urlopen(..) - don't provide the reporthook argument. This argument is a callable you pass to the urlretrieve(..) function. In turn, urlretrieve(..) calls it regularly with the following arguments:
block nr.
block size
total file size
I use it to update a progressbar. That's why I miss the reporthook argument in alternatives.
3. urlretrieve(..) vs urlopen(..)
I discovered that urlretrieve(..) simply uses urlopen(..). See the request.py code file in the Python 3.7 installation (Python37/Lib/urllib/request.py):
_url_tempfiles = []
def urlretrieve(url, filename=None, reporthook=None, data=None):
"""
Retrieve a URL into a temporary location on disk.
Requires a URL argument. If a filename is passed, it is used as
the temporary file location. The reporthook argument should be
a callable that accepts a block number, a read size, and the
total file size of the URL target. The data argument should be
valid URL encoded data.
If a filename is passed and the URL points to a local resource,
the result is a copy from local file to new file.
Returns a tuple containing the path to the newly created
data file as well as the resulting HTTPMessage object.
"""
url_type, path = splittype(url)
with contextlib.closing(urlopen(url, data)) as fp:
headers = fp.info()
# Just return the local path and the "headers" for file://
# URLs. No sense in performing a copy unless requested.
if url_type == "file" and not filename:
return os.path.normpath(path), headers
# Handle temporary file setup.
if filename:
tfp = open(filename, 'wb')
else:
tfp = tempfile.NamedTemporaryFile(delete=False)
filename = tfp.name
_url_tempfiles.append(filename)
with tfp:
result = filename, headers
bs = 1024*8
size = -1
read = 0
blocknum = 0
if "content-length" in headers:
size = int(headers["Content-Length"])
if reporthook:
reporthook(blocknum, bs, size)
while True:
block = fp.read(bs)
if not block:
break
read += len(block)
tfp.write(block)
blocknum += 1
if reporthook:
reporthook(blocknum, bs, size)
if size >= 0 and read < size:
raise ContentTooShortError(
"retrieval incomplete: got only %i out of %i bytes"
% (read, size), result)
return result
4. Conclusion
From all this, I see three possible decisions:
I keep my code unchanged. Let's hope the urlretrieve(..) function won't get deprecated anytime soon.
I write myself a replacement function behaving like urlretrieve(..) on the outside and using urlopen(..) on the inside. Actually, such function would be a copy-paste of the code above. It feels unclean to do that - compared to using the official urlretrieve(..).
I write myself a replacement function behaving like urlretrieve(..) on the outside and using something entirely different on the inside. But hey, why would I do that? urlopen(..) is not deprecated, so why not use it?
What decision would you take?

The following example uses urllib.request.urlopen to download a zip file containing Oceania's crop production data from the FAO statistical database. In that example, it is necessary to define a minimal header, otherwise FAOSTAT throws an Error 403: Forbidden.
import shutil
import urllib.request
import tempfile
# Create a request object with URL and headers
url = “http://fenixservices.fao.org/faostat/static/bulkdownloads/Production_Crops_Livestock_E_Oceania.zip”
header = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '}
req = urllib.request.Request(url=url, headers=header)
# Define the destination file
dest_file = tempfile.gettempdir() + '/' + 'crop.zip'
print(f“File located at:{dest_file}”)
# Create an http response object
with urllib.request.urlopen(req) as response:
# Create a file object
with open(dest_file, "wb") as f:
# Copy the binary content of the response to the file
shutil.copyfileobj(response, f)
Based on https://stackoverflow.com/a/48691447/2641825 for the request part and https://stackoverflow.com/a/66591873/2641825 for the header part, see also urllib's documentation at https://docs.python.org/3/howto/urllib2.html

Python-JSON - How to parse API output?

I'm pretty new.
I wrote this python script to make an API call from blockr.io to check the balance of multiple bitcoin addresses.
The contents of btcaddy.txt are bitcoin addresses seperated by commas. For this example, let it parse this.
import urllib2
import json
btcaddy = open("btcaddy.txt","r")
urlRequest = urllib2.Request("http://btc.blockr.io/api/v1/address/info/" + btcaddy.read())
data = urllib2.urlopen(urlRequest).read()
json_data = json.loads(data)
balance = float(json_data['data''address'])
print balance
raw_input()
However, it gives me an error. What am I doing wrong? For now, how do I get it to print the balance of the addresses?

You've done multiple things wrong in your code. Here's my fix. I recommend a for loop.
import json
import urllib
addresses = open("btcaddy.txt", "r").read()
base_url = "http://btc.blockr.io/api/v1/address/info/"
request = urllib.urlopen(base_url+addresses)
result = json.loads(request.read())['data']
for balance in result:
print balance['address'], ":" , balance['balance'], "BTC"
You don't need an input at the end, too.

Your question is clear, but your tries not.
You said, you have a file, with at least, more than registry. So you need to retrieve the lines of this file.
with open("btcaddy.txt","r") as a:
addresses = a.readlines()
Now you could iterate over registries and make a request to this uri. The urllib module is enough for this task.
import json
import urllib
base_url = "http://btc.blockr.io/api/v1/address/info/%s"
for address in addresses:
request = urllib.request.urlopen(base_url % address)
result = json.loads(request.read().decode('utf8'))
print(result)
HTTP sends bytes as response, so you should to us decode('utf8') as approach to handle with data.

Fetching data from webpage using python and downloading output it to a file

I am trying to fetch some data from the webpage (http://www.usgs.gov/) using python and JSON. And it works fine when i executed this script (i found this script in one tutorial), but when i am trying to get this output in to a local file, i am getting some errors in the last two lines,saying that invalid syntax(":" ) and got errors as well when i insert f.close() I googled it and changed some script but it doesn't work. Need help to fix this. I am using Python IDLE version 2.7.5.
import urllib2
import json
#Example file to parse and process JSON
f = open("output.txt")
#use the JSON module to load the string into a dictionary
def printResults(data):
theJSON = json.loads(data)
#access the contents of JSON like any othe object
if "title" in theJSON["metadata"]:
f.write( theJSON["metadata"]["title"])
#output the no of events + magnitude and each event name
count = theJSON["metadata"]["count"];
f.write( str(count) + " events recorded")
def main():
urlData = "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/2.5_hour.geojson"
#open the url and read the data
webUrl = urllib2.urlopen(urlData)
print webUrl.getcode()
if (webUrl.getcode() == 200):
data = webUrl.read()
printResults(data)
else:
f.write( "Received an error from server,can't retrieve results" + str(webUrl.getcode())
if __name__=="__main__":
main()

You're missing a closing brace on this line:
f.write( "Received an error from server,can't retrieve results" + str(webUrl.getcode())
Also, your indentation is not consistent. You need to make sure your indents are always exactly four spaces. It's probably best to use a editor that automatically does this for you.
got errors as well when i insert f.close()
Although it's always a good practice, you don't need to close files explicitly in python. They will be closed when they are garbage collected (typically, after there are no references to the object, such as when the program terminates, in this case)

I made 2 changes in this script, and i think the scripts is working efficiently.
The first, I add the second argument to the open function which is 'w' for write,
**f = open("output.txt",'w')**
and the second is in the latest lines of the file as follow :
if (webUrl.getcode() == 200):
data = webUrl.read()
printResults(data)
else:
f.write( "Received an error from server,can't retrieve results" + str(webUrl.getcode()))
Be careful with indentation, and don't forget parentheses !

urllib2 not retrieving entire HTTP response

I'm perplexed as to why I'm not able to download the entire contents of some JSON responses from FriendFeed using urllib2.
>>> import urllib2
>>> stream = urllib2.urlopen('http://friendfeed.com/api/room/the-life-scientists/profile?format=json')
>>> stream.headers['content-length']
'168928'
>>> data = stream.read()
>>> len(data)
61058
>>> # We can see here that I did not retrieve the full JSON
... # given that the stream doesn't end with a closing }
...
>>> data[-40:]
'ce2-003048343a40","name":"Vincent Racani'
How can I retrieve the full response with urllib2?

Best way to get all of the data:
fp = urllib2.urlopen("http://www.example.com/index.cfm")
response = ""
while 1:
data = fp.read()
if not data: # This might need to be if data == "": -- can't remember
break
response += data
print response
The reason is that .read() isn't guaranteed to return the entire response, given the nature of sockets. I thought this was discussed in the documentation (maybe urllib) but I cannot find it.

Use tcpdump (or something like it) to monitor the actual network interactions - then you can analyze why the site is broken for some client libraries. Ensure that you repeat multiple times by scripting the test, so you can see if the problem is consistent:
import urllib2
url = 'http://friendfeed.com/api/room/friendfeed-feedback/profile?format=json'
stream = urllib2.urlopen(url)
expected = int(stream.headers['content-length'])
data = stream.read()
datalen = len(data)
print expected, datalen, expected == datalen
The site's working consistently for me so I can't give examples of finding failures :)

Keep calling stream.read() until it's done...
while data = stream.read() :
... do stuff with data

readlines()
also works

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Retrieving an image over HTTP in Python - python

Related

How do I create and write headers to a csv file if it doesn't exist, but if it already exists then write data?

Should I switch from "urllib.request.urlretrieve(..)" to "urllib.request.urlopen(..)"?

Python-JSON - How to parse API output?

Fetching data from webpage using python and downloading output it to a file

urllib2 not retrieving entire HTTP response

Categories

Resources