Related
I'm working on JSON data from this API call:
https://api.nfz.gov.pl/app-umw-api/agreements?year=2022&branch=01&productCode=01.0010.094.01&page=1&limit=10&format=json&api-version=1.2
This is page 1, but there are 49 pages in total, therefore a part of my code deals (successfully) with pagination. I don't want to save this JSON in a file and, if I can avoid it, don't really want to import the 'json' package - but will do if necessary.
A variation of this code works correctly if I'm pulling entire ['data']['agreements'] dictionary (or is it a list...).
But I don't want that, I want individual parameters for all the 'attributes' of each 'agreement'. In my code below I'm trying to pull the 'provider-name' attribute, and would like to get a list of all the provider names, without any other data there.
But I keep getting the "list indices must be integers or slices, not str" error in line 18. I've tried many ways to get this data which is nested within a list nested within a dictionary, etc. like splitting it further into another 'for' loop, but no success.
import requests
import math
import pandas as pd
baseurl = 'https://api.nfz.gov.pl/app-umw-api/agreements?year=2022&branch=01&productCode=01.0010.094.01&page=1&limit=10&format=json&api-version=1.2'
def main_request(baseurl, x):
r = requests.get(baseurl + f'&page={x}')
return r.json()
def get_pages(response):
return math.ceil(response['meta']['count'] / 10)
def get_names(response):
providerlist = []
all_data = response['data']['agreements']
for attributes1 in all_data ['data']['agreements']:
item = attributes1['attributes']['provider-name']
providers = {
'page1': item,
}
providerlist.append(providers)
return providerlist
mainlist = []
data = main_request(baseurl, 1)
for x in range(1,get_pages(data)+1):
mainlist.extend(get_names(main_request(baseurl, x)))
mydataframe = pd.DataFrame(mainlist)
print(mydataframe)
To get the data from the Json to the dataframe you can use next example:
import requests
import pandas as pd
api_url = "https://api.nfz.gov.pl/app-umw-api/agreements?year=2022&branch=01&productCode=01.0010.094.01&page={}&limit=10&format=json&api-version=1.2"
all_data = []
for page in range(1, 5): # <-- increase page numbers here
data = requests.get(api_url.format(page)).json()
for a in data["data"]["agreements"]:
all_data.append({"id": a["id"], **a["attributes"], "link": a["links"]['related']})
df = pd.DataFrame(all_data)
print(df.head().to_markdown(index=False))
Prints:
id
code
technical-code
origin-code
service-type
service-name
amount
updated-at
provider-code
provider-nip
provider-regon
provider-registry-number
provider-name
provider-place
year
branch
link
75f1b5a0-34d1-d827-8970-89b6b593be86
0113/3202010/01/2022/01
0113/3202010/01/2022/01
0113/3202010/01/2022/01
01
Podstawowa Opieka Zdrowotna
14583.7
2022-07-11T20:04:39
3202010
8851039259
89019398100026
000000001951-W-02
NZOZ PRAKTYKA LEKARZA RODZINNEGO JAN WOLAŃCZYK
JEDLINA-ZDRÓJ
2022
01
https://api.nfz.gov.pl/app-umw-api/agreements/75f1b5a0-34d1-d827-8970-89b6b593be86?format=json&api-version=1.2
1840cf6e-10ba-33a1-81f1-9f58c613d705
0113/3302665/01/2022/01
0113/3302665/01/2022/01
0113/3302665/01/2022/01
01
Podstawowa Opieka Zdrowotna
1479
2022-08-03T20:00:22
3302665
9281731555
390737391
000000023969-W-02
NZOZ "MEDICA"
PĘCŁAW
2022
01
https://api.nfz.gov.pl/app-umw-api/agreements/1840cf6e-10ba-33a1-81f1-9f58c613d705?format=json&api-version=1.2
954eb365-e232-fd29-10f7-c8af21c07470
0113/3402005/01/2022/01
0113/3402005/01/2022/01
0113/3402005/01/2022/01
01
Podstawowa Opieka Zdrowotna
1936
2022-09-02T20:01:17
3402005
6121368883
23106871400021
000000002014-W-02
PRZYCHODNIA OGÓLNA TSARAKHOV OLEG
BOLESŁAWIEC
2022
01
https://api.nfz.gov.pl/app-umw-api/agreements/954eb365-e232-fd29-10f7-c8af21c07470?format=json&api-version=1.2
7dd72607-ab9f-7217-87b9-8e4ed2bc5537
0113/3202025/01/2022/01
0113/3202025/01/2022/01
0113/3202025/01/2022/01
01
Podstawowa Opieka Zdrowotna
0
2022-04-14T20:01:42
3202025
8851557014
891487450
000000002063-W-02
"PRZYCHODNIA LEKARSKA ZDROWIE BIELAK, PIEC I SZYMANIAK SPÓŁKA PARTNERSKA"
NOWA RUDA
2022
01
https://api.nfz.gov.pl/app-umw-api/agreements/7dd72607-ab9f-7217-87b9-8e4ed2bc5537?format=json&api-version=1.2
bb60b21d-38da-1f2e-a7fd-5a45453e7370
0113/3102115/01/2022/01
0113/3102115/01/2022/01
0113/3102115/01/2022/01
01
Podstawowa Opieka Zdrowotna
414
2022-10-18T20:01:17
3102115
8941504470
93009444900038
000000001154-W-02
PRAKTYKA LEKARZA RODZINNEGO WALDEMAR CHRYSTOWSKI
WROCŁAW
2022
01
https://api.nfz.gov.pl/app-umw-api/agreements/bb60b21d-38da-1f2e-a7fd-5a45453e7370?format=json&api-version=1.2
I try to extreact data from twitter json file retrived by using tweepy streaming
Here is my code for streaming:
class MyListener(Stream):
t_count=0
def on_data(self, data):
print (data)
self.t_count += 0
#stop by
if self.t_count >= 5000:
sys.exit("exit")
return True
def on_error(self, status):
print (status)
if __name__ == '__main__':
stream = MyListener(consumer_key, consumer_secret, access_token, access_token_secret)
stream.filter(track=['corona'], languages = ["en"])
Here is my code for reading the file:
with open("covid-test-out", "r") as f:
count = 0
for line in f:
data = json.loads(line)
Then I got the error
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Here is one line in the json file. I noticed that there is a b-prefix in front of each line but when I check the type of the line, it is not bytes object but still string object. And I am not even sure if this is the reason that I can not get the correct data.
b'{"created_at":"Mon Nov 22 07:37:46 +0000 2021","id":1462686730956333061,"id_str":"1462686730956333061","text":"RT #corybernardi: Scientists 'mystified'. \n\nhttps:\/\/t.co\/rvTYCUEQ74","source":"\u003ca href=\"https:\/\/mobile.twitter.com\" rel=\"nofollow\"\u003eTwitter Web App\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":1336870146242056192,"id_str":"1336870146242056192","name":"Terence Byrnes","screen_name":"byrnes_terence","location":null,"url":null,"description":"Retired Aussie. Against mandatory vaccinations, government interference in our lives, and the climate cult. Now on Gab Social as a backup : Terence50","translator_type":"none","protected":false,"verified":false,"followers_count":960,"friends_count":1012,"listed_count":3,"favourites_count":15163,"statuses_count":171876,"created_at":"Thu Dec 10 03:08:01 +0000 2020","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":null,"contributors_enabled":false,"is_translator":false,"profile_background_color":"F5F8FA","profile_background_image_url":"","profile_background_image_url_https":"","profile_background_tile":false,"profile_link_color":"1DA1F2","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/1428994180458508292\/fT2Olt4J_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/1428994180458508292\/fT2Olt4J_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/1336870146242056192\/1631520259","default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null,"withheld_in_countries":[]},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweeted_status":{"created_at":"Sun Nov 21 19:42:14 +0000 2021","id":1462506658421112834,"id_str":"1462506658421112834","text":"Scientists 'mystified'. \n\nhttps:\/\/t.co\/rvTYCUEQ74","source":"\u003ca href=\"https:\/\/mobile.twitter.com\" rel=\"nofollow\"\u003eTwitter Web App\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":80965423,"id_str":"80965423","name":"CoryBernardi.com.au","screen_name":"corybernardi","location":"Adelaide ","url":"http:\/\/www.corybernardi.com.au","description":"Get your free Weekly Dose of Common Sense email at https:\/\/t.co\/MAJpp7iZJy.\n\nLaughing at liars and leftists since 2006. Tweets deleted weekly to infuriate losers.","translator_type":"none","protected":false,"verified":true,"followers_count":47794,"friends_count":63,"listed_count":461,"favourites_count":112,"statuses_count":55,"created_at":"Thu Oct 08 22:54:55 +0000 2009","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":null,"contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"1DA1F2","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/1446336496827387904\/Ay6QRHQt_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/1446336496827387904\/Ay6QRHQt_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/80965423\/1633668973","default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null,"withheld_in_countries":[]},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"quote_count":5,"reply_count":30,"retweet_count":40,"favorite_count":136,"entities":{"hashtags":[],"urls":[{"url":"https:\/\/t.co\/rvTYCUEQ74","expanded_url":"https:\/\/apnews.com\/article\/coronavirus-pandemic-science-health-pandemics-united-nations-fcf28a83c9352a67e50aa2172eb01a2f","display_url":"apnews.com\/article\/corona\u2026","indices":[26,49]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en"},"is_quote_status":false,"quote_count":0,"reply_count":0,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[{"url":"https:\/\/t.co\/rvTYCUEQ74","expanded_url":"https:\/\/apnews.com\/article\/coronavirus-pandemic-science-health-pandemics-united-nations-fcf28a83c9352a67e50aa2172eb01a2f","display_url":"apnews.com\/article\/corona\u2026","indices":[44,67]}],"user_mentions":[{"screen_name":"corybernardi","name":"CoryBernardi.com.au","id":80965423,"id_str":"80965423","indices":[3,16]}],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1637566666722"}'
I've been using the script below to download technical videos for later analysis. The script has worked well for me and retrieves the highest resolution version available for the videos that I have needed.
Now I've come across a 4K YouTube video, and my script only saves an mp4 with 1280x720.
I'd like to know if there is a way to adjust my current script to download higher resolution versions of this video. I understand there are python packages that might address this, but right now I would like stick to this step-by-step method if possible.
above: info from Quicktime and OSX
"""
length: 175 seconds
quality: hd720
type: video/mp4; codecs="avc1.64001F, mp4a.40.2"
Last-Modified: Sun, 21 Aug 2016 10:41:48 GMT
Content-Type: video/mp4
Date: Sat, 01 Apr 2017 16:50:16 GMT
Expires: Sat, 01 Apr 2017 16:50:16 GMT
Cache-Control: private, max-age=21294
Accept-Ranges: bytes
Content-Length: 35933033
Connection: close
Alt-Svc: quic=":443"; ma=2592000
X-Content-Type-Options: nosniff
Server: gvs 1.
"""
import urlparse, urllib2
vid = "vzS1Vkpsi5k"
save_title = "YouTube SpaceX - Booster Number 4 - Thaicom 8 06-06-2016"
url_init = "https://www.youtube.com/get_video_info?video_id=" + vid
resp = urllib2.urlopen(url_init, timeout=10)
data = resp.read()
info = urlparse.parse_qs(data)
title = info['title']
print "length: ", info['length_seconds'][0] + " seconds"
stream_map = info['url_encoded_fmt_stream_map'][0]
vid_info = stream_map.split(",")
mp4_filename = save_title + ".mp4"
for video in vid_info:
item = urlparse.parse_qs(video)
print 'quality: ', item['quality'][0]
print 'type: ', item['type'][0]
url_download = item['url'][0]
resp = urllib2.urlopen(url_download)
print resp.headers
length = int(resp.headers['Content-Length'])
my_file = open(mp4_filename, "w+")
done, i = 0, 0
buff = resp.read(1024)
while buff:
my_file.write(buff)
done += 1024
percent = done * 100.0 / length
buff = resp.read(1024)
if not i%1000:
percent = done * 100.0 / length
print str(percent) + "%"
i += 1
break
Ok, so I have not taken the time to get to the bottom of this. However, I did find that when you do:
stream_map = info['url_encoded_fmt_stream_map'][0]
Somehow you only get a selection of a single 720p option, one 'medium' and two 'small'.
However, if you change that line into:
stream_map = info['adaptive_fmts'][0]
you will get all the available versions, including the 2160p one. Thus, the 4K one.
PS: You'd have to comment out the print quality and print type command since those labels aren't always available in the new throughput. When commenting them out however, and adapting your script as explained above, I was able to successfully download the 4K version.
indeed
info ['adaptive_fmts'] [0]
returns the information of the whole video, but the url is not usable directly , but the bar of advancement
urllib.urlretrieve returns silently even if the file doesn't exist on the remote http server, it just saves a html page to the named file. For example:
urllib.urlretrieve('http://google.com/abc.jpg', 'abc.jpg')
just returns silently, even if abc.jpg doesn't exist on google.com server, the generated abc.jpg is not a valid jpg file, it's actually a html page . I guess the returned headers (a httplib.HTTPMessage instance) can be used to actually tell whether the retrieval successes or not, but I can't find any doc for httplib.HTTPMessage.
Can anybody provide some information about this problem?
Consider using urllib2 if it possible in your case. It is more advanced and easy to use than urllib.
You can detect any HTTP errors easily:
>>> import urllib2
>>> resp = urllib2.urlopen("http://google.com/abc.jpg")
Traceback (most recent call last):
<<MANY LINES SKIPPED>>
urllib2.HTTPError: HTTP Error 404: Not Found
resp is actually HTTPResponse object that you can do a lot of useful things with:
>>> resp = urllib2.urlopen("http://google.com/")
>>> resp.code
200
>>> resp.headers["content-type"]
'text/html; charset=windows-1251'
>>> resp.read()
"<<ACTUAL HTML>>"
I keep it simple:
# Simple downloading with progress indicator, by Cees Timmerman, 16mar12.
import urllib2
remote = r"http://some.big.file"
local = r"c:\downloads\bigfile.dat"
u = urllib2.urlopen(remote)
h = u.info()
totalSize = int(h["Content-Length"])
print "Downloading %s bytes..." % totalSize,
fp = open(local, 'wb')
blockSize = 8192 #100000 # urllib.urlretrieve uses 8192
count = 0
while True:
chunk = u.read(blockSize)
if not chunk: break
fp.write(chunk)
count += 1
if totalSize > 0:
percent = int(count * blockSize * 100 / totalSize)
if percent > 100: percent = 100
print "%2d%%" % percent,
if percent < 100:
print "\b\b\b\b\b", # Erase "NN% "
else:
print "Done."
fp.flush()
fp.close()
if not totalSize:
print
According to the documentation is is undocumented
to get access to the message it looks like you do something like:
a, b=urllib.urlretrieve('http://google.com/abc.jpg', r'c:\abc.jpg')
b is the message instance
Since I have learned that Python it is always useful to use Python's ability to be introspective when I type
dir(b)
I see lots of methods or functions to play with
And then I started doing things with b
for example
b.items()
Lists lots of interesting things, I suspect that playing around with these things will allow you to get the attribute you want to manipulate.
Sorry this is such a beginner's answer but I am trying to master how to use the introspection abilities to improve my learning and your questions just popped up.
Well I tried something interesting related to this-I was wondering if I could automatically get the output from each of the things that showed up in the directory that did not need parameters so I wrote:
needparam=[]
for each in dir(b):
x='b.'+each+'()'
try:
eval(x)
print x
except:
needparam.append(x)
You can create a new URLopener (inherit from FancyURLopener) and throw exceptions or handle errors any way you want. Unfortunately, FancyURLopener ignores 404 and other errors. See this question:
How to catch 404 error in urllib.urlretrieve
I ended up with my own retrieve implementation, with the help of pycurl it supports more protocols than urllib/urllib2, hope it can help other people.
import tempfile
import pycurl
import os
def get_filename_parts_from_url(url):
fullname = url.split('/')[-1].split('#')[0].split('?')[0]
t = list(os.path.splitext(fullname))
if t[1]:
t[1] = t[1][1:]
return t
def retrieve(url, filename=None):
if not filename:
garbage, suffix = get_filename_parts_from_url(url)
f = tempfile.NamedTemporaryFile(suffix = '.' + suffix, delete=False)
filename = f.name
else:
f = open(filename, 'wb')
c = pycurl.Curl()
c.setopt(pycurl.URL, str(url))
c.setopt(pycurl.WRITEFUNCTION, f.write)
try:
c.perform()
except:
filename = None
finally:
c.close()
f.close()
return filename
class MyURLopener(urllib.FancyURLopener):
http_error_default = urllib.URLopener.http_error_default
url = "http://page404.com"
filename = "download.txt"
def reporthook(blockcount, blocksize, totalsize):
pass
...
try:
(f,headers)=MyURLopener().retrieve(url, filename, reporthook)
except Exception, e:
print e
:) My first post on StackOverflow, have been a lurker for years. :)
Sadly dir(urllib.urlretrieve) is deficient in useful information.
So from this thread thus far I tried writing this:
a,b = urllib.urlretrieve(imgURL, saveTo)
print "A:", a
print "B:", b
which produced this:
A: /home/myuser/targetfile.gif
B: Accept-Ranges: bytes
Access-Control-Allow-Origin: *
Cache-Control: max-age=604800
Content-Type: image/gif
Date: Mon, 07 Mar 2016 23:37:34 GMT
Etag: "4e1a5d9cc0857184df682518b9b0da33"
Last-Modified: Sun, 06 Mar 2016 21:16:48 GMT
Server: ECS (hnd/057A)
Timing-Allow-Origin: *
X-Cache: HIT
Content-Length: 27027
Connection: close
I guess one can check:
if b.Content-Length > 0:
My next step is to test a scenario where the retrieve fails...
Results against another server/website - what comes back in "B" is a bit random, but one can test for certain values:
A: get_good.jpg
B: Date: Tue, 08 Mar 2016 00:44:19 GMT
Server: Apache
Last-Modified: Sat, 02 Jan 2016 09:17:21 GMT
ETag: "524cf9-18afe-528565aef9ef0"
Accept-Ranges: bytes
Content-Length: 101118
Connection: close
Content-Type: image/jpeg
A: get_bad.jpg
B: Date: Tue, 08 Mar 2016 00:44:20 GMT
Server: Apache
Content-Length: 1363
X-Frame-Options: deny
Connection: close
Content-Type: text/html
In the 'bad' case (non-existing image file) "B" retrieved a small chunk of (Googlebot?) HTML code and saved it as the target, hence Content-Length of 1363 bytes.
urllib.urlretrieve returns silently even if the file doesn't exist on the remote http server, it just saves a html page to the named file. For example:
urllib.urlretrieve('http://google.com/abc.jpg', 'abc.jpg')
just returns silently, even if abc.jpg doesn't exist on google.com server, the generated abc.jpg is not a valid jpg file, it's actually a html page . I guess the returned headers (a httplib.HTTPMessage instance) can be used to actually tell whether the retrieval successes or not, but I can't find any doc for httplib.HTTPMessage.
Can anybody provide some information about this problem?
Consider using urllib2 if it possible in your case. It is more advanced and easy to use than urllib.
You can detect any HTTP errors easily:
>>> import urllib2
>>> resp = urllib2.urlopen("http://google.com/abc.jpg")
Traceback (most recent call last):
<<MANY LINES SKIPPED>>
urllib2.HTTPError: HTTP Error 404: Not Found
resp is actually HTTPResponse object that you can do a lot of useful things with:
>>> resp = urllib2.urlopen("http://google.com/")
>>> resp.code
200
>>> resp.headers["content-type"]
'text/html; charset=windows-1251'
>>> resp.read()
"<<ACTUAL HTML>>"
I keep it simple:
# Simple downloading with progress indicator, by Cees Timmerman, 16mar12.
import urllib2
remote = r"http://some.big.file"
local = r"c:\downloads\bigfile.dat"
u = urllib2.urlopen(remote)
h = u.info()
totalSize = int(h["Content-Length"])
print "Downloading %s bytes..." % totalSize,
fp = open(local, 'wb')
blockSize = 8192 #100000 # urllib.urlretrieve uses 8192
count = 0
while True:
chunk = u.read(blockSize)
if not chunk: break
fp.write(chunk)
count += 1
if totalSize > 0:
percent = int(count * blockSize * 100 / totalSize)
if percent > 100: percent = 100
print "%2d%%" % percent,
if percent < 100:
print "\b\b\b\b\b", # Erase "NN% "
else:
print "Done."
fp.flush()
fp.close()
if not totalSize:
print
According to the documentation is is undocumented
to get access to the message it looks like you do something like:
a, b=urllib.urlretrieve('http://google.com/abc.jpg', r'c:\abc.jpg')
b is the message instance
Since I have learned that Python it is always useful to use Python's ability to be introspective when I type
dir(b)
I see lots of methods or functions to play with
And then I started doing things with b
for example
b.items()
Lists lots of interesting things, I suspect that playing around with these things will allow you to get the attribute you want to manipulate.
Sorry this is such a beginner's answer but I am trying to master how to use the introspection abilities to improve my learning and your questions just popped up.
Well I tried something interesting related to this-I was wondering if I could automatically get the output from each of the things that showed up in the directory that did not need parameters so I wrote:
needparam=[]
for each in dir(b):
x='b.'+each+'()'
try:
eval(x)
print x
except:
needparam.append(x)
You can create a new URLopener (inherit from FancyURLopener) and throw exceptions or handle errors any way you want. Unfortunately, FancyURLopener ignores 404 and other errors. See this question:
How to catch 404 error in urllib.urlretrieve
I ended up with my own retrieve implementation, with the help of pycurl it supports more protocols than urllib/urllib2, hope it can help other people.
import tempfile
import pycurl
import os
def get_filename_parts_from_url(url):
fullname = url.split('/')[-1].split('#')[0].split('?')[0]
t = list(os.path.splitext(fullname))
if t[1]:
t[1] = t[1][1:]
return t
def retrieve(url, filename=None):
if not filename:
garbage, suffix = get_filename_parts_from_url(url)
f = tempfile.NamedTemporaryFile(suffix = '.' + suffix, delete=False)
filename = f.name
else:
f = open(filename, 'wb')
c = pycurl.Curl()
c.setopt(pycurl.URL, str(url))
c.setopt(pycurl.WRITEFUNCTION, f.write)
try:
c.perform()
except:
filename = None
finally:
c.close()
f.close()
return filename
class MyURLopener(urllib.FancyURLopener):
http_error_default = urllib.URLopener.http_error_default
url = "http://page404.com"
filename = "download.txt"
def reporthook(blockcount, blocksize, totalsize):
pass
...
try:
(f,headers)=MyURLopener().retrieve(url, filename, reporthook)
except Exception, e:
print e
:) My first post on StackOverflow, have been a lurker for years. :)
Sadly dir(urllib.urlretrieve) is deficient in useful information.
So from this thread thus far I tried writing this:
a,b = urllib.urlretrieve(imgURL, saveTo)
print "A:", a
print "B:", b
which produced this:
A: /home/myuser/targetfile.gif
B: Accept-Ranges: bytes
Access-Control-Allow-Origin: *
Cache-Control: max-age=604800
Content-Type: image/gif
Date: Mon, 07 Mar 2016 23:37:34 GMT
Etag: "4e1a5d9cc0857184df682518b9b0da33"
Last-Modified: Sun, 06 Mar 2016 21:16:48 GMT
Server: ECS (hnd/057A)
Timing-Allow-Origin: *
X-Cache: HIT
Content-Length: 27027
Connection: close
I guess one can check:
if b.Content-Length > 0:
My next step is to test a scenario where the retrieve fails...
Results against another server/website - what comes back in "B" is a bit random, but one can test for certain values:
A: get_good.jpg
B: Date: Tue, 08 Mar 2016 00:44:19 GMT
Server: Apache
Last-Modified: Sat, 02 Jan 2016 09:17:21 GMT
ETag: "524cf9-18afe-528565aef9ef0"
Accept-Ranges: bytes
Content-Length: 101118
Connection: close
Content-Type: image/jpeg
A: get_bad.jpg
B: Date: Tue, 08 Mar 2016 00:44:20 GMT
Server: Apache
Content-Length: 1363
X-Frame-Options: deny
Connection: close
Content-Type: text/html
In the 'bad' case (non-existing image file) "B" retrieved a small chunk of (Googlebot?) HTML code and saved it as the target, hence Content-Length of 1363 bytes.