1. Deprecation problem
In Python 3.7, I download a big file from a URL using the urllib.request.urlretrieve(..) function. In the documentation (https://docs.python.org/3/library/urllib.request.html) I read the following just above the urllib.request.urlretrieve(..) docs:
Legacy interface
The following functions and classes are ported from the Python 2 module urllib (as opposed to urllib2). They might become deprecated at some point in the future.
2. Searching an alternative
To keep my code future-proof, I'm on the lookout for an alternative. The official Python docs don't mention a specific one, but it looks like urllib.request.urlopen(..) is the most straightforward candidate. It's at the top of the docs page.
Unfortunately, the alternatives - like urlopen(..) - don't provide the reporthook argument. This argument is a callable you pass to the urlretrieve(..) function. In turn, urlretrieve(..) calls it regularly with the following arguments:
block nr.
block size
total file size
I use it to update a progressbar. That's why I miss the reporthook argument in alternatives.
3. urlretrieve(..) vs urlopen(..)
I discovered that urlretrieve(..) simply uses urlopen(..). See the request.py code file in the Python 3.7 installation (Python37/Lib/urllib/request.py):
_url_tempfiles = []
def urlretrieve(url, filename=None, reporthook=None, data=None):
"""
Retrieve a URL into a temporary location on disk.
Requires a URL argument. If a filename is passed, it is used as
the temporary file location. The reporthook argument should be
a callable that accepts a block number, a read size, and the
total file size of the URL target. The data argument should be
valid URL encoded data.
If a filename is passed and the URL points to a local resource,
the result is a copy from local file to new file.
Returns a tuple containing the path to the newly created
data file as well as the resulting HTTPMessage object.
"""
url_type, path = splittype(url)
with contextlib.closing(urlopen(url, data)) as fp:
headers = fp.info()
# Just return the local path and the "headers" for file://
# URLs. No sense in performing a copy unless requested.
if url_type == "file" and not filename:
return os.path.normpath(path), headers
# Handle temporary file setup.
if filename:
tfp = open(filename, 'wb')
else:
tfp = tempfile.NamedTemporaryFile(delete=False)
filename = tfp.name
_url_tempfiles.append(filename)
with tfp:
result = filename, headers
bs = 1024*8
size = -1
read = 0
blocknum = 0
if "content-length" in headers:
size = int(headers["Content-Length"])
if reporthook:
reporthook(blocknum, bs, size)
while True:
block = fp.read(bs)
if not block:
break
read += len(block)
tfp.write(block)
blocknum += 1
if reporthook:
reporthook(blocknum, bs, size)
if size >= 0 and read < size:
raise ContentTooShortError(
"retrieval incomplete: got only %i out of %i bytes"
% (read, size), result)
return result
4. Conclusion
From all this, I see three possible decisions:
I keep my code unchanged. Let's hope the urlretrieve(..) function won't get deprecated anytime soon.
I write myself a replacement function behaving like urlretrieve(..) on the outside and using urlopen(..) on the inside. Actually, such function would be a copy-paste of the code above. It feels unclean to do that - compared to using the official urlretrieve(..).
I write myself a replacement function behaving like urlretrieve(..) on the outside and using something entirely different on the inside. But hey, why would I do that? urlopen(..) is not deprecated, so why not use it?
What decision would you take?
The following example uses urllib.request.urlopen to download a zip file containing Oceania's crop production data from the FAO statistical database. In that example, it is necessary to define a minimal header, otherwise FAOSTAT throws an Error 403: Forbidden.
import shutil
import urllib.request
import tempfile
# Create a request object with URL and headers
url = “http://fenixservices.fao.org/faostat/static/bulkdownloads/Production_Crops_Livestock_E_Oceania.zip”
header = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '}
req = urllib.request.Request(url=url, headers=header)
# Define the destination file
dest_file = tempfile.gettempdir() + '/' + 'crop.zip'
print(f“File located at:{dest_file}”)
# Create an http response object
with urllib.request.urlopen(req) as response:
# Create a file object
with open(dest_file, "wb") as f:
# Copy the binary content of the response to the file
shutil.copyfileobj(response, f)
Based on https://stackoverflow.com/a/48691447/2641825 for the request part and https://stackoverflow.com/a/66591873/2641825 for the header part, see also urllib's documentation at https://docs.python.org/3/howto/urllib2.html
Related
my first post here in stackoverflow and trying to dip my feet into python by writing a program that calls data from an API of an online game I play :)
I've written the below code to create a .csv file if it doesn't exist, and then use a for loop to call an API twice (each with different match IDs). The response is in JSON, and the idea is that if the file is empty (i.e. newly created), it will execute the if statement to write headers in, and if it's not empty (i.e. the headers have already been written in), then to write the values.
My code returns a .csv with the headers written twice - so for some reason within the for loop the file size doesn't change even though the headers have been written. Is there something i'm missing here? Much appreciated!
import urllib.request, urllib.parse, urllib.error
import json
import csv
import ssl
import os
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
api_key = 'XXX'
puuid = 'XXX'
matchlist = ['0e8194de-f011-4018-aca2-36b1a749badd','ae207558-599e-480e-ae97-7b59f97ec8d7']
f = csv.writer(open('my_file.csv','w+'))
for matchid in matchlist:
matchdeturl = 'https://europe.api.riotgames.com/lor/match/v1/matches/'+ matchid +'?api_key=' + api_key
matchdetuh = urllib.request.urlopen(matchdeturl, context = ctx)
matchdet = json.loads(matchdetuh.read().decode())
matchplayers = matchdet['info']
#if file is blank, write headers, if not write values
if os.stat('my_file.csv').st_size == 0:
f.writerow(list(matchplayers))
f.writerow(matchplayers.values())
else:
f.writerow(matchplayers.values())
It's possible that the file buffers instead of writing immediately to disk because IO is an expensive operation. Either flush the file before checking its size, or set a flag in your loop and check that flag instead of checking the file size.
f = csv.writer(open('my_file.csv','w+'))
needs_header = os.stat('my_file.csv').st_size == 0
for matchid in matchlist:
# do stuff
#if file needs a header, write headers
if needs_header:
f.writerow(list(matchplayers))
needs_header = False
# Then, write values
f.writerow(matchplayers.values())
I'm trying to access an authenticated site using a cookies.txt file (generated with a Chrome extension) with Python Requests:
import requests, cookielib
cj = cookielib.MozillaCookieJar('cookies.txt')
cj.load()
r = requests.get(url, cookies=cj)
It doesn't throw any error or exception, but yields the login screen, incorrectly. However, I know that my cookie file is valid, because I can successfully retrieve my content using it with wget. Any idea what I'm doing wrong?
Edit:
I'm tracing cookielib.MozillaCookieJar._really_load and can verify that the cookies are correctly parsed (i.e. they have the correct values for the domain, path, secure, etc. tokens). But as the transaction is still resulting in the login form, it seems that wget must be doing something additional (as the exact same cookies.txt file works for it).
MozillaCookieJar inherits from FileCookieJar which has the following docstring in its constructor:
Cookies are NOT loaded from the named file until either the .load() or
.revert() method is called.
You need to call .load() method then.
Also, like Jermaine Xu noted the first line of the file needs to contain either # Netscape HTTP Cookie File or # HTTP Cookie File string. Files generated by the plugin you use do not contain such a string so you have to insert it yourself. I raised appropriate bug at http://code.google.com/p/cookie-txt-export/issues/detail?id=5
EDIT
Session cookies are saved with 0 in the 5th column. If you don't pass ignore_expires=True to load() method all such cookies are discarded when loading from a file.
File session_cookie.txt:
# Netscape HTTP Cookie File
.domain.com TRUE / FALSE 0 name value
Python script:
import cookielib
cj = cookielib.MozillaCookieJar('session_cookie.txt')
cj.load()
print len(cj)
Output:
0
EDIT 2
Although we managed to get cookies into the jar above they are subsequently discarded by cookielib because they still have 0 value in the expires attribute. To prevent this we have to set the expire time to some future time like so:
for cookie in cj:
# set cookie expire date to 14 days from now
cookie.expires = time.time() + 14 * 24 * 3600
EDIT 3
I checked both wget and curl and both use 0 expiry time to denote session cookies which means it's the de facto standard. However Python's implementation uses empty string for the same purpose hence the problem raised in the question. I think Python's behavior in this regard should be in line with what wget and curl do and that's why I raised the bug at http://bugs.python.org/issue17164
I'll note that replacing 0s with empty strings in the 5th column of the input file and passing ignore_discard=True to load() is the alternate way of solving the problem (no need to change expiry time in this case).
I tried taking into account everything that Piotr Dobrogost had valiantly figured out about MozillaCookieJar but to no avail. I got fed up and just parsed the damn cookies.txt myself and now all is well:
import re
import requests
def parseCookieFile(cookiefile):
"""Parse a cookies.txt file and return a dictionary of key value pairs
compatible with requests."""
cookies = {}
with open (cookiefile, 'r') as fp:
for line in fp:
if not re.match(r'^\#', line):
lineFields = line.strip().split('\t')
cookies[lineFields[5]] = lineFields[6]
return cookies
cookies = parseCookieFile('cookies.txt')
import pprint
pprint.pprint(cookies)
r = requests.get('https://example.com', cookies=cookies)
This worked for me:
from http.cookiejar import MozillaCookieJar
from pathlib import Path
import requests
cookies = Path('/Users/name/cookies.txt')
jar = MozillaCookieJar(cookies)
jar.load()
requests.get('https://path.to.site.com', cookies=jar)
<Response [200]>
I tried editing Tristan answer to add some info to it but it seems SO edit q is full therefore, I am writing this answer, since, I have struggled real bad on using existing cookies with python request.
First, get the cookies from the Chrome. Easiest way would be to use an extension called 'cookies.txt'
https://chrome.google.com/webstore/detail/get-cookiestxt/bgaddhkoddajcdgocldbbfleckgcbcid/related
After downloading those cookies, use the below code to make sure that you are able to parse the file without any issues.
import re, requests, pprint
def parseCookieFile(cookiefile):
"""Parse a cookies.txt file and return a dictionary of key value pairs
compatible with requests."""
cookies = {}
with open (cookiefile, 'r') as fp:
for line in fp:
if not re.match(r'^\#', line):
lineFields = re.findall(r'[^\s]+', line) #capturing anything but empty space
try:
cookies[lineFields[5]] = lineFields[6]
except Exception as e:
print (e)
return cookies
cookies = parseCookieFile('cookies.txt') #replace the filename
pprint.pprint(cookies)
Next, use those cookies with python request
x = requests.get('your__url', verify=False, cookies=cookies)
print (x.content)
This should save your day from going on different SO posts and trying those cookielib and other methods which never worked for me.
I finally found a way to make it work (I got the idea by looking at curl's verbose ouput): instead of loading my cookies from a file, I simply created a dict with the required value/name pairs:
cd = {'v1': 'n1', 'v2': 'n2'}
r = requests.get(url, cookies=cd)
and it worked (although it doesn't explain why the previous method didn't). Thanks for all the help, it's really appreciated.
I want to write a unit test cases using unittest where my input is set of images and output is set of images with boundary box in the text, also it generate co-ordinate of the boundary boxes. How can I write test script for that using unittest pyunit framework.
So I have written unittest script to test api is working or not, response should be json / and list format and response of image file format should be png,jpg format, which is working now. I have below test case scenario need to test but how I will check? I don't know.
If optional keys are not passed to API, it should not throw an error. (opposite for compulsory ones) 8) The implemented route should throw/return error if GET request is passed instead of POST. 9) If valid path but invalid file name is provided, you should see respective error. 10) In case of invalid path, you should see respective error message. 11) There is specified set of keys which are compulsory to be passed to APIs, if not it returns an error 12) verify for session time out.
Here is my code:
import requests
import json
import sshtunnel
import unittest
class TestSequentialExecutions(unittest.TestCase):
def setUp(self) -> None:
a=10
def test_API(self):
self.resp_list = []
# API url
url = ['dummy url','dummyurl']
# Additional headers.
headers = {'Content-Type': 'application/json'}
# Body
payload = [{'input': 'dummy path'},
{"path": "dummy"}]
# Test case-1 checking valid API is routed or not
# convert dict to json by json.dumps() for body data.
for i in range(len(url)):
resp = requests.post(url[i], headers=headers, data=json.dumps(payload[i], indent=4))
self.assertEqual(resp.status_code, 200)
self.resp_list.append(resp.json())
#Test case-2 to check input file is in JPG ,PNG format or not
def test_fileformat(self):
n = len(self.resp_list[1])
my_list = [1]*n
empty_list=[]
extensions = ['png','v']
for filename in self.resp_list[0]:
if filename.lower().endswith(('.png', '.jpg')):
empty_list.append(1)
else:
empty_list.append(0)
self.assertEqual(my,empy_list)
if __name__ == '__main__':
unittest.main()
Actually I am trying to write test script for below github code: https://github.com/eragonruan/text-detection-ctpn
My requirement: Read contents from a input type="file" with ID= "rtfile1" and write it to a textarea with ID- "rt1"
Based on the documentation on [https://brython.info/][1] I tried to read a file but it fails with this error:
Access to XMLHttpRequest at 'file:///C:/fakepath/requirements.txt' from origin 'http://example.com:8000' has been blocked by CORS policy: Cross origin requests are only supported for protocol schemes: http, data, chrome, chrome-extension, https.
I tried following two Brython codes, both of them failed with the same aforementioned error.
Code 1:
def file_read(ev):
doc['rt1'].value = open(doc['rtfile1'].value).read()
doc["rtfile1"].bind("input", file_read)
Code 2:
def file_read(ev):
def on_complete(req):
if req.status==200 or req.status==0:
doc['rt1'].value = req.text
else:
doc['rt1'].value = "error "+req.text
def err_msg():
doc['rt1'].value = "server didn't reply after %s seconds" %timeout
timeout = 4
def go(url):
req = ajax.ajax()
req.bind("complete", on_complete)
req.set_timeout(timeout, err_msg)
req.open('GET', url, True)
req.send()
print('Triggered')
go(doc['rtfile1'].value)
doc["rtfile1"].bind("input", file_read)
Any help would be greatly appreciated. Thanks!!! :)
It's not related to Brython (you would have the same result with the equivalent Javascript), but to the way you tell the browser which file you want to upload.
If you select the file by an HTML tag such as
<input type="file" id="rtfile1">
the object referenced by doc['rtfile1'] in the Brython code has an attribute value, but it is not the file path or url, it's a "fakepath" built by the browser (as you can see in the error message), and you can't use it as an argument of the Brython function open(), or as a url to send an Ajax request to; if you want to use the file url, you should enter it in a basic input tag (without type="file").
It is better to select the file with type="file", but in this case the object doc['rtfile1'] is a FileList object, described in the DOM's Web API, whose first element is a File object. Reading its content is unfortunately not as simple as with open(), but here is a working example:
from browser import window, document as doc
def file_read(ev):
def onload(event):
"""Triggered when file is read. The FileReader instance is
event.target.
The file content, as text, is the FileReader instance's "result"
attribute."""
doc['rt1'].value = event.target.result
# Get the selected file as a DOM File object
file = doc['rtfile1'].files[0]
# Create a new DOM FileReader instance
reader = window.FileReader.new()
# Read the file content as text
reader.readAsText(file)
reader.bind("load", onload)
doc["rtfile1"].bind("input", file_read)
I'm trying to access an authenticated site using a cookies.txt file (generated with a Chrome extension) with Python Requests:
import requests, cookielib
cj = cookielib.MozillaCookieJar('cookies.txt')
cj.load()
r = requests.get(url, cookies=cj)
It doesn't throw any error or exception, but yields the login screen, incorrectly. However, I know that my cookie file is valid, because I can successfully retrieve my content using it with wget. Any idea what I'm doing wrong?
Edit:
I'm tracing cookielib.MozillaCookieJar._really_load and can verify that the cookies are correctly parsed (i.e. they have the correct values for the domain, path, secure, etc. tokens). But as the transaction is still resulting in the login form, it seems that wget must be doing something additional (as the exact same cookies.txt file works for it).
MozillaCookieJar inherits from FileCookieJar which has the following docstring in its constructor:
Cookies are NOT loaded from the named file until either the .load() or
.revert() method is called.
You need to call .load() method then.
Also, like Jermaine Xu noted the first line of the file needs to contain either # Netscape HTTP Cookie File or # HTTP Cookie File string. Files generated by the plugin you use do not contain such a string so you have to insert it yourself. I raised appropriate bug at http://code.google.com/p/cookie-txt-export/issues/detail?id=5
EDIT
Session cookies are saved with 0 in the 5th column. If you don't pass ignore_expires=True to load() method all such cookies are discarded when loading from a file.
File session_cookie.txt:
# Netscape HTTP Cookie File
.domain.com TRUE / FALSE 0 name value
Python script:
import cookielib
cj = cookielib.MozillaCookieJar('session_cookie.txt')
cj.load()
print len(cj)
Output:
0
EDIT 2
Although we managed to get cookies into the jar above they are subsequently discarded by cookielib because they still have 0 value in the expires attribute. To prevent this we have to set the expire time to some future time like so:
for cookie in cj:
# set cookie expire date to 14 days from now
cookie.expires = time.time() + 14 * 24 * 3600
EDIT 3
I checked both wget and curl and both use 0 expiry time to denote session cookies which means it's the de facto standard. However Python's implementation uses empty string for the same purpose hence the problem raised in the question. I think Python's behavior in this regard should be in line with what wget and curl do and that's why I raised the bug at http://bugs.python.org/issue17164
I'll note that replacing 0s with empty strings in the 5th column of the input file and passing ignore_discard=True to load() is the alternate way of solving the problem (no need to change expiry time in this case).
I tried taking into account everything that Piotr Dobrogost had valiantly figured out about MozillaCookieJar but to no avail. I got fed up and just parsed the damn cookies.txt myself and now all is well:
import re
import requests
def parseCookieFile(cookiefile):
"""Parse a cookies.txt file and return a dictionary of key value pairs
compatible with requests."""
cookies = {}
with open (cookiefile, 'r') as fp:
for line in fp:
if not re.match(r'^\#', line):
lineFields = line.strip().split('\t')
cookies[lineFields[5]] = lineFields[6]
return cookies
cookies = parseCookieFile('cookies.txt')
import pprint
pprint.pprint(cookies)
r = requests.get('https://example.com', cookies=cookies)
This worked for me:
from http.cookiejar import MozillaCookieJar
from pathlib import Path
import requests
cookies = Path('/Users/name/cookies.txt')
jar = MozillaCookieJar(cookies)
jar.load()
requests.get('https://path.to.site.com', cookies=jar)
<Response [200]>
I tried editing Tristan answer to add some info to it but it seems SO edit q is full therefore, I am writing this answer, since, I have struggled real bad on using existing cookies with python request.
First, get the cookies from the Chrome. Easiest way would be to use an extension called 'cookies.txt'
https://chrome.google.com/webstore/detail/get-cookiestxt/bgaddhkoddajcdgocldbbfleckgcbcid/related
After downloading those cookies, use the below code to make sure that you are able to parse the file without any issues.
import re, requests, pprint
def parseCookieFile(cookiefile):
"""Parse a cookies.txt file and return a dictionary of key value pairs
compatible with requests."""
cookies = {}
with open (cookiefile, 'r') as fp:
for line in fp:
if not re.match(r'^\#', line):
lineFields = re.findall(r'[^\s]+', line) #capturing anything but empty space
try:
cookies[lineFields[5]] = lineFields[6]
except Exception as e:
print (e)
return cookies
cookies = parseCookieFile('cookies.txt') #replace the filename
pprint.pprint(cookies)
Next, use those cookies with python request
x = requests.get('your__url', verify=False, cookies=cookies)
print (x.content)
This should save your day from going on different SO posts and trying those cookielib and other methods which never worked for me.
I finally found a way to make it work (I got the idea by looking at curl's verbose ouput): instead of loading my cookies from a file, I simply created a dict with the required value/name pairs:
cd = {'v1': 'n1', 'v2': 'n2'}
r = requests.get(url, cookies=cd)
and it worked (although it doesn't explain why the previous method didn't). Thanks for all the help, it's really appreciated.