How do I seek to a particular position on a remote (HTTP) file so I can download only that part?
Lets say the bytes on a remote file were: 1234567890
I wanna seek to 4 and download 3 bytes from there so I would have: 456
and also, how do I check if a remote file exists?
I tried, os.path.isfile() but it returns False when I'm passing a remote file url.
If you are downloading the remote file through HTTP, you need to set the Range header.
Check in this example how it can be done. Looks like this:
myUrlclass.addheader("Range","bytes=%s-" % (existSize))
EDIT: I just found a better implementation. This class is very simple to use, as it can be seen in the docstring.
class HTTPRangeHandler(urllib2.BaseHandler):
"""Handler that enables HTTP Range headers.
This was extremely simple. The Range header is a HTTP feature to
begin with so all this class does is tell urllib2 that the
"206 Partial Content" reponse from the HTTP server is what we
expected.
Example:
import urllib2
import byterange
range_handler = range.HTTPRangeHandler()
opener = urllib2.build_opener(range_handler)
# install it
urllib2.install_opener(opener)
# create Request and set Range header
req = urllib2.Request('http://www.python.org/')
req.header['Range'] = 'bytes=30-50'
f = urllib2.urlopen(req)
"""
def http_error_206(self, req, fp, code, msg, hdrs):
# 206 Partial Content Response
r = urllib.addinfourl(fp, hdrs, req.get_full_url())
r.code = code
r.msg = msg
return r
def http_error_416(self, req, fp, code, msg, hdrs):
# HTTP's Range Not Satisfiable error
raise RangeError('Requested Range Not Satisfiable')
Update: The "better implementation" has moved to github: excid3/urlgrabber in the byterange.py file.
I highly recommend using the requests library. It is easily the best HTTP library I have ever used. In particular, to accomplish what you have described, you would do something like:
import requests
url = "http://www.sffaudio.com/podcasts/ShellGameByPhilipK.Dick.pdf"
# Retrieve bytes between offsets 3 and 5 (inclusive).
r = requests.get(url, headers={"range": "bytes=3-5"})
# If a 4XX client error or a 5XX server error is encountered, we raise it.
r.raise_for_status()
AFAIK, this is not possible using fseek() or similar. You need to use the HTTP Range header to achieve this. This header may or may not be supported by the server, so your mileage may vary.
import urllib2
myHeaders = {'Range':'bytes=0-9'}
req = urllib2.Request('http://www.promotionalpromos.com/mirrors/gnu/gnu/bash/bash-1.14.3-1.14.4.diff.gz',headers=myHeaders)
partialFile = urllib2.urlopen(req)
s2 = (partialFile.read())
EDIT: This is of course assuming that by remote file you mean a file stored on a HTTP server...
If the file you want is on an FTP server, FTP only allows to to specify a start offset and not a range. If this is what you want, then the following code should do it (not tested!)
import ftplib
fileToRetrieve = 'somefile.zip'
fromByte = 15
ftp = ftplib.FTP('ftp.someplace.net')
outFile = open('partialFile', 'wb')
ftp.retrbinary('RETR '+ fileToRetrieve, outFile.write, rest=str(fromByte))
outFile.close()
You can use httpio to access remote HTTP files as if they were local:
pip install httpio
import zipfile
import httpio
url = "http://some/large/file.zip"
with httpio.open(url) as fp:
zf = zipfile.ZipFile(fp)
print(zf.namelist())
Related
I have a problem with getting my test running using Robot Framework and robotframework-requests. I need to send a POST request and a binary data in the body. I looked at this question already, but it's not really answered. Here's how my test case looks like:
Upload ${filename} file
Create Session mysession http://${ADDRESS}
${data} = Get Binary File ${filename}
&{headers} = Create Dictionary Content-Type=application/octet-stream Accept=application/octet-stream
${resp} = Post Request mysession ${CGIPath} data=${data} headers=&{headers}
[Return] ${resp.status_code} ${resp.text}
The problem is that my binary data is about 250MB. When the data is read with Get Binary File I see that memory consumption goes up to 2.x GB. A few seconds later when the Post Request is triggered my test is killed by OOM. I already looked at files parameter, but it seems it uses multipart encoding upload, which is not what I need.
My other thought was about passing open file handler directly to underlying requests library, but I guess that would require robotframework-request modification. Another idea is to fall back to curl for this test only.
Am I missing something in my test? What is the better way to address this?
I proceeded with the idea of robotframework-request modification and added this method
def post_request_binary(
self,
alias,
uri,
path=None,
params=None,
headers=None,
allow_redirects=None,
timeout=None):
session = self._cache.switch(alias)
redir = True if allow_redirects is None else allow_redirects
self._capture_output()
method_name = "post"
method = getattr(session, method_name)
with open(path, 'rb') as f:
resp = method(self._get_url(session, uri),
data=f,
params=self._utf8_urlencode(params),
headers=headers,
allow_redirects=allow_redirects,
timeout=self._get_timeout(timeout),
cookies=self.cookies,
verify=self.verify)
self._print_debug()
# Store the last session object
session.last_resp = resp
self.builtin.log(method_name + ' response: ' + resp.text, 'DEBUG')
return resp
I guess I can improve it a bit and create a pull request.
I have looked and perhaps i missed it. I currently have a file such as the one below:
PUT /URL/TO/SEND/REQUEST
Host: 127.0.0.1
Connection: keep-alive
...
bunch of data here
This file contains the header & the data i want to send over ssl. I know on windows i can use fiddler etc.. to send this raw data BUT i was hoping to use python. I tried looking (may be not hard enough) at urllib2 urllib & httplib to see if i could just send this file as the entire request i don't want to deal with parsing the file etc... Is this possible?
I did notice that in httplib i can use request where "body can be a file object." but from the description seems as though it still sends the header seperately and that file is only for the data being sent.
Thanks
It isn't documented, but it looks like you should be able to use httplib.HTTPConnection.send() for this:
In [13]: httplib.HTTPConnection.send??
Type: instancemethod
String Form:<unbound method HTTPConnection.send>
File: /usr/local/lib/python2.7/httplib.py
Definition: httplib.HTTPConnection.send(self, data)
Source:
def send(self, data):
"""Send `data' to the server."""
if self.sock is None:
if self.auto_open:
self.connect()
else:
raise NotConnected()
if self.debuglevel > 0:
print "send:", repr(data)
blocksize = 8192
if hasattr(data,'read') and not isinstance(data, array):
if self.debuglevel > 0: print "sendIng a read()able"
datablock = data.read(blocksize)
while datablock:
self.sock.sendall(datablock)
datablock = data.read(blocksize)
else:
self.sock.sendall(data)
The request() method combines the header and body and passes it to this function, which looks like it should handle strings or file objects.
Of course you will still need to know the host so that you can create the HTTPConnection object, so your code might look something like this (untested):
import httplib
conn = httplib.HTTPConnection('127.0.0.1')
conn.send(open(filename))
response = conn.getresponse()
edit: It turns out there is some internal state stuff that keeps this from working as is, here is a workaround (full example with google main page), but it is a bit of a hack. Tested using Python 2.6 and 2.7, does not appear to work on 3.x by just replacing httplib with http.client:
import httplib
conn = httplib.HTTPConnection('www.google.com')
conn.send('GET / HTTP/1.1\r\nHost: www.google.com\r\n\r\n')
conn._HTTPConnection__state = httplib._CS_REQ_SENT
response = conn.getresponse()
The key part here is setting conn.__state (mangled name) to the httplib._CS_REQ_SENT after calling send().
I am using Python's requests library in one method of my application. The body of the method looks like this:
def handle_remote_file(url, **kwargs):
response = requests.get(url, ...)
buff = StringIO.StringIO()
buff.write(response.content)
...
return True
I'd like to write some unit tests for that method, however, what I want to do is to pass a fake local url such as:
class RemoteTest(TestCase):
def setUp(self):
self.url = 'file:///tmp/dummy.txt'
def test_handle_remote_file(self):
self.assertTrue(handle_remote_file(self.url))
When I call requests.get with a local url, I got the KeyError exception below:
requests.get('file:///tmp/dummy.txt')
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/packages/urllib3/poolmanager.pyc in connection_from_host(self, host, port, scheme)
76
77 # Make a fresh ConnectionPool of the desired type
78 pool_cls = pool_classes_by_scheme[scheme]
79 pool = pool_cls(host, port, **self.connection_pool_kw)
80
KeyError: 'file'
The question is how can I pass a local url to requests.get?
PS: I made up the above example. It possibly contains many errors.
As #WooParadog explained requests library doesn't know how to handle local files. Although, current version allows to define transport adapters.
Therefore you can simply define you own adapter which will be able to handle local files, e.g.:
from requests_testadapter import Resp
import os
class LocalFileAdapter(requests.adapters.HTTPAdapter):
def build_response_from_file(self, request):
file_path = request.url[7:]
with open(file_path, 'rb') as file:
buff = bytearray(os.path.getsize(file_path))
file.readinto(buff)
resp = Resp(buff)
r = self.build_response(request, resp)
return r
def send(self, request, stream=False, timeout=None,
verify=True, cert=None, proxies=None):
return self.build_response_from_file(request)
requests_session = requests.session()
requests_session.mount('file://', LocalFileAdapter())
requests_session.get('file://<some_local_path>')
I'm using requests-testadapter module in the above example.
Here's a transport adapter I wrote which is more featureful than b1r3k's and has no additional dependencies beyond Requests itself. I haven't tested it exhaustively yet, but what I have tried seems to be bug-free.
import requests
import os, sys
if sys.version_info.major < 3:
from urllib import url2pathname
else:
from urllib.request import url2pathname
class LocalFileAdapter(requests.adapters.BaseAdapter):
"""Protocol Adapter to allow Requests to GET file:// URLs
#todo: Properly handle non-empty hostname portions.
"""
#staticmethod
def _chkpath(method, path):
"""Return an HTTP status for the given filesystem path."""
if method.lower() in ('put', 'delete'):
return 501, "Not Implemented" # TODO
elif method.lower() not in ('get', 'head'):
return 405, "Method Not Allowed"
elif os.path.isdir(path):
return 400, "Path Not A File"
elif not os.path.isfile(path):
return 404, "File Not Found"
elif not os.access(path, os.R_OK):
return 403, "Access Denied"
else:
return 200, "OK"
def send(self, req, **kwargs): # pylint: disable=unused-argument
"""Return the file specified by the given request
#type req: C{PreparedRequest}
#todo: Should I bother filling `response.headers` and processing
If-Modified-Since and friends using `os.stat`?
"""
path = os.path.normcase(os.path.normpath(url2pathname(req.path_url)))
response = requests.Response()
response.status_code, response.reason = self._chkpath(req.method, path)
if response.status_code == 200 and req.method.lower() != 'head':
try:
response.raw = open(path, 'rb')
except (OSError, IOError) as err:
response.status_code = 500
response.reason = str(err)
if isinstance(req.url, bytes):
response.url = req.url.decode('utf-8')
else:
response.url = req.url
response.request = req
response.connection = self
return response
def close(self):
pass
(Despite the name, it was completely written before I thought to check Google, so it has nothing to do with b1r3k's.) As with the other answer, follow this with:
requests_session = requests.session()
requests_session.mount('file://', LocalFileAdapter())
r = requests_session.get('file:///path/to/your/file')
The easiest way seems using requests-file.
https://github.com/dashea/requests-file (available through PyPI too)
"Requests-File is a transport adapter for use with the Requests Python library to allow local filesystem access via file:// URLs."
This in combination with requests-html is pure magic :)
packages/urllib3/poolmanager.py pretty much explains it. Requests doesn't support local url.
pool_classes_by_scheme = {
'http': HTTPConnectionPool,
'https': HTTPSConnectionPool,
}
In a recent project, I've had the same issue. Since requests doesn't support the "file" scheme, I'll patch our code to load the content locally. First, I define a function to replace requests.get:
def local_get(self, url):
"Fetch a stream from local files."
p_url = six.moves.urllib.parse.urlparse(url)
if p_url.scheme != 'file':
raise ValueError("Expected file scheme")
filename = six.moves.urllib.request.url2pathname(p_url.path)
return open(filename, 'rb')
Then, somewhere in test setup or decorating the test function, I use mock.patch to patch the get function on requests:
#mock.patch('requests.get', local_get)
def test_handle_remote_file(self):
...
This technique is somewhat brittle -- it doesn't help if the underlying code calls requests.request or constructs a Session and calls that. There may be a way to patch requests at a lower level to support file: URLs, but in my initial investigation, there didn't seem to be an obvious hook point, so I went with this simpler approach.
To load a file from a local URL, e.g. an image file you can do this:
import urllib
from PIL import Image
Image.open(urllib.request.urlopen('file:///path/to/your/file.png'))
I think simple solution for this will be creating temporary http server using python and using it.
Put all your files in temporary folder eg. tempFolder
Go to that directory and create a temporary http server in terminal/cmd as per your OS using command python -m http.server 8000 (Note 8000 is port no.)
This will you give you a link to http server. You can access it from http://127.0.0.1:8000/
Open your desired file in browser and copy the link to your url.
I have seen this thread already - How can I unshorten a URL?
My issue with the resolved answer (that is using the unshort.me API) is that I am focusing on unshortening youtube links. Since unshort.me is used readily, this returns almost 90% of the results with captchas which I am unable to resolve.
So far I am stuck with using:
def unshorten_url(url):
resolvedURL = urllib2.urlopen(url)
print resolvedURL.url
#t = Test()
#c = pycurl.Curl()
#c.setopt(c.URL, 'http://api.unshort.me/?r=%s&t=xml' % (url))
#c.setopt(c.WRITEFUNCTION, t.body_callback)
#c.perform()
#c.close()
#dom = xml.dom.minidom.parseString(t.contents)
#resolvedURL = dom.getElementsByTagName("resolvedURL")[0].firstChild.nodeValue
return resolvedURL.url
Note: everything in the comments is what I tried to do when using the unshort.me service which was returning captcha links.
Does anyone know of a more efficient way to complete this operation without using open (since it is a waste of bandwidth)?
one line functions, using requests library and yes, it supports recursion.
def unshorten_url(url):
return requests.head(url, allow_redirects=True).url
Use the best rated answer (not the accepted answer) in that question:
# This is for Py2k. For Py3k, use http.client and urllib.parse instead, and
# use // instead of / for the division
import httplib
import urlparse
def unshorten_url(url):
parsed = urlparse.urlparse(url)
h = httplib.HTTPConnection(parsed.netloc)
resource = parsed.path
if parsed.query != "":
resource += "?" + parsed.query
h.request('HEAD', resource )
response = h.getresponse()
if response.status/100 == 3 and response.getheader('Location'):
return unshorten_url(response.getheader('Location')) # changed to process chains of short urls
else:
return url
You DO have to open it, otherwise you won't know what URL it will redirect to. As Greg put it:
A short link is a key into somebody else's database; you can't expand the link without querying the database
Now to your question.
Does anyone know of a more efficient way to complete this operation
without using open (since it is a waste of bandwidth)?
The more efficient way is to not close the connection, keep it open in the background, by using HTTP's Connection: keep-alive.
After a small test, unshorten.me seems to take the HEAD method into account and doing a redirect to itself:
> telnet unshorten.me 80
Trying 64.202.189.170...
Connected to unshorten.me.
Escape character is '^]'.
HEAD http://unshort.me/index.php?r=http%3A%2F%2Fbit.ly%2FcXEInp HTTP/1.1
Host: unshorten.me
HTTP/1.1 301 Moved Permanently
Date: Mon, 22 Aug 2011 20:42:46 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
X-AspNet-Version: 2.0.50727
Location: http://resolves.me/index.php?r=http%3A%2F%2Fbit.ly%2FcXEInp
Cache-Control: private
Content-Length: 0
So if you use the HEAD HTTP method, instead of GET, you will actually end up doing the same work twice.
Instead, you should keep the connection alive, which will save you only a little bandwidth, but what it will certainly save is the latency of establishing a new connection every time. Establishing a TCP/IP connection is expensive.
You should get away with a number of kept-alive connections to the unshorten service equal to the number of concurrent connections your own service receives.
You could manage these connections in a pool. That's the closest you can get. Beside tweaking your kernel's TCP/IP stack.
Here a src code that takes into account almost of the useful corner cases:
set a custom Timeout.
set a custom User Agent.
check whether we have to use an http or https connection.
resolve recursively the input url and prevent ending within a loop.
The src code is on github # https://github.com/amirkrifa/UnShortenUrl
comments are welcome ...
import logging
logging.basicConfig(level=logging.DEBUG)
TIMEOUT = 10
class UnShortenUrl:
def process(self, url, previous_url=None):
logging.info('Init url: %s'%url)
import urlparse
import httplib
try:
parsed = urlparse.urlparse(url)
if parsed.scheme == 'https':
h = httplib.HTTPSConnection(parsed.netloc, timeout=TIMEOUT)
else:
h = httplib.HTTPConnection(parsed.netloc, timeout=TIMEOUT)
resource = parsed.path
if parsed.query != "":
resource += "?" + parsed.query
try:
h.request('HEAD',
resource,
headers={'User-Agent': 'curl/7.38.0'}
)
response = h.getresponse()
except:
import traceback
traceback.print_exec()
return url
logging.info('Response status: %d'%response.status)
if response.status/100 == 3 and response.getheader('Location'):
red_url = response.getheader('Location')
logging.info('Red, previous: %s, %s'%(red_url, previous_url))
if red_url == previous_url:
return red_url
return self.process(red_url, previous_url=url)
else:
return url
except:
import traceback
traceback.print_exc()
return None
import requests
short_url = "<your short url goes here>"
long_url = requests.get(short_url).url
print(long_url)
I'm back. :) Again trying to get the gzipped contents of a URL and gunzip them. This time in Python. The #SERVER section of code is the script I'm using to generate the gzipped data. The data is known good, as it works with Java. The #CLIENT section of code is the bit of code I'm using client-side to try and read that data (for eventual JSON parsing). However, somewhere in this transfer, the gzip module forgets how to read the data it created.
#SERVER
outbuf = StringIO.StringIO()
outfile = gzip.GzipFile(fileobj = outbuf, mode = 'wb')
outfile.write(data)
outfile.close()
print "Content-Encoding: gzip\n"
print outbuf.getvalue()
#CLIENT
urlReq = urllib2.Request(url)
urlReq.add_header('Accept-Encoding', '*')
urlConn = urllib2.build_opener().open(urlReq)
urlConnObj = StringIO.StringIO(urlConn.read())
gzin = gzip.GzipFile(fileobj = urlConnObj)
return gzin.read() #IOError: Not a gzipped file.
Other Notes:
outbuf.getvalue() is the same as urlConnObj.getvalue() is the same as urlConn.read()
This StackOverflow question seemed to help me out.
Apparently, it was just wise to by-pass the gzip module entirely, opting for zlib instead. Also, changing "*" to "gzip" in the "Accept-Encoding" header may've helped.
#CLIENT
urlReq = urllib2.Request(url)
urlReq.add_header('Accept-Encoding', 'gzip')
urlConn = urllib2.urlopen(urlReq)
return zlib.decompress(urlConn.read(), 16+zlib.MAX_WBITS)