Streaming audio with Pygame [duplicate] - python

I was trying to craft a response to a question about streaming audio from a HTTP server, then play it with PyGame. I had the code mostly complete, but hit an error where the PyGame music functions tried to seek() on the urllib.HTTPResponse object.
According to the urlib docs, the urllib.HTTPResponse object (since v3.5) is an io.BufferedIOBase. I expected this would make the stream seek()able, however it does not.
Is there a way to wrap the io.BufferedIOBase such that it is smart enough to buffer enough data to handle the seek operation?
import pygame
import urllib.request
import io
# Window size
WINDOW_WIDTH = 400
WINDOW_HEIGHT = 400
# background colour
SKY_BLUE = (161, 255, 254)
### Begin the streaming of a file
### Return the urlib.HTTPResponse, a file-like-object
def openURL( url ):
result = None
try:
http_response = urllib.request.urlopen( url )
print( "streamHTTP() - Fetching URL [%s]" % ( http_response.geturl() ) )
print( "streamHTTP() - Response Status [%d] / [%s]" % ( http_response.status, http_response.reason ) )
result = http_response
except:
print( "streamHTTP() - Error Fetching URL [%s]" % ( url ) )
return result
### MAIN
pygame.init()
window = pygame.display.set_mode( ( WINDOW_WIDTH, WINDOW_HEIGHT ) )
pygame.display.set_caption("Music Streamer")
clock = pygame.time.Clock()
done = False
while not done:
# Handle user-input
for event in pygame.event.get():
if ( event.type == pygame.QUIT ):
done = True
# Keys
keys = pygame.key.get_pressed()
if ( keys[pygame.K_UP] ):
if ( pygame.mixer.music.get_busy() ):
print("busy")
else:
print("play")
remote_music = openURL( 'http://127.0.0.1/example.wav' )
if ( remote_music != None and remote_music.status == 200 ):
pygame.mixer.music.load( io.BufferedReader( remote_music ) )
pygame.mixer.music.play()
# Re-draw the screen
window.fill( SKY_BLUE )
# Update the window, but not more than 60fps
pygame.display.flip()
clock.tick_busy_loop( 60 )
pygame.quit()
When this code runs, and Up is pushed, it fails with the error:
streamHTTP() - Fetching URL [http://127.0.0.1/example.wav]
streamHTTP() - Response Status [200] / [OK]
io.UnsupportedOperation: seek
io.UnsupportedOperation: File or stream is not seekable.
io.UnsupportedOperation: seek
io.UnsupportedOperation: File or stream is not seekable.
Traceback (most recent call last):
File "./sound_stream.py", line 57, in <module>
pygame.mixer.music.load( io.BufferedReader( remote_music ) )
pygame.error: Unknown WAVE format
I also tried re-opening the the io stream, and various other re-implementations of the same sort of thing.

Seeking seeking
According to the urlib docs, the urllib.HTTPResponse object (since v3.5) is an io.BufferedIOBase. I expected this would make the stream seek()able, however it does not.
That's correct. The io.BufferedIOBase interface doesn't guarantee the I/O object is seekable. For HTTPResponse objects, IOBase.seekable() returns False:
>>> import urllib.request
>>> response = urllib.request.urlopen("http://httpbin.org/get")
>>> response
<http.client.HTTPResponse object at 0x110870ca0>
>>> response.seekable()
False
That's because the BufferedIOBase implementation offered by HTTPResponse is wrapping a socket object, and sockets are not seekable either.
You can't wrap an BufferedIOBase object in a BufferedReader object and add seeking support. The Buffered* wrapper objects can only wrap RawIOBase types, and they rely on the wrapped object to provide seeking support. You would have to emulate seeking at raw I/O level, see below.
You can still provide the same functionality at a higher level, but take into account that seeking on remote data is a lot more involved; this isn't a simple change a simple OS variable that represents a file position on disk operation. For larger remote file data, seeking without backing the whole file on disk locally could be as sophisticated as using HTTP range requests and local (in memory or on-disk) buffers to balance sound play-back performance and minimising local data storage. Doing this correctly for a wide range of use-cases can be a lot of effort, so is certainly not part of the Python standard library.
If your sound files are small
If your HTTP-sourced sound files are small enough (a few MB at most) then just read the whole response into an in-memory io.BytesIO() file object. I really do not think it is worth making this more complicated than that, because the moment you have enough data to make that worth pursuing your files are large enough to take up too much memory!
So this would be more than enough if your sound files are smaller (no more than a few MB):
from io import BytesIO
import urllib.error
import urllib.request
def open_url(url):
try:
http_response = urllib.request.urlopen(url)
print(f"streamHTTP() - Fetching URL [{http_response.geturl()}]")
print(f"streamHTTP() - Response Status [{http_response.status}] / [{http_response.reason}]")
except urllib.error.URLError:
print("streamHTTP() - Error Fetching URL [{url}]")
return
if http_response.status != 200:
print("streamHTTP() - Error Fetching URL [{url}]")
return
return BytesIO(http_response.read())
This doesn't require writing a wrapper object, and because BytesIO is a native implementation, once the data is fully copied over, access to the data is faster than any Python-code wrapper could ever give you.
Note that this returns a BytesIO file object, so you no longer need to test for the response status:
remote_music = open_url('http://127.0.0.1/example.wav')
if remote_music is not None:
pygame.mixer.music.load(remote_music)
pygame.mixer.music.play()
If they are more than a few MB
Once you go beyond a few megabytes, you could try pre-loading the data into a local file object. You can make this more sophisticated by using a thread to have shutil.copyfileobj() copy most of the data into that file in the background and give the file to PyGame after loading just an initial amount of data.
By using an actual file object, you can actually help performance here, as PyGame will try to minimize interjecting itself between the SDL mixer and the file data. If there is an actual file on disk with a file number (the OS-level identifier for a stream, something that the SDL mixer library can make use of), then PyGame will operate directly on that and so minimize blocking the GIL (which in turn will help the Python portions of your game perform better!). And if you pass in a filename (just a string), then PyGame gets out of the way entirely and leaves all file operations over to the SDL library.
Here's such an implementation; this should, on normal Python interpreter exit, clean up the downloaded files automatically. It returns a filename for PyGame to work on, and finalizing downloading the data is done in a thread after the initial few KB has been buffered. It will avoid loading the same URL more than once, and I've made it thread-safe:
import shutil
import urllib.error
import urllib.request
from tempfile import NamedTemporaryFile
from threading import Lock, Thread
INITIAL_BUFFER = 1024 * 8 # 8kb initial file read to start URL-backed files
_url_files_lock = Lock()
# stores open NamedTemporaryFile objects, keeping them 'alive'
# removing entries from here causes the file data to be deleted.
_url_files = {}
def open_url(url):
with _url_files_lock:
if url in _url_files:
return _url_files[url].name
try:
http_response = urllib.request.urlopen(url)
print(f"streamHTTP() - Fetching URL [{http_response.geturl()}]")
print(f"streamHTTP() - Response Status [{http_response.status}] / [{http_response.reason}]")
except urllib.error.URLError:
print("streamHTTP() - Error Fetching URL [{url}]")
return
if http_response.status != 200:
print("streamHTTP() - Error Fetching URL [{url}]")
return
fileobj = NamedTemporaryFile()
content_length = http_response.getheader("Content-Length")
if content_length is not None:
try:
content_length = int(content_length)
except ValueError:
content_length = None
if content_length:
# create sparse file of full length
fileobj.seek(content_length - 1)
fileobj.write(b"\0")
fileobj.seek(0)
fileobj.write(http_response.read(INITIAL_BUFFER))
with _url_files_lock:
if url in _url_files:
# another thread raced us to this point, we lost, return their
# result after cleaning up here
fileobj.close()
http_response.close()
return _url_files[url].name
# store the file object for this URL; this keeps the file
# open and so readable if you have the filename.
_url_files[url] = fileobj
def copy_response_remainder():
# copies file data from response to disk, for all data past INITIAL_BUFFER
with http_response:
shutil.copyfileobj(http_response, fileobj)
t = Thread(daemon=True, target=copy_response_remainder)
t.start()
return fileobj.name
Like the BytesIO() solution, the above returns either None or a value ready for passing to pass to pygame.mixer.music.load().
The above will probably not work if you try to immediately set an advanced playing position in your sound files, as later data may not yet have been copied into the file. It's a trade-off.
Seeking and finding third party libraries
If you need to have full seeking support on remote URLs and don't want to use on-disk space for them and don't want to have to worry about their size, you don't need to re-invent the HTTP-as-seekable-file wheel here. You could use an existing project that offers the same functionality. I found two that offer io.BufferedIOBase-based implementations:
smart_open
httpio
Both use HTTP Range requests to implement seeking support. Just use httpio.open(URL) or smart_open.open(URL) and pass that directly to pygame.mixer.music.load(); if the URL can't be opened, you can catch that by handling the IOError exception:
from smart_open import open as url_open # or from httpio import open
try:
remote_music = url_open('http://127.0.0.1/example.wav')
except IOError:
pass
else:
pygame.mixer.music.load(remote_music)
pygame.mixer.music.play()
smart_open uses an in-memory buffer to satisfy reads of a fixed size, but creates a new HTTP Range request for every call to seek that changes the current file position, so performance may vary. Since the SDL mixer executes a few seeks on audio files to determine their type, I expect this to be a little slower.
httpio can buffer blocks of data and so might handle seeks better, but from a brief glance at the source code, when actually setting a buffer size the cached blocks are never evicted from memory again so you'd end up with the whole file in memory, eventually.
Implementing seeking ourselves, via io.RawIOBase
And finally, because I'm not able to find efficient HTTP-Range-backed I/O implementations, I wrote my own. The following implements the io.RawIOBase interface, specifically so you can then wrap the object in a io.BufferedIOReader() and so delegate caching to a caching buffer that will be managed correctly when seeking:
import io
from copy import deepcopy
from functools import wraps
from typing import cast, overload, Callable, Optional, Tuple, TypeVar, Union
from urllib.request import urlopen, Request
T = TypeVar("T")
#overload
def _check_closed(_f: T) -> T: ...
#overload
def _check_closed(*, connect: bool, default: Union[bytes, int]) -> Callable[[T], T]: ...
def _check_closed(
_f: Optional[T] = None,
*,
connect: bool = False,
default: Optional[Union[bytes, int]] = None,
) -> Union[T, Callable[[T], T]]:
def decorator(f: T) -> T:
#wraps(cast(Callable, f))
def wrapper(self, *args, **kwargs):
if self.closed:
raise ValueError("I/O operation on closed file.")
if connect and self._fp is None or self._fp.closed:
self._connect()
if self._fp is None:
# outside the seekable range, exit early
return default
try:
return f(self, *args, **kwargs)
except Exception:
self.close()
raise
finally:
if self._range_end and self._pos >= self._range_end:
self._fp.close()
del self._fp
return cast(T, wrapper)
if _f is not None:
return decorator(_f)
return decorator
def _parse_content_range(
content_range: str
) -> Tuple[Optional[int], Optional[int], Optional[int]]:
"""Parse a Content-Range header into a (start, end, length) tuple"""
units, *range_spec = content_range.split(None, 1)
if units != "bytes" or not range_spec:
return (None, None, None)
start_end, _, size = range_spec[0].partition("/")
try:
length: Optional[int] = int(size)
except ValueError:
length = None
start_val, has_start_end, end_val = start_end.partition("-")
start = end = None
if has_start_end:
try:
start, end = int(start_val), int(end_val)
except ValueError:
pass
return (start, end, length)
class HTTPRawIO(io.RawIOBase):
"""Wrap a HTTP socket to handle seeking via HTTP Range"""
url: str
closed: bool = False
_pos: int = 0
_size: Optional[int] = None
_range_end: Optional[int] = None
_fp: Optional[io.RawIOBase] = None
def __init__(self, url_or_request: Union[Request, str]) -> None:
if isinstance(url_or_request, str):
self._request = Request(url_or_request)
else:
# copy request objects to avoid sharing state
self._request = deepcopy(url_or_request)
self.url = self._request.full_url
self._connect(initial=True)
def readable(self) -> bool:
return True
def seekable(self) -> bool:
return True
def close(self) -> None:
if self.closed:
return
if self._fp:
self._fp.close()
del self._fp
self.closed = True
#_check_closed
def tell(self) -> int:
return self._pos
def _connect(self, initial: bool = False) -> None:
if self._fp is not None:
self._fp.close()
if self._size is not None and self._pos >= self._size:
# can't read past the end
return
request = self._request
request.add_unredirected_header("Range", f"bytes={self._pos}-")
response = urlopen(request)
self.url = response.geturl() # could have been redirected
if response.status not in (200, 206):
raise OSError(
f"Failed to open {self.url}: "
f"{response.status} ({response.reason})"
)
if initial:
# verify that the server supports range requests. Capture the
# content length if available
if response.getheader("Accept-Ranges") != "bytes":
raise OSError(
f"Resource doesn't support range requests: {self.url}"
)
try:
length = int(response.getheader("Content-Length", ""))
if length >= 0:
self._size = length
except ValueError:
pass
# validate the range we are being served
start, end, length = _parse_content_range(
response.getheader("Content-Range", "")
)
if self._size is None:
self._size = length
if (start is not None and start != self._pos) or (
length is not None and length != self._size
):
# non-sensical range response
raise OSError(
f"Resource at {self.url} served invalid range: pos is "
f"{self._pos}, range {start}-{end}/{length}"
)
if self._size and end is not None and end + 1 < self._size:
# incomplete range, not reaching all the way to the end
self._range_end = end
else:
self._range_end = None
fp = cast(io.BufferedIOBase, response.fp) # typeshed doesn't name fp
self._fp = fp.detach() # assume responsibility for the raw socket IO
#_check_closed
def seek(self, offset: int, whence: int = io.SEEK_SET) -> int:
relative_to = {
io.SEEK_SET: 0,
io.SEEK_CUR: self._pos,
io.SEEK_END: self._size,
}.get(whence)
if relative_to is None:
if whence == io.SEEK_END:
raise IOError(
f"Can't seek from end on unsized resource {self.url}"
)
raise ValueError(f"whence value {whence} unsupported")
if -offset > relative_to: # can't seek to a point before the start
raise OSError(22, "Invalid argument")
self._pos = relative_to + offset
# there is no point in optimising an existing connection
# by reading from it if seeking forward below some threshold.
# Use a BufferedIOReader to avoid seeking by small amounts or by 0
if self._fp:
self._fp.close()
del self._fp
return self._pos
# all read* methods delegate to the SocketIO object (itself a RawIO
# implementation).
#_check_closed(connect=True, default=b"")
def read(self, size: int = -1) -> Optional[bytes]:
assert self._fp is not None # show type checkers we already checked
res = self._fp.read(size)
if res is not None:
self._pos += len(res)
return res
#_check_closed(connect=True, default=b"")
def readall(self) -> bytes:
assert self._fp is not None # show type checkers we already checked
res = self._fp.readall()
self._pos += len(res)
return res
#_check_closed(connect=True, default=0)
def readinto(self, buffer: bytearray) -> Optional[int]:
assert self._fp is not None # show type checkers we already checked
n = self._fp.readinto(buffer)
self._pos += n or 0
return n
Remember that this is a RawIOBase object, which you really want to wrap in a BufferReader(). Doing so in open_url() looks like this:
def open_url(url, *args, **kwargs):
return io.BufferedReader(HTTPRawIO(url), *args, **kwargs)
This gives you fully buffered I/O, with full support seeking, over a remote URL, and the BufferedReader implementation will minimise resetting the HTTP connection when seeking. I've found that using this with the PyGame mixer, only single HTTP connection is made, as all the test seeks are within the default 8KB buffer.

If your fine with using the requests module (which supports streaming) instead of urllib, you could use a wrapper like this:
class ResponseStream(object):
def __init__(self, request_iterator):
self._bytes = BytesIO()
self._iterator = request_iterator
def _load_all(self):
self._bytes.seek(0, SEEK_END)
for chunk in self._iterator:
self._bytes.write(chunk)
def _load_until(self, goal_position):
current_position = self._bytes.seek(0, SEEK_END)
while current_position < goal_position:
try:
current_position = self._bytes.write(next(self._iterator))
except StopIteration:
break
def tell(self):
return self._bytes.tell()
def read(self, size=None):
left_off_at = self._bytes.tell()
if size is None:
self._load_all()
else:
goal_position = left_off_at + size
self._load_until(goal_position)
self._bytes.seek(left_off_at)
return self._bytes.read(size)
def seek(self, position, whence=SEEK_SET):
if whence == SEEK_END:
self._load_all()
else:
self._bytes.seek(position, whence)
Then I guess you can do something like this:
WINDOW_WIDTH = 400
WINDOW_HEIGHT = 400
SKY_BLUE = (161, 255, 254)
URL = 'http://localhost:8000/example.wav'
pygame.init()
window = pygame.display.set_mode( ( WINDOW_WIDTH, WINDOW_HEIGHT ) )
pygame.display.set_caption("Music Streamer")
clock = pygame.time.Clock()
done = False
font = pygame.font.SysFont(None, 32)
state = 0
def play_music():
response = requests.get(URL, stream=True)
if (response.status_code == 200):
stream = ResponseStream(response.iter_content(64))
pygame.mixer.music.load(stream)
pygame.mixer.music.play()
else:
state = 0
while not done:
for event in pygame.event.get():
if ( event.type == pygame.QUIT ):
done = True
if event.type == pygame.KEYDOWN and state == 0:
Thread(target=play_music).start()
state = 1
window.fill( SKY_BLUE )
window.blit(font.render(str(pygame.time.get_ticks()), True, (0,0,0)), (32, 32))
pygame.display.flip()
clock.tick_busy_loop( 60 )
pygame.quit()
using a Thread to start streaming.
I'm not sure this works 100%, but give it a try.

Related

Images are seemingly randomly truncated when sent over sockets in python using PIL and BytesIO

I want to send PIL images over a socket in python. I used a method suggested on stack overflow:
def exchange_image(self, img_send):
fd = io.BytesIO()
img_send.save(fd, "png")
img_recv = self.exchange(fd.getvalue(), MAX_IMG_SIZE) # max size is 10000000
return Image.open(io.BytesIO(img_recv))
def exchange(self, to_send, size):
if self.role == "client":
self.conn.sendall(to_send)
return self.conn.recv(size)
elif self.role == "server":
received = self.conn.recv(size)
self.conn.sendall(to_send)
return received
The problem with this method is that sometimes it throws a strange error which should not happen in my opinion.
struct.error: unpack_from requires a buffer of at least 4 bytes for unpacking 4 bytes at offset 0 (actual buffer size is 0)
which causes
OSError("image file is truncated")
The error happens when I use the returned image and make a PhotoImage out of it.
self.image_gui = ImageTk.PhotoImage(self.image)
I am unable to wrap my head around this as the buffer apparently disappears at random (one can call the function 5 times with perfect results and then suddenly it does not work anymore).
This is the first time I do "networking" and it most probably has something to do with the way I am misunderstanding the inner workings of BytesIO, send and recv. Help would be greatly appreciated.
Now it works.
def exchange_image(self, img_send):
fd = io.BytesIO()
img_send.save(fd, "png")
img_recv = self.exchange(fd.getvalue())
fd.close()
return Image.open(io.BytesIO(img_recv))
def exchange(self, to_send):
size_send = str(len(to_send))
size_send = "0" * (MAX_INIT_SIZE - len(size_send)) + size_send
if self.role == "client":
self.conn.sendall(size_send.encode())
size_recv = int(self.recvall(MAX_INIT_SIZE).decode())
self.conn.sendall(to_send)
return self.recvall(size_recv)
elif self.role == "server":
size_recv = int(self.recvall(MAX_INIT_SIZE).decode())
self.conn.sendall(size_send.encode())
received = self.recvall(size_recv)
self.conn.sendall(to_send)
return received
def recvall(self, size):
msg = bytes()
while len(msg) < size:
msg += self.conn.recv(size)
return msg

Monitering time spent on computer w/ Python and xorg-server

I am trying to make a script which will help me track how long I spend on what on my computer. This script should track when I start, stop, and how long I spend on each "task". After some searching I have found a terminal utility called xdotool which will return the current focused window and it's title when ran like so: xdotool getwindowfocus getwindowna me. For example. when focusing on this window it returns:
linux - Monitering time spent on computer w/ Python and xorg-server - Stack Overflow — Firefox Developer Edition
which is exactly what I want. My first idea was to detect when the focused window is changed and then get the time that this happens at, however I could not find any results, so I have resorted to a while loop that runs this command every 5 seconds, but that is quite hack-y and has proven troublesome, I would highly prefer the on-focus-change method, but here is my code as of now:
#!/usr/bin/env python3
from subprocess import run
from time import time, sleep
log = []
prevwindow = ""
while True:
currentwindow = run(['xdotool', 'getwindowfocus', 'getwindowname'],
capture_output=True, text=True).stdout
if currentwindow != prevwindow:
for entry in log:
if currentwindow in entry:
pass # Calculate time spent
print(f"{time()}:\t{currentwindow}")
log.append((time(), currentwindow))
prevwindow = currentwindow
sleep(5)
I am on Arch linux with dwm should that matter
See this gist. Just put your logging mechanism inside the handle_change function and it should work, as tested on an Arch Linux - dwm system.
For archival purposes, here I include the code. All credit goes to Stephan Sokolow (sskolow) on GitHub.
from contextlib import contextmanager
from typing import Any, Dict, Optional, Tuple, Union # noqa
from Xlib import X
from Xlib.display import Display
from Xlib.error import XError
from Xlib.xobject.drawable import Window
from Xlib.protocol.rq import Event
# Connect to the X server and get the root window
disp = Display()
root = disp.screen().root
# Prepare the property names we use so they can be fed into X11 APIs
NET_ACTIVE_WINDOW = disp.intern_atom('_NET_ACTIVE_WINDOW')
NET_WM_NAME = disp.intern_atom('_NET_WM_NAME') # UTF-8
WM_NAME = disp.intern_atom('WM_NAME') # Legacy encoding
last_seen = {'xid': None, 'title': None} # type: Dict[str, Any]
#contextmanager
def window_obj(win_id: Optional[int]) -> Window:
"""Simplify dealing with BadWindow (make it either valid or None)"""
window_obj = None
if win_id:
try:
window_obj = disp.create_resource_object('window', win_id)
except XError:
pass
yield window_obj
def get_active_window() -> Tuple[Optional[int], bool]:
"""Return a (window_obj, focus_has_changed) tuple for the active window."""
response = root.get_full_property(NET_ACTIVE_WINDOW,
X.AnyPropertyType)
if not response:
return None, False
win_id = response.value[0]
focus_changed = (win_id != last_seen['xid'])
if focus_changed:
with window_obj(last_seen['xid']) as old_win:
if old_win:
old_win.change_attributes(event_mask=X.NoEventMask)
last_seen['xid'] = win_id
with window_obj(win_id) as new_win:
if new_win:
new_win.change_attributes(event_mask=X.PropertyChangeMask)
return win_id, focus_changed
def _get_window_name_inner(win_obj: Window) -> str:
"""Simplify dealing with _NET_WM_NAME (UTF-8) vs. WM_NAME (legacy)"""
for atom in (NET_WM_NAME, WM_NAME):
try:
window_name = win_obj.get_full_property(atom, 0)
except UnicodeDecodeError: # Apparently a Debian distro package bug
title = "<could not decode characters>"
else:
if window_name:
win_name = window_name.value # type: Union[str, bytes]
if isinstance(win_name, bytes):
# Apparently COMPOUND_TEXT is so arcane that this is how
# tools like xprop deal with receiving it these days
win_name = win_name.decode('latin1', 'replace')
return win_name
else:
title = "<unnamed window>"
return "{} (XID: {})".format(title, win_obj.id)
def get_window_name(win_id: Optional[int]) -> Tuple[Optional[str], bool]:
"""Look up the window name for a given X11 window ID"""
if not win_id:
last_seen['title'] = None
return last_seen['title'], True
title_changed = False
with window_obj(win_id) as wobj:
if wobj:
try:
win_title = _get_window_name_inner(wobj)
except XError:
pass
else:
title_changed = (win_title != last_seen['title'])
last_seen['title'] = win_title
return last_seen['title'], title_changed
def handle_xevent(event: Event):
"""Handler for X events which ignores anything but focus/title change"""
if event.type != X.PropertyNotify:
return
changed = False
if event.atom == NET_ACTIVE_WINDOW:
if get_active_window()[1]:
get_window_name(last_seen['xid']) # Rely on the side-effects
changed = True
elif event.atom in (NET_WM_NAME, WM_NAME):
changed = changed or get_window_name(last_seen['xid'])[1]
if changed:
handle_change(last_seen)
def handle_change(new_state: dict):
"""Replace this with whatever you want to actually do"""
print(new_state)
if __name__ == '__main__':
# Listen for _NET_ACTIVE_WINDOW changes
root.change_attributes(event_mask=X.PropertyChangeMask)
# Prime last_seen with whatever window was active when we started this
get_window_name(get_active_window()[0])
handle_change(last_seen)
while True: # next_event() sleeps until we get an event
handle_xevent(disp.next_event())

Python tornado async request handler

I have written an async file upload RequestHandler. It is correct byte-wise, that is the files I receive are identical to the ones being sent. One issue that I am having trouble figuring out is upload delay. Specifically when I issue the post request to upload the file while testing locally I see the browser showing upload progress get stuck. For files close to 4MB in size it gets stuck on 50%+ for a little while then some time passes and it sends all of the data, and gets stuck on "waiting for localhost..." The whole process may last 3+ minutes.
The kicker is when I add print statements that end with a new line to data_received method the delays disappear. Does the print statement trigger the network buffers to be flushed somehow?
Here is the implementation of data_received, along with the helper methods:
#tornado.gen.coroutine
def _read_data(self, cont_buf):
'''
Read the file data.
#param cont_buf - buffered HTTP request
#param boolean indicating whether data is still being read and new
buffer
'''
# Check only last characters of the buffer guaranteed to be large
# enough to contain the boundary
end_of_data_idx = cont_buf.find(self._boundary)
if end_of_data_idx >= 0:
data = cont_buf[:(end_of_data_idx - self.LSEP)]
self.receive_data(self.header_list[-1], data)
new_buffer = cont_buf[(end_of_data_idx + len(self._boundary)):]
return False, new_buffer
else:
self.receive_data(self.header_list[-1], cont_buf)
return True, b""
#tornado.gen.coroutine
def _parse_params(self, param_buf):
'''
Parse HTTP header parameters.
#param param_buf - string buffer containing the parameters.
#returns parameters dictionary
'''
params = dict()
param_res = self.PAT_HEADERPARAMS.findall(param_buf)
if param_res:
for name, value in param_res:
params[name] = value
elif param_buf:
params['value'] = param_buf
return params
#tornado.gen.coroutine
def _parse_header(self, header_buf):
'''
Parses a buffer containing an individual header with parameters.
#param header_buf - header buffer containing a single header
#returns header dictionary
'''
res = self.PAT_HEADERVALUE.match(header_buf)
header = dict()
if res:
name, value, tail = res.groups()
header = {'name': name, 'value': value,
'params': (yield self._parse_params(tail))}
elif header_buf:
header = {"value": header_buf}
return header
#tornado.gen.coroutine
def data_received(self, chunk):
'''
Processes a chunk of content body.
#param chunk - a piece of content body.
'''
self._count += len(chunk)
self._buffer += chunk
# Has boundary been established?
if not self._boundary:
self._boundary, self._buffer =\
(yield self._extract_boundary(self._buffer))
if (not self._boundary
and len(self._buffer) > self.BOUNDARY_SEARCH_BUF_SIZE):
raise RuntimeError("Cannot find multipart delimiter.")
while True:
if self._receiving_data:
self._receiving_data, self._buffer = yield self._read_data(self._buffer)
if self._is_end_of_request(self._buffer):
yield self.request_done()
break
elif self._is_end_of_data(self._buffer):
break
else:
headers, self._buffer = yield self._read_headers(self._buffer)
if headers:
self.header_list.append(headers)
self._receiving_data = True
else:
break

Speed up reading wav in python

Evening,
I am working on a project that requires me to read in multichannel wav files in 32-bit float.
When I read a specific file (1 minute long, 6 channels, 48k fs) in into Matlab and measure it with tic/toc it parses the file in 2.456482 seconds.
Matlab Code for file reading speed measurement
tic
wavread('C:/data/testData/6ch.wav');
toc
When I do it in python (mind you, I'm pretty unfamiliar with python) it takes 18.1655315617 seconds!
It seems to me like the way I am doing it is inefficient (I did get it down to 18 from 28 but it's still too much...)
I stripped the code to what is relevant to this subject:
Python Code for file reading speed measurement
import wave32
import struct
import time
import numpy as np
def getWavData(inFile)
wavFile = wave32.open(inFile, 'r')
wavParams = wavFile.getparams()
nChannels = wavParams[0]
byteDepth = wavParams[1]
nFrames = wavParams[3]
wavData = np.empty([nFrames, nChannels], np.float32)
frames = wavFile.readframes(nFrames)
for i in range(nFrames):
for j in range(nChannels):
start = ( i * nChannels + j ) * byteDepth
stop = start + byteDepth
wavData[i][j] = struct.unpack('<f', frames[start:stop])[0]
return wavData
inFile = 'C:/data/testData/6ch.wav'
start = time.clock()
data2 = getWavData(inFile)
elapsed = time.clock()
elapsedNew = elapsed - start
print str(elapsedNew)
please not that wav32 is a small hack I had to perform on wave.py to enable 32-bit float reading.
"""Stuff to parse WAVE files.
Usage.
Reading WAVE files:
f = wave.open(file, 'r')
where file is either the name of a file or an open file pointer.
The open file pointer must have methods read(), seek(), and close().
When the setpos() and rewind() methods are not used, the seek()
method is not necessary.
This returns an instance of a class with the following public methods:
getnchannels() -- returns number of audio channels (1 for
mono, 2 for stereo)
getsampwidth() -- returns sample width in bytes
getframerate() -- returns sampling frequency
getnframes() -- returns number of audio frames
getcomptype() -- returns compression type ('NONE' for linear samples)
getcompname() -- returns human-readable version of
compression type ('not compressed' linear samples)
getparams() -- returns a tuple consisting of all of the
above in the above order
getmarkers() -- returns None (for compatibility with the
aifc module)
getmark(id) -- raises an error since the mark does not
exist (for compatibility with the aifc module)
readframes(n) -- returns at most n frames of audio
rewind() -- rewind to the beginning of the audio stream
setpos(pos) -- seek to the specified position
tell() -- return the current position
close() -- close the instance (make it unusable)
The position returned by tell() and the position given to setpos()
are compatible and have nothing to do with the actual position in the
file.
The close() method is called automatically when the class instance
is destroyed.
Writing WAVE files:
f = wave.open(file, 'w')
where file is either the name of a file or an open file pointer.
The open file pointer must have methods write(), tell(), seek(), and
close().
This returns an instance of a class with the following public methods:
setnchannels(n) -- set the number of channels
setsampwidth(n) -- set the sample width
setframerate(n) -- set the frame rate
setnframes(n) -- set the number of frames
setcomptype(type, name)
-- set the compression type and the
human-readable compression type
setparams(tuple)
-- set all parameters at once
tell() -- return current position in output file
writeframesraw(data)
-- write audio frames without pathing up the
file header
writeframes(data)
-- write audio frames and patch up the file header
close() -- patch up the file header and close the
output file
You should set the parameters before the first writeframesraw or
writeframes. The total number of frames does not need to be set,
but when it is set to the correct value, the header does not have to
be patched up.
It is best to first set all parameters, perhaps possibly the
compression type, and then write audio frames using writeframesraw.
When all frames have been written, either call writeframes('') or
close() to patch up the sizes in the header.
The close() method is called automatically when the class instance
is destroyed.
"""
import __builtin__
__all__ = ["open", "openfp", "Error"]
class Error(Exception):
pass
WAVE_FORMAT_PCM = 0x0001
WAVE_FORMAT_IEEE_FLOAT = 0x0003
_array_fmts = None, 'b', 'h', None, 'l'
# Determine endian-ness
import struct
if struct.pack("h", 1) == "\000\001":
big_endian = 1
else:
big_endian = 0
from chunk import Chunk
class Wave_read:
"""Variables used in this class:
These variables are available to the user though appropriate
methods of this class:
_file -- the open file with methods read(), close(), and seek()
set through the __init__() method
_nchannels -- the number of audio channels
available through the getnchannels() method
_nframes -- the number of audio frames
available through the getnframes() method
_sampwidth -- the number of bytes per audio sample
available through the getsampwidth() method
_framerate -- the sampling frequency
available through the getframerate() method
_comptype -- the AIFF-C compression type ('NONE' if AIFF)
available through the getcomptype() method
_compname -- the human-readable AIFF-C compression type
available through the getcomptype() method
_soundpos -- the position in the audio stream
available through the tell() method, set through the
setpos() method
These variables are used internally only:
_fmt_chunk_read -- 1 iff the FMT chunk has been read
_data_seek_needed -- 1 iff positioned correctly in audio
file for readframes()
_data_chunk -- instantiation of a chunk class for the DATA chunk
_framesize -- size of one frame in the file
"""
def initfp(self, file):
self._convert = None
self._soundpos = 0
self._file = Chunk(file, bigendian = 0)
if self._file.getname() != 'RIFF':
raise Error, 'file does not start with RIFF id'
if self._file.read(4) != 'WAVE':
raise Error, 'not a WAVE file'
self._fmt_chunk_read = 0
self._data_chunk = None
while 1:
self._data_seek_needed = 1
try:
chunk = Chunk(self._file, bigendian = 0)
except EOFError:
break
chunkname = chunk.getname()
if chunkname == 'fmt ':
self._read_fmt_chunk(chunk)
self._fmt_chunk_read = 1
elif chunkname == 'data':
if not self._fmt_chunk_read:
raise Error, 'data chunk before fmt chunk'
self._data_chunk = chunk
self._nframes = chunk.chunksize // self._framesize
self._data_seek_needed = 0
break
chunk.skip()
if not self._fmt_chunk_read or not self._data_chunk:
raise Error, 'fmt chunk and/or data chunk missing'
def __init__(self, f):
self._i_opened_the_file = None
if isinstance(f, basestring):
f = __builtin__.open(f, 'rb')
self._i_opened_the_file = f
# else, assume it is an open file object already
try:
self.initfp(f)
except:
if self._i_opened_the_file:
f.close()
raise
def __del__(self):
self.close()
#
# User visible methods.
#
def getfp(self):
return self._file
def rewind(self):
self._data_seek_needed = 1
self._soundpos = 0
def close(self):
if self._i_opened_the_file:
self._i_opened_the_file.close()
self._i_opened_the_file = None
self._file = None
def tell(self):
return self._soundpos
def getnchannels(self):
return self._nchannels
def getnframes(self):
return self._nframes
def getsampwidth(self):
return self._sampwidth
def getframerate(self):
return self._framerate
def getcomptype(self):
return self._comptype
def getcompname(self):
return self._compname
def getparams(self):
return self.getnchannels(), self.getsampwidth(), \
self.getframerate(), self.getnframes(), \
self.getcomptype(), self.getcompname()
def getmarkers(self):
return None
def getmark(self, id):
raise Error, 'no marks'
def setpos(self, pos):
if pos < 0 or pos > self._nframes:
raise Error, 'position not in range'
self._soundpos = pos
self._data_seek_needed = 1
def readframes(self, nframes):
if self._data_seek_needed:
self._data_chunk.seek(0, 0)
pos = self._soundpos * self._framesize
if pos:
self._data_chunk.seek(pos, 0)
self._data_seek_needed = 0
if nframes == 0:
return ''
if self._sampwidth > 1 and big_endian:
# unfortunately the fromfile() method does not take
# something that only looks like a file object, so
# we have to reach into the innards of the chunk object
import array
chunk = self._data_chunk
data = array.array(_array_fmts[self._sampwidth])
nitems = nframes * self._nchannels
if nitems * self._sampwidth > chunk.chunksize - chunk.size_read:
nitems = (chunk.chunksize - chunk.size_read) / self._sampwidth
data.fromfile(chunk.file.file, nitems)
# "tell" data chunk how much was read
chunk.size_read = chunk.size_read + nitems * self._sampwidth
# do the same for the outermost chunk
chunk = chunk.file
chunk.size_read = chunk.size_read + nitems * self._sampwidth
data.byteswap()
data = data.tostring()
else:
data = self._data_chunk.read(nframes * self._framesize)
if self._convert and data:
data = self._convert(data)
self._soundpos = self._soundpos + len(data) // (self._nchannels * self._sampwidth)
return data
#
# Internal methods.
#
def _read_fmt_chunk(self, chunk):
wFormatTag, self._nchannels, self._framerate, dwAvgBytesPerSec, wBlockAlign = struct.unpack('<hhllh', chunk.read(14))
if wFormatTag == WAVE_FORMAT_PCM or wFormatTag==WAVE_FORMAT_IEEE_FLOAT:
sampwidth = struct.unpack('<h', chunk.read(2))[0]
self._sampwidth = (sampwidth + 7) // 8
else:
#sampwidth = struct.unpack('<h', chunk.read(2))[0]
#self._sampwidth = (sampwidth + 7) // 8
raise Error, 'unknown format: %r' % (wFormatTag,)
self._framesize = self._nchannels * self._sampwidth
self._comptype = 'NONE'
self._compname = 'not compressed'
class Wave_write:
"""Variables used in this class:
These variables are user settable through appropriate methods
of this class:
_file -- the open file with methods write(), close(), tell(), seek()
set through the __init__() method
_comptype -- the AIFF-C compression type ('NONE' in AIFF)
set through the setcomptype() or setparams() method
_compname -- the human-readable AIFF-C compression type
set through the setcomptype() or setparams() method
_nchannels -- the number of audio channels
set through the setnchannels() or setparams() method
_sampwidth -- the number of bytes per audio sample
set through the setsampwidth() or setparams() method
_framerate -- the sampling frequency
set through the setframerate() or setparams() method
_nframes -- the number of audio frames written to the header
set through the setnframes() or setparams() method
These variables are used internally only:
_datalength -- the size of the audio samples written to the header
_nframeswritten -- the number of frames actually written
_datawritten -- the size of the audio samples actually written
"""
def __init__(self, f):
self._i_opened_the_file = None
if isinstance(f, basestring):
f = __builtin__.open(f, 'wb')
self._i_opened_the_file = f
try:
self.initfp(f)
except:
if self._i_opened_the_file:
f.close()
raise
def initfp(self, file):
self._file = file
self._convert = None
self._nchannels = 0
self._sampwidth = 0
self._framerate = 0
self._nframes = 0
self._nframeswritten = 0
self._datawritten = 0
self._datalength = 0
self._headerwritten = False
def __del__(self):
self.close()
#
# User visible methods.
#
def setnchannels(self, nchannels):
if self._datawritten:
raise Error, 'cannot change parameters after starting to write'
if nchannels < 1:
raise Error, 'bad # of channels'
self._nchannels = nchannels
def getnchannels(self):
if not self._nchannels:
raise Error, 'number of channels not set'
return self._nchannels
def setsampwidth(self, sampwidth):
if self._datawritten:
raise Error, 'cannot change parameters after starting to write'
if sampwidth < 1 or sampwidth > 4:
raise Error, 'bad sample width'
self._sampwidth = sampwidth
def getsampwidth(self):
if not self._sampwidth:
raise Error, 'sample width not set'
return self._sampwidth
def setframerate(self, framerate):
if self._datawritten:
raise Error, 'cannot change parameters after starting to write'
if framerate <= 0:
raise Error, 'bad frame rate'
self._framerate = framerate
def getframerate(self):
if not self._framerate:
raise Error, 'frame rate not set'
return self._framerate
def setnframes(self, nframes):
if self._datawritten:
raise Error, 'cannot change parameters after starting to write'
self._nframes = nframes
def getnframes(self):
return self._nframeswritten
def setcomptype(self, comptype, compname):
if self._datawritten:
raise Error, 'cannot change parameters after starting to write'
if comptype not in ('NONE',):
raise Error, 'unsupported compression type'
self._comptype = comptype
self._compname = compname
def getcomptype(self):
return self._comptype
def getcompname(self):
return self._compname
def setparams(self, params):
nchannels, sampwidth, framerate, nframes, comptype, compname = params
if self._datawritten:
raise Error, 'cannot change parameters after starting to write'
self.setnchannels(nchannels)
self.setsampwidth(sampwidth)
self.setframerate(framerate)
self.setnframes(nframes)
self.setcomptype(comptype, compname)
def getparams(self):
if not self._nchannels or not self._sampwidth or not self._framerate:
raise Error, 'not all parameters set'
return self._nchannels, self._sampwidth, self._framerate, \
self._nframes, self._comptype, self._compname
def setmark(self, id, pos, name):
raise Error, 'setmark() not supported'
def getmark(self, id):
raise Error, 'no marks'
def getmarkers(self):
return None
def tell(self):
return self._nframeswritten
def writeframesraw(self, data):
self._ensure_header_written(len(data))
nframes = len(data) // (self._sampwidth * self._nchannels)
if self._convert:
data = self._convert(data)
if self._sampwidth > 1 and big_endian:
import array
data = array.array(_array_fmts[self._sampwidth], data)
data.byteswap()
data.tofile(self._file)
self._datawritten = self._datawritten + len(data) * self._sampwidth
else:
self._file.write(data)
self._datawritten = self._datawritten + len(data)
self._nframeswritten = self._nframeswritten + nframes
def writeframes(self, data):
self.writeframesraw(data)
if self._datalength != self._datawritten:
self._patchheader()
def close(self):
if self._file:
self._ensure_header_written(0)
if self._datalength != self._datawritten:
self._patchheader()
self._file.flush()
self._file = None
if self._i_opened_the_file:
self._i_opened_the_file.close()
self._i_opened_the_file = None
#
# Internal methods.
#
def _ensure_header_written(self, datasize):
if not self._headerwritten:
if not self._nchannels:
raise Error, '# channels not specified'
if not self._sampwidth:
raise Error, 'sample width not specified'
if not self._framerate:
raise Error, 'sampling rate not specified'
self._write_header(datasize)
def _write_header(self, initlength):
assert not self._headerwritten
self._file.write('RIFF')
if not self._nframes:
self._nframes = initlength / (self._nchannels * self._sampwidth)
self._datalength = self._nframes * self._nchannels * self._sampwidth
self._form_length_pos = self._file.tell()
self._file.write(struct.pack('<l4s4slhhllhh4s',
36 + self._datalength, 'WAVE', 'fmt ', 16,
WAVE_FORMAT_PCM, self._nchannels, self._framerate,
self._nchannels * self._framerate * self._sampwidth,
self._nchannels * self._sampwidth,
self._sampwidth * 8, 'data'))
self._data_length_pos = self._file.tell()
self._file.write(struct.pack('<l', self._datalength))
self._headerwritten = True
def _patchheader(self):
assert self._headerwritten
if self._datawritten == self._datalength:
return
curpos = self._file.tell()
self._file.seek(self._form_length_pos, 0)
self._file.write(struct.pack('<l', 36 + self._datawritten))
self._file.seek(self._data_length_pos, 0)
self._file.write(struct.pack('<l', self._datawritten))
self._file.seek(curpos, 0)
self._datalength = self._datawritten
def open(f, mode=None):
if mode is None:
if hasattr(f, 'mode'):
mode = f.mode
else:
mode = 'rb'
if mode in ('r', 'rb'):
return Wave_read(f)
elif mode in ('w', 'wb'):
return Wave_write(f)
else:
raise Error, "mode must be 'r', 'rb', 'w', or 'wb'"
openfp = open # B/W compatibility
Sorry for the long code BTW :)
So my question is: is the wave.py module inherently slow (any alternatives to fix this?) or am I doing something inefficient?
I suppose I could just read in the wav header with a custom function and read the file in in a different way, but it seems like this is going to be A LOT of work, especially since I don't know a lot about 1) python and 2) file handling
Kind regards,
K.
Edit: I tried unutbu's suggestion but that does not work as scipy does not accept >16 bit.
When I try to parse the wav file through the scipy wavreader I get this message:
C:\Users\King Broos\AppData\Local\Enthought\Canopy32\System\lib\site-packages\scipy\io\wavfile.py:31: WavFileWarning: Unfamiliar format bytes
warnings.warn("Unfamiliar format bytes", WavFileWarning)
C:\Users\King Broos\AppData\Local\Enthought\Canopy32\System\lib\site-packages\scipy\io\wavfile.py:121: WavFileWarning: chunk not understood
warnings.warn("chunk not understood", WavFileWarning)
Looking into the code of wavfile.py this is the line where it throws the exception:
if (comp != 1 or size > 16):
warnings.warn("Unfamiliar format bytes", WavFileWarning)
I really need either 24 or 32 bit so I guess scipy not an option?
If you can install or have scipy, then use wavfile.read:
import scipy.io.wavfile as wavfile
sample_rate, x = wavfile.read(filename)
You might also want to study the source code, here.
Note that scipy.io.wavfile does not use Python's wave module. I'm not sure if it reads your IEEE_FLOAT format or not, but it does not do the same check as wave.py:
if wFormatTag == WAVE_FORMAT_PCM or wFormatTag==WAVE_FORMAT_IEEE_FLOAT:
sampwidth = struct.unpack('<h', chunk.read(2))[0]
self._sampwidth = (sampwidth + 7) // 8
else:
#sampwidth = struct.unpack('<h', chunk.read(2))[0]
#self._sampwidth = (sampwidth + 7) // 8
raise Error, 'unknown format: %r' % (wFormatTag,)
so perhaps it will work out-of-the-box.
By the way, instead of making your own module, wave32.py which is almost exactly the same as wave.py from the standard library, you could use monkey-patching:
import wave
import struct
WAVE_FORMAT_IEEE_FLOAT = 0x0003
def _read_fmt_chunk(self, chunk):
wFormatTag, self._nchannels, self._framerate, dwAvgBytesPerSec, wBlockAlign = struct.unpack('<hhllh', chunk.read(14))
if wFormatTag == WAVE_FORMAT_PCM or wFormatTag == WAVE_FORMAT_IEEE_FLOAT:
sampwidth = struct.unpack('<h', chunk.read(2))[0]
self._sampwidth = (sampwidth + 7) // 8
else:
raise Error, 'unknown format: %r' % (wFormatTag,)
self._framesize = self._nchannels * self._sampwidth
self._comptype = 'NONE'
self._compname = 'not compressed'
wave.Wave_read._read_fmt_chunk = _read_fmt_chunk
You can also use numpy directly:
import numpy as np
fs = np.fromfile(filename, dtype=np.int32, count=1, offset=24)[0] # Hz
byte_length = np.fromfile(filename, dtype=np.int32, count=1, offset=40)[0]
To manually read pieces of metadata. I recommend using a hex editor and wave format reference to verify the locations for pieces of metadata and the offset to the start of the data chunk (might not be 40 or 44 bytes in).
To read 32-bit WAVE_FORMAT_IEEE_FLOAT:
data = np.fromfile(filename, dtype=np.float32, count=byte_length // 4, offset=44)
To read 24-bit WAVE_FORMAT_PCM:
# prepend zero-byte to each sample (since there's no np.int24)
# then flatten, convert normally and byte-shift to correct for extra byte
data = np.zeros([byte_length // 3, 4], dtype=np.int8)
data[:, 1:] = np.fromfile(filename, dtype=np.int8, count=byte_length, offset=44).reshape(-1, 3)
data = np.right_shift(data.reshape(-1).view(dtype=np.int32), 8)
data = data / 2 ** 23 # if you want to normalize
Depends on the wavefile and machine, but this seems to be ~120 times faster than a loop for a 4.4 MB 24-bit .wav file, but there's likely bigger performance gains for bigger files (until swap is required, I think there's ~5 memory copies performed, including normalization).
This assumes:
No extra chunks at the start of the file, else offset= parameters are wrong
Single channel - reshape the array and/or change the byte order for multi-channel, with something like .reshape(num_channels, -1, order='F')
Little-endian I think

Python - seek in http response stream

Using urllibs (or urllibs2) and wanting what I want is hopeless.
Any solution?
I'm not sure how the C# implementation works, but, as internet streams are generally not seekable, my guess would be it downloads all the data to a local file or in-memory object and seeks within it from there. The Python equivalent of this would be to do as Abafei suggested and write the data to a file or StringIO and seek from there.
However, if, as your comment on Abafei's answer suggests, you want to retrieve only a particular part of the file (rather than seeking backwards and forwards through the returned data), there is another possibility. urllib2 can be used to retrieve a certain section (or 'range' in HTTP parlance) of a webpage, provided that the server supports this behaviour.
The range header
When you send a request to a server, the parameters of the request are given in various headers. One of these is the Range header, defined in section 14.35 of RFC2616 (the specification defining HTTP/1.1). This header allows you to do things such as retrieve all data starting from the 10,000th byte, or the data between bytes 1,000 and 1,500.
Server support
There is no requirement for a server to support range retrieval. Some servers will return the Accept-Ranges header (section 14.5 of RFC2616) along with a response to report if they support ranges or not. This could be checked using a HEAD request. However, there is no particular need to do this; if a server does not support ranges, it will return the entire page and we can then extract the desired portion of data in Python as before.
Checking if a range is returned
If a server returns a range, it must send the Content-Range header (section 14.16 of RFC2616) along with the response. If this is present in the headers of the response, we know a range was returned; if it is not present, the entire page was returned.
Implementation with urllib2
urllib2 allows us to add headers to a request, thus allowing us to ask the server for a range rather than the entire page. The following script takes a URL, a start position, and (optionally) a length on the command line, and tries to retrieve the given section of the page.
import sys
import urllib2
# Check command line arguments.
if len(sys.argv) < 3:
sys.stderr.write("Usage: %s url start [length]\n" % sys.argv[0])
sys.exit(1)
# Create a request for the given URL.
request = urllib2.Request(sys.argv[1])
# Add the header to specify the range to download.
if len(sys.argv) > 3:
start, length = map(int, sys.argv[2:])
request.add_header("range", "bytes=%d-%d" % (start, start + length - 1))
else:
request.add_header("range", "bytes=%s-" % sys.argv[2])
# Try to get the response. This will raise a urllib2.URLError if there is a
# problem (e.g., invalid URL).
response = urllib2.urlopen(request)
# If a content-range header is present, partial retrieval worked.
if "content-range" in response.headers:
print "Partial retrieval successful."
# The header contains the string 'bytes', followed by a space, then the
# range in the format 'start-end', followed by a slash and then the total
# size of the page (or an asterix if the total size is unknown). Lets get
# the range and total size from this.
range, total = response.headers['content-range'].split(' ')[-1].split('/')
# Print a message giving the range information.
if total == '*':
print "Bytes %s of an unknown total were retrieved." % range
else:
print "Bytes %s of a total of %s were retrieved." % (range, total)
# No header, so partial retrieval was unsuccessful.
else:
print "Unable to use partial retrieval."
# And for good measure, lets check how much data we downloaded.
data = response.read()
print "Retrieved data size: %d bytes" % len(data)
Using this, I can retrieve the final 2,000 bytes of the Python homepage:
blair#blair-eeepc:~$ python retrieverange.py http://www.python.org/ 17387
Partial retrieval successful.
Bytes 17387-19386 of a total of 19387 were retrieved.
Retrieved data size: 2000 bytes
Or 400 bytes from the middle of the homepage:
blair#blair-eeepc:~$ python retrieverange.py http://www.python.org/ 6000 400
Partial retrieval successful.
Bytes 6000-6399 of a total of 19387 were retrieved.
Retrieved data size: 400 bytes
However, the Google homepage does not support ranges:
blair#blair-eeepc:~$ python retrieverange.py http://www.google.com/ 1000 500
Unable to use partial retrieval.
Retrieved data size: 9621 bytes
In this case, it would be necessary to extract the data of interest in Python prior to any further processing.
It may work best just to write the data to a file (or even to a string, using StringIO), and to seek in that file (or string).
I did not find any existing implementations of a file-like interface with seek() to HTTP URLs, so I rolled my own simple version: https://github.com/valgur/pyhttpio. It depends on urllib.request but could probably easily be modified to use requests, if necessary.
The full code:
import cgi
import time
import urllib.request
from io import IOBase
from sys import stderr
class SeekableHTTPFile(IOBase):
def __init__(self, url, name=None, repeat_time=-1, debug=False):
"""Allow a file accessible via HTTP to be used like a local file by utilities
that use `seek()` to read arbitrary parts of the file, such as `ZipFile`.
Seeking is done via the 'range: bytes=xx-yy' HTTP header.
Parameters
----------
url : str
A HTTP or HTTPS URL
name : str, optional
The filename of the file.
Will be filled from the Content-Disposition header if not provided.
repeat_time : int, optional
In case of HTTP errors wait `repeat_time` seconds before trying again.
Negative value or `None` disables retrying and simply passes on the exception (the default).
"""
super().__init__()
self.url = url
self.name = name
self.repeat_time = repeat_time
self.debug = debug
self._pos = 0
self._seekable = True
with self._urlopen() as f:
if self.debug:
print(f.getheaders())
self.content_length = int(f.getheader("Content-Length", -1))
if self.content_length < 0:
self._seekable = False
if f.getheader("Accept-Ranges", "none").lower() != "bytes":
self._seekable = False
if name is None:
header = f.getheader("Content-Disposition")
if header:
value, params = cgi.parse_header(header)
self.name = params["filename"]
def seek(self, offset, whence=0):
if not self.seekable():
raise OSError
if whence == 0:
self._pos = 0
elif whence == 1:
pass
elif whence == 2:
self._pos = self.content_length
self._pos += offset
return self._pos
def seekable(self, *args, **kwargs):
return self._seekable
def readable(self, *args, **kwargs):
return not self.closed
def writable(self, *args, **kwargs):
return False
def read(self, amt=-1):
if self._pos >= self.content_length:
return b""
if amt < 0:
end = self.content_length - 1
else:
end = min(self._pos + amt - 1, self.content_length - 1)
byte_range = (self._pos, end)
self._pos = end + 1
with self._urlopen(byte_range) as f:
return f.read()
def readall(self):
return self.read(-1)
def tell(self):
return self._pos
def __getattribute__(self, item):
attr = object.__getattribute__(self, item)
if not object.__getattribute__(self, "debug"):
return attr
if hasattr(attr, '__call__'):
def trace(*args, **kwargs):
a = ", ".join(map(str, args))
if kwargs:
a += ", ".join(["{}={}".format(k, v) for k, v in kwargs.items()])
print("Calling: {}({})".format(item, a))
return attr(*args, **kwargs)
return trace
else:
return attr
def _urlopen(self, byte_range=None):
header = {}
if byte_range:
header = {"range": "bytes={}-{}".format(*byte_range)}
while True:
try:
r = urllib.request.Request(self.url, headers=header)
return urllib.request.urlopen(r)
except urllib.error.HTTPError as e:
if self.repeat_time is None or self.repeat_time < 0:
raise
print("Server responded with " + str(e), file=stderr)
print("Sleeping for {} seconds before trying again".format(self.repeat_time), file=stderr)
time.sleep(self.repeat_time)
A potential usage example:
url = "https://www.python.org/ftp/python/3.5.0/python-3.5.0-embed-amd64.zip"
f = SeekableHTTPFile(url, debug=True)
zf = ZipFile(f)
zf.printdir()
zf.extract("python.exe")
Edit: There is actually a mostly identical, if slightly more minimal, implementation in this answer: https://stackoverflow.com/a/7852229/2997179

Categories

Resources