python construct for protocol parsing - python

I am trying to mix up the power of twisted Protocol with the ductility of construct, the declarative binary data parser.
So far, my MessageReceiver protocol accumulates the data coming from the tcp channel in the following way:
def rawDataReceived(self, data):
'''
This method bufferizes the data coming from the TCP channel in the following way:
- Initially, discard the stream until a reserved character is detected
- add data to the buffer up to the expected message length unless the reserved character is met again. In that case discard the message and start again
- if the expected message length is reached, attempt to parse the message and clear the buffer
'''
if self._buffer:
index = data.find(self.reserved_character)
if index > -1:
if len(self._buffer) + index >= self._fixed_size:
self.on_message(self._buffer + data[:data.index(self._reserved_character)])
self._buffer = b''
data = data[data.index(self.reserved_character):]
[self.on_message(chunks[:self._fixed_size]) for chunks in [self.reserved_character + msg for msg in data.split(self._reserved_character) if msg]]
elif len(self._buffer) + len(data) < self._expected_size:
self._buffer = self._buffer + data
else:
self._buffer = b''
else:
try:
data = data[data.index(self._reserved_character):]
[self.on_message(chunks[:self._fixed_size]) for chunks in [self._reserved_character + msg for msg in data.split(self._reserved_character) if msg]]
except Exception, exc:
log.msg("Warning: Maybe there is no delimiter {delim} for the new message. Error: {err}".format(delim=self._reserved_character, err=str(exc)))
Now I am in need of evolving the protocol to take into consideration the fact that the message may or may not carry optional fields (thus there isn't a fixed message length anymore). I modeled (a meaningful part of) the message parser with construct in the following way:
def on_message(self, msg):
return Struct(HEADER,
Bytes(HEADER_RAW, 3),
BitStruct(OPTIONAL_HEADER_STRUCT,
Nibble(APPLICATION_SELECTOR),
Flag(OPTIONAL_HEADER_FLAG),
Padding(3)
),
If(lambda ctx: ctx.optional_header_struct[OPTIONAL_HEADER_FLAG],
Embed(Struct(None,
Byte(BATTERY_CHARGE),
Bytes(OPTIONAL_HEADER, 3)
)
)
)
).parse(msg)
So right now I am in need to change the buffering logic to pass the right chunk size to the Struct. I would like to avoid sizing up the data to be passed to the Structin the rawDataReceived method considering that the rules of what is a possible candidate for a message are known in the construct object.
Is there any way to push the buffering logic to the construct object?
Edit
I was able to partially achieved the aim to push the buffering logic inside, by simply making use of Macros and Adapters:
MY_PROTOCOL = Struct("whatever",
Anchor("begin"),
RepeatUntil(lambda obj, ctx:obj==RESERVED_CHAR, Field("garbage", 1)),
NoneOf(Embed(HEADER_SECTION), [RESERVED_CHAR]),
Anchor("end"),
Value("size", lambda ctx:ctx.end - ctx.begin)
)
This greatly simplifies the caller code (which is no longer in rawDataReceived thanks to Glyph's suggestion):
def dataReceived(self, data):
log.msg('Received data: {}'.format(bytes_to_hex(data)))
self._buffer += data
try:
container = My_PROTOCOL.parse(self._buffer)
self._buffer = self._buffer[container.size:]
d, self.d = self.d, self._create_new_transmission_deferred()
d.callback(container)
except ValidationError, err:
self._cb_error("A validation error occurred. Discarding the rest of the message. {}".format(err))
self._buffer = b''
except FieldError, err: #Incomplete message. We simply keep on buffering and retry
if len(self._buffer) >= MyMessageReceiver.MAX_GARBAGE_SIZE:
self._cb_error("Buffer overflown. No delimiter found in the stream")
Unfortunately this solution covers the requirements only partially since I could not find a way to get construct to tell me the index of the stream that produced the error and therefore I am obliged to drop the entire buffer, which is not ideal.

To get the stream position at which an error occurs, you'll need to use Anchor and write your own version of NoneOf. Assuming HEADER_SECTION is another Construct, replace the NoneOf like so:
SpecialNoneOf(Struct('example', Anchor('position'), HEADER_SECTION), [RESERVED_CHAR]))
SpecialNoneOf needs to subclass from Adapter and combine init and _validate from NoneOf with _encode and _decode from Validator. In _decode, replace
raise ValidationError("invalid object", obj)
with
raise ValidationError("invalid object", obj.header_section + " at " + obj.position)
Replace header_section with the name of the HEADER_SECTION Construct. You will have to change the structure of the resulting container or figure out a different way to use Embed to make this method work.

Related

Problem with streaming audio in Python from a mic via MQTT to Google Streaming using generators

I've read the Google documentation and looked at their examples however have not managed to get this working correctly in my particular use case. The problem is that the packets of the audio stream are broken up into smaller chunks (frame size) base64 encoded and sent over MQTT - meaning that the generator approach is likely to stop part way through despite not being fully completed by the sender. My MicrophoneSender component will send the final part of the message with a segment_key = -1, so this is the flag that the complete message has been sent and that a full/final process of the stream can be completed. Prior to that point the buffer may not have all of the complete stream so it's difficult to get either a) the generator to stop yielding b) the google as to return a partial transcription. A partial transcription is required once every 10 or so frames.
To illustrate this better here is my code.
inside receiver:
STREAMFRAMETHRESHOLD = 10
def mqttMsgCallback(self, client, userData, msg):
if msg.topic.startswith("MicSender/stream"):
msgDict = json.loads(msg.payload)
streamBytes = b64decode(msgDict['audio_data'].encode('utf-8'))
frameNum = int(msgDict['segment_num'])
if frameNum == 0:
self.asr_time_start = time.time()
self.asr.endOfStream = False
if frameNum >= 0:
self.asr.store_stream_bytes(streamBytes)
self.asr.endOfStream = False
if frameNum % STREAMFRAMETHRESHOLD == 0:
self.asr.get_intermediate_and_print()
else:
#FINAL, recieved -1
trans = self.asr.finish_stream()
self.send_message(trans)
self.frameCount=0
inside Google Speech Class implementation:
class GoogleASR(ASR):
def __init__(self, name):
super().__init__(name)
# STREAMING
self.stream_buf = queue.Queue()
self.stream_gen = self.getGenerator(self.stream_buf)
self.endOfStream = True
self.requests = (types.StreamingRecognizeRequest(audio_content=chunk) for chunk in self.stream_gen)
self.streaming_config = types.StreamingRecognitionConfig(config=self.config)
self.current_transcript = ''
self.numCharsPrinted = 0
def getGenerator(self, buff):
while not self.endOfStream:
# Use a blocking get() to ensure there's at least one chunk of
# data, and stop iteration if the chunk is None, indicating the
# end of the audio stream.
chunk = buff.get()
if chunk is None:
return
data = [chunk]
# Now consume whatever other data's still buffered.
while True:
try:
chunk = buff.get(block=False)
data.append(chunk)
except queue.Empty:
self.endOfStream = True
yield b''.join(data)
break
yield b''.join(data)
def store_stream_bytes(self, bytes):
self.stream_buf.put(bytes)
def get_intermediate_and_print(self):
self.get_intermediate()
def get_intermediate(self):
if self.stream_buf.qsize() > 1:
print("stream buf size: {}".format(self.stream_buf.qsize()))
responses = self.client.streaming_recognize(self.streaming_config, self.requests)
# print(responses)
try:
# Now, put the transcription responses to use.
if not self.numCharsPrinted:
self.numCharsPrinted = 0
for response in responses:
if not response.results:
continue
# The `results` list is consecutive. For streaming, we only care about
# the first result being considered, since once it's `is_final`, it
# moves on to considering the next utterance.
result = response.results[0]
if not result.alternatives:
continue
# Display the transcription of the top alternative.
self.current_transcript = result.alternatives[0].transcript
# Display interim results, but with a carriage return at the end of the
# line, so subsequent lines will overwrite them.
#
# If the previous result was longer than this one, we need to print
# some extra spaces to overwrite the previous result
overwrite_chars = ' ' * (self.numCharsPrinted - len(self.current_transcript))
sys.stdout.write(self.current_transcript + overwrite_chars + '\r')
sys.stdout.flush()
self.numCharsPrinted = len(self.current_transcript)
def finish_stream(self):
self.endOfStream = False
self.get_intermediate()
self.endOfStream = True
final_result = self.current_transcript
self.stream_buf= queue.Queue()
self.allBytes = bytearray()
self.current_transcript = ''
self.requests = (types.StreamingRecognizeRequest(audio_content=chunk) for chunk in self.stream_gen)
self.streaming_config = types.StreamingRecognitionConfig(config=self.config)
return final_result
Currently what this does is output nothing from the transcriptions side.
stream buf size: 21
stream buf size: 41
stream buf size: 61
stream buf size: 81
stream buf size: 101
stream buf size: 121
stream buf size: 141
stream buf size: 159
But the response/transcript is empty. If I put a breakpoint on the for response in responses inside the get_intermediate function then it never runs which means that for some reason it's empty (not retuned from Google). However, if I put a breakpoint on the generator and take too long (> 5 seconds) to continue to yield the data, it (Google) tells me that the data is probably being sent to the server too slow. google.api_core.exceptions.OutOfRange: 400 Audio data is being streamed too slow. Please stream audio data approximately at real time.
Maybe someone can spot the obvious here...
The way you have organized your code, the generator you give to the Google API is initialized exactly once - on line 10, using a generator expression: self.requests = (...). As constructed, this generator will also run exactly once and become 'exhausted'. Same applies to the generator function that the (for ...) generator itself calls (self.getGeneerator()). It will run once only and stop when it retrieved 10 chunks of data (which are very small, from what I can see). Then, the outer generator (what you assigned to self.requests) will also stop forever - giving the ASR only a short bit of data (10 times 20 bytes, looking at the printed debug output). There's nothing recognizable in that, most likely.
BTW, note you have a redundant yield b''.join(data) in your function, the data will be sent twice.
You will need to redo the (outer) generator so it does not return until all data is received. If you want to use another generator as you do to gather each bigger chunk for the 'outer' generator from which the Google API is reading, you will need to re-make it every time you begin a new loop with it.

Python tornado async request handler

I have written an async file upload RequestHandler. It is correct byte-wise, that is the files I receive are identical to the ones being sent. One issue that I am having trouble figuring out is upload delay. Specifically when I issue the post request to upload the file while testing locally I see the browser showing upload progress get stuck. For files close to 4MB in size it gets stuck on 50%+ for a little while then some time passes and it sends all of the data, and gets stuck on "waiting for localhost..." The whole process may last 3+ minutes.
The kicker is when I add print statements that end with a new line to data_received method the delays disappear. Does the print statement trigger the network buffers to be flushed somehow?
Here is the implementation of data_received, along with the helper methods:
#tornado.gen.coroutine
def _read_data(self, cont_buf):
'''
Read the file data.
#param cont_buf - buffered HTTP request
#param boolean indicating whether data is still being read and new
buffer
'''
# Check only last characters of the buffer guaranteed to be large
# enough to contain the boundary
end_of_data_idx = cont_buf.find(self._boundary)
if end_of_data_idx >= 0:
data = cont_buf[:(end_of_data_idx - self.LSEP)]
self.receive_data(self.header_list[-1], data)
new_buffer = cont_buf[(end_of_data_idx + len(self._boundary)):]
return False, new_buffer
else:
self.receive_data(self.header_list[-1], cont_buf)
return True, b""
#tornado.gen.coroutine
def _parse_params(self, param_buf):
'''
Parse HTTP header parameters.
#param param_buf - string buffer containing the parameters.
#returns parameters dictionary
'''
params = dict()
param_res = self.PAT_HEADERPARAMS.findall(param_buf)
if param_res:
for name, value in param_res:
params[name] = value
elif param_buf:
params['value'] = param_buf
return params
#tornado.gen.coroutine
def _parse_header(self, header_buf):
'''
Parses a buffer containing an individual header with parameters.
#param header_buf - header buffer containing a single header
#returns header dictionary
'''
res = self.PAT_HEADERVALUE.match(header_buf)
header = dict()
if res:
name, value, tail = res.groups()
header = {'name': name, 'value': value,
'params': (yield self._parse_params(tail))}
elif header_buf:
header = {"value": header_buf}
return header
#tornado.gen.coroutine
def data_received(self, chunk):
'''
Processes a chunk of content body.
#param chunk - a piece of content body.
'''
self._count += len(chunk)
self._buffer += chunk
# Has boundary been established?
if not self._boundary:
self._boundary, self._buffer =\
(yield self._extract_boundary(self._buffer))
if (not self._boundary
and len(self._buffer) > self.BOUNDARY_SEARCH_BUF_SIZE):
raise RuntimeError("Cannot find multipart delimiter.")
while True:
if self._receiving_data:
self._receiving_data, self._buffer = yield self._read_data(self._buffer)
if self._is_end_of_request(self._buffer):
yield self.request_done()
break
elif self._is_end_of_data(self._buffer):
break
else:
headers, self._buffer = yield self._read_headers(self._buffer)
if headers:
self.header_list.append(headers)
self._receiving_data = True
else:
break

unpack_from requires a buffer of at least 4 bytes

I am receiving a packet from client, consisting of many fields. I read all fields successfully, but when it comes to the last field which is tag_end, python gives me an error:
unpack_from requires a buffer of at least 4 bytes not found.
this is the code:
def set_bin(self, buf):
"""Reads a vector of bytes (probably received from network or
read from file) and tries to construct the packet structure
from it, by reading each packet member from the buffer. This
is somehow like deserializing the packet.
"""
assert isinstance(buf, bytearray), 'buffer type is not valid'
offset = 0
print("$$$$$$$$$$$$$$$$ set bin $$$$$$$$$$$$$$$$$")
try:
(self._tag_start, self._version, self._checksum, self._connection_id,
self._packet_seq) = Packet.PACKER_1.unpack_from(str(buf), offset)
except struct.error as e:
print(e)
raise DeserializeError(e)
except ValueError as e:
print(e)
raise DeserializeError(e)
#I=4 H=2 B=1
offset = Packet.OFFSET_GUID #14 correct
self._guid = buf[offset:offset+Packet.UUID_SIZE] #14-16 correct
offset = Packet.OFFSET_GUID + Packet.UUID_SIZE
print("$$$$$$$$$$$$$$$$ GUID read successfully $$$$$$$$$$$$$$$$$")
try:
(self._timestamp_sec, self._timestamp_microsec, self._command,
self._command_seq, self._subcommand, self._data_seq,
self._data_length) = Packet.PACKER_3.unpack_from(str(buf), offset)
except struct.error as e:
print(e)
raise DeserializeError(e)
except ValueError as e:
print(e)
raise DeserializeError(e)
print("$$$$$$$$$$$$$$$$ timestamps read successfully $$$$$$$$$$$$$$$$$")
offset = Packet.OFFSET_AUTHENTICATE
self._username = buf[offset:offset + self.USERNAME_SIZE] #Saman
offset += self.USERNAME_SIZE
print("$$$$$$$$$$$$$$$$ username read successfully $$$$$$$$$$$$$$$$$")
self._password = buf[offset:offset+self.USERNAME_SIZE]
offset += self.PASSWORD_SIZE
print("$$$$$$$$$$$$$$$$ password read successfully $$$$$$$$$$$$$$$$$")
self._data = buf[offset:offset+self._data_length]
offset = offset + self._data_length
print("$$$$$$$$$$$$$$$$ data read successfully $$$$$$$$$$$$$$$$$")
try:
(self._tag_end,) = Packet.PACKER_4.unpack_from(str(buf), offset)
except struct.error as e:
print(e)
raise DeserializeError(e)
except ValueError as e:
print(e)
raise DeserializeError(e)
print("$$$$$$$$$$$$$$$$ tag end read successfully $$$$$$$$$$$$$$$$$")
if len(buf) != Packet.PACKER.size + self._data_length:
print('failed to deserialize binary data correctly and construct the packet due to extra data')
else:
print('############### Deserialized Successfully')
and this is some constants used in the code:
STRUCT_FORMAT_STR = r'=IHIHH 16B IIHHHHH I 6c 9c' #Saman
STRUCT_FORMAT_STR_1 = r'=IHIHH'
STRUCT_FORMAT_STR_2 = r'=16B'
STRUCT_FORMAT_STR_3 = r'=IIHHHHH'
STRUCT_FORMAT_STR_4 = r'=I'
STRUCT_FORMAT_STR_5 = r'=6c'
STRUCT_FORMAT_STR_6 = r'=9c'
UUID_SIZE = 16
OFFSET_GUID = 14
#OFFSET_DATA = 48 #shifting offset data by 15 char
OFFSET_AUTHENTICATE = 48
PACKER = struct.Struct(str(STRUCT_FORMAT_STR)) #Saman
PACKER_1 = struct.Struct(str(STRUCT_FORMAT_STR_1))
PACKER_2 = struct.Struct(str(STRUCT_FORMAT_STR_2))
PACKER_3 = struct.Struct(str(STRUCT_FORMAT_STR_3))
PACKER_4 = struct.Struct(str(STRUCT_FORMAT_STR_4))
PACKER_5 = struct.Struct(str(STRUCT_FORMAT_STR_5))
PACKER_6 = struct.Struct(str(STRUCT_FORMAT_STR_6))
BYTES_TAG_START = PACKER_4.pack(TAG_START)
BYTES_TAG_END = PACKER_4.pack(TAG_END)
and initialization of the packet object, where it initializes the fields:
def init(self, **kwargs):
if 'buf' in kwargs:
self.set_bin(kwargs['buf'])
else:
assert kwargs['command'] in Packet.RTCINET_COMMANDS.values() and kwargs['subcommand'] in Packet.RTCINET_COMMANDS.values(), 'Undefined protocol command'
assert isinstance(kwargs['data'], bytearray), 'invalid type for data field'
for field in ('command', 'subcommand', 'data'):
setattr(self, '_' + field, kwargs[field])
self._tag_start = Packet.TAG_START
self._version = Packet.VERSION_CURRENT % (Packet.USHRT_MAX + 1)
self._checksum = Packet.CRC_INIT
self._connection_id = kwargs.get('connection_id', 0) % (Packet.USHRT_MAX + 1)
self._packet_seq = Packet.PACKET_SEQ
Packet.PACKET_SEQ = (Packet.PACKET_SEQ + 1) % (Packet.USHRT_MAX + 1)
self._guid = uuid.uuid4().bytes
dt = datetime.datetime.now()
self._timestamp_sec = int(time.mktime(dt.timetuple()))
self._timestamp_microsec = dt.microsecond
# self._command = kwargs['command']
self._command_seq = kwargs.get('command_seq', 0)
# self._subcommand = kwargs['subcommand']
self._data_seq = kwargs.get('data_seq', 0)
self._data_length = len(kwargs['data'])
self._username = Packet.USERNAME #Saman
self._password = Packet.PASSWORD
I have made sure that I read all fields in the right order, as it was written in the packet by the client program. but still I couldn't manage to solve this problem.
Do you have any idea how this could be solved?
The problem seems to be that you're converting things to str all over the place for no good reason.
In some places, like PACKER_1 = struct.Struct(str(STRUCT_FORMAT_STR_1)), it makes your code less readable and understandable, but doesn't affect the actual output. For example, STRUCT_FORMAT_STR_1 is already a str, so str(STRUCT_FORMAT_STR_1) is the same str.
But in other places, it's far worse than that. In particular, look at all the lines like Packet.PACKER_1.unpack_from(str(buf), offset). There, buf is a bytearray. (It has to be, because you assert it.) Calling str on a bytearray gives you the string representation of that bytearray. For example:
>>> b = bytearray(b'abc')
>>> len(b)
3
>>> s = str(b)
>>> s
"bytearray(b'abc')"
>>> len(s)
17
That string representation is obviously not generally going to have the same length as the actual buffer you're representing. So it's no wonder that you get errors about the length being wrong. (And if you got really unlucky and didn't have any such errors, you'd be reading garbage values instead.)
So, what should you do to convert the bytearray into something the struct module can handle? Nothing! As the docs say:
Several struct functions (and methods of Struct) take a buffer argument. This refers to objects that implement the Buffer Protocol and provide either a readable or read-writable buffer. The most common types used for that purpose are bytes and bytearray…

Is there a good regular expression for multiline matching of received SIP invites?

I really need python regexp which would give me this information:
Data:
Received from 1.1.1.1 18:41:51:330
(123 bytes):
INVITE: sip:dsafsdf#fsdafas.com To:
sdfasdfasdfas From: "test"
Via:
sdafsdfasdfasd
Sent from 1.1.1.1 18:42:51:330
(123 bytes):
INVITE: sip:dsafsdf#fsdafas.com
From: "test"
To:
sdfasdfasdfas Via:
sdafsdfasdfasd
Received from 1.1.1.1 18:50:51:330
(123 bytes):
INVITE: sip:dsafsdf#fsdafas.com
Via: sdafsdfasdfasd
From: "test"
To:
sdfasdfasdfas
What I need to achieve, is to find the newest INVITE that was "Received" in order to get From: header value. So searching the data backwards.
Is it possible with unique regexp ? :)
Thanks.
One-line answer, assuming you suck the entire header into a string with embedded newlines (or cr/nl's):
sorted(re.findall("Received [^\r\n]+ (\d{2}:\d{2}:\d{2}:\d{3})[^\"]+From: \"([^\r\n]+)\"", data))[-1][1]
The trick to doing it with one RE is using [^\r\n] instead of . when you want to scan over stuff. This works assuming from string always has the double quotes. The double quotes are used to keep the scanner from swallowing the entire string at the first Received... ;)
I do not think a single regular expression is the answer. I think a stateful line-by-line matcher is what you're looking for here.
import re
import collections
_msg_start_re = re.compile('^(Received|Sent)\s+from\s+(\S.*):\s*$')
_msg_field_re = re.compile('^([A-Za-z](?:(?:\w|-)+)):\s+(\S(?:.*\S)?)\s*$')
def message_parser():
hdr = None
fields = collections.defaultdict(list)
msg = None
while True:
if msg is not None:
line = (yield msg)
msg = None
hdr = None
fields = collections.defaultdict(list)
else:
line = (yield None)
if hdr is None:
hdr_match = _msg_start_re.match(line)
hdr = None if hdr_match is None else hdr_match.groups()
elif len(fields) <= 0:
field_match = _msg_field_re.match(line)
if field_match is not None:
fields[field_match.group(1)].append(field_match.group(2))
else: # Waiting for the end of the message
if line.strip() == '':
msg = (hdr, dict(fields))
else:
field_match = _msg_field_re.match(line)
fields[field_match.group(1)].append(field_match.group(2))
Example of use:
parser = msg_parser()
parser.next()
recvd_invites = [msg for msg in (parser.send(line) for line in linelst) \
if (msg is not None) and \
(msg[0][0] == 'Received') and \
('INVITE' in msg[1])]
You might be able to do this with a multiple line regex, but if you do it this way you get the message nicely parsed into its various fields. Presumably you want to do something interesting with the messages, and this will let you do a whole bunch more with them without having to use more regexps.
This also allows you to parse something other than an already existing file or a giant string with all the messages in it. For example, if you want to parse the output of a pipe that's printing out these requests as they happen you can simply do msg = parser.send(line) every time you receive a line and get a new message out as soon as its all been printed (if the line isn't the end of a message then msg will be None).

Python - seek in http response stream

Using urllibs (or urllibs2) and wanting what I want is hopeless.
Any solution?
I'm not sure how the C# implementation works, but, as internet streams are generally not seekable, my guess would be it downloads all the data to a local file or in-memory object and seeks within it from there. The Python equivalent of this would be to do as Abafei suggested and write the data to a file or StringIO and seek from there.
However, if, as your comment on Abafei's answer suggests, you want to retrieve only a particular part of the file (rather than seeking backwards and forwards through the returned data), there is another possibility. urllib2 can be used to retrieve a certain section (or 'range' in HTTP parlance) of a webpage, provided that the server supports this behaviour.
The range header
When you send a request to a server, the parameters of the request are given in various headers. One of these is the Range header, defined in section 14.35 of RFC2616 (the specification defining HTTP/1.1). This header allows you to do things such as retrieve all data starting from the 10,000th byte, or the data between bytes 1,000 and 1,500.
Server support
There is no requirement for a server to support range retrieval. Some servers will return the Accept-Ranges header (section 14.5 of RFC2616) along with a response to report if they support ranges or not. This could be checked using a HEAD request. However, there is no particular need to do this; if a server does not support ranges, it will return the entire page and we can then extract the desired portion of data in Python as before.
Checking if a range is returned
If a server returns a range, it must send the Content-Range header (section 14.16 of RFC2616) along with the response. If this is present in the headers of the response, we know a range was returned; if it is not present, the entire page was returned.
Implementation with urllib2
urllib2 allows us to add headers to a request, thus allowing us to ask the server for a range rather than the entire page. The following script takes a URL, a start position, and (optionally) a length on the command line, and tries to retrieve the given section of the page.
import sys
import urllib2
# Check command line arguments.
if len(sys.argv) < 3:
sys.stderr.write("Usage: %s url start [length]\n" % sys.argv[0])
sys.exit(1)
# Create a request for the given URL.
request = urllib2.Request(sys.argv[1])
# Add the header to specify the range to download.
if len(sys.argv) > 3:
start, length = map(int, sys.argv[2:])
request.add_header("range", "bytes=%d-%d" % (start, start + length - 1))
else:
request.add_header("range", "bytes=%s-" % sys.argv[2])
# Try to get the response. This will raise a urllib2.URLError if there is a
# problem (e.g., invalid URL).
response = urllib2.urlopen(request)
# If a content-range header is present, partial retrieval worked.
if "content-range" in response.headers:
print "Partial retrieval successful."
# The header contains the string 'bytes', followed by a space, then the
# range in the format 'start-end', followed by a slash and then the total
# size of the page (or an asterix if the total size is unknown). Lets get
# the range and total size from this.
range, total = response.headers['content-range'].split(' ')[-1].split('/')
# Print a message giving the range information.
if total == '*':
print "Bytes %s of an unknown total were retrieved." % range
else:
print "Bytes %s of a total of %s were retrieved." % (range, total)
# No header, so partial retrieval was unsuccessful.
else:
print "Unable to use partial retrieval."
# And for good measure, lets check how much data we downloaded.
data = response.read()
print "Retrieved data size: %d bytes" % len(data)
Using this, I can retrieve the final 2,000 bytes of the Python homepage:
blair#blair-eeepc:~$ python retrieverange.py http://www.python.org/ 17387
Partial retrieval successful.
Bytes 17387-19386 of a total of 19387 were retrieved.
Retrieved data size: 2000 bytes
Or 400 bytes from the middle of the homepage:
blair#blair-eeepc:~$ python retrieverange.py http://www.python.org/ 6000 400
Partial retrieval successful.
Bytes 6000-6399 of a total of 19387 were retrieved.
Retrieved data size: 400 bytes
However, the Google homepage does not support ranges:
blair#blair-eeepc:~$ python retrieverange.py http://www.google.com/ 1000 500
Unable to use partial retrieval.
Retrieved data size: 9621 bytes
In this case, it would be necessary to extract the data of interest in Python prior to any further processing.
It may work best just to write the data to a file (or even to a string, using StringIO), and to seek in that file (or string).
I did not find any existing implementations of a file-like interface with seek() to HTTP URLs, so I rolled my own simple version: https://github.com/valgur/pyhttpio. It depends on urllib.request but could probably easily be modified to use requests, if necessary.
The full code:
import cgi
import time
import urllib.request
from io import IOBase
from sys import stderr
class SeekableHTTPFile(IOBase):
def __init__(self, url, name=None, repeat_time=-1, debug=False):
"""Allow a file accessible via HTTP to be used like a local file by utilities
that use `seek()` to read arbitrary parts of the file, such as `ZipFile`.
Seeking is done via the 'range: bytes=xx-yy' HTTP header.
Parameters
----------
url : str
A HTTP or HTTPS URL
name : str, optional
The filename of the file.
Will be filled from the Content-Disposition header if not provided.
repeat_time : int, optional
In case of HTTP errors wait `repeat_time` seconds before trying again.
Negative value or `None` disables retrying and simply passes on the exception (the default).
"""
super().__init__()
self.url = url
self.name = name
self.repeat_time = repeat_time
self.debug = debug
self._pos = 0
self._seekable = True
with self._urlopen() as f:
if self.debug:
print(f.getheaders())
self.content_length = int(f.getheader("Content-Length", -1))
if self.content_length < 0:
self._seekable = False
if f.getheader("Accept-Ranges", "none").lower() != "bytes":
self._seekable = False
if name is None:
header = f.getheader("Content-Disposition")
if header:
value, params = cgi.parse_header(header)
self.name = params["filename"]
def seek(self, offset, whence=0):
if not self.seekable():
raise OSError
if whence == 0:
self._pos = 0
elif whence == 1:
pass
elif whence == 2:
self._pos = self.content_length
self._pos += offset
return self._pos
def seekable(self, *args, **kwargs):
return self._seekable
def readable(self, *args, **kwargs):
return not self.closed
def writable(self, *args, **kwargs):
return False
def read(self, amt=-1):
if self._pos >= self.content_length:
return b""
if amt < 0:
end = self.content_length - 1
else:
end = min(self._pos + amt - 1, self.content_length - 1)
byte_range = (self._pos, end)
self._pos = end + 1
with self._urlopen(byte_range) as f:
return f.read()
def readall(self):
return self.read(-1)
def tell(self):
return self._pos
def __getattribute__(self, item):
attr = object.__getattribute__(self, item)
if not object.__getattribute__(self, "debug"):
return attr
if hasattr(attr, '__call__'):
def trace(*args, **kwargs):
a = ", ".join(map(str, args))
if kwargs:
a += ", ".join(["{}={}".format(k, v) for k, v in kwargs.items()])
print("Calling: {}({})".format(item, a))
return attr(*args, **kwargs)
return trace
else:
return attr
def _urlopen(self, byte_range=None):
header = {}
if byte_range:
header = {"range": "bytes={}-{}".format(*byte_range)}
while True:
try:
r = urllib.request.Request(self.url, headers=header)
return urllib.request.urlopen(r)
except urllib.error.HTTPError as e:
if self.repeat_time is None or self.repeat_time < 0:
raise
print("Server responded with " + str(e), file=stderr)
print("Sleeping for {} seconds before trying again".format(self.repeat_time), file=stderr)
time.sleep(self.repeat_time)
A potential usage example:
url = "https://www.python.org/ftp/python/3.5.0/python-3.5.0-embed-amd64.zip"
f = SeekableHTTPFile(url, debug=True)
zf = ZipFile(f)
zf.printdir()
zf.extract("python.exe")
Edit: There is actually a mostly identical, if slightly more minimal, implementation in this answer: https://stackoverflow.com/a/7852229/2997179

Categories

Resources