Constantly monitoring a TCP streaming feed for data in Python

Constantly monitoring a TCP streaming feed for data in Python - python

I am writing a Python script in Python 3.5, I have a host and port and what I am trying to do is create the script so that it is constantly monitoring the provided host for data. The data is distributed through the TCP streaming feed in an xml format with a tag marking the start and end of the event.
So what I am trying to do is basically monitor the TCP feed for new events which are marked between an xml start and end tag, then retrieve the event and handle it accordingly in my script. In addition, ideally I would need to have access to new data in the feed within milliseconds.
The feed is a government feed that distributes alerts, the feed is streaming1.naad-adna.pelmorex.com and the port is 8080, what I want to do is monitor this feed for new alerts and then be able to access the alerts and handle them accordingly in Python. The feed sends a heartbeat every minute to indicate the connection is alive.
I believe the best option would be to make use of Sockets though I am unsure of how to implement them in this specific use case. I do not have much experience with TCP feeds, and I was unable to find much online pertaining to how to handle a TCP feed in Python under my specific use case, I am able to handle the xml though once I am able to figure out how to pull it from the TCP feed.
Any help would be appreciated.

There are several technical challenges presented in your question.
First, is the simple matter of connecting to the server and retrieving the data. As you can see in connect() below, that is pretty simple, just create a socket (s = socket.socket()) and connect it (s.connect(('hostname', port_number))).
The next problem is retrieving the data in a useful form. The socket natively provides .recv(), but I wanted something with a file-like interface. The socket module provides a method unique to Python: .makefile(). (return s.makefile('rb'))
Now we get to the hard part. XML documents are typically stored one document per file, or one document per TCP transmission. Thus the end of the document is easily discovered by an end of file indication, or by a Content-Length: header. Consequently, none of the Python XML API have a mechanism for dealing with multiple XML documents in one file, or in one string. I wrote xml_partition() to solve that problem. xml_partition() consumes data from a file-like object and yields each XML document from the stream. (Note: the XML documents must be pressed together. No whitespace is allowed after the final >).
Finally, there is a short test program (alerts()) which connects to the stream and reads a few of the XML documents, storing each into its own file.
Here, in its entirety, is a program for downloading emergency alerts from the National Alert Aggregation & Dissemination System from Pelmorex.
import socket
import xml.etree.ElementTree as ET
def connect():
'Connect to pelmorex data stream and return a file-like object'
# Set up the socket
s = socket.socket()
s.connect(('streaming1.naad-adna.pelmorex.com', 8080))
return s.makefile('rb')
# We have to consume the XML data in bits and pieces
# so that we can stop precisely at the boundary between
# streamed XML documents. This function ensures that
# nothing follows a '>' in any XML fragment.
def partition(s, pattern):
'Consume a file-like object, and yield parts defined by pattern'
data = s.read(2048)
while data:
left, middle, data = data.partition(pattern)
while left or middle:
yield left
yield middle
left, middle, data = data.partition(pattern)
data = s.read(2048)
# Split the incoming XML stream into fragments (much smaller
# than an XML document.) The end of each XML document
# is guaranteed to align with the end of a fragment.
# Use an XML parser to determine the actual end of
# a document. Whenever the parser signals the end
# of an XML document, yield what we have so far and
# start a new parser.
def xml_partition(s):
'Read multiple XML documents from one data stream'
parser = None
for part in partition(s, b'>'):
if parser is None:
parser = ET.XMLPullParser(['start', 'end'])
starts = ends = 0
xml = []
xml.append(part)
parser.feed(part)
for event, elem in parser.read_events():
starts += event == "start"
ends += event == "end"
if starts == ends > 0:
# We have reached the end of the XML doc
parser.close()
parser = None
yield b''.join(xml)
# Typical usage:
def alerts():
for i, xml in enumerate(xml_partition(connect())):
# The XML is a bytes object that contains the undecoded
# XML stream. You'll probably want to parse it and
# somehow display the alert.
# I'm just saving it to a file.
with open('alert%d.xml' % i, 'wb') as fp:
fp.write(xml)
if i == 3:
break
def test():
# A test function that uses multiple XML documents in one
# file. This avoids the wait for a natural-disaster alert.
with open('multi.xml', 'rb') as fp:
print(list(xml_partition(fp)))
alerts()

Related

How do you run a command on an ECS docker container using Python boto3, and get the result?

I want to use boto3 to run a command on an ECS Fargate container which generates a lot of binary output, and stream that output into a file on my local machine.
My attempt is based on the recommendation here, and looks like this:
import json
import uuid
import boto3
import construct as c
import websocket
# Define Structs
AgentMessageHeader = c.Struct(
"HeaderLength" / c.Int32ub,
"MessageType" / c.PaddedString(32, "ascii"),
)
AgentMessagePayload = c.Struct(
"PayloadLength" / c.Int32ub,
# This only works with my test command. It won't work with my real command that returns binary data
"Payload" / c.PaddedString(c.this.PayloadLength, "ascii"),
)
# Define initial payload
init_payload = {
"MessageSchemaVersion": "1.0",
"RequestId": str(uuid.uuid4()),
"TokenValue": session["tokenValue"],
}
# Define the container you want to talk to
cluster = "..."
task = "..."
container = "..."
# Send command with large response (large enough to span multiple messages)
result = client.execute_command(
cluster=cluster,
task=task,
container=container,
# This is a sample command that returns text. My real command returns hundreds of megabytes of binary data
command="python -c 'for i in range(1000):\n print(i)'",
interactive=True,
)
# Get session info
session = result["session"]
# Create websocket connection
connection = websocket.create_connection(session["streamUrl"])
try:
# Send initial response
connection.send(json.dumps(init_payload))
while True:
# Receive data
response = connection.recv()
# Decode data
message = AgentMessageHeader.parse(response)
payload_message = AgentMessagePayload.parse(response[message.HeaderLength:])
if 'channel_closed' in message.MessageType:
raise Exception('Channel closed before command output was received')
# Print data
print("Header:", message.MessageType)
print("Payload Length:", payload_message.PayloadLength)
print("Payload Message:", payload_message.Payload)
finally:
connection.close()
This almost works, but has a problem - I can't tell when I should stop reading.
If you read the final message from aws, and call connection.recv() again, aws seems to loop around and send you the initial data - the same data you would have received the first time you called connection.recv().
One semi-hackey way to try to deal with this is by adding an end marker to the command. Sort of like:
result = client.execute_command(
...
command="""bash -c "python -c 'for i in range(1000):\n print(i)'; echo -n "=== END MARKER ===""""",
)
This idea works, but to be used properly, becomes really difficult to use. There's always a chance that the end marker text gets split up between two messages, and dealing with that becomes a pain, since you can no longer write a payload immediately to disk until you verify that the end of the payload, along with the beginning of the next payload, isn't your end marker.
Another hackey way is to checksum the first payload, and every subsequent payload, comparing the checksum of each payload to the checksum of the first payload. That will tell you if you've looped around. Unfortunately, this also has a chance of having a collision, if the binary data in 2 messages just happens to repeat, although the chances of that in practice would probably be slim.
Is there a simpler way to determine when to stop reading?
Or better yet, a simpler way to have boto3 give me a stream of binary data from the command I ran?

TCP socket reads out of turn

I am using TCP with Python sockets, transfering data from one computer to another. However the recv command reads more than it should in the serverside, I could not find the issue.
client.py
while rval:
image_string = frame.tostring()
sock.sendall(image_string)
rval, frame = vc.read()
server.py
while True:
image_string = ""
while len(image_string) < message_size:
data = conn.recv(message_size)
image_string += data
The length of the message is 921600 (message_size) so it is sent with sendall, however when recieved, when I print the length of the arrived messages, the lengths are sometimes wrong, and sometimes correct.
921600
921600
921923 # wrong
922601 # wrong
921682 # wrong
921600
921600
921780 # wrong
As you see, the wrong arrivals have no pattern. As I use TCP, I expected more consistency, however it seems the buffers are mixed up and somehow recieving a part of the next message, therefore producing a longer message. What is the issue here ?
I tried to add just the relevant part of the code, I can add more if you wish, but the code performs well on localhost but fails on two computers, so there should be no errors besides the transmitting part.
Edit1: I inspected this question a bit, it mentions that all send commands in the client may not be recieved by a single recv in the server, but I could not understand how to apply this to practice.

TCP is a stream protocol. There is ABSOLUTELY NO CONNECTION between the sizes of the chunks of data you send, and the chunks of data you receive. If you want to receive data of a known size, it's entirely up to you to only request that much data: you're currently requesting the total length of the data each time, which is going to try to read too much except in the unlikely event of the entire data being retrieved by the first .recv() call. Basically, you need to do something like data = conn.recv(message_size - len(image_string)) to reflect the fact that the amount of remaining data is decreasing.

Think of TCP as a raw stream of bytes. It is your responsibility to track where you are in the stream and interpret it correctly. Buffer what you read and only extract what you currently need.
Here's an (untested) class to illustrate:
class Buffer:
def __init__(self,socket):
self.socket = socket
self.buffer = b''
def recv_exactly(self,count):
# Could return less if socket closes early...
while len(self.buffer) < count:
data = self.socket.recv(4096)
if not data: break
self.buffer += data
ret,self.buffer = self.buffer[:count],self.buffer[count:]
return ret
The recv always requests the same amount of data and queues it in a buffer. recv_exactly only returns the number of bytes requested and leaves any extra in the buffer.

Read specific bytes using urlopen()

I want to read specific bytes from a remote file using a python module. I am using urllib2. Specific bytes in the sense bytes in the form of Offset,Size. I know we can read X number of bytes from a remote file using urlopen(link).read(X). Is there any way so that I can read data which starts from Offset of length Size.?
def readSpecificBytes(link,Offset,size):
# code to be written

This will work with many servers (Apache, etc.), but doesn't always work, esp. not with dynamic content like CGI (*.php, *.cgi, etc.):
import urllib2
def get_part_of_url(link, start_byte, end_byte):
req = urllib2.Request(link)
req.add_header('Range', 'bytes=' + str(start_byte) + '-' + str(end_byte))
resp = urllib2.urlopen(req)
content = resp.read()
Note that this approach means that the server never has to send and you never download the data you don't need/want, which could save tons of bandwidth if you only want a small amount of data from a large file.
When it doesn't work, just read the first set of bytes before the rest.
See Wikipedia Article on HTTP headers for more details.

Unfortunately the file-like object returned by urllib2.urlopen() doesn't actually have a seek() method. You will need to work around this by doing something like this:
def readSpecificBytes(link,Offset,size):
f = urllib2.urlopen(link)
if Offset > 0:
f.read(Offset)
return f.read(size)

Sending data chunks over named pipe in linux

I want to send data blocks over named pipe and want receiver to know where data blocks end. How should I do it with named pipes? Should I use some format for joining and splitting blocks (treat pipe always as stream of bytes) or are there any other method?
I've tried opening and closing pipe at sender for every data block but data becomes concatenated at receiver side (EOF not send):
for _ in range(2):
with open('myfifo', 'bw') as f:
f.write(b'+')
Result:
rsk#fe temp $ cat myfifo
++rsk#fe temp $

You can either use some sort of delimiter or a frame structure over your pipes, or (preferably) use multiprocessing.Pipe like objects and run Pickled Python objects through them.
The first option is simply defining a simple protocol you will be running through your pipe. Add a header to each chunk of data you send so that you know what to do with it. For instance, use a length-value system:
import struct
def send_data(file_descriptor, data):
length = struct.pack('>L', len(data))
packet = "%s%s" % (length, data)
file_descriptor.write(packet)
def read_data(file_descriptor):
binary_length = file_descriptor.read(4)
length = struct.unpack('>L', binary_length)[0]
data = ''
while len(data) < length:
data += file_descriptor.read(length - len(data))
As for the other option - You can try reading the code of the multiprocessing module, but essentially, you just run through the pipe the result of cPickle.dumps and then you read it into cPickle.loads to get Python objects.

I would just use lines of JSON ecoded data. These are easy to debug and the performance is reasonable.
For an example on reading and writing lines:
http://www.tutorialspoint.com/python/file_writelines.htm
For an example of using ujson (UltraJSON):
https://pypi.python.org/pypi/ujson

In addition to other solutions, you don't need to stick on named pipes. Named sockets aren't worse and provide more handy features. With AF_LOCAL and SOCK_SEQPACKET, message boundaries transport is maintained by the kernel, so what is written by a single send() will be got on the opposite side with a single recv().

Writing raw IP data to an interface (linux)

I have a file which contains raw IP packets in binary form. The data in the file contains a full IP header, TCP\UDP header, and data. I would like to use any language (preferably python) to read this file and dump the data onto the line.
In Linux I know you can write to some devices directly (echo "DATA" > /dev/device_handle). Would using python to do an open on /dev/eth1 achieve the same effect (i.e. could I do echo "DATA" > /dev/eth1)

Something like:
#!/usr/bin/env python
import socket
s = socket.socket(socket.AF_PACKET, socket.SOCK_RAW)
s.bind(("ethX", 0))
blocksize = 100;
with open('filename.txt') as fh:
while True:
block = fh.read(blocksize)
if block == "": break #EOF
s.send(block)
Should work, haven't tested it however.
ethX needs to be changed to your interface (e.g. eth1, eth2, wlan1, etc.)
You may want to play around with blocksize. 100 bytes at a time should be fine, you may consider going up but I'd stay below the 1500 byte Ethernet PDU.
It's possible you'll need root/sudoer permissions for this. I've needed them before when reading from a raw socket, never tried simply writing to one.
This is provided that you literally have the packet (and only the packet) dumped to file. Not in any sort of encoding (e.g. hex) either. If a byte is 0x30 it should be '0' in your text file, not "0x30", "30" or anything like that. If this is not the case you'll need to replace the while loop with some processing, but the send is still the same.
Since I just read that you're trying to send IP packets -- In this case, it's also likely that you need to build the entire packet at once, and then push that to the socket. The simple while loop won't be sufficient.

No; there is no /dev/eth1 device node -- network devices are in a different namespace from character/block devices like terminals and hard drives. You must create an AF_PACKET socket to send raw IP packets.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Constantly monitoring a TCP streaming feed for data in Python - python

Related

How do you run a command on an ECS docker container using Python boto3, and get the result?

TCP socket reads out of turn

Read specific bytes using urlopen()

Sending data chunks over named pipe in linux

Writing raw IP data to an interface (linux)

Categories

Resources