Sending data chunks over named pipe in linux - python

I want to send data blocks over named pipe and want receiver to know where data blocks end. How should I do it with named pipes? Should I use some format for joining and splitting blocks (treat pipe always as stream of bytes) or are there any other method?
I've tried opening and closing pipe at sender for every data block but data becomes concatenated at receiver side (EOF not send):
for _ in range(2):
with open('myfifo', 'bw') as f:
f.write(b'+')
Result:
rsk#fe temp $ cat myfifo
++rsk#fe temp $

You can either use some sort of delimiter or a frame structure over your pipes, or (preferably) use multiprocessing.Pipe like objects and run Pickled Python objects through them.
The first option is simply defining a simple protocol you will be running through your pipe. Add a header to each chunk of data you send so that you know what to do with it. For instance, use a length-value system:
import struct
def send_data(file_descriptor, data):
length = struct.pack('>L', len(data))
packet = "%s%s" % (length, data)
file_descriptor.write(packet)
def read_data(file_descriptor):
binary_length = file_descriptor.read(4)
length = struct.unpack('>L', binary_length)[0]
data = ''
while len(data) < length:
data += file_descriptor.read(length - len(data))
As for the other option - You can try reading the code of the multiprocessing module, but essentially, you just run through the pipe the result of cPickle.dumps and then you read it into cPickle.loads to get Python objects.

I would just use lines of JSON ecoded data. These are easy to debug and the performance is reasonable.
For an example on reading and writing lines:
http://www.tutorialspoint.com/python/file_writelines.htm
For an example of using ujson (UltraJSON):
https://pypi.python.org/pypi/ujson

In addition to other solutions, you don't need to stick on named pipes. Named sockets aren't worse and provide more handy features. With AF_LOCAL and SOCK_SEQPACKET, message boundaries transport is maintained by the kernel, so what is written by a single send() will be got on the opposite side with a single recv().

Related

Python does not stop reading from stdin buffer

In Node.JS, I spawn a child Python process to be piped. I want to send a UInt8Array through stdin. So as to notify the size of the buffer data to be read, I send the size of it before. But it doesn't stop reading for the actual data from the buffer properly after a specified size. As a result, the Python process doesn't terminate forever. I've checked that it takes bufferSize properly and converts it into an integer. In the absence of size = int(input()) and python.stdin.write(bufferSize.toString() + "\n") and when the size of the buffer is hardcoded, it works correctly. I couldn't figure out why it does not end waiting after reading for the specified amount of bytes.
// Node.JS
const python_command = command.serializeBinary()
const python = spawn('test/production_tests/py_test_scripts/protocolbuffer/venv/bin/python', ['test/production_tests/py_test_scripts/protocolbuffer/command_handler.py']);
const bufferSize = python_command.byteLength
python.stdin.write(bufferSize.toString() + "\n")
python.stdin.write(python_command)
# Python
size = int(input())
data = sys.stdin.buffer.read(size)
In a nutshell, the problem arises from the fact that putting normal stdin input() firstly and then sys.stdin.buffer.read. I guess the preceding one conflicts with the successive one and precludes it to work normally.
There are two potential problems here. The first is that the pipe between node.js and the python script is block buffered. You won't see any data on the python side until either a block's worth of data is filled (system dependent) or the pipe is closed. The second is that there is a decoder between input and the byte stream coming in on stdin. This decoder is free to read ahead in the stream as it wishes. Reading sys.stdin.buffer may miss whatever happens to be buffered in the decoder.
You can solve the second problem by doing all of your reads from the buffer as shown below. The first problem needs to be solved on the node.js side - likely by closing its subprocess stdin. You may be better off just writing the size as a binary number, say uint64.
import struct
import sys
# read size - assuming its coming in as ascii stream
size_buf = []
while True:
c = sys.stdin.buffer.read(1)
if c == b"\n":
size = int(b"".join(size_buf))
break
size_buf.append(c)
fmt = "B" # read unsigned char
fmtsize = struct.calcsize(fmt)
buf = [struct.unpack(fmt, sys.stdin.buffer.read(fmtsize))[0] for _ in range(size)]
print(buf)

Constantly monitoring a TCP streaming feed for data in Python

I am writing a Python script in Python 3.5, I have a host and port and what I am trying to do is create the script so that it is constantly monitoring the provided host for data. The data is distributed through the TCP streaming feed in an xml format with a tag marking the start and end of the event.
So what I am trying to do is basically monitor the TCP feed for new events which are marked between an xml start and end tag, then retrieve the event and handle it accordingly in my script. In addition, ideally I would need to have access to new data in the feed within milliseconds.
The feed is a government feed that distributes alerts, the feed is streaming1.naad-adna.pelmorex.com and the port is 8080, what I want to do is monitor this feed for new alerts and then be able to access the alerts and handle them accordingly in Python. The feed sends a heartbeat every minute to indicate the connection is alive.
I believe the best option would be to make use of Sockets though I am unsure of how to implement them in this specific use case. I do not have much experience with TCP feeds, and I was unable to find much online pertaining to how to handle a TCP feed in Python under my specific use case, I am able to handle the xml though once I am able to figure out how to pull it from the TCP feed.
Any help would be appreciated.
There are several technical challenges presented in your question.
First, is the simple matter of connecting to the server and retrieving the data. As you can see in connect() below, that is pretty simple, just create a socket (s = socket.socket()) and connect it (s.connect(('hostname', port_number))).
The next problem is retrieving the data in a useful form. The socket natively provides .recv(), but I wanted something with a file-like interface. The socket module provides a method unique to Python: .makefile(). (return s.makefile('rb'))
Now we get to the hard part. XML documents are typically stored one document per file, or one document per TCP transmission. Thus the end of the document is easily discovered by an end of file indication, or by a Content-Length: header. Consequently, none of the Python XML API have a mechanism for dealing with multiple XML documents in one file, or in one string. I wrote xml_partition() to solve that problem. xml_partition() consumes data from a file-like object and yields each XML document from the stream. (Note: the XML documents must be pressed together. No whitespace is allowed after the final >).
Finally, there is a short test program (alerts()) which connects to the stream and reads a few of the XML documents, storing each into its own file.
Here, in its entirety, is a program for downloading emergency alerts from the National Alert Aggregation & Dissemination System from Pelmorex.
import socket
import xml.etree.ElementTree as ET
def connect():
'Connect to pelmorex data stream and return a file-like object'
# Set up the socket
s = socket.socket()
s.connect(('streaming1.naad-adna.pelmorex.com', 8080))
return s.makefile('rb')
# We have to consume the XML data in bits and pieces
# so that we can stop precisely at the boundary between
# streamed XML documents. This function ensures that
# nothing follows a '>' in any XML fragment.
def partition(s, pattern):
'Consume a file-like object, and yield parts defined by pattern'
data = s.read(2048)
while data:
left, middle, data = data.partition(pattern)
while left or middle:
yield left
yield middle
left, middle, data = data.partition(pattern)
data = s.read(2048)
# Split the incoming XML stream into fragments (much smaller
# than an XML document.) The end of each XML document
# is guaranteed to align with the end of a fragment.
# Use an XML parser to determine the actual end of
# a document. Whenever the parser signals the end
# of an XML document, yield what we have so far and
# start a new parser.
def xml_partition(s):
'Read multiple XML documents from one data stream'
parser = None
for part in partition(s, b'>'):
if parser is None:
parser = ET.XMLPullParser(['start', 'end'])
starts = ends = 0
xml = []
xml.append(part)
parser.feed(part)
for event, elem in parser.read_events():
starts += event == "start"
ends += event == "end"
if starts == ends > 0:
# We have reached the end of the XML doc
parser.close()
parser = None
yield b''.join(xml)
# Typical usage:
def alerts():
for i, xml in enumerate(xml_partition(connect())):
# The XML is a bytes object that contains the undecoded
# XML stream. You'll probably want to parse it and
# somehow display the alert.
# I'm just saving it to a file.
with open('alert%d.xml' % i, 'wb') as fp:
fp.write(xml)
if i == 3:
break
def test():
# A test function that uses multiple XML documents in one
# file. This avoids the wait for a natural-disaster alert.
with open('multi.xml', 'rb') as fp:
print(list(xml_partition(fp)))
alerts()

Parsing the output of a subprocess while executing and clearing the memory (Python 2.7)

I need to parse the output produced by an external program (third party, I have no control over it) which produces large amounts of data. Since the size of the output greatly exceeds the available memory, I would like to parse the output while the process is running
and remove from the memory the data that have already been processed.
So far I do something like this:
import subprocess
p_pre = subprocess.Popen("preprocessor",stdout = subprocess.PIPE)
# preprocessor is an external bash script that produces the input for the third-party software
p_3party = subprocess.Popen("thirdparty",stdin = p_pre.stdout, stdout = subprocess.PIPE)
(data_to_parse,can_be_thrown) = p_3party.communicate()
parsed_data = myparser(data_to_parse)
When "thirdparty" output is small enough, this approach works. But as stated in the Python documentation:
The data read is buffered in memory, so do not use this method if the data size is large or unlimited.
I think a better approach (that could actually make me save some time),
would be to start processing data_to_parse while it is being produces,
and when the parsing has been done correctly "clear" data_to_parse removing
the data that have already been parsed.
I have also tried to use a for cycle like:
parsed_data=[]
for i in p_3party.stdout:
parsed_data.append(myparser(i))
but it gets stuck and can't understand why.
So I would like to know what it is the best approach to accomplish this? What are the issues to be aware of?
You can use the subprocess.Popen() to create a steam from which you read lines.
import subprocess
stream = subprocess.Popen(stdout=subprocess.PIPE).stdout
for line in stream:
#parse lines as you recieve them.
print line
You could pass the lines to your myparser() method, or append them to a list until you are ready to use them.. whatever.
In your case, using two sub-processes, it would work something like this:
import subprocess
def method(stream, retries=3):
while retries > 0:
line = stream.readline()
if line:
yield line
else:
retries -= 1
pre_stream = subprocess.Popen(cmd, stdout=subprocess.PIPE).stdout
stream = subprocess.Popen(cmd, stdin=pre_stream, stdout=subprocess.PIPE).stdout
for parsed in method(stream):
# do what you want with the parsed data.
parsed_data.append(parsed)
Iterating over a file as in for i in p_3party.stdout: uses a read-ahead buffer. The readline() method may be more reliable with a pipe -- AFAIK it reads character by character.
while True:
line = p_3party.stdout.readline()
if not line:
break
parsed_data.append(myparser(line))

Writing raw IP data to an interface (linux)

I have a file which contains raw IP packets in binary form. The data in the file contains a full IP header, TCP\UDP header, and data. I would like to use any language (preferably python) to read this file and dump the data onto the line.
In Linux I know you can write to some devices directly (echo "DATA" > /dev/device_handle). Would using python to do an open on /dev/eth1 achieve the same effect (i.e. could I do echo "DATA" > /dev/eth1)
Something like:
#!/usr/bin/env python
import socket
s = socket.socket(socket.AF_PACKET, socket.SOCK_RAW)
s.bind(("ethX", 0))
blocksize = 100;
with open('filename.txt') as fh:
while True:
block = fh.read(blocksize)
if block == "": break #EOF
s.send(block)
Should work, haven't tested it however.
ethX needs to be changed to your interface (e.g. eth1, eth2, wlan1, etc.)
You may want to play around with blocksize. 100 bytes at a time should be fine, you may consider going up but I'd stay below the 1500 byte Ethernet PDU.
It's possible you'll need root/sudoer permissions for this. I've needed them before when reading from a raw socket, never tried simply writing to one.
This is provided that you literally have the packet (and only the packet) dumped to file. Not in any sort of encoding (e.g. hex) either. If a byte is 0x30 it should be '0' in your text file, not "0x30", "30" or anything like that. If this is not the case you'll need to replace the while loop with some processing, but the send is still the same.
Since I just read that you're trying to send IP packets -- In this case, it's also likely that you need to build the entire packet at once, and then push that to the socket. The simple while loop won't be sufficient.
No; there is no /dev/eth1 device node -- network devices are in a different namespace from character/block devices like terminals and hard drives. You must create an AF_PACKET socket to send raw IP packets.

pyserial.readline() with python 2.7

I am using python 2.7.2 with pyserial 2.6.
What is the best way to use pyserial.readline() when talking to a device that has a character other than "\n" for eol? The pyserial doc points out that pyserial.readline() no longer takes an 'eol=' argument in python 2.6+, but recommends using io.TextIOWrapper as follows:
ser = serial.serial_for_url('loop://', timeout=1)
sio = io.TextIOWrapper(io.BufferedRWPair(ser, ser))
However the python io.BufferedRWPair doc specifically warns against that approach, saying "BufferedRWPair does not attempt to synchronize accesses to its underlying raw streams. You should not pass it the same object as reader and writer; use BufferedRandom instead."
Could someone point to a working example of pyserial.readline() working with an eol other than 'eol'?
Thanks,
Tom
read() has a user-settable maximum size to the data it reads(in bits), if your data strings are a predictable length you could simply set that to capture a fixed-length string. it's sort of 'kentucky windage' in execution but so long as your data strings are consistent in size it won't bork.
beyond that, your real option is to capture and write the data stream to another file and split out your entries manually/programatically.
for example, you could write your datastream to a .csv file, and adjust the delimiter variable to be your EoL character.
Assume s is an open serial.serial object and the newline character is \r.
Then this will read to end of line and return everything up to '\r' as a string.
def read_value(encoded_command):
s.write(encoded_command)
temp = ''
response = ''
while '\r' not in response:
response = s.read().decode()
temp = temp + response
return temp
And, BTW, I implemented the io.TextIOWrapper recommendation you talked about above and it 1)is much slower and 2)somehow makes the port close.

Categories

Resources