Using twisted to process files

Using twisted to process files - python

I'm trying to set up a twisted xmlrpc server, which will accept files from a client, process them, and return a file and result dictionary back.
I've used python before, but never the twisted libraries. For my purposes security is a non issue, and the ssh protocol seems like overkill. It also has problems on the windows server, since termios is not available.
So all of my research points to xmlrpc being the best way to accomplish this. However, there are two methods of file transfer available. Using the xml binary data method, or the http request method.
Files can be up to a few hundred megs either way, so which method should I use? Sample code is appreciated, since I could find no documentation for file transfers over xml with twisted.
Update:
So it seems that serializing the file with xmlrpclib.Binary does not work for large files, or I'm using it wrong. Test code below:
from twisted.web import xmlrpc, server
class Example(xmlrpc.XMLRPC):
"""
An example object to be published.
"""
def xmlrpc_echo(self, x):
"""
Return all passed args.
"""
return x
def xmlrpc_add(self, a, b):
"""
Return sum of arguments.
"""
return a + b
def xmlrpc_fault(self):
"""
Raise a Fault indicating that the procedure should not be used.
"""
raise xmlrpc.Fault(123, "The fault procedure is faulty.")
def xmlrpc_write(self, f, location):
with open(location, 'wb') as fd:
fd.write(f.data)
if __name__ == '__main__':
from twisted.internet import reactor
r = Example(allowNone=True)
reactor.listenTCP(7080, server.Site(r))
reactor.run()
And the client code:
import xmlrpclib
s = xmlrpclib.Server('http://localhost:7080/')
with open('test.pdf', 'rb') as fd:
f = xmlrpclib.Binary(fd.read())
s.write(f, 'output.pdf')
I get xmlrpclib.Fault: <Fault 8002: "Can't deserialize input: "> when I test this. Is it because the file is a pdf?

XML-RPC is a poor choice for file transfers. XML-RPC requires the file content to be encoded in a way that XML supports. This is expensive in both runtime costs and network resources. Instead, try just POSTing or PUTing the file using plain old HTTP.

Related

Telling python multiprocessing which pickle protocol should be used for serialization [duplicate]

How can I change the serialization method used by the Python multiprocessing library? In particular, the default serialization method uses the pickle library with the default pickle protocol version for that version of Python. The default pickle protocol is version 2 in Python 2.7 and version 3 in Python 3.6. How can I set the protocol version to 2 in Python 3.6, so I can use some of the classes (like Client and Listener) in the multiprocessing library to communicate between a server processing run by Python 2.7 and a client process run by Python 3.6?
(Side note: as a test, I modified line 206 of multiprocessing/connection.py by adding protocol=2 to the dump() call to force the protocol version to 2 and my client/server processes worked in my limited testing with the server run by 2.7 and the client by 3.6).
In Python 3.6, a patch was merged to let the serializer be set, but the patch was undocumented, and I haven't figured out how to use it. Here is how I tried to use it (I posted this also to the Python ticket that I linked to):
pickle2reducer.py:
from multiprocessing.reduction import ForkingPickler, AbstractReducer
class ForkingPickler2(ForkingPickler):
def __init__(self, *args):
if len(args) > 1:
args[1] = 2
else:
args.append(2)
super().__init__(*args)
#classmethod
def dumps(cls, obj, protocol=2):
return ForkingPickler.dumps(obj, protocol)
def dump(obj, file, protocol=2):
ForkingPickler2(file, protocol).dump(obj)
class Pickle2Reducer(AbstractReducer):
ForkingPickler = ForkingPickler2
register = ForkingPickler2.register
dump = dump
and in my client:
import pickle2reducer
multiprocessing.reducer = pickle2reducer.Pickle2Reducer()
at the top before doing anything else with multiprocessing. I still see ValueError: unsupported pickle protocol: 3 on the server run by Python 2.7 when I do this.

I believe the patch you're referring to works if you're using a multiprocessing "context" object.
Using your pickle2reducer.py, your client should start with:
import pickle2reducer
import multiprocessing as mp
ctx = mp.get_context()
ctx.reducer = pickle2reducer.Pickle2Reducer()
And ctx has the same API as multiprocessing.
Hope that helps!

Thanks so much for this. It led me exactly to the solution I needed. I ended up doing something similar but by modifying the Connection class. It felt cleaner to me than making my own full subclass and replacing that.
from multiprocessing.connection import Connection, _ForkingPickler, Client, Listener
def send_py2(self, obj):
self._check_closed()
self._check_writable()
self._send_bytes(_ForkingPickler.dumps(obj, protocol=2))
Connection.send = send_py2
This is just exactly the code from multiprocessing.connection with only the protocol=2 argument added.
I suppose you could even do the same thing by directly editing the original ForkingPickler class inside of multiprocessing.reduction.

RSA Digital signature verification using public key in apache nifi

Hi All,
I have a requirement where the client application is expected to send the data through the rest api end point provided by us. Client application is expected to send the data as query parameters.
curl -L -X POST "http://endpointurl:port/data?result=PASS&c=ADD_RECORD&attribute1=test1&attribute2=test2&attribute3=test3&attribute4=test4&signature=62759010b083d8fcf6e7ec18e6582bc07789d6dda17efbff6f474635c63db6afcbb3a0c25cf0d4c5bb1ba0ab772124edb9ba064d1530c2848fc160546263c86a2ba0cc26dd0073bb6344a1abb7475bcb1cd9f1c2b6af750db043a3da807ca356ab2d0959719dfff28af16246ce242a71d9fc99e5c383edfa90f6426568e1b6e9f8510871e40a05f6debaa6d9eee72eb9f6e0691ec625b1b24bb49cb3840940e7f83a13cdc0022e4a8ac35866f9b74418dcbeb232962113ad765cce334f431108866753c767098c363f97c056fa5f377b04094436629e9ede71b3074766c5b7492e4d7d5f4f52af0bee1683af68bb70f3cda4fef78cf5f98ce8765fd5d0e12280"
Along with the all the elements/columns of actual data(result, attribute1, attribute2, attribute3 and attribute4 client application would also send one addition parameter called signature which is created by hashing and creating the signature using client side private key based on the key parameters(not all the query parameters)
Signature: echo -n 'attribute1attribute3' | openssl sha1 -sign id_rsa -hex
Client application also provided the public key file for us to validate the signature with the actual data before wee process the records.
I am using apache nifi on HDP with below high level flow. Have used some other processors for validations and to allow other http requests.
Handlehttprequest-->>AttributestoJson-->>RouteOnAttribute-->>JolttransformJSON-->>ReplaceText-->>PutKafka-->>HandlehttpResponse
Basically, I am extracting all the http.query.param values from the data posted, and if parameter C=ADD_RECORD, I am concatenating the key attributes (attribute1, attribute3) which would be the actual data which should be verified against the signature value.
I tried to go through hashcontent processor with SHA1 but the hash value that i get is very small and it is not derived based on the client provided public key.
I also tried looking at the python scripts using Crypto package, but not been able to verify the signature with the actual data. On top of that, I am not sure, how can i call python script inside nifi.
Below are the commands that I can use manually to validate the signature with the data
echo -n 'attribute1attribute3' | openssl sha1 -sign id_rsa -hex>signature.hex
xxd -r -p signature.hex > signature.bin
echo -n 'attribute1attribute3'>keyattribute.txt
openssl dgst -sha1 -verify /tmp/test.pub -signature signature.bin keyattribute.txt and signature.bin to verify the digital signature., but in my actual requirement, I would be getting all these data as query parameters.
Need help in providing insights with respect to below.
Hashcontent can be used to generate the signature based on the public.key file? if so, I think we can use Routeonattribute to verify the signature with the actual value and take necessary actions.
Any guidance on achieving this by python/Groovy/Jython script and idea on how it can be called with in Nifi pipeline?
Any possibility of building the custom processor for meeting this requirement?
Appreciate any help on this.
Thanks,
Vish
====================================
Hi All,
In addition to my earlier query, I could finally get the python script up and running which takes three arguments
pub.key file location
signature value from hexdump from the client side
actual concatenated fields of key columns on which the signature is generated.
and displays whether the signature matches or fails.
from __future__ import print_function, unicode_literals
import sys
import binascii
from Crypto.PublicKey import RSA
from Crypto.Signature import PKCS1_v1_5
from Crypto.Hash import SHA
pubfile = sys.argv[1]
sig_hex = sys.argv[2]
data = sys.argv[3]
if not path.isfile(pubfile):
sys.stderr.write('public key file not found\n')
def verifier(pubkey, sig, data):
rsakey = RSA.importKey(key)
signer = PKCS1_v1_5.new(rsakey)
digest = SHA.new()
digest.update(data)
return signer.verify(digest, sig)
with open("pubfile", 'rb') as f: key = f.read()
sig = sig_hex.strip().decode('hex')
if verifier(key, sig, data):
print("Verified OK")
else:
print("Verification Failure")

Now need to know how can this be called with in nifi? and how can I pass the flow file attributes as arguments to the script (execute script processor) ? and how can I get the verification status message as additional attribute in the flow file?
Any help is greatly appreciated.
Thanks,
Vish
============================================================

You can do this in a few ways. Since you already have a working series of shell commands, you can use the ExecuteStreamCommand processor to execute the series of commands (or wrap them in a shell script) and stream the incoming flowfile content to STDIN and stream STDOUT to the outgoing flowfile content.
If you prefer, you can use ExecuteScript or InvokeScriptedProcessor to run the DSL script you've written. I would recommend switching to use Groovy, as it is handled much better in current Apache NiFi releases. Python (actually Jython) is much slower and does not have access to native libraries.
Finally, you could write a custom processor to do this if it's a repeated task you'll need in future flows. There are many guides for writing a custom processor available online.

Python/Flask - ValueError: I/O operation on closed file

Before anyone says that this is a duplicate, I do not think it is because I have looked at the similar questions and they have not helped me!
I am creating a Flask server in python, and I need to be able to have a url that shows a pdf.
I tried to use the following code:
#app.route('/pdf')
def pdfStuff():
with open('pdffile.pdf', 'rb') as static_file:
return send_file(static_file, attachment_filename='pdffile.pdf')
This is supposed to make it so when I go to /pdf it will show the pdf file pdffile.pdf.
However, this does not work because when I run the code I get this error:
ValueError: I/O operation on closed file
How is this the case? My return statement is inside the with statement, therefore shouldn't the file be open?
I tried to use a normal static_file = open(...) and used try and finally statements, like this:
static_file = open('pdffile.pdf','rb')
try:
return send_file(static_file, attachment_filename='pdffile.pdf')
finally:
static_file.close()
The same error happens with the above code, and I have no idea why. Does anyone know what I could be doing wrong?
Sorry if I am being stupid and there is something simple that I made a mistake with!
Thank you very much in advance !!

Use send_file with the filename, it'll open, serve and close it the way you expect.
#app.route('/pdf')
def pdfStuff():
return send_file('pdffile.pdf')

Despite #iurisilvio's answer solves this specific problem, is not a useful answer in any other case. I was struggling with this myself.
All the following examples are throwing ValueError: I/O operation on closed file. but why?
#app.route('/pdf')
def pdfStuff():
with open('pdffile.pdf', 'rb') as static_file:
return send_file(static_file, attachment_filename='pdffile.pdf')
#app.route('/pdf')
def pdfStuff():
static_file = open('pdffile.pdf','rb')
try:
return send_file(static_file, attachment_filename='pdffile.pdf')
finally:
static_file.close()
I am doing something slightly different. Like this:
#page.route('/file', methods=['GET'])
def build_csv():
# ... some query ...
ENCODING = 'utf-8'
bi = io.BytesIO()
tw = io.TextIOWrapper(bi, encoding=ENCODING)
c = csv.writer(tw)
c.writerow(['col_1', 'col_2'])
c.writerow(['1', '2'])
bi.seek(0)
return send_file(bi,
as_attachment=True,
attachment_filename='file.csv',
mimetype="Content-Type: text/html; charset={0}".format(ENCODING)
)
In the first two cases, the answer is simple:
You give a stream to send_file, this function will not immediatelly transmit the file, but rather wrap the stream and return it to Flask for future handling. Your pdfStuff function will allready return before Flask will start handling your stream, and in both cases (with and finally) the stream will be closed before your function returns.
The third case is more tricky (but this answer pointed me in the right direction: Why is TextIOWrapper closing the given BytesIO stream?). In the same fashion as explained above, bi is handled only after build_csv returns. Hence tw has allready been abandoned to the garbage collector. When the collector will destroy it, tw will implicitly close bi. The solution to this one is tw.detach() before returning (this will stop TextIOWrapper from affecting the stream).
Side note (please correct me if I'm wrong):
This behaviour is limiting, unless when send_file is provided with a file-like object it will handle the closing on its own. It is not clear from the documentation (https://flask.palletsprojects.com/en/0.12.x/api/#flask.send_file) if closing is handled. I would assume so (there are some .close() present in the source code + send_file uses werkzeug.wsgi.FileWrapper which has .close() implemented too), in which case your approach can be corrected to:
#app.route('/pdf')
def pdfStuff():
return send_file(open('pdffile.pdf','rb'), attachment_filename='pdffile.pdf')
Ofcourse in this case, would be stright forward to provide the file name. But in other cases, may be needed to wrap the file stream in some manipulation pipeline (decode / zip)

Global variable in Python server

Background: I am a complete beginner when it comes to servers, but I know my way around programming in Python.
I am trying to setup a simple server using the basic Python 2.7 modules (SimpleHTTPServer, CGIHTTPServer, etc). This server needs to load a global, read-only variable with several GB of data from a file when it starts; then, when each user accesses the page, the server uses the big data to generate some output which is then given to the user.
For the sake of example, let's suppose I have a 4 GB file names.txt which contains all possible proper nouns of English:
Jack
John
Allison
Richard
...
Let's suppose that my goal is to read the whole list of names into memory, and then choose 1 name at random from this big list of proper nouns. I am currently able to use Python's native CGIHTTPServer module to accomplish this. To start, I just run the CGIHTTPServer module directly, by executing from a terminal:
python -m CGIHTTPServer
Then, someone accesses www.example-server.net:8000/foo.py and they are given one of these names at random. I have the following code in foo.py:
#!/usr/bin/env python
import random
name_list = list()
FILE = open('names.txt','r')
for line in FILE:
name = line[:-1]
name_list.append(name)
FILE.close()
name_to_return = random.choice(name_list)
print "Content-type: text/html"
print
print "<title>Here is your name</title>"
print "<p>" + name_to_return + "</p>"
This does what I want; however, it is extremely inefficient, because every access forces the server to re-read a 4 GB file.
How can I make this into an efficient process, where the variable name_list is created as global immediately when the server starts, and each access only reads from that variable?

Just for future reference, if anyone ever faces the same problem: I ended up sub-classing CGIHTTPServer's request handler and implementing a new do_POST() function. If you had a working CGI script without global variables, something like this should get you started:
import CGIHTTPServer
import random
import sys
import cgi
class MyRequestHandler(CGIHTTPServer.CGIHTTPRequestHandler):
global super_important_list
super_important_list = range(10)
random.shuffle(super_important_list)
def do_POST(s):
"""Respond to a POST request."""
form = cgi.FieldStorage(fp=s.rfile,headers=s.headers,environ={'REQUEST_METHOD':'POST','CONTENT_TYPE':s.headers['Content-Type'],})
s.wfile.write("<html><head><title>Title goes here.</title></head>")
s.wfile.write("<body><p>This is a test.</p>")
s.wfile.write("<p>You accessed path: %s</p>" % s.path)
s.wfile.write("<p>Also, super_important_list is:</p>")
s.wfile.write(str(super_important_list))
s.wfile.write("<p>Furthermore, you POSTed the following info: ")
for item in form.keys():
s.wfile.write("<p>Item: " + item)
s.wfile.write("<p>Value: " + form[item].value)
s.wfile.write("</body></html>")
if __name__ == '__main__':
server_address = ('', 8000)
httpd = CGIHTTPServer.BaseHTTPServer.HTTPServer(server_address, MyRequestHandler)
try:
httpd.serve_forever()
except KeyboardInterrupt:
sys.exit()
Whenever someone fills out your form and performs a POST, the variable form will be a dictionary-like object with key-value pairs which may differ for each user of your site, but the global variable super_important_list will be the same for every user.
Thanks to everyone who answered my question, especially Mike Steder, who pointed me in the right direction!

CGI works by spawning a process to handle each request. You need to run a server process that stays in memory handles HTTP requests.
You could use a modified BaseHTTPServer, just define your own Handler class. You'd load the dataset once in your code and then the do_GET method of your handler would just pick one randomly.
Personally, I'd look into something like CherryPy as a simple solution that is IMO a lot nicer than BaseHTTPServer. There are tons of options other than CherryPy like bottle, flask, twisted, django, etc. Of course if you need this server to be behind some other webserver you'll need to look into setting up a reverse proxy or running CherryPy as a WSGI app.

You may want to store the values of the names in a db and store the names according to the letter that they start with. Then you can do a random for a letter between a and z and from there randomize again to get a random name from your random beginning letter.

Build a prefix tree (a.k.a. trie) once and generate a random walk whenever you receive a query.
That should be pretty efficient.

how to use /dev/ptmx for create a virtual serial port?

I have a program, using pyserial, and I want to test it without using a real serial port device.
In windows, I use com0com, and in linux, I know there is a method to create virtual serial port pair without using additional program.
so I look up the manual, and found pts, /dev/ptmx, but I don't know how to create a pair by following the manual, can anyone give me a example?
I tried(in python):
f = open("/dev/ptmx", "r")
and it works, /dev/pts/4 is created.
and I tried:
f = open("/dev/4", "w")
and the result is:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: [Errno 5] Input/output error: '/dev/pts/4'
edit:
I found a solution(workround), using socat.
socat PTY,link=COM8 PTY,link=COM9
then COM8 COM9 are created as virtual serial port pair.

I was trying to make an application that made use of virtual serial ports in order to communicate with some remote devices using TCP/Serial conversion... and I came across to a problem similar to yours. My solution worked out as follows:
import os, pty, serial
master, slave = pty.openpty()
s_name = os.ttyname(slave)
ser = serial.Serial(s_name)
# To Write to the device
ser.write('Your text')
# To read from the device
os.read(master,1000)
Although the name of the port of the master is the same if you check (/dev/ptmx) the fd is different if you create another master, slave pair, so reading from the master gets you the message issued to his assigned slave. I hope this helps you or anyone else that comes across a problem similar to this one.

Per the docs, you need ptsname to get the name of the slave-side of the pseudo-terminal, and also, quoting the docs,
Before opening the pseudo-terminal
slave, you must pass the master's file
descriptor to grantpt(3) and
unlockpt(3).
You should be able to use ctypes to call all of the needed functions.

I don't know python but I can point you in the right direction: look here at a C code sample. Here's the man page for the /dev/ptmx. Make sure the permissions and owner is correct!. Here is the poster on the linuxquestions forum on how to use it from C.

You should consider using the pty module instead, which should take care of this for you. (it opens /dev/ptmx or calls openpty or opens another appropriate device, depending on the platform.)

You could build a dummy object that implements the same interface as the pySerial classes you use, but do something completely different and easily replicable like reading from and writing to files/terminal/etc.
For example:
class DummySerial():
#you should consider subclassing this
def __init__(self, *args, **kwargs):
self.fIn = open("input.raw", 'r')
self.fOut = open("output.raw", 'w')
pass
def read(self, bytes = 1):
return self.fIn.read(bytes)
def write(self, data):
self.fOut.write(data)
def close(self):
self.fIn.close()
self.fOut.close()
#implement more methods here
If it quacks like a duck and it ducks like a duck...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using twisted to process files - python

XML-RPC is a poor choice for file transfers. XML-RPC requires the file content to be encoded in a way that XML supports. This is expensive in both runtime costs and network resources. Instead, try just POSTing or PUTing the file using plain old HTTP.

Related

Telling python multiprocessing which pickle protocol should be used for serialization [duplicate]

RSA Digital signature verification using public key in apache nifi

Python/Flask - ValueError: I/O operation on closed file

Global variable in Python server

how to use /dev/ptmx for create a virtual serial port?

Categories

Resources