I've been messing around with a few personal projects, and have found the need to offload the processing of a large amount of data to more beefy, dedicated servers. I tend to do this over XML-RPC in Python, and have made some interesting observations, and wanted to both share, and see if anybody knows of a better or more efficient way of doing this.
So, let's say I need to send a large amount of data over XML-RPC in Python. What's the fastest way of doing this?
I started doing some experimenting with the XML-RPC module, as there isn't much about it online. Initially, to handle my data (~15 megabytes), I was simply passing a dictionary object to the XML-RPC method on the client side. This was very slow, on both the server and client side- each took a few minutes just to encode/decode the data! I assume (but am not sure) that this is an issue with having to encode a lot of data in XML.
However, after some fiddling around, I tried serializing the dictionary as a JSON object with json.dumps on the client end and loading it with json.loads on the server side, which to my surprise, ended up being many times faster.
Warning: Pure Speculation!
I suspect that the XML encoding may be so much slower than JSON encoding because json.dumps is written in CPython, but I do not know if Python's XML encoding is written in CPython. I ran in to a similar issue using json.dumps vs json.dump in a previous project: the latter is many times slower because it is written in pure Python; as opposed to being written in CPython (in the words of the Python bug report, "json.dump doesn't use the C accelerations": https://bugs.python.org/msg137170).
I could theoretically upload the serialized JSON string (or pickled dict object, but this strikes me as a bad idea) to cloud storage, such as AWS S3, and then pull it on the server end, but I feel like I might as well just send the data directly from one machine to the other at that point.
I'm going to experiment with doing some gzip compression on the serialized JSON string to hopefully cut down on network bandwidth being a bottleneck, as I eventually plan on being able to handle gigabytes of data over RPC. I'll post my results here.
I thought this was interesting, and I'm wondering if anybody has run in to this issue before, and how they've went about it. I haven't been able to find much online. Cheers!
Related
Alex Gaynor explains some problems with pickle in his talk "Pickles are for delis, not software", including security, reliability, human-readableness. I am generally wary of using pickle on data in my python programs. As a general rule, I much prefer to pass my data around with json or other serialization formats, specified by myself, manually.
The situation I'm interested in is: I've gathered some data in my python program and I want to run an embarrassingly parallel task on it a bunch of times in parallel.
As far as I know, the nicest parallelization library for doing this in python right now is dask-distributed, followed by joblib-parallel, concurrent.futures, and multiprocessing.
However, all of these solutions use pickle for serialization. Given the various issues with pickle, I'm inclined to simply send a json array to a subprocess of GNU parallel. But of course, this feels like a hack, and loses all the fancy goodness of Dask.
Is it possible to specify a different default serialization format for my data, but continue to parallelize in python, preferably dask, without resorting to pickle or gnu parallel?
The page http://distributed.dask.org/en/latest/protocol.html is worth a read regarding how Dask passes information around a set of distributed workers and scheduler. As can be seen, (cloud)pickle enters the picture for things like functions, which we want to be able to pass to workers, so they can execute them, but data is generally sent via fairly efficient msgpack serialisation. There would be no way to serialise functions with JSON. In fact, there is a fairly flexible internal dispatch mechanism for deciding what gets serialised with what mechanism, but there is no need to get into that here.
I would also claim that pickle is a fine way to serialise some things when passing between processes, so long as you have gone to the trouble to ensure consistent environments between them, which is an assumption that Dask makes.
-edit-
You could of course include fuction names or escapes in JSON, but I would suggest that's just as brittle as pickle anyway.
Pickles are bad for long-term storage ("what if my class definition changes after I've persisted something to a database?") and terrible for accepting as user input:
def foo():
os.system('rm -rf /')
return {'lol': foo}
But I don't think there's any problem at all with using them in this specific case. Suppose you're passing around datetime objects. Do you really want to write your own ad-hoc JSON adapter to serialize and deserialize them? I mean, you can, but do you want to? Pickles are well specified, and the process is fast. That's kind of exactly what you want here, where you're neither persisting the intermediate serialized object nor accepting objects from third parties. You're literally passing them from yourself to yourself.
I'd highly recommend picking the library you want to use -- you like Dask? Go for it! -- and not worrying about its innards until such time as you specifically have to care. In the mean time, concentrate on the parts of your program that are unique to your problem. Odds are good that the underlying serialization format won't be one of them.
I'm having a problem creating a inter-process communication for my python application. I have two python scripts at hand, let's say A and B. A is used to open a huge file, keep it in memory and do some processing that Mysql can't do, and B is a process used to query A very often.
Since the file A needs to read is really large, I hope to read it once and have it hang there waiting for my Bs' to query.
What I do now is, I use cherrypy to build a http-server. However, I feel it's kind of awkward to do so since what I'm trying to do is absolutely local. So, I'm wondering are there other more organic way to achieve this goal?
I don't know much about TCP/socket etc. If possible, toy examples would be appreciate (please include the part to read file).
Python has good support for ZeroMQ, which is much easier and more robust than using raw sockets.
The ZeroMQ site treats Python as one of its primary languages and offers copious Python examples in its documentation. Indeed, the example in "Learn the Basics" is written in Python.
I'd like my wxPython application to support cut/copy/paste operations between different running instances of the application. Is it OK to simply pickle a data structure, copy it to clipboard as text, and then unpickle it for paste operations?
I know I'd have to check the data for some sign that it's from my app. Or could I just TRY to unpickle whatever is there? How robust is pickle at nicely failing if it tries to unpickle arbitrary text left on the clipboard?
Also, is there a practical limit to how much data could be copied this way?
I'm running on Windows and Linux today - have not tried Mac.
EDIT
I'm aware of that comment in the documentation. I don't really care about a malicious user trying to compromise his own instance of the software, if that's what people are worried about they should deprecate pickle. My questions are of practicality, not security.
You should not trust data from the clipboard for unpickling, unless you have a sure way to make sure it was wrtten by your app, and has not been altered.
From the python documentation:
Warning The pickle module is not intended to be secure against
erroneous or maliciously constructed data. Never unpickle data
received from an untrusted or unauthenticated source.
If applicable I suggest you to convert your data to and from json using one of the many python implementations.
Being plain text is easy to transfer using clipboard moreover there are no risks converting back a json object back to python.
One last thing: no risks of deprecation.
Looking for advice on the best technique for saving complex Python data structures across program sessions.
Here's a list of techniques I've come up with so far:
pickle/cpickle
json
jsonpickle
xml
database (like SQLite)
Pickle is the easiest and fastest technique, but my understanding is that there is no guarantee that pickle output will work across various versions of Python 2.x/3.x or across 32 and 64 bit implementations of Python.
Json only works for simple data structures. Jsonpickle seems to correct this AND seems to be written to work across different versions of Python.
Serializing to XML or to a database is possible, but represents extra effort since we would have to do the serialization ourselves manually.
Thank you,
Malcolm
You have a misconception about pickles: they are guaranteed to work across Python versions. You simply have to choose a protocol version that is supported by all the Python versions you care about.
The technique you left out is marshal, which is not guaranteed to work across Python versions (and btw, is how .pyc files are written).
You left out the marshal and shelve modules.
Also this python docs page covers persistence
Have you looked at PySyck or pyYAML?
What are your criteria for "best" ?
pickle can do most Python structures, deeply nested ones too
sqlite dbs can be easily queried (if you know sql :)
speed / memory ? trust no benchmarks that you haven't faked yourself.
(Fine print:
cPickle.dump(protocol=-1) compresses, in one case 15M pickle / 60M sqlite, but can break.
Strings that occur many times, e.g. country names, may take more memory than you expect;
see the builtin intern().
)
I was just looking through some information about Google's protocol buffers data interchange format. Has anyone played around with the code or even created a project around it?
I'm currently using XML in a Python project for structured content created by hand in a text editor, and I was wondering what the general opinion was on Protocol Buffers as a user-facing input format. The speed and brevity benefits definitely seem to be there, but there are so many factors when it comes to actually generating and processing the data.
If you are looking for user facing interaction, stick with xml. It has more support, understanding, and general acceptance currently. If it's internal, I would say that protocol buffers are a great idea.
Maybe in a few years as more tools come out to support protocol buffers, then start looking towards that for a public facing api. Until then... JSON?
Protocol buffers are intended to optimize communications between machines. They are really not intended for human interaction. Also, the format is binary, so it could not replace XML in that use case.
I would also recommend JSON as being the most compact text-based format.
Another drawback of binary format like PB is that if there is a single bit of error, the entire data file is not parsable, but with JSON or XML, as the last resort you can still manually fix the error because it is human readable and has redundancy built-in..
From your brief description, it sounds like protocol buffers is not the right fit. The phrase "structured content created by hand in a text editor" pretty much screams for XML.
But if you want efficient, low latency communications with data structures that are not shared outside your organization, binary serialization such as protocol buffers can offer a huge win.