Protocol buffers, where to use them? [closed] - python

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I've recently read an article about protocol buffers,
Protocol Buffers is a method of serializing structured data. It is
useful in developing programs to communicate with each other over a
wire or for storing data. The method involves an interface description
language that describes the structure of some data and a program that
generates source code from that description for generating or parsing
a stream of bytes that represents the structured data
What I want to know is, where to use them? Are there any real-life examples rather than simple address book examples? Is it for example used to pre-cache query results from databases?

Protocol buffers are a data storage and exchange format, notably used for RPC - communication between programs or computers.
Alternatives include language-specific serialization (Java serialization, Python pickles, etc.), tabular formats like CSV and TSV, structured text formats like XML and JSON, and other binary formats like Apache Thrift. Conceptually these are all just different ways of representing structured data, but in practice they have different pros and cons.
Protocol buffers are:
Space efficient, relying on a custom format to represent data compactly.
Provide strong type safety cross-language (particularly in strongly-typed languages like Java, but even in Python it's still quite useful).
Designed to be backwards and forwards-compatible. It's easy to make structural changes to protocol buffers (normally adding new fields or deprecating old ones) without needing to ensure all applications using the proto are updated simultaneously.
Somewhat tedious to work with manually. While there is a text format, it is mostly useful for manually inspecting, not storing, protos. JSON, for instance, is much easier for a human to write and edit. Therefore protos are usually written and read by programs.
Dependent on a .proto compiler. By separating the structure from the data protocol buffers can be lean and mean, but it means without an associated .proto file and a tool like protoc to generate code to parse it, arbitrary data in proto format is unusable. This makes protos a poor choice for sending data to other people who may not have the .proto file on hand.
To make some sweeping generalizations about different formats:
CSV/TSV/etc. are useful for human-constructed data that never needs to be transmitted between people or programs. It's easy to construct and easy to parse, but a nightmare to keep in sync and can't easily represent complex structures.
Language-specific serialization like pickles can be useful for short-lived serialization, but quickly runs into backwards compatibility issues and obviously limit you to one language. Except in some very specific cases protobufs accomplish all the same goals with more safety and better future-proofing.
JSON is ideal for sending data between different parties (e.g. public APIs). Because the structure and the content are transmitted together anyone can understand it, and it's easy to parse in all major languages. There's little reason nowadays to use other structured formats like XML.
Binary formats like Protocol Buffers are ideal for almost all other data serialization use cases; long and short-term storage, inter-process communication, intra-process and application-wide caching, and more.
Google famously uses protocol buffers for practically everything they do. If you can imagine a reason to need to store or transmit data, Google probably does it with protocol buffers.

I used them to create a financial trading system. Here are the reasons:
There's libraries for many languages. Some things needed to be in c++, others in c#. And it was open to extending to Python or Java, etc.
It needed to be fast to serialize/deserialize and compact. This is due to the speed requirement in the financial trading system. The messages were quite a lot shorter than comparable text type messages, which meant you never had a problem fitting them in one network packet.
It didn't need to be readable off the wire. Previously the system had XML which is nice for debugging, but you can get debugging outputs in other ways and turn them off in production.
It gives your message a natural structure, and an API for getting the parts you need. Writing something custom would have required thinking about all the helper functions to pull numbers out of the binary, with corner cases and all that.

Related

Parallelizing without pickle

Alex Gaynor explains some problems with pickle in his talk "Pickles are for delis, not software", including security, reliability, human-readableness. I am generally wary of using pickle on data in my python programs. As a general rule, I much prefer to pass my data around with json or other serialization formats, specified by myself, manually.
The situation I'm interested in is: I've gathered some data in my python program and I want to run an embarrassingly parallel task on it a bunch of times in parallel.
As far as I know, the nicest parallelization library for doing this in python right now is dask-distributed, followed by joblib-parallel, concurrent.futures, and multiprocessing.
However, all of these solutions use pickle for serialization. Given the various issues with pickle, I'm inclined to simply send a json array to a subprocess of GNU parallel. But of course, this feels like a hack, and loses all the fancy goodness of Dask.
Is it possible to specify a different default serialization format for my data, but continue to parallelize in python, preferably dask, without resorting to pickle or gnu parallel?
The page http://distributed.dask.org/en/latest/protocol.html is worth a read regarding how Dask passes information around a set of distributed workers and scheduler. As can be seen, (cloud)pickle enters the picture for things like functions, which we want to be able to pass to workers, so they can execute them, but data is generally sent via fairly efficient msgpack serialisation. There would be no way to serialise functions with JSON. In fact, there is a fairly flexible internal dispatch mechanism for deciding what gets serialised with what mechanism, but there is no need to get into that here.
I would also claim that pickle is a fine way to serialise some things when passing between processes, so long as you have gone to the trouble to ensure consistent environments between them, which is an assumption that Dask makes.
-edit-
You could of course include fuction names or escapes in JSON, but I would suggest that's just as brittle as pickle anyway.
Pickles are bad for long-term storage ("what if my class definition changes after I've persisted something to a database?") and terrible for accepting as user input:
def foo():
os.system('rm -rf /')
return {'lol': foo}
But I don't think there's any problem at all with using them in this specific case. Suppose you're passing around datetime objects. Do you really want to write your own ad-hoc JSON adapter to serialize and deserialize them? I mean, you can, but do you want to? Pickles are well specified, and the process is fast. That's kind of exactly what you want here, where you're neither persisting the intermediate serialized object nor accepting objects from third parties. You're literally passing them from yourself to yourself.
I'd highly recommend picking the library you want to use -- you like Dask? Go for it! -- and not worrying about its innards until such time as you specifically have to care. In the mean time, concentrate on the parts of your program that are unique to your problem. Odds are good that the underlying serialization format won't be one of them.

Python PureMVC vs Pubsub [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I'm creating a python application and I want to implement it with MVC in mind. I was going to use pubsub to accomplish this but I came across PureMVC.
Could anyone explain these two things to me, the differences between them and the implications of using the one over the other.
I presume you are referring to pypubsub which I know a lot about (I am the author ;). However I don't know much about PureMVC for Python.
The two are very different, based on the PureMVC docs. Here are some differences that I think would matter in choosing, based on browsing the docs and listening to the presentation:
Learning curve:
Incorporating pypubsub in your app is easy: decide on "message topics", subscribe methods and functions, add send messages for those topics. Transport of the messages to destination is automatic. The "cruising speed" API is small: you have pub.subscribe and pub.sendMessage to learn and that's it.
with PureMVC you have to learn about mediators, commands, proxies, etc. These are all powerful concepts with significant functionality that you will have to learn upfront. You may even have to write a couple apps before you go from "knowledge" of their purpose, to "understanding" when/how to use them. For a one-of app, the overhead will sometimes be worth it. Most likely worth it if you create many applications that use the framework.
Impact on application design:
PyPubsub: anonymous observer design pattern.
PureMVC: MVC architectural pattern.
There are no classes to use with pypubsub "standard use". Mostly you have to classify your messages into topics and decide what to include as data. This can evolve fairly organically: you need a new dialog, and need to make some of its state available so when change a field, a label changes somewhere else: all you need to do is include a publish in the dialog, and a subscribe in the code that updates the label. If anything, pypubsub allows you to not worry about design; or rather, it allows you to focus your design on functionality rather than how to get data from one place to another.
With PureMVC there are many classes to use, they require that you design your components to derive from them and register them and implement base class functionality. It is not obvious that you can easily publish data from one place in your application and capture it in another without creating several new classes and implement such that they will do the right thing when called by the framework. Of course, the overhead (time to design) will in some cases be worth it.
Re-usability:
As long as a component documents what message topics it publishes, and what it listens to, it can be incorporated in another application, unit tested for behavior, etc. If the other application does not use pypubsub, it is easy to add, there is no impact on the architecture. Not all of an application needs to use pubsub, it can be used only where needed.
OTOH a PureMVC component could only be incorporated into an application that is already based on PureMVC.
Testability:
PureMVC facilitates testing by separating concerns across layers: visuals, logic, data.
Whereas publish-subscribe (pypubsub) facilitates it by separating across publishers vs consumers, regardless of layer. Hence testing with pypubsub consists in having the test publish data used by your component, and subscribe to data published by your component. Whereas with PureMVC the test would have to pretend to be visual and data layers. I don't know how easy that is in PureMVC.
Every publish-subscribe system can become difficult to debug without the right tools, once the application reaches a certain size: it can be difficult to trace the path of messages. Pypubsub provides classes that help with this (to be used during development), and functionality that verifies whether publishers and listeners are compatible.
It seems to me based on PureMVC diagrams that similar issues would arise: you would have to trace your way across proxies, commands, and mediators, via facades, to figure out why something went wrong. I don't know what tools PureMVC provides to deal with this.
Purpose:
The observer pattern is about how to get data from one place to the next via a sort of "data bus"; as long as components can link to the bus, state can be exchanged without knowledge of source or sink.
PureMVC is an architectural pattern: its job is to make it easy to describe your application in terms of concerns of view, control, and the data. The model does not care about how the control interacts with it; the control does not care how it is displayed; but the view needs the control to provide specific services to handle user actions and to get desired subset of data to show (since typically not all data available is shown), and the control needs the model to provide specific services (to get data, change it, validate it, save it, etc), and the control needs to instantiate view components at the right time.
Mutual exclusion: there is no reason that I can think of, based on the docs, that would prevent the two libraries from being used in the same application. They work at different levels, they have different purpose, than can coexist.
All decoupling strategies have pros and cons, and you have to weigh each. Learning curve, return on investment, re-usability, performance, testability, etc.

Similar .rdata functionality in Python?

I'm starting to learn about doing data analysis in Python.
In R, you can load data into memory, then save variables into a .rdata file.
I'm trying to create an analysis "project", so I can load the data, store the scripts, then save the output so I can recall it should I need to.
Is there an equivalent function in Python?
Thanks
What you're looking for is binary serialization. The most notable functionality for this in Python is pickle. If you have some standard scientific data structures, you could look at HDF5 instead. JSON works for a lot of objects as well, but it is not binary serialization - it is text-based.
If you expand your options, there are a lot of other serialization options, too. Such as Google's Protocol Buffers (the developer of Rprotobuf is the top-ranked answerer for the r tag on SO), Avro, Thrift, and more.
Although there are generic serialization options, such as pickle and .Rdat, careful consideration of your usage will be helpful in making I/O fast and appropriate to your needs, especially if you need random access, portability, parallel access, tool re-use, etc. For instance, I now tend to avoid .Rdat for large objects.
json
pickle

Learning to write reusable libraries [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
We need to write simple scripts to manipulate the configuration of our load balancers (ie, drain nodes from pools, enabled or disable traffic rules). The load balancers have a SOAP API (defined through a bunch of WSDL files) which is very comprehensive but using it is quite low-level with a lot of manual error checking and list manipulation. It doesn't tend to produce reusable, robust code.
I'd like to write a Python library to handle the nitty-gritty of interacting with the SOAP interface but I don't really know where to start; all of my coding experience is with writing one-off monolithic programs for specific jobs. This is fine for small jobs but it's not helping me or my coworkers -- we're reinventing the wheel with a different number of spokes each time :~)
The API already provides methods like getPoolNames() and getDrainingNodes() but they're a bit awkward to use. Most take a list of nodes and return another list, so (say) working out which virtual servers are enabled involves this sort of thing:
names = conn.getVirtualServerNames()
enabled = conn.getEnabled(names)
for i in range(0, len(names)):
if (enabled[i]):
print names[i]
conn.setEnabled(['www.example.com'], [0])
Whereas something like this:
lb = LoadBalancer('hostname')
for name in [vs.name for vs in lb.virtualServers() if vs.isEnabled()]:
print name
www = lb.virtualServer('www.example.com').disable()
is more Pythonic and (IMHO) easier.
There are a lot of things I'm not sure about: how to handle errors, how to deal with 20-odd WSDL files (a SOAPpy/suds instance for each?) and how much boilerplate translation from the API methods to my methods I'll need to do.
This is more an example of a wider problem (how to learn to write libraries instead of one-off scripts) so I don't want answers to these specific questions -- they're there to demonstrate my thinking and illustrate my problem. I recognise a code smell in the way I do things at the moment (one-off, non-reusable code) but I don't know how to fix it. How does one get into the mindset for tackling problems at a more abstract level? How do you 'learn' software design?
"I don't really know where to start"
Clearly false. You provided an excellent example. Just do more of that. It's that simple.
"There are a lot of things I'm not sure about: how to handle errors, how to deal with 20-odd WSDL files (a SOAPpy/suds instance for each?) and how much boilerplate translation from the API methods to my methods I'll need to do."
Handle errors by raising an exception. That's enough. Remember, you're still going to have high-level scripts using your API library.
20-odd WSDL files? Just pick something for now. Don't overengineer this. Design the API -- as you did with your example -- for the things you want to do. The WSDL's and the number of instances will become clear as you go. One, Ten, Twenty doesn't really matter to users of your API library. It only matters to you, the maintainer. Focus on the users.
Boilerplate translation? As little as possible. Focus on what parts of these interfaces you use with your actual scripts. Translate just what you need and nothing more.
An API is not fixed, cast in concrete, a thing of beauty and a joy forever. It's just a module (in your case a package might be better) that does some useful stuff.
It will undergo constant change and evolution.
Don't overengineer the first release. Build something useful that works for one use case. Then add use cases to it.
"But what if I realize I did something wrong?" That's inevitable, you'll always reach this point. Don't worry about it now.
The most important thing about writing an API library is writing the unit tests that (a) demonstrate how it works and (b) prove that it actually works.
There's an excellent presentation by Joshua Bloch on API design (and thus leading to library design). It's well worth watching. IIRC it's Java-focused, but the principles will apply to any language.
If you are not afraid of C++, there is an excellent book on the subject called "Large-scale C++ Software Design".
This book will guide you through the steps of designing a library by introducing "physical" and "logical" design.
For instance, you'll learn to flatten your components' hierarchy, to restrict dependency between components, to create levels of abstraction.
The is really "the" book on software design IMHO.

Any experiences with Protocol Buffers?

I was just looking through some information about Google's protocol buffers data interchange format. Has anyone played around with the code or even created a project around it?
I'm currently using XML in a Python project for structured content created by hand in a text editor, and I was wondering what the general opinion was on Protocol Buffers as a user-facing input format. The speed and brevity benefits definitely seem to be there, but there are so many factors when it comes to actually generating and processing the data.
If you are looking for user facing interaction, stick with xml. It has more support, understanding, and general acceptance currently. If it's internal, I would say that protocol buffers are a great idea.
Maybe in a few years as more tools come out to support protocol buffers, then start looking towards that for a public facing api. Until then... JSON?
Protocol buffers are intended to optimize communications between machines. They are really not intended for human interaction. Also, the format is binary, so it could not replace XML in that use case.
I would also recommend JSON as being the most compact text-based format.
Another drawback of binary format like PB is that if there is a single bit of error, the entire data file is not parsable, but with JSON or XML, as the last resort you can still manually fix the error because it is human readable and has redundancy built-in..
From your brief description, it sounds like protocol buffers is not the right fit. The phrase "structured content created by hand in a text editor" pretty much screams for XML.
But if you want efficient, low latency communications with data structures that are not shared outside your organization, binary serialization such as protocol buffers can offer a huge win.

Categories

Resources