I was just looking through some information about Google's protocol buffers data interchange format. Has anyone played around with the code or even created a project around it?
I'm currently using XML in a Python project for structured content created by hand in a text editor, and I was wondering what the general opinion was on Protocol Buffers as a user-facing input format. The speed and brevity benefits definitely seem to be there, but there are so many factors when it comes to actually generating and processing the data.
If you are looking for user facing interaction, stick with xml. It has more support, understanding, and general acceptance currently. If it's internal, I would say that protocol buffers are a great idea.
Maybe in a few years as more tools come out to support protocol buffers, then start looking towards that for a public facing api. Until then... JSON?
Protocol buffers are intended to optimize communications between machines. They are really not intended for human interaction. Also, the format is binary, so it could not replace XML in that use case.
I would also recommend JSON as being the most compact text-based format.
Another drawback of binary format like PB is that if there is a single bit of error, the entire data file is not parsable, but with JSON or XML, as the last resort you can still manually fix the error because it is human readable and has redundancy built-in..
From your brief description, it sounds like protocol buffers is not the right fit. The phrase "structured content created by hand in a text editor" pretty much screams for XML.
But if you want efficient, low latency communications with data structures that are not shared outside your organization, binary serialization such as protocol buffers can offer a huge win.
Related
I've been messing around with a few personal projects, and have found the need to offload the processing of a large amount of data to more beefy, dedicated servers. I tend to do this over XML-RPC in Python, and have made some interesting observations, and wanted to both share, and see if anybody knows of a better or more efficient way of doing this.
So, let's say I need to send a large amount of data over XML-RPC in Python. What's the fastest way of doing this?
I started doing some experimenting with the XML-RPC module, as there isn't much about it online. Initially, to handle my data (~15 megabytes), I was simply passing a dictionary object to the XML-RPC method on the client side. This was very slow, on both the server and client side- each took a few minutes just to encode/decode the data! I assume (but am not sure) that this is an issue with having to encode a lot of data in XML.
However, after some fiddling around, I tried serializing the dictionary as a JSON object with json.dumps on the client end and loading it with json.loads on the server side, which to my surprise, ended up being many times faster.
Warning: Pure Speculation!
I suspect that the XML encoding may be so much slower than JSON encoding because json.dumps is written in CPython, but I do not know if Python's XML encoding is written in CPython. I ran in to a similar issue using json.dumps vs json.dump in a previous project: the latter is many times slower because it is written in pure Python; as opposed to being written in CPython (in the words of the Python bug report, "json.dump doesn't use the C accelerations": https://bugs.python.org/msg137170).
I could theoretically upload the serialized JSON string (or pickled dict object, but this strikes me as a bad idea) to cloud storage, such as AWS S3, and then pull it on the server end, but I feel like I might as well just send the data directly from one machine to the other at that point.
I'm going to experiment with doing some gzip compression on the serialized JSON string to hopefully cut down on network bandwidth being a bottleneck, as I eventually plan on being able to handle gigabytes of data over RPC. I'll post my results here.
I thought this was interesting, and I'm wondering if anybody has run in to this issue before, and how they've went about it. I haven't been able to find much online. Cheers!
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I've recently read an article about protocol buffers,
Protocol Buffers is a method of serializing structured data. It is
useful in developing programs to communicate with each other over a
wire or for storing data. The method involves an interface description
language that describes the structure of some data and a program that
generates source code from that description for generating or parsing
a stream of bytes that represents the structured data
What I want to know is, where to use them? Are there any real-life examples rather than simple address book examples? Is it for example used to pre-cache query results from databases?
Protocol buffers are a data storage and exchange format, notably used for RPC - communication between programs or computers.
Alternatives include language-specific serialization (Java serialization, Python pickles, etc.), tabular formats like CSV and TSV, structured text formats like XML and JSON, and other binary formats like Apache Thrift. Conceptually these are all just different ways of representing structured data, but in practice they have different pros and cons.
Protocol buffers are:
Space efficient, relying on a custom format to represent data compactly.
Provide strong type safety cross-language (particularly in strongly-typed languages like Java, but even in Python it's still quite useful).
Designed to be backwards and forwards-compatible. It's easy to make structural changes to protocol buffers (normally adding new fields or deprecating old ones) without needing to ensure all applications using the proto are updated simultaneously.
Somewhat tedious to work with manually. While there is a text format, it is mostly useful for manually inspecting, not storing, protos. JSON, for instance, is much easier for a human to write and edit. Therefore protos are usually written and read by programs.
Dependent on a .proto compiler. By separating the structure from the data protocol buffers can be lean and mean, but it means without an associated .proto file and a tool like protoc to generate code to parse it, arbitrary data in proto format is unusable. This makes protos a poor choice for sending data to other people who may not have the .proto file on hand.
To make some sweeping generalizations about different formats:
CSV/TSV/etc. are useful for human-constructed data that never needs to be transmitted between people or programs. It's easy to construct and easy to parse, but a nightmare to keep in sync and can't easily represent complex structures.
Language-specific serialization like pickles can be useful for short-lived serialization, but quickly runs into backwards compatibility issues and obviously limit you to one language. Except in some very specific cases protobufs accomplish all the same goals with more safety and better future-proofing.
JSON is ideal for sending data between different parties (e.g. public APIs). Because the structure and the content are transmitted together anyone can understand it, and it's easy to parse in all major languages. There's little reason nowadays to use other structured formats like XML.
Binary formats like Protocol Buffers are ideal for almost all other data serialization use cases; long and short-term storage, inter-process communication, intra-process and application-wide caching, and more.
Google famously uses protocol buffers for practically everything they do. If you can imagine a reason to need to store or transmit data, Google probably does it with protocol buffers.
I used them to create a financial trading system. Here are the reasons:
There's libraries for many languages. Some things needed to be in c++, others in c#. And it was open to extending to Python or Java, etc.
It needed to be fast to serialize/deserialize and compact. This is due to the speed requirement in the financial trading system. The messages were quite a lot shorter than comparable text type messages, which meant you never had a problem fitting them in one network packet.
It didn't need to be readable off the wire. Previously the system had XML which is nice for debugging, but you can get debugging outputs in other ways and turn them off in production.
It gives your message a natural structure, and an API for getting the parts you need. Writing something custom would have required thinking about all the helper functions to pull numbers out of the binary, with corner cases and all that.
I am trying to implement a server using python-twisted with potential C# and ObjC clients. I started with LineReceiver and that works well for basic messaging, but I can't figure out the best approach for something more robust. Any ideas for a simple solution for the following requirements?
Request and response
ex. send message to get a status, receive status back
Recieve binary data transfer (non-trivial, but not massive - less than a few megs)
ex. bytes of a small png file
AMP seems like a feasible solution for the first scenario, but may not be able to handle the size for the data transfer scenario.
I've also looked at full blown SOAP but haven't found a decent enough example to get me going.
I like AMP a lot. twisted.protocols.amp is moderately featureful and relatively easily testable (although documentation on how to test applications written with it is a little lacking).
The command/response abstraction AMP provides is comfortable and familiar (after all, we live in a world where HTTP won). AMP avoids the trap of excessive complexity (seemingly for the sake of complexity) that SOAP fell squarely into. But it's not so simple you won't be able to do the job with it (like LineReceiver most likely is).
There are intermediate steps - for example, twisted.protocols.basic.Int32Receiver gives you a more sophisticated framing mechanism (32 bit length prefixes instead of magic-bytes-terminated-lines) - but in my opinion AMP is a really good first choice for a protocol. You may find you want to switch to something else later (one size really does not fit all) but AMP is at the sweet spot between features and simplicity that seems like a good fit for a very broad range of applications.
It's true that there are some built-in length limits in AMP. This is a long standing sore spot that is just waiting for someone with a real-world need to address it. :) There is a fairly well thought-out design for lifting this limit (without breaking protocol compatibility!). If AMP seems otherwise appealing to you then I encourage you to engage the Twisted development community to find out how you can help make this a reality. ;)
There's also always the option of using AMP for messaging and to set up another channel (eg, HTTP) for transferring your larger chunks of data.
I would like to create a structure in Python which represents a Simulink model. I am aware of at least two ways of doing this - by parsing an ".mdl" file, or by using Matlab's api for communicating with the model.
Can you recommend good libraries or APIs for doing this?
In particular, I need to perform some processing on a Simulink model and I would like to do it in Python. Also I don't want to be constantly communicating with Matlab for doing this (so that I can release the floating license).
I have seen some parsers online, but they seem to be a little limited, usually not supporting components such as Bus Creators and Bus Selectors, Muxes, Demuxes, and reading UserData information.
Any help will be greatly appreciated.
Not my area, but noticed this Python parser which may be helpful.
Or you can purchase the Simulink Report Generator in order to save/manipulate the model as a XML file.
Or the *.mdl file is a readable ascii file. You could read it into a string with a fread statement, alter the string, then either save it to your format of choice or write it back out to a *.mdl file. My coworker thought of this, not me! But would require doing the editing/parsing with a routine you write yourself.
This question may be seen as subjective, but I'd like to ask SO users which common structured textual data format is best supported in Python.
My initial choices are:
XML
JSON
and YAML
Which of these three is easiest to work with in Python (ie. has the best library support / performance) ... or is there another format that I haven't mentioned that is better supported in Python.
I cannot use a Python only format (e.g. Pickling) since interop is quite important, but the majority of the code that handles these files will be written in Python, so I'm keen to use a format that has the strongest support in Python.
CSV or fixed column text may also be viable for most use cases, however I'd prefer the flexibility of a more scalable format.
Thank you
Note
Regarding interop I will be generating these files initially from Ruby, using Builder, however Ruby will not be consuming these files again.
I would go with JSON, I mean YAML is awesome but interop with it is not that great.
XML is just an ugly mess to look at and has too much fat.
Python has a built-in JSON module since version 2.6.
JSON has great python support and it is much more compact than XML (and the API is generally more convenient if you're just trying to dump and load objects). There's no out of the box support for YAML that I know of, although I haven't really checked. In the abstract I would suggest using JSON due to the low overhead of the format and the wide range of language support, but it does depend a bit on your application - if you're working in a space that already has established applications, the formats they use might be preferable, even if they're technically deficient.
I think it depends a lot on what you need to do with the data. If you're going to be building a complex database and doing processing and transformations on it, I suspect you'd be better off with XML. I've found the lxml module pretty useful in this regard. It has full support for standards like xpath and xslt, and this support is implemented in native code so you'll get good performance.
But if you're doing something more simple, then likely you'd be better off to use a simpler format like yaml or json. I've heard tell of "json transforms" but don't know how mature the technology is or how developed Python's access to it is.
It's pretty much all the same, out of those three. Use whichever is easier to inter-operate with.