How can I get pb2.py into excel? - python

I am having a hard time with this. Is there a way to get a compiled protocol buffer file’s (pb2.py) contents into excel?

Your question lacks detail and a demonstration of your attempt to solve this problem, it is likely that it will be closed.
Presumably (!?) your intent is to start with serialized binary/wire format protocol buffer messages, unmarshal these into Python objects and then, using a Python package (list that can interact with Excel, enter these objects as row into Excel.
The Python (pb2.pby) file generated by the protocol buffer compiler (protoc) from a .proto file, contains everything you need to marshal and unmarshal messages in the binary/wire format to Python objects that represent the messages etc. that are defined by the .proto file. The protocol buffer documentation is comprehensive and explains this well (link).
Once you've unmarshaled the data into one or more Python objects, you will need to use the Python package for Excel of your choosing to output these objects into the spreadsheet(s).
It is unclear whether you have flat or hierarchical data. If you have anything non-trivial, you'll also need to decide how to represent the structure in the spreadsheet's table-oriented structure.

Related

How to autogenerate python data parser for C++ structs?

I have a C++ based application logging data to files and I want to load that data in Python so I can explore it. The data files are flat files with a known number of records per file. The data records are represented as a struct (nested structs) in my C++ application. This struct (subtructs) change regularly during my development process, so I also have to make associated changes to my Python code that loads the data. This is obviously tedious and doesn't scale well. What I am interested in is a way to automate the process of updating the Python code (or some other way to handle this problem altogether). I am exploring some libraries that convert my C++ structs to other formats such as JSON, but I have yet to find a solid solution. Can anyone suggest something?
Consider using data serialization system / format that has C++ and Python bindings: https://en.wikipedia.org/wiki/Comparison_of_data-serialization_formats
(e.g. protobuf or even json or csv)
Alternatively consider writing a library in C that reads the data end exposes them as structures. Then use: https://docs.python.org/3.7/library/ctypes.html to call this C library and retrieve records
Of course if semantics of the data changes (e.g. new important field needs to by analyzed) you will have to handle that new stuff in the python code. No free lunch.

Removing JSON objects that aren't correctly formatted Python

I'm building a chatbot database atm. I uses data from pushshift.io. In order to deal with big datafile, (I understand that json loads everything into RAM, so if you only have 16GB RAM and working with 30GB of data, that is a nono), I wrote a bash script that split the big file into smaller chunk of 3GB of file so that I can run it through json.loads (or pd.read_json). The problem whenever I run my code it returns
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Thus I take a look into the temp json file that I just created and I see this happens in my JSON file:
ink_id":"t3_2qyr1a","body":"Most of us have some family members like this. *Most* of my family is like this. ","downs":0,"created_utc":"1420070400","score":14,"author":"YoungModern","distinguished":null,"id":"cnas8zv","archived":false,"parent_id":"t3_2qyr1a","subreddit":"exmormon","author_flair_css_class":null,"author_flair_text":null,"gilded":0,"retrieved_on":1425124282,"ups":14,"controversiality":0,"subreddit_id":"t5_2r0gj","edited":false}
The sample correction of the data looks like this
{"score_hidden":false,"name":"t1_cnas8zv","link_id":"t3_2qyr1a","body":"Most of us have some family members like this. *Most* of my family is like this. ","downs":0,"created_utc":"1420070400","score":14,"author":"YoungModern","distinguished":null,"id":"cnas8zv","archived":false,"parent_id":"t3_2qyr1a","subreddit":"exmormon","author_flair_css_class":null,"author_flair_text":null,"gilded":0,"retrieved_on":1425124282,"ups":14,"controversiality":0,"subreddit_id":"t5_2r0gj","edited":false}
I notice that my bash script split the file without paying attention to the JSON objects. So my question is are there ways to write a function in python that can detect JSON objects that are not correctly formatted and deleted it?
There isn't a lot of information to go on, but I would challenge the frame a little.
There are several incremental json parsers available in Python. A quick search shows ijson should allow you to traverse your very large data structure without exploding.
You also should consider another data format (or a real database), or you will easily find yourself spending time reimplementing much much slower versions of features that already exist with the right tools.
If you are using the json standard library, then calling json.loads on badly formatted data will return JSONDecodeError. You can put your code in a try-catch statement and check if this exception occurs to make sure you only process correctly formatted data.

extracting GPB descriptor from a proto file

I have a proto file defining some GPB (proto buffer) messages.
I want to implement a simple python script that go over the different messages and write to external file (lets say a JSON file) the basic information regarding each of the messages' fields (name, type, default value, etc..).
I searched on the WEB and found that once I get the GPB descriptor the rest should be relatively easy.
However, I have no idea how to get the descriptor itself.
Can someone help me here??
10x
protoc has an option --descriptor_set_out which writes the descriptors as a FileDescriptorSet as described in descriptor.proto from the Protobuf source code. See protoc --help for more info.
Alternatively, you might consider actually writing your script as a code generator plugin. In this case, you wouldn't be generating code, but just a JSON file (or whatever), but the mechanism is the same.

Python: How to write to http input stream

I could see a couple of examples to read from the http stream. But how to write to a http input stream using python?
You could use standard library module httplib: in the HTTPConnection.request method, the body argument (since Python 2.6) can be an open file object (better be a "pretty real" file, since, as the docs say, "this file object should support fileno() and read() methods"; but it could be a named or unnamed pipe to which a separate process can be writing). The advantage is however dubious, since (again per the docs) "The header Content-Length is automatically set to the correct value" -- which, since headers come before body, and the file's content length can't be known until the file is read, implies the whole file's going to be read into memory anyway.
If you're desperate to "stream" dynamically generated content into an HTTP POST (rather than preparing it all beforehand and then posting), you need a server supporting HTTP's "chunked transfer encoding": this SO question's accepted answer mentions that the popular asynchronous networking Python package twisted does, and gives some useful pointers.

Using Python, how do I get a binary serialization of my Google protobuf message?

I see the function SerializeAsString in the protobuf Python documentation, but like this suggests, this gives me a string version of the binary data. Is there a way of serializing and parsing a binary array of protobuf data using Python?
We have a C++ application that stores the protobuf messages as binary data in a file. We'd like to read and write to the file using Python.
Python strings can hold binary data, therefore SerializeAsString returns binary data.
.serializeToString(), despite its name, returns the bytes type, not the str type.
Part of what makes this confusing is that the method name remains unchanged, and they haven't updated the documentation; you just have to know that when they say "String" they mean bytes.
I think that strings are the usual way to represent binary data in Python. What do you exactly want to do?
[Edit]
Have a look at the struct module: http://docs.python.org/library/struct.html
You can use Pythons Strings for getting proto buffers serialized data (doesn't matter how they ware crated - in Python, Java, C++ or any other language).
These is line from Pythons version of proto buffers tutorial:
address_book.ParseFromString(f.read())
It not clear what you want to do:
Do something with the serialized form of an entire message (From the SerializeAsString method). Not sure what you'd want to do with this?
Store a byte string inside a protobuf message - just use the bytes type in the .proto file, and a byte string in python for the variable.

Categories

Resources