Best format to store data

Best format to store data - python

In my program I read data from a file and then parse it. The format is
data | data | data | data | data
What is a better format to store data in ?
It must be easily parsed by python and easy to use.

JSON - http://docs.python.org/2/library/json.html
CSV - http://docs.python.org/2/library/csv.html?highlight=csvreader
XML - there's a selection to choose from depending what you need.

Take a look at pickling. You can serialise and write objects to a file and then read them back later.
If the data needs to be read by programs written in other languages consider using JSON.

Your data format is fine if you don't need to use the pipe (|) character anywhere. Databases often use pipe-delimited data and it's easily parsed.
CSV (comma-separated values) are a more universal format, but not much different that pipe-separated. Both have some limitations, but for simple data they work fine.
XML is good if you have complex data, but it's a more complicated format. Complicated doesn't necessarily mean better if your needs are simple, so you'd need to think about the data you want to store, and if you want to transfer it to other apps or languages.

Related

How to append data to a nested JSON file in Python

I'm creating a program that will need to store different objects in a logical structure on a file which will be read by a web server and displayed to users.
Since the file will contain a lot of information, loading the whole file tom memory, appending information and writing the whole file back to the filesystem - as some answers stated - will prove problematic.
I'm looking for something of this sort:
foods = [{
"fruits":{
"apple":"red",
"banana":"yellow",
"kiwi":"green"
}
"vegetables":{
"cucumber":"green",
"tomato":"red",
"lettuce":"green"
}
}]
I would like to be able to add additional data to the table like so:
newFruit = {"cherry":"red"}
foods["fruits"].append(newFruit)
Is there any way to do this in python with JSON without loading the whole file?

That is not possible with pure JSON, appending to a JSON list will always require reading the whole file into memory.
But you could use JSON Lines for that. It's a format where each line in a valid JSON on itself, that's what AWS uses for their API's. Your vegetables.json could be written like this:
{"cucumber":"green"}
{"tomato":"red"}
{"lettuce":"green"}
Therefore, adding a new entry is very easy because it becomes just appending a new entry to the end of the file.

Since the file will contain a lot of information, loading the whole file tom memory, appending information and writing the whole file back to the filesystem - as some answers stated - will prove problematic
If your file is really too huge to fit in memory then either the source json should have been splitted in smaller independant parts or it's just not a proper use case for json. IOW what you have in this case is a design issue, not a coding one.
There's at least one streaming json parser that might or not allow you to solve the issue, depending on the source data structure and the effective updates you have to do.
This being said, given today's computers, you need a really huge json file to end up eating all your ram so before anything else you should probably just check the effective file size and how much memory it needs to be parsed to Python.

A file format writable by python, readable as a Dataframe in Spark

I have python scripts (no Spark here) producing some data files, that I want to be readable easily as Dataframes in a scala/spark application.
What's the best choice ?

If your data doesn't have newlines in then a simple text-based format such as TSV is probably best.
If you need to include binary data then a separated format like protobuf makes sense - anything for which a hadoop InputFormat exists should be fine.

Where to store metadata associated with files?

This a question on storing and loading data, particularly in Python. I'm not entirely sure this is the appropriate forum, so redirect me if not.
I'm handling about 50 1000-row CSV files, and each has 10 parameters of associated metadata. What is the best method to store this in regards to:
(A) All the information is human-readable plain text and it's easy for a non-programming human to associate data and metadata.
(B) It's convenient to load the metadata and each column of the csv to a python dictionary.
I've considered four possible solutions:
(0) Previously, I've stored smaller amounts of metadata in the filename. This is bad for obvious reasons.
(1) Assign each CSV file a ID number, name each "ID.csv" and then produce a "metadata.csv" which maps each CSV ID number to its metadata. The shortcomings here are that using ID numbers reduces human readability. (To learn the contents of a file a non-programming human reader must manually check the "metadata.csv")
(2) Leave the metadata at the top of CSV file. This has shortcomings in that my program would need to perform two steps: (a) get the metadata from some arbitrary number of lines at the top of the file and (b) tell the CSV reader (pandas.read_csv) to ignore first few lines.
(3) Convert to CSV to some data serialization format like YAML, where I could then easily include the metadata. This has shortcomings of easily loading the columns of the CSV to my dictionary, and not everyone knows YAML.
Are there any clever solutions to this problem? Thanks!

This question is a tad suggestive so it may be closed, but let me offer the suggestion of the built-in python module for handling json files. JSON maintains a good balance of "human-readability" and is highly portable to almost any language or format. You could construct from your original data to something like this:
{
"metadata":{"name":"foo", "status":"bar"},
"data":[[1,2,3],[4,5,6],[....]]
}
where data is your original CSV file and metadata is a dictionary containing whatever data you would like store. Additionally it is also simple to "strip" the metadata out and return the original csv data from this format - all within the confines of built-in python modules.

Reverse-enginering communication protocol

What serialization format is this, and are there any libraries to parse it back to python-native data structures or at least something easier to manage?
At least it looks like it could have a 1:1 correspondent in python.
%xt%tableFameUpdate%-1%{"season":[1.329534083671E9,"160",53255],"leaderboard":[["1001:6587656216929005792","1718","Kjeld","http:/..."],["1001:6301086609221020111","802","Asti","http://..."],["1018:995158152656680513","419","QiZOra","http://..."],["1018:8494206166685317681","364","Bingay","http://..."],["1:100000380528383","160","...","http://..."]],"multipliers":{"1001:6835768553933918921":67,"1001:4106589374707547411":0,"1001:5353968490097996024":0,"1018:1168770734837476224":0,"1018:8374571792147098127":0,"1001:4225536539330822139":0,"1:100000380528383":0,"1001:4082457720357735190":68,"1001:1650191466786177826":0,"1001:4299232509980238095":38,"1001:7604050184057349633":0,"1001:6587656216929005792":0,"1001:3852516077423175846":0,"1001:888471333619738847":9,"1001:7823244004315560346":0,"1001:7665905871463311833":0,"1001:4453073160237910447":0,"1001:6338802281112620503":64,"1001:7644306056081384910":13,"1001:4956919992342871722":0,"1001:4126528826861913228":29,"1001:7325864606573096759":47,"1001:6494182198787618518":16,"1001:3678910058012926187":4,"1001:435065490460532259":39,"1001:5366593356123167358":0,"1001:6041488907938219046":8,"1001:6051083835382544277":5,"1001:9187877490300372546":0,"1001:482518425014054339":0}}%

if you strip off the first piece and the last percent sign it is json, which you could parse with any json parser. It looks like its using the percent signs as a sort of iterator, you could prolly split on those.

Apart from the few characters at the beginning, this looks like JSON.
%xt%tableFameUpdate%-1% is not JSON, but the rest is. There's a lot of JSON parsers for python, pick one and it should parse your data without a hitch.

Printing several binary data fields from Google DataStore?

I'm using Google App Engine and python for a web service. Some of the models (tables) I have in my web service have several binary data fields in them, and I'd like to present this data to a computer requesting it, all fields at the same time. Now, the problem is I don't know how to write it out in a way that the other computer knows where the first data ends and the other starts. I've been using JSON for all the things that aren't binary, but afaik JSON doesn't work for binary data. So how do you get around this?
You could of course separate the data and put it in its own model, and then reference it back to some metadata model. That would allow you to make a single page that just prints one data field of one of the items, but that is trappy both server and client implementation wise.
Another solution would be to put in some kind of separator, and just split the data on that. I suppose it would work and that's how you do it, but isn't there like a standardized way to do that? Any libraries I could use?
In short, I'd like to be able to do something like this:
binaryDataField1: data data data ...
binaryDataField2: data data data ...
etc

Several easy options:
base64 encode your data - meaning you can still use JSON.
Use Protocol Buffers.
Prefix each field with its length - either as a 4- or 8- byte integer, or as a numeric string.

One solution that would leverage your json investment would be to simply convert the binary data to something that json can support. For example, Base64 encoding might work well for you. You could treat the output of your BAse64 encoder just like you would a normal string in json. it looks like python has Base64 support built in, though i only use java on app engine so I can't guarantee that the linked library work in the sandbox or not.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.