How to append data to a nested JSON file in Python - python

I'm creating a program that will need to store different objects in a logical structure on a file which will be read by a web server and displayed to users.
Since the file will contain a lot of information, loading the whole file tom memory, appending information and writing the whole file back to the filesystem - as some answers stated - will prove problematic.
I'm looking for something of this sort:
foods = [{
"fruits":{
"apple":"red",
"banana":"yellow",
"kiwi":"green"
}
"vegetables":{
"cucumber":"green",
"tomato":"red",
"lettuce":"green"
}
}]
I would like to be able to add additional data to the table like so:
newFruit = {"cherry":"red"}
foods["fruits"].append(newFruit)
Is there any way to do this in python with JSON without loading the whole file?

That is not possible with pure JSON, appending to a JSON list will always require reading the whole file into memory.
But you could use JSON Lines for that. It's a format where each line in a valid JSON on itself, that's what AWS uses for their API's. Your vegetables.json could be written like this:
{"cucumber":"green"}
{"tomato":"red"}
{"lettuce":"green"}
Therefore, adding a new entry is very easy because it becomes just appending a new entry to the end of the file.

Since the file will contain a lot of information, loading the whole file tom memory, appending information and writing the whole file back to the filesystem - as some answers stated - will prove problematic
If your file is really too huge to fit in memory then either the source json should have been splitted in smaller independant parts or it's just not a proper use case for json. IOW what you have in this case is a design issue, not a coding one.
There's at least one streaming json parser that might or not allow you to solve the issue, depending on the source data structure and the effective updates you have to do.
This being said, given today's computers, you need a really huge json file to end up eating all your ram so before anything else you should probably just check the effective file size and how much memory it needs to be parsed to Python.

Related

Reading custom format file using apache beam

I am new to Apache Beam. I have a requirement to read a text file with the format as given below
a=1
b=3
c=2
a=2
b=6
c=5
Here all rows till an empty line are part of one record and need to be processed together (eg. insert to the table as columns). The above example corresponds to a file with just 2 records.
I am using ReadFromText to read the file and process it. It reads each line as an element. I am then trying to loop and process till I get empty lines.
ReadFromText returns a PCollection and I have read that PCollection is an abstraction of the potentially distributed dataset. My doubt is while reading, will I get records in the same order as in the file. Or will I just get a collection of rows where the order is not preserved. What solution can I use to solve this problem?
I am using python language. I have to read the file from the GCP bucket and use Google Dataflow for execution.
No, your records are not guaranteed to be in the same order. PCollections are inherently unordered, and elements in a PCollection are expected to be parallelization, that is distinct and not reliant on other elements in the PCollection.
In your example you're using TextIO which treats each line of a text file as a separate element, but what you need is to gather each set of data for a record as one element. There are many potential ways around this.
If you can modify the text file, you could put all your data on a single line per record, and then parse that line in a transform you write. This is the usual approach taken, for example with CSV files.
If you can't modify the files, a simple solution for adding your own logic for reading files is to retrieve the files with FileIO and then write a custom ParDo with your own logic for reading the files. This is not as simple as using an existing IO out of the box, but is still easier than creating a fully featured Source.
If the files are more complex and you need a more robust solution, you can implement your own Source that reads the file and outputs records in your required format. This would most likely involve using Splittable DoFns and would require a fair amount of knowledge in how a FileBasedSource works.

How to control python program behavior with external file presence/content?

I developed a software which can be automatically updated, so I need external-placed config file/files. For now I use json file to store user-input variables like user name etc. But I am not sure how the program itself should be controlled. I mean things like checking if program is opened for first time after update to know if update notes should be shown, what functions were already used etc. For now I am doing it with things like:
if os.path.exists(control_file_1):
actions_1
if os.path.exists(control_file_2):
some other actions unrelated to actions_1
it is independent from the files content - so there is no need to read the file content - which is convenient.
What functions should be used to store those information in one file and read them efficiently? Just normal file.read() etc? It seems not very clean-code friendly.
Thanks
UPDATE:
Looks like a ConfigParser is a way to go. Am I right? Or are they any better ways to accomplish what I am going for?
Given that you need to have config information stored in a file. If you choose to have that information in a file that contains a json record then it is the most convenient if the file is used internally and updating and reading the record in the file is easy (treat it as a dict)
However, if you want a more universal config.ini reader then you can go with ConfigParser class which you can use directly or create your own wrapper
class MYConfig_Parser(ConfigParser):
so that you can check stuff in the constructor like if mandatory entries are available etc before processing the entries.

Creating a Case in PSSE

I have data in an excel file that I would like to use to create a case in PSSE. The data is organized as it would appear in a case in PSSE (ie. for bus Bus number, name, base kV, and so on. Of course the data can be entered manually but I'm working with over 500 buses. I have tried copied and pasting, but that seems to works only sometimes. For machine data, it barely works.
Is there a way to import this data to PSSE from an excel file? I have recently started running PSSE with Python, and maybe there is a way to do this?
--
MK.
Yes. You can import data from an excel file into PSSE using the python package xlrt, however, I would reccomend instead converting your excel file to csv before you import and use csv as it is much easier. Importing data using the API is not just a copy and paste job, into the nicely tabulated spreadsheet that PSSE has in its case data.
Refer to the API documentation for PSSE, chapter II. Search this function, BUS_DATA_2. You will see that you can create buses with this function.
So your job should be three fold.
Import the csv file data with each line being a list of each data parameter for your bus. Like voltage, name, baseKV, PU etc. Store it to another list.
Iterate through the new list you just created and call:
ierr = bus_data_2(i, intgar, realar, name)
and pass in your data from the csv file. (see PSSE API documentation on how to do this) This will effectively load data from the csv file to your case ( in the form of nodes or buses).
After you are finished, you will need to call a function called psspy.save("Casename.sav") to save your work in a new PSSE case.
Note: there are functions to load in line data, fix shunt data, generator data etc.
Your other option is to call up the PTI folks as they can give you training.
Good luck
If you have an Excel data file with exactly the same "format" and same "info" as the regular case file (.sav), try this:
Open any small example .sav file from the example sub-folder PSSE's installation folder
Copy the corresponding spreadsheet to the working case (shown in spreadsheet view) with the same "info" (say, bus, branch,etc.) in PSSE GUI
After finishing copying everything, then save the edited working case in GUI as a new working case.
If this doesn't work, I suggest you to ask this question on forum of "Python for Power Systems":
https://psspy.org/psse-help-forum/questions/

re-reading a dictionary structure saved as text file , again as dictionary

I've parsed a big corpus and I've saved the data I needed in a dictionary structure. But at the end of my code I've saved it as a .txt file 'cause I needed to manually check something. now in another part of my work I need that dictionary as my input. I wanted to know if there are other ways than just opening the text file and re-putting it as a dictionary structure. If I can just manipulate my other to keep also as it is. Is Pickle the right thing for my case? or I'm totally on a wrong way? sorry if my question is so naive ,I'm really new to python and I'm still learning it.
Copy & pasting from Pickle or json?
for the ease of reading.
If you do not have any interoperability requirements (i.e. you're just going to use the data with Python), and a binary format is fine, go with cPickle, which gives you really fast Python object serialization.
If you want interoperability, or you want a text format to store your data, go with JSON (or some other appropriate format depending on your constraints).
According to the above, I guess you would like cPickle over json
However, another article I found that is interesting: http://kovshenin.com/2010/pickle-vs-json-which-is-faster/, which proves that json is a lot faster than pickle (the author states in the article that cPickle is faster than pickle but stil slower than json)
This SO answer What is faster - Loading a pickled dictionary object or Loading a JSON file - to a dictionary? compares 6 different libraries.
pickle
cPickle
json
simplejson
usjon
yajl
In addition, if you use pypy, json can be really fast.
Finally, some very recently profiling data https://gist.github.com/schlamar/3134391.

Where to store metadata associated with files?

This a question on storing and loading data, particularly in Python. I'm not entirely sure this is the appropriate forum, so redirect me if not.
I'm handling about 50 1000-row CSV files, and each has 10 parameters of associated metadata. What is the best method to store this in regards to:
(A) All the information is human-readable plain text and it's easy for a non-programming human to associate data and metadata.
(B) It's convenient to load the metadata and each column of the csv to a python dictionary.
I've considered four possible solutions:
(0) Previously, I've stored smaller amounts of metadata in the filename. This is bad for obvious reasons.
(1) Assign each CSV file a ID number, name each "ID.csv" and then produce a "metadata.csv" which maps each CSV ID number to its metadata. The shortcomings here are that using ID numbers reduces human readability. (To learn the contents of a file a non-programming human reader must manually check the "metadata.csv")
(2) Leave the metadata at the top of CSV file. This has shortcomings in that my program would need to perform two steps: (a) get the metadata from some arbitrary number of lines at the top of the file and (b) tell the CSV reader (pandas.read_csv) to ignore first few lines.
(3) Convert to CSV to some data serialization format like YAML, where I could then easily include the metadata. This has shortcomings of easily loading the columns of the CSV to my dictionary, and not everyone knows YAML.
Are there any clever solutions to this problem? Thanks!
This question is a tad suggestive so it may be closed, but let me offer the suggestion of the built-in python module for handling json files. JSON maintains a good balance of "human-readability" and is highly portable to almost any language or format. You could construct from your original data to something like this:
{
"metadata":{"name":"foo", "status":"bar"},
"data":[[1,2,3],[4,5,6],[....]]
}
where data is your original CSV file and metadata is a dictionary containing whatever data you would like store. Additionally it is also simple to "strip" the metadata out and return the original csv data from this format - all within the confines of built-in python modules.

Categories

Resources