I'm trying to save and load a list of tuples of 2 ndarrays and an int to and from a .csv file.
In my current implementation, when I save and load a list l, there is some error in the recovered list on the order of 10^-10. Is there a way to save and recover values more precisely? I would also appreciate comments on my code in general. Thanks!
This is what I have now:
def save_l(l,path):
tup=()
for X in l:
u=X[0].reshape(784*9)
v=X[2]*np.ones(1)
w=np.concatenate((u,X[1],v))
tup+=(w,)
L=np.row_stack(tup)
df=pd.DataFrame(L)
df.to_csv(path)
def load_l(path):
df=pd.read_csv(path)
L=df.values
l=[]
for v in L:
tup=()
for i in range(784):
tup+=(v[9*i+1:9*(i+1)+1],)
T=np.row_stack(tup)
Q=v[9*784+1:10*784+1]
i=v[7841]
l.append((T,Q,i))
return(l)
It may be possible that the issue you are experiencing is due to absence of .csv file protection during save and load.
a good way to make sure that your file is locked until all data are saved/loaded completely is using a context manager. In this way, you won't lose any data in case your system stops execution due to whatever reason, because all results are saved in the moment when they are available.
I recommend to use the with-statement, whose primary use is an exception-safe cleanup of the object used inside (in this case your .csv). In other words, with makes sure that files are closed, locks released, contexts restored etc.
with open("myfile.csv", "a") as reference: # Drop to csv w/ context manager
df.to_csv(reference, sep = ",", index = False) # Same goes for read_csv
# As soon as you are here, reference is closed
If you try this and still see your error, it's not due to save/load issues.
Related
I'm new here and to python in general, so please forgive any formatting issues and whatever else. I'm a physicist and I have a parametric model, where I want to iterate over one or more of the model's parameter values (possibly in an MCMC setting). But for simplicity, imagine I have just a single parameter with N possible values. In a loop, I compute the model and several scalar metrics pertaining to it.
I want to save the data [parameter value, metric1, metric2, ...] line-by-line to a file. I don't care what type: .pickle, .npz, .txt, .csv or anything else are fine.
I do NOT want to save the array after all N models have been computed. The issue here is that, sometimes a parameter value is so nonphysical that the program I call to calculate the model (which is a giant complicated thing years in development, so I'm not touching it) crashes the kernel. If I have N = 30000 models to do, and this happens at 29000, I'll be very unhappy and have wasted a lot of time. I also probably have to be conscious of memory usage - I've figured out how to do what I propose with a text file, but it crashes around 2600 lines because I don't think it likes opening a text file that long.
So, some pseudo-code:
filename = 'outFile.extension'
dataArray = np.zeros([N,3])
idx = 0
for p in Parameter1:
modelOutputVector = calculateModel(p)
metric1, metric2 = getMetrics(modelOutputVector)
dataArray[idx,0] = p
dataArray[idx,1] = metric1
dataArray[idx,2] = metric2
### Line that saves data here
idx+=1
I'm partial to npz or pickle formats, but can't figure out how to do this with either. If there is a better format or a better solution, I appreciate any advice.
Edit: What I tried to make a text file was this, inside the loop:
fileObject = open(filename, 'ab')
np.savetxt(fileObject, rowOfData, delimiter = ',', newline = ' ')
fileObject.write('\n')
fileObject.close()
The first time it crashed at 2600 or whatever I thought it was just coincidence, but every time I try this, that's where it stops. I could hack it and make a batch of files that are all 2600 lines, but there's got to be a better solution.
Its hard to say with such a limited knowledge of the error, but if you think it is a file writing error maybe you could try something like:
with open(filename, 'ab') as fileObject:
# code that computes numpy array
np.savetxt(fileObject, rowOfData, delimiter = ',', newline = ' ')
fileObject.write('\n')
# no need to .close() because the "with open()" will handle it
However
I have not used np.savetxt()
I am not an expert on your project
I do not even know if it is truly a file writing error to begin with
I just prefer the with open() technique because that's how all the introductory python books I've read structure their file reading/writing processes, so I assume there is wisdom in it. You could also consider doing like fabianegli commented and save to separate files (thats what my work does).
I have a Python code which is logging some data into a .csv file.
logging_file = 'test.csv'
dt = datetime.datetime.now()
f = open(logging_file, 'a')
f.write('\n "{:%H:%M:%S}",{},{}'.format(dt,x,y,))
The above code is the core part and this produces continuous data in .csv file as
"00:34:09" ,23.05,23.05
"00:36:09" ,24.05,24.05
"00:38:09" ,26.05,26.05
... etc.,
Now I wish to add the following lines in first row of this data. time, data1,data2.I expect output as
time, data1, data2
"00:34:09" ,23.05,23.05
"00:36:09" ,24.05,24.05
"00:38:09" ,26.05,26.05
... etc.,
I tried many ways. Those ways not produced me the result as preferred format.But I am unable to get my expected result.
Please help me to solve the problem.
I would recommend writing a class specifically for creating and managing logs.Have it initialize a file, on creation, with the expected first line (don't forget a \n character!), and keep track of any necessary information about that log(the name of the log it created, where it is, etc). You can then have the class 'write' to the log (append the log, really), you can create new logs as necessary, and, you can have it check for existing logs, and make decisions about either updating what is existing, or scrapping it and starting over.
Python 2.7. I'm using xlsxwriter.
Let's say I have myDict = {1: 'One', 2: 'Two', 3: 'Three'}
I need to perform some transformation on the value and write the result to a spreadsheet.
So I write a function to create a new file and put some headers in there and do formatting, but don't close it so I can write further with my next function.
Then I write another function for transforming my dict values and writing them to the worksheet.
I'm a noob when it comes to classes so please forgive me if this looks silly.
import xlsxwriter
class ReadWriteSpreadsheet(object):
def __init__(self, outputFile=None, writeWorkbook=None, writeWorksheet=None):
self.outputFile = outputFile
self.writeWorksheet = writeWorksheet
self.writeWorkbook = writeWorkbook
# This function works fine
def setup_new_spreadsheet(self):
self.writeWorkbook = xlsxwriter.Workbook(self.outputFile)
self.writeWorksheet = self.writeWorkbook.add_worksheet('My Worksheet')
self.writeWorksheet.write('A1', 'TEST')
# This one does not
def write_data(self):
# Forget iterating through the dict for now
self.writeWorksheet.write('A5', myDict[1])
x = ReadWriteSpreadsheet(outputFile='test.xlsx')
x.setup_new_spreadsheet()
x.write_data()
I get:
Exception Exception: Exception('Exception caught in workbook destructor. Explicit close() may be required for workbook.',) in <bound method Workbook.__del__ of <xlsxwriter.workbook.Workbook object at 0x00000000023FDF28>> ignored
The docs say this error is due to not closing the workbook, but if I close it then I can't write to it further...
How do I structure this class so that the workbook and worksheet from setup_new_spreadsheet() is able to be written to by write_data()?
The exception mentioned in your question is triggered when python realises you will not need to use your Workbook any more in the rest of your code and therefore decides to delete it from his memory (garbage collection). When doing so, it will realise you haven't closed your workbook yet and so will not have persisted your excel spreadsheet at all on the disk (only happen on close I assume) and will raise that exception.
If you had another method close on your class that did: self.writeWorkbook.close() and made sure to call it last you would not have that error.
When you do ReadWriteSpreadsheet() you get a new instance of the class you've defined. That new instance doesn't have any knowledge of any workbooks that were set up in a different instance.
It looks like what you want to do is get a single instance, and then issue the methods on that one instance:
x = ReadWriteSpreadsheet(outputFile='test.xlsx')
x.setup_new_spreadsheet()
x.write_data()
To address your new concern:
The docs say this error is due to not closing the workbook, but if I close it then I can't write to it further...
Yes, that's true, you can't write to it further. That is one of the fundamental properties of Excel files. At the level we're working with here, there's no such thing as "appending" or "updating" an Excel file. Even the Excel program itself cannot do it. You only have two viable approaches:
Keep all data in memory and only commit to disk at the very end.
Reopen the file, reading the data into memory; modify the in-memory data; and write all the in-memory data back out to a new disk file (which can have the same name as the original if you want to overwrite).
The second approach requires using a package that can read Excel files. The main choices there are xlrd and OpenPyXL. The latter will handle both reading and writing, so if you use that one, you don't need XlsxWriter.
In our lab we store our data in hdf5 files trough the python package h5py.
At the beginning of an experiment we create an hdf5 file and store array after array of array of data in the file (among other things). When an experiment fails or is interrupted the file is not correctly closed.
Because our experiments run from iPython the reference to the data object remains (somewhere) in memory.
Is there a way to scan for all open h5py data objects and close them?
This is how it could be done (I could not figure out how to check for closed-ness of the file without exceptions, maybe you will find):
import gc
for obj in gc.get_objects(): # Browse through ALL objects
if isinstance(obj, h5py.File): # Just HDF5 files
try:
obj.close()
except:
pass # Was already closed
Another idea:
Dpending how you use the files, what about using the context manager and the with keyword like this?
with h5py.File("some_path.h5") as f:
f["data1"] = some_data
When the program flow exits the with-block, the file is closed regardless of what happens, including exceptions etc.
pytables (which h5py uses) keeps track of all open files and provides an easy method to force-close all open hdf5 files.
import tables
tables.file._open_files.close_all()
That attribute _open_files also has helpful methods to give you information and handlers for the open files.
I've found that hFile.bool() returns True if open, and False otherwise. This might be the simplest way to check.
In other words, do this:
hFile = h5py.File(path_to_file)
if hFile.__bool__():
hFile.close()
I have some json files with 500MB.
If I use the "trivial" json.load() to load its content all at once, it will consume a lot of memory.
Is there a way to read partially the file? If it was a text, line delimited file, I would be able to iterate over the lines. I am looking for analogy to it.
There was a duplicate to this question that had a better answer. See https://stackoverflow.com/a/10382359/1623645, which suggests ijson.
Update:
I tried it out, and ijson is to JSON what SAX is to XML. For instance, you can do this:
import ijson
for prefix, the_type, value in ijson.parse(open(json_file_name)):
print prefix, the_type, value
where prefix is a dot-separated index in the JSON tree (what happens if your key names have dots in them? I guess that would be bad for Javascript, too...), theType describes a SAX-like event, one of 'null', 'boolean', 'number', 'string', 'map_key', 'start_map', 'end_map', 'start_array', 'end_array', and value is the value of the object or None if the_type is an event like starting/ending a map/array.
The project has some docstrings, but not enough global documentation. I had to dig into ijson/common.py to find what I was looking for.
So the problem is not that each file is too big, but that there are too many of them, and they seem to be adding up in memory. Python's garbage collector should be fine, unless you are keeping around references you don't need. It's hard to tell exactly what's happening without any further information, but some things you can try:
Modularize your code. Do something like:
for json_file in list_of_files:
process_file(json_file)
If you write process_file() in such a way that it doesn't rely on any global state, and doesn't
change any global state, the garbage collector should be able to do its job.
Deal with each file in a separate process. Instead of parsing all the JSON files at once, write a
program that parses just one, and pass each one in from a shell script, or from another python
process that calls your script via subprocess.Popen. This is a little less elegant, but if
nothing else works, it will ensure that you're not holding on to stale data from one file to the
next.
Hope this helps.
Yes.
You can use jsonstreamer SAX-like push parser that I have written which will allow you to parse arbitrary sized chunks, you can get it here and checkout the README for examples. Its fast because it uses the 'C' yajl library.
It can be done by using ijson. The working of ijson has been very well explained by Jim Pivarski in the answer above. The code below will read a file and print each json from the list. For example, file content is as below
[{"name": "rantidine", "drug": {"type": "tablet", "content_type": "solid"}},
{"name": "nicip", "drug": {"type": "capsule", "content_type": "solid"}}]
You can print every element of the array using the below method
def extract_json(filename):
with open(filename, 'rb') as input_file:
jsonobj = ijson.items(input_file, 'item')
jsons = (o for o in jsonobj)
for j in jsons:
print(j)
Note: 'item' is the default prefix given by ijson.
if you want to access only specific json's based on a condition you can do it in following way.
def extract_tabtype(filename):
with open(filename, 'rb') as input_file:
objects = ijson.items(input_file, 'item.drugs')
tabtype = (o for o in objects if o['type'] == 'tablet')
for prop in tabtype:
print(prop)
This will print only those json whose type is tablet.
On your mention of running out of memory I must question if you're actually managing memory. Are you using the "del" keyword to remove your old object before trying to read a new one? Python should never silently retain something in memory if you remove it.
Update
See the other answers for advice.
Original answer from 2010, now outdated
Short answer: no.
Properly dividing a json file would take intimate knowledge of the json object graph to get right.
However, if you have this knowledge, then you could implement a file-like object that wraps the json file and spits out proper chunks.
For instance, if you know that your json file is a single array of objects, you could create a generator that wraps the json file and returns chunks of the array.
You would have to do some string content parsing to get the chunking of the json file right.
I don't know what generates your json content. If possible, I would consider generating a number of managable files, instead of one huge file.
Another idea is to try load it into a document-store database like MongoDB.
It deals with large blobs of JSON well. Although you might run into the same problem loading the JSON - avoid the problem by loading the files one at a time.
If path works for you, then you can interact with the JSON data via their client and potentially not have to hold the entire blob in memory
http://www.mongodb.org/
"the garbage collector should free the memory"
Correct.
Since it doesn't, something else is wrong. Generally, the problem with infinite memory growth is global variables.
Remove all global variables.
Make all module-level code into smaller functions.
in addition to #codeape
I would try writing a custom json parser to help you figure out the structure of the JSON blob you are dealing with. Print out the key names only, etc. Make a hierarchical tree and decide (yourself) how you can chunk it. This way you can do what #codeape suggests - break the file up into smaller chunks, etc
You can parse the JSON file to CSV file and you can parse it line by line:
import ijson
import csv
def convert_json(self, file_path):
did_write_headers = False
headers = []
row = []
iterable_json = ijson.parse(open(file_path, 'r'))
with open(file_path + '.csv', 'w') as csv_file:
csv_writer = csv.writer(csv_file, ',', '"', csv.QUOTE_MINIMAL)
for prefix, event, value in iterable_json:
if event == 'end_map':
if not did_write_headers:
csv_writer.writerow(headers)
did_write_headers = True
csv_writer.writerow(row)
row = []
if event == 'map_key' and not did_write_headers:
headers.append(value)
if event == 'string':
row.append(value)
So simply using json.load() will take a lot of time. Instead, you can load the json data line by line using key and value pair into a dictionary and append that dictionary to the final dictionary and convert it to pandas DataFrame which will help you in further analysis
def get_data():
with open('Your_json_file_name', 'r') as f:
for line in f:
yield line
data = get_data()
data_dict = {}
each = {}
for line in data:
each = {}
# k and v are the key and value pair
for k, v in json.loads(line).items():
#print(f'{k}: {v}')
each[f'{k}'] = f'{v}'
data_dict[i] = each
Data = pd.DataFrame(data_dict)
#Data will give you the dictionary data in dataFrame (table format) but it will
#be in transposed form , so will then finally transpose the dataframe as ->
Data_1 = Data.T