I have a collection of mainly numerical data-files that are the result of running a physics simulation (or several). I can convert the files into pandas dataframes. It is natural to organize the dataframe objects in lists, lists of lists etc. For example:
allData = [df1, [df11, df12], df2, [df21, df22]]
I want to save this data to files (to be sent). I know the whole thing can be dumped into one file with e.g. a pickle format, but I don't want this because some files can be large and I want to be able to load the files selectively. So each dataframe should be stored as a separate file.
But I also want to store how the objects are organized into lists, for example in another file. So that when reading the files from somewhere else, python will know how the data files are connected.
Possibly I could solve this by inventing some system of writing the filenames and how they are structured into a txt file. But is there a proper/cleaner way to do it?
Related
I have written an optimization algorithm that tests some functions on historical stock data, then returns a 2d list of the pandas dataframes generated by each run and the function parameters used. This list takes the form of [[df,params],[df,params], ... [df,params],[df,params]]. After it has been generated, I would like to save this data to be processed in another script, but I am having trouble. Currently I am converting this list to a dataframe and using the to_csv() method from pandas, but this is mangling my data when I open it in another file - I expect the data types to be [[dataframe,list][dataframe,list]...[dataframe,list][dataframe,list]], but they instead become [[str,str],[str,str]...,[str,str],[str,str]]. I open the file using the read_csv() method from pandas, then I convert the resulting dataframe back into a list using the df.values.to_list() method.
To clarify, I save the list to a .csv like this, where out is the list:
out = pd.DataFrame(out)
out.to_csv('optimized_ticker.csv')
And I open the .csv and convert it back from a dataframe to a list like this:
df = pd.read_csv('optimized_ticker.csv')
list = df.values.tolist()
I figured that the problem was my dataframes had commas in there somewhere, so I tried changing the delimiter on the .csv to a few different things, but the issue persisted. How can I fix this issue, so that my datatypes aren't ? It is not imperative that I use the .csv format, so if there's a filetype more suited to the job I can switch to using it. The only purpose of saving the data is so that I can process it with any number of other scripts without having to re-run the simulation each time.
The best way to save a pandas dataframe is not via CSV if its only purpose is to be read by another pandas script. Parquet offers a much more robust option, it saves the datatypes for each column, can be compressed and you won't have to worry about things like comma's in values. Just use the following:
out.to_parquet('optimized_ticker.parquet')
df = pd.read_parquet('optimized_ticker.parquet')
EDIT:
As mentioned in the comments pickle is also a possibility, so the solution depends on your case. Google will be your best friend in figuring out whether to use pickle or parquet or feather.
My application need to process data periodically. The application need to process new data and then merge it with old ones. The data may have billions rows with only two columns, which the first column is the row name and the second one is values. The following one is the example:
a00001,12
a00002,2321
a00003,234
The new data may has new row names or old ones. I want to merge them. So each in processing procedure I need to read the old large data file and merge it with new ones. Then I write the new data to a new file.
I find that the most time-consuming process is read and write data. I have tried several data I/O way.
Orignal read and write text. This is the most time-consuming way
Python pickle package, however, it is not efficient for large data file
Are there any other data I/O formats or packages can load and write large data efficiently in python?
If you have such large amounts of data, it might be faster to try lowering the amount of data you have to read and write.
You could spread the data over multiple files instead of saving it all in one.
When processing your new data, check what old data has to be merged and just read and write those specific files.
Your data has two rows:
name1, data1
name2, data2
Files containing old data:
db_1.dat, db_2.dat, db_3.dat
name_1: data_1 name_1001: data_1001 name_2001: data_2001
. . .
. . .
. . .
name_1000: data_1000 name_2000: data_2000 name_3000: data_3000
Now you can check what data you need to merge and just read and write the specific files holding that data.
Not sure if what you are trying to achieve allows a system like this but it would speed up the process as there is less data to handle.
Maybe this article could help you. It seems like father and parquet may be interesting.
I have a huge dataset of images and I am processing them one by one.
All the images are stored in a folder.
My Approach:
What I have tried is that I have tried reading all the filenames in memory and whenever a call for a certain index is sent, I load the corresponding image.
The problem is that it is even not possible to keep the paths and the names of the files in memory due to the huge dataset.
Is it possible to have an indexed file on storage and one is able to read a file name at a certain index.
Thanks a lot.
I'm new to the HDF5 file format and have been experimenting successfully in Python with h5py. Now its time to store real data.
I will need to store a list of objects, where each object can be one of several types and each type will have a number of arrays and strings. The critical part is that the list of objects will be different for each file, and there will be hundreds of files.
Is there a way to automatically export an arbitrary, nested object into (and back from) an HDF5 file? I'm imagining a routine that would automatically span the hierarchy of a nested object and build the same hierarchy in the HDF5 file.
I've read through the H5PY doc and don't see any spanning routines. Furthermore, google and SO searches are (strangely) not showing this capability. Am i missing something or is there another way to look at this.
So I'm working with parametric energy simulations and ended up with 500GB+ of data stored in .CSV files. I need to be able to process all this data to compare the results and gain insights of the influence of different parameters.
Each csv file name contains information of the parameters used for the simulation so I can not merge the files.
I normally loaded the .csv files to python using pandas and defining a Class. but now (with all this data) there is not enough memory to do this.
Can you point me out a way to process this data? I need to be able to do plots and compare the csv files.
Thank you for your time.
Convert the csv files to hdf5, which was created to deal with massive and complex datasets. It works with pandas as well as other libraries.