How to use pickle to load data in a faster way? - python

Hi I saved my data by using pickle package in python. I did it like this:
save=open('slidings.txt','wb')
pickle.dump(slidings,save)
save.close()
The data format is like this:
slidings=[dic,dic,...,dic]
where dic={key1:value1, key2:value2,...}
where value is a list.
The resulting txt file is 18GB, which is too large to be loaded. I use the following code to load my data:
import pickle
df=open('slidings.txt','rb')
slidings=pickle.load(df)
df.close()
But it's very slow.
Is there any alternative way to do this?
Whether an alternative way to save the data or load the data. Any suggestions would be much appreciated! Thank you.

Related

Storing multiple GeoTiffs in HDF5 file in Python

I want to store multiple GeoTiff files in one HDF5 file to use it for further analysis since the function I am supposed to use can just deal with HDF5 (so basically like a raster stack in R but stored in a HDF5). I have to use Python. I am relatively new to HDF5 format (and geoanalysis in Python generally) and don't really know how to approach this issue. Especially keeping the geolocation/projection inforation seems tricky to me. So far I tried:
import h5py
import rasterio
r1 = rasterio.open("filename.tif")
r2 = rasterio.open("filename2.tif")
with h5py.File('path/test.h5', 'w') as hdf:
hdf.create_dataset('GeoTiff1', data=r1)
hdf.create_dataset('GeoTiff2', data=r2)
Yielding the following errror:
TypeError: Object dtype dtype('O') has no native HDF5 equivalent
I am pretty sure this not at all the correct approach and I'm happy about any suggestions.
What you can try is to do this:
import numpy as np
spec_dtype = h5py.special_dtype(vlen=np.dtype('float64'))
Just make a spec_dtype variable with float64 type then apply this to create_dataset:
with h5py.File('path/test.h5', 'w') as hdf:
hdf.create_dataset('GeoTiff1', data=r1,, dtype=spec_dtype)
hdf.create_dataset('GeoTiff2', data=r2,, dtype=spec_dtype)
Apply these and hopefully it will work.
Using HDFql in Python, your use-case could be solved as follows:
import HDFql
HDFql.execute("SHOW FILE SIZE filename.tif, filename2.tif")
HDFql.cursor_next()
HDFql.execute("CREATE DATASET path/test.h5 GeoTiff1 AS OPAQUE(%d) VALUES FROM BINARY FILE filename.tif" % HDFql.cursor_get_bigint())
HDFql.cursor_next()
HDFql.execute("CREATE DATASET path/test.h5 GeoTiff2 AS OPAQUE(%d) VALUES FROM BINARY FILE filename2.tif" % HDFql.cursor_get_bigint())

Creating *.mat file from Python without using dictionary

I have few lists which i want to save it to a *.mat file. But according to scipy.io.savemat command documentation i Need to create a dictionary with the lists and then use the command to save it to a *.mat file.
If i save it according to the way mentioned in the docs the mat file will have structure with variables as the Arrays which i used in the dictionary. Now i have a Problem here, I have another program (which is not editable) will use the mat files and load them to plot some Graphs from the data. The program cannot process the structure because it is written in a way where if it loads a mat files and then it will directly process the Arrays in it.
So is there a way to save the mat file without using dictionaries? Please see the Image for more understanding
Thanks
This is the sample algorithm i used to save my *.mat file
import os
os.getcwd()
os.chdir(os.getcwd())
import scipy.io as sio
x=[1,2,3,4,5]
y=[234,5445,778] #can be 1000 lists
data={}
data['x']=x
data['y']=y
sio.savemat('test.mat',{'interpolated_data':data})
How about
scipy.io.savemat('interpolated_data_max_compare.mat',
{'NA1_X_order10_ACCE_ms2': np.zeros((3000,1)),
'NA1_X_order10_DISP_mm': np.ones((3000,1))})
Should work fine...
According to the code you added in your question, instead of sio.savemat('...', {'interpolated_data':data}), just save
sio.savemat('...', data)
and you should be fine: data is already a dictionary you don't need to add an extra level with {'interpolated_data': data} when saving.
You could use the Writing primitives directly
import scipy.io.matlab as ml
f=open("something.mat","wb")
mw=ml.mio5.MatFile5Writer(f)
mw.put_variables({"testVar":22})

Accessing all the data contained in an h5 file with python

After I load an h5 files and then check the keys, is there any other data that can be stored in the h5 that I might be missing? For example:
import h5py
a = '/path/to/file.h5'
a_h5 = h5py.File(a)
a_h5.keys()
From the h5py documentation, it looks like you can also do:
a_h5.values()
a_h5.items()
I don't know much about this format, but this looks like additional information you can extract.

Saving and loading simple data in Python convenient way

I'm currently working on a simple Python 3.4.3 and Tkinter game.
I struggle with saving/reading data now, because I'm a beginner at coding.
What I do now is use .txt files to store my data, but I find this extremely counter-intuitive, as saving/reading more than one line of data requires of me to have additional code to catch any newlines.
Skipping a line would be terrible too.
I've googled it, but I either find .txt save/file options or way too complex ones for saving large-scale data.
I only need to save some strings right now and be able to access them (if possible) by key like in a dictionary key:value .
Do you know of any file format/method to help me accomplish that?
Also: If possible, should work on Win/iOS/Linux.
It sounds like using json would be best for this, which comes as part of the Python Standard library in Python-2.6+
import json
data = {'username':'John', 'health':98, 'weapon':'warhammer'}
# serialize the data to user-data.txt
with open('user-data.txt', 'w') as fobj:
json.dump(data, fobj)
# read the data back in
with open('user-data.txt', 'r') as fobj:
data = json.load(fobj)
print(data)
# outputs:
# {u'username': u'John', u'weapon': u'warhammer', u'health': 98}
A popular alternative is yaml, which is actually a superset of json and produces slightly more human readable results.
You might want to try Redis.
http://redis.io/
I'm not totally sure it'll meet all your needs, but it would probably be better than a flat file.

Pickling pandas dataframe multiplies by 5 the file size

I am reading a 800 Mb CSV file with pandas.read_csv, and then use the original Python pickle.dump(datfarame) to save it. The result is a 4 Gb pkl file, so the CSV size is multiplied by 5.
I expected pickle to compress data rather than extend it. Also because I can do a gzip on the CSV file which compress it to 200 Mb, dividing it by 4.
I am willing to accelerate the loading time of my program, and thought that pickling would help, but considering disk access is the main bottleneck I am understanding that I would rather have to compress the files and then use the compression option from pandas.read_csv to speed up the loading time.
Is that correct?
Is it normal that pickling pandas dataframe extend the data size?
How do you speed up loading time usually?
What are the data-size limit would you load with pandas?
Not sure why you think pickling compresses the data size, pickling creates a string version of your python object so that it can be loaded back as a python object:
In [388]:
import sys
import os
df = pd.DataFrame({'a':np.arange(5)})
df.to_pickle(r'c:\data\df.pkl')
print(sys.getsizeof(df))
statinfo = os.stat(r'c:\data\df.pkl')
print(statinfo.st_size)
with open(r'c:\data\df.pkl', 'rb') as f:
print(f.read())
56
700
b'\x80\x04\x95\xb1\x02\x00\x00\x00\x00\x00\x00\x8c\x11pandas.core.frame\x94\x8c\tDataFrame\x94\x93\x94)}\x94\x92\x94\x8c\x15pandas.core.internals\x94\x8c\x0cBlockManager\x94\x93\x94)}\x94\x92\x94(]\x94(\x8c\x11pandas.core.index\x94\x8c\n_new_Index\x94\x93\x94h\x0b\x8c\x05Index\x94\x93\x94}\x94(\x8c\x04data\x94\x8c\x15numpy.core.multiarray\x94\x8c\x0c_reconstruct\x94\x93\x94\x8c\x05numpy\x94\x8c\x07ndarray\x94\x93\x94K\x00\x85\x94C\x01b\x94\x87\x94R\x94(K\x01K\x01\x85\x94\x8c\x05numpy\x94\x8c\x05dtype\x94\x93\x94\x8c\x02O8\x94K\x00K\x01\x87\x94R\x94(K\x03\x8c\x01|\x94NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK?t\x94b\x89]\x94\x8c\x01a\x94at\x94b\x8c\x04name\x94Nu\x86\x94R\x94h\rh\x0b\x8c\nInt64Index\x94\x93\x94}\x94(h\x11h\x14h\x17K\x00\x85\x94h\x19\x87\x94R\x94(K\x01K\x05\x85\x94h\x1f\x8c\x02i8\x94K\x00K\x01\x87\x94R\x94(K\x03\x8c\x01<\x94NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00t\x94b\x89C(\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x94t\x94bh(Nu\x86\x94R\x94e]\x94h\x14h\x17K\x00\x85\x94h\x19\x87\x94R\x94(K\x01K\x01K\x05\x86\x94h\x1f\x8c\x02i4\x94K\x00K\x01\x87\x94R\x94(K\x03h5NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00t\x94b\x89C\x14\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x94t\x94ba]\x94h\rh\x0f}\x94(h\x11h\x14h\x17K\x00\x85\x94h\x19\x87\x94R\x94(K\x01K\x01\x85\x94h"\x89]\x94h&at\x94bh(Nu\x86\x94R\x94a}\x94\x8c\x060.14.1\x94}\x94(\x8c\x06blocks\x94]\x94}\x94(\x8c\x06values\x94h>\x8c\x08mgr_locs\x94\x8c\x08builtins\x94\x8c\x05slice\x94\x93\x94K\x00K\x01K\x01\x87\x94R\x94ua\x8c\x04axes\x94h\nust\x94bb.'
The method to_csv does support compression as a kwarg, 'gzip' and 'bz2':
In [390]:
df.to_csv(r'c:\data\df.zip', compression='bz2')
statinfo = os.stat(r'c:\data\df.zip')
print(statinfo.st_size)
29
It is likely in your best interest to stash your CSV file in a database of some sort and perform operations on that rather than loading the CSV file to RAM, as Kathirmani suggested. You will see the speedup in loading time that you expect due simply to the fact that you are not filling up 800 Mb worth of RAM every time you load your script.
File compression and loading time are two conflicting elements of what you seem to be trying to accomplish. Compressing the CSV file and loading that will take more time; you've now added the extra step of having to decompress the file, which doesn't solve your problem.
Consider a precursory step to ship the data to an sqlite3 database, as described here: Importing a CSV file into a sqlite3 database table using Python.
You now have the pleasure of being able to query a subset of your data and quickly load it into a pandas.DataFrame for further use, as follows:
from pandas.io import sql
import sqlite3
conn = sqlite3.connect('your/database/path')
query = "SELECT * FROM foo WHERE bar = 'FOOBAR';"
results_df = sql.read_frame(query, con=conn)
...
Conversely, you can use pandas.DataFrame.to_sql() to save these for later use.
Dont load 800MB file to memory. It will increase your loading time. Pickle objects too takes more time to load. Instead store the csv file as a sqlite3 (which comes along with python) table. And then query the table every time depending upon your need.
You can also use panda's pickle methods which should compress your data.
Save a dataframe:
df.to_pickle(filename)
Load it:
df = pd.read_pickle(filename)

Categories

Resources