fast file loading on python

fast file loading on python - python

I have two problems with loading data on python, both the scipts work properly but they need too much time to run and sometimes "Killed" is the result (with the first one).
I have a big zipped text file and I do something like this:
import gzip
import cPickle as pickle
f = gzip.open('filename.gz','r')
tab={}
for line in f:
#fill tab
with open("data_dict.pkl","wb") as g:
pickle.dump(tab,g)
f.close()
I have to do some operations on the dictionary I created in the previous script
import cPickle as pickle
with open("data_dict.pkl", "rb") as f:
tab = pickle.load(f)
f.close()
#operations on tab (the dictionary)
Do you have other solutionsin mind? Maybe not the ones involving YAML or JSON...

If the data you are pickling is primitive and simple, you can try marshal module: http://docs.python.org/3/library/marshal.html#module-marshal. That's what Python uses to serialize its bytecode, so it's pretty fast.

First one comment, in:
with open("data_dict.pkl", "rb") as f:
tab = pickle.load(f)
f.close()
f.close() is not necessary, the context manager (with syntax) does that automatically.
Now as for speed, I don't think you're going to get too much faster than cPickle for the purpose of reading in something from disk directly as a Python object. If this script needs to be run over and over I would try using memchached via pylibmc to keep the object stored persistently in memory so you can access it lightning fast:
import pylibmc
mc = pylibmc.Client(["127.0.0.1"], binary=True,behaviors={"tcp_nodelay": True,"ketama": True})
d = range(10000) ## some big object
mc["some_key"] = d ## save in memory
Then after saving it once you can access and modify it, it stays in memory even after the previous program finishes its execution:
import pylibmc
mc = pylibmc.Client(["127.0.0.1"], binary=True,behaviors={"tcp_nodelay": True,"ketama": True})
d = mc["some_key"] ## load from memory
d[0] = 'some other value' ## modify
mc["some_key"] = d ## save to memory again

Related

python Ray: How to write to a file

How can i construct a ray framework where each process will write it's results to a common file ? What i'm currently trying is :
import ray
import time
import pickle
import filelock
ray.init()
filename = 'data/db.pkl'
#ray.remote
def f(i):
try:
with filelock.FileLock(filename):
with open(filename, 'rb') as file:
data = pickle.load(file)
except FileNotFoundError:
data = {}
if i not in data.keys():
# The actual computations that takes times and need to be parralell: here just a square.
new_key = i
new_item = i**2
with filelock.FileLock(filename):
with open(filename, 'rb') as file:
data = pickle.load(file)
data[new_key] = new_item
with open(filename, 'wb') as file:
pickle.dump(data,file)
return None
numbers = [0,1,2,3,4,5,6,7,8,9,10]
rez = [f.remote(i) for i in numbers]
But i get an error.
How can i achieve this behavior ? I want each process to :
1° Check the database to see if it's work is needed
2° Work
3° Write it's result to the database.
Without locking the file, this work, but not all results are saved... How can i achieve the wanted behavior ? Note that later i'll need this to work on a distributed setup..

First of all, you should use 'ab' (the append mode instead of 'wb' for overwriting the file). With append mode you shouldn't need locking since it is thread-safe on a POSIX system.
What error did you get when using lock on the file?
Given that you will eventually make the program distributed, I think the easiest thing to do is to use ray.put() in your f(i) to store the data in Ray shared memory and then write the objects out from the main program.

how to save torchtext Dataset?

I'm working with text and use torchtext.data.Dataset.
Creating the dataset takes a considerable amount of time.
For just running the program this is still acceptable. But I would like to debug the torch code for the neural network. And if python is started in debug mode, the dataset creation takes roughly 20 minutes (!!). That's just to get a working environment where I can debug-step through the neural network code.
I would like to save the Dataset, for example with pickle. This sample code is taken from here, but I removed everything that is not necessary for this example:
from torchtext import data
from fastai.nlp import *
PATH = 'data/aclImdb/'
TRN_PATH = 'train/all/'
VAL_PATH = 'test/all/'
TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'
TEXT = data.Field(lower=True, tokenize="spacy")
bs = 64;
bptt = 70
FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)
with open("md.pkl", "wb") as file:
pickle.dump(md, file)
To run the code, you need the aclImdb dataset, it can be downloaded from here. Extract it into a data/ folder next to this code snippet. The code produces an error in the last line, where pickle is used:
Traceback (most recent call last):
File "/home/lhk/programming/fastai_sandbox/lesson4-imdb2.py", line 27, in <module>
pickle.dump(md, file)
TypeError: 'generator' object is not callable
The samples from fastai often use dill instead of pickle. But that doesn't work for me either.

I came up with the following functions for myself:
import dill
from pathlib import Path
import torch
from torchtext.data import Dataset
def save_dataset(dataset, path):
if not isinstance(path, Path):
path = Path(path)
path.mkdir(parents=True, exist_ok=True)
torch.save(dataset.examples, path/"examples.pkl", pickle_module=dill)
torch.save(dataset.fields, path/"fields.pkl", pickle_module=dill)
def load_dataset(path):
if not isinstance(path, Path):
path = Path(path)
examples = torch.load(path/"examples.pkl", pickle_module=dill)
fields = torch.load(path/"fields.pkl", pickle_module=dill)
return Dataset(examples, fields)
Not that actual objects could be a bit different, for example, if you save TabularDataset, then load_dataset returns an instance of class Dataset. This unlikely affect the data pipeline but may require extra diligence for tests.
In the case of a custom tokenizer, it should be serializable as well (e.g. no lambda functions, etc).

You can use dill instead of pickle. It works for me.
You can save a torchtext Field like
TEXT = data.Field(sequential=True, tokenize=tokenizer, lower=True,fix_length=200,batch_first=True)
with open("model/TEXT.Field","wb")as f:
dill.dump(TEXT,f)
And load a Field like
with open("model/TEXT.Field","rb")as f:
TEXT=dill.load(f)
The offical code suppport is under development，you can follow https://github.com/pytorch/text/issues/451 and https://github.com/pytorch/text/issues/73 .

You can always use the pickle to dump the objects, but keep in mind one thing that dumping a list of dictionary or fields objects are not taken care of by the module, so to the best try to decompose the list first
To Store the DataSet Object to a pickle file for later easy loading
def save_to_pickle(dataSetObject,PATH):
with open(PATH,'wb') as output:
for i in dataSetObject:
pickle.dump(vars(i), output, pickle.HIGHEST_PROTOCOL)
The toughest thing is yet to come, Yeah loading the pickle file.... ;)
First, try to look for all field names and field attributes and then go for the kill
To load the pickle file into the DataSetObject
def load_pickle(PATH, FIELDNAMES, FIELD):
dataList = []
with open(PATH, "rb") as input_file:
while True:
try:
# Taking the dictionary instance as the input Instance
inputInstance = pickle.load(input_file)
# plugging it into the list
dataInstance = [inputInstance[FIELDNAMES[0]],inputInstance[FIELDNAMES[1]]]
# Finally creating an example objects list
dataList.append(Example().fromlist(dataInstance,fields=FIELD))
except EOFError:
break
# At last creating a data Set Object
exampleListObject = Dataset(dataList, fields=data_fields)
return exampleListObject
This hackish solution has worked in my case, hope you will find it useful in your case too.
Btw any suggestion is welcome :).

The pickle/dill approach is fine if your dataset is small. But if you are working with large datasets I won't recommend it as it will be too slow.
I simply save the examples (iteratively) as JSON-strings. The reason behind this is because saving the whole Dataset object takes a lot of time, plus you need serialization tricks such a dill, which makes the serialization even slower.
Moreover, these serializers take a lot of memory (some of them even create copies of the dataset) and if they start making use of the swap memory, you're done. That process is gonna take so long that you will probably terminate it before it finishes.
Therefore, I end up with the following approach:
Iterate over the examples
Convert each example (on the fly) to a JSON-string
Write that JSON-string into a text file (one sample per
line)
When loading, add the examples to the Dataset object along with the fields
def save_examples(dataset, savepath):
with open(savepath, 'w') as f:
# Save num. elements (not really need it)
f.write(json.dumps(total)) # Write examples length
f.write("\n")
# Save elements
for pair in dataset.examples:
data = [pair.src, pair.trg]
f.write(json.dumps(data)) # Write samples
f.write("\n")
def load_examples(filename):
examples = []
with open(filename, 'r') as f:
# Read num. elements (not really need it)
total = json.loads(f.readline())
# Save elements
for i in range(total):
line = f.readline()
example = json.loads(line)
# example = data.Example().fromlist(example, fields) # Create Example obj. (you can do it here or later)
examples.append(example)
end = time.time()
print(end - start)
return examples
Then, you can simply rebuild the dataset by:
# Define fields
SRC = data.Field(...)
TRG = data.Field(...)
fields = [('src', SRC), ('trg', TRG)]
# Load examples from JSON and convert them to "Example objects"
examples = load_examples(filename)
examples = [data.Example().fromlist(d, fields) for d in examples]
# Build dataset
mydataset = Dataset(examples, fields)
The reason why I use JSON instead of pickle, dill, msgpack, etc is not arbitrary.
I did some tests and these are the results:
Dataset size: 2x (1,960,641)
Saving times:
- Pickle/Dill*: >30-45 min (...or froze my computer)
- MessagePack (iterative): 123.44 sec
100%|██████████| 1960641/1960641 [02:03<00:00, 15906.52it/s]
- JSON (iterative): 16.33 sec
100%|██████████| 1960641/1960641 [00:15<00:00, 125955.90it/s]
- JSON (bulk): 46.54 sec (memory problems)
Loading times:
- Pickle/Dill*: -
- MessagePack (iterative): 143.79 sec
100%|██████████| 1960641/1960641 [02:23<00:00, 13635.20it/s]
- JSON (iterative): 33.83 sec
100%|██████████| 1960641/1960641 [00:33<00:00, 57956.28it/s]
- JSON (bulk): 27.43 sec
*Similar approach as the other answers

Save Numpy Array using Pickle

I've got a Numpy array that I would like to save (130,000 x 3) that I would like to save using Pickle, with the following code. However, I keep getting the error "EOFError: Ran out of input" or "UnsupportedOperation: read" at the pkl.load line. This is my first time using Pickle, any ideas?
Thanks,
Anant
import pickle as pkl
import numpy as np
arrayInput = np.zeros((1000,2)) #Trial input
save = True
load = True
filename = path + 'CNN_Input'
fileObject = open(fileName, 'wb')
if save:
pkl.dump(arrayInput, fileObject)
fileObject.close()
if load:
fileObject2 = open(fileName, 'wb')
modelInput = pkl.load(fileObject2)
fileObject2.close()
if arrayInput == modelInput:
Print(True)

You should use numpy.save and numpy.load.

I have no problems using pickle:
In [126]: arr = np.zeros((1000,2))
In [127]: with open('test.pkl','wb') as f:
...: pickle.dump(arr, f)
...:
In [128]: with open('test.pkl','rb') as f:
...: x = pickle.load(f)
...: print(x.shape)
...:
...:
(1000, 2)
pickle and np.save/load have a deep reciprocity. Like I can load this pickle with np.load:
In [129]: np.load('test.pkl').shape
Out[129]: (1000, 2)
If I open the pickle file in the wrong I do get your error:
In [130]: with open('test.pkl','wb') as f:
...: x = pickle.load(f)
...: print(x.shape)
...:
UnsupportedOperation: read
But that shouldn't be surprising - you can't read a freshly opened write file. It will be empty.
np.save/load is the usual pair for writing numpy arrays. But pickle uses save to serialize arrays, and save uses pickle to serialize non-array objects (in the array). Resulting file sizes are similar. Curiously in timings the pickle version is faster.

It's been a bit but if you're finding this, Pickle completes in a fraction of the time.
with open('filename','wb') as f: pickle.dump(arrayname, f)
with open('filename','rb') as f: arrayname1 = pickle.load(f)
numpy.array_equal(arrayname,arrayname1) #sanity check
On the other hand, by default numpy compress took my 5.2GB down to .4GB and Pickle went to 1.7GB.

Don't use pickle for numpy arrays, for an extended discussion that links to all resources I could find see my answer here.
Short reasons:
there is already a nice interface the developers of numpy made and will save you lots of time of debugging (most important reason)
np.save,np.load,np.savez have pretty good performance in most metrics, see this, which is to be expected since it's an established library and the developers of numpy made those functions.
Pickle executes arbitrary code and is a security issue
to use pickle you would have to open and file and might get issues that leads to bugs (e.g. I wasn't aware of using b and it stopped working, took time to debug)
if you refuse to accept this advice, at least really articulate the reason you need to use something else. Make sure it's crystal clear in your head.
Avoid repeating code at all costs if a solution already exists!
Anyway, here are all the interfaces I tried, hopefully it saves someone time (probably my future self):
import numpy as np
import pickle
from pathlib import Path
path = Path('~/data/tmp/').expanduser()
path.mkdir(parents=True, exist_ok=True)
lb,ub = -1,1
num_samples = 5
x = np.random.uniform(low=lb,high=ub,size=(1,num_samples))
y = x**2 + x + 2
# using save (to npy), savez (to npz)
np.save(path/'x', x)
np.save(path/'y', y)
np.savez(path/'db', x=x, y=y)
with open(path/'db.pkl', 'wb') as db_file:
pickle.dump(obj={'x':x, 'y':y}, file=db_file)
## using loading npy, npz files
x_loaded = np.load(path/'x.npy')
y_load = np.load(path/'y.npy')
db = np.load(path/'db.npz')
with open(path/'db.pkl', 'rb') as db_file:
db_pkl = pickle.load(db_file)
print(x is x_loaded)
print(x == x_loaded)
print(x == db['x'])
print(x == db_pkl['x'])
print('done')
but most useful see my answer here.

Here is one of extra possible ways. Sometimes you should add extra option protocol. For example,
import pickle
# Your array
arrayInput = np.zeros((1000,2))
Here is your approach:
pickle.dump(arrayInput, open('file_name.pickle', 'wb'))
Which you can change to:
# in two lines of code
with open("file_name.pickle", "wb") as f:
pickle.dump(arrayInput, f, protocol=pickle.HIGHEST_PROTOCOL)
or
# Save in line of code
pickle.dump(arrayInput, open("file_name.pickle", "wb"), protocol=pickle.HIGHEST_PROTOCOL)
Aftermath, you can easy read you numpy array like:
arrayInput = pickle.load(open(self._dir_models+"model_rf.sav", 'rb'))
Hope, it is useful for you

Many are forgetting one very important thing: security.
Pickled data is binary, so it gets run immediately upon using pickle.load. If loading from an untrusted source, the file could contain executable instructions to achieve things like man-in-the-middle attacks over a network, among other things. (e.g. see this realpython.com article)
Pure pickled data may be faster to save/load if you don't follow with bz2 compression, and hence have a larger file size, but numpy load/save may be more secure.
Alternatively, you may save purely pickled data along with an encryption key using the builtin hashlib and hmac libraries and, prior to loading, compare the hash key against your security key:
import hashlib
import hmac
def calculate_hash(
key_,
file_path,
hash_=hashlib.sha256
):
with open(file_path, "rb") as fp:
file_hash = hmac.new(key_, fp.read(), hash_).hexdigest()
return file_hash
def compare_hash(
hash1,
hash2,
):
"""
Warning:
Do not use `==` directly to compare hash values. Timing attacks can be used
to learn your security key. Use ``compare_digest()``.
"""
return hmac.compare_digest(hash1, hash2)
In a corporate setting, always be sure to confirm with your IT department. You want to be sure proper authentication, encryption, and authorization is all "set to go" when loading and saving data over servers and networks.
Pickle/CPickle
If you are confident you are using nothing but trusted sources and speed is a major concern over security and file size, pickle might be the way to go. In addition, you can take a few extra security measures using cPickle (this may have been incorporated directly into pickle in recent Python3 versions, but I'm not sure, so always double-check):
Use a cPickle.Unpickler instance, and set its "find_global" attribute to None to disable importing any modules (thus restricting loading to builtin types such as dict, int, list, string, etc).
Use a cPickle.Unpickler instance, and set its "find_global" attribute to a function that only allows importing of modules and names from a whitelist.
Use something like the itsdangerous package to authenticate the data before unpickling it if you're loading it from an untrusted source.
Numpy
If you are only saving numpy data and no other python data, and security is a greater priority over file size and speed, then numpy might be the way to go.
HDF5/H5PY
If your data is truly large and complex, hdf5 format via h5py is good.
JSON
And of course, this discussion wouldn't be complete without mentioning json. You may need to do extra work setting up encoding and decoding of your data, but nothing gets immediately run when you use json.load, so you can check the template/structure of the loaded data before you use it.
DISCLAIMER: I take no responsibility for end-user security with this provided information. The above information is for informational purposes only. Please use proper discretion and appropriate measures (including corporate policies, where applicable) with regard to security needs.

You should use numpy.save() for saving numpy matrices.

In your code, you're using
if load:
fileObject2 = open(fileName, 'wb')
modelInput = pkl.load(fileObject2)
fileObject2.close()
The second argument in the open function is the method. w stands for writing, r for reading. The second character b denotes that bytes will be read/written. A file that will be written to cannot be read and vice versa. Therefore, opening the file with fileObject2 = open(fileName, 'rb') will do the trick.

The easiest way to save and load a NumPy array -
# a numpy array
result.importances_mean
array([-1.43651529e-03, -2.73401297e-03, 9.26784059e-05, -7.41427247e-04,
3.56811863e-03, 2.78035218e-03, 3.70713624e-03, 5.51436515e-03,
1.16821131e-01, 9.26784059e-05, 9.26784059e-04, -1.80722892e-03,
-1.71455051e-03, -1.29749768e-03, -9.26784059e-05, -1.43651529e-03,
0.00000000e+00, -1.11214087e-03, -4.63392030e-05, -4.63392030e-04,
1.20481928e-03, 5.42168675e-03, -5.56070436e-04, 8.34105653e-04,
-1.85356812e-04, 0.00000000e+00, -9.73123262e-04, -1.43651529e-03,
-1.76088971e-03])
# save the array format - np.save(filename.npy, array)
np.save(os.path.join(model_path, "permutation_imp.npy"), result.importances_mean)
# load the array format - np.load(filename.npy)
res = np.load(os.path.join(model_path, "permutation_imp.npy"))
res
array([-1.43651529e-03, -2.73401297e-03, 9.26784059e-05, -7.41427247e-04,
3.56811863e-03, 2.78035218e-03, 3.70713624e-03, 5.51436515e-03,
1.16821131e-01, 9.26784059e-05, 9.26784059e-04, -1.80722892e-03,
-1.71455051e-03, -1.29749768e-03, -9.26784059e-05, -1.43651529e-03,
0.00000000e+00, -1.11214087e-03, -4.63392030e-05, -4.63392030e-04,
1.20481928e-03, 5.42168675e-03, -5.56070436e-04, 8.34105653e-04,
-1.85356812e-04, 0.00000000e+00, -9.73123262e-04, -1.43651529e-03,
-1.76088971e-03])

Python Storing Data

I have a list in my program. I have a function to append to the list, unfortunately when you close the program the thing you added goes away and the list goes back to the beginning. Is there any way that I can store the data so the user can re-open the program and the list is at its full.

You may try pickle module to store the memory data into disk,Here is an example:
store data:
import pickle
dataset = ['hello','test']
outputFile = 'test.data'
fw = open(outputFile, 'wb')
pickle.dump(dataset, fw)
fw.close()
load data:
import pickle
inputFile = 'test.data'
fd = open(inputFile, 'rb')
dataset = pickle.load(fd)
print dataset

You can make a database and save them, the only way is this. A database with SQLITE or a .txt file. For example:
with open("mylist.txt","w") as f: #in write mode
f.write("{}".format(mylist))
Your list goes into the format() function. It'll make a .txt file named mylist and will save your list data into it.
After that, when you want to access your data again, you can do:
with open("mylist.txt") as f: #in read mode, not in write mode, careful
rd=f.readlines()
print (rd)

The built-in pickle module provides some basic functionality for serialization, which is a term for turning arbitrary objects into something suitable to be written to disk. Check out the docs for Python 2 or Python 3.
Pickle isn't very robust though, and for more complex data you'll likely want to look into a database module like the built-in sqlite3 or a full-fledged object-relational mapping (ORM) like SQLAlchemy.

For storing big data, HDF5 library is suitable. It is implemented by h5py in Python.

Python Pickle Help

I'm not sure why this Pickle example is not showing both of the dictionary definitions. As I understand it, "ab+" should mean that the pickle.dat file is being appended to and can be read from. I'm new to the whole pickle concept, but the tutorials on the net don't seem to go beyond just the initial storage.
import cPickle as pickle
def append_object(d, fname):
"""appends a pickle dump of d to fname"""
print "append_hash", d, fname
with open(fname, 'ab') as pickler:
pickle.dump(d, pickler)
db_file = 'pickle.dat'
cartoon = {}
cartoon['Mouse'] = 'Mickey'
append_object(cartoon, db_file)
cartoon = {}
cartoon['Bird'] = 'Tweety'
append_object(cartoon, db_file)
print 'loading from pickler'
with open(db_file, 'rb') as pickler:
cartoon = pickle.load(pickler)
print 'loaded', cartoon
Ideally, I was hoping to build up a dictionary using a for loop and then add the key:value pair to the pickle.dat file, then clear the dictionary to save some RAM.
What's going on here?

Don't use pickle for that. Use a database.
Python dbm module seems to fit what you want perfectly. It presents you with a dictionary-like interface, but data is saved to disk.
Example usage:
>>> import dbm
>>> x = dbm.open('/tmp/foo.dat', 'c')
>>> x['Mouse'] = 'Mickey'
>>> x['Bird'] = 'Tweety'
Tomorrow you can load the data:
>>> import dbm
>>> x = dbm.open('/tmp/foo.dat', 'c')
>>> print x['Mouse']
Mickey
>>> print x['Bird']
Tweety

I started to edit your code for readability and factored out append_object in the process.
There are multiple confusions here. The first, is that pickle.dump writes a Python object in its entirety. You can put multiple objects in a pickle file, but each needs its own load. The code did what you asked of it and loaded the first dictionary you wrote to the file. The second dictionary was there waiting to be read but it isn't a concatenation to the first, it is its own loadable.
Don't underestimate the importance of names. append_object isn't a great name, but it is different than append_to_object.
If you are opening a file for reading, just open it for reading and the same for writing or appending. Not only does it make your intentions more clear but it prevents silly errors.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

fast file loading on python - python

If the data you are pickling is primitive and simple, you can try marshal module: http://docs.python.org/3/library/marshal.html#module-marshal. That's what Python uses to serialize its bytecode, so it's pretty fast.

Related

python Ray: How to write to a file

how to save torchtext Dataset?

Save Numpy Array using Pickle

Python Storing Data

Python Pickle Help

Categories

Resources