How can i construct a ray framework where each process will write it's results to a common file ? What i'm currently trying is :
import ray
import time
import pickle
import filelock
ray.init()
filename = 'data/db.pkl'
#ray.remote
def f(i):
try:
with filelock.FileLock(filename):
with open(filename, 'rb') as file:
data = pickle.load(file)
except FileNotFoundError:
data = {}
if i not in data.keys():
# The actual computations that takes times and need to be parralell: here just a square.
new_key = i
new_item = i**2
with filelock.FileLock(filename):
with open(filename, 'rb') as file:
data = pickle.load(file)
data[new_key] = new_item
with open(filename, 'wb') as file:
pickle.dump(data,file)
return None
numbers = [0,1,2,3,4,5,6,7,8,9,10]
rez = [f.remote(i) for i in numbers]
But i get an error.
How can i achieve this behavior ? I want each process to :
1° Check the database to see if it's work is needed
2° Work
3° Write it's result to the database.
Without locking the file, this work, but not all results are saved... How can i achieve the wanted behavior ? Note that later i'll need this to work on a distributed setup..
First of all, you should use 'ab' (the append mode instead of 'wb' for overwriting the file). With append mode you shouldn't need locking since it is thread-safe on a POSIX system.
What error did you get when using lock on the file?
Given that you will eventually make the program distributed, I think the easiest thing to do is to use ray.put() in your f(i) to store the data in Ray shared memory and then write the objects out from the main program.
Related
I'm trying to parallelize the reading the content of 16 gzip files with script:
import gzip
import glob
from dask import delayed
from dask.distributed import Client, LocalCluster
#delayed
def get_gzip_delayed(gzip_file):
with gzip.open(gzip_file) as f:
reads = f.readlines()
reads = [read.decode("utf-8") for read in reads]
return reads
if __name__ == "__main__":
cluster = LocalCluster()
client = Client(cluster)
read_files = glob.glob("*.txt.gz")
all_files = []
for file in read_files:
reads = get_gzip_delayed(file)
all_files.extend(reads)
with open("all_reads.txt", "w") as f:
w = delayed(all_files.writelines)(f)
w.compute()
However, I get the following error:
> TypeError: Delayed objects of unspecified length are not iterable
How do I parallelize a for loop with extend/append and writing the function to a doc. All dask examples always include some final function performed on for loop product.
List all_files consists of delayed values, and calling delayed(f.writelines)(all_files) (note the different arguments relative to the code in the question) is not going to work for several reasons, the main is that you prepare lazy instructions for writing, but execute them only after closing the file.
There are different ways to solve this problem, at least two are:
if the data from the files fits into memory, then it's easiest to compute it and write to the file:
all_files = dask.compute(all_files)
with open("all_reads.txt", "w") as f:
f.writelines(all_files)
if the data cannot fit into memory, then another option is to put the writing inside the get_gzip_delayed function, so data doesn't travel between worker and client:
from dask.distributed import Lock
#delayed
def get_gzip_delayed(gzip_file):
with gzip.open(gzip_file) as f:
reads = f.readlines()
# create a lock to prevent others from writing at the same time
with Lock("all_reads.txt"):
with open("all_reads.txt", "a") as f: # need to be careful here, since files are appending
f.writelines([read.decode("utf-8") for read in reads])
Note that if memory is a severe constraint, then the above can also be refactored to process the files line-by-line (at the cost of slower IO).
I'm working with text and use torchtext.data.Dataset.
Creating the dataset takes a considerable amount of time.
For just running the program this is still acceptable. But I would like to debug the torch code for the neural network. And if python is started in debug mode, the dataset creation takes roughly 20 minutes (!!). That's just to get a working environment where I can debug-step through the neural network code.
I would like to save the Dataset, for example with pickle. This sample code is taken from here, but I removed everything that is not necessary for this example:
from torchtext import data
from fastai.nlp import *
PATH = 'data/aclImdb/'
TRN_PATH = 'train/all/'
VAL_PATH = 'test/all/'
TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'
TEXT = data.Field(lower=True, tokenize="spacy")
bs = 64;
bptt = 70
FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)
with open("md.pkl", "wb") as file:
pickle.dump(md, file)
To run the code, you need the aclImdb dataset, it can be downloaded from here. Extract it into a data/ folder next to this code snippet. The code produces an error in the last line, where pickle is used:
Traceback (most recent call last):
File "/home/lhk/programming/fastai_sandbox/lesson4-imdb2.py", line 27, in <module>
pickle.dump(md, file)
TypeError: 'generator' object is not callable
The samples from fastai often use dill instead of pickle. But that doesn't work for me either.
I came up with the following functions for myself:
import dill
from pathlib import Path
import torch
from torchtext.data import Dataset
def save_dataset(dataset, path):
if not isinstance(path, Path):
path = Path(path)
path.mkdir(parents=True, exist_ok=True)
torch.save(dataset.examples, path/"examples.pkl", pickle_module=dill)
torch.save(dataset.fields, path/"fields.pkl", pickle_module=dill)
def load_dataset(path):
if not isinstance(path, Path):
path = Path(path)
examples = torch.load(path/"examples.pkl", pickle_module=dill)
fields = torch.load(path/"fields.pkl", pickle_module=dill)
return Dataset(examples, fields)
Not that actual objects could be a bit different, for example, if you save TabularDataset, then load_dataset returns an instance of class Dataset. This unlikely affect the data pipeline but may require extra diligence for tests.
In the case of a custom tokenizer, it should be serializable as well (e.g. no lambda functions, etc).
You can use dill instead of pickle. It works for me.
You can save a torchtext Field like
TEXT = data.Field(sequential=True, tokenize=tokenizer, lower=True,fix_length=200,batch_first=True)
with open("model/TEXT.Field","wb")as f:
dill.dump(TEXT,f)
And load a Field like
with open("model/TEXT.Field","rb")as f:
TEXT=dill.load(f)
The offical code suppport is under development,you can follow https://github.com/pytorch/text/issues/451 and https://github.com/pytorch/text/issues/73 .
You can always use the pickle to dump the objects, but keep in mind one thing that dumping a list of dictionary or fields objects are not taken care of by the module, so to the best try to decompose the list first
To Store the DataSet Object to a pickle file for later easy loading
def save_to_pickle(dataSetObject,PATH):
with open(PATH,'wb') as output:
for i in dataSetObject:
pickle.dump(vars(i), output, pickle.HIGHEST_PROTOCOL)
The toughest thing is yet to come, Yeah loading the pickle file.... ;)
First, try to look for all field names and field attributes and then go for the kill
To load the pickle file into the DataSetObject
def load_pickle(PATH, FIELDNAMES, FIELD):
dataList = []
with open(PATH, "rb") as input_file:
while True:
try:
# Taking the dictionary instance as the input Instance
inputInstance = pickle.load(input_file)
# plugging it into the list
dataInstance = [inputInstance[FIELDNAMES[0]],inputInstance[FIELDNAMES[1]]]
# Finally creating an example objects list
dataList.append(Example().fromlist(dataInstance,fields=FIELD))
except EOFError:
break
# At last creating a data Set Object
exampleListObject = Dataset(dataList, fields=data_fields)
return exampleListObject
This hackish solution has worked in my case, hope you will find it useful in your case too.
Btw any suggestion is welcome :).
The pickle/dill approach is fine if your dataset is small. But if you are working with large datasets I won't recommend it as it will be too slow.
I simply save the examples (iteratively) as JSON-strings. The reason behind this is because saving the whole Dataset object takes a lot of time, plus you need serialization tricks such a dill, which makes the serialization even slower.
Moreover, these serializers take a lot of memory (some of them even create copies of the dataset) and if they start making use of the swap memory, you're done. That process is gonna take so long that you will probably terminate it before it finishes.
Therefore, I end up with the following approach:
Iterate over the examples
Convert each example (on the fly) to a JSON-string
Write that JSON-string into a text file (one sample per
line)
When loading, add the examples to the Dataset object along with the fields
def save_examples(dataset, savepath):
with open(savepath, 'w') as f:
# Save num. elements (not really need it)
f.write(json.dumps(total)) # Write examples length
f.write("\n")
# Save elements
for pair in dataset.examples:
data = [pair.src, pair.trg]
f.write(json.dumps(data)) # Write samples
f.write("\n")
def load_examples(filename):
examples = []
with open(filename, 'r') as f:
# Read num. elements (not really need it)
total = json.loads(f.readline())
# Save elements
for i in range(total):
line = f.readline()
example = json.loads(line)
# example = data.Example().fromlist(example, fields) # Create Example obj. (you can do it here or later)
examples.append(example)
end = time.time()
print(end - start)
return examples
Then, you can simply rebuild the dataset by:
# Define fields
SRC = data.Field(...)
TRG = data.Field(...)
fields = [('src', SRC), ('trg', TRG)]
# Load examples from JSON and convert them to "Example objects"
examples = load_examples(filename)
examples = [data.Example().fromlist(d, fields) for d in examples]
# Build dataset
mydataset = Dataset(examples, fields)
The reason why I use JSON instead of pickle, dill, msgpack, etc is not arbitrary.
I did some tests and these are the results:
Dataset size: 2x (1,960,641)
Saving times:
- Pickle/Dill*: >30-45 min (...or froze my computer)
- MessagePack (iterative): 123.44 sec
100%|██████████| 1960641/1960641 [02:03<00:00, 15906.52it/s]
- JSON (iterative): 16.33 sec
100%|██████████| 1960641/1960641 [00:15<00:00, 125955.90it/s]
- JSON (bulk): 46.54 sec (memory problems)
Loading times:
- Pickle/Dill*: -
- MessagePack (iterative): 143.79 sec
100%|██████████| 1960641/1960641 [02:23<00:00, 13635.20it/s]
- JSON (iterative): 33.83 sec
100%|██████████| 1960641/1960641 [00:33<00:00, 57956.28it/s]
- JSON (bulk): 27.43 sec
*Similar approach as the other answers
I'm realativly new to the python programming language and i ran into a problem with the module zstandard.
I'm currently working with the replayfiles of Halite.
Since they are compressed with zstandard, i have to use this module. And if i read a file, everything is fine! I can decompress the ".hlt" files.
But i've done some transformations of the json data that i want to save on disk to use later. I find it very useful to store the data compressed again, so i used the compressor. The compression works fine, too. However, if i open the file i just created again, i get an error message reading: "zstd.ZstdError: decompression error: Unknown frame descriptor".
Have a look on my code below:
def getFileData(self, filename):
with open(filename, "rb") as file:
data = file.read()
return data
def saveDataToFile(self, filename, data):
with open(filename, "bw") as file:
file.write(data)
def transformCompressedToJson(self, data, beautify=0):
zd = ZstdDecompressor()
decompressed = zd.decompress(data, len(data))
return json.loads(decompressed)
def transformJsonToCompressed(self, jsonData, beautify=0):
zc = ZstdCompressor()
if beautify > 0:
jsonData = json.dumps(jsonData, sort_keys=True, indent=beautify)
objectCompressor = zc.compressobj()
compressed = objectCompressor.compress(jsonData.encode())
return objectCompressor.flush()
And i am using it here:
rp = ReplayParser()
gameDict = rp.parse('replays/replay-20180215-152416+0100--4209273584-160-160-278627.hlt')
compressed = rp.transformJsonToCompressed(json.dumps(gameDict, sort_keys=False, indent=0))
rp.saveDataToFile("test.cmp", compressed)
t = rp.getFileData('test.cmp')
j = rp.transformCompressedToJson(t) -> Here is the error
print(j)
The function rp.parse(..) just transforms the data - so it just creates a dictionary .. The rp.parse(..) function also calls transformCompressedToJson, so it is working fine for the hlt file.
Hopefully, you guys can help me with this.
Greethings,
Noixes
In transformJsonToCompressed(), you are throwing away the result of the .compress() method (which is likely going to be the bulk of the output data), and instead returning only the result of .flush() (which will just be the last little bit of data remaining in buffers). The normal way to use a compression library like this would be to write each chunk of compressed data directly to the output file as it is generated. Your code isn't structured to allow that (the function knows nothing about the file the data will be written to), so instead you could concatenate the two chunks of compressed data and return that.
I have a list in my program. I have a function to append to the list, unfortunately when you close the program the thing you added goes away and the list goes back to the beginning. Is there any way that I can store the data so the user can re-open the program and the list is at its full.
You may try pickle module to store the memory data into disk,Here is an example:
store data:
import pickle
dataset = ['hello','test']
outputFile = 'test.data'
fw = open(outputFile, 'wb')
pickle.dump(dataset, fw)
fw.close()
load data:
import pickle
inputFile = 'test.data'
fd = open(inputFile, 'rb')
dataset = pickle.load(fd)
print dataset
You can make a database and save them, the only way is this. A database with SQLITE or a .txt file. For example:
with open("mylist.txt","w") as f: #in write mode
f.write("{}".format(mylist))
Your list goes into the format() function. It'll make a .txt file named mylist and will save your list data into it.
After that, when you want to access your data again, you can do:
with open("mylist.txt") as f: #in read mode, not in write mode, careful
rd=f.readlines()
print (rd)
The built-in pickle module provides some basic functionality for serialization, which is a term for turning arbitrary objects into something suitable to be written to disk. Check out the docs for Python 2 or Python 3.
Pickle isn't very robust though, and for more complex data you'll likely want to look into a database module like the built-in sqlite3 or a full-fledged object-relational mapping (ORM) like SQLAlchemy.
For storing big data, HDF5 library is suitable. It is implemented by h5py in Python.
I have two problems with loading data on python, both the scipts work properly but they need too much time to run and sometimes "Killed" is the result (with the first one).
I have a big zipped text file and I do something like this:
import gzip
import cPickle as pickle
f = gzip.open('filename.gz','r')
tab={}
for line in f:
#fill tab
with open("data_dict.pkl","wb") as g:
pickle.dump(tab,g)
f.close()
I have to do some operations on the dictionary I created in the previous script
import cPickle as pickle
with open("data_dict.pkl", "rb") as f:
tab = pickle.load(f)
f.close()
#operations on tab (the dictionary)
Do you have other solutionsin mind? Maybe not the ones involving YAML or JSON...
If the data you are pickling is primitive and simple, you can try marshal module: http://docs.python.org/3/library/marshal.html#module-marshal. That's what Python uses to serialize its bytecode, so it's pretty fast.
First one comment, in:
with open("data_dict.pkl", "rb") as f:
tab = pickle.load(f)
f.close()
f.close() is not necessary, the context manager (with syntax) does that automatically.
Now as for speed, I don't think you're going to get too much faster than cPickle for the purpose of reading in something from disk directly as a Python object. If this script needs to be run over and over I would try using memchached via pylibmc to keep the object stored persistently in memory so you can access it lightning fast:
import pylibmc
mc = pylibmc.Client(["127.0.0.1"], binary=True,behaviors={"tcp_nodelay": True,"ketama": True})
d = range(10000) ## some big object
mc["some_key"] = d ## save in memory
Then after saving it once you can access and modify it, it stays in memory even after the previous program finishes its execution:
import pylibmc
mc = pylibmc.Client(["127.0.0.1"], binary=True,behaviors={"tcp_nodelay": True,"ketama": True})
d = mc["some_key"] ## load from memory
d[0] = 'some other value' ## modify
mc["some_key"] = d ## save to memory again