Changing the content of Pickle File

Changing the content of Pickle File - python

I have trained a deep learning model and it got saved in a pickle file. Due to some reason, I have to slightly change the code from which I got the pickle file. It took me months in training & I want to anyhow use the last pickle file created, as the weights will remains same. Is there any way to view and change the content of the pickle file?
Edit: For example, if we have the stylegan2 pre-trained network pickle file and suppose we made changes on the G_synthesis function code (present in https://github.com/NVlabs/stylegan2/blob/master/training/networks_stylegan2.py) then how can we use the old pickled file.

If you just want to change some functions but keep the same weights, can you just copy the weights to new model like this:
import pickle
from old_model_file import old_model
from new_model_file import new_model
# 1.load pickle file
with open('old.pickle','rb') as f:
old_pickle = pickle.load(f)
# 2.create model based new model
new_pickle = new_model()
# 3. copy weights from old model
'''
##you should copy all weights from old_pickle to new_pickle
##for example:
new_pickle.weight_A = old_pickle.weight_A
new_pickle.weight_B = old_pickle.weight_B
'''
# 4. save the new model
with open('new.pickle','wb') as f:
pickle.dump(new_pickle,f)
Is this what you want?

Related

RobotFramework file .save() function changes format

I have been coding in Robotframework and Python:
I use get_model() to get model from a .robot file. Then modify, the model using ModelTransformer() which works on the basic idea of AST(Abstract Syntax Trees).
But after finishing the modification, when I try to save the modified model into a new .robot file using .save() function, this completely changes the format of the new robot file.
# Code to save new robot file
model.save("New.robot")
Can anyone please let me know, how to solve this ?

Edited:
from robot.api import TestData
from robot.api.parsing import ModelTransformer
# read the original model from the .robot file
test_data = TestData(source='original.robot')
# apply your modifications using the ModelTransformer
transformer = ModelTransformer()
transformed_data = transformer.visit(test_data)
# write the modified model to a new .robot file
transformed_data.save(filename='new.robot')
# write the modified model to a new .robot file
with open('new.robot', 'w') as file:
transformed_data.save(file)

How to extract metadata from tflite model

I'm loading this object detection model in python. I can load it with the following lines of code:
import tflite_runtime.interpreter as tflite
model_path = 'path_to_model_file.tf'
interpreter = tflite.Interpreter(model_path)
I'm able to perform inferences on this without any problem. However, labels are suposed to be included in the metadata, according to model's documentation, but I can't extract it.
The closest I was, it was when following this:
from tflite_support import metadata as _metadata
displayer = _metadata.MetadataDisplayer.with_model_file(model_path)
export_json_file = "extracted_metadata.json")
json_file = displayer.get_metadata_json()
# Optional: write out the metadata as a json file
with open(export_json_file, "w") as f:
f.write(json_file)
but the very first line of code, fails with this error: {AtributeError}'int' object has no attribute 'tobytes'.
How to extract it?

If you only care about the label file, you can simply run command like unzip model_path on Linux or Mac. TFLite model with metadata is essentially a zip file. See the public introduction for more details.
You code snippet to extract metadata works on my end. Make sure to double check model_path. It should be a string, such as "lite-model_ssd_mobilenet_v1_1_metadata_2.tflite".
If you'd like to read label files in an Android app, here is the sample code to do so.

Using trained GB classifier for new data

I have trained my Gradient Boosting Classifier and saved the model using pickle
with open("model.bin", 'wb') as f_out:
pickle.dump(xgb_clf, f_out)
As a data source, I had .csv-file.
Now I need to test the performance on completely new data, but I do not now how.
I found several tutorials, but was unable to proceed.
I understand that the key is to load the saved model
with open('model.bin', 'rb') as f_in:
model = pickle.load(f_in)
but I do not know how to apply this model on new data I have in csv.
Could you help, please?
Thank you.

The model object you are using should have a method, similar to model.predict(x), depending on the library (I'm assuming it is scikit-learn).
You need to load the data from the .csv file:
import pandas as pd
data = pd.read_csv('data.csv')
Select columns that belong to x:
x = data[['col1', 'col2']]
And call the prediction:
res = model.predict(x)

You can directly use the predict function.
model.predict(data)

how to save torchtext Dataset?

I'm working with text and use torchtext.data.Dataset.
Creating the dataset takes a considerable amount of time.
For just running the program this is still acceptable. But I would like to debug the torch code for the neural network. And if python is started in debug mode, the dataset creation takes roughly 20 minutes (!!). That's just to get a working environment where I can debug-step through the neural network code.
I would like to save the Dataset, for example with pickle. This sample code is taken from here, but I removed everything that is not necessary for this example:
from torchtext import data
from fastai.nlp import *
PATH = 'data/aclImdb/'
TRN_PATH = 'train/all/'
VAL_PATH = 'test/all/'
TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'
TEXT = data.Field(lower=True, tokenize="spacy")
bs = 64;
bptt = 70
FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)
with open("md.pkl", "wb") as file:
pickle.dump(md, file)
To run the code, you need the aclImdb dataset, it can be downloaded from here. Extract it into a data/ folder next to this code snippet. The code produces an error in the last line, where pickle is used:
Traceback (most recent call last):
File "/home/lhk/programming/fastai_sandbox/lesson4-imdb2.py", line 27, in <module>
pickle.dump(md, file)
TypeError: 'generator' object is not callable
The samples from fastai often use dill instead of pickle. But that doesn't work for me either.

I came up with the following functions for myself:
import dill
from pathlib import Path
import torch
from torchtext.data import Dataset
def save_dataset(dataset, path):
if not isinstance(path, Path):
path = Path(path)
path.mkdir(parents=True, exist_ok=True)
torch.save(dataset.examples, path/"examples.pkl", pickle_module=dill)
torch.save(dataset.fields, path/"fields.pkl", pickle_module=dill)
def load_dataset(path):
if not isinstance(path, Path):
path = Path(path)
examples = torch.load(path/"examples.pkl", pickle_module=dill)
fields = torch.load(path/"fields.pkl", pickle_module=dill)
return Dataset(examples, fields)
Not that actual objects could be a bit different, for example, if you save TabularDataset, then load_dataset returns an instance of class Dataset. This unlikely affect the data pipeline but may require extra diligence for tests.
In the case of a custom tokenizer, it should be serializable as well (e.g. no lambda functions, etc).

You can use dill instead of pickle. It works for me.
You can save a torchtext Field like
TEXT = data.Field(sequential=True, tokenize=tokenizer, lower=True,fix_length=200,batch_first=True)
with open("model/TEXT.Field","wb")as f:
dill.dump(TEXT,f)
And load a Field like
with open("model/TEXT.Field","rb")as f:
TEXT=dill.load(f)
The offical code suppport is under development，you can follow https://github.com/pytorch/text/issues/451 and https://github.com/pytorch/text/issues/73 .

You can always use the pickle to dump the objects, but keep in mind one thing that dumping a list of dictionary or fields objects are not taken care of by the module, so to the best try to decompose the list first
To Store the DataSet Object to a pickle file for later easy loading
def save_to_pickle(dataSetObject,PATH):
with open(PATH,'wb') as output:
for i in dataSetObject:
pickle.dump(vars(i), output, pickle.HIGHEST_PROTOCOL)
The toughest thing is yet to come, Yeah loading the pickle file.... ;)
First, try to look for all field names and field attributes and then go for the kill
To load the pickle file into the DataSetObject
def load_pickle(PATH, FIELDNAMES, FIELD):
dataList = []
with open(PATH, "rb") as input_file:
while True:
try:
# Taking the dictionary instance as the input Instance
inputInstance = pickle.load(input_file)
# plugging it into the list
dataInstance = [inputInstance[FIELDNAMES[0]],inputInstance[FIELDNAMES[1]]]
# Finally creating an example objects list
dataList.append(Example().fromlist(dataInstance,fields=FIELD))
except EOFError:
break
# At last creating a data Set Object
exampleListObject = Dataset(dataList, fields=data_fields)
return exampleListObject
This hackish solution has worked in my case, hope you will find it useful in your case too.
Btw any suggestion is welcome :).

The pickle/dill approach is fine if your dataset is small. But if you are working with large datasets I won't recommend it as it will be too slow.
I simply save the examples (iteratively) as JSON-strings. The reason behind this is because saving the whole Dataset object takes a lot of time, plus you need serialization tricks such a dill, which makes the serialization even slower.
Moreover, these serializers take a lot of memory (some of them even create copies of the dataset) and if they start making use of the swap memory, you're done. That process is gonna take so long that you will probably terminate it before it finishes.
Therefore, I end up with the following approach:
Iterate over the examples
Convert each example (on the fly) to a JSON-string
Write that JSON-string into a text file (one sample per
line)
When loading, add the examples to the Dataset object along with the fields
def save_examples(dataset, savepath):
with open(savepath, 'w') as f:
# Save num. elements (not really need it)
f.write(json.dumps(total)) # Write examples length
f.write("\n")
# Save elements
for pair in dataset.examples:
data = [pair.src, pair.trg]
f.write(json.dumps(data)) # Write samples
f.write("\n")
def load_examples(filename):
examples = []
with open(filename, 'r') as f:
# Read num. elements (not really need it)
total = json.loads(f.readline())
# Save elements
for i in range(total):
line = f.readline()
example = json.loads(line)
# example = data.Example().fromlist(example, fields) # Create Example obj. (you can do it here or later)
examples.append(example)
end = time.time()
print(end - start)
return examples
Then, you can simply rebuild the dataset by:
# Define fields
SRC = data.Field(...)
TRG = data.Field(...)
fields = [('src', SRC), ('trg', TRG)]
# Load examples from JSON and convert them to "Example objects"
examples = load_examples(filename)
examples = [data.Example().fromlist(d, fields) for d in examples]
# Build dataset
mydataset = Dataset(examples, fields)
The reason why I use JSON instead of pickle, dill, msgpack, etc is not arbitrary.
I did some tests and these are the results:
Dataset size: 2x (1,960,641)
Saving times:
- Pickle/Dill*: >30-45 min (...or froze my computer)
- MessagePack (iterative): 123.44 sec
100%|██████████| 1960641/1960641 [02:03<00:00, 15906.52it/s]
- JSON (iterative): 16.33 sec
100%|██████████| 1960641/1960641 [00:15<00:00, 125955.90it/s]
- JSON (bulk): 46.54 sec (memory problems)
Loading times:
- Pickle/Dill*: -
- MessagePack (iterative): 143.79 sec
100%|██████████| 1960641/1960641 [02:23<00:00, 13635.20it/s]
- JSON (iterative): 33.83 sec
100%|██████████| 1960641/1960641 [00:33<00:00, 57956.28it/s]
- JSON (bulk): 27.43 sec
*Similar approach as the other answers

Word2Vec Vocabulary not definded error

I am new to python and word2vec and keep getting a "you must first build vocabulary before training the model" error. What is wrong with my code?
Here is my code:
file_object=open("SupremeCourt.txt","w")
from gensim.models import word2vec
data = word2vec.Text8Corpus('SupremeCourt.txt')
model = word2vec.Word2Vec(data, size=200)
out=model.most_similar()
print(out[1])
print(out[2])

I could see some wrong things in your code like the file is opened in write mode and the model which you have loaded doesn't contain the word which you want to find the most similar words.
I would like to suggest to use the predefined models like google_news_vectors to load in the gensim or to build your own word2vec model so that you won't get the error.
the usage of most_similar in gensim is out = model.most_similar("word-name")
file_object=open("SupremeCourt.txt","r")
from gensim.models import word2vec
data = word2vec.Text8Corpus('SupremeCourt.txt')
model = word2vec.Word2Vec(data, size=200)#use google news vectors here
out=model.most_similar("word")
print(out)

You're opening that file in write mode with this line:
file_object = open("SupremeCourt.txt", "w")
By doing this, you're erasing the contents of your file, so that when you try to pass the file the model for training, there is no data to read. That's why that error is thrown.
Remove that line (and also restore your file contents), and it'll work.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Changing the content of Pickle File - python

Related

RobotFramework file .save() function changes format

How to extract metadata from tflite model

Using trained GB classifier for new data

how to save torchtext Dataset?

Word2Vec Vocabulary not definded error

Categories

Resources