I am applying some preprocessing to the CIFAR100 dataset
from datasets.load import load_dataset
from datasets import Features, Array3D
from transformers.models.vit.feature_extraction_vit import ViTFeatureExtractor
# Resampling & Normalization
feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224-in21k')
dataset = load_dataset('cifar100', split='train[:100]')
features = Features({
'pixel_values': Array3D(dtype="float32", shape=(3, 224, 224)),
**dataset.features,
})
dataset = dataset.map(lambda batch, col_name: feature_extractor(batch[col_name]),
features=features, fn_kwargs={'col_name': 'img'}, batched=True)
I got the following warning, which means datasets cannot cache the transformed dataset.
Reusing dataset cifar100 (/home/qys/.cache/huggingface/datasets/cifar100/cifar100/1.0.0/f365c8b725c23e8f0f8d725c3641234d9331cd2f62919d1381d1baa5b3ba3142)
Parameter 'function'=<function <lambda> at 0x7f3279f3eef0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
Curiously, I can pickle/dill foo, but not lambda x: foo(x), despite the fact that they have exactly the same effect. I guess that's related to the problem?
>>> def foo(x): return x + 1
...
>>> Hasher.hash(foo)
'ff7fae499aa1d820'
>>> Hasher.hash(lambda x: foo(x))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/qys/Research/embedder/.venv/lib/python3.10/site-packages/datasets/fingerprint.py", line 237, in hash
return cls.hash_default(value)
File "/home/qys/Research/embedder/.venv/lib/python3.10/site-packages/datasets/fingerprint.py", line 230, in hash_default
return cls.hash_bytes(dumps(value))
File "/home/qys/Research/embedder/.venv/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 564, in dumps
dump(obj, file)
File "/home/qys/Research/embedder/.venv/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 539, in dump
Pickler(file, recurse=True).dump(obj)
File "/home/qys/Research/embedder/.venv/lib/python3.10/site-packages/dill/_dill.py", line 620, in dump
StockPickler.dump(self, obj)
File "/home/qys/.pyenv/versions/3.10.4/lib/python3.10/pickle.py", line 487, in dump
self.save(obj)
File "/home/qys/.pyenv/versions/3.10.4/lib/python3.10/pickle.py", line 560, in save
f(self, obj) # Call unbound method with explicit self
File "/home/qys/Research/embedder/.venv/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 862, in save_function
dill._dill._save_with_postproc(
File "/home/qys/Research/embedder/.venv/lib/python3.10/site-packages/dill/_dill.py", line 1153, in _save_with_postproc
pickler.write(pickler.get(pickler.memo[id(dest)][0]))
KeyError: 139847629663936
I have also tried making the function accessible from the top level of a module, i.e.
preprocessor = lambda batch: feature_extractor(batch['img'])
dataset = dataset.map(preprocessor, features=features, batched=True)
However, it still doesn't work
>>> from datasets.fingerprint import Hasher
>>> preprocessor = lambda batch: feature_extractor(batch['img'])
>>> Hasher.hash(preprocessor)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/qys/Research/embedder/.venv/lib/python3.10/site-packages/datasets/fingerprint.py", line 237, in hash
return cls.hash_default(value)
File "/home/qys/Research/embedder/.venv/lib/python3.10/site-packages/datasets/fingerprint.py", line 230, in hash_default
return cls.hash_bytes(dumps(value))
File "/home/qys/Research/embedder/.venv/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 564, in dumps
dump(obj, file)
File "/home/qys/Research/embedder/.venv/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 539, in dump
Pickler(file, recurse=True).dump(obj)
File "/home/qys/Research/embedder/.venv/lib/python3.10/site-packages/dill/_dill.py", line 620, in dump
StockPickler.dump(self, obj)
File "/home/qys/.pyenv/versions/3.10.4/lib/python3.10/pickle.py", line 487, in dump
self.save(obj)
File "/home/qys/.pyenv/versions/3.10.4/lib/python3.10/pickle.py", line 560, in save
f(self, obj) # Call unbound method with explicit self
File "/home/qys/Research/embedder/.venv/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 862, in save_function
dill._dill._save_with_postproc(
File "/home/qys/Research/embedder/.venv/lib/python3.10/site-packages/dill/_dill.py", line 1153, in _save_with_postproc
pickler.write(pickler.get(pickler.memo[id(dest)][0]))
KeyError: 140408024252096
In Python 3.9, pickle hashes the glob_ids dictionary in addition to the globs of a function. To make hashing deterministic when the globals are not in the same order, the order of glob_ids needs to be made deterministic. PR to fix: https://github.com/huggingface/datasets/pull/4516
(Until merged, a temporary fix is to use an older version of dill:
pip install "dill<0.3.5"
see https://github.com/huggingface/datasets/issues/4506#issuecomment-1157417219)
Related
Hi I am working on a project to do segmentation for persons. Now I followed the code from https://pixellib.readthedocs.io/en/latest/Image_pascal.html#image-pascal and it is giving me an error: ValueError: Resize method is not implemented. in line 4.
import pixellib
from pixellib.semantic import semantic_segmentation
segment_video = semantic_segmentation()
segment_video.load_pascalvoc_model("deeplabv3_xception_tf_dim_ordering_tf_kernels.h5")
segment_video.process_video_pascalvoc("IMG_2649.mp4", overlay = True, frames_per_second= 15,
output_video_name="output.mp4")
Anyone know why this error is being triggered?
Error:
Traceback (most recent call last):
File "H:/Yolo/person_seg.py", line 4, in <module>
segment_video = semantic_segmentation()
File "G:\anaconda3\envs\yolo5\lib\site-packages\pixellib\semantic.py", line 23, in __init__
self.model = Deeplab_xcep_pascal()
File "G:\anaconda3\envs\yolo5\lib\site-packages\pixellib\deeplab.py", line 214, in Deeplab_xcep_pascal
method='bilinear', align_corners=True))(b4)
File "G:\anaconda3\envs\yolo5\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 554, in __call__
outputs = self.call(inputs, *args, **kwargs)
File "G:\anaconda3\envs\yolo5\lib\site-packages\tensorflow\python\keras\layers\core.py", line 743, in call
return self.function(inputs, **arguments)
File "G:\anaconda3\envs\yolo5\lib\site-packages\pixellib\deeplab.py", line 214, in <lambda>
method='bilinear', align_corners=True))(b4)
File "G:\anaconda3\envs\yolo5\lib\site-packages\tensorflow\python\ops\image_ops_impl.py", line 960, in resize_images
name=None)
File "G:\anaconda3\envs\yolo5\lib\site-packages\tensorflow\python\ops\image_ops_impl.py", line 1088, in resize_images_v2
raise ValueError('Resize method is not implemented.')
ValueError: Resize method is not implemented.
Make sure to follow the initial steps prior installing the PixelLib library, since it requires the latest version of Tensorflow (Tensorflow 2.0+) as well as imgaug.
my code as follow:
df['O_ID'] = (df.apply(get_match_id, arguments=[df['pickup_longitude'], df['pickup_latitude']])).jit_cuda()
When first I used this function——jit_cuda(),there was an error "No Module named cupy"
But, when I have installed the cupy-cuda101(Adapt to my CUDA version)
I get a new error
Traceback (most recent call last):
File "F:\Anaconda3\lib\site-packages\vaex\dataframe.py", line 3580, in table_part
values[name] = df.evaluate(name)
File "F:\Anaconda3\lib\site-packages\vaex\dataframe.py", line 2616, in evaluate
return self._evaluate_implementation(expression, i1=i1, i2=i2, out=out, selection=selection, filtered=filtered, internal=internal, parallel=parallel, chunk_size=chunk_size)
File "F:\Anaconda3\lib\site-packages\vaex\dataframe.py", line 5352, in _evaluate_implementation
dtypes[expression] = df.data_type(expression, internal=False)
File "F:\Anaconda3\lib\site-packages\vaex\dataframe.py", line 1998, in data_type
data = self.evaluate(expression, 0, 1, filtered=True, internal=True, parallel=False)
File "F:\Anaconda3\lib\site-packages\vaex\dataframe.py", line 2616, in evaluate
return self._evaluate_implementation(expression, i1=i1, i2=i2, out=out, selection=selection, filtered=filtered, internal=internal, parallel=parallel, chunk_size=chunk_size)
File "F:\Anaconda3\lib\site-packages\vaex\dataframe.py", line 5427, in _evaluate_implementation
value = scope.evaluate(expression)
File "F:\Anaconda3\lib\site-packages\vaex\scopes.py", line 97, in evaluate
result = self[expression]
File "F:\Anaconda3\lib\site-packages\vaex\scopes.py", line 139, in __getitem__
self.values[variable] = self.evaluate(expression) # , out=self.buffers[variable])
File "F:\Anaconda3\lib\site-packages\vaex\scopes.py", line 103, in evaluate
result = eval(expression, expression_namespace, self)
File "<string>", line 1, in <module>
File "F:\Anaconda3\lib\site-packages\vaex\expression.py", line 1073, in __call__
return self.f(*args, **kwargs)
File "F:\Anaconda3\lib\site-packages\vaex\expression.py", line 1120, in wrapper
return cupy.asnumpy(func(*args))
File "cupy\core\fusion.pyx", line 905, in cupy.core.fusion.Fusion.__call__
File "cupy\core\fusion.pyx", line 754, in cupy.core.fusion._FusionHistory.get_fusion
File "<string>", line 6, in f
NameError: name 'lambda_function_1' is not defined
How should I solve it?
My understanding is that just-in-time compilation in vaex works only for virtual columns, or expressions/columns computed mainly with various arithmetic operations using numpy methods or pure python arithmetics.
When using apply, a function can be quite abstract, basically whatever you want, so it may not be possible for it to be compiled.
If you can rewrite your .apply function using numpy expressions, then you are likely able to use the jit_cuda method to accelerate it. Vaex does not recommend using .apply anyway, since it is hard to parallelize and should be used a "last resort" of sorts.
Source: https://vaex.io/docs/tutorial.html#Just-In-Time-compilation
Using pandas version 0.19.1 (with py27-tables-3.2.2_1 and hdf5-1.10.0 installed on my system), I am trying to save a pandas dataframe as a .h5 with:
import pandas as pd
df = pd.DataFrame(dict(A=range(5), B=range(5)))
df.to_hdf('savefile.h5', 'table', mode='w')
However the following error results:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/pandas/core/generic.py", line 1138, in to_hdf
return pytables.to_hdf(path_or_buf, key, self, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/io/pytables.py", line 270, in to_hdf
f(store)
File "/usr/local/lib/python2.7/site-packages/pandas/io/pytables.py", line 264, in <lambda>
f = lambda store: store.put(key, value, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/io/pytables.py", line 873, in put
self._write_to_group(key, value, append=append, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/io/pytables.py", line 1315, in _write_to_group
s.write(obj=value, append=append, complib=complib, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/io/pytables.py", line 2864, in write
self.attrs.ndim = data.ndim
File "/usr/local/lib/python2.7/site-packages/tables/attributeset.py", line 461, in __setattr__
self._g__setattr(name, value)
File "/usr/local/lib/python2.7/site-packages/tables/attributeset.py", line 403, in _g__setattr
self._g_setattr(self._v_node, name, stvalue)
File "tables/hdf5extension.pyx", line 696, in tables.hdf5extension.AttributeSet._g_setattr (tables/hdf5extension.c:7549)
tables.exceptions.HDF5ExtError: HDF5 error back trace
File "H5A.c", line 634, in H5Awrite
not an attribute
End of HDF5 error back trace
Can't set attribute 'ndim' in node:
/table (Group) ''.
Could someone provide a simple working example of how to save a pandas dataframe in hdf5 format.
pytables is currently not compatible with hdf5-1.10 as reported in this issue on github, downgrading to hdf5-0.8 is the recommended solution.
I write python layer for caffe that can read hdf5 with some manipulation needs for me. But I have some issue when open and read hdf5 file in python setup method. Here it is.
When I used tables with the
code in setup:
def setup(self, bottom, top):
h5file = tables.open_file("/home/titan/models/hdf5/train_small.h5", driver="H5FD_CORE")
I have this error when run net:
Traceback (most recent call last):
File "/home/titan/scripts/python_layers/pydata_hdf5.py", line 37, in setup
h5file = tables.open_file("/home/titan/models/hdf5/train_small.h5", driver="H5FD_CORE")
File "/home/titan/anaconda/lib/python2.7/site-packages/tables/file.py", line 318, in open_file
return File(filename, mode, title, root_uep, filters, **kwargs)
File "/home/titan/anaconda/lib/python2.7/site-packages/tables/file.py", line 784, in __init__
self._g_new(filename, mode, **params)
File "tables/hdf5extension.pyx", line 465, in tables.hdf5extension.File._g_new (tables/hdf5extension.c:4872)
tables.exceptions.HDF5ExtError: HDF5 error back trace
File "../../../src/H5FDcore.c", line 273, in H5Pset_fapl_core
not a file access property list
File "../../../src/H5Pint.c", line 3371, in H5P_object_verify
property list is not a member of the class
File "../../../src/H5Pint.c", line 3321, in H5P_isa_class
not a property list
End of HDF5 error back trace
When I used h5py
def setup(self, bottom, top):
self.data = h5py.File('/home/titan/models/hdf5_nose_mouth/train_small.h5', 'r')
I have the same error:
Traceback (most recent call last):
File "/home/titan/scripts/python_layers/pydata_hdf5.py", line 11, in <module>
import h5py
File "/home/titan/anaconda/lib/python2.7/site-packages/h5py/__init__.py", line 31, in <module>
from .highlevel import *
File "/home/titan/anaconda/lib/python2.7/site-packages/h5py/highlevel.py", line 13, in <module>
from ._hl.base import is_hdf5, HLObject
File "/home/titan/anaconda/lib/python2.7/site-packages/h5py/_hl/base.py", line 78, in <module>
dlapl = default_lapl()
File "/home/titan/anaconda/lib/python2.7/site-packages/h5py/_hl/base.py", line 65, in default_lapl
lapl = h5p.create(h5p.LINK_ACCESS)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2458)
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2415)
File "h5py/h5p.pyx", line 130, in h5py.h5p.create (-------src-dir-------/h5py/h5p.c:2491)
ValueError: Not a property list class (Not a property list class)
When I used deepdish:
def setup(self, bottom, top):
self.data = dd.io.load('/home/titan/models/hdf5/train_smallest.h5')
I have error too:
Traceback (most recent call last):
File "/home/titan/scripts/python_layers/pydata_hdf5.py", line 36, in setup
self.data = dd.io.load('/home/titan/models/hdf5/train_smallest.h5')
File "/home/titan/anaconda/lib/python2.7/site-packages/deepdish/io/hdf5io.py", line 476, in load
with tables.open_file(path, mode='r') as h5file:
File "/home/titan/anaconda/lib/python2.7/site-packages/tables/file.py", line 318, in open_file
return File(filename, mode, title, root_uep, filters, **kwargs)
File "/home/titan/anaconda/lib/python2.7/site-packages/tables/file.py", line 784, in __init__
self._g_new(filename, mode, **params)
File "tables/hdf5extension.pyx", line 488, in tables.hdf5extension.File._g_new (tables/hdf5extension.c:5081)
tables.exceptions.HDF5ExtError: HDF5 error back trace
File "../../../src/H5F.c", line 1582, in H5Fopen
not file access property list
File "../../../src/H5Pint.c", line 3321, in H5P_isa_class
not a property list
End of HDF5 error back trace
But when I read simply *.txt file, all is ok. Also I can read this file from console and use it in hdf5 layer in caffe. Please help me, how I can read hdf5 file from python layer?
Try to install another version of h5py. I just solved it with:
pip install h5py==prev_version
I guess it has something to do with the linking but it would be interesting to know which is the exact cause for this problem.
I tried to use pickle to dump a MDAnalysis.universe object, but I got error mmessage like
Traceback (most recent call last):
File "convert.py", line 9, in <module>
blob = pickle.dumps(u)
File "/usr/lib/python2.7/pickle.py", line 1374, in dumps
Pickler(file, protocol).dump(obj)
File "/usr/lib/python2.7/pickle.py", line 224, in dump
self.save(obj)
File "/usr/lib/python2.7/pickle.py", line 331, in save
self.save_reduce(obj=obj, *rv)
File "/usr/lib/python2.7/pickle.py", line 419, in save_reduce
save(state)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 649, in save_dict
self._batch_setitems(obj.iteritems())
File "/usr/lib/python2.7/pickle.py", line 663, in _batch_setitems
save(v)
File "/usr/lib/python2.7/pickle.py", line 306, in save
rv = reduce(self.proto)
File "/usr/lib/python2.7/copy_reg.py", line 84, in _reduce_ex
dict = getstate()
TypeError: 'AtomGroup' object is not callable
any suggestion would be appreciated!
Updated answer (for MDAnalysis ≥ 2.0)
Since MDAnalysis 2.0.0 (August 2021), Universes can be pickled.
import MDAnalysis as mda
import pickle
u = mda.Universe(topology, trajectory)
pickle.dump(u, open("universe.pkl", "wb"))
# load pickled universe
u_pickled = pickle.load(open("universe.pkl", "rb"))
# test that we get same positions
(u_pickled.atoms.positions == u.atoms.positions).all()
# -> True
# but that universes are different
u == u_pickled
# -> False
See also Parallelizing Analysis in the User Guide.
Old answer
MDAnalysis.Universe objects contain some objects that cannot be serialized and pickled by the standard mechanisms, such as open file descriptors. One would need to write specialized __getstate__() and __setstate__() methods as described in the Pickle protocol but none of this is implemented as of the current 0.8.1 (April 2014) release.
The specific error is explained by Manel in his comment on MDAnalysis Issue 173: Pickle searches for a __getstate__() method. Although that is not implemented, Universe, which manages manages its own attributes to generate "virtual attributes" on the fly, interprets this as an atom selection and eventually returns an empty AtomGroup instance. This, in turn, is called because pickle believes that it is the local implementation of __getstate__. AtomGroup, however, is not callable and the error results.
As you probably noticed you got a quicker response by asking on the MDAnalysis user list or by filing an issue — Stackoverflow is typically lower on the list of developers for answering such specific questions.