Subclass of PyTorch dataset class cannot find dataset files

Subclass of PyTorch dataset class cannot find dataset files - python

I'm trying to create a subclass of the PyTorch MNIST dataset class, which I call CustomMNISTDataset, as follows:
import torchvision.datasets as datasets
class CustomMNISTDataset(datasets.MNIST):
def __init__(self, root='/home/psando'):
super().__init__(root=root,
download=False)
but when I execute:
dataset = CustomMNISTDataset()
it fails with error: "RuntimeError: Dataset not found. You can use download=True to download it".
However, when I run the following in the same file:
dataset = datasets.MNIST(root='/home/psando', download=False)
print(len(dataset))
it succeeds and prints "60000", as expected.
Since CustomMNISTDataset subclasses datasets.MNIST why is the behavior different? I've verified that the path '/home/psando' contains the MNIST directory with raw and processed subdirectories (otherwise, explicitly calling the constructor for datasets.MNIST() would have failed). The current behavior implies that the call to super().__init__() within CustomMNISTDataset is not calling the constructor for datasets.MNIST which is very strange!
Other details: I'm using Python 3.6.8 with torch==1.6.0 and
torchvision==0.7.0. Any help would be appreciated!

This requires some source-diving, but your problem is this function. The path to the dataset is dependant on the name of the class, so when you subclass MNIST the root folder changes to /home/psando/CustomMNISTDataset
So if you rename /home/psando/MNIST to /home/psando/CustomMNISTDataset it works.

Related

Azure blob storage model access for gensim in python

I am trying to load my model files using below code
import gensim
import os
from azure.storage.blob import BlobServiceClient
from smart_open import open
azure_storage_connection_string = "DefaultEndpointsProtocol=https;AccountName=lnipcfdevlanding;AccountKey=xxxxxxxxx"
client = BlobServiceClient.from_connection_string(azure_storage_connection_string)
file_prefix="azure://landing/TechnologyCluster/VectorCreation/embeddings/"
fin = open(file_prefix+"word2vec.Tobacco.fasttext.model", transport_params=dict(client=client))
clustering.embedding = gensim.models.Word2Vec.load(fin)
But it is failing with below error
AttributeError: '_io.TextIOWrapper' object has no attribute 'endswith'
I assume the way I am passing file to gensim.models.Word2Vec.load is not the right way. I could not find any good example that how to pass the filename which is on Azure blob storage, if I give complete uri it does not work, what is the right way to achieve this ?

Please check :
AttributeError usually occurs when save() or load() function is called on an object instance instead of class as that is a class method..
Please note that the information in file can be incomplete ,check if the binary tree is missing.In this case, the query may be obtained ,but you may not be able to continue with a model loaded with this way.
Check if the file path must be saved in word2vec-format file
binary is bool, if True, indicates data is in binary word2vec format.
Note from: https://radimrehurek.com/gensim/models/keyedvectors.html.
Check if this way can be worked around by importing required models and by checking version compliance : -
gensim.models.Word2Vec.load_word2vec_format('model', binary=True)
or withKeyedVectors.load in place of .Word2Vec.load_... according to what supports fasttext.
Also check whether the model is correctly supports the
function.
References:
python - AttributeError: 'Word2Vec' object has no attribute
'endswith' - Stack Overflow
models.keyedvectors – Store and query word vectors — gensim
(radimrehurek.com)

decouple function and class into separate files

I have a file that has a thousand line of codes and I'd like to break it into several files. However, I found those functions depends on each other so I have no idea how to decouple those... Here is a simplified example:
import numpy as np
def tensor(data):
return Tensor(data)
class Tensor:
def __init__(self,data):
self.data=data
def __repr__(self):
return f'Tensor({str(self.data)})'
def mean(self):
return mean(self.data)
def mean(data):
value=np.mean(data)
return tensor(value)
What is the best way to separate tensor, Tensor, and mean (put them into 3 different files)? Thanks for your help!!

Having a module that is thousands of lines long isn't that bad. You may not actually need to break it up into different modules. It is common to have a module that has a function alongside a class like your tensor and Tensor in the same module, and there is no reason for mean to be split up into a separate function as that code can just be placed directly in Tensor.mean.
A module should have a specific purpose and be a self contained unit around that purpose. If you are splitting things up just to have smaller files, then that is only going to make your codebase worse. However, large modules are a sign that things may need to be refactored. If you can find good ways of refactoring ideas in the code into smaller ideas, then those smaller units could be given their own modules, otherwise, keep everything as a bigger module.
As for how you can split up code that is coupled together. Here is one of way of splitting up the code into the modules you indicated. Since you have a function, the tensor function, that you would like people to use to get an instance of your Tensor class, it seemed like creating a Python package would be somewhat sensible since packages come with an __init__.py file that is used for establishing the API ie your tensor function. I put the tensor function directly in the __init__.py file, but if the function is pretty large, it can be broken out into a separate module, since the __init__.py file is just suppose to give you an overview of the API being created.
# --- main.py ----
from tensor import tensor
print(tensor([1,2,3]).mean())
# --- tensor/__init__.py ----
'''
Add some documentation here
'''
def tensor(data):
return Tensor(data)
from tensor.Tensor import Tensor
# --- tensor/Tensor.py ----
from tensor import helper
class Tensor:
def __init__(self,data):
self.data=data
def __repr__(self):
return f'Tensor({str(self.data)})'
def mean(self):
return helper.mean(self.data)
# --- tensor/helper.py ----
import numpy as np
from . import tensor
def mean(data):
value=np.mean(data)
return tensor(value)
About circular dependencies
Tensor and helper are importing each other, and this is ok. When the helper module imports Tensor, and Tensor in turn imports helper again, helper will just continue loading normally, and then when it is done Tensor will finish loading. Now if you had stuff on the module level (code outside of your function/classes) being executed when the module is first loaded, and it is dependent on functionality in another module that is only partially loaded, then that is when you run into problems with circular dependencies.
Using classes that don't exist yet
I can add to the __init__ file
def some_function():
return DoesntExist()
and your code would still run. It doesn't look for a class named Tensor until it is actually running the tensor function. If we did the following then we would get an error about Tensor not existing.
def tensor(data):
return Tensor(data)
tensor()
from tensor.Tensor import Tensor
because now we are running the tensor function before the import and it can't find anything named Tensor.
The order of stuff in __init__
If you switch the order around you will have
__init__ imports Tensor imports helper imports __init__ again
as it tries to grab the tensor function, but it can't as the __init__ function can't proceed past the the line that imports Tensor until that import has been completed.
Now with the current order we have,
__init__ defines tensor, sees the import statement, and saves its current progress as a partial import
The same imports happen (__init__ imports Tensor imports helper imports __init__ looking for a tensor function)
This time we look at the partial import for the tensor function, find it, and we are able to continue on using that.
I didn't think about any of that when I put things in that order. I just wrote out the code, got the circular import error, switched the order around, and didn't think about what was going on until you asked about it.
And now that I think about it, the following would have worked too.
The order of things in the __init__ file will no longer matter.
from tensor.Tensor import Tensor
def tensor(data):
return Tensor(data)
And then in helper.py
import numpy as np
import tensor
def mean(data):
value=np.mean(data)
return tensor.tensor(value)
The difference is that now instead of specifically asking that the tensor function exist when the module is imported by trying to do from . import tensor, we are doing import tensor (which is importing the tensor package and not the function). And now, whenever the the mean function gets run, we are going to do tensor.tensor(value) to get the tensor function inside our tensor package.

How to get the source code for property in python

I want to programmatically get the source code for a given class property (e.g., pandas.DataFrame.iloc). I tried using inspect.findsource(), which works fine for classes and functions. However, it doesn't work for properties.
import pandas
import inspect
type(pandas.DataFrame) # type
inspect.findsource(pandas.DataFrame) # works ok
type(pandas.DataFrame.apply) # function
inspect.findsource(pandas.DataFrame.apply) # works ok
type(pandas.DataFrame.iloc) # property
inspect.findsource(pandas.DataFrame.iloc) # throws exception
TypeError: module, class, method, function, traceback, frame, or code object was expected, got property
Why inspect cannot find source for property? How can I programmatically get the source code (or source file path) for a given property?

pickling and unpickling user-defined class

I have a user-defined class 'myclass' that I store on file with the pickle module, but I am having problem unpickling it. I have about 20 distinct instances of the same structure, that I save in distinct files. When I read each file, the code works on some files and not on others, when I get the error:
'module' object has no attribute 'myclass'
I have generated some files today and some other yesterday, and my code only works on the files generated today (I have NOT changed class definition between yesterday and today).
I was wondering if maybe my method is not robust, if I am not doing things as I should do, for example maybe I cannot pickled user-defined class, and if this is introducing some randomness in the process.
Another issue could be that the files that I generated yesterday were generated on a different machine --- because I work on an academic cluster, I have some login nodes and some computing nodes, that differs by architecture. So I generated yesterday files on the computing nodes, and today files on the login nodes, and I am reading everything on the login nodes.
As suggested in some of the comments, I have installed dill and loaded it with import dill as pickle. Now I can read the files from computing nodes to login nodes of the same cluster. But if I try to read the files generated on the computing node of one cluster, on the login node of another cluster I cannot. I get KeyError: 'ClassType' in _load_type(name) in dill.py
Can it be because the python version is different? I have generated the files with python2.7 and I read them with python3.3.
EDIT:
I can read the pickled files, if I use everywhere python 2.7. Sadly, part of my code, written in python 3, is not automatically compatible with python 2.7 :(

Can you from mymodule import myclass? Pickling does not pickle the class, just a reference to it. To load a pickled object python must be able to find the class that was to be used to create the object.
eg.
import pickle
class A(object):
pass
obj = A()
pickled = pickle.dumps(obj)
_A = A; del A # hide class
try:
pickle.loads(pickled)
except AttributeError as e:
print(e)
A = _A # unhide class
print(pickle.loads(pickled))

in python unit test using vcr, can we use one function to generate different cassettes files?

vcrpy is the python record/play package, below is the common way from the guideline
class TestCloudAPI(unittest.TestCase):
def test_get_api_token(self):
with vcr.use_cassette('fixtures/vcr_cassettes/test_get_api_token.yaml'):
# real request and testing
def test_container_lifecycle(self):
with vcr.use_cassette('fixtures/vcr_cassettes/test_container_lifecycle.yaml'):
I want to have different record files, so I have to repeat this in every method.
Is it possible to have one line somewhere to simplify this like:
TEST_CASE_VCR(USE_METHOD_AS_FILENAME)
class TestCloudAPI(unittest.TestCase):
def test_get_api_token(self):
# real request and testing
def test_container_lifecycle(self):

This is now supported in newer versions of vcrpy by omitting the cassette name altogether. From the documentation:
VCR.py now allows the omission of the path argument to the use_cassette function. Both of the following are now legal/should work
#my_vcr.use_cassette
def my_test_function():
...
In both cases, VCR.py will use a path that is generated from the provided test function’s name. If no
cassette_library_dir has been set, the cassette will be in a file with the name of the test function in directory
of the file in which the test function is declared. If a cassette_library_dir has been set, the cassette
will appear in that directory in a file with the name of the decorated function.
It is possible to control the path produced by the automatic naming machinery by customizing the
path_transformer and func_path_generator vcr variables

There isn't a feature to do this currently built in to VCR, but you can make your own. Check out the decorator that Venmo created.

This gets a lot easier with vcrpy-unittest which is--as you might guess--integration between vcrpy and unittest.
Your example becomes this:
from vcr_unittest import VCRTestCase
class TestCloudAPI(VCRTestCase):
def test_get_api_token(self):
# real request and testing
def test_container_lifecycle(self):
# real request and testing
and the cassettes are automatically named according to the test and saved in a cassettes dir alongside the test file. For example, this would create two files: cassettes/TestCloudAPI.test_get_api_token.yaml and cassettes/TestCloudAPI.test_container_lifecycle.yaml.
The directory and naming can be customized by overriding a couple methods: _get_cassette_library_dir and _get_cassette_name but it's probably not necessary.
vcrpy-unittest is on github at https://github.com/agriffis/vcrpy-unittest and PyPI at https://pypi.python.org/pypi/vcrpy-unittest

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Subclass of PyTorch dataset class cannot find dataset files - python

This requires some source-diving, but your problem is this function. The path to the dataset is dependant on the name of the class, so when you subclass MNIST the root folder changes to /home/psando/CustomMNISTDataset So if you rename /home/psando/MNIST to /home/psando/CustomMNISTDataset it works.

Related

Azure blob storage model access for gensim in python

decouple function and class into separate files

How to get the source code for property in python

pickling and unpickling user-defined class

in python unit test using vcr, can we use one function to generate different cassettes files?

Categories

Resources