Where are the `tfds.load` datasets are saved?

Where are the `tfds.load` datasets are saved? - python

I downloaded the cats vs dogs dataset using the tfds.load('cats_vs_dogs') and I want to find where it has been saved on my computer, after reading a bit I came across someone who claims the dataset can be found at ~/tensorflow_datasets/cats_vs_dogs/ but I can't find a folder that is called cats_vs_dogs at that path, how can I get the path to the files?

As per default
as I assume TFDS_DATA_DIR has not been set, datasets will be stored under ~/tensorflow_datasets
However, as this depends on your system and setup: If you want to check the dataset and see it, I would suggest to just manually set data_dir when using tfds.load. Then you know for sure, where it is stored.

You can use this:
import tensorflow_datasets as tfds
tfds.core.get_tfds_path('cats_vs_dogs')
'C:/Users/user/anaconda3/envs/env/lib/site-packages/tensorflow_datasets/cats_vs_dogs'

You can also set a folder to download as :
data_dir = 'D:\\Sandbox\\Github\\DATA_TFDS'
tfds.load(name='mnist',
split=['train', 'test'],
shuffle_files=True,
data_dir=data_dir,
with_info=True,
download=True)

Related

How to update data_dir and data_path in TF DatasetInfo object?

I'm trying to run a script that builds and loads a TF dataset. The dataset is cityscapes and it is already downloaded and stored in fs/datasets/cityscapes/. I can't move the data. In the directory, there are the following files: ['tfrecord', 'gtFine', 'tfrecord_instances_old', 'README', 'leftImg8bit', 'cityscapesScripts', 'tfrecord_instances', 'license.txt']. An error arises when I try to run dataset = self._dataset_builder.as_dataset(split=self._split, decoders=self._decoders). This error is
AssertionError: Dataset cityscapes: could not find data in /fs/datasets/cityscapes. Please make sure to call dataset_builder.download_and_prepare(), or pass download=True to tfds.load() before trying to access the tf.data.Dataset object.
I believe the issue relates to Constructing tf.data.Dataset cityscapes for split train, from /fs/datasets/cityscapes/cityscapes/semantic_segmentation/1.0.0 which is printed before the error. This added path comes from the Cityscapes TFDS DatasetInfo object. If I try to edit the data_dir or data_path in that object with self._dataset_builder.info.data_dir='/fs/datasets/cityscapes', I receive the error message: AttributeError: can't set attribute. So if anyone has a fix, I'd appreciate it.

I want to change this code to put my dataset(from my drive it's an Arabic data with 2 classes) tried to read it normally but i get an error in 'train'

I want to change this code to put my own dataset(from my drive it's an Arabic data with 2 classes) tried to read it normally but i get an error in 'train' .. what i want is to use the dataset that i have not the ones from huggingface.
!pip install openprompt
!git clone https://github.com/thunlp/OpenPrompt.git
!cd OpenPrompt
# load dataset
from datasets import load_dataset
# raw_dataset = load_dataset('super_glue', 'cb', cache_dir="../datasets/.cache/huggingface_datasets")
# raw_dataset['train'][0]
from datasets import load_from_disk
raw_dataset = load_from_disk("/home/hushengding/huggingface_datasets/saved_to_disk/super_glue.cb")
# Note that if you are running this scripts inside a GPU cluster, there are chances are you are not able to connect to huggingface website directly.
# In this case, we recommend you to run `raw_dataset = load_dataset(...)` on some machine that have internet connections.
# Then use `raw_dataset.save_to_disk(path)` method to save to local path.
# Thirdly upload the saved content into the machiine in cluster.
# Then use `load_from_disk` method to load the dataset.
from openprompt.data_utils import InputExample
dataset = {}
for split in ['train', 'validation', 'test']:
dataset[split] = []
for data in raw_dataset[split]:
input_example = InputExample(text_a = data['premise'], text_b = data['hypothesis'], label=int(data['label']), guid=data['idx'])
dataset[split].append(input_example)
print(dataset['train'][0])

The data should be in here:
from datasets import load_dataset
raw_dataset = load_dataset('csv', data_files='/content/drive/MyDrive/TEST2.csv')

How to read and write of TFOD2 pipeline.config file by python?

As you have already seen in Tensorflow objects detection they provide pipeline.config file with respect to a particular model. But there we need to manually open these config files & change the parameter by hard coding. My query is like how can I read this pipeline.config file by python & change the parameter in runtime. Please help me with that.

There's an example in the tutorial notebook.
from object_detection.utils import config_util, save_pipeline_config
pipeline_config = 'configs/tf2/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8.config'
configs = config_util.get_configs_from_pipeline_file(pipeline_config)
configs['model'].ssd.num_classes = 10 # change number of classes
Then, you can save:
save_pipeline_config(configs, 'path/to/save/dir/')
See the source code.

The answer of #Nicolas Gervais seems to be a bit outdated.
This seems to be the fully working version right now:
from object_detection.utils import config_util
pipeline_config = 'configs/tf2/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8.config'
configs = config_util.get_configs_from_pipeline_file(pipeline_config)
configs['model'].ssd.num_classes = 10 # change number of classes
After you can save your pipeline.config in the following way:
# Convert dictionary to pipeline_pb2.TrainEvalPipelineConfig to be able to save it
pipeline_proto = config_util.create_pipeline_proto_from_configs(configs)
config_util.save_pipeline_config(pipeline_proto, 'path/to/save/dir/')

TypeError: Input 'filename' of 'ReadFile' Op has type float32 that does not match expected type of string

I am running this code from the tutorial here: https://keras.io/examples/vision/image_classification_from_scratch/
with a custom dataset, that is divided in 2 datasets as in the tutorial. However, I got this error:
TypeError: Input 'filename' of 'ReadFile' Op has type float32 that does not match expected type of string.
I made this casting. I tried this:
is_jfif = str(tf.compat.as_bytes("JFIF")) in fobj.peek(10)
but nothing changed as far as the error
I am trying all day to figure out how to solve it, without any success. Can someone help me? Thank you...

Simplest way I found is to create a subfolder and copy the files to that subfolder.
i.e. Lets assume your files are 0.jpg, 1.jpg,2.jpg....2000.jpg and in directory named "patterns".
Seems like the Keras API does not accept it as the files are named by numbers and for Keras it is in float32.
To overcome this issue, either you can rename the files as one answer suggests, or you can simply create a subfolder under "patterns" (i.e. "patterndir"). So now your image files are under ...\patterns\patterndir
Keras (internally) possibly using the subdirectory name and may be attaching it in front of the image file thus making it a string (sth like patterndir_01.jpg, patterndir_02.jpg) [Note this is my interpretation, does not mean that it is true]
When you compile it this time, you will see that it works and you will get a compiler message as:
Found 2001 files belonging to 1 classes.
Using 1601 files for training.
Found 2001 files belonging to 1 classes.
Using 400 files for validation.
My code looks like this
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
#Generate a dataset
image_size = (28, 28)
batch_size = 32
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
"patterns",
validation_split=0.2,
subset="training",
seed=1337,
image_size=image_size,
batch_size=batch_size,
)
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
"patterns",
validation_split=0.2,
subset="validation",
seed=1337,
image_size=image_size,
batch_size=batch_size,
)

In my case, I simply did not have enough samples in the training directories. There was one per category and I got the error.

Just make a subdirectory and move your files there.
So if the files are here:
'/home/dataset_28/'
Put them here:
'/home/dataset_28/files/'
And then do this:
from tensorflow.keras.preprocessing import image_dataset_from_directory
image_dataset_from_directory('/home/dataset_28/', batch_size=1, image_size=(28, 28))

The names of the files are in the float32 format.
Renaming all the images in the dataset solves the problem.
Loop over all the files with os.rename().

I was just hitting this TypeError: Input 'filename' of 'ReadFile' Op has type float32 that does not match expected type of string. error too with tensorflow==2.4.4.
I played around with validation_split:
Error happens: validation_split=0.001
I did this in an effort to have 0 images in the validation set.
Error doesn't happen: validation_split=0.2
This results in 1 image used for validation
Conclusion: a known root cause of this error is 0 images inside the validation set.
Failed Fixes
Per this answer I renamed my files via os.rename to 1.jpg, 2.jpg, 3.jpg, ... didn't work :/
Per this answer talking about one image/category, that's wrong, it's fine to have just one image inside a category.

One of the issues is related to the image downloading. If a designated image document is not downloaded, it also show the same error.

You have to check the several things after this exception appeared:
Do you have enough data for training?
If you only have limited data in your training set, this exception would appear. I guess if you want to split the data, the amount of the data should be divisible by 10 (Take validation_split=0.1 for example).
Do your image in valid format?
This method only allows formats in ('.bmp', '.gif', '.jpeg', '.jpg', '.png'). Invalid format would appear this exception.
Honestly, the exception doesn't give much information of what's happening exactly. Hopefully would update in near future.

kaggle dataset or python split CLI

I downloaded the dataset from kaggle:
https://www.kaggle.com/c/dogs-vs-cats/data
Then tried to get image label from the downloaded data using cv2.split('.')[-3] command. (code in the end)
However, i got an "index out of range error". I checked the filename and see the filename after unzip from kaggle datasets is only 1.jpg, 2.jpg, 3.jpg.
From what I read, the dataset should have label in the filename. i.e.
https://www.packtpub.com/mapt/book/big_data_and_business_intelligence/9781788475655/23/ch23lvl1sec118/deep-learning-for-cats-versus-dogs
So my question is
Q1: I assume my python syntax is right. As it looks like I would only have two argument [0] and [1] with filename of "num.jpg" not "label.num.jpg", right?
Q2: if so, anyone can help me to point out why I cannot get the right datasets with label in the filename?
ps: I am really new in python, kaggle, (or programming area).
Thank you
Mira
ps: my partial code:
for img in tqdm(os.listdir(TRAIN_DIR))
path = os.path.join(TRAIN_DIR, img)
img_data = cv2.imread(path)
cv2.imshow('train_data_image:', img_data)
print ('test:', img.split('.')[-3])

just FYI - I found the answer for my question...
It turns out I was using the test data which indeed should not contain the label in the dataset. I download the train data and it does have the label (dog/cat) in the filename.
thanks!
Mira

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Where are the `tfds.load` datasets are saved? - python

You can use this: import tensorflow_datasets as tfds tfds.core.get_tfds_path('cats_vs_dogs') 'C:/Users/user/anaconda3/envs/env/lib/site-packages/tensorflow_datasets/cats_vs_dogs'

You can also set a folder to download as : data_dir = 'D:\\Sandbox\\Github\\DATA_TFDS' tfds.load(name='mnist', split=['train', 'test'], shuffle_files=True, data_dir=data_dir, with_info=True, download=True)

Related

How to update data_dir and data_path in TF DatasetInfo object?

I want to change this code to put my dataset(from my drive it's an Arabic data with 2 classes) tried to read it normally but i get an error in 'train'

How to read and write of TFOD2 pipeline.config file by python?

TypeError: Input 'filename' of 'ReadFile' Op has type float32 that does not match expected type of string

kaggle dataset or python split CLI

Categories

Resources