How to list files from Dataset object in Azure Pipeline - python

I am trying to list files from Dataset in Azure pipeline (build) but getting error, I tried below two ways
Method #1
from azureml.core import Workspace, Dataset
import os
ws = Workspace.from_config()
dataset = Dataset.get_by_name(ws, "<dataset-name>")
print(os.listdir(str(dataset.as_mount())))
FileNotFoundError: [Errno 2] No such file or directory:
'<azureml.data.dataset_consumption_config.DatasetConsumptionConfig
object at >'
Method #2
from azureml.core import Workspace, Dataset
import os
ws = Workspace.from_config()
dataset = Dataset.get_by_name(ws, "<dataset-name>")
data = dataset.mount()
data.start()
print(os.listdir(data.mount_point))
Error: Mount is only supported on Unix or Unix-like operating systems
with the native package libfuse installed.
Can anyone please help me with this, I am stuck at it from sometime.

Related

I cannot import a specific .py file from the folder i have in google drive , it throws an error

# loading dataframes using dataset module
import os
os.system("/content/drive/My Drive/Project/Emotion Speech Recognition/utils/dataset.py")
from utils import dataset
df, train_df, test_df = dataset.create_and_load_meta_csv_df(dataset_path, destination_path, randomize, split)
i think you are trying to import the file where instead you are opening it in another python instance.
to import the file you should add its module directory to the python path then import it.
import sys
sys.path.insert(0,'/content/drive/My Drive/Project/Emotion Speech Recognition')
from utils import dataset
instead of
os.system("/content/drive/My Drive/Project/Emotion Speech Recognition/utils/dataset.py")

OSError: [Errno 36] File name too long: for python package and .txt file, pandas opening

Error OSError: [Errno 36] File name too long: for the following code:
from importlib_resources import open_text
import pandas as pd
with open_text('package.data', 'librebook.txt') as f:
input_file = f.read()
dataset = pd.read_csv(input_file)
Ubuntu 20.04 OS, this is for a python package, init.py file
I dont want to use .readlines()
Can I structure this code differently to not have this outcome occur? Do I need to modify my OS system? Some of the help I found looked to modify OS but dont want to do this if I dont need to. Thank you.
why not just pass in the name of the file and not the contents
dataset = pd.read_csv('librebook.txt')
from importlib_resources import path
import pandas as pd
with path('package.data', 'librebook.txt') as f:
dataset = pd.read_csv(f)

No such file or directory: 'final_data_1.npy'

I am trying this code using tensorflow and numpy. However, I am getting an error.
import numpy as np
from tensorflow.python.framework import ops
np.random.seed(1)
ops.reset_default_graph()
ops.reset_default_graph()
#final_data_1 and 2 are the numpy array files for the images in the folder img and annotations.csv file
#total of 5 GB due to conversion of values to int
Z2= np.load('final_data_1.npy')
Z1= np.load('final_data_2.npy')
print(Z2[:,0])
print(Z1.shape)
my error is:
FileNotFoundError: [Errno 2] No such file or directory: 'final_data_1.npy'
Can you suggest a solution?
Like the Error message implies you have to name the right directory where this file "final_data_1.npy" is located at:
Example
import pandas as pd
df = pd.read_csv("./Path/where/you/stored/table/data.csv")
print(df)
Same goes with the function .load()
You have to add the directory of this file
np.load('./User/Desktop/final_data_1.npy')
Without naming the directory where the file is located your computer doesn't know where "final_data_1" is

Can't access directory Tensorflow Google Colab

Sorry I'm new to Tensorflow2.1 andGoogleColab`. And I don't understand why I have this error :
My code :
%tensorflow_version 2.x
import tensorflow as tf
from tensorflow import keras
print(tf.__version__)
import pathlib
import os
path_data_dir = tf.keras.utils.get_file(origin='https://www.kaggle.com/c/dogs-vs-cats/download/0iMGwZllApFLiU35zX78%2Fversions%2Fm5lLqMS0KLfxJUozn3gR%2Ffiles%2Ftrain.zip',fname='train',untar= True)
data_dir = pathlib.Path(path_data_dir)
entries = os.listdir(data_dir)
for entry in entries:
print(entry)
And I have this error (I tried to mount a GoogleDrive folder and I have access
FileNotFoundError Traceback (most recent call last)
<ipython-input-1-88f88035f225> in <module>()
12 data_dir = pathlib.Path(path_data_dir)
13
---> 14 entries = os.listdir(data_dir)
15 for entry in entries:
16 print(entry)
FileNotFoundError: [Errno 2] No such file or directory: '/root/.keras/datasets/train'
Thanks a lot for your help
Lily
I am assuming this is because of the different file system structure between a normal Linux machine and the runtime hosted by Google Colab.
As a workaround, pass the cache_dir='/content' argument to the get_file function to be as follows: path_data_dir = tf.keras.utils.get_file(origin='https://www.kaggle.com/c/dogs-vs-cats/download/0iMGwZllApFLiU35zX78%2Fversions%2Fm5lLqMS0KLfxJUozn3gR%2Ffiles%2Ftrain.zip',fname='train',untar= True, cache_dir='/content')
Be aware that the returned value path_data_dir is a full path to the file, so the function call os.list_dir(data_dir) will fail since data_dir points to a file and not a directory.
To fix this, change entries = os.listdir(data_dir) to entries = os.listdir(data_dir.parent)
I think this is simply a bad link to download data finally... On google colab I can't see correctly the downloaded file (because I can't see folders...) but I tried later on a computer and It's juste the link.

How do I load files from a specific local path(whether in master node or slave node) in spark cluster?

I am a beginner of spark. Recently I am learning pySpark and trying to submit a simple application to spark cluster (I set up a cluster with 1 master and 1 worker). However, I don't know how to properly specify a path(for example a folder in my master node).
This is my code:
import os
from PIL import Image
from pyspark import SparkConf, SparkContext
APP_NAME = "ImageResizer"
if __name__ == "__main__":
conf = SparkConf().setAppName(APP_NAME)
conf = conf.setMaster("spark://10.233.70.48:7077")
sc = SparkContext(conf=conf)
s_list = sc.parallelize(os.listdir('.'))
s_jpg_list = s_list.filter(lambda f:f.endswith('.jpg'))
def resize_image(f):
i = Image.open(f)
size_64 = (64,64)
name, extension = os.path.splitext(f)
i.thumbnail(size_64)
out_path = 'resize/{}_64{}'.format(name, extension)
i.save(out_path)
return out_path
s_jpg_files = s_jpg_list.map(resize_image)
print('Converted Images:', s_jpg_files.collect())
I want to batch resize images from a folder in my master machine, the code takes all the contents within the folder that the python application locates. But when I submitted the application to spark, the system can not find the path.
then I tried:
s_list = sc.parallelize(os.listdir('/home/xxx/'))
seems I can still not access the desired folder(I don't know how to even specify master node or slave node when choosing the folder).
Can anyone please help me to revise the way I refer the path?
In addition, how can I refer to the machine where the job was submitted / The local directory on every machine /A shared network drive? Thank you so much!

Categories

Resources