Load numpy array in google-cloud-ml job

Load numpy array in google-cloud-ml job - python

In the model I want to launch, I have some variables which have to be initialized with specific values.
I currently store these variables into numpy arrays but I don't know how to adapt my code to make it work on a google-cloud-ml job.
Currently I initialize my variable like this:
my_variable = variables.model_variable('my_variable', shape=None, dtype=tf.float32, initializer=np.load('datasets/real/my_variable.npy'))
Can someone help me ?

First, you'll need to copy/store the data on GCS (using, e.g., gsutil) and ensure your training script has access to that bucket. The easiest way to do so is to copy the array to the same bucket as your data, since you'll likely already have configured that bucket for read access. If the bucket is in the same project as your training job and you followed these instructions (particularly, gcloud beta ml init-project), you should be set. If the data will be in another bucket, see these instructions.
Then you'll need to use a library capable of loading data from GCS. Tensorflow includes a module that can do this, although you're free to use any client library that can read from GCS. Here's an example of using TensorFlow's file_io module:
from StringIO import StringIO
import tensorflow as tf
import numpy as np
from tensorflow.python.lib.io import file_io
# Create a variable initialized to the value of a serialized numpy array
f = StringIO(file_io.read_file_to_string('gs://my-bucket/123.npy'))
my_variable = tf.Variable(initial_value=np.load(f), name='my_variable')
Note that we have to read the file into a string and use StringIO, since file_io.FileIO does not fully implement the seek function required by numpy.load.
Bonus: in case it's useful, you can directly store a numpy array to GCS using the file_io module, e.g.:
np.save(file_io.FileIO('gs://my-bucket/123', 'w'), np.array([[1,2,3], [4,5,6]]))
For Python 3, use from io import StringIO instead of from StringIO import StringIO.

I tried the accepted answer but ran into some problems. Eventually this worked for me (Python 3):
from io import BytesIO
import numpy as np
from tensorflow.python.lib.io import file_io
To save:
array = np.ones((100, ))
dest = 'gs://[BUCKET-NAME]/' # Destination to save in GCS
np.save(file_io.FileIO(dest, 'w'), array)
To load:
f = BytesIO(file_io.read_file_to_string(src, binary_mode=True))
arr = np.load(f)

Related

printing the uploading progress using `xarray.Dataset.to_zarr` function

I'm trying to upload an xarray dataset to GCP using the function ds.to_zarr(store=store), and it works perfect. However, I would like to show the progress of big datasets. Is there any option to chunk my dataset in a way I can use tqdm or someting like that to log the uploading progress?
Here is the code that I currently have:
import os
import xarray as xr
import numpy as np
import gcsfs
from dask.diagnostics import ProgressBar
if __name__ == '__main__':
# for testing
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "service-account.json"
# create xarray
data_arr = np.random.rand(5000, 100, 100)
data_xarr = xr.DataArray(data_arr,
dims=["x", "y", "z"])
# define store
gcp_blob_uri = "gs://gprlib/test.zarr"
gcs = gcsfs.GCSFileSystem()
store = gcs.get_mapper(gcp_blob_uri)
# delayed to_zarr computation -> seems that it does not work
write_job = data_xarr\
.to_dataset(name="data")\
.to_zarr(store, mode="w", compute=False)
print(write_job)

xarray.Dataset.to_zarr has an optional argument compute which is True by default:
compute (bool, optional) – If True write array data immediately, otherwise return a dask.delayed.Delayed object that can be computed to write array data later. Metadata is always updated eagerly.
Using this, you can track the progress using dask's own dask.distributed.progress bar:
write_job = ds.to_zarr(store, compute=False)
write_job = write_job.persist()
# this will return an interactive (non-blocking) widget if in a notebook
# environment. To force the widget to block, provide notebook=False.
distributed.progress(write_job, notebook=False)
[############## ] | 35% Completed | 4.5s
Note that for this to work, the dataset must consist of chunked dask arrays. If the data is in memory, you could use a single chunk per array with ds.chunk().to_zarr.

PyArrow ipc.read_tensor causes seg fault

I am trying to pass numpy arrays from one process to another using PyArrow's shared memory framework. Currently the sender process has this code:
import numpy as np
import pyarrow as pa
data = np.random.rand(100,100, 100)
tensor = pa.Tensor.from_numpy(data)
output_stream = pa.BufferOutputStream()
pa.ipc.write_tensor(tensor, output_stream)
buf = output_stream.getvalue()
print(buf.address, buf.size)
which outputs something like (5311993741568, 8000256)
In the second process I have:
import pyarrow as pa
import numpy as np
buf2 = pa.foreign_buffer(5311993741568, 8000256)
tensor2 = pa.ipc.read_tensor(buf2)
but I get a segfault on the last line. The documentation isn't very clear on what the right way to use read_tensor and write_tensor is. I am also running on windows so I cannot use plasma object store.

Create mxnet.ndarray.NDArray from pycuda.driver.DeviceAllocation

I am trying to pass output of some pycuda operation to the input of mxnet computational graph.
I am able to achieve this via numpy conversion with the following code
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
import mxnet as mx
batch_shape = (1, 1, 10, 10)
h_input = np.zeros(shape=batch_shape, dtype=np.float32)
# init output with ones to see if contents really changed
h_output = np.ones(shape=batch_shape, dtype=np.float32)
device_ptr = cuda.mem_alloc(input.nbytes)
stream = cuda.Stream()
cuda.memcpy_htod_async(d_input, h_input, stream)
# here some actions with d_input may be performed, e.g. kernel calls
# but for the sake of simplicity we'll just transfer it back to host
cuda.memcpy_dtoh_async(d_input, h_output, stream)
stream.synchronize()
mx_input = mx.nd(h_output, ctx=mx.gpu(0))
print('output after pycuda calls: ', h_output)
print('mx_input: ', mx_input)
However i would like to avoid the overhead of device-to-host and host-to-device memory copying.
I couldn't find a way to construct mxnet.ndarray.NDArray directly from h_output.
The closest thing that i was able to find is construction of ndarray from dlpack.
But it is not clear how to work with dlpack object from python.
Is there a way fo achieve NDArray <-> pycuda interoperability without copying memory via host?

Unfortunately, it is not possible at the moment.

Install issues with 'lr_utils' in python

I am trying to complete some homework in a DeepLearning.ai course assignment.
When I try the assignment in Coursera platform everything works fine, however, when I try to do the same imports on my local machine it gives me an error,
ModuleNotFoundError: No module named 'lr_utils'
I have tried resolving the issue by installing lr_utils but to no avail.
There is no mention of this module online, and now I started to wonder if that's a proprietary to deeplearning.ai?
Or can we can resolve this issue in any other way?

You will be able to find the lr_utils.py and all the other .py files (and thus the code inside them) required by the assignments:
Go to the first assignment (ie. Python Basics with numpy) - which you can always access whether you are a paid user or not
And then click on 'Open' button in the Menu bar above. (see the image below)
.
Then you can include the code of the modules directly in your code.

As per the answer above, lr_utils is a part of the deep learning course and is a utility to download the data sets. It should readily work with the paid version of the course but in case you 'lost' access to it, I noticed this github project has the lr_utils.py as well as some data sets
https://github.com/andersy005/deep-learning-specialization-coursera/tree/master/01-Neural-Networks-and-Deep-Learning/week2/Programming-Assignments
Note:
The chinese website links did not work when I looked at them. Maybe the server storing the files expired. I did see that this github project had some datasets though as well as the lr_utils file.
EDIT: The link no longer seems to work. Maybe this one will do?
https://github.com/knazeri/coursera/blob/master/deep-learning/1-neural-networks-and-deep-learning/2-logistic-regression-as-a-neural-network/lr_utils.py

Download the datasets from the answer above.
And use this code (It's better than the above since it closes the files after usage):
def load_dataset():
with h5py.File('datasets/train_catvnoncat.h5', "r") as train_dataset:
train_set_x_orig = np.array(train_dataset["train_set_x"][:])
train_set_y_orig = np.array(train_dataset["train_set_y"][:])
with h5py.File('datasets/test_catvnoncat.h5', "r") as test_dataset:
test_set_x_orig = np.array(test_dataset["test_set_x"][:])
test_set_y_orig = np.array(test_dataset["test_set_y"][:])
classes = np.array(test_dataset["list_classes"][:])
train_set_y_orig = train_set_y_orig.reshape((1, train_set_y_orig.shape[0]))
test_set_y_orig = test_set_y_orig.reshape((1, test_set_y_orig.shape[0]))
return train_set_x_orig, train_set_y_orig, test_set_x_orig, test_set_y_orig, classes

"lr_utils" is not official library or something like that.
Purpose of "lr_utils" is to fetch the dataset that is required for course.
option (didn't work for me): go to this page and there is a python code for downloading dataset and creating "lr_utils"
I had a problem with fetching data from provided url (but at least you can try to run it, maybe it will work)
option (worked for me): in the comments (at the same page 1) there are links for manually downloading dataset and "lr_utils.py", so here they are:
link for dataset download
link for lr_utils.py script download
Remember to extract dataset when you download it and you have to put dataset folder and "lr_utils.py" in the same folder as your python script that is using it (script with this line "import lr_utils").

The way I fixed this problem was by:
clicking File -> Open -> You will see the lr_utils.py file ( it does not matter whether you have paid/free version of the course).
opening the lr_utils.py file in Jupyter Notebooks and clicking File -> Download ( store it in your own folder ), rerun importing the modules. It will work like magic.
I did the same process for the datasets folder.

You can download train and test dataset directly here: https://github.com/berkayalan/Deep-Learning/tree/master/datasets
And you need to add this code to the beginning:
import numpy as np
import h5py
import os
def load_dataset():
train_dataset = h5py.File('datasets/train_catvnoncat.h5', "r")
train_set_x_orig = np.array(train_dataset["train_set_x"][:]) # your train set features
train_set_y_orig = np.array(train_dataset["train_set_y"][:]) # your train set labels
test_dataset = h5py.File('datasets/test_catvnoncat.h5', "r")
test_set_x_orig = np.array(test_dataset["test_set_x"][:]) # your test set features
test_set_y_orig = np.array(test_dataset["test_set_y"][:]) # your test set labels
classes = np.array(test_dataset["list_classes"][:]) # the list of classes
train_set_y_orig = train_set_y_orig.reshape((1, train_set_y_orig.shape[0]))
test_set_y_orig = test_set_y_orig.reshape((1, test_set_y_orig.shape[0]))
return train_set_x_orig, train_set_y_orig, test_set_x_orig, test_set_y_orig, classes

I faced similar problem and I had followed the following steps:
1. import the following library
import numpy as np
import matplotlib.pyplot as plt
import h5py
import scipy
from PIL import Image
from scipy import ndimage
2. download the train_catvnoncat.h5 and test_catvnoncat.h5 from any of the below link:
[https://github.com/berkayalan/Neural-Networks-and-Deep-Learning/tree/master/datasets]
or
[https://github.com/JudasDie/deeplearning.ai/tree/master/Improving%20Deep%20Neural%20Networks/Week1/Regularization/datasets]
3. create a folder named datasets and paste these two files in this folder.
[ Note: datasets folder and your source code file should be in same directory]
4. run the following code
def load_dataset():
with h5py.File('datasets1/train_catvnoncat.h5', "r") as train_dataset:
train_set_x_orig = np.array(train_dataset["train_set_x"][:])
train_set_y_orig = np.array(train_dataset["train_set_y"][:])
with h5py.File('datasets1/test_catvnoncat.h5', "r") as test_dataset:
test_set_x_orig = np.array(test_dataset["test_set_x"][:])
test_set_y_orig = np.array(test_dataset["test_set_y"][:])
classes = np.array(test_dataset["list_classes"][:])
train_set_y_orig = train_set_y_orig.reshape((1, train_set_y_orig.shape[0]))
test_set_y_orig = test_set_y_orig.reshape((1, test_set_y_orig.shape[0]))
return train_set_x_orig, train_set_y_orig, test_set_x_orig, test_set_y_orig, classes
5. Load the data:
train_set_x_orig, train_set_y, test_set_x_orig, test_set_y, classes = load_dataset()
check datasets
print(len(train_set_x_orig))
print(len(test_set_x_orig))
your data set is ready, you may check the len of the train_set_x_orig, train_set_y variable. For mine, it was 209 and 50

I could download the dataset directly from coursera page.
Once you open the Coursera notebook you go to File -> Open and the following window will be display:
enter image description here
Here the notebooks and datasets are displayed, you can go to the datasets folder and download the required data for the assignment. The package lr_utils.py is also available for downloading.

below is your code, just save your file named "lr_utils.py" and now you can use it.
import numpy as np
import h5py
def load_dataset():
train_dataset = h5py.File('datasets/train_catvnoncat.h5', "r")
train_set_x_orig = np.array(train_dataset["train_set_x"][:]) # your train set features
train_set_y_orig = np.array(train_dataset["train_set_y"][:]) # your train set labels
test_dataset = h5py.File('datasets/test_catvnoncat.h5', "r")
test_set_x_orig = np.array(test_dataset["test_set_x"][:]) # your test set features
test_set_y_orig = np.array(test_dataset["test_set_y"][:]) # your test set labels
classes = np.array(test_dataset["list_classes"][:]) # the list of classes
train_set_y_orig = train_set_y_orig.reshape((1, train_set_y_orig.shape[0]))
test_set_y_orig = test_set_y_orig.reshape((1, test_set_y_orig.shape[0]))
return train_set_x_orig, train_set_y_orig, test_set_x_orig, test_set_y_orig, classes
if your code file can not find you newly created lr_utils.py file just write this code:
import sys
sys.path.append("full path of the directory where you saved Ir_utils.py file")

Here is the way to get dataset from as #ThinkBonobo:
https://github.com/andersy005/deep-learning-specialization-coursera/tree/master/01-Neural-Networks-and-Deep-Learning/week2/Programming-Assignments/datasets
write a lr_utils.py file, as above answer #StationaryTraveller, put it into any of sys.path() directory.
def load_dataset():
with h5py.File('datasets/train_catvnoncat.h5', "r") as train_dataset:
....
!!! BUT make sure that you delete 'datasets/', cuz now the name of your data file is train_catvnoncat.h5
restart kernel and good luck.

I may add to the answers that you can save the file with lr_utils script on the disc and import that as a module using importlib util function in the following way.
The below code came from the general thread about import functions from external files into the current user session:
How to import a module given the full path?
### Source load_dataset() function from a file
# Specify a name (I think it can be whatever) and path to the lr_utils.py script locally on your PC:
util_script = importlib.util.spec_from_file_location("utils function", "D:/analytics/Deep_Learning_AI/functions/lr_utils.py")
# Make a module
load_utils = importlib.util.module_from_spec(util_script)
# Execute it on the fly
util_script.loader.exec_module(load_utils)
# Load your function
load_utils.load_dataset()
# Then you can use your load_dataset() coming from above specified 'module' called load_utils
train_set_x_orig, train_set_y, test_set_x_orig, test_set_y, classes = load_utils.load_dataset()
# This could be a general way of calling different user specified modules so I did the same for the rest of the neural network function and put them into separate file to keep my script clean.
# Just remember that Python treat it like a module so you need to prefix the function name with a 'module' name eg.:
# d = nnet_utils.model(train_set_x, train_set_y, test_set_x, test_set_y, num_iterations = 1000, learning_rate = 0.005, print_cost = True)
nnet_script = importlib.util.spec_from_file_location("utils function", "D:/analytics/Deep_Learning_AI/functions/lr_nnet.py")
nnet_utils = importlib.util.module_from_spec(nnet_script)
nnet_script.loader.exec_module(nnet_utils)
That was the most convenient way for me to source functions/methods from different files in Python so far.
I am coming from the R background where you can call just one line function source() to bring external scripts contents into your current session.

The above answers didn't help, some links had expired.
So, lr_utils is not a pip library but a file in the same notebook as the CourseEra website.
You can click on "Open", and it'll open the explorer where you can download everything that you would want to run in another environment.
(I used this on a browser.)

This is how i solved mine, i copied the lir_utils file and paste it in my notebook thereafter i downloaded the dataset by zipping the file and extracting it. With the following code. Note: Run the code on coursera notebook and select only the zipped file in the directory to download.
!pip install zipfile36
zf = zipfile.ZipFile('datasets/train_catvnoncat_h5.zip', mode='w')
try:
zf.write('datasets/train_catvnoncat.h5')
zf.write('datasets/test_catvnoncat.h5')
finally:
zf.close()

Numpy memmap with file deletion

Is it possible to have a Numpy memmap file who's file will be deleted when the memmap array is garbage collected?
I have tried:
import tempfile
import numpy as np
arr = np.memmap(tempfile.NamedTemporaryFile(), mode='w+',
shape=(10, 10), dtype=np.int)
os.path.exists(arr.filename) # False
But it appears the reference to the temporary file isn't kept so is deleted.
I don't want to use a context manager on the temporary file, as I want to be able to return the array from a function and have the file live until the array is no longer used.
NB: similar question here: In Python, is it possible to overload Numpy's memmap to delete itself when the memmap object is no longer referenced? but the asker exhibits poor knowledge of Python scoping and the tempfile module.

As it turns out, jtaylor's answer to the question originally linked is correct. So the code:
import tempfile
import numpy as np
with tempfile.NamedTemporaryFile() as ntf:
temp_name = ntf.name
arr = np.memmap(ntf, mode='w+',
shape=(10, 10), dtype=np.int)
print(arr)
Works as desired, even though os.path.exists(temp_name) is false, because of the way the OS manages files. The file path (temp_name) is unlinked, and no longer accessible through the filesystem, however the actual file disk usage will remain available until the open file is closed, which the memmap object will keep.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Load numpy array in google-cloud-ml job - python

Related

printing the uploading progress using `xarray.Dataset.to_zarr` function

PyArrow ipc.read_tensor causes seg fault

Create mxnet.ndarray.NDArray from pycuda.driver.DeviceAllocation

Install issues with 'lr_utils' in python

Numpy memmap with file deletion

Categories

Resources