printing the uploading progress using `xarray.Dataset.to_zarr` function - python

I'm trying to upload an xarray dataset to GCP using the function ds.to_zarr(store=store), and it works perfect. However, I would like to show the progress of big datasets. Is there any option to chunk my dataset in a way I can use tqdm or someting like that to log the uploading progress?
Here is the code that I currently have:
import os
import xarray as xr
import numpy as np
import gcsfs
from dask.diagnostics import ProgressBar
if __name__ == '__main__':
# for testing
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "service-account.json"
# create xarray
data_arr = np.random.rand(5000, 100, 100)
data_xarr = xr.DataArray(data_arr,
dims=["x", "y", "z"])
# define store
gcp_blob_uri = "gs://gprlib/test.zarr"
gcs = gcsfs.GCSFileSystem()
store = gcs.get_mapper(gcp_blob_uri)
# delayed to_zarr computation -> seems that it does not work
write_job = data_xarr\
.to_dataset(name="data")\
.to_zarr(store, mode="w", compute=False)
print(write_job)

xarray.Dataset.to_zarr has an optional argument compute which is True by default:
compute (bool, optional) – If True write array data immediately, otherwise return a dask.delayed.Delayed object that can be computed to write array data later. Metadata is always updated eagerly.
Using this, you can track the progress using dask's own dask.distributed.progress bar:
write_job = ds.to_zarr(store, compute=False)
write_job = write_job.persist()
# this will return an interactive (non-blocking) widget if in a notebook
# environment. To force the widget to block, provide notebook=False.
distributed.progress(write_job, notebook=False)
[############## ] | 35% Completed | 4.5s
Note that for this to work, the dataset must consist of chunked dask arrays. If the data is in memory, you could use a single chunk per array with ds.chunk().to_zarr.

Related

Is there a way to not use Pickling when using the python multi processing module?

I am having a very hard time figuring this out. Im trying to create a real time satellite tracker and using the sky field python module. It reads in TLE data and then gives a LAT and LON position relative to the earth. The sky filed module creates satrec objects which cannot be pickled (even tried using dill). I am using a for loop to loop over all the satellites but this is very slow so I want to speed it up using multiprocessing with the pool method, but as above this is not working since multiprocessing uses pickle. Is there any way around this or does anyone have suggestions on other ways to use multiprocessing so speed up this for loop?
from skyfield.api import load, wgs84, EarthSatellite
import numpy as np
import pandas as pd
import time
import os
from pyspark import SparkContext
from multiprocessing import Pool
import dill
data = pd.read_json('tempSatelliteData.json')
print(data.head())
newData = data.filter(['tle_line0', 'tle_line1', 'tle_line2'])
newData.to_csv('test.txt', sep='\n', index=False)
stations_file = 'test.txt'
satellites = load.tle_file(stations_file)
ts = load.timescale()
t = ts.now()
#print(satellites)
#data = pd.DataFrame(data=satellites)
#data = data.to_numpy()
def normal_for():
# this for loop takes 9 seconds to comeplete TOO SLOW
ty = time.time()
for satellite in satellites:
geocentric = satellite.at(t)
lat,lon = wgs84.latlon_of(geocentric)
print('Latitude:', lat)
print('Longitude:', lon)
print(np.round_(time.time()-ty,3),'sec')
def sat_lat_lon(satellite):
geocentric = satellite.at(t)
lat,lon = wgs84.latlon_of(geocentric)
p = Pool()
result = p.map(sat_lat_lon, satellites)
p.close()
p.join()
I'm the author of dill, multiprocess, ppft, and pathos. You can try multiprocess, which uses dill instead of pickle, but if you say the objects are not serializable by dill, then that won't work. Alternates are multiprocess.dummy which uses threading, and won't require object serialization as in multiprocess (as suggested in the comments). There's also pathos.pools.ParallelPool (or just use underlyingppft)... which converts objects to source code to ship them across processes. There are a few other codes that provide parallel maps, but most of them require serialization of some sort. If none of the above works, you might have to work harder to make the objects serializable. For example, you could register serialization functions for the objects which can inform dill how to pickle the objects. dill also has serialization variants in dill.settings that enables you to try different serializations that might work. Sometimes, just changing the code construction or import locations can make an object serializable.
If it's the speed of the serialization, and not the ability to serialize the objects... then you might try mpi4py (or pyina to get a MPI map). MPI is intended a bit more for heavy lifting (expensive code). However, if it's the serialization and shipping of large serialized objects that is slowing you down... then using threading or adding a custom serializer is probably your best bet.
An EarthSatellite object can be created directly by passing the tle lines and a timescale to the constructor. So use Pool.map() or similar to pass the tle lines and timescale to the processes and let them create the satrec objects themselves.
You can probably get the tle lines directly from the json data and skip the steps of read_json and write_csv. But you didn't provide a sample json file.
I don't have any sample data, so this is untested:
from skyfield.api import load, wgs84, EarthSatellite
import pandas as pd
from multiprocessing import Pool
ts = load.timescale()
t = ts.now()
# load the .json data and convert it to a list of
# lists containing tle data for a satellite
data = pd.read_json('tempSatelliteData.json')
tle_data = [(row.tle_line0, row.tle_line1, row.tle_line2, ts, t)
for row in data.itertuple()]
def sat_lat_lon(line0, line1, line2, ts, t):
satellite = EarthSatellite(line1, line2, line0, ts)
geocentric = satellite.at(t)
lat,lon = wgs84.latlon_of(geocentric)
return satellite.satnum, lat, lon
with Pool() as p:
result = p.starmap(sat_lat_lon, tle_data)
p.close()
p.join()

InvalidStateError running PyRPL oscilloscope code

I have been using the PyRPL API Manual to begin running my Red Pitaya (STEMLab 125-10) on Jupyter Notebook instead of using the GUI. When I reach the oscilloscope code (below, along with a condensed version of the API manual code before it), when I reach ch1, ch2 = res.result(), I get "InvalidStateError: Result is not set." This is the error I would expect for trying to call the result of a future which is not done yet, but according to the source code for pyrpl.hardware_modules.scope, the future is never done:
This Future object is the one controlling the acquisition in
rolling_mode. It will never be fullfilled (done), since rolling_mode
is always continuous, but the timer/slot mechanism to control the
rolling_mode acquisition is encapsulated in this object.
(See class ContinuousRollingFuture in the link above). So if I can't take the result of a future which is not fulfilled yet, and the future is never fulfilled, and I need the result to get the curve, how can I ever get the curve?
I don't have much experience with the python library asyncio, so I would be very grateful for suggestions and tips on what I'm not understanding!
#condensed version of PyRPL code before the oscilloscope block
HOSTNAME = "rp-XXXXXX.local" # hostname of the red pitaya
folder = "." # relative folder where files are saved
import sys
import os
from pyrpl import Pyrpl
import numpy as np
import time
from time import sleep
import matplotlib.pyplot as plt
import scipy.fft
import IPython
import ipywidgets as widgets
from scipy.signal import welch
p = Pyrpl(hostname=HOSTNAME, config='test',gui=False)
r = p.rp
s = r.scope
# oscilloscope code
asg = r.asg1
# turn off asg so the scope has a chance to measure its "off-state" as well
asg.output_direct = "off"
# setup scope
s.input1 = 'asg1'
# show pid output on channel2
s.input2 = 'pid0'
# trig at zero volt crossing
s.threshold_ch1 = 0
# positive/negative slope is detected by waiting for input to
# sweep through hysteresis around the trigger threshold in
# the right direction
s.hysteresis_ch1 = 0.01
# trigger on the input signal positive slope
s.trigger_source = 'ch1_positive_edge'
# take data symetrically around the trigger event
s.trigger_delay = 0
# set decimation factor to 64 -> full scope trace is 8ns * 2^14 * decimation = 8.3 ms long
s.decimation = 64
# launch a single (asynchronous) curve acquisition, the asynchronous
# acquisition means that the function returns immediately, eventhough the
# data-acquisition is still going on.
res = s.curve_async()
print("Before turning on asg:")
print("Curve ready:", s.curve_ready()) # trigger should still be armed
# turn on asg and leave enough time for the scope to record the data
asg.setup(frequency=1e3, amplitude=0.3, start_phase=90, waveform='halframp', trigger_source='immediately')
sleep(0.010)
# check that the trigger has been disarmed
print("After turning on asg:")
print("Curve ready:", s.curve_ready())
print("Trigger event age [ms]:",8e-9*((
s.current_timestamp&0xFFFFFFFFFFFFFFFF) - s.trigger_timestamp)*1000)
# The function curve_async returns a *future* (or promise) of the curve. To
# access the actual curve, use result()
# this is the line that's giving me trouble:
ch1, ch2 = res.result()
# plot the data
%matplotlib inline
plt.plot(s.times*1e3, ch1, s.times*1e3, ch2)
plt.xlabel("Time [ms]")
plt.ylabel("Voltage")

How to efficiently convert npy to xarray / zarr

I have a 37 GB .npy file that I would like to convert to Zarr store so that I can include coordinate labels. I have code that does this in theory, but I keep running out of memory. I want to use Dask in-between to facilitate doing this in chunks, but I still keep running out of memory.
The data is "thickness maps" for people's femoral cartilage. Each map is a 310x310 float array, and there are 47789 of these maps. So the data shape is (47789, 310, 310).
Step 1: Load the npy file as a memmapped Dask array.
fem_dask = dask.array.from_array(np.load('/Volumes/T7/cartilagenpy20220602/femoral.npy', mmap_mode='r'),
chunks=(300, -1, -1))
Step 2: Make an xarray DataArray over the Dask array, with the desired coordinates. I have several coordinates for the 'map' dimension that come from metadata (a pandas dataframe).
fem_xr = xr.DataArray(fem_dask, dims=['map','x','y'],
coords={'patient_id': ('map', metadata['patient_id']),
'side': ('map', metadata['side'].astype(np.string_)),
'timepoint': ('map', metadata['timepoint'])
})
Step 3: Write to Zarr.
fem_ds = fem_xr.to_dataset(name='femoral') # Zarr requires Dataset, not DataArray
res = fem_ds.to_zarr('/Volumes/T7/femoral.zarr',
encoding={'femoral': {'dtype': 'float32'}},
compute=False)
res.visualize()
See task graph below if desired
When I call res.compute(), RAM use quickly climbs out of control. The other python processes, which I think are the Dask workers, seem to be inactive:
But a bit later, they are active -- see that one of those Python processes now has 20 gb RAM and another has 36 gb:
Which we can also confirm from the Dask dashboard:
Eventually all the workers get killed and the task errors out. How can I do this in an efficient way that correctly uses Dask, xarray, and Zarr, without running out of RAM (or melting the laptop)?
using threads
If the dask workers can share threads, your code should just work. If you don't initialize a dask Cluster explicitly, dask.Array will create one with default args, which use processes. This results in the behavior you're seeing. To solve this, explicitly create a cluster using threads:
# use threads, not processes
cluster = dask.distributed.LocalCluster(processes=False)
client = dask.distributed.Client(cluster)
arr = np.load('myarr.npy', mmap_mode='r')
da = dda.from_array(arr).rechunk(chunks=(100, 310, 310))
da.to_zarr('myarr.zarr', mode='w')
using processes or distributed workers
If you're using a cluster which cannot share threads, such as a JobQueue, KubernetesCluster, etc., you can use the following to read the npy file, assuming it's on a networked filesystem or is in some way available to all workers.
Here's a workflow that creates an empy array from the memory map, then maps the read job using dask.array.map_blocks. The key is the use of the block_info optional keyword, which gives information about the location of the block within the array, which we can use to slice new mmap array objects using dask workers:
def load_npy_chunk(da, fp, block_info=None, mmap_mode='r'):
"""Load a slice of the .npy array, making use of the block_info kwarg"""
np_mmap = np.load(fp, mmap_mode=mmap_mode)
array_location = block_info[0]['array-location']
dim_slicer = tuple(list(map(lambda x: slice(*x), array_location)))
return np_mmap[dim_slicer]
def dask_read_npy(fp, chunks=None, mmap_mode='r'):
"""Read metadata by opening the mmap, then send the read job to workers"""
np_mmap = np.load(fp, mmap_mode=mmap_mode)
da = dda.empty_like(np_mmap, chunks=chunks)
return da.map_blocks(load_npy_chunk, fp=fp, mmap_mode=mmap_mode, meta=da)
This works for me on a demo of the same size (you could add the xarray.DataArray creation/formatting step at the end, but the dask ops work fine and worker memory stays below 1GB for me):
import numpy as np, dask.array as dda, xarray as xr, pandas as pd, dask.distributed
### insert/import above functions here
# save a large numpy array
np.save('myarr.npy', np.empty(shape=(47789, 310, 310), dtype=np.float32))
cluster = dask.distributed.LocalCluster()
client = dask.distributed.Client(cluster)
da = dask_read_npy('myarr.npy', chunks=(300, -1, -1), mmap_mode='r')
da.to_zarr('myarr.zarr', mode='w')

PyArrow ipc.read_tensor causes seg fault

I am trying to pass numpy arrays from one process to another using PyArrow's shared memory framework. Currently the sender process has this code:
import numpy as np
import pyarrow as pa
data = np.random.rand(100,100, 100)
tensor = pa.Tensor.from_numpy(data)
output_stream = pa.BufferOutputStream()
pa.ipc.write_tensor(tensor, output_stream)
buf = output_stream.getvalue()
print(buf.address, buf.size)
which outputs something like (5311993741568, 8000256)
In the second process I have:
import pyarrow as pa
import numpy as np
buf2 = pa.foreign_buffer(5311993741568, 8000256)
tensor2 = pa.ipc.read_tensor(buf2)
but I get a segfault on the last line. The documentation isn't very clear on what the right way to use read_tensor and write_tensor is. I am also running on windows so I cannot use plasma object store.

Load numpy array in google-cloud-ml job

In the model I want to launch, I have some variables which have to be initialized with specific values.
I currently store these variables into numpy arrays but I don't know how to adapt my code to make it work on a google-cloud-ml job.
Currently I initialize my variable like this:
my_variable = variables.model_variable('my_variable', shape=None, dtype=tf.float32, initializer=np.load('datasets/real/my_variable.npy'))
Can someone help me ?
First, you'll need to copy/store the data on GCS (using, e.g., gsutil) and ensure your training script has access to that bucket. The easiest way to do so is to copy the array to the same bucket as your data, since you'll likely already have configured that bucket for read access. If the bucket is in the same project as your training job and you followed these instructions (particularly, gcloud beta ml init-project), you should be set. If the data will be in another bucket, see these instructions.
Then you'll need to use a library capable of loading data from GCS. Tensorflow includes a module that can do this, although you're free to use any client library that can read from GCS. Here's an example of using TensorFlow's file_io module:
from StringIO import StringIO
import tensorflow as tf
import numpy as np
from tensorflow.python.lib.io import file_io
# Create a variable initialized to the value of a serialized numpy array
f = StringIO(file_io.read_file_to_string('gs://my-bucket/123.npy'))
my_variable = tf.Variable(initial_value=np.load(f), name='my_variable')
Note that we have to read the file into a string and use StringIO, since file_io.FileIO does not fully implement the seek function required by numpy.load.
Bonus: in case it's useful, you can directly store a numpy array to GCS using the file_io module, e.g.:
np.save(file_io.FileIO('gs://my-bucket/123', 'w'), np.array([[1,2,3], [4,5,6]]))
For Python 3, use from io import StringIO instead of from StringIO import StringIO.
I tried the accepted answer but ran into some problems. Eventually this worked for me (Python 3):
from io import BytesIO
import numpy as np
from tensorflow.python.lib.io import file_io
To save:
array = np.ones((100, ))
dest = 'gs://[BUCKET-NAME]/' # Destination to save in GCS
np.save(file_io.FileIO(dest, 'w'), array)
To load:
f = BytesIO(file_io.read_file_to_string(src, binary_mode=True))
arr = np.load(f)

Categories

Resources