PyArrow ipc.read_tensor causes seg fault

PyArrow ipc.read_tensor causes seg fault - python

I am trying to pass numpy arrays from one process to another using PyArrow's shared memory framework. Currently the sender process has this code:
import numpy as np
import pyarrow as pa
data = np.random.rand(100,100, 100)
tensor = pa.Tensor.from_numpy(data)
output_stream = pa.BufferOutputStream()
pa.ipc.write_tensor(tensor, output_stream)
buf = output_stream.getvalue()
print(buf.address, buf.size)
which outputs something like (5311993741568, 8000256)
In the second process I have:
import pyarrow as pa
import numpy as np
buf2 = pa.foreign_buffer(5311993741568, 8000256)
tensor2 = pa.ipc.read_tensor(buf2)
but I get a segfault on the last line. The documentation isn't very clear on what the right way to use read_tensor and write_tensor is. I am also running on windows so I cannot use plasma object store.

Related

printing the uploading progress using `xarray.Dataset.to_zarr` function

I'm trying to upload an xarray dataset to GCP using the function ds.to_zarr(store=store), and it works perfect. However, I would like to show the progress of big datasets. Is there any option to chunk my dataset in a way I can use tqdm or someting like that to log the uploading progress?
Here is the code that I currently have:
import os
import xarray as xr
import numpy as np
import gcsfs
from dask.diagnostics import ProgressBar
if __name__ == '__main__':
# for testing
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "service-account.json"
# create xarray
data_arr = np.random.rand(5000, 100, 100)
data_xarr = xr.DataArray(data_arr,
dims=["x", "y", "z"])
# define store
gcp_blob_uri = "gs://gprlib/test.zarr"
gcs = gcsfs.GCSFileSystem()
store = gcs.get_mapper(gcp_blob_uri)
# delayed to_zarr computation -> seems that it does not work
write_job = data_xarr\
.to_dataset(name="data")\
.to_zarr(store, mode="w", compute=False)
print(write_job)

xarray.Dataset.to_zarr has an optional argument compute which is True by default:
compute (bool, optional) – If True write array data immediately, otherwise return a dask.delayed.Delayed object that can be computed to write array data later. Metadata is always updated eagerly.
Using this, you can track the progress using dask's own dask.distributed.progress bar:
write_job = ds.to_zarr(store, compute=False)
write_job = write_job.persist()
# this will return an interactive (non-blocking) widget if in a notebook
# environment. To force the widget to block, provide notebook=False.
distributed.progress(write_job, notebook=False)
[############## ] | 35% Completed | 4.5s
Note that for this to work, the dataset must consist of chunked dask arrays. If the data is in memory, you could use a single chunk per array with ds.chunk().to_zarr.

Is there a way to not use Pickling when using the python multi processing module?

I am having a very hard time figuring this out. Im trying to create a real time satellite tracker and using the sky field python module. It reads in TLE data and then gives a LAT and LON position relative to the earth. The sky filed module creates satrec objects which cannot be pickled (even tried using dill). I am using a for loop to loop over all the satellites but this is very slow so I want to speed it up using multiprocessing with the pool method, but as above this is not working since multiprocessing uses pickle. Is there any way around this or does anyone have suggestions on other ways to use multiprocessing so speed up this for loop?
from skyfield.api import load, wgs84, EarthSatellite
import numpy as np
import pandas as pd
import time
import os
from pyspark import SparkContext
from multiprocessing import Pool
import dill
data = pd.read_json('tempSatelliteData.json')
print(data.head())
newData = data.filter(['tle_line0', 'tle_line1', 'tle_line2'])
newData.to_csv('test.txt', sep='\n', index=False)
stations_file = 'test.txt'
satellites = load.tle_file(stations_file)
ts = load.timescale()
t = ts.now()
#print(satellites)
#data = pd.DataFrame(data=satellites)
#data = data.to_numpy()
def normal_for():
# this for loop takes 9 seconds to comeplete TOO SLOW
ty = time.time()
for satellite in satellites:
geocentric = satellite.at(t)
lat,lon = wgs84.latlon_of(geocentric)
print('Latitude:', lat)
print('Longitude:', lon)
print(np.round_(time.time()-ty,3),'sec')
def sat_lat_lon(satellite):
geocentric = satellite.at(t)
lat,lon = wgs84.latlon_of(geocentric)
p = Pool()
result = p.map(sat_lat_lon, satellites)
p.close()
p.join()

I'm the author of dill, multiprocess, ppft, and pathos. You can try multiprocess, which uses dill instead of pickle, but if you say the objects are not serializable by dill, then that won't work. Alternates are multiprocess.dummy which uses threading, and won't require object serialization as in multiprocess (as suggested in the comments). There's also pathos.pools.ParallelPool (or just use underlyingppft)... which converts objects to source code to ship them across processes. There are a few other codes that provide parallel maps, but most of them require serialization of some sort. If none of the above works, you might have to work harder to make the objects serializable. For example, you could register serialization functions for the objects which can inform dill how to pickle the objects. dill also has serialization variants in dill.settings that enables you to try different serializations that might work. Sometimes, just changing the code construction or import locations can make an object serializable.
If it's the speed of the serialization, and not the ability to serialize the objects... then you might try mpi4py (or pyina to get a MPI map). MPI is intended a bit more for heavy lifting (expensive code). However, if it's the serialization and shipping of large serialized objects that is slowing you down... then using threading or adding a custom serializer is probably your best bet.

An EarthSatellite object can be created directly by passing the tle lines and a timescale to the constructor. So use Pool.map() or similar to pass the tle lines and timescale to the processes and let them create the satrec objects themselves.
You can probably get the tle lines directly from the json data and skip the steps of read_json and write_csv. But you didn't provide a sample json file.
I don't have any sample data, so this is untested:
from skyfield.api import load, wgs84, EarthSatellite
import pandas as pd
from multiprocessing import Pool
ts = load.timescale()
t = ts.now()
# load the .json data and convert it to a list of
# lists containing tle data for a satellite
data = pd.read_json('tempSatelliteData.json')
tle_data = [(row.tle_line0, row.tle_line1, row.tle_line2, ts, t)
for row in data.itertuple()]
def sat_lat_lon(line0, line1, line2, ts, t):
satellite = EarthSatellite(line1, line2, line0, ts)
geocentric = satellite.at(t)
lat,lon = wgs84.latlon_of(geocentric)
return satellite.satnum, lat, lon
with Pool() as p:
result = p.starmap(sat_lat_lon, tle_data)
p.close()
p.join()

Create mxnet.ndarray.NDArray from pycuda.driver.DeviceAllocation

I am trying to pass output of some pycuda operation to the input of mxnet computational graph.
I am able to achieve this via numpy conversion with the following code
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
import mxnet as mx
batch_shape = (1, 1, 10, 10)
h_input = np.zeros(shape=batch_shape, dtype=np.float32)
# init output with ones to see if contents really changed
h_output = np.ones(shape=batch_shape, dtype=np.float32)
device_ptr = cuda.mem_alloc(input.nbytes)
stream = cuda.Stream()
cuda.memcpy_htod_async(d_input, h_input, stream)
# here some actions with d_input may be performed, e.g. kernel calls
# but for the sake of simplicity we'll just transfer it back to host
cuda.memcpy_dtoh_async(d_input, h_output, stream)
stream.synchronize()
mx_input = mx.nd(h_output, ctx=mx.gpu(0))
print('output after pycuda calls: ', h_output)
print('mx_input: ', mx_input)
However i would like to avoid the overhead of device-to-host and host-to-device memory copying.
I couldn't find a way to construct mxnet.ndarray.NDArray directly from h_output.
The closest thing that i was able to find is construction of ndarray from dlpack.
But it is not clear how to work with dlpack object from python.
Is there a way fo achieve NDArray <-> pycuda interoperability without copying memory via host?

Unfortunately, it is not possible at the moment.

Load numpy array in google-cloud-ml job

In the model I want to launch, I have some variables which have to be initialized with specific values.
I currently store these variables into numpy arrays but I don't know how to adapt my code to make it work on a google-cloud-ml job.
Currently I initialize my variable like this:
my_variable = variables.model_variable('my_variable', shape=None, dtype=tf.float32, initializer=np.load('datasets/real/my_variable.npy'))
Can someone help me ?

First, you'll need to copy/store the data on GCS (using, e.g., gsutil) and ensure your training script has access to that bucket. The easiest way to do so is to copy the array to the same bucket as your data, since you'll likely already have configured that bucket for read access. If the bucket is in the same project as your training job and you followed these instructions (particularly, gcloud beta ml init-project), you should be set. If the data will be in another bucket, see these instructions.
Then you'll need to use a library capable of loading data from GCS. Tensorflow includes a module that can do this, although you're free to use any client library that can read from GCS. Here's an example of using TensorFlow's file_io module:
from StringIO import StringIO
import tensorflow as tf
import numpy as np
from tensorflow.python.lib.io import file_io
# Create a variable initialized to the value of a serialized numpy array
f = StringIO(file_io.read_file_to_string('gs://my-bucket/123.npy'))
my_variable = tf.Variable(initial_value=np.load(f), name='my_variable')
Note that we have to read the file into a string and use StringIO, since file_io.FileIO does not fully implement the seek function required by numpy.load.
Bonus: in case it's useful, you can directly store a numpy array to GCS using the file_io module, e.g.:
np.save(file_io.FileIO('gs://my-bucket/123', 'w'), np.array([[1,2,3], [4,5,6]]))
For Python 3, use from io import StringIO instead of from StringIO import StringIO.

I tried the accepted answer but ran into some problems. Eventually this worked for me (Python 3):
from io import BytesIO
import numpy as np
from tensorflow.python.lib.io import file_io
To save:
array = np.ones((100, ))
dest = 'gs://[BUCKET-NAME]/' # Destination to save in GCS
np.save(file_io.FileIO(dest, 'w'), array)
To load:
f = BytesIO(file_io.read_file_to_string(src, binary_mode=True))
arr = np.load(f)

Create temporary dataframe with rpy2: memory issue

This question is similar to but simpler than my previous one.
Here is the code that I use to create R dataframes from python using rpy2:
import numpy as np
from rpy2 import robjects
Z = np.zeros((10000, 500))
df = robjects.r["data.frame"]([robjects.FloatVector(column) for column in Z.T])
My problem is that using it repetitively results in huge memory consumption.
I tried to adapt the idea from here but without success.
How can I convert many numpy arrays to dataframe for treatment by R methods without gradually using all my memory?

You should make sure that you're using the latest version of rpy2. With rpy2 version 2.4.2, the following works nicely:
import gc
import numpy as np
from rpy2 import robjects
from rpy2.robjects.numpy2ri import numpy2ri
for i in range(100):
print i
Z = np.random.random(size=(10000, 500))
matrix = numpy2ri(Z)
df = robjects.r("data.frame")(matrix)
gc.collect()
Memory usage never exceeds 600 MB on my computer.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PyArrow ipc.read_tensor causes seg fault - python

Related

printing the uploading progress using `xarray.Dataset.to_zarr` function

Is there a way to not use Pickling when using the python multi processing module?

Create mxnet.ndarray.NDArray from pycuda.driver.DeviceAllocation

Load numpy array in google-cloud-ml job

Create temporary dataframe with rpy2: memory issue

Categories

Resources