efficiently create dask.array from a dask.Series of lists

efficiently create dask.array from a dask.Series of lists - python

What is the most efficient way to create a dask.array from a dask.Series of list?
The series consists of 5 million lists 300 of elements.
It is currently divide into 500 partitions.
Currently I am trying:
pt = [delayed(np.array)(y)
for y in
[delayed(list)(x)
for x in series.to_delayed()]]
da = delayed(dask.array.concatenate)(pt, axis=1)
da = dask.array.from_delayed(da, (vec.size.compute(), 300), dtype=float)
The idea is to convert each partition into a numpy array and stitch
those together into a dask.array.
This code is taking forever to run though.
A numpy array can be built from this data quite quickly from this data sequentially as long as there is enough RAM.

I think that you are on the right track using dask.delayed. However calling list on the series is probably not ideal. I would create a function that converts one of your series into a numpy array and then go through delayed with that.
def convert_series_to_array(pandas_series): # make this as fast as you can
...
return numpy_array
L = dask_series.to_delayed()
L = [delayed(convert_series_to_array)(x) for x in L]
arrays = [da.from_delayed(x, shape=(np.nan, 300), dtype=...) for x in L]
x = da.concatenate(arrays, axis=0)
Also, regarding this line:
da = delayed(dask.array.concatenate)(pt, axis=1)
You should never call delayed on a dask function. They are already lazy.

Looking at this with some dummy data. Building on #MRocklin's answer (and molding more after my specific use case), let's say that your vectors are actually list of ints instead of floats and the list is stored as a string. We take the series, transform it, and store it in a zarr array file.
# create dummy data
vectors = [ np.random.randint(low=0,high=100,size=300).tolist() for _ in range(1000) ]
df = pd.DataFrame()
df['vector'] = vectors
df['vector'] = df['vector'].map(lambda x:f"{x}")
df['foo'] = 'bar'
ddf = dd.from_pandas( df, npartitions=100 )
# transform series data to numpy array
def convert_series_to_array( series ): # make this as fast as you can
series_ = [ast.literal_eval( i ) for i in series]
return np.stack(series_, axis=0)
L = ddf['vector'].to_delayed()
L = [delayed(convert_series_to_array)(x) for x in L]
arrays = [da.from_delayed(x, shape=(np.nan, 300), dtype=np.int64) for x in L]
x = da.concatenate(arrays, axis=0)
# store result into a zarr array
x.compute_chunk_sizes().to_zarr( 'toy_dataset.zarr', '/home/user/Documents/', overwrite=True )

Related

How to return arrays from a function in python

I have a function to read some data files and make some pandas data. I have 4 paths I want to read and make dataframes.
def read_files(paths:np.array,scalings:np.array):
names = ['E','I']
for p,s in zip(paths,scalings):
df = pd.read_csv(p,engine = 'python', sep ='\s+', names=names)
energy_ = df['E']
intensity_ = df['I']
return energy_, intensity_
I want to make 2 arrays which have all of the dfs inside them to use for other functions.
(below) where each energy0 is the df['E'] from the first path in the paths array and so on.
energy = [energy_0, energy_1, energy2, energy_3]
intensity = [intensity_0, intensity_1, intensity_2, intensity_3]
to use in
fig = plotfunction(energy,intensity,etc)
How can I get call each specific dataframe so I can make an array with them? Edit: If I want to use the energy and intensity dataframes from the path3 if i use paths = [path0,path1,path2,path3]

How to return arrays from a function in python
Here's an example on how to return 2 array from a function:
def function():
array = [1,2,3]
array2 = [4,5,6]
return array, array2
a, a2 = function()
print(a)
print(a2)
How can I get call each specific dataframe
What?
I can only guess you want
def read_files(paths:np.array):
names = ['E','I']
energy_ = [] # create an array here
intensity_ = [] # create another array here
for p,s in zip(paths, scalings):
df = pd.read_csv(p,engine = 'python', sep ='\s+', names=names)
energy_.append(df['E']) # append to that array
intensity_.append(df['I']) # append to that other array
return energy_, intensity_ # return both arrays
Note that scalings is not defined in the code you have shown us.

Running Scipy Linregress Across Dataframe Where Each Element is a List

I am working with a Pandas dataframe where each element contains a list of values. I would like to run a regression between the lists in the first column and the lists in each subsequent column for every row in the dataframe, and store the t-stats of each regression (currently using a numpy array to store them). I am able to do this using a nested for loop that loops through each row and column, but the performance is not optimal for the amount of data I am working with.
Here is a quick sample of what I have so far:
import numpy as np
import pandas as pd
from scipy.stats import linregress
df = pd.DataFrame(
{'a': [list(np.random.rand(11)) for i in range(100)],
'b': [list(np.random.rand(11)) for i in range(100)],
'c': [list(np.random.rand(11)) for i in range(100)],
'd': [list(np.random.rand(11)) for i in range(100)],
'e': [list(np.random.rand(11)) for i in range(100)],
'f': [list(np.random.rand(11)) for i in range(100)]
}
)
Here is what the data looks like:
a b c d e f
0 [0.279347961395256, 0.07198822780319691, 0.209... [0.4733815106836531, 0.5807425586417414, 0.068... [0.9377037591435088, 0.9698329284595916, 0.241... [0.03984770879654953, 0.650429630364027, 0.875... [0.04654151678901641, 0.1959629573862498, 0.36... [0.01328000288459652, 0.10429773699794731, 0.0...
1 [0.1739544898167934, 0.5279297754363472, 0.635... [0.6464841177367048, 0.004013634850660308, 0.2... [0.0403944630279538, 0.9163938509072009, 0.350... [0.8818108296208096, 0.2910758930807579, 0.739... [0.5263032002243185, 0.3746299115677546, 0.122... [0.5511171062367501, 0.327702669239891, 0.9147...
2 [0.49678125158054476, 0.807770957943305, 0.396... [0.6218806473477556, 0.01720135741717188, 0.15... [0.6110516368605904, 0.20848099927159314, 0.51... [0.7473669581190695, 0.5107081859246958, 0.442... [0.8231961741887535, 0.9686869510163731, 0.473... [0.34358121300094313, 0.9787339533782848, 0.72...
3 [0.7672751789941814, 0.412055981587398, 0.9951... [0.8470471648467321, 0.9967427749160083, 0.818... [0.8591072331661481, 0.6279199806511635, 0.365... [0.9456189188046846, 0.5084362869897466, 0.586... [0.2685328112579779, 0.8893788305422594, 0.235... [0.029919732007230193, 0.6377951981939682, 0.1...
4 [0.21420195955828203, 0.15178914447352077, 0.9... [0.6865307542882283, 0.0620359602798356, 0.382... [0.6469510945986712, 0.676059598071864, 0.0396... [0.2320436872397288, 0.09558341089961908, 0.98... [0.7733653233006889, 0.2405189745554751, 0.016... [0.8359561624563979, 0.24335481664355396, 0.38...
... ... ... ... ... ... ...
95 [0.42373270776373506, 0.7731750012629109, 0.90... [0.9430465078763153, 0.8506292743184455, 0.567... [0.41367168515273345, 0.9040247409476362, 0.72... [0.23016875953835192, 0.8206550830081965, 0.26... [0.954233948805146, 0.995068745046983, 0.20247... [0.26269690906898413, 0.5032835345055103, 0.26...
96 [0.36114607798432685, 0.11322299769211142, 0.0... [0.729848741496316, 0.9946930423163686, 0.2265... [0.17207915211677138, 0.3270055732644267, 0.73... [0.13211243241239223, 0.28382298905995607, 0.2... [0.03915259352564071, 0.05639914089770948, 0.0... [0.12681415759423675, 0.006417761276839351, 0....
97 [0.5020186971295065, 0.04018166955309821, 0.19... [0.9082402680300308, 0.1334790715379094, 0.991... [0.7003469664104871, 0.9444397336912727, 0.113... [0.7982221018200218, 0.9097963438776192, 0.163... [0.07834894180973451, 0.7948519146738178, 0.56... [0.5833962514812425, 0.403689767723475, 0.7792...
98 [0.16413822314461857, 0.40683312270714234, 0.4... [0.07366489230864415, 0.2706766599711766, 0.71... [0.6410967759869383, 0.5780018716586993, 0.622... [0.5466463581695835, 0.4949639043264169, 0.749... [0.40235314091318986, 0.8305539205264385, 0.35... [0.009668651763079184, 0.8071825962911674, 0.0...
99 [0.8189246990381518, 0.69175150213841, 0.82687... [0.40469941577758317, 0.49004906937461257, 0.7... [0.4940080411615112, 0.33621539942693246, 0.67... [0.8637418291877355, 0.34876318713083676, 0.09... [0.3526913672876807, 0.5177762589812651, 0.746... [0.3463129199717484, 0.9694802522161138, 0.732...
100 rows × 6 columns
My code to run the regressions and store the t-stats:
rows = len(df)
cols = len(df.columns)
tstats = np.zeros(shape=(rows,cols-1))
for i in range(0,rows):
for j in range(1,cols):
lg = linregress(df.iloc[i,0],df.iloc[i,j])
tstats[i,j-1] = lg.slope/lg.stderr
The code above works just fine and is doing exactly what I need, however as I mentioned above the performance begins to slow down when the # of rows and columns in df increases substantially.
I'm hoping someone could offer advice on how to optimize my code for better performance.
Thank you!

I am newbie to this but I do optimization your original code:
by purely use python builtin list object (there is no need to use pandas and to be honest I cannot find a better way to solve your problem in pandas than you original code :D)
by using numpy, which should be (at least they claimed) faster than python builtin list.
You can jump to see the code, its in Jupyter notebook format so you need to install Jupyter first.
Conclusion
Here is the test result:
On a (100, 100) matrix containing (30,) length random lists,
the total time difference is around 1 second.
Time elapsed to run 1 times on new method is 24.282760 seconds.
Time elapsed to run 1 times on old method is 25.954801 seconds.
Refer to
test_perf
in sample code for result.
PS: During test only one thread is used, so maybe multi-thread will help to improve performance, but that's out of my ability...
Idea
I think numpy.nditer is suitable for your request, though the result of optimization is not that significant. Here is my idea:
Generate the input array
I have altered you first part of script, I think using list comprehension along is enough to build a matrix of random lists. Refer to
get_matrix_from_builtin.
Please note I have stored the random lists in another 1-element tuple to keep the shape as ndarray generate from numpy.
As a compare, you can also construct such matrix with numpy. Refer to
get_matrix_from_numpy.
Because ndarray try to boardcast list-like object (and I don't know how to stop it), I have to wrap it into a tuple to avoid auto boardcast from numpy.array constructor. If anyone have a better solution please note it, thanks :)
Calculate the result
I altered you original code using pandas.DataFrame to access element by row/col index, but it is not that way.
Pandas provides some iteration tool for DataFrame: pipe, apply, agg, and appymap, search API for more info, but it seems not suitable for your request here, as you want to obtain the current index of row and col during iteration.
I searched and found numpy.nditer can provide that needs: it return a iterator of ndarray, which have an attribution multi_index that provide the row/col pair of current element. see iterating-over-arrays
Explain on solve.ipynb
I use Jupyter Notebook to test this, you might need got one, here is the instruction of install.
I have altered your original code, which remove the request of pandas and purely used builtin list. Refer to
old_calc_tstat
in the sample code.
Also, I used numpy.nditer to calc your tstats matrix, Refer to
new_calc_tstat
in the sample code.
Then, I tested if the result of both methods are equal, I used same input array to ensure random won't affect the test. Refer to
test_equal
for result.
Finally, do the time performance. I am not patient so I only run it for one time, you may add the repeats count of test in the
test_perf function.
The code
# To add a new cell, type '# %%'
# To add a new markdown cell, type '# %% [markdown]'
# %% [markdown]
# [origin question](https://stackoverflow.com/questions/69228572/running-scipy-linregress-across-dataframe-where-each-element-is-a-list)
#
# %%
import sys
import time
import numpy as np
from scipy.stats import linregress
# %%
def get_matrix_from_builtin():
# use builtin list to construct matrix of random list
# note I put random list inside a tuple to keep it same shape
# as I later use numpy to do the same thing.
return [
[(list(np.random.rand(11)),)
for col in range(6)]
for row in range(100)
]
# %timeit get_matrix_from_builtin()
# %%
def get_matrix_from_numpy(
gen=np.random.rand,
shape=(1, 1),
nest_shape=(1, ),
):
# custom dtype for random lists
mydtype = [
('randonlist', 'f', nest_shape)
]
a = np.empty(shape, dtype=mydtype)
# [DOC] moditfying array values
# https://numpy.org/doc/stable/reference/arrays.nditer.html#modifying-array-values
# enable per operation flags 'readwrite' to modify element in ndarray
# enable global flag 'refs_ok' to allow use callable function 'gen' in iteration
with np.nditer(a, op_flags=['readwrite'], flags=['refs_ok']) as it:
for x in it:
# pack list in a 1-d turple to prevent numpy boardcast it
x[...] = (gen(nest_shape[0]), )
return a
def test_get_matrix_from_numpy():
gen = np.random.rand # generator of random list
shape = (6, 100) # shape of matrix to hold random lists
nest_shape = (11, ) # shape of random lists
return get_matrix_from_numpy(gen, shape, nest_shape)
# access a random list by a[row][col][0]
# %timeit test_get_matrix_from_numpy()
# %%
def test_get_matrix_from_numpy():
gen = np.random.rand
shape = (6, 100)
nest_shape = (11, )
return get_matrix_from_numpy(gen, shape, nest_shape)
# %%
def old_calc_tstat(a=None):
if a is None:
a = get_matrix_from_builtin()
a = np.array(a)
rows, cols = a.shape[:2]
tstats = np.zeros(shape=(rows, cols))
for i in range(0, rows):
for j in range(1, cols):
lg = linregress(a[i][0][0], a[i][j][0])
tstats[i, j-1] = lg.slope/lg.stderr
return tstats
# %%
def new_calc_tstat(a=None):
# read input metrix of random lists
if a is None:
gen = np.random.rand
shape = (6, 100)
nest_shape = (11, )
a = get_matrix_from_numpy(gen, shape, nest_shape)
# construct ndarray for t-stat result
tstats = np.empty(a.shape)
# enable global flags 'multi_index' to retrive index of current element
# [DOC] Tracking an Index or Multi-Index
# https://numpy.org/doc/stable/reference/arrays.nditer.html#tracking-an-index-or-multi-index
it = np.nditer(tstats, op_flags=['readwrite'], flags=['multi_index'])
# obtain total columns count of tstats's shape
col = tstats.shape[1]
for x in it:
i, j = it.multi_index
# trick to avoid IndexError: substract len(list) after +1 to index
j = j + 1 - col
lg = linregress(
a[i][0][0],
a[i][j][0]
)
# note: nditer ignore ZeroDivisionError by default, and return np.inf to the element
# you have to override it manually:
if lg.stderr == 0:
x[...] = 0
else:
x[...] = lg.slope / lg.stderr
return tstats
# new_calc_tstat()
# %%
def test_equal():
"""Test if the new method has equal output to old one"""
# use same input list to avoid affect of rand
a = test_get_matrix_from_numpy()
old = old_calc_tstat(a)
new = new_calc_tstat(a)
print(
"Is the shape of old and new same ?\n%s. old: %s, new: %s\n" % (
old.shape == new.shape, old.shape, new.shape),
)
res = (old == new)
print(
"Is the result object same?"
)
if res.all() == True:
print("True.")
else:
print("False. Difference(new - old) as below:\n")
print(new - old)
return old, new
old, new = test_equal()
# %%
# the only diff is the last element
# in old method it is 0
# in new method it is inf
# if you perfer the old method, just add condition in new method to override
# [new[x][99] for x in range(6)]
# %%
# python version: 3.8.8
timer = time.clock if sys.platform[:3] == 'win' else time.time
def total(func, *args, _reps=1, **kwargs):
start = timer()
for i in range(_reps):
ret = func(*args, **kwargs)
elapsed = timer() - start
return elapsed
def test_perf():
"""Test of performance"""
# first, get a larger input array
gen = np.random.rand
shape = (1000, 100)
nest_shape = (30, )
a = get_matrix_from_numpy(gen, shape, nest_shape)
# repeat how many time for each test
reps = 1
# then, time both old and new calculation method
old = total(old_calc_tstat, a, _reps=reps)
new = total(new_calc_tstat, a, _reps=reps)
msg = "Time elapsed to run %d times on %s is %f seconds."
print(msg % (reps, 'new method', new))
print(msg % (reps, 'old method', old))
test_perf()

Sort numpy string array using positional data

I have a numpy array of strings
names = array([
'p00x00', 'p01x00', 'p02x00', 'p03x00', 'p04x00', 'p05x00',
'p00x01', 'p01x01', 'p02x01', 'p03x01', 'p04x01', 'p05x01',
'p00x02', 'p01x02', 'p02x02', 'p03x02', 'p04x02', 'p05x02',
'p00x03', 'p01x03', 'p02x03', 'p03x03', 'p04x03', 'p05x03',
'p00x04', 'p01x04', 'p02x04', 'p03x04', 'p04x04', 'p05x04',
'p00x05', 'p01x05', 'p02x05', 'p03x05', 'p04x05', 'p05x05'])
And corresponding position data
X = array([2.102235, 2.094113, 2.086038, 2.077963, 2.069849, 2.061699])
Y = array([-7.788431, -7.780364, -7.772306, -7.764247, -7.756188, -7.748114])
How can I sort names using X and Y such that I get out a sorted grid of names with shape (6, 6)? Note that there are essentially 6 unique X and Y positions -- I'm not just arbitrarily choosing 6x6.
names = array([
['p00x00', 'p01x00', 'p02x00', 'p03x00', 'p04x00', 'p05x00'],
['p00x01', 'p01x01', 'p02x01', 'p03x01', 'p04x01', 'p05x01'],
['p00x02', 'p01x02', 'p02x02', 'p03x02', 'p04x02', 'p05x02'],
['p00x03', 'p01x03', 'p02x03', 'p03x03', 'p04x03', 'p05x03'],
['p00x04', 'p01x04', 'p02x04', 'p03x04', 'p04x04', 'p05x04'],
['p00x05', 'p01x05', 'p02x05', 'p03x05', 'p04x05', 'p05x05']])
I realize in this case that I could simply reshape the array, but in general the data will not work out this neatly.

You can use numpy.argsort to get the indexes of the elements of an array after it's sorted. These indices you can then use to sort your names array.
import numpy as np
names = np.array([
'p00x00', 'p01x00', 'p02x00', 'p03x00', 'p04x00', 'p05x00',
'p00x01', 'p01x01', 'p02x01', 'p03x01', 'p04x01', 'p05x01',
'p00x02', 'p01x02', 'p02x02', 'p03x02', 'p04x02', 'p05x02',
'p00x03', 'p01x03', 'p02x03', 'p03x03', 'p04x03', 'p05x03',
'p00x04', 'p01x04', 'p02x04', 'p03x04', 'p04x04', 'p05x04',
'p00x05', 'p01x05', 'p02x05', 'p03x05', 'p04x05', 'p05x05'])
X = np.array([2.102235, 2.094113, 2.086038, 2.077963, 2.069849, 2.061699])
Y = np.array([-7.788431, -7.780364, -7.772306, -7.764247, -7.756188, -7.748114])
x_order = np.argsort(X)
y_order = np.argsort(Y)
names_ordered = names.reshape(6,6)[np.meshgrid(x_order,y_order)]
print(names_ordered)
gives the following output:
[['p00x05' 'p00x04' 'p00x03' 'p00x02' 'p00x01' 'p00x00']
['p01x05' 'p01x04' 'p01x03' 'p01x02' 'p01x01' 'p01x00']
['p02x05' 'p02x04' 'p02x03' 'p02x02' 'p02x01' 'p02x00']
['p03x05' 'p03x04' 'p03x03' 'p03x02' 'p03x01' 'p03x00']
['p04x05' 'p04x04' 'p04x03' 'p04x02' 'p04x01' 'p04x00']
['p05x05' 'p05x04' 'p05x03' 'p05x02' 'p05x01' 'p05x00']]

dynamically append N-dimensional array

If each array has the shape (1000, 2, 100), it is easy to use
con = np.concatenate((array_A, array_B))
to concatenate them, thus con has the shape (2000, 2, 100).
I want to dynamically append or concatenate "con" in a function. The step is described as following:
First, read data from the first file and process data to generate an array.
Secondly, read date from the second file and append generated array into the first array
....
def arrayappend():
for i in range(n):
#read data from file_0 to file_n-1
data = read(file_i)
#data processing to generate an array with shape (1000, 2, 100)
con = function(data)
# append con

Assuming all your files produce the same shape objects and you want to join them on the 1st dimension, there are several options:
alist = []
for f in files:
data = foo(f)
alist.append(f)
arr = np.concatenate(alist, axis=0)
concatenate takes a list. There are variations if you want to add a new axis (np.array(alist), np.stack etc).
Append to a list is fast, since it just means adding a pointer to the data object. concatenate creates a new array from the components; it's compiled but still relatively slower.
If you must/want to make a new array at each stage you could write:
arr = function(files[0])
for f in files[1:]:
data = foo(f)
arr = np.concatenate((arr, data), axis=0)
This probably is slower, though, if the file loading step is slow enough you might not notice a difference.
With care you might be able start with arr = np.zeros((0,2,100)) and read all files in the loop. You have to make sure the initial 'empty' array has a compatible shape. New users often have problems with this.

If you absolutely want to do it during iteration then:
def arrayappend():
con = None
for i, d in enumerate(files_list):
data = function(d)
con = data if i is 0 else np.vstack([con, data])
This should stack it vertically.

Very non pretty, but does it achieve what you want? It is way unoptimized.
def arrayappend():
for i in range(n):
data = read(file_i)
try:
con
con = np.concatenate((con, function(data)))
except NameError:
con = function(data)
return con
First loop will take the except branch, subsequent wont.

efficient way to convert a nested list to numpy array

I have a nested list with different list sized and types.
def read(f,tree,objects):
Event=[]
for o in objects:
#find different features of one class
temp=[i.GetName() for i in tree.GetListOfBranches() if i.GetName().startswith(o)]
tempList=[] #contains one class of objects
for t in temp:
#print t
tempList.append(t)
comp=np.asarray(getattr(tree,t))
tempList.append(comp)
Event.append(tempList)
return Event
def main():
path="path/to/file"
objects= ['TauJet', 'Jet', 'Electron', 'Muon', 'Photon', 'Tracks', 'ETmis', 'CaloTower']
f=ROOT.TFile(path)
tree=f.Get("RecoTree")
tree.GetEntry(100)
event=read(f,tree,objects)
for example result of event[0] is
['TauJet', array(1), 'TauJet_E', array([ 31.24074173]), 'TauJet_Px', array([-28.27997971]), 'TauJet_Py', array([-13.18042469]), 'TauJet_Pz', array([-1.08304048]), 'TauJet_Eta', array([-0.03470514]), 'TauJet_Phi', array([-2.70545626]), 'TauJet_PT', array([ 31.20065498]), 'TauJet_Charge', array([ 1.]), 'TauJet_NTracks', array([3]), 'TauJet_EHoverEE', array([ 1745.89221191]), 'TauJet_size', array(1)]
how can I convert it into numpy array?
NOTE 1: np.asarray(event, "object") is slow. I am looking for a better way. Also np.fromiter() is not applicable as far as I don't have a fixed type
NOTE 2: I don't know the length of my Events.
NOTE 3: I can also get ride of names if it makes thing easier.

You could try something like this, I'm not sure how fast its going to be though. This creates a numpy record array for first row.
data = event[0]
keys = data[0::2]
vals = data[1::2]
#there are some zero-rank arrays in there, so need to check for those,
#but I think just recasting them to a np.float should work.
temp = [np.float(v) for v in vals]
#you could also just create a np array from the line above with np.array(temp)
dtype={"names":keys, "formats":("f4")*len(vals)}
myArr = np.rec.fromarrays(temp, dtype=dtype)
#test it out
In [53]: data["TauJet_Pz"]
Out[53]: array(-1.0830404758453369, dtype=float32)
#alternatively, you could try something like this, which just creates a 2d numpy array
vals = np.array([[np.float(v) for v in row[1::2]] for row in event])
#now create a nice record array from that using the dtypes above
myRecordArray = np.rec.fromarrays(vals, dtype=dtype)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

efficiently create dask.array from a dask.Series of lists - python

Related

How to return arrays from a function in python

Running Scipy Linregress Across Dataframe Where Each Element is a List

Sort numpy string array using positional data

dynamically append N-dimensional array

efficient way to convert a nested list to numpy array

Categories

Resources