Related
I am working with a Pandas dataframe where each element contains a list of values. I would like to run a regression between the lists in the first column and the lists in each subsequent column for every row in the dataframe, and store the t-stats of each regression (currently using a numpy array to store them). I am able to do this using a nested for loop that loops through each row and column, but the performance is not optimal for the amount of data I am working with.
Here is a quick sample of what I have so far:
import numpy as np
import pandas as pd
from scipy.stats import linregress
df = pd.DataFrame(
{'a': [list(np.random.rand(11)) for i in range(100)],
'b': [list(np.random.rand(11)) for i in range(100)],
'c': [list(np.random.rand(11)) for i in range(100)],
'd': [list(np.random.rand(11)) for i in range(100)],
'e': [list(np.random.rand(11)) for i in range(100)],
'f': [list(np.random.rand(11)) for i in range(100)]
}
)
Here is what the data looks like:
a b c d e f
0 [0.279347961395256, 0.07198822780319691, 0.209... [0.4733815106836531, 0.5807425586417414, 0.068... [0.9377037591435088, 0.9698329284595916, 0.241... [0.03984770879654953, 0.650429630364027, 0.875... [0.04654151678901641, 0.1959629573862498, 0.36... [0.01328000288459652, 0.10429773699794731, 0.0...
1 [0.1739544898167934, 0.5279297754363472, 0.635... [0.6464841177367048, 0.004013634850660308, 0.2... [0.0403944630279538, 0.9163938509072009, 0.350... [0.8818108296208096, 0.2910758930807579, 0.739... [0.5263032002243185, 0.3746299115677546, 0.122... [0.5511171062367501, 0.327702669239891, 0.9147...
2 [0.49678125158054476, 0.807770957943305, 0.396... [0.6218806473477556, 0.01720135741717188, 0.15... [0.6110516368605904, 0.20848099927159314, 0.51... [0.7473669581190695, 0.5107081859246958, 0.442... [0.8231961741887535, 0.9686869510163731, 0.473... [0.34358121300094313, 0.9787339533782848, 0.72...
3 [0.7672751789941814, 0.412055981587398, 0.9951... [0.8470471648467321, 0.9967427749160083, 0.818... [0.8591072331661481, 0.6279199806511635, 0.365... [0.9456189188046846, 0.5084362869897466, 0.586... [0.2685328112579779, 0.8893788305422594, 0.235... [0.029919732007230193, 0.6377951981939682, 0.1...
4 [0.21420195955828203, 0.15178914447352077, 0.9... [0.6865307542882283, 0.0620359602798356, 0.382... [0.6469510945986712, 0.676059598071864, 0.0396... [0.2320436872397288, 0.09558341089961908, 0.98... [0.7733653233006889, 0.2405189745554751, 0.016... [0.8359561624563979, 0.24335481664355396, 0.38...
... ... ... ... ... ... ...
95 [0.42373270776373506, 0.7731750012629109, 0.90... [0.9430465078763153, 0.8506292743184455, 0.567... [0.41367168515273345, 0.9040247409476362, 0.72... [0.23016875953835192, 0.8206550830081965, 0.26... [0.954233948805146, 0.995068745046983, 0.20247... [0.26269690906898413, 0.5032835345055103, 0.26...
96 [0.36114607798432685, 0.11322299769211142, 0.0... [0.729848741496316, 0.9946930423163686, 0.2265... [0.17207915211677138, 0.3270055732644267, 0.73... [0.13211243241239223, 0.28382298905995607, 0.2... [0.03915259352564071, 0.05639914089770948, 0.0... [0.12681415759423675, 0.006417761276839351, 0....
97 [0.5020186971295065, 0.04018166955309821, 0.19... [0.9082402680300308, 0.1334790715379094, 0.991... [0.7003469664104871, 0.9444397336912727, 0.113... [0.7982221018200218, 0.9097963438776192, 0.163... [0.07834894180973451, 0.7948519146738178, 0.56... [0.5833962514812425, 0.403689767723475, 0.7792...
98 [0.16413822314461857, 0.40683312270714234, 0.4... [0.07366489230864415, 0.2706766599711766, 0.71... [0.6410967759869383, 0.5780018716586993, 0.622... [0.5466463581695835, 0.4949639043264169, 0.749... [0.40235314091318986, 0.8305539205264385, 0.35... [0.009668651763079184, 0.8071825962911674, 0.0...
99 [0.8189246990381518, 0.69175150213841, 0.82687... [0.40469941577758317, 0.49004906937461257, 0.7... [0.4940080411615112, 0.33621539942693246, 0.67... [0.8637418291877355, 0.34876318713083676, 0.09... [0.3526913672876807, 0.5177762589812651, 0.746... [0.3463129199717484, 0.9694802522161138, 0.732...
100 rows × 6 columns
My code to run the regressions and store the t-stats:
rows = len(df)
cols = len(df.columns)
tstats = np.zeros(shape=(rows,cols-1))
for i in range(0,rows):
for j in range(1,cols):
lg = linregress(df.iloc[i,0],df.iloc[i,j])
tstats[i,j-1] = lg.slope/lg.stderr
The code above works just fine and is doing exactly what I need, however as I mentioned above the performance begins to slow down when the # of rows and columns in df increases substantially.
I'm hoping someone could offer advice on how to optimize my code for better performance.
Thank you!
I am newbie to this but I do optimization your original code:
by purely use python builtin list object (there is no need to use pandas and to be honest I cannot find a better way to solve your problem in pandas than you original code :D)
by using numpy, which should be (at least they claimed) faster than python builtin list.
You can jump to see the code, its in Jupyter notebook format so you need to install Jupyter first.
Conclusion
Here is the test result:
On a (100, 100) matrix containing (30,) length random lists,
the total time difference is around 1 second.
Time elapsed to run 1 times on new method is 24.282760 seconds.
Time elapsed to run 1 times on old method is 25.954801 seconds.
Refer to
test_perf
in sample code for result.
PS: During test only one thread is used, so maybe multi-thread will help to improve performance, but that's out of my ability...
Idea
I think numpy.nditer is suitable for your request, though the result of optimization is not that significant. Here is my idea:
Generate the input array
I have altered you first part of script, I think using list comprehension along is enough to build a matrix of random lists. Refer to
get_matrix_from_builtin.
Please note I have stored the random lists in another 1-element tuple to keep the shape as ndarray generate from numpy.
As a compare, you can also construct such matrix with numpy. Refer to
get_matrix_from_numpy.
Because ndarray try to boardcast list-like object (and I don't know how to stop it), I have to wrap it into a tuple to avoid auto boardcast from numpy.array constructor. If anyone have a better solution please note it, thanks :)
Calculate the result
I altered you original code using pandas.DataFrame to access element by row/col index, but it is not that way.
Pandas provides some iteration tool for DataFrame: pipe, apply, agg, and appymap, search API for more info, but it seems not suitable for your request here, as you want to obtain the current index of row and col during iteration.
I searched and found numpy.nditer can provide that needs: it return a iterator of ndarray, which have an attribution multi_index that provide the row/col pair of current element. see iterating-over-arrays
Explain on solve.ipynb
I use Jupyter Notebook to test this, you might need got one, here is the instruction of install.
I have altered your original code, which remove the request of pandas and purely used builtin list. Refer to
old_calc_tstat
in the sample code.
Also, I used numpy.nditer to calc your tstats matrix, Refer to
new_calc_tstat
in the sample code.
Then, I tested if the result of both methods are equal, I used same input array to ensure random won't affect the test. Refer to
test_equal
for result.
Finally, do the time performance. I am not patient so I only run it for one time, you may add the repeats count of test in the
test_perf function.
The code
# To add a new cell, type '# %%'
# To add a new markdown cell, type '# %% [markdown]'
# %% [markdown]
# [origin question](https://stackoverflow.com/questions/69228572/running-scipy-linregress-across-dataframe-where-each-element-is-a-list)
#
# %%
import sys
import time
import numpy as np
from scipy.stats import linregress
# %%
def get_matrix_from_builtin():
# use builtin list to construct matrix of random list
# note I put random list inside a tuple to keep it same shape
# as I later use numpy to do the same thing.
return [
[(list(np.random.rand(11)),)
for col in range(6)]
for row in range(100)
]
# %timeit get_matrix_from_builtin()
# %%
def get_matrix_from_numpy(
gen=np.random.rand,
shape=(1, 1),
nest_shape=(1, ),
):
# custom dtype for random lists
mydtype = [
('randonlist', 'f', nest_shape)
]
a = np.empty(shape, dtype=mydtype)
# [DOC] moditfying array values
# https://numpy.org/doc/stable/reference/arrays.nditer.html#modifying-array-values
# enable per operation flags 'readwrite' to modify element in ndarray
# enable global flag 'refs_ok' to allow use callable function 'gen' in iteration
with np.nditer(a, op_flags=['readwrite'], flags=['refs_ok']) as it:
for x in it:
# pack list in a 1-d turple to prevent numpy boardcast it
x[...] = (gen(nest_shape[0]), )
return a
def test_get_matrix_from_numpy():
gen = np.random.rand # generator of random list
shape = (6, 100) # shape of matrix to hold random lists
nest_shape = (11, ) # shape of random lists
return get_matrix_from_numpy(gen, shape, nest_shape)
# access a random list by a[row][col][0]
# %timeit test_get_matrix_from_numpy()
# %%
def test_get_matrix_from_numpy():
gen = np.random.rand
shape = (6, 100)
nest_shape = (11, )
return get_matrix_from_numpy(gen, shape, nest_shape)
# %%
def old_calc_tstat(a=None):
if a is None:
a = get_matrix_from_builtin()
a = np.array(a)
rows, cols = a.shape[:2]
tstats = np.zeros(shape=(rows, cols))
for i in range(0, rows):
for j in range(1, cols):
lg = linregress(a[i][0][0], a[i][j][0])
tstats[i, j-1] = lg.slope/lg.stderr
return tstats
# %%
def new_calc_tstat(a=None):
# read input metrix of random lists
if a is None:
gen = np.random.rand
shape = (6, 100)
nest_shape = (11, )
a = get_matrix_from_numpy(gen, shape, nest_shape)
# construct ndarray for t-stat result
tstats = np.empty(a.shape)
# enable global flags 'multi_index' to retrive index of current element
# [DOC] Tracking an Index or Multi-Index
# https://numpy.org/doc/stable/reference/arrays.nditer.html#tracking-an-index-or-multi-index
it = np.nditer(tstats, op_flags=['readwrite'], flags=['multi_index'])
# obtain total columns count of tstats's shape
col = tstats.shape[1]
for x in it:
i, j = it.multi_index
# trick to avoid IndexError: substract len(list) after +1 to index
j = j + 1 - col
lg = linregress(
a[i][0][0],
a[i][j][0]
)
# note: nditer ignore ZeroDivisionError by default, and return np.inf to the element
# you have to override it manually:
if lg.stderr == 0:
x[...] = 0
else:
x[...] = lg.slope / lg.stderr
return tstats
# new_calc_tstat()
# %%
def test_equal():
"""Test if the new method has equal output to old one"""
# use same input list to avoid affect of rand
a = test_get_matrix_from_numpy()
old = old_calc_tstat(a)
new = new_calc_tstat(a)
print(
"Is the shape of old and new same ?\n%s. old: %s, new: %s\n" % (
old.shape == new.shape, old.shape, new.shape),
)
res = (old == new)
print(
"Is the result object same?"
)
if res.all() == True:
print("True.")
else:
print("False. Difference(new - old) as below:\n")
print(new - old)
return old, new
old, new = test_equal()
# %%
# the only diff is the last element
# in old method it is 0
# in new method it is inf
# if you perfer the old method, just add condition in new method to override
# [new[x][99] for x in range(6)]
# %%
# python version: 3.8.8
timer = time.clock if sys.platform[:3] == 'win' else time.time
def total(func, *args, _reps=1, **kwargs):
start = timer()
for i in range(_reps):
ret = func(*args, **kwargs)
elapsed = timer() - start
return elapsed
def test_perf():
"""Test of performance"""
# first, get a larger input array
gen = np.random.rand
shape = (1000, 100)
nest_shape = (30, )
a = get_matrix_from_numpy(gen, shape, nest_shape)
# repeat how many time for each test
reps = 1
# then, time both old and new calculation method
old = total(old_calc_tstat, a, _reps=reps)
new = total(new_calc_tstat, a, _reps=reps)
msg = "Time elapsed to run %d times on %s is %f seconds."
print(msg % (reps, 'new method', new))
print(msg % (reps, 'old method', old))
test_perf()
New to for loops and I cannot seem to get this one to work. I have multiple arrays that I want to run through my code. It works for individual arrays, but when I try to run it through a list of arrays it tries to join the arrays together.
Pandas looping, multiple attempts at looping in numpy.
Min regret matrix
for i in [a],[b],[c],[d],[e]:
sum columns and rows:
suma0 = np.sum(a,axis=0)
suma1 = np.sum(a,axis=1)
#find the minimum values for rows and columns:
col_min=np.min(a)
col_min0=data.min(0)
row_min=np.min(a[:44])
row_min0=data.min(1)
difference or least regret between scenarios and policies:
p = np.array(a)
q = np.min(p,axis=0)
r = np.min(p,axis=1)
cidx = np.argmin(p,axis=0)
ridx = np.argmin(p,axis=1)
cdif = p-q
rdif = p-r[:,None]
find the sum of the rows and columns for the difference arrays:
sumc = np.sum(cdif,axis=0)
sumr = np.sum(rdif,axis=1)
sumr1 = np.reshape(sumr,(44,1))
append the scenario array with the column sums:
sumcol = np.zeros((45,10))
sumcol = np.append([cdif],[sumc])
sumcol.shape = (45,10)
rank columns:
order0 = sumc.argsort()
rank0 = order0.argsort()
rankcol = np.zeros((46,10))
rankcol = np.append([sumcol],[rank0])
rankcol.shape = (46,10)
append the policy array with row sums
sumrow = np.zeros((44,11))
sumrow = np.hstack((rdif,sumr1))
rank rows
order1 = sumr.argsort()
rank1 = order1.argsort()
rank1r = np.reshape(rank1,(44,1))
rankrow = np.zeros((44,12))
rankrow = np.hstack((sumrow,rank1r))
print(sumrow)
print(rankrow)
Add row and column headers for least regret for df0:
RCP = np.zeros((47,11))
RCP = pd.DataFrame(rankcol, columns=column_names1, index=row_names1)
print(RCP)
Add row and column headers for least regret for df1:
RCP1 = np.zeros((45,13))
RCP1 = pd.DataFrame(rankrow, columns=column_names2, index=row_names2)
print(RCP1)
Export loops to CSV in output folder:
filepath = os.path.join(output_path, 'out_'+str(index)+'.csv')
RCP.to_csv(filepath)
filepath = os.path.join(output_path, 'out1_'+str(index)+'.csv')
RCP1.to_csv(filepath)
As per your question, please highlight the input, expected output and error as this is a base case example.
x = np.random.randn(2)
x.shape = (2,)
and if we attempt for :
x.reshape(44,1)
The error we get is:
ValueError: cannot reshape array of size 2 into shape (44,1)
reason for this error is simple as we are trying to reshape an array of size 2 into 44 sized array. As per your error highlighted please check the dimension of the input and expected output.
I have a numpy array of strings
names = array([
'p00x00', 'p01x00', 'p02x00', 'p03x00', 'p04x00', 'p05x00',
'p00x01', 'p01x01', 'p02x01', 'p03x01', 'p04x01', 'p05x01',
'p00x02', 'p01x02', 'p02x02', 'p03x02', 'p04x02', 'p05x02',
'p00x03', 'p01x03', 'p02x03', 'p03x03', 'p04x03', 'p05x03',
'p00x04', 'p01x04', 'p02x04', 'p03x04', 'p04x04', 'p05x04',
'p00x05', 'p01x05', 'p02x05', 'p03x05', 'p04x05', 'p05x05'])
And corresponding position data
X = array([2.102235, 2.094113, 2.086038, 2.077963, 2.069849, 2.061699])
Y = array([-7.788431, -7.780364, -7.772306, -7.764247, -7.756188, -7.748114])
How can I sort names using X and Y such that I get out a sorted grid of names with shape (6, 6)? Note that there are essentially 6 unique X and Y positions -- I'm not just arbitrarily choosing 6x6.
names = array([
['p00x00', 'p01x00', 'p02x00', 'p03x00', 'p04x00', 'p05x00'],
['p00x01', 'p01x01', 'p02x01', 'p03x01', 'p04x01', 'p05x01'],
['p00x02', 'p01x02', 'p02x02', 'p03x02', 'p04x02', 'p05x02'],
['p00x03', 'p01x03', 'p02x03', 'p03x03', 'p04x03', 'p05x03'],
['p00x04', 'p01x04', 'p02x04', 'p03x04', 'p04x04', 'p05x04'],
['p00x05', 'p01x05', 'p02x05', 'p03x05', 'p04x05', 'p05x05']])
I realize in this case that I could simply reshape the array, but in general the data will not work out this neatly.
You can use numpy.argsort to get the indexes of the elements of an array after it's sorted. These indices you can then use to sort your names array.
import numpy as np
names = np.array([
'p00x00', 'p01x00', 'p02x00', 'p03x00', 'p04x00', 'p05x00',
'p00x01', 'p01x01', 'p02x01', 'p03x01', 'p04x01', 'p05x01',
'p00x02', 'p01x02', 'p02x02', 'p03x02', 'p04x02', 'p05x02',
'p00x03', 'p01x03', 'p02x03', 'p03x03', 'p04x03', 'p05x03',
'p00x04', 'p01x04', 'p02x04', 'p03x04', 'p04x04', 'p05x04',
'p00x05', 'p01x05', 'p02x05', 'p03x05', 'p04x05', 'p05x05'])
X = np.array([2.102235, 2.094113, 2.086038, 2.077963, 2.069849, 2.061699])
Y = np.array([-7.788431, -7.780364, -7.772306, -7.764247, -7.756188, -7.748114])
x_order = np.argsort(X)
y_order = np.argsort(Y)
names_ordered = names.reshape(6,6)[np.meshgrid(x_order,y_order)]
print(names_ordered)
gives the following output:
[['p00x05' 'p00x04' 'p00x03' 'p00x02' 'p00x01' 'p00x00']
['p01x05' 'p01x04' 'p01x03' 'p01x02' 'p01x01' 'p01x00']
['p02x05' 'p02x04' 'p02x03' 'p02x02' 'p02x01' 'p02x00']
['p03x05' 'p03x04' 'p03x03' 'p03x02' 'p03x01' 'p03x00']
['p04x05' 'p04x04' 'p04x03' 'p04x02' 'p04x01' 'p04x00']
['p05x05' 'p05x04' 'p05x03' 'p05x02' 'p05x01' 'p05x00']]
What is the most efficient way to create a dask.array from a dask.Series of list?
The series consists of 5 million lists 300 of elements.
It is currently divide into 500 partitions.
Currently I am trying:
pt = [delayed(np.array)(y)
for y in
[delayed(list)(x)
for x in series.to_delayed()]]
da = delayed(dask.array.concatenate)(pt, axis=1)
da = dask.array.from_delayed(da, (vec.size.compute(), 300), dtype=float)
The idea is to convert each partition into a numpy array and stitch
those together into a dask.array.
This code is taking forever to run though.
A numpy array can be built from this data quite quickly from this data sequentially as long as there is enough RAM.
I think that you are on the right track using dask.delayed. However calling list on the series is probably not ideal. I would create a function that converts one of your series into a numpy array and then go through delayed with that.
def convert_series_to_array(pandas_series): # make this as fast as you can
...
return numpy_array
L = dask_series.to_delayed()
L = [delayed(convert_series_to_array)(x) for x in L]
arrays = [da.from_delayed(x, shape=(np.nan, 300), dtype=...) for x in L]
x = da.concatenate(arrays, axis=0)
Also, regarding this line:
da = delayed(dask.array.concatenate)(pt, axis=1)
You should never call delayed on a dask function. They are already lazy.
Looking at this with some dummy data. Building on #MRocklin's answer (and molding more after my specific use case), let's say that your vectors are actually list of ints instead of floats and the list is stored as a string. We take the series, transform it, and store it in a zarr array file.
# create dummy data
vectors = [ np.random.randint(low=0,high=100,size=300).tolist() for _ in range(1000) ]
df = pd.DataFrame()
df['vector'] = vectors
df['vector'] = df['vector'].map(lambda x:f"{x}")
df['foo'] = 'bar'
ddf = dd.from_pandas( df, npartitions=100 )
# transform series data to numpy array
def convert_series_to_array( series ): # make this as fast as you can
series_ = [ast.literal_eval( i ) for i in series]
return np.stack(series_, axis=0)
L = ddf['vector'].to_delayed()
L = [delayed(convert_series_to_array)(x) for x in L]
arrays = [da.from_delayed(x, shape=(np.nan, 300), dtype=np.int64) for x in L]
x = da.concatenate(arrays, axis=0)
# store result into a zarr array
x.compute_chunk_sizes().to_zarr( 'toy_dataset.zarr', '/home/user/Documents/', overwrite=True )
I'm looking for a way to read this csv into python 2.7 and turn it into a (3,22000) array. For some reason I haven't been able to do it, no matter which way i try, I either get a groupn of strings in an array that i cant convert or an array seen below that won't convert to floats or allow computations to be done on them. Any help would be appreciated. Thanks
For the record it says the shape is (22000,), which I'm unsure about also.
In [126]: import csv
import numpy as np
with open("Data.csv") as sd:
ri = []
dv = []
for row in csv.reader(sd):
if row != ["ccx","ccy","ccz","cellVolumes","Cell Type"]:
nrow = []
for val in row[0:3]:
val = float(val)
nrow.append(val)
ri.append(nrow)
nrow = []
for val in row[3:4]:
val = float(val)
nrow.append(val)
dv.append(nrow)
ri = np.array(ri)
ri
.
Out[126]: array([[-0.179967, -0.38936, -0.46127], [-0.0633236, -0.407683, -0.542979],
[-0.125841, -0.494202, -0.412042], ...,
[-0.0116821, 0.764493, 0.573541], [0.630377, 0.469657, 0.442017],
[0.248253, 0.615365, 0.354134]], dtype=object
(from the helpful comments)
Check the length of those sublists. If they are all the same I'd expect a 2d array; but if they differ (most 3, but some 0, 2,4 etc) then the best it can do is give you a 1d array of 'objects' - the lists.
I would just do [len(x) for x in ri] before passing it to np.array. Maybe apply a max and min. A list comprehension like that won't take long.