I am trying to multiply the leading diagonal in a pandas dataframe and I am not sure how to proceed in a computationally reasonable way.
df = [ 3 4 5
6 7 8
9 10 11]
ouput_df = [231 32 5
60 77 8
9 10 11]
Explanation - lookoing to 3 * 7 * 11 for the first element, 4 * 8 for the second element, 7 * 11 for the fifth element etc.,
Note: The matrix I am working on is not a square matrix, but a rectangular matrix.
Here's one based on NumPy -
def cumprod_upper_diag(a):
m,n = a.shape
mask = ~np.tri(m,n, dtype=bool)
p = np.ones((m,n),dtype=a.dtype)
p[mask[:,::-1]] = a[mask]
a[mask] = p[::-1].cumprod(0)[::-1][mask[:,::-1]]
return a
a = df.to_numpy(copy=False) # For older versions : a = df.values
out = a.copy()
cumprod_upper_diag(out)
cumprod_upper_diag(out.T)
out.ravel()[::a.shape[1]+1] = out.ravel()[::out.shape[1]+1][::-1].cumprod()[::-1]
out_df = pd.DataFrame(out)
You can use a sparse diagonal matrix here with some finnicking. This assumes all non-zero elements in your original matrix, or else this will not work.
from scipy import sparse
a = df.to_numpy()
b = sparse.dia_matrix(a)
c = b.data[:, ::-1]
cp = np.cumprod(np.where(c != 0, c, 1), axis=1)
b.data = cp[:, ::-1]
b.A
array([[231, 32, 5],
[ 60, 77, 8],
[ 9, 10, 11]], dtype=int64)
As Chris mentioned, this is cumprod in reverse order:
# stack for groupby
new_df = df.stack().reset_index()[::-1]
# diagonals meaning col_num - row_num are the same
diags = new_df['level_0']-new_df['level_1']
# groupby diagonals
new_df['out'] = new_df.groupby(diags)[0].cumprod()
# pivot to get the original shape
new_df.pivot('level_0', 'level_1', 'out')
output:
level_1 0 1 2
level_0
0 231 32 5
1 60 77 8
2 9 10 11
Here's a method that operates on the DataFrame in place.
df = pd.DataFrame(data=[[3, 4, 5], [6, 7, 8], [9, 10, 11]])
m, n = df.shape
for i in range(-m + 1, n):
ri, rj = max(-i, 0), min(m - 1, n - i - 1)
ci, cj = max( i, 0), min(n - 1, m + i - 1)
np.fill_diagonal(df.values[ri:rj+1,ci:cj+1],
df.values.diagonal(i)[::-1].cumprod()[::-1])
print(df)
Result:
0 1 2
0 231 32 5
1 60 77 8
2 9 10 11
Related
I tried to make a kind of running average - out of 90 rows, every 3 in column A should make an average that would be the same as those rows in column B.
For example:
From this:
df = pd.DataFrame( A B
2 0
3 0
4 0
7 0
9 0
8 0)
to this:
df = pd.DataFrame( A B
2 3
3 3
4 3
7 8
9 8
8 8)
I tried running this code:
x=0
for i in df['A']:
if x<90:
y = (df['A'][x]+ df['A'][(x +1)]+df['A'][(x +2)])/3
df['B'][x] = y
df['B'][(x+1)] = y
df['B'][(x+2)] = y
x=x+3
print(y)
It does print the correct Y
But does not change B
I know there is a better way to do it, and if anyone knows - it would be great if they shared it. But the more important thing for me is to understand why what I wrote down doesn't have an effect on the df.
You could group by the index divided by 3, then use transform to compute the mean of those values and assign to B:
df = pd.DataFrame({'A': [2, 3, 4, 7, 9, 8], 'B': [0, 0, 0, 0, 0, 0]})
df['B'] = df.groupby(df.index // 3)['A'].transform('mean')
Output:
A B
0 2 3
1 3 3
2 4 3
3 7 8
4 9 8
5 8 8
Note that this relies on the index being of the form 0,1,2,3,4,.... If that is not the case, you could either reset the index (df.reset_index(drop=True)) or use np.arange(df.shape[0]) instead i.e.
df['B'] = df.groupby(np.arange(df.shape[0]) // 3)['A'].transform('mean')
i = 0
batch_size = 3
df = pd.DataFrame({'A':[2,3,4,7,9,8,9,10],'B':[-1] * 8})
while i < len(df):
j = min(i+batch_size-1,len(df)-1)
avg =sum(df.loc[i:j,'A'])/ (j-i+1)
df.loc[i:j,'B'] = [avg] * (j-i+1)
i+=batch_size
df
corner case when len(df) % batch_size != 0 assumes we take the average of the leftover rows.
I found multiple questions with almost the same title but none of them have a working code.
so I have the following 1D array a, once I reshape it to 3d array b, and then reshape it back to 1D array.
import numpy as np
depth = 2
row = 3
col = 4
a = np.arange(24)
b = a.reshape((depth, row, col))
c = np.zeros(row*col*depth)
print(a)
print(b)
index = 0
for i in range(depth):
for j in range(row):
for k in range(col):
c[k+ j * row + i * (col*row)] = b[i][j][k] # ???
# c[k + row * (j + depth * i)] = b[i][j][k]
# c[index] = b[i][j][k]
# index+=1
print(c)
# [ 0. 1. 2. 4. 5. 6. 8. 9. 10. 11. 0. 0. 12. 13. 14. 16. 17. 18.
# 20. 21. 22. 23. 0. 0.]
I don't get Flat[x + WIDTH * (y + DEPTH * z)] = Original[x, y, z] in this case.
EDIT
I am using this code (the equivalent) in a CUDA code, so I need the loops. I am not going to use any numpy magic functions.
That's because the array you created uses row-major order instead of column-major as you assumed
The correct way to compare is
depth = 2
row = 3
col = 4
a = np.arange(24)
b = a.reshape((depth, row, col))
b.strides
i,j,k = np.indices(b.shape)
assert(np.all(a[(i*row+j)*col+k] == b[i,j,k]))
If you want more information on how the matrix is arranged you can check
b.strides = (48, 16, 4), this gives the coefficient (in bytes) of each index, i.e.g (i*48+j*16+k*4) is the offset of the element b[i,j,k].
This is the column-major order is called Fortran order and you can get numpy to reshape assuming it by passing order='F'
bf = a.reshape((depth, row, col), order='F')
assert(np.all(a[i + depth * (j + row * k)] == bf))
Then bf will be
array([[[ 0, 6, 12, 18],
[ 2, 8, 14, 20],
[ 4, 10, 16, 22]],
[[ 1, 7, 13, 19],
[ 3, 9, 15, 21],
[ 5, 11, 17, 23]]])
Simply reshape again to your original dimension or flatten the resulting array:
depth = 2
row = 3
col = 4
a = np.arange(24)
b = a.reshape((depth, row, col))
print(b.reshape((24,)))
print(b.flatten())
Output:
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23]
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23]
I have a column (binary) in a dataframe (df) of the form:
Vector
0
1
0
1
0
.
.
.
I am using this in a binary classification model. My objective is to take these 0's and 1's and move them into two seperate lists, which then get translated into numpy arrays. As an example, I would like to move the first 5 items from Vector into X, then the 6th item into Y. Then the next 5 items into X, and then the following 6th item into Y, till the end of the df (currently 200k rows).
My first instinct is to write a for loop for this (but I know this is hugely inefficient):
for i in range(0, df.shape[0] - 6):
# as we iterate through the df
# we will use a step of 5
if i_cnt > 5:
y = df['Vector'].iloc[i]
Y.append(y)
i_cnt = 1
else:
x = df['Vector'].iloc[i]
X.append(x)
i_cnt +=1
There is definitely a faster way to do this and hoping someone knows how I can do that?
Use modulo with 6 by array created by length of index and compare for X and Y:
#sample data for easy verify
df = pd.DataFrame({'Vector': range(20)})
idx = np.arange(len(df)) % 6
X = df.loc[idx < 5, 'Vector']
print (X)
0 0
1 1
2 2
3 3
4 4
6 6
7 7
8 8
9 9
10 10
12 12
13 13
14 14
15 15
16 16
18 18
19 19
Y = df.loc[idx == 5, 'Vector']
print (Y)
5 5
11 11
17 17
If output format is different - X is 2d array use reshape with -1 for automatic count length with 6 and select by indexing:
df = pd.DataFrame({'Vector': range(18)})
arr = df['Vector'].to_numpy().reshape(-1, 6)
X = arr[:, :-1]
Y = arr[:, -1]
print (X)
[[ 0 1 2 3 4]
[ 6 7 8 9 10]
[12 13 14 15 16]]
print (Y)
[ 5 11 17]
For k = 5 + 1 = 6,
k = 6
n_rows = len(df.index)
n_samples = n_rows // k
X_and_y = df.Vector.to_numpy().reshape(n_samples, k)
X = X_and_y[:, :-1]
y = X_and_y[:, -1]
We reshape the column to a (n_samples, 5 + 1) array where n_samples = n_rows / 6, then we take all but last column into X and last column into y.
e.g.
>>> df = pd.DataFrame(np.random.randint(0, 2, size=18), columns=["Vector"])
>>> df
Vector
0 0
1 0
2 1
3 1
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 1
12 0
13 0
14 1
15 0
16 0
17 1
>>> # after
>>> X
array([[0, 0, 1, 1, 0],
[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0]])
>>> y
array([0, 1, 1])
You can try
X = list(df[df.index % 6 < 5]["Vector"])
y = list(df[df.index % 6 == 5]["Vector"])
I need to populate a dataframe with a matrix built from a single list, but the math and python syntax are beyond me. I essentially need to perform some math operations as if the same list were both the rows and the columns.
So it should look something like this....
#Input
list = [1,2,3,4]
create a matrix using some math on the list, like matrix[i,j] = list[i] * list[j]
#output
np.matrix([[1,2,3,4], [2,4,6,8], [3,6,9,12], [4,8,12,16]])
df = pd.dataframe[np.matrix]
Broadcasted multiplication will work here:
arr = np.array([1, 2, 3, 4])
pd.DataFrame(arr * arr[:,None])
0 1 2 3
0 1 2 3 4
1 2 4 6 8
2 3 6 9 12
3 4 8 12 16
Alternatively, most numpy arithmetic functions define an .outer unfunc:
pd.DataFrame(np.multiply.outer(arr, arr))
0 1 2 3
0 1 2 3 4
1 2 4 6 8
2 3 6 9 12
3 4 8 12 16
data = [1,2,3,4]
Nested for loops would work:
import numpy as np
a = []
for n in data:
row = []
for m in data:
math = some_operation_on(m,n)
row.append(math)
a.append(row)
a = np.array(a)
For simple operations like your example use numpy.meshgrid.
In [21]: a = [1,2,3,4]
In [22]: x,y = np.meshgrid(a,a)
In [23]: x*y
Out[23]:
array([[ 1, 2, 3, 4],
[ 2, 4, 6, 8],
[ 3, 6, 9, 12],
[ 4, 8, 12, 16]])
I have a very large dataframe
in>> all_data.shape
out>> (228714, 436)
What I would like to do effciently is multiply many of the columns together. I started with a for loop and list of columns--the most effcient way I have found is
from itertools import combinations
newcolnames=list(all_data.columns.values)
newcolnames=newcolnames[0:87]
#make cross products (the columns I want to operate on are the first 87)
for c1, c2 in combinations(newcolnames, 2):
all_data['{0}*{1}'.format(c1,c2)] = all_data[c1] * all_data[c2]
The problem as one may guess is I have 87 columns which would give on the order of 3800 new columns (yes this is what I intended). Both my jupyter notebook and ipython shell choke on this calculation. I need to figure a better way to undertake this multiplication.
Is there a more efficient way to vectorize and/or process? Perhaps using a numpy array (my dataframe has been processed and now contains only numbers and NANs, it started with categorical variables).
As you have mentioned NumPy in the question, that might be a viable option here, specially because you might want to work in 2D space of NumPy instead of 1D columnar processing with pandas. To start off, you can convert the dataframe to a NumPy array with a call to np.array, like so -
arr = np.array(df) # df is the input dataframe
Now, you can get the pairwise combinations of the column IDs and then index into the columns and perform column-wise multiplications and all of this would be done in a vectorized manner, like so -
idx = np.array(list(combinations(newcolnames, 2)))
out = arr[:,idx[:,0]]*arr[:,idx[:,1]]
Sample run -
In [117]: arr = np.random.randint(0,9,(4,8))
...: newcolnames = [1,4,5,7]
...: for c1, c2 in combinations(newcolnames, 2):
...: print arr[:,c1] * arr[:,c2]
...:
[16 2 4 56]
[64 2 6 16]
[56 3 0 24]
[16 4 24 14]
[14 6 0 21]
[56 6 0 6]
In [118]: idx = np.array(list(combinations(newcolnames, 2)))
...: out = arr[:,idx[:,0]]*arr[:,idx[:,1]]
...:
In [119]: out.T
Out[119]:
array([[16, 2, 4, 56],
[64, 2, 6, 16],
[56, 3, 0, 24],
[16, 4, 24, 14],
[14, 6, 0, 21],
[56, 6, 0, 6]])
Finally, you can create the output dataframe with propers column headers (if needed), like so -
>>> headers = ['{0}*{1}'.format(idx[i,0],idx[i,1]) for i in range(len(idx))]
>>> out_df = pd.DataFrame(out,columns = headers)
>>> df
0 1 2 3 4 5 6 7
0 6 1 1 6 1 5 6 3
1 6 1 2 6 4 3 8 8
2 5 1 4 1 0 6 5 3
3 7 2 0 3 7 0 5 7
>>> out_df
1*4 1*5 1*7 4*5 4*7 5*7
0 1 5 3 5 3 15
1 4 3 8 12 32 24
2 0 6 3 0 0 18
3 14 0 14 0 49 0
you can try the df.eval() method:
for c1, c2 in combinations(newcolnames, 2):
all_data['{0}*{1}'.format(c1,c2)] = all_data.eval('{} * {}'.format(c1, c2))