Reshape a NumPy array based on values in its columns - python

What is the most compact way to make a matrix from a table with numpy?
I have a table of values, where 1st column is x, 2nd is y and 3rd is z. z values are all unique, (x, y) pair is obtained with combinations of x and y. Here is an example:
0.0 0.0 949219540.0
0.0 0.5 944034910.0
0.0 1.0 938508543.0
0.0 1.5 930093905.0
0.0 2.0 922076484.0
50.0 0.0 911497861.0
50.0 0.5 903224763.0
50.0 1.0 900406431.0
50.0 1.5 890658529.0
50.0 2.0 880907404.0
100.0 0.0 883527077.0
100.0 0.5 911683042.0
........ # and so on
basically this is a matrix 5x9:
0.0 0.0 0.5 1.0 1.5 2.0
0.0 0.949 0.944 0.939 0.93 0.922
50.0 0.911 0.903 0.9 0.891 0.881
100.0 0.884 0.912 0.84 0.839 0.851
150.0 0.85 0.84 0.799 0.844 0.863
200.0 0.84 0.79 0.806 0.847 0.745
250.0 0.789 0.78 0.748 0.719 0.759
300.0 0.761 0.783 0.714 0.766 0.698
350.0 0.737 0.757 0.792 0.705 0.665
400.0 0.801 0.797 0.57 0.628 0.532
Now for this i making: set(x) and set(y) to get rid of duplicates, reshape(Z) with length of x any y and then vstack and hstack to concatenate x, y, z. I belief that this is quite common operation in data processing, and maybe it has one-step-solution. More over, my way is not good when x and y are not in order, so set() can broke a matrix logic.

This is basically the opposite of numpy.meshgrid.
For a one-liner, you can use scipy.interpolate.griddata:
grid = griddata(list(zip(x, y)), z,
(x.reshape((len(set(y)), len(set(x)))),
y.reshape((len(set(y)), len(set(x))))),
method='nearest')
Longer demonstration: let's say that we have an list of entries that cover completely a matrix. In numpy, this is obtained by meshgrid
In [1]: import numpy as np
In [2]: a = np.arange(0, 5)
In [3]: b = np.arange(6, 9)
In [4]: aa, bb = np.meshgrid(a, b)
And assign values to each element of the mesh:
IN [5]: x, y = aa.flatten(), bb.flatten()
In [6]: z = np.ones(len(x))
These are the starting x, y, and z of the OP.
Now let's use grid data to get all values into a matrix. griddata is much more powerful than this, but having only one point per grid and a clearly equally spaced grid, the matrix comes out exact.
In [7]: points = list(zip(x, y))
In [8]: from scipy.interpolate import griddata
In [9]: grid = griddata(points, z,
(x.reshape((len(set(y)), len(set(x)))),
y.reshape((len(set(y)), len(set(x))))),
method='nearest')
In [10]: grid
Out[10]:
array([[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1]])
In [11]: a, b = set(x), set(y)
In [12]: np.hstack((np.concatenate(([0], b)).reshape((1, len(b) + 1)).T, np.vstack((a, grid))))
Out[12]:
array([[ 0., 0., 1., 2., 3., 4.],
[ 6., 1., 1., 1., 1., 1.],
[ 7., 1., 1., 1., 1., 1.],
[ 8., 1., 1., 1., 1., 1.]])

Related

How to count occurrence of each unique ID in each bin in a 2D histogram (python or pandas)

I have a csv file and I would like to create a 2d histogram where the value in each bin depends on the unique ID. For example (see below), for the range 0<x<1 and 1<y<2, the value is 2 (A, B) not 3 (A, A, B) because A appears twice. Thanks!
ID
x
y
A
0.5
1.4
A
0.6
1.6
A
1.2
2.2
B
0.7
1.7
C
4.4
3.5
C
3.1
3.7
A bin of i_x < x < j_x, i_y < y < j_y can be uniquely identified as the (i_x, i_y); we can see that this tuple is unique for each bin. i_x and i_y are simply the floor value of x and y. Like For row : (x, y) = (0.5, 1.4) bin is: 0 < 0.5 < 1, 1 < 1.4 < 1.2 here i_x = 0 = floor(0.5) and i_y = 1 = floor(1.4).
Approach:
Find i_x and i_y for x and y columns.
Group the dataframe using key (i_x, i_y) and count unique IDs in each of the group.
Code:
>>> df
ID x y
0 A 0.5 1.4
1 A 0.6 1.6
2 A 1.2 2.2
3 B 0.7 1.7
4 C 4.4 3.5
5 C 3.1 3.7
df['bin_x'] = np.floor(df.x).astype(int)
df['bin_y'] = np.floor(df.y).astype(int)
df = (df.groupby(['bin_x', 'bin_y'], as_index = False)['ID']
.agg({'cnt' : 'nunique'}))
>>> df
bin_x bin_y cnt
0 0 1 2
1 1 2 1
2 3 3 1
3 4 3 1
If you are defining your histogram as numpy array of size (5, 5) then we can assign cnt values to that array and get the desired histogram.
histogram = np.zeros((5, 5))
histogram[df.bin_x, df.bin_y] = df.cnt
>>> histogram
array([[0., 2., 0., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 1., 0.],
[0., 0., 0., 1., 0.]])

Tuples of arrays 1D and 2D to dataframe with python

This is what a model.predic returns. ¿How can i convert this tuple in columns of a dataframe?
(array([1., 1., 1., ..., 1., 1., 1.]), array([[0.46502338, 0.53497662],
[0.47072865, 0.52927135],
[0.4696557 , 0.5303443 ],
...,
[0.47139825, 0.52860175],
[0.46367829, 0.53632171],
[0.46586898, 0.53413102]]))
<class 'tuple'>
Nothing of those is working for me
pd.DataFrame(dict(class_pred=tuple[0], prob_0=tuple[1], prob_1=tuple[2]))
pd.DataFrame(np.column_stack(tuple),columns=['class_pred','prob_0','prob_1'])
I would like to obtain something like this:
class_pred prob_0 prob_1
1 0.470728 0.5292713
AniSkywalker solution works perfectly.
type(data)
print(data)
tuple
(array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
array([[0.46502338, 0.53497662],
[0.47072865, 0.52927135],
[0.4696557 , 0.5303443 ],
[0.46511921, 0.53488079],
[0.46739934, 0.53260066],
[0.47387646, 0.52612354],
[0.4737461 , 0.5262539 ],
[0.47052631, 0.52947369],
[0.47658316, 0.52341684],
[0.47222654, 0.52777346]]))
df_pred = pd.DataFrame(data=dict(pred=data[0], prob_0=data[1][:,0], prob_1=data[1][:,1]))
print(df_pred)
pred prob_0 prob_1
0 1.0 0.465023 0.534977
1 1.0 0.470729 0.529271
2 1.0 0.469656 0.530344
3 1.0 0.465119 0.534881
4 1.0 0.467399 0.532601
5 1.0 0.473876 0.526124
6 1.0 0.473746 0.526254
7 1.0 0.470526 0.529474
8 1.0 0.476583 0.523417
9 1.0 0.472227 0.527773
I'm assuming your data is of the form ((n), (n, 2)) so that:
import numpy as np
n = 5
data = (np.random.rand(n), np.random.rand(n, 2))
provides a reasonable estimate of what your output looks like.
Let's say that data is:
(array([0.27856312, 0.66255123, 0.47976175, 0.59381106, 0.82096555]), array([[0.53719357, 0.55803381],
[0.5749893 , 0.09712089],
[0.91607789, 0.21579499],
[0.50163898, 0.39188127],
[0.60427654, 0.07801227]]))
Your dict method actually works with one modification:
import pandas as pd
df = pd.DataFrame(data=dict(class_pred=data[0], prob_0=data[1][:,0], prob_1=data[1][:,1]))
Notice that prob_0 and prob_1 are both derived from the second tuple element, but using Numpy's column indexing we can split the individual arrays as you described.
Let's take data[1][:,0], for example: first, we select the second element of the data tuple, which is the (n, 2) matrix. Then, we select the first column (0) from all rows (:). The result is a vector of the first element of every row in that matrix.
Using my made-up numbers, df.head() should give you:
class_pred prob_0 prob_1
0 0.278563 0.537194 0.558034
1 0.662551 0.574989 0.097121
2 0.479762 0.916078 0.215795
3 0.593811 0.501639 0.391881
4 0.820966 0.604277 0.078012

How to change dataframe cells values with "coordinate-like" indexes stored in two lists/vectors/series?

Apologize if this has been asked before, somehow I am not able to find the answer to this.
Let's say I have two lists of values:
rows = [0,1,2]
cols = [0,2,3]
that represents indexes of rows and columns respectively. The two lists combined signified sort of coordinates in the matrix, i.e (0,0), (1,2), (2,3).
I would like to use those coordinates to change specific cells of the dataframe without using a loop.
In numpy, this is trivial:
data = np.ones((4,4))
data[rows, cols] = np.nan
array([[nan, 1., 1., 1.],
[ 1., 1., nan, 1.],
[ 1., 1., 1., nan],
[ 1., 1., 1., 1.]])
But in pandas, it seems I am stuck with a loop:
df = pd.DataFrame(np.ones((4,4)))
for _r, _c in zip(rows, cols):
df.iat[_r, _c] = np.nan
Is there a way to use to vectors that lists coordinate-like index to directly modify cells in pandas?
Please note that the answer is not to use iloc instead, this selects the intersection of entire rows and columns.
Very simple! Exploit the fact that pandas is built on top of numpy and use DataFrame.values
df.values[rows, cols] = np.nan
Output:
0 1 2 3
0 NaN 1.0 1.0 1.0
1 1.0 1.0 NaN 1.0
2 1.0 1.0 1.0 NaN
3 1.0 1.0 1.0 1.0

pandas - cumulative median

I was wondering if there is any pandas equivalent to cumsum() or cummax() etc. for median: e.g. cummedian().
So that if I have, for example this dataframe:
a
1 5
2 7
3 6
4 4
what I want is something like:
df['a'].cummedian()
which should output:
5
6
6
5.5
You can use expanding.median -
df.a.expanding().median()
1 5.0
2 6.0
3 6.0
4 5.5
Name: a, dtype: float64
Timings
df = pd.DataFrame({'a' : np.arange(1000000)})
%timeit df['a'].apply(cummedian())
1 loop, best of 3: 1.69 s per loop
%timeit df.a.expanding().median()
1 loop, best of 3: 838 ms per loop
The winner is expanding.median by a huge margin. Divakar's method is memory intensive and suffers memory blowout at this size of input.
We could create nan filled subarrays as rows with a strides based function, like so -
def nan_concat_sliding_windows(x):
n = len(x)
add_arr = np.full(n-1, np.nan)
x_ext = np.concatenate((add_arr, x))
strided = np.lib.stride_tricks.as_strided
nrows = len(x_ext)-n+1
s = x_ext.strides[0]
return strided(x_ext, shape=(nrows,n), strides=(s,s))
Sample run -
In [56]: x
Out[56]: array([5, 6, 7, 4])
In [57]: nan_concat_sliding_windows(x)
Out[57]:
array([[ nan, nan, nan, 5.],
[ nan, nan, 5., 6.],
[ nan, 5., 6., 7.],
[ 5., 6., 7., 4.]])
Thus, to get sliding median values for an array x, we would have a vectorized solution, like so-
np.nanmedian(nan_concat_sliding_windows(x), axis=1)
Hence, the final solution would be -
In [54]: df
Out[54]:
a
1 5
2 7
3 6
4 4
In [55]: pd.Series(np.nanmedian(nan_concat_sliding_windows(df.a.values), axis=1))
Out[55]:
0 5.0
1 6.0
2 6.0
3 5.5
dtype: float64
A faster solution for the specific cumulative median
In [1]: import timeit
In [2]: setup = """import bisect
...: import pandas as pd
...: def cummedian():
...: l = []
...: info = [0, True]
...: def inner(n):
...: bisect.insort(l, n)
...: info[0] += 1
...: info[1] = not info[1]
...: median = info[0] // 2
...: if info[1]:
...: return (l[median] + l[median - 1]) / 2
...: else:
...: return l[median]
...: return inner
...: df = pd.DataFrame({'a': range(20)})"""
In [3]: timeit.timeit("df['cummedian'] = df['a'].apply(cummedian())",setup=setup,number=100000)
Out[3]: 27.11604686321956
In [4]: timeit.timeit("df['expanding'] = df['a'].expanding().median()",setup=setup,number=100000)
Out[4]: 48.457676260100335
In [5]: 48.4576/27.116
Out[5]: 1.7870482372031273

Fastest way to calculate difference in all columns

I have a dataframe of all float columns. For example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(12.0).reshape(3,4), columns=list('ABCD'))
# A B C D
# 0 0.0 1.0 2.0 3.0
# 1 4.0 5.0 6.0 7.0
# 2 8.0 9.0 10.0 11.0
I would like to calculate column-wise differences for all combinations of columns (e.g., A-B, A-C, B-C, etc.).
E.g., the desired output would be something like:
A_B A_C A_D B_C B_D C_D
-1.0 -2.0 -3.0 -1.0 -2.0 -1.0
-1.0 -2.0 -3.0 -1.0 -2.0 -1.0
-1.0 -2.0 -3.0 -1.0 -2.0 -1.0
Since the number of columns may be large, I'd like to do the calculations as efficiently/quickly as possible. I assume I'll get a big speed bump by converting the dataframe to a numpy array first so I'll do that, but I'm wondering if there are any other strategies that might result in large performance gains. Maybe some matrix algebra or multidimensional data format trick that results in not having to loop through all unique combinations. Any suggestions are welcome. This project is in Python 3.
Listed in this post are two NumPy approaches for performance - One would be fully vectorized approach and another with one loop.
Approach #1
def numpy_triu1(df):
a = df.values
r,c = np.triu_indices(a.shape[1],1)
cols = df.columns
nm = [cols[i]+"_"+cols[j] for i,j in zip(r,c)]
return pd.DataFrame(a[:,r] - a[:,c], columns=nm)
Sample run -
In [72]: df
Out[72]:
A B C D
0 0.0 1.0 2.0 3.0
1 4.0 5.0 6.0 7.0
2 8.0 9.0 10.0 11.0
In [78]: numpy_triu(df)
Out[78]:
A_B A_C A_D B_C B_D C_D
0 -1.0 -2.0 -3.0 -1.0 -2.0 -1.0
1 -1.0 -2.0 -3.0 -1.0 -2.0 -1.0
2 -1.0 -2.0 -3.0 -1.0 -2.0 -1.0
Approach #2
If we are okay with array as output or dataframe without specialized column names, here's another -
def pairwise_col_diffs(a): # a would df.values
n = a.shape[1]
N = n*(n-1)//2
idx = np.concatenate(( [0], np.arange(n-1,0,-1).cumsum() ))
start, stop = idx[:-1], idx[1:]
out = np.empty((a.shape[0],N),dtype=a.dtype)
for j,i in enumerate(range(n-1)):
out[:, start[j]:stop[j]] = a[:,i,None] - a[:,i+1:]
return out
Runtime test
Since OP has mentioned that multi-dim array output would work for them as well, here are the array based approaches from other author(s) -
# #Allen's soln
def Allen(arr):
n = arr.shape[1]
idx = np.asarray(list(itertools.combinations(range(n),2))).T
return arr[:,idx[0]]-arr[:,idx[1]]
# #DYZ's soln
def DYZ(arr):
result = np.concatenate([(arr.T - arr.T[x])[x+1:] \
for x in range(arr.shape[1])]).T
return result
pandas based solution from #Gerges Dib's post wasn't included as it came out very slow as compared to others.
Timings -
We will use three dataset sizes - 100, 500 and 1000 :
In [118]: df = pd.DataFrame(np.random.randint(0,9,(3,100)))
...: a = df.values
...:
In [119]: %timeit DYZ(a)
...: %timeit Allen(a)
...: %timeit pairwise_col_diffs(a)
...:
1000 loops, best of 3: 258 µs per loop
1000 loops, best of 3: 1.48 ms per loop
1000 loops, best of 3: 284 µs per loop
In [121]: df = pd.DataFrame(np.random.randint(0,9,(3,500)))
...: a = df.values
...:
In [122]: %timeit DYZ(a)
...: %timeit Allen(a)
...: %timeit pairwise_col_diffs(a)
...:
100 loops, best of 3: 2.56 ms per loop
10 loops, best of 3: 39.9 ms per loop
1000 loops, best of 3: 1.82 ms per loop
In [123]: df = pd.DataFrame(np.random.randint(0,9,(3,1000)))
...: a = df.values
...:
In [124]: %timeit DYZ(a)
...: %timeit Allen(a)
...: %timeit pairwise_col_diffs(a)
...:
100 loops, best of 3: 8.61 ms per loop
10 loops, best of 3: 167 ms per loop
100 loops, best of 3: 5.09 ms per loop
I think you can do it with NumPy. Let arr=df.values. First, let's find all two-column combinations:
from itertools import combinations
column_combos = combinations(range(arr.shape[1]), 2)
Now, subtract columns pairwise and convert a list of arrays back to a 2D array:
result = np.array([(arr[:,x[1]] - arr[:,x[0]]) for x in column_combos]).T
#array([[1., 2., 3., 1., 2., 1.],
# [1., 2., 3., 1., 2., 1.],
# [1., 2., 3., 1., 2., 1.]])
Another solution is somewhat (~15%) faster because it subtracts whole 2D arrays rather than columns, and has fewer Python-side iterations:
result = np.concatenate([(arr.T - arr.T[x])[x+1:] for x in range(arr.shape[1])]).T
#array([[ 1., 2., 3., 1., 2., 1.],
# [ 1., 2., 3., 1., 2., 1.],
# [ 1., 2., 3., 1., 2., 1.]])
You can convert the result back to a DataFrame if you want:
columns = list(map(lambda x: x[1]+x[0], combinations(df.columns, 2)))
#['BA', 'CA', 'DA', 'CB', 'DB', 'DC']
pd.DataFrame(result, columns=columns)
# BA CA DA CB DB DC
#0 1.0 2.0 3.0 1.0 2.0 1.0
#1 1.0 2.0 3.0 1.0 2.0 1.0
#2 1.0 2.0 3.0 1.0 2.0 1.0
import itertools
df = pd.DataFrame(np.arange(12.0).reshape(3,4), columns=list('ABCD'))
df_cols = df.columns.tolist()
#build a index array of all the pairs need to do the subtraction
idx = np.asarray(list(itertools.combinations(range(len(df_cols)),2))).T
#build a new DF using the pairwise difference and column names
df_new = pd.DataFrame(data=df.values[:,idx[0]]-df.values[:,idx[1]],
columns=[''.join(e) for e in (itertools.combinations(df_cols,2))])
df_new
Out[43]:
AB AC AD BC BD CD
0 -1.0 -2.0 -3.0 -1.0 -2.0 -1.0
1 -1.0 -2.0 -3.0 -1.0 -2.0 -1.0
2 -1.0 -2.0 -3.0 -1.0 -2.0 -1.0
I am not sure how fast can this be compared to other possible methods, but here it is:
df = pd.DataFrame(np.arange(12.0).reshape(3,4), columns=list('ABCD'))
# get the columns as list
cols = list(df.columns)
# define output dataframe
out = pd.DataFrame()
# loop over possible periods
for period in range(1, df.shape[1]):
names = [l1 + l2 for l1, l2, in zip(cols, cols[period:])]
out[names] = df.diff(periods=period, axis=1).dropna(axis=1, how='all')
print(out)
# column name shows which two columns are subtracted
AB BC CD AC BD AD
0 1.0 1.0 1.0 2.0 2.0 3.0
1 1.0 1.0 1.0 2.0 2.0 3.0
2 1.0 1.0 1.0 2.0 2.0 3.0

Categories

Resources