Pandas groupby to expanding to list (numpy array)

Pandas groupby to expanding to list (numpy array) - python

I have a DataFrame that can be produced using this Python code:
import pandas as pd
df = pd.DataFrame({'visit': [1] * 6 + [2] * 6,
'time': [t for t in range(6)] * 2,
'observations': [o for o in range(12)]})
The following code enables me to reformat the data as desired:
dflist = []
for v_ in df.visit.unique():
for t_ in df.time[df.visit == v_]:
dflist.append([df[(df.visit == v_) & (df.time <= t_)].groupby('visit')['observations'].apply(list)])
pd.DataFrame(pd.concat([df[0] for df in dflist], axis=0))
However this is extremely slow.
I have tried using .expanding(), however, this will only return scalars whereas I would like list (or numpy array).
I would appreciate any help in vectorizing or otherwise optimizing this procedure.
Thanks

Fortunately, in pandas 1.1.0 and newer, expanding produces an iterable which can be used to use take advantage of the faster grouping, but produce non-scaler data like lists:
new_df = pd.DataFrame({
'observations':
[list(x) for x in df.groupby('visit')['observations'].expanding()]
}, index=df['visit'])
new_df:
observations
visit
1 [0]
1 [0, 1]
1 [0, 1, 2]
1 [0, 1, 2, 3]
1 [0, 1, 2, 3, 4]
1 [0, 1, 2, 3, 4, 5]
2 [6]
2 [6, 7]
2 [6, 7, 8]
2 [6, 7, 8, 9]
2 [6, 7, 8, 9, 10]
2 [6, 7, 8, 9, 10, 11]
Timing via %timeit:
Setup:
import pandas as pd
df = pd.DataFrame({'visit': [1] * 6 + [2] * 6,
'time': [t for t in range(6)] * 2,
'observations': [o for o in range(12)]})
Original:
def fn():
dflist = []
for v_ in df.visit.unique():
for t_ in df.time[df.visit == v_]:
dflist.append([
df[(df.visit == v_) & (df.time <= t_)]
.groupby('visit')['observations'].apply(list)
])
return pd.DataFrame(pd.concat([df[0] for df in dflist], axis=0))
%timeit fn()
13 ms ± 692 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
List comprehension with expanding (~13x faster on this sample):
def fn2():
return pd.DataFrame({
'observations':
[list(x) for x in df.groupby('visit')['observations'].expanding()]
}, index=df['visit'])
%timeit fn2()
967 µs ± 57.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sanity Check:
fn().eq(fn2()).all(axis=None) # True
The double apply approach by #Quixotic22 (~3.4x faster than the original ~3.9x slower than comprehension + expanding):
def fn3():
return (df.
set_index('visit')['observations'].
apply(lambda x: [x]).
reset_index().groupby('visit')['observations'].
apply(lambda x: x.cumsum()))
%timeit fn3()
3.78 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
*Note this approach only produces a series of observations, does not include the visit as the index.
fn().eq(fn3()).all(axis=None) # False

Looks like a good solution has been provided but dropping this here as a viable alternative.
(df.
set_index('visit')['observations'].
apply(lambda x: [x]).
reset_index().groupby('visit')['observations'].
apply(lambda x: x.cumsum())
)

Related

Roll first column by 1, second column by 2, etc

I have an array in numpy. I want to roll the first column by 1, second column by 2, etc.
Here is an example.
>>> x = np.reshape(np.arange(15), (5, 3))
>>> x
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14]])
What I want to do:
>>> y = roll(x)
>>> y
array([[12, 10, 8],
[ 0, 13, 11],
[ 3, 1, 14],
[ 6, 4, 2],
[ 9, 7, 5]])
What is the best way to do it?
The real array will be very big. I'm using cupy, the GPU version of numpy. I will prefer solution fastest on GPU, but of course, any idea is welcomed.

You could use advanced indexing:
import numpy as np
x = np.reshape(np.arange(15), (5, 3))
h, w = x.shape
rows, cols = np.arange(h), np.arange(w)
offsets = cols + 1
shifted = np.subtract.outer(rows, offsets) % h
y = x[shifted, cols]
y:
array([[12, 10, 8],
[ 0, 13, 11],
[ 3, 1, 14],
[ 6, 4, 2],
[ 9, 7, 5]])

I implemented a naive solution (roll_for) and compares it to #Chrysophylaxs 's solution (roll_indexing).
Conclusion: roll_indexing is faster for small arrays, but the difference shrinks when the array goes bigger, and is eventually slower than roll_for for very large arrays.
Implementations:
import numpy as np
def roll_for(x, shifts=None, axis=-1):
if shifts is None:
shifts = np.arange(1, x.shape[axis] + 1) # OP requirement
xt = x.swapaxes(axis, 0) # https://stackoverflow.com/a/31094758/13636407
yt = np.empty_like(xt)
for idx, shift in enumerate(shifts):
yt[idx] = np.roll(xt[idx], shift=shift)
return yt.swapaxes(0, axis)
def roll_indexing(x):
h, w = x.shape
rows, cols = np.arange(h), np.arange(w)
offsets = cols + 1
shifted = np.subtract.outer(rows, offsets) % h # fix
return x[shifted, cols]
Tests:
M, N = 5, 3
x = np.arange(M * N).reshape(M, N)
expected = np.array([[12, 10, 8], [0, 13, 11], [3, 1, 14], [6, 4, 2], [9, 7, 5]])
assert np.array_equal(expected, roll_for(x))
assert np.array_equal(expected, roll_indexing(x))
M, N = 100, 200
# roll_indexing did'nt work when M < N before fix
x = np.arange(M * N).reshape(M, N)
assert np.array_equal(roll_for(x), roll_indexing(x))
Benchmark:
M, N = 100, 100
x = np.arange(M * N).reshape(M, N)
assert np.array_equal(roll_for(x), roll_indexing(x))
%timeit roll_for(x) # 859 µs ± 2.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit roll_indexing(x) # 81 µs ± 255 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
M, N = 1_000, 1_000
x = np.arange(M * N).reshape(M, N)
assert np.array_equal(roll_for(x), roll_indexing(x))
%timeit roll_for(x) # 12.7 ms ± 56.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit roll_indexing(x) # 12.4 ms ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
M, N = 10_000, 10_000
x = np.arange(M * N).reshape(M, N)
assert np.array_equal(roll_for(x), roll_indexing(x))
%timeit roll_for(x) # 1.3 s ± 6.46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit roll_indexing(x) # 1.61 s ± 4.96 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

How to delete values from one pandas series that are common to another?

So I have a specific problem that needs to be solved. I need to DELETE elements present in one pandas series (ser1) that are common to another pandas series (ser2).
I have tried a bunch of things that do not work and the closest thing I was able to find was with arrays using np.intersect1d() function. This works to find common values, but when I try to drop indexes that are equal to these values, i get a bunch of mistakes.
I've tried a bunch of other things that did not really work and have been at it for 3 hours now so about to give up.
here are the two series:
ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])
The result should be:
print(ser1)
0 1
1 2
2 3
I am sure there is a simple solution.

Use .isin:
>>> ser1[~ser1.isin(ser2)]
0 1
1 2
2 3
dtype: int64
The numpy version is .setdiff1d (and not .intersect1d)
>>> np.setdiff1d(ser1, ser2)
array([1, 2, 3])

A numpy alternative, np.isin
import pandas as pd
import numpy as np
ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])
res = ser1[~np.isin(ser1, ser2)]
print(res)
Micro-Benchmark
import pandas as pd
import numpy as np
ser1 = pd.Series([1, 2, 3, 4, 5] * 100)
ser2 = pd.Series([4, 5, 6, 7, 8] * 10)
%timeit res = ser1[~np.isin(ser1, ser2)]
136 µs ± 2.56 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit res = ser1[~ser1.isin(ser2)]
209 µs ± 1.66 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit pd.Index(ser1).difference(ser2).to_series()
277 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

You can use set notation - I am not sure of the speed though, compared to isin:
pd.Index(ser1).difference(ser2).to_series()
Out[35]:
1 1
2 2
3 3
dtype: int64

Numpy array to dictionary with indices as values

I have a numpy array with integer values, let's call this array x.
I want to create some sort of list where for each value, I have the indices of x that hold this value.
For example, for:
x = [1,2,2,4,7,1,1,7,16]
I want to get:
{1: [0,5,6], 2:[1,2], 4:[3], 7:[4,7], 15:[16]}
The parenthesis I used are arbitrary, I don't care which data structure I use as long as I can output my result to a file as quickly as possible. At the end I want a .txt file that reads:
0,5,6
1,2
3
4,7
16

Since you mentioned you're not picky about the data structure of your values,tTo get something like the dictionary you posted in your question, you could do a dictionary comprehension over the unique values in x with np.where for the values:
>>> {i:np.where(x == i)[0] for i in set(x)}
{1: array([0, 5, 6]),
2: array([1, 2]),
4: array([3]),
7: array([4, 7]),
16: array([8])}
Comparing this to a more vanilla loop through a list, this will be significantly faster for larger arrays:
def list_method(x):
res = {i:[] for i in set(x)}
for i, value in enumerate(x):
res[value].append(i)
return res
def np_method(x):
return {i:np.where(x == i)[0] for i in set(x)}
x = np.random.randint(1, 50, 1000000)
In [5]: %timeit list_method(x)
259 ms ± 4.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: %timeit np_method(x)
120 ms ± 4.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Pure python will be like this:
result = {}
for idx,val in enumerate(x):
arr = result.get(val,[])
arr.append(idx)
result[val] = arr

x = [1,2,2,4,7,1,1,7,16]
numlist = []
numdict = {}
c = 0
for n in x:
if n not in numlist:
numlist.append(n)
numdict[n] = [c]
else:
numdict[n].append(c)
c += 1
print(numlist, numdict)
Output is:
[1, 2, 4, 7, 16] {1: [0, 5, 6], 2: [1, 2], 4: [3], 7: [4, 7], 16: [8]}
To write to file use:
with open('file.txt', 'w') as f:
f.write(str(numdict))

Numpy group by multiple vectors, get group indices

I have several numpy arrays; I want to build a groupby method that would have group ids for these arrays. It will then allow me to index these arrays on the group id to perform operations on the groups.
For an example:
import numpy as np
import pandas as pd
a = np.array([1,1,1,2,2,3])
b = np.array([1,2,2,2,3,3])
def group_np(groupcols):
groupby = np.array([''.join([str(b) for b in bs]) for bs in zip(*[c for c in groupcols])])
_, groupby = np.unique(groupby, return_invesrse=True)
return groupby
def group_pd(groupcols):
df = pd.DataFrame(groupcols[0])
for i in range(1, len(groupcols)):
df[i] = groupcols[i]
for i in range(len(groupcols)):
df[i] = df[i].fillna(-1)
return df.groupby(list(range(len(groupcols)))).grouper.group_info[0]
Outputs:
group_np([a,b]) -> [0, 1, 1, 2, 3, 4]
group_pd([a,b]) -> [0, 1, 1, 2, 3, 4]
Is there a more efficient way of implementing it, ideally in pure numpy? The bottleneck currently seems to be building a vector that would have unique values for each group - at the moment I am doing that by concatenating the values for each vector as strings.
I want this to work for any number of input vectors, which can have millions of elements.
Edit: here is another testcase:
a = np.array([1,2,1,1,1,2,3,1])
b = np.array([1,2,2,2,2,3,3,2])
Here, group elements 2,3,4,7 should all be the same.
Edit2: adding some benchmarks.
a = np.random.randint(1, 1000, 30000000)
b = np.random.randint(1, 1000, 30000000)
c = np.random.randint(1, 1000, 30000000)
def group_np2(groupcols):
_, groupby = np.unique(np.stack(groupcols), return_inverse=True, axis=1)
return groupby
%timeit group_np2([a,b,c])
# 25.1 s +/- 1.06 s per loop (mean +/- std. dev. of 7 runs, 1 loop each)
%timeit group_pd([a,b,c])
# 21.7 s +/- 646 ms per loop (mean +/- std. dev. of 7 runs, 1 loop each)

After using np.stack on the arrays a and b, if you set the parameter return_inverse to True in np.unique then it is the output you are looking for:
a = np.array([1,2,1,1,1,2,3,1])
b = np.array([1,2,2,2,2,3,3,2])
_, inv = np.unique(np.stack([a,b]), axis=1, return_inverse=True)
print (inv)
array([0, 2, 1, 1, 1, 3, 4, 1], dtype=int64)
and you can replace [a,b] in np.stack by a list of all the vectors.
Edit: a faster solution is use np.unique on the sum of the arrays multiply by the cumulative product (np.cumprod) of the max plus 1 of all previous arrays in groupcols. such as:
def group_np_sum(groupcols):
groupcols_max = np.cumprod([ar.max()+1 for ar in groupcols[:-1]])
return np.unique( sum([groupcols[0]] +
[ ar*m for ar, m in zip(groupcols[1:],groupcols_max)]),
return_inverse=True)[1]
To check:
a = np.array([1,2,1,1,1,2,3,1])
b = np.array([1,2,2,2,2,3,3,2])
print (group_np_sum([a,b]))
array([0, 2, 1, 1, 1, 3, 4, 1], dtype=int64)
Note: the number associated to each group may not be the same (here I changed the first element of a by 3)
a = np.array([3,2,1,1,1,2,3,1])
b = np.array([1,2,2,2,2,3,3,2])
print(group_np2([a,b]))
print (group_np_sum([a,b]))
array([3, 1, 0, 0, 0, 2, 4, 0], dtype=int64)
array([0, 2, 1, 1, 1, 3, 4, 1], dtype=int64)
but groups themselves are the same.
Now to check for timing:
a = np.random.randint(1, 100, 30000)
b = np.random.randint(1, 100, 30000)
c = np.random.randint(1, 100, 30000)
groupcols = [a,b,c]
%timeit group_pd(groupcols)
#13.7 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit group_np2(groupcols)
#34.2 ms ± 6.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit group_np_sum(groupcols)
#3.63 ms ± 562 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The numpy_indexed package (dsiclaimer: I am its authos) covers these type of use cases:
import numpy_indexed as npi
npi.group_by((a, b))
Passing a tuple of index-arrays like this avoids creating a copy; but if you dont mind making the copy you can use stacking as well:
npi.group_by(np.stack(a, b))

Deleting diagonal elements of a numpy array

Given input
A = np.array([[1,2,3],[4,5,6],[7,8,9]])
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
Need output :
array([[2, 3],
[4, 6],
[7, 8]])
It is easy to use iteration or loop to do this, but there should be a neat way to do this without using loops. Thanks

Approach #1
One approach with masking -
A[~np.eye(A.shape[0],dtype=bool)].reshape(A.shape[0],-1)
Sample run -
In [395]: A
Out[395]:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
In [396]: A[~np.eye(A.shape[0],dtype=bool)].reshape(A.shape[0],-1)
Out[396]:
array([[2, 3],
[4, 6],
[7, 8]])
Approach #2
Using the regular pattern of non-diagonal elements that could be traced with broadcasted additions with range arrays -
m = A.shape[0]
idx = (np.arange(1,m+1) + (m+1)*np.arange(m-1)[:,None]).reshape(m,-1)
out = A.ravel()[idx]
Approach #3 (Strides Strikes!)
Abusing the regular pattern of non-diagonal elements from previous approach, we can introduce np.lib.stride_tricks.as_strided and some slicing help, like so -
m = A.shape[0]
strided = np.lib.stride_tricks.as_strided
s0,s1 = A.strides
out = strided(A.ravel()[1:], shape=(m-1,m), strides=(s0+s1,s1)).reshape(m,-1)
Runtime test
Approaches as funcs :
def skip_diag_masking(A):
return A[~np.eye(A.shape[0],dtype=bool)].reshape(A.shape[0],-1)
def skip_diag_broadcasting(A):
m = A.shape[0]
idx = (np.arange(1,m+1) + (m+1)*np.arange(m-1)[:,None]).reshape(m,-1)
return A.ravel()[idx]
def skip_diag_strided(A):
m = A.shape[0]
strided = np.lib.stride_tricks.as_strided
s0,s1 = A.strides
return strided(A.ravel()[1:], shape=(m-1,m), strides=(s0+s1,s1)).reshape(m,-1)
Timings -
In [528]: A = np.random.randint(11,99,(5000,5000))
In [529]: %timeit skip_diag_masking(A)
...: %timeit skip_diag_broadcasting(A)
...: %timeit skip_diag_strided(A)
...:
10 loops, best of 3: 56.1 ms per loop
10 loops, best of 3: 82.1 ms per loop
10 loops, best of 3: 32.6 ms per loop

I know I'm late to this party, but I have what I believe is a simper solution. So you want to remove the diagonal? Okay cool:
replace it with NaN
filter all but NaN (this converts to one dimensional as it can't assume the result will be square)
reset the dimensionality
`
arr = np.array([[1,2,3],[4,5,6],[7,8,9]]).astype(np.float)
np.fill_diagonal(arr, np.nan)
arr[~np.isnan(arr)].reshape(arr.shape[0], arr.shape[1] - 1)

Solution steps:
Flatten your array
Delete the location of the diagonal elements which is at the location range(0, len(x_no_diag), len(x) + 1)
Reshape your array to (num_rows, num_columns - 1)
The function:
import numpy as np
def remove_diag(x):
x_no_diag = np.ndarray.flatten(x)
x_no_diag = np.delete(x_no_diag, range(0, len(x_no_diag), len(x) + 1), 0)
x_no_diag = x_no_diag.reshape(len(x), len(x) - 1)
return x_no_diag
Example:
>>> x = np.random.randint(5, size=(3,3))
array([[0, 2, 3],
[3, 4, 1],
[2, 4, 0]])
>>> remove_diag(x)
array([[2, 3],
[3, 1],
[2, 4]])

Just with numpy, assuming a square matrix:
new_A = numpy.delete(A,range(0,A.shape[0]**2,(A.shape[0]+1))).reshape(A.shape[0],(A.shape[1]-1))

If you do not mind creating a new array, then you can use list comprehension.
A = np.array([A[i][A[i] != A[i][i]] for i in range(len(A))])
Rerunning the same methods as #Divakar,
A = np.random.randint(11,99,(5000,5000))
skip_diag_masking
85.7 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
skip_diag_broadcasting
163 ms ± 1.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
skip_diag_strided
52.5 ms ± 2.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
skip_diag_list_comp
101 ms ± 347 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Perhaps the cleanest way, based on Divakar's first solution but using len(array) instead of array.shape[0], is:
array_without_diagonal = array[~np.eye(len(array), dtype=bool)].reshape(len(array), -1)

I love all the answers here, but would like to add one in case your numpy object has more than 2-dimensions. In that case, you can use the following adjustment of Divakar's approach #1:
def remove_diag(A):
removed = A[~np.eye(A.shape[0], dtype=bool)].reshape(A.shape[0], int(A.shape[0])-1, -1)
return np.squeeze(removed)

The other approach is to use numpy.delete(). assuming square matrix, you can use:
numpy.delete(A,range(0,A.shape[0]**2,A.shape[0])).reshape(A.shape[0],A.shape[1]-1)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas groupby to expanding to list (numpy array) - python

Looks like a good solution has been provided but dropping this here as a viable alternative. (df. set_index('visit')['observations']. apply(lambda x: [x]). reset_index().groupby('visit')['observations']. apply(lambda x: x.cumsum()) )

Related

Roll first column by 1, second column by 2, etc

How to delete values from one pandas series that are common to another?

Numpy array to dictionary with indices as values

Numpy group by multiple vectors, get group indices

Deleting diagonal elements of a numpy array

Categories

Resources