How to select the first 3 rows of every group in pandas? - python

I get a pandas dataframe like this:
id prob
0 1 0.5
1 1 0.6
2 1 0.4
3 1 0.2
4 2 0.3
6 2 0.5
...
I want to group it by 'id', sort descending order and get the first 3 prob of every group. Note that some groups contain rows less than 3.
Finally I want to get a 2D array like:
[[1, 0.6, 0.5, 0.4], [2, [0.5, 0.3]]...]
How can I do that with pandas?
Thanks!

Use sort_values, groupby, and head:
df.sort_values(by=['id','prob'], ascending=[True,False]).groupby('id').head(3).values
Output:
array([[ 1. , 0.6],
[ 1. , 0.5],
[ 1. , 0.4],
[ 2. , 0.5],
[ 2. , 0.3]])
Following #COLDSPEED lead:
df.sort_values(by=['id','prob'], ascending=[True,False])\
.groupby('id').agg(lambda x: x.head(3).tolist())\
.reset_index().values.tolist()
Output:
[[1, [0.6, 0.5, 0.4]], [2, [0.5, 0.3]]]

You can use groupby and nlargest
df.groupby('id').prob.nlargest(3).reset_index(1,drop = True)
id
1 0.6
1 0.5
1 0.4
2 0.5
2 0.3
For the array
df1 = df.groupby('id').prob.nlargest(3).unstack(1)#.reset_index(1,drop = True)#.set_index('id')
np.column_stack((df1.index.values, df1.values))
You get
array([[ 1. , 0.5, 0.6, 0.4, nan, nan],
[ 2. , nan, nan, nan, 0.3, 0.5]])

If you're looking for a dataframe of array columns, you can use np.sort:
df = df.groupby('id').prob.apply(lambda x: np.sort(x.values)[:-4:-1])
df
id
1 [0.6, 0.5, 0.4]
2 [0.5, 0.3]
To retrieve the values, reset_index and access:
df.reset_index().values
array([[1, array([ 0.6, 0.5, 0.4])],
[2, array([ 0.5, 0.3])]], dtype=object)

[[n, g.nlargest(3).tolist()] for n, g in df.groupby('id').prob]
[[1, [0.6, 0.5, 0.4]], [2, [0.5, 0.3]]]

Related

calculate the score with respect to the group it was assigned or other

I'm beginner in python, I have two dataframes as below. The first dataframe represents the user with their vectors and group number.
df1 = pd.DataFrame({'user': ['user 1', 'user 2', 'user 3', 'user 4', 'user 5'], 'x1': [[0.2, 0.3, 0.5],[0.3, 0.3, 0.4],[0.4, 0.4, 0.2],[0.2, 0.1, 0.7],[0.5,0.3,0.2]],'group': [1, 0, 0, 2, 1]})
df1
output:
user x1 group
0 user 1 [0.2, 0.3, 0.5] 1
1 user 2 [0.3, 0.3, 0.4] 0
2 user 3 [0.4, 0.4, 0.2] 0
3 user 4 [0.2, 0.1, 0.7] 2
4 user 5 [0.5, 0.3, 0.2] 1
the second dataframe represents the group number with its vector and variable (p2) and its threshold
df2 = pd.DataFrame({'group': [0, 1, 2],
'x2': [[0.4, 0.2, 0.4],[0.5, 0.1, 0.4], [0.5, 0.1, 0.4]],
'p2': [0.231, 0.342, 0.411],
'threshold': [0.9, 0.6, 0.8]})
df2
output:
group x2 p2 threshold
0 0 [0.4, 0.2, 0.4] 0.231 0.9
1 1 [0.5, 0.1, 0.4] 0.342 0.6
2 2 [0.5, 0.1, 0.4] 0.411 0.8
I am trying to calculate for each user, the score (S) with respect to the group it was assigned by using:
where k= group size and T is the transport matrix of (x2 -x1).
How could I do that for all users?
First, count up the members of each group to get the k term:
df2['count'] = df1.groupby('group')['user'].count()
Then merge df1 and df2 so that we have a frame with all necessary parameters for each user in a single row:
joined = df1.join(df2[['x2', 'p2', 'threshold', 'count']], on='group')
print(joined)
>>> user x1 group x2 p2 threshold count
0 user 1 [0.2, 0.3, 0.5] 1 [0.5, 0.1, 0.4] 0.342 0.6 2
1 user 2 [0.3, 0.3, 0.4] 0 [0.4, 0.2, 0.4] 0.231 0.9 2
2 user 3 [0.4, 0.4, 0.2] 0 [0.4, 0.2, 0.4] 0.231 0.9 2
3 user 4 [0.2, 0.1, 0.7] 2 [0.5, 0.1, 0.4] 0.411 0.8 1
4 user 5 [0.5, 0.3, 0.2] 1 [0.5, 0.1, 0.4] 0.342 0.6 2
Now define functions to calculate the S score:
def l_delta(z1, z2):
return [a1 - a2 for (a1, a2) in zip(z1, z2)]
def inner(z1, z2):
return sum([a1 * a2 for (a1, a2) in zip(z1, z2)])
def s_score(row):
delta = l_delta(row['x2'], row['x1'])
num = inner(delta, delta)
return 1/row['count'] + num / row['p2']
Finally, apply these functions to each row in the joined matrix:
joined['s_score'] = joined.apply(s_score, axis=1)
print(joined[['user', 's_score']])
Result:
user s_score
0 user 1 0.909357
1 user 2 0.586580
2 user 3 0.846320
3 user 4 1.437956
4 user 5 0.733918
Similar answer to #The Photon, where we (1) merge df1 and df2, (2) calculate k with groupby (3) calculate (x2-x1) inner product with itself
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'user': ['user 1', 'user 2', 'user 3', 'user 4', 'user 5'],
'x1': [[0.2, 0.3, 0.5],[0.3, 0.3, 0.4],[0.4, 0.4, 0.2],[0.2, 0.1, 0.7],[0.5,0.3,0.2]],
'group': [1, 0, 0, 2, 1]})
df2 = pd.DataFrame({'group': [0, 1, 2],
'x2': [[0.4, 0.2, 0.4],[0.5, 0.1, 0.4], [0.5, 0.1, 0.4]],
'p2': [0.231, 0.342, 0.411],
'threshold': [0.9, 0.6, 0.8]})
#merge df1 and df2 into a single table
merged_df = df1.merge(df2)
#calculate the number of unique users per group (k)
merged_df['k'] = merged_df.groupby('group')['user'].transform('nunique')
#calculate x2-x1 for each user (convert to numpy array for vectorized subtraction)
x2_sub_x1 = merged_df['x2'].apply(np.array)-merged_df['x1'].apply(np.array)
#calculate (x2-x1)T(x2-x1) for each user (same as squaring each term and summing)
numerator = x2_sub_x1.pow(2).apply(sum)
#calculate S from your formula and add it as a column to the merged table
merged_df['S'] = (1/merged_df['k'])+(numerator/merged_df['p2'])
Final merged table
user x1 group x2 p2 threshold k S
0 user 1 [0.2, 0.3, 0.5] 1 [0.5, 0.1, 0.4] 0.342 0.6 2 0.909357
1 user 5 [0.5, 0.3, 0.2] 1 [0.5, 0.1, 0.4] 0.342 0.6 2 0.733918
2 user 2 [0.3, 0.3, 0.4] 0 [0.4, 0.2, 0.4] 0.231 0.9 2 0.586580
3 user 3 [0.4, 0.4, 0.2] 0 [0.4, 0.2, 0.4] 0.231 0.9 2 0.846320
4 user 4 [0.2, 0.1, 0.7] 2 [0.5, 0.1, 0.4] 0.411 0.8 1 1.437956

Get two neighboring non-nan values in numpy array

Let's say I have a numpy array
my_array = [0.2, 0.3, nan, nan, nan, 0.1, nan, 0.5, nan]
For each nan value, I want to extract the two non-nan values to the left and right of that point (or single value if appropriate). So I would like my output to be something like
output = [[0.3,0.1], [0.3,0.1], [0.3,0.1], [0.1,0.5], [0.5]]
I was thinking of looping through all the values in my_array, then finding those that are nan, but I'm not sure how to do the next part of finding the nearest non-nan values.
Using pandas and numpy:
s = pd.Series([0.2, 0.3, nan, nan, nan, 0.1, nan, 0.5, nan])
m = s.isna()
a = np.vstack((s.ffill()[m], s.bfill()[m]))
out = a[:,~np.isnan(a).any(0)].T.tolist()
Output:
[[0.3, 0.1], [0.3, 0.1], [0.3, 0.1], [0.1, 0.5]]
NB. You can choose to keep or drop the lists containing NaNs.
With NaNs:
out = a.T.tolist()
[[0.3, 0.1], [0.3, 0.1], [0.3, 0.1], [0.1, 0.5], [0.5, nan]]
alternative to handle the single elements:
s = pd.Series([0.2, 0.3, nan, nan, nan, 0.1, nan, 0.5, nan])
m = s.isna()
(pd
.concat((s.ffill()[m], s.bfill()[m]), axis=1)
.stack()
.groupby(level=0).agg(list)
.to_list()
)
Output:
[[0.3, 0.1], [0.3, 0.1], [0.3, 0.1], [0.1, 0.5], [0.5]]
Less elegant than #mozway's answer, but the last list only has one element:
pd.DataFrame({
'left':arr.ffill(),
'right': arr.bfill()
}).loc[arr.isna()].apply(lambda row: row.dropna().to_list(), axis=1).to_list()
For the sake of education, I'll post a pretty straight-forward algorithm for achieving this result, which works by finding the closest index of a value to the left and to the right of each index of a NaN, and filters out any infs at the end:
def get_neighbors(x: np.ndarray) -> list:
mask = np.isnan(x)
nan_idxs, *_ = np.where(mask)
val_idxs, *_ = np.where(~mask)
neighbors = []
for nan_idx in nan_idxs:
L, R = -float("inf"), float("inf")
for val_idx in val_idxs:
if val_idx < nan_idx:
L = max(L, val_idx)
else:
R = min(R, val_idx)
# casting to list isn't strictly necessary, you'll just end up with a list of arrays
neighbors.append(list(x[[i for i in (L, R) if i > 0 and i < float("inf")]]))
return neighbors
Output:
>>> get_neighbors(my_array)
[[0.3, 0.1], [0.3, 0.1], [0.3, 0.1], [0.1, 0.5], [0.5]]
The nested for loop has a worst-case runtime of O((n / 2)^2) where n is the number of elements of x (worst case occurs when exactly half the elements are NaN).
I was eager to check how could use just NumPy to solve this problem as an exercise. After some hours I could reach a solution :), but as I think it will be inefficient comparing to pandas as mentioned by Mozway, I didn't optimized the code further (it can be optimized; if conditions may could be cured and merged in other sections):
my_array = np.array([np.nan, np.nan, 0.2, 0.3, np.nan, np.nan, np.nan, 0.1, 0.7, np.nan, 0.5])
nans = np.isnan(my_array).astype(np.int8) # [1 1 0 0 1 1 1 0 0 1 0]
zeros = np.where(nans == 0)[0] # [ 2 3 7 8 10]
diff_nan = np.diff(nans) # [ 0 -1 0 1 0 0 -1 0 1 -1]
start = np.where(diff_nan == 1)[0] # [3 8]
end = np.where(diff_nan == -1)[0] + 1 # [ 2 7 10]
mask_start_nan = np.isnan(my_array[0]) # True
mask_end_nan = np.isnan(my_array[-1]) # False
if mask_end_nan: start = start[:-1] # [3 8]
if mask_start_nan: end = end[1:] # [ 7 10]
inds = np.dstack([start, end]).squeeze() # [[ 3 7] [ 8 10]]
initial = my_array[inds] # [[0.3 0.1] [0.7 0.5]]
repeats = np.diff(np.where(np.concatenate(([nans[0]], nans[:-1] != nans[1:], [True])))[0])[::2] # [2 3 1]
if mask_end_nan: repeats = repeats[:-1] # [2 3 1]
if mask_start_nan: repeats = repeats[1:] # [3 1]
result = np.repeat(initial, repeats, axis=0) # [[0.3 0.1] [0.3 0.1] [0.3 0.1] [0.7 0.5]]
if mask_end_nan: result = np.array([*result, np.array(my_array[zeros[-1]])], dtype=object)
if mask_start_nan: result = np.array([np.array(my_array[zeros[0]]), *result], dtype=object)
# [array(0.2) array([0.3, 0.1]) array([0.3, 0.1]) array([0.3, 0.1]) array([0.7, 0.5])]
I don't know if there be a much easier solution by NumPy; I implemented what came to my mind. I believe that this code can be greatly improved (I will do it if I find a free time).

NumPy - generate multiple intervals

I have an array like this:
[[0.13, 0.19],
[0.25, 0.6 ],
[0.7 , 0.89]]
I want, given the above array, to create a result like this:
[[0, 0.12],
[0.13, 0.19],
[0.20, 0.24],
[0.25, 0.60],
[0.61, 0.69],
[0.70, 0.89],
[0.90, 1]]
Namely, I want to create a total matrix of intervals, given a pre-defined intervals.
This isn't specific to numpy, but maybe it will point you in the correct direction.
Basically, you need to know where to start, end, and the 'resolution' (for lack of a better word) — how far apart the gaps are. With that you can loop through the existing intervals and fill in the others. You'll want to watch the edge cases where the intervals are already filled in — like one starting a 0 or [0.6, 0.8], [0.9, 0.95] so you don't fill those in twice. This might look something like:
def fill_intervals(existing_intervals, start=0, end=1.0, inc=0.01):
l2 = []
for i in l:
if start < i[0]:
l2.append([start, i[0] - inc])
l2.append(i)
start = i[1] + inc
if start < end:
l2.append([start, end])
return l2
l = [
[0.13, 0.19],
[0.25, 0.6 ],
[0.7 , 0.89]
]
fill_intervals(l)
Returning:
[[0, 0.12],
[0.13, 0.19],
[0.2, 0.24],
[0.25, 0.6],
[0.61, 0.69],
[0.7, 0.89],
[0.9, 1.0]]
You can duplicate items and then make it quite close:
arr = np.array([[0.13, 0.19], [0.25, 0.6 ], [0.7 , 0.89]])
consecutive = np.r_[0, np.repeat(arr, 2), 1]
intervals = consecutive.reshape(-1, 2)
intervals:
array([[0. , 0.13], # required: [0, 0.12]
[0.13, 0.19], # OK
[0.19, 0.25], # required: [0.20, 0.24]
[0.25, 0.6 ], # OK
[0.6 , 0.7 ], # required: [0.61, 0.69]
[0.7 , 0.89], # OK
[0.89, 1. ]])# required: [0.9, 1]
It seems you need to fix alternate intervals so just do:
intervals[2::2,0] = intervals[2::2,0] + 0.01
intervals[:-1:2,1] = intervals[:-1:2,1] - 0.01
intervals:
array([[0. , 0.12],
[0.13, 0.19],
[0.2 , 0.24],
[0.25, 0.6 ],
[0.61, 0.69],
[0.7 , 0.89],
[0.9 , 1. ]])
You can use linspace to create your intervals
import numpy as np
>>> np.linspace(0, 1, num=3, endpoint=False)
array([0. , 0.33333333, 0.66666667])

Add a scalar to a numpy matrix based on the indices in a different numpy array

I'm sorry if this question isn't framed well. So I would rather explain with an example.
I have a numpy matrix:
a = np.array([[0.5, 0.8, 0.1], [0.6, 0.9, 0.3], [0.7, 0.4, 0.8], [0.8, 0.7, 0.6]])
And another numpy array as shown:
b = np.array([1, 0, 2, 2])
With the given condition that values in b will be in the range(a.shape[1]) and that b.shape[1] == a.shape[0]. Now this is the operation I need to perform.
For every index i of a, and the corresponding index i of b, I need to subtract 1 from the index j of a[i] where j== b[i]
So in my example, a[0] == [0.5 0.8 0.1] and b[0] == 1. Therefore I need to subtract 1 from a[0][b[0]] so that a[0] = [0.5, -0.2, 0.1]. This has to be done for all rows of a. Any direct solution without me having to iterate through all rows or columns one by one?
Thanks.
Use numpy indexing. See this post for a nice introduction:
import numpy as np
a = np.array([[0.5, 0.8, 0.1], [0.6, 0.9, 0.3], [0.7, 0.4, 0.8], [0.8, 0.7, 0.6]])
b = np.array([1, 0, 2, 2])
a[np.arange(a.shape[0]), b] -= 1
print(a)
Output
[[ 0.5 -0.2 0.1]
[-0.4 0.9 0.3]
[ 0.7 0.4 -0.2]
[ 0.8 0.7 -0.4]]
As an alternative use substract.at:
np.subtract.at(a, (np.arange(a.shape[0]), b), 1)
print(a)
Output
[[ 0.5 -0.2 0.1]
[-0.4 0.9 0.3]
[ 0.7 0.4 -0.2]
[ 0.8 0.7 -0.4]]
The main idea is that:
np.arange(a.shape[0]) # shape[0] is equals to the numbers of rows
generates the indices of the rows:
[0 1 2 3]

Return the biggest value less than one from a numpy vector

I have a numpy vector in python and I want to find the index of the max value of the vector with the condition that it is less than one. I have as an example the following:
temp_res = [0.9, 0.8, 0.7, 0.99, 1.2, 1.5, 0.1, 0.5, 0.1, 0.01, 0.12, 0.56, 0.89, 0.23, 0.56, 0.78]
temp_res = np.asarray(temp_res)
indices = np.where((temp_res == temp_res.max()) & (temp_res < 1))
However, what I tried always return an empty matrix since those two conditions cannot be met. HU want to return as final result the index = 3 which correspond to 0.99 the biggest value that it is less than 1. How can I do so?
You need to perform the max() function after filtering your array:
temp_res = np.asarray(temp_res)
temp_res[temp_res < 1].max()
Out[60]: 0.99
If you want to find all the indexes, here is a more genera approach:
mask = temp_res < 1
indices = np.where(mask)
maximum = temp_res[mask].max()
max_indices = np.where(temp_res == maximum)
Example:
...: temp_res = [0.9, 0.8, 0.7, 1, 0.99, 0.99, 1.2, 1.5, 0.1, 0.5, 0.1, 0.01, 0.12, 0.56, 0.89, 0.23, 0.56, 0.78]
...: temp_res = np.asarray(temp_res)
...: mask = temp_res < 1
...: indices = np.where(mask)
...: maximum = temp_res[mask].max()
...: max_indices = np.where(temp_res == maximum)
...:
In [72]: max_indices
Out[72]: (array([4, 5]),)
You can use:
np.where(temp_res == temp_res[temp_res < 1].max())[0]
Example:
In [49]: temp_res
Out[49]:
array([0.9 , 0.8 , 0.7 , 0.99, 1.2 , 1.5 , 0.1 , 0.5 , 0.1 , 0.01, 0.12,
0.56, 0.89, 0.23, 0.56, 0.78])
In [50]: np.where(temp_res == temp_res[temp_res < 1].max())[0]
...:
Out[50]: array([3])

Categories

Resources