Retain pandas multiindex after function across level - python

I'm looking to find a minimum value across level 1 of a multiindex, time in this example. But I'd like to retain all other labels of the index.
import numpy as np
import pandas as pd
stack = [
[0, 1, 1, 5],
[0, 1, 2, 6],
[0, 1, 3, 2],
[0, 2, 3, 4],
[0, 2, 2, 5],
[0, 3, 2, 1],
[1, 1, 0, 5],
[1, 1, 2, 6],
[1, 1, 3, 7],
[1, 2, 2, 8],
[1, 2, 3, 9],
[2, 1, 7, 1],
[2, 1, 8, 3],
[2, 2, 3, 4],
[2, 2, 8, 1],
]
df = pd.DataFrame(stack)
df.columns = ['self', 'time', 'other', 'value']
df.set_index(['self', 'time', 'other'], inplace=True)
df.groupby(level=1).min() doesn't return the correct values:
value
time
1 1
2 1
3 1
doing something like df.groupby(level=[0,1,2]).min() returns the original dataframe unchanged.
I swear I used to be able to do this by calling .min(level=1) but it's giving me deprecation notices and teling me to use the above groupby format, but the result seems different than I remember, am I stupid?
original:
value
self time other
0 1 1 5
2 6
3 2 #<-- min row
2 3 4 #<-- min row
2 5
3 2 1 #<-- min row
1 1 0 5 #<-- min row
2 6
3 7
2 2 8 #<-- min row
3 9
2 1 7 1 #<-- min row
8 3
2 3 4
8 1 #<-- min row
desired result:
value
self time other
0 1 3 2
2 3 4
3 2 1
1 1 0 5
2 2 8
2 1 7 1
2 8 1

Group by your 2 first levels then return the idxmin instead of min to get all indexes. Finally, use loc to filter out your original dataframe:
out = df.loc[df.groupby(level=['self', 'time'])['value'].idxmin()]
print(out)
# Output
value
self time other
0 1 3 2
2 3 4
3 2 1
1 1 0 5
2 2 8
2 1 7 1
2 8 1

Why not just groupby the first two indexes, rather than all three?
out = df.groupby(level=[0,1]).min()
Output:
>>> out
value
self time
0 1 2
2 4
3 1
1 1 5
2 8
2 1 1
2 1

Related

Get value and key lists out of pandas groupBy

I am using pandas to create three arrays that I need for some stats.
I need all the fields, the month and the number of finishes and starts in that month.
My dataframe is the following
month finish started
0 MONTH.Mar 1 0
1 MONTH.Mar 1 0
2 MONTH.Mar 1 0
3 MONTH.Mar 1 0
4 MONTH.Mar 1 0
5 MONTH.Mar 0 1
6 MONTH.Apr 1 0
7 MONTH.Mar 0 1
8 MONTH.Mar 0 1
9 MONTH.Feb 0 1
I do a groupby:
df.groupby('month').sum()
and the output is the following:
finish started
month
MONTH.Apr 1 0
MONTH.Feb 0 1
MONTH.Mar 5 3
How can I convert the data into three different lists like this:
['MONTH.Apr','MONTH.Feb','MONTH.Mar']
[1,0,5]
[0,1,3]
I tried to do frame.values.tolist() but the output was the following:
[[1, 0], [0, 1], [5, 3]]
and it was impossible to get the months.
IIUC, try reset_index() and transposing .T:
>>> df.groupby('month').sum().reset_index().T.to_numpy()
array([['MONTH.Apr', 'MONTH.Feb', 'MONTH.Mar'],
[1, 0, 5],
[0, 1, 3]], dtype=object)
Or:
>>> df.groupby('month').sum().reset_index().T.values.tolist()
[['MONTH.Apr', 'MONTH.Feb', 'MONTH.Mar'], [1, 0, 5], [0, 1, 3]]
You can use:
month, finish, started = df.groupby('month', as_index=False) \
.sum().to_dict('list').values()
Output:
>>> month
['MONTH.Apr', 'MONTH.Feb', 'MONTH.Mar']
>>> finish
[1, 0, 5]
>>> started
[0, 1, 3]

How to count number of unique values in pandas while each cell includes list

I have a data frame like this:
import pandas as pd
import numpy as np
Out[10]:
samples subject trial_num
0 [0 2 2 1 1
1 [3 3 0 1 2
2 [1 1 1 1 3
3 [0 1 2 2 1
4 [4 5 6 2 2
5 [0 8 8 2 3
I want to have the output like this:
samples subject trial_num frequency
0 [0 2 2 1 1 2
1 [3 3 0 1 2 2
2 [1 1 1 1 3 1
3 [0 1 2 2 1 3
4 [4 5 6 2 2 3
5 [0 8 8 2 3 2
The frequency here is the number of unique values in each list per sample. For example, [0, 2, 2] only have one unique value.
I can do the unique values in pandas without having a list, or implement it using for loop to go through each row access each list and .... but I want a better pandas way to do it.
Thanks.
You can use collections.Counter for the task:
from collections import Counter
df['frequency'] = df['samples'].apply(lambda x: sum(v==1 for v in Counter(x).values()))
print(df)
Prints:
samples subject trial_num frequency
0 [0, 2, 2] 1 1 1
1 [3, 3, 0] 1 2 1
2 [1, 1, 1] 1 3 0
3 [0, 1, 2] 2 1 3
4 [4, 5, 6] 2 2 3
5 [0, 8, 8] 2 3 1
EDIT: For updated question:
df['frequency'] = df['samples'].apply(lambda x: len(set(x)))
print(df)
Prints:
samples subject trial_num frequency
0 [0, 2, 2] 1 1 2
1 [3, 3, 0] 1 2 2
2 [1, 1, 1] 1 3 1
3 [0, 1, 2] 2 1 3
4 [4, 5, 6] 2 2 3
5 [0, 8, 8] 2 3 2
import pandas as pd
import ast # import for sample data creation
from io import StringIO # import for sample data creation
# sample data
s = """samples;subject;trial_num
[0, 2, 2];1;1
[3, 3, 0];1;2
[1, 1, 1];1;3
[0, 1, 2];2;1
[4, 5, 6];2;2
[0, 8, 8];2;3"""
df = pd.read_csv(StringIO(s), sep=';')
df['samples'] = df['samples'].apply(ast.literal_eval)
# convert lists to a new frame and use nunique
# assign values to a col
df['frequency'] = pd.DataFrame(df['samples'].values.tolist()).nunique(1)
samples subject trial_num frequency
0 [0, 2, 2] 1 1 2
1 [3, 3, 0] 1 2 2
2 [1, 1, 1] 1 3 1
3 [0, 1, 2] 2 1 3
4 [4, 5, 6] 2 2 3
5 [0, 8, 8] 2 3 2

I want to get the complete row that corresponds to the max value of a specific column in a data-frame, grouped by another column [duplicate]

This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 2 years ago.
I have a dataframe with column a,b,c,d. I would like to have a new dataframe grouped by a, and for each group I want to get the complete row with vlaues(a,b,c,d)corresponding to the max value of the column (d)
You can use DataFrame.xs and get row corresponding sorted index.
dict1 = {'a': [1, 2, 3, 1, 2, 1, 1],
'b': [4, 5, 6, 2, 2, 1, 4],
'c': [7, 8, 9, 4, 1, 2, 4],
'd': [1, 2, 3, 1, 2, 3, 4]}
df = pd.DataFrame(dict1)
Out:
a b c d
0 1 4 7 1
1 2 5 8 2
2 3 6 9 3
3 1 2 4 1
4 2 2 1 2
5 1 1 2 3
6 1 4 4 4
I'd use this way to group:
grp = df.set_index(['d','a']).sort_index(ascending=False)
and then, just watch at the first element and make the slice:
grp.xs(4, drop_level=False)
Out:
b c
d a
4 1 4 4
Also, you can use df.reset_index and get clear row.

Transform 2D numpy array to row-column-value pandas DataFrame

Suppose I have a 2D numpy array like this:
arr = np.array([[1, 2], [3, 4], [5, 6]])
# array([[1, 2],
# [3, 4],
# [5, 6]])
How can one transform that to a "long" structure with one record per value, associated with the row and column index? In this case that would look like:
df = pd.DataFrame({'row': [0, 0, 1, 1, 2, 2],
'column': [0, 1, 0, 1, 0, 1],
'value': [1, 2, 3, 4, 5, 6]})
melt only assigns the column identifier, not the row:
pd.DataFrame(arr).melt()
# variable value
# 0 0 1
# 1 0 3
# 2 0 5
# 3 1 2
# 4 1 4
# 5 1 6
Is there a way to attach the row identifier?
Pass index to idvar:
pd.DataFrame(arr).reset_index().melt('index')
# index variable value
# 0 0 0 1
# 1 1 0 3
# 2 2 0 5
# 3 0 1 2
# 4 1 1 4
# 5 2 1 6
You can rename:
df = pd.DataFrame(arr).reset_index().melt('index')
df.columns = ['row', 'column', 'value']
melt can use the index if it's a column:
arrdf = pd.DataFrame(arr)
arrdf['row'] = arrdf.index
arrdf.melt(id_vars='row', var_name='column')
# row column value
# 0 0 0 1
# 1 1 0 3
# 2 2 0 5
# 3 0 1 2
# 4 1 1 4
# 5 2 1 6

How to join two pandas Series into a single one with interleaved values?

I have two pandas.Series...
import pandas as pd
import numpy as np
length = 5
s1 = pd.Series( [1]*length ) # [1, 1, 1, 1, 1]
s2 = pd.Series( [2]*length ) # [2, 2, 2, 2, 2]
...and I would like to have them joined together in a single Series with the interleaved values from the first 2 series.
Something like: [1, 2, 1, 2, 1, 2, 1, 2, 1, 2]
Using np.column_stack:
In[27]:pd.Series(np.column_stack((s1,s2)).flatten())
Out[27]:
0 1
1 2
2 1
3 2
4 1
5 2
6 1
7 2
8 1
9 2
dtype: int64
Here we are:
s1.index = range(0,len(s1)*2,2)
s2.index = range(1,len(s2)*2,2)
interleaved = pd.concat([s1,s2]).sort_index()
idx values
0 1
1 2
2 1
3 2
4 1
5 2
6 1
7 2
8 1
9 2
Here's one using NumPy stacking, np.vstack -
pd.Series(np.vstack((s1,s2)).ravel('F'))

Categories

Resources