Get value and key lists out of pandas groupBy - python

I am using pandas to create three arrays that I need for some stats.
I need all the fields, the month and the number of finishes and starts in that month.
My dataframe is the following
month finish started
0 MONTH.Mar 1 0
1 MONTH.Mar 1 0
2 MONTH.Mar 1 0
3 MONTH.Mar 1 0
4 MONTH.Mar 1 0
5 MONTH.Mar 0 1
6 MONTH.Apr 1 0
7 MONTH.Mar 0 1
8 MONTH.Mar 0 1
9 MONTH.Feb 0 1
I do a groupby:
df.groupby('month').sum()
and the output is the following:
finish started
month
MONTH.Apr 1 0
MONTH.Feb 0 1
MONTH.Mar 5 3
How can I convert the data into three different lists like this:
['MONTH.Apr','MONTH.Feb','MONTH.Mar']
[1,0,5]
[0,1,3]
I tried to do frame.values.tolist() but the output was the following:
[[1, 0], [0, 1], [5, 3]]
and it was impossible to get the months.

IIUC, try reset_index() and transposing .T:
>>> df.groupby('month').sum().reset_index().T.to_numpy()
array([['MONTH.Apr', 'MONTH.Feb', 'MONTH.Mar'],
[1, 0, 5],
[0, 1, 3]], dtype=object)
Or:
>>> df.groupby('month').sum().reset_index().T.values.tolist()
[['MONTH.Apr', 'MONTH.Feb', 'MONTH.Mar'], [1, 0, 5], [0, 1, 3]]

You can use:
month, finish, started = df.groupby('month', as_index=False) \
.sum().to_dict('list').values()
Output:
>>> month
['MONTH.Apr', 'MONTH.Feb', 'MONTH.Mar']
>>> finish
[1, 0, 5]
>>> started
[0, 1, 3]

Related

Retain pandas multiindex after function across level

I'm looking to find a minimum value across level 1 of a multiindex, time in this example. But I'd like to retain all other labels of the index.
import numpy as np
import pandas as pd
stack = [
[0, 1, 1, 5],
[0, 1, 2, 6],
[0, 1, 3, 2],
[0, 2, 3, 4],
[0, 2, 2, 5],
[0, 3, 2, 1],
[1, 1, 0, 5],
[1, 1, 2, 6],
[1, 1, 3, 7],
[1, 2, 2, 8],
[1, 2, 3, 9],
[2, 1, 7, 1],
[2, 1, 8, 3],
[2, 2, 3, 4],
[2, 2, 8, 1],
]
df = pd.DataFrame(stack)
df.columns = ['self', 'time', 'other', 'value']
df.set_index(['self', 'time', 'other'], inplace=True)
df.groupby(level=1).min() doesn't return the correct values:
value
time
1 1
2 1
3 1
doing something like df.groupby(level=[0,1,2]).min() returns the original dataframe unchanged.
I swear I used to be able to do this by calling .min(level=1) but it's giving me deprecation notices and teling me to use the above groupby format, but the result seems different than I remember, am I stupid?
original:
value
self time other
0 1 1 5
2 6
3 2 #<-- min row
2 3 4 #<-- min row
2 5
3 2 1 #<-- min row
1 1 0 5 #<-- min row
2 6
3 7
2 2 8 #<-- min row
3 9
2 1 7 1 #<-- min row
8 3
2 3 4
8 1 #<-- min row
desired result:
value
self time other
0 1 3 2
2 3 4
3 2 1
1 1 0 5
2 2 8
2 1 7 1
2 8 1
Group by your 2 first levels then return the idxmin instead of min to get all indexes. Finally, use loc to filter out your original dataframe:
out = df.loc[df.groupby(level=['self', 'time'])['value'].idxmin()]
print(out)
# Output
value
self time other
0 1 3 2
2 3 4
3 2 1
1 1 0 5
2 2 8
2 1 7 1
2 8 1
Why not just groupby the first two indexes, rather than all three?
out = df.groupby(level=[0,1]).min()
Output:
>>> out
value
self time
0 1 2
2 4
3 1
1 1 5
2 8
2 1 1
2 1

How to count number of unique values in pandas while each cell includes list

I have a data frame like this:
import pandas as pd
import numpy as np
Out[10]:
samples subject trial_num
0 [0 2 2 1 1
1 [3 3 0 1 2
2 [1 1 1 1 3
3 [0 1 2 2 1
4 [4 5 6 2 2
5 [0 8 8 2 3
I want to have the output like this:
samples subject trial_num frequency
0 [0 2 2 1 1 2
1 [3 3 0 1 2 2
2 [1 1 1 1 3 1
3 [0 1 2 2 1 3
4 [4 5 6 2 2 3
5 [0 8 8 2 3 2
The frequency here is the number of unique values in each list per sample. For example, [0, 2, 2] only have one unique value.
I can do the unique values in pandas without having a list, or implement it using for loop to go through each row access each list and .... but I want a better pandas way to do it.
Thanks.
You can use collections.Counter for the task:
from collections import Counter
df['frequency'] = df['samples'].apply(lambda x: sum(v==1 for v in Counter(x).values()))
print(df)
Prints:
samples subject trial_num frequency
0 [0, 2, 2] 1 1 1
1 [3, 3, 0] 1 2 1
2 [1, 1, 1] 1 3 0
3 [0, 1, 2] 2 1 3
4 [4, 5, 6] 2 2 3
5 [0, 8, 8] 2 3 1
EDIT: For updated question:
df['frequency'] = df['samples'].apply(lambda x: len(set(x)))
print(df)
Prints:
samples subject trial_num frequency
0 [0, 2, 2] 1 1 2
1 [3, 3, 0] 1 2 2
2 [1, 1, 1] 1 3 1
3 [0, 1, 2] 2 1 3
4 [4, 5, 6] 2 2 3
5 [0, 8, 8] 2 3 2
import pandas as pd
import ast # import for sample data creation
from io import StringIO # import for sample data creation
# sample data
s = """samples;subject;trial_num
[0, 2, 2];1;1
[3, 3, 0];1;2
[1, 1, 1];1;3
[0, 1, 2];2;1
[4, 5, 6];2;2
[0, 8, 8];2;3"""
df = pd.read_csv(StringIO(s), sep=';')
df['samples'] = df['samples'].apply(ast.literal_eval)
# convert lists to a new frame and use nunique
# assign values to a col
df['frequency'] = pd.DataFrame(df['samples'].values.tolist()).nunique(1)
samples subject trial_num frequency
0 [0, 2, 2] 1 1 2
1 [3, 3, 0] 1 2 2
2 [1, 1, 1] 1 3 1
3 [0, 1, 2] 2 1 3
4 [4, 5, 6] 2 2 3
5 [0, 8, 8] 2 3 2

How to create a new DataFrame repeating rows using indexes from original DF

I have a DataFrame of generated random agents. However, I want to expand them to match the population I am looking for, so I need to repeat rows, according to my sampled indexes.
Here is a loop code that is taking forever:
df = pd.DataFrame({'a': [0, 1, 2]})
sampled_indexes = [0, 0, 1, 1, 2, 2, 2]
new_df = pd.DataFrame(columns=['a'])
for i, idx in enumerate(sampled_indexes):
new_df.loc[i] = df.loc[idx]
Then, the original DataFrame:
df
a
0 0
1 1
2 2
gives me the result of an enlarged new dataframe
new_df
a
0 0
1 0
2 1
3 1
4 2
5 2
6 2
So, this loop is too slow with a DataFrame that has 34,000 or more rows (takes forever).
How can I do this simpler and faster?
Reindex the dataframe with sampled_indexes, then reset the index.
df.reindex(sampled_indexes).reset_index(drop=True)
You can do DataFrame.merge:
df = pd.DataFrame({'a': [0, 1, 2]})
sampled_indexes = [0, 0, 1, 1, 2, 2, 2]
print( df.merge(pd.DataFrame({'a': sampled_indexes})) )
Prints:
a
0 0
1 0
2 1
3 1
4 2
5 2
6 2

Transform 2D numpy array to row-column-value pandas DataFrame

Suppose I have a 2D numpy array like this:
arr = np.array([[1, 2], [3, 4], [5, 6]])
# array([[1, 2],
# [3, 4],
# [5, 6]])
How can one transform that to a "long" structure with one record per value, associated with the row and column index? In this case that would look like:
df = pd.DataFrame({'row': [0, 0, 1, 1, 2, 2],
'column': [0, 1, 0, 1, 0, 1],
'value': [1, 2, 3, 4, 5, 6]})
melt only assigns the column identifier, not the row:
pd.DataFrame(arr).melt()
# variable value
# 0 0 1
# 1 0 3
# 2 0 5
# 3 1 2
# 4 1 4
# 5 1 6
Is there a way to attach the row identifier?
Pass index to idvar:
pd.DataFrame(arr).reset_index().melt('index')
# index variable value
# 0 0 0 1
# 1 1 0 3
# 2 2 0 5
# 3 0 1 2
# 4 1 1 4
# 5 2 1 6
You can rename:
df = pd.DataFrame(arr).reset_index().melt('index')
df.columns = ['row', 'column', 'value']
melt can use the index if it's a column:
arrdf = pd.DataFrame(arr)
arrdf['row'] = arrdf.index
arrdf.melt(id_vars='row', var_name='column')
# row column value
# 0 0 0 1
# 1 1 0 3
# 2 2 0 5
# 3 0 1 2
# 4 1 1 4
# 5 2 1 6

Pandas - Does row fall below a row with a column value and same id

I am new to Pandas. I have a Pandas data frame like so:
df = pd.DataFrame(data={'id': [1, 1, 1, 2, 2, 2, 2], 'val1': [0, 1, 0, 0, 1, 0, 0]})
I want to add a column val2, that indicates whether an row falls below another row having the same id as itself where val1 == 1.
The result would be a data frame like:
df = pd.DataFrame(data={'id': [1, 1, 1, 2, 2, 2, 2], 'val1': [0, 1, 0, 0, 1, 0, 0], 'val2': [0, 0, 1, 0, 0, 1, 1]})
My first thought was to use an apply statement, but these only go by row. And from my experience for loops are never the answer. Any help would be greatly appreciated!
Let's try shift + cumsum inside a groupby.
df['val2'] = df.groupby('id').val1.apply(
lambda x: x.shift().cumsum()
).ge(1).astype(int)
Or, in an attempt to avoid the lambda,
df['val2'] = (
df.groupby('id')
.val1.shift()
.groupby(df.id)
.cumsum()
.ge(1)
.astype(int)
)
df
id val1 val2
0 1 0 0
1 1 1 0
2 1 0 1
3 2 0 0
4 2 1 0
5 2 0 1
6 2 0 1
Using groupby + transform. Similar to coldspeed's but using bool conversion for non-zero cumsum values.
df['val2'] = df.groupby('id')['val1'].transform(lambda x: x.cumsum().shift())\
.fillna(0).astype(bool).astype(int)
print(df)
id val1 val2
0 1 0 0
1 1 1 0
2 1 0 1
3 2 0 0
4 2 1 0
5 2 0 1
6 2 0 1

Categories

Resources