Selecting rows from pandas dataframe limited by count per column value - python

I have a dataframe defined as follows:
df = pd.DataFrame({'id': [11, 12, 13, 14, 21, 22, 31, 32, 33],
'class': ['A', 'A', 'A', 'A', 'B', 'B', 'C', 'C', 'C'],
'count': [2, 2, 2 ,2 ,1, 1, 2, 2, 2]})
For each class, I'd like to select top n rows where n is specified by count column. The expected output from the above dataframe would be like this:
How can I achieve this?

You could use
In [771]: df.groupby('class').apply(
lambda x: x.head(x['count'].iloc[0])
).reset_index(drop=True)
Out[771]:
id class count
0 11 A 2
1 12 A 2
2 21 B 1
3 31 C 2
4 32 C 2

Use:
(df.groupby('class', as_index=False, group_keys=False)
.apply(lambda x: x.head(x['count'].iloc[0])))
Output:
id class count
0 11 A 2
1 12 A 2
4 21 B 1
6 31 C 2
7 32 C 2

Using cumcount
df[(df.groupby('class').cumcount()+1).le(df['count'])]
Out[150]:
class count id
0 A 2 11
1 A 2 12
4 B 1 21
6 C 2 31
7 C 2 32

Here is a solution which groups by class then then looks at the first value in the smaller dataframe and returns the corresponding rows.
def func(df_):
count_val = df_['count'].values[0]
return df_.iloc[0:count_val]
df.groupby('class', group_keys=False).apply(func)
returns
class count id
0 A 2 11
1 A 2 12
4 B 1 21
6 C 2 31
7 C 2 32

Related

Pandas groupby cumcount starting on row with a certain column value

I'd like to create two cumcount columns, depending on the values of two columns.
In the example below, I'd like one cumcount starting when colA is at least 100, and another cumcount starting when colB is at least 10.
columns = ['ID', 'colA', 'colB', 'cumcountA', 'cumountB']
data = [['A', 3, 1, '',''],
['A', 20, 4, '',''],
['A', 102, 8, 1, ''],
['A', 117, 10, 2, 1],
['B', 75, 0, '',''],
['B', 170, 12, 1, 1],
['B', 200, 13, 2, 2],
['B', 300, 20, 3, 3],
]
pd.DataFrame(columns=columns, data=data)
ID colA colB cumcountA cumountB
0 A 3 1
1 A 20 4
2 A 102 8 1
3 A 117 10 2 1
4 B 75 0
5 B 170 12 1 1
6 B 200 13 2 2
7 B 300 20 3 3
How would I calculate cumcountA and cumcountB?
you can try setting df.clip lower = your values (here 100 and 10) and then compare then groupby ID and cumsum :
col_list = ['colA','colB']
val_list = [100,10]
df[['cumcountA','cumountB']] = (df[col_list].ge(df[col_list].clip(lower=val_list,axis=1))
.groupby(df['ID']).cumsum().replace(0,''))
print(df)
Or may be even better to compare directly:
df[['cumcountA','cumountB']] = (df[['colA','colB']].ge([100,10])
.groupby(df['ID']).cumsum().replace(0,''))
print(df)
ID colA colB cumcountA cumountB
0 A 3 1
1 A 20 4
2 A 102 8 1
3 A 117 10 2 1
4 B 75 0
5 B 170 12 1 1
6 B 200 13 2 2
7 B 300 20 3 3

Pandas: set preceding values conditional on current value in column (by group)

I have a pandas data frame where values should be greater or equal to preceding values. In cases where the current value is lower than the preceding values, the preceding values must be set equal to the current value. This is best explained by example below:
data = {'group':['A', 'A', 'A', 'A', 'A', 'B', 'B',
'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C'],
'value':[0, 1, 2, 3, 2, 0, 1, 2, 3, 1, 5, 0, 1, 0, 3, 2]}
df = pd.DataFrame(data)
df
group value
0 A 0
1 A 1
2 A 2
3 A 3
4 A 2
5 B 0
6 B 1
7 B 2
8 B 3
9 B 1
10 B 5
11 C 0
12 C 1
13 C 0
14 C 3
15 C 2
and the result I am looking for is:
group value
0 A 0
1 A 1
2 A 2
3 A 2
4 A 2
5 B 0
6 B 1
7 B 1
8 B 1
9 B 1
10 B 5
11 C 0
12 C 0
13 C 0
14 C 2
15 C 2
So here's my go!
(Special thanks to #jezrael for helping me simplify it considerably!)
I'm basing this on Expanding Windows, in reverse, to always get a suffix of the elements in each group (from the last element, expanding towards first).
this expanding window has the following logic:
For element in index i, you get a Series containing all elements in group with indices >=i, and I need to return a new single value for i in the result.
What is the value corresponding to this suffix? its minimum! because if the later elements are smaller, we need to take the smallest among them.
then we can assign the result of this operation to df['value'].
try this:
df['value'] = (df.iloc[::-1]
.groupby('group')['value']
.expanding()
.min()
.reset_index(level=0, drop=True)
.astype(int))
print (df)
Output:
group value
0 A 0
1 A 1
2 A 2
3 A 2
4 A 2
5 B 0
6 B 1
7 B 1
8 B 1
9 B 1
10 B 5
11 C 0
12 C 0
13 C 0
14 C 2
15 C 2
I didnt get your output but I believe you are looking for something like
df['fwd'] = df.value.shift(-1)
df['new'] = np.where(df['value'] > df['fwd'], df['fwd'], df['value'])

How to get equivalent of pandas melt using groupby + stack?

Recently I am learning groupby and stack and encountered one method of pandas called melt. I would like to know how to achieve the same result given by melt using groupby and stack.
Here is the MWE:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
'B': [1, 1, 2, 2, 1],
'C': [10, 20, 30, 40, 50],
'D': ['X', 'Y', 'X', 'Y', 'Y']})
df1 = pd.melt(df, id_vars='A',value_vars=['B','C'],var_name='variable',value_name='value')
print(df1)
A variable value
0 1 B 1
1 1 B 1
2 1 B 2
3 2 B 2
4 2 B 1
5 1 C 10
6 1 C 20
7 1 C 30
8 2 C 40
9 2 C 50
How to get the same result using groupby and stack?
My attempt
df.groupby('A')[['B','C']].count().stack(0).reset_index()
I am not quite correct. And looking for the suggestions.
I guess you do not need groupby, just stack + sort_values:
result = df[['A', 'B', 'C']].set_index('A').stack().reset_index().sort_values(by='level_1')
result.columns = ['A', 'variable', 'value']
Output
A variable value
0 1 B 1
2 1 B 1
4 1 B 2
6 2 B 2
8 2 B 1
1 1 C 10
3 1 C 20
5 1 C 30
7 2 C 40
9 2 C 50

return new columns by subtracting two columns in a pandas list

It seems so basic, but I can't work out how to achieve the following...
Consider the scenario where I have the following data:
all_columns = ['A','B','C','D']
first_columns = ['A','B']
second_columns = ['C','D']
new_columns = ['E','F']
values = [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12], [13, 14, 15, 16]]
df = pd.DataFrame(data = values, columns = all_columns)
df
A B C D
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16
How can I using this data subsequently subtract let's say column C - column A, then column D - column B and return two new columns E and F respectively to my df Pandas dataframe? I have multiple columns so writing the formula one by one is not an option.
I imagine it should be something like that, but python thinks that I am trying to subtract list names rather than the values in the actual lists...
df[new_columns] = df[second_columns] - df[first_columns]
Expected output:
A B C D E F
0 1 2 3 4 2 2
1 5 6 7 8 2 2
2 9 10 11 12 2 2
3 13 14 15 16 2 2
df['E'] = df['C'] - df['A']
df['F'] = df['D'] - df['B']
Or, alternatively (similar to #rafaelc's comment):
new_cols = ['E', 'F']
second_cols = ['C', 'D']
first_cols = ['A', 'B']
df[new_cols] = df[second_cols] - df[first_cols].values
As #rafaelc and #Ben.T mentioned .. below would be the good fit to go.
I'm Just placing this is in the answer section for the posterity use...
>>> df
A B C D
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16
Result:
>>> df[['E', 'F']] = df[['C', 'D']] - df[['A', 'B']].values
>>> df
A B C D E F
0 1 2 3 4 2 2
1 5 6 7 8 2 2
2 9 10 11 12 2 2
3 13 14 15 16 2 2

simply put data on top of another pandas python

i have 2 sample datasets dfa and dfb:
import pandas as pd
a = {
'unit': ['A', 'B', 'C', 'D'],
'count': [ 1, 12, 34, 52]
}
b = {
'department': ['E', 'F'],
'count': [ 6, 12]
}
dfa = pd.DataFrame(a)
dfb = pd.DataFrame(b)
they looks like:
dfa
count unit
1 A
12 B
34 C
52 D
dfb
count department
6 E
12 F
what I want is simply have dfa stack on top of dfb not based on any column or any index. i have checked this page: https://pandas.pydata.org/pandas-docs/stable/merging.html but couldn't find the right one for my purpose.
my desired output is to create a dfc that looks like below dataset, i want to keep the headers:
dfc:
count unit
1 A
12 B
34 C
52 D
count department
6 E
12 F
In [37]: pd.concat([dfa, pd.DataFrame(dfb.T.reset_index().T.values, columns=dfa.columns)],
ignore_index=True)
Out[37]:
count unit
0 1 A
1 12 B
2 34 C
3 52 D
4 count department
5 6 E
6 12 F
or
In [39]: dfa.append(pd.DataFrame(dfb.T.reset_index().T.values, columns=dfa.columns)) \
.reset_index(drop=True)
Out[39]:
count unit
0 1 A
1 12 B
2 34 C
3 52 D
4 count department
5 6 E
6 12 F
UPDATE: merging 3 DFs:
pd.concat([dfa,
pd.DataFrame(dfb.T.reset_index().T.values, columns=dfa.columns),
pd.DataFrame(dfc.T.reset_index().T.values, columns=dfa.columns)],
ignore_index=True)
Option 1
You can construct it from scratch using np.vstack
pd.DataFrame(
np.vstack([dfa.values, dfb.columns, dfb.values]),
columns=dfa.columns
)
count unit
0 1 A
1 12 B
2 34 C
3 52 D
4 count department
5 6 E
6 12 F
Option 2
You can export to csv and read it back
from io import StringIO
import pandas as pd
pd.read_csv(StringIO(
'\n'.join([d.to_csv(index=None) for d in [dfa, dfb]])
))
count unit
0 1 A
1 12 B
2 34 C
3 52 D
4 count department
5 6 E
6 12 F
dfa.loc[len(dfa),:] = dfb.columns
dfb.columns = dfa.columns
dfa.append(dfb)

Categories

Resources