Run calculations on list of selected columns [duplicate] - python

With the DataFrame below as an example,
In [83]:
df = pd.DataFrame({'A':[1,1,2,2],'B':[1,2,1,2],'values':np.arange(10,30,5)})
df
Out[83]:
A B values
0 1 1 10
1 1 2 15
2 2 1 20
3 2 2 25
What would be a simple way to generate a new column containing some aggregation of the data over one of the columns?
For example, if I sum values over items in A
In [84]:
df.groupby('A').sum()['values']
Out[84]:
A
1 25
2 45
Name: values
How can I get
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45

In [20]: df = pd.DataFrame({'A':[1,1,2,2],'B':[1,2,1,2],'values':np.arange(10,30,5)})
In [21]: df
Out[21]:
A B values
0 1 1 10
1 1 2 15
2 2 1 20
3 2 2 25
In [22]: df['sum_values_A'] = df.groupby('A')['values'].transform(np.sum)
In [23]: df
Out[23]:
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45

I found a way using join:
In [101]:
aggregated = df.groupby('A').sum()['values']
aggregated.name = 'sum_values_A'
df.join(aggregated,on='A')
Out[101]:
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45
Anyone has a simpler way to do it?

This is not so direct but I found it very intuitive (the use of map to create new columns from another column) and can be applied to many other cases:
gb = df.groupby('A').sum()['values']
def getvalue(x):
return gb[x]
df['sum'] = df['A'].map(getvalue)
df

In [15]: def sum_col(df, col, new_col):
....: df[new_col] = df[col].sum()
....: return df
In [16]: df.groupby("A").apply(sum_col, 'values', 'sum_values_A')
Out[16]:
A B values sum_values_A
0 1 1 10 25
1 1 2 15 25
2 2 1 20 45
3 2 2 25 45

Related

Remove groups from dataframe where a min value within that group is not below a threshold

The dataframe looks like this:
id1 id2 value
1 1 35
1 1 23
1 1 20
1 2 5
1 2 50
2 1 42
2 1 3
2 1 12
2 2 64
2 3 34
2 3 1
I want to group them by id1 and id2, and remove all rows of a group if the minimum value of that group is not less than 10.
So the result would look like this:
id1 id2 value
1 2 5
1 2 50
2 1 3
2 1 12
2 3 34
2 3 1
I have tried this:
dfmin = df.groupby(["id1", "id2"])["value"].min().reset_index()
df = df[
dfmin.loc[
(dfmin["id1"] == df["id1"]) & (dfmin["id1"] == df["id1"]),
"value",
].iat[0]
< 10
]
But I get the error Can only compare identically-labeled Series objects.
What am I doing wrong and is there a better way?
use groupby filter
out = df.groupby(['id1', 'id2']).filter(lambda x: x['value'].min() < 10)
out
id1 id2 value
3 1 2 5
4 1 2 50
5 2 1 42
6 2 1 3
7 2 1 12
9 2 3 34
10 2 3 1

python pandas: Remove duplicates by columns A, which is not satisfying a condition in column B

I have a dataframe with repeat values in column A. I want to drop duplicates, keeping the row which has its value > 0 in column B
So this:
A B
1 20
1 10
1 -3
2 30
2 -9
2 40
3 10
Should turn into this:
A B
1 20
1 10
2 30
2 40
3 10
Any suggestions on how this can be achieved? I shall be grateful!
In sample data are not duplciates, so use only:
df = df[df['B'].gt(0)]
print (df)
A B
0 1 20
1 1 10
3 2 30
5 2 40
6 3 10
If there are duplicates:
print (df)
A B
0 1 20
1 1 10
2 1 10
3 1 10
4 1 -3
5 2 30
6 2 -9
7 2 40
8 3 10
df = df[df['B'].gt(0) & ~df.duplicated()]
print (df)
A B
0 1 20
1 1 10
5 2 30
7 2 40
8 3 10

group by pandas removes duplicates

I have a dataframe (df)
a b c
1 2 20
1 2 15
2 4 30
3 2 20
3 2 15
and I want to recognize only max values from column c
I tried
a = df.loc[df.groupby('b')['c'].idxmax()]
but it group by removes duplicates so I get
a b c
1 2 20
2 4 30
it removes rows 3 because they are the same was rows 1.
Is it any way to write the code to not remove duplicates?
Just also take column a into account when you do the groupby:
a = df.loc[df.groupby(['a', 'b'])['c'].idxmax()]
a b c
0 1 2 20
2 2 4 30
3 3 2 20
I think you need:
df = df[df['c'] == df.groupby('b')['c'].transform('max')]
print (df)
a b c
0 1 2 20
2 2 4 30
3 3 2 20
Difference in changed data:
print (df)
a b c
0 1 2 30
1 1 2 30
2 1 2 15
3 2 4 30
4 3 2 20
5 3 2 15
#only 1 max rows per groups a and b
a = df.loc[df.groupby(['a', 'b'])['c'].idxmax()]
print (a)
a b c
0 1 2 30
3 2 4 30
4 3 2 20
#all max rows per groups b
df1 = df[df['c'] == df.groupby('b')['c'].transform('max')]
print (df1)
a b c
0 1 2 30
1 1 2 30
3 2 4 30
#all max rows per groups a and b
df2 = df[df['c'] == df.groupby(['a', 'b'])['c'].transform('max')]
print (df2)
a b c
0 1 2 30
1 1 2 30
3 2 4 30
4 3 2 20

pandas reset_index after groupby.value_counts()

I am trying to groupby a column and compute value counts on another column.
import pandas as pd
dftest = pd.DataFrame({'A':[1,1,1,1,1,1,1,1,1,2,2,2,2,2],
'Amt':[20,20,20,30,30,30,30,40, 40,10, 10, 40,40,40]})
print(dftest)
dftest looks like
A Amt
0 1 20
1 1 20
2 1 20
3 1 30
4 1 30
5 1 30
6 1 30
7 1 40
8 1 40
9 2 10
10 2 10
11 2 40
12 2 40
13 2 40
perform grouping
grouper = dftest.groupby('A')
df_grouped = grouper['Amt'].value_counts()
which gives
A Amt
1 30 4
20 3
40 2
2 40 3
10 2
Name: Amt, dtype: int64
what I want is to keep top two rows of each group
Also, I was perplexed by an error when I tried to reset_index
df_grouped.reset_index()
which gives following error
df_grouped.reset_index()
ValueError: cannot insert Amt, already exists
You need parameter name in reset_index, because Series name is same as name of one of levels of MultiIndex:
df_grouped.reset_index(name='count')
Another solution is rename Series name:
print (df_grouped.rename('count').reset_index())
A Amt count
0 1 30 4
1 1 20 3
2 1 40 2
3 2 40 3
4 2 10 2
More common solution instead value_counts is aggregate size:
df_grouped1 = dftest.groupby(['A','Amt']).size().reset_index(name='count')
print (df_grouped1)
A Amt count
0 1 20 3
1 1 30 4
2 1 40 2
3 2 10 2
4 2 40 3

Pandas access length of group touple key with only one part of that key

I've got a DataFrameGroupBy with a key that is of the structure Hour, ID.
I am trying to get the size of each group with each key from each hour.
Running mygroup.size() gives me output like:
ID
0 41 3
55 10
56 1
60 7
65 1
...
23 2218 5
2222 9
2223 5
2225 2
What I want to be able to do is filter this list so I can just get the total number in each group, based on the Hour part of the key (0-23)
Call count and pass level=0, example:
In [21]:
df = pd.DataFrame({'a':[0,0,1,1,1,1],'b':[1,2,3,12,3,4],'c':np.arange(6)})
df
Out[21]:
a b c
0 0 1 0
1 0 2 1
2 1 3 2
3 1 12 3
4 1 3 4
5 1 4 5
In [22]:
gp = df.groupby(['a','b'])
gp.size()
Out[22]:
a b
0 1 1
2 1
1 3 2
4 1
12 1
dtype: int64
In [23]:
gp.size().count(level=0)
Out[23]:
a
0 2
1 3
dtype: int64

Categories

Resources