pandas groupby excluding when a column takes some value - python

Is there a way to exclude rows that take certain values when aggregating?
For example:
ID | Company | Cost
1 | Us | 2
1 | Them | 1
1 | Them | 1
2 | Us | 1
2 | Them | 2
2 | Them | 1
I would like to do a groupby and sum but ignoring whenever a row is Company="us".
The result should be something like this:
ID | Sum of cost
1 | 2
2 | 3
I solved it by doing this, but I want to know if there's a smarter solution:
df_agg = df[df['Company']!="Us"][['ID','Cost']].groupby(['ID']).sum()

Use:
print (df)
ID Company Cost
0 1 Us 2
1 1 Them 1
2 1 Them 1
3 2 Us 1
4 2 Them 2
5 2 Them 1
6 3 Us 1 <- added new row for see difference
If need filter first and not matched groups (if exist) are not important use:
df1 = df[df.Company!="Us"].groupby('ID', as_index=False).Cost.sum()
print (df1)
ID Cost
0 1 2
1 2 3
df1 = df.query('Company!="Us"').groupby('ID', as_index=False).Cost.sum()
print (df1)
ID Cost
0 1 2
1 2 3
If need all groups ID with Cost=0 for Us first set Cost to 0 and then aggregate:
df2 = (df.assign(Cost = df.Cost.where(df.Company!="Us", 0))
.groupby('ID', as_index=False).Cost
.sum())
print (df2)
ID Cost
0 1 2
1 2 3
2 3 0

Related

How to aggregate 3 columns in DataFrame to have count and distribution of values in separated columns in Python Pandas?

I have Pandas DataFrame like below:
data types:
ID - int
TIME - int
TG - int
ID
TIME
TG
111
20210101
0
111
20210201
0
111
20210301
1
222
20210101
0
222
20210201
1
333
20210201
1
And I need to aggregate above DataFrame so as to know:
how many IDs are per each value in TIME
how many "1" from TG are per each value in TIME
how many "0" from TG are per each value in TIME
So I need to something like below:
TIME | num_ID | num_1 | num_0
---------|--------|-------|--------
20210101 | 2 | 0 | 2
20210201 | 3 | 2 | 1
20210301 | 1 | 1 | 0
How can I do that in Python Padas ?
Use GroupBy.size for counts TIME values with crosstab for count number of 0 and 1 values:
df1 = (df.groupby('TIME').size().to_frame('num_ID')
.join(pd.crosstab(df['TIME'], df['TG']).add_prefix('num_'))
.reset_index())
print (df1)
TIME num_ID num_0 num_1
0 20210101 2 2 0
1 20210201 3 1 2
2 20210301 1 0 1
Another idea if need count only 0 and 1 values in GroupBy.agg:
df1 = (df.assign(num_0 = df['TG'].eq(0),
num_1 = df['TG'].eq(1))
.groupby('TIME').agg(num_ID = ('TG','size'),
num_1=('num_1','sum'),
num_0=('num_0','sum'),
)
.reset_index()
)
print (df1)
TIME num_ID num_1 num_0
0 20210101 2 0 2
1 20210201 3 2 1
2 20210301 1 1 0
dict1 = {'ID':pd.Series.nunique, 'TG': [lambda x: x.eq(1).sum(), lambda x: x.eq(0).sum()]}
col1 = ['num_id', 'num_1', 'num_0']
df.groupby('TIME').agg(dict1).set_axis(col1, axis=1).reset_index()
result:
TIME num_id num_1 num_0
0 20210101 2 0 2
1 20210201 3 2 1
2 20210301 1 1 0

Count occurrences of specific value in column based on categories of another column

I have a dataset that looks like this:
Categories | Clicks
1 | 1
1 | 3
1 | 2
2 | 2
2 | 1
2 | 1
2 | 2
3 | 1
3 | 2
3 | 3
4 | 2
4 | 1
And to make some bar plots I would like for it to look like this:
Categories | Clicks_count | Clicks_prob
1 | 1 | 33%
2 | 2 | 50%
3 | 1 | 33%
4 | 1 | 50%
so basically: grouping by Categories and calculating on Clicks_count the number of times per category that Clicks takes the value 1, and Clicks_prob the probability of Clicks taking the value 1 (so it's count of Clicks==1/Count of Category i observations)
How could I do this? I tried, to get the second column:
df.groupby("Categories")["Clicks"].count().reset_index()
but the result is:
Categories | Clicks
1 | 3
2 | 4
3 | 3
4 | 2
Try sum and mean on the condition Clicks==1. Since you're working with groups, put them in groupby:
df['Clicks'].eq(1).groupby(df['Categories']).agg(['sum','mean'])
Output:
sum mean
Categories
1 1 0.333333
2 2 0.500000
3 1 0.333333
4 1 0.500000
To match output's naming, use named aggregation:
df['Clicks'].eq(1).groupby(df['Categories']).agg(Click_counts='sum', Clicks_prob='mean')
Output:
Click_counts Clicks_prob
Categories
1 1 0.333333
2 2 0.500000
3 1 0.333333
4 1 0.500000

Python Pandas keep the first occurence of a specific value and drop the rest of rows with the same specific value

I cannot figure out how to get rid of rows (but keep the first occurence and get rid of every row that has the value) with some condition.
I tried using drop_duplicate but this will get rid of everything. I just want to get rid of some rows with a specific value (Within the same column)
Data is formatted like so:
Col_A | Col_B
5 | 1
5 | 2
1 | 3
5 | 4
1 | 5
5 | 6
I want it like (based on Col_A):
Col_A | Col_B
5 | 1
5 | 2
1 | 3
5 | 4
5 | 6
Use idxmax and check the index. This of course assumes your index is unique.
m = df.Col_A.eq(1) # replace 1 with your desired bad value
df.loc[~m | (df.index == m.idxmax())]
Col_A Col_B
0 5 1
1 5 2
2 1 3
3 5 4
5 5 6
Try this:
df1=df.copy()
mask=df['Col_A'] == 5
df1.loc[mask,'Col_A'] = df1.loc[mask,'Col_A']+range(len(df1.loc[mask,'Col_A']))
df1=df1.drop_duplicates(subset='Col_A',keep='first')
print(df.iloc[df1.index])
Output:
Col_A Col_B
0 5 1
1 5 2
2 1 3
3 5 4
5 5 6

Ranking columns by a single integer

Hi i am basically trying to rank a column in a dataframe into ranking position.
it looks something like this i am trying to create something like this. For person with same number of fruits sold to have the same ranking So that when i sort them by rankings it does not have any decimals. Can anyone advice me?
person | number of fruits sold | ranking
A | 5 | 2
B | 6 | 1
C | 2 | 4
D | 5 | 2
E | 3 | 3
You can use pd.factorize. A few tricks here: take care to negate your series, specify sort=True, add 1 for your desired result.
df['ranking'] = pd.factorize(-df['number of fruits sold'], sort=True)[0] + 1
Result:
person number of fruits sold ranking
0 A 5 2
1 B 6 1
2 C 2 4
3 D 5 2
4 E 3 3
Use Series.rank:
df['ranking'] = df['number of fruits sold'].rank(method='dense', ascending=False).astype(int)
print (df)
person number of fruits sold ranking
0 A 5 2
1 B 6 1
2 C 2 4
3 D 5 2
4 E 3 3

Pandas data frame: adding columns based on previous time periods

I am trying to work through a problem in pandas, being more accustomed to R.
I have a data frame df with three columns: person, period, value
df.head() or the top few rows look like:
| person | period | value
0 | P22 | 1 | 0
1 | P23 | 1 | 0
2 | P24 | 1 | 1
3 | P25 | 1 | 0
4 | P26 | 1 | 1
5 | P22 | 2 | 1
Notice the last row records a value for period 2 for person P22.
I would now like to add a new column that provides the value from the previous period. So if for P22 the value in period 1 is 0, then this new column would look like:
| person | period | value | lastperiod
5 | P22 | 2 | 1 | 0
I believe I need to do something like the following command, having loaded pandas:
for p in df.period.unique():
df['lastperiod']== [???]
How should this be formulated?
You could groupby person and then apply a shift to the values:
In [11]: g = df.groupby('person')
In [12]: g['value'].apply(lambda s: s.shift())
Out[12]:
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 0
dtype: float64
Adding this as a column:
In [13]: df['lastPeriod'] = g['value'].apply(lambda s: s.shift())
In [14]: df
Out[14]:
person period value lastPeriod
1 P22 1 0 NaN
2 P23 1 0 NaN
3 P24 1 1 NaN
4 P25 1 0 NaN
5 P26 1 1 NaN
6 P22 2 1 0
Here the NaN signify missing data (i.e. there wasn't an entry in the previous period).

Categories

Resources