pandas dataframe groupby apply multi columns and get count

pandas dataframe groupby apply multi columns and get count - python

I have a excel like this:
year
a
b
2021
12
23
2021
31
0
2021
15
21
2021
14
0
2022
32
0
2022
24
15
2022
28
29
2022
33
0
I wanna get count of condition: a>=30 and b==0 group by year
the final output like this:
2021 1
2022 2
I wanna use pandas dataframe to implement this, can anyone help? I'm quite new to python

For count matched rows chain both conditions by & for bitwise AND and aggregate sum, Trues are processing like 1 and False like 0:
df1 = ((df.a>=30) & (df.b==0)).astype(int)
.groupby(df['year']).sum().reset_index(name='count')
print (df1)
year count
0 2021 1
1 2022 2
Similar idea with helper column:
df1 = (df.assign(count = ((df.a>=30) & (df.b==0)).astype(int))
.groupby('year', as_index=False)['count']
.sum())

Related

Return a Cell from a Pandas Dataframe based on other values

I have the following dataset in a Pandas Dataframe:
Id
Year
Month
Total
0
2020
9
11788.33
1
2020
10
18373.99
2
2020
11
31018.59
3
2020
12
29279.30
4
2021
1
1875.10
5
2021
2
9550.06
6
2021
3
33844.39
7
2021
4
33126.53
8
2021
5
12910.05
9
2021
6
44628.63
10
2021
7
25830.03
11
2021
8
54463.08
12
2021
9
49723.93
13
2021
10
23753.81
14
2021
11
52532.49
15
2021
12
7467.32
16
2022
1
24333.54
17
2022
2
12394.11
18
2022
3
76575.46
19
2022
4
95119.82
20
2022
5
63048.05
I am trying to dynamically return the value from the Total column based on the first month (Month 1) from last year (Year 2021). Solution is 1875.10.
I am using Python in PyCharm to complete this.
Note: The "Id" column is the one that is automatically generated when using a pandas Dataframe. I believe it is called an index within Pandas.
Any help would be greatly appreciated.
Thank you.

You can use .loc[]:
df.loc[(df['Year'] == 2021) & (df['Month'] == 1), 'Total']
Which will give you:
0 1875.1
Name: Total, dtype: float64
To get the actual number you can add .iloc[] on the end:
df.loc[(df['Year'] == 2021) & (df['Month'] == 1), 'Total'].iloc[0]
Output:
1875.1

Another method is doing this.
df[df['Year']==2021].iloc[0]['Total']
This part df[df['Year']==2021] creates a new dataframe, where we only have values from 2021, and the .iloc fetches the value at position 0 in the 'Total' column

Would simple filter suffice?
df[(df.Year == 2021) & (df.Month == 1)].Total

How to set order of sorting MultiIndex

I have dataframe like this:
import pandas as pd
import numpy as np
np.random.seed(123)
col_num = 1
row_num = 18
col_names = ['C' + str(x) for x in range(col_num)]
mix = pd.MultiIndex.from_product([['a', 'b'], [ '01 Jan 2011', '02 Feb 2000', '30 Apr 1999'], [1,2,3]])
df = pd.DataFrame(np.round(((np.random.rand(row_num,col_num)* 2 - 1)*100),2), columns = col_names, index = mix)
#df
C0
a 01 Jan 2011 1 39.29
2 -42.77
3 -54.63
02 Feb 2000 1 10.26
2 43.89
3 -15.38
30 Apr 1999 1 96.15
2 36.97
3 -3.81
b 01 Jan 2011 1 -21.58
2 -31.36
3 45.81
02 Feb 2000 1 -12.29
2 -88.06
3 -20.39
30 Apr 1999 1 47.60
2 -63.50
3 -64.91
How to sort MultiIndex in such a way that dates on level 1 are kept in chronological order while preserving sorting on other mix levels as is, including priority of levels ordering (ie: first level 0, then level1 and finally level2).
I need to keep dates as strings in final df. Final df will be pickled. I try to set sorting order of dates before serializing rather than writing sorting function after retrieving df.

Let's create a new MultiIndex after setting the level 1 values mapped to datetime then use argsort on this new index to get the indices that would sort the original dataframe:
idx = df.index.set_levels(pd.to_datetime(df.index.levels[1]), 1)
df1 = df.iloc[np.argsort(idx)]
print(df1)
C0
a 30 Apr 1999 1 96.15
2 36.97
3 -3.81
02 Feb 2000 1 10.26
2 43.89
3 -15.38
01 Jan 2011 1 39.29
2 -42.77
3 -54.63
b 30 Apr 1999 1 47.60
2 -63.50
3 -64.91
02 Feb 2000 1 -12.29
2 -88.06
3 -20.39
01 Jan 2011 1 -21.58
2 -31.36
3 45.81

If one wants to create desired df with sorted index and doesn't mind having categorical index, here is a code to achieve it (probably there is a simpler way but I couldn't find it :).
Start with df from question above.
from datetime import datetime as dt
org_l1 = df.index.get_level_values(1).unique().tolist()
l1_as_date = [dt.strptime(x, '%d %b %Y') for x in org_level1]
l1_as_date.sort()
l1_sorted_as_str = [dt.strftime(x, '%d %b %Y') for x in l1_as_date]
df= df.reset_index()
df.level_1 = df.level_1.astype('category')
df.level_1 = df.level_1.cat.set_categories(l1_sorted_as_str, ordered=True)
df = df.set_index(['level_0', 'level_1', 'level_2'])
df.sort_index(inplace=True)

Need to group the data using pandas based on months in the column data

I would like to group the data based on the month January and February. Here is a sample of the data set that I have.
Date Count
01.01.2019 1
01.02.2019 7
02.01.2019 4
03.01.2019 4
04.01.2019 1
04.02.2019 5
I want to group the data as follows, where total count is summed up of count based on month 1(Jan) and 2(Feb):
Month Total_Count
Jan 10
Feb 12

Cast to datetime, group by the dt.month_name and sum:
(df.groupby(pd.to_datetime(df['Date'], format='%d.%m.%Y')
.dt.month_name()
.str[:3])['Count']
.sum()
.rename_axis('Month')
.reset_index(name='Total_Count'))
Month Total_Count
0 Feb 12
1 Jan 10
To sort the index by month, we could instead do:
s = df.groupby(pd.to_datetime(df['Date-'], format='%d.%m.%Y-').dt.month)['Count'].sum()
s.index = pd.to_datetime(s.index, format='%m').month_name().str[:3]
s.rename_axis('Month').reset_index(name='Total_Count')
Month Total_Count
0 Jan 10
1 Feb 12

Pivot and rename Pandas dataframe

I have a dataframe in the format
Date Datediff Cumulative_sum
01 January 2019 1 5
02 January 2019 1 7
02 January 2019 2 15
01 January 2019 2 8
01 January 2019 3 13
and I want to pivot the column Datediff from the dataframe such that the end result looks like
Index Day-1 Day-2 Day-3
01 January 2019 5 8 13
02 January 2019 7 15
I have used the pivot command shuch that
pt = pd.pivot_table(df, index = "Date",
columns = "Datediff",
values = "Cumulative_sum") \
.reset_index() \
.set_index("Date"))
which returns the pivoted table
1 2 3
01 January 2019 5 8 13
02 January 2019 7 15
And I can then rename rename the columns using the loop
for column in pt:
pt.rename(columns = {column : "Day-" + str(column)}, inplace = True)
which returns exactly what I want. However, I was wondering if there is a faster way to rename the columns when pivoting and get rid of the loop altogether.

Use DataFrame.add_prefix:
df.add_prefix('Day-')
In your solution:
pt = (pd.pivot_table(df, index = "Date",
columns = "Datediff",
values = "Cumulative_sum")
.add_prefix('Day-'))

pandas rename: change values of index for a specific column only

I have the following pandas dataframe:
Cost
Year Month ID
2016 1 10 40
2 11 50
2017 4 1 60
The columns Year, Month and ID make up the index. I want to set the values within Month to be the name equivalent (e.g. 1 = Jan, 2 = Feb). I've come up with the following code:
df.rename(index={i: calendar.month_abbr[i] for i in range(1, 13)}, inplace=True)
However, this changes the values within every column in the index:
Cost
Year Month ID
2016 Jan 10 40
Feb 11 50
2017 Apr Jan 60 # Jan here is incorrect
I obviously only want to change the values in the Month column. How can I fix this?

use set_levels
m = {1: 'Jan', 2: 'Feb', 4: 'Mar'}
df.index.set_levels(
df.index.levels[1].to_series().map(m).values,
1, inplace=True)
print(df)
Cost
Year Month ID
2016 Jan 10 40
Feb 11 50
2017 Mar 1 60

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas dataframe groupby apply multi columns and get count - python

Related

Return a Cell from a Pandas Dataframe based on other values

How to set order of sorting MultiIndex

Need to group the data using pandas based on months in the column data

Pivot and rename Pandas dataframe

pandas rename: change values of index for a specific column only

Categories

Resources