pandas dataframe groupby apply multi columns and get count - python

I have a excel like this:
year
a
b
2021
12
23
2021
31
0
2021
15
21
2021
14
0
2022
32
0
2022
24
15
2022
28
29
2022
33
0
I wanna get count of condition: a>=30 and b==0 group by year
the final output like this:
2021 1
2022 2
I wanna use pandas dataframe to implement this, can anyone help? I'm quite new to python

For count matched rows chain both conditions by & for bitwise AND and aggregate sum, Trues are processing like 1 and False like 0:
df1 = ((df.a>=30) & (df.b==0)).astype(int)
.groupby(df['year']).sum().reset_index(name='count')
print (df1)
year count
0 2021 1
1 2022 2
Similar idea with helper column:
df1 = (df.assign(count = ((df.a>=30) & (df.b==0)).astype(int))
.groupby('year', as_index=False)['count']
.sum())

Related

Return a Cell from a Pandas Dataframe based on other values

I have the following dataset in a Pandas Dataframe:
Id
Year
Month
Total
0
2020
9
11788.33
1
2020
10
18373.99
2
2020
11
31018.59
3
2020
12
29279.30
4
2021
1
1875.10
5
2021
2
9550.06
6
2021
3
33844.39
7
2021
4
33126.53
8
2021
5
12910.05
9
2021
6
44628.63
10
2021
7
25830.03
11
2021
8
54463.08
12
2021
9
49723.93
13
2021
10
23753.81
14
2021
11
52532.49
15
2021
12
7467.32
16
2022
1
24333.54
17
2022
2
12394.11
18
2022
3
76575.46
19
2022
4
95119.82
20
2022
5
63048.05
I am trying to dynamically return the value from the Total column based on the first month (Month 1) from last year (Year 2021). Solution is 1875.10.
I am using Python in PyCharm to complete this.
Note: The "Id" column is the one that is automatically generated when using a pandas Dataframe. I believe it is called an index within Pandas.
Any help would be greatly appreciated.
Thank you.
You can use .loc[]:
df.loc[(df['Year'] == 2021) & (df['Month'] == 1), 'Total']
Which will give you:
0 1875.1
Name: Total, dtype: float64
To get the actual number you can add .iloc[] on the end:
df.loc[(df['Year'] == 2021) & (df['Month'] == 1), 'Total'].iloc[0]
Output:
1875.1
Another method is doing this.
df[df['Year']==2021].iloc[0]['Total']
This part df[df['Year']==2021] creates a new dataframe, where we only have values from 2021, and the .iloc fetches the value at position 0 in the 'Total' column
Would simple filter suffice?
df[(df.Year == 2021) & (df.Month == 1)].Total

How to set order of sorting MultiIndex

I have dataframe like this:
import pandas as pd
import numpy as np
np.random.seed(123)
col_num = 1
row_num = 18
col_names = ['C' + str(x) for x in range(col_num)]
mix = pd.MultiIndex.from_product([['a', 'b'], [ '01 Jan 2011', '02 Feb 2000', '30 Apr 1999'], [1,2,3]])
df = pd.DataFrame(np.round(((np.random.rand(row_num,col_num)* 2 - 1)*100),2), columns = col_names, index = mix)
#df
C0
a 01 Jan 2011 1 39.29
2 -42.77
3 -54.63
02 Feb 2000 1 10.26
2 43.89
3 -15.38
30 Apr 1999 1 96.15
2 36.97
3 -3.81
b 01 Jan 2011 1 -21.58
2 -31.36
3 45.81
02 Feb 2000 1 -12.29
2 -88.06
3 -20.39
30 Apr 1999 1 47.60
2 -63.50
3 -64.91
How to sort MultiIndex in such a way that dates on level 1 are kept in chronological order while preserving sorting on other mix levels as is, including priority of levels ordering (ie: first level 0, then level1 and finally level2).
I need to keep dates as strings in final df. Final df will be pickled. I try to set sorting order of dates before serializing rather than writing sorting function after retrieving df.
Let's create a new MultiIndex after setting the level 1 values mapped to datetime then use argsort on this new index to get the indices that would sort the original dataframe:
idx = df.index.set_levels(pd.to_datetime(df.index.levels[1]), 1)
df1 = df.iloc[np.argsort(idx)]
print(df1)
C0
a 30 Apr 1999 1 96.15
2 36.97
3 -3.81
02 Feb 2000 1 10.26
2 43.89
3 -15.38
01 Jan 2011 1 39.29
2 -42.77
3 -54.63
b 30 Apr 1999 1 47.60
2 -63.50
3 -64.91
02 Feb 2000 1 -12.29
2 -88.06
3 -20.39
01 Jan 2011 1 -21.58
2 -31.36
3 45.81
If one wants to create desired df with sorted index and doesn't mind having categorical index, here is a code to achieve it (probably there is a simpler way but I couldn't find it :).
Start with df from question above.
from datetime import datetime as dt
org_l1 = df.index.get_level_values(1).unique().tolist()
l1_as_date = [dt.strptime(x, '%d %b %Y') for x in org_level1]
l1_as_date.sort()
l1_sorted_as_str = [dt.strftime(x, '%d %b %Y') for x in l1_as_date]
df= df.reset_index()
df.level_1 = df.level_1.astype('category')
df.level_1 = df.level_1.cat.set_categories(l1_sorted_as_str, ordered=True)
df = df.set_index(['level_0', 'level_1', 'level_2'])
df.sort_index(inplace=True)

Need to group the data using pandas based on months in the column data

I would like to group the data based on the month January and February. Here is a sample of the data set that I have.
Date Count
01.01.2019 1
01.02.2019 7
02.01.2019 4
03.01.2019 4
04.01.2019 1
04.02.2019 5
I want to group the data as follows, where total count is summed up of count based on month 1(Jan) and 2(Feb):
Month Total_Count
Jan 10
Feb 12
Cast to datetime, group by the dt.month_name and sum:
(df.groupby(pd.to_datetime(df['Date'], format='%d.%m.%Y')
.dt.month_name()
.str[:3])['Count']
.sum()
.rename_axis('Month')
.reset_index(name='Total_Count'))
Month Total_Count
0 Feb 12
1 Jan 10
To sort the index by month, we could instead do:
s = df.groupby(pd.to_datetime(df['Date-'], format='%d.%m.%Y-').dt.month)['Count'].sum()
s.index = pd.to_datetime(s.index, format='%m').month_name().str[:3]
s.rename_axis('Month').reset_index(name='Total_Count')
Month Total_Count
0 Jan 10
1 Feb 12

Pivot and rename Pandas dataframe

I have a dataframe in the format
Date Datediff Cumulative_sum
01 January 2019 1 5
02 January 2019 1 7
02 January 2019 2 15
01 January 2019 2 8
01 January 2019 3 13
and I want to pivot the column Datediff from the dataframe such that the end result looks like
Index Day-1 Day-2 Day-3
01 January 2019 5 8 13
02 January 2019 7 15
I have used the pivot command shuch that
pt = pd.pivot_table(df, index = "Date",
columns = "Datediff",
values = "Cumulative_sum") \
.reset_index() \
.set_index("Date"))
which returns the pivoted table
1 2 3
01 January 2019 5 8 13
02 January 2019 7 15
And I can then rename rename the columns using the loop
for column in pt:
pt.rename(columns = {column : "Day-" + str(column)}, inplace = True)
which returns exactly what I want. However, I was wondering if there is a faster way to rename the columns when pivoting and get rid of the loop altogether.
Use DataFrame.add_prefix:
df.add_prefix('Day-')
In your solution:
pt = (pd.pivot_table(df, index = "Date",
columns = "Datediff",
values = "Cumulative_sum")
.add_prefix('Day-'))

pandas rename: change values of index for a specific column only

I have the following pandas dataframe:
Cost
Year Month ID
2016 1 10 40
2 11 50
2017 4 1 60
The columns Year, Month and ID make up the index. I want to set the values within Month to be the name equivalent (e.g. 1 = Jan, 2 = Feb). I've come up with the following code:
df.rename(index={i: calendar.month_abbr[i] for i in range(1, 13)}, inplace=True)
However, this changes the values within every column in the index:
Cost
Year Month ID
2016 Jan 10 40
Feb 11 50
2017 Apr Jan 60 # Jan here is incorrect
I obviously only want to change the values in the Month column. How can I fix this?
use set_levels
m = {1: 'Jan', 2: 'Feb', 4: 'Mar'}
df.index.set_levels(
df.index.levels[1].to_series().map(m).values,
1, inplace=True)
print(df)
Cost
Year Month ID
2016 Jan 10 40
Feb 11 50
2017 Mar 1 60

Categories

Resources