Say I have a data frame, sega_df:
MONTH Character Rings Chili Dogs Emeralds
0 Jun 2017 Sonic 25.0 10.0 6.0
5 Jun 2017 Sonic 19.0 15.0 0.0
8 Jun 2017 Shadow 4.0 1.0 0.0
9 Jun 2017 Shadow 23.0 1.0 0.0
12 Jun 2017 Knuckles 9.0 3.0 1.0
13 Jun 2017 Tails 10.0 6.0 0.0
22 Jul 2017 Sonic 5.0 20.0 0.0
23 Jul 2017 Shadow 3.0 3.0 7.0
24 Jul 2017 Knuckles 9.0 4.0 0.0
27 Jul 2017 Knuckles 11.0 2.0 0.0
28 Jul 2017 Tails 12.0 3.0 0.0
29 Jul 2017 Tails 12.0 5.0 0.0
My pivot_table command gives me a table output of each character by row against each month, but the values are a series of random Nan or 0. The 0s are because there is more data with 0s in later months and I only posted the first few rows. The data types of the values in the three columns (Rings,Chili Dogs, and Emeralds) are numpy.float64, so I'm also curious if that affects it, or if it's how I define aggfunc.
My values argument and pivot_table commmand is as follows:
values = list(sega_df.columns.values)
test = pd.pivot_table(data = sega_df, values = values, index = 'Character', columns = 'MONTH', aggfunc='sum')
Here is my desired pivot_table output, -- with the sum of the three columns per character per month (eg. Sonic for month of June is [25 + 10 + 6 + 19 + 15 + 0] = 75.0):
MONTH Jun 2017 Jul 2017
Character
0 Sonic 75.0 25.0
1 Shadow 29.0 18.0
2 Knuckles 13.0 26.0
3 Tails 16.0 32.0
Just need groupby sum and sum with axis = 1 , then we unstack
df.groupby(['Character','MONTH']).sum().sum(1).unstack()
Out[953]:
MONTH Jul2017 Jun2017
Character
Knuckles 26.0 13.0
Shadow 13.0 29.0
Sonic 25.0 75.0
Tails 32.0 16.0
Related
I have a pandas dataframe result, looks like this:
Weekday Day Store1 Store2 Store3 Store4 Store5
0 Mon 6 0.0 0.0 0.0 0.0 0.0
1 Tue 7 42.0 33.0 23.0 42.0 21.0
2 Wed 8 43.0 29.0 13.0 33.0 22.0
3 Thu 9 45.0 24.0 20.0 29.0 18.0
4 Fri 10 48.0 21.0 22.0 37.0 22.0
5 Sat 11 34.0 22.0 23.0 34.0 18.0
0 Mon 13 39.0 21.0 21.0 25.0 21.0
1 Tue 14 39.0 20.0 18.0 0.0 19.0
2 Wed 15 46.0 26.0 18.0 31.0 24.0
3 Thu 16 38.0 21.0 15.0 45.0 29.0
4 Fri 17 42.0 21.0 21.0 41.0 20.0
5 Sat 18 40.0 25.0 15.0 36.0 19.0
0 Mon 20 39.0 22.0 23.0 36.0 19.0
1 Tue 21 31.0 18.0 16.0 35.0 23.0
2 Wed 22 33.0 25.0 17.0 39.0 22.0
3 Thu 23 34.0 24.0 19.0 18.0 27.0
4 Fri 24 33.0 18.0 24.0 43.0 24.0
5 Sat 25 38.0 22.0 20.0 40.0 12.0
0 Mon 27 41.0 21.0 18.0 31.0 23.0
1 Tue 28 32.0 21.0 14.0 23.0 14.0
2 Wed 29 33.0 18.0 15.0 19.0 23.0
3 Thu 30 36.0 21.0 21.0 23.0 18.0
4 Fri 1 40.0 30.0 24.0 38.0 23.0
5 Sat 2 40.0 19.0 22.0 38.0 21.0
Notice how Day goes from 6 to 30, then back to 1, and 2. In this example, it's referring to
September 6, 2021 - October 2nd, 2021.
I currently have a variable PrimaryMonth = September and SecondaryMonth = October
I know that I can do result['Month'] = 'September' but it will list all the Month values as September, I'd like to find a way, if possible, to iterate through the rows so that when it reaches the bottom 1 and 2 it will show October in the new Month column.
Is it possible to do a For loop or some other iteration to accomplish this? I was initially brainstorming some pseudocode
#for row in result:
# while Day <= 31
#concat PrimaryMonth
#else concat SecondaryMonth
You can kind of get an idea of where I want to go with this.
Many things are easier if you use proper date formats...
date_str = 'Monday, September 6, 2021 - Saturday, October 2, 2021'
new_index = pd.date_range(*map(pd.to_datetime, date_str.split(' - ')))
dates = pd.DataFrame(index=new_index)
dates['day'] = dates.index.day
dates.columns = ['Day']
df = pd.merge(dates, df, 'outer')
df.index = dates.index
df['month'] = df.index.month_name()
print(df.dropna())
Output:
Day Weekday Store1 Store2 Store3 Store4 Store5 month
2021-09-06 6 Mon 0.0 0.0 0.0 0.0 0.0 September
2021-09-07 7 Tue 42.0 33.0 23.0 42.0 21.0 September
2021-09-08 8 Wed 43.0 29.0 13.0 33.0 22.0 September
2021-09-09 9 Thu 45.0 24.0 20.0 29.0 18.0 September
2021-09-10 10 Fri 48.0 21.0 22.0 37.0 22.0 September
2021-09-11 11 Sat 34.0 22.0 23.0 34.0 18.0 September
2021-09-13 13 Mon 39.0 21.0 21.0 25.0 21.0 September
2021-09-14 14 Tue 39.0 20.0 18.0 0.0 19.0 September
2021-09-15 15 Wed 46.0 26.0 18.0 31.0 24.0 September
2021-09-16 16 Thu 38.0 21.0 15.0 45.0 29.0 September
2021-09-17 17 Fri 42.0 21.0 21.0 41.0 20.0 September
2021-09-18 18 Sat 40.0 25.0 15.0 36.0 19.0 September
2021-09-20 20 Mon 39.0 22.0 23.0 36.0 19.0 September
2021-09-21 21 Tue 31.0 18.0 16.0 35.0 23.0 September
2021-09-22 22 Wed 33.0 25.0 17.0 39.0 22.0 September
2021-09-23 23 Thu 34.0 24.0 19.0 18.0 27.0 September
2021-09-24 24 Fri 33.0 18.0 24.0 43.0 24.0 September
2021-09-25 25 Sat 38.0 22.0 20.0 40.0 12.0 September
2021-09-27 27 Mon 41.0 21.0 18.0 31.0 23.0 September
2021-09-28 28 Tue 32.0 21.0 14.0 23.0 14.0 September
2021-09-29 29 Wed 33.0 18.0 15.0 19.0 23.0 September
2021-09-30 30 Thu 36.0 21.0 21.0 23.0 18.0 September
2021-10-01 1 Fri 40.0 30.0 24.0 38.0 23.0 October
2021-10-02 2 Sat 40.0 19.0 22.0 38.0 21.0 October
And no, no matter what you do, a for-loop is probably the wrong answer when it comes to pandas.
I have the below dataframe
Jan Feb Mar Apr May Jun July Aug Sep Oct Nov Dec
store_id
S_1 8.0 20.0 13.0 21.0 17.0 20.0 24.0 17.0 16.0 9.0 7.0 6.0
S_10 14.0 23.0 20.0 11.0 12.0 13.0 19.0 6.0 5.0 22.0 17.0 16.0
and I want to calculate the mean of each store per quarter:
Q1 Q2 Q3 Q4
store_id
S_1 13.67 19.33 15.67 7.33
S_10 19.0 12.0 10.0 18.33
How can this be achieved?
Convert values to quarter by DatetimeIndex.quarter and aggregate, it working correct also if changed order of columns:
#if necessary
df = df.rename(columns={'July':'Jul'})
df = (df.groupby(pd.to_datetime(df.columns, format='%b').quarter, axis=1)
.mean()
.add_prefix('Q')
.round(2))
print(df)
Q1 Q2 Q3 Q4
store_id
S_1 13.67 19.33 19.0 7.33
S_10 19.00 12.00 10.0 18.33
Assuming you have the columns in order, use groupby on axis=1:
import numpy as np
out = df.groupby([np.arange(df.shape[1])//3+1], axis=1).mean().add_prefix('Q')
output:
Q1 Q2 Q3 Q4
store_id
S_1 13.666667 19.333333 19.0 7.333333
S_10 19.000000 12.000000 10.0 18.333333
Given this dataframe:
print(df)
0 1 2
0 354.7 April 4.0
1 55.4 August 8.0
2 176.5 December 12.0
3 95.5 February 2.0
4 85.6 January 1.0
5 152 July 7.0
6 238.7 June 6.0
7 104.8 March 3.0
8 283.5 May 5.0
9 278.8 November 11.0
10 249.6 October 10.0
11 212.7 September 9.0
If I do order by column 2 using df.sort_values('2'), I get:
0 1 2
4 85.6 January 1.0
3 95.5 February 2.0
7 104.8 March 3.0
0 354.7 April 4.0
8 283.5 May 5.0
6 238.7 June 6.0
5 152.0 July 7.0
1 55.4 August 8.0
11 212.7 September 9.0
10 249.6 October 10.0
9 278.8 November 11.0
2 176.5 December 12.0
Is there a smart way to re-define the index column (from 0 to 11) preserving the new order I got?
Use reset_index:
df.sort_values('2').reset_index(drop=True)
Also (this will replace the original dataframe)
df[:] = df.sort_values('2').values
I have a pivot table using CategoricalDtype so I can get the month names in order. How can I can drop the column name/label "Month" and then move the month abbreviation names to the same level as "Year"?
...
.pivot_table(index='Year',columns='Month',values='UpClose',aggfunc=np.sum))
Current output:
Month Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Total
Year
1997 12.0 8.0 8.0 12.0 11.0 12.0 14.0 10.0 10.0 10.0 10.0 9.0 126.0
1998 10.0 12.0 14.0 12.0 9.0 11.0 10.0 8.0 11.0 10.0 10.0 12.0 129.0
Desired output:
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Total
1997 12.0 8.0 8.0 12.0 11.0 12.0 14.0 10.0 10.0 10.0 10.0 9.0 126.0
1998 10.0 12.0 14.0 12.0 9.0 11.0 10.0 8.0 11.0 10.0 10.0 12.0 129.0
If I use, data.columns.name = None it will remove the "Month" label, but it doesn't drop the month abbreviations to the same level as "Year.
You need to replace the column name by doing something like this Renaming columns in dataframe w.r.t another specific column
# replace the Month with year
df = df.rename(columns={"Month":"Year"})
# drop first column
df = df.iloc[1:].reset_index(drop=True)
I have a dataframe where I need to do a burndown starting from the baseline and subtracting all the values along, essentially I'm looking for an opposite of DataFrame().cumsum(0):
In Use
Baseline 3705.0
February 2018 0.0
March 2018 2.0
April 2018 15.0
May 2018 30.0
June 2018 14.0
July 2018 797.0
August 2018 1393.0
September 2018 86.0
October 2018 374.0
November 2018 21.0
December 2018 0.0
January 2019 0.0
February 2019 0.0
March 2019 0.0
April 2019 2.0
unknown 971.0
I cannot find a function to do or, or I'm not looking by the right tags / names.
How can this be achieved?
Use DataFrameGroupBy.diff by groups created by diff, comapring by lt < and cumulative sum:
g = df['Use'].diff().lt(0).cumsum()
df['new'] = df['Use'].groupby(g).diff().fillna(df['Use'])
print (df)
In Use new
0 Baseline 3705.0 3705.0
1 February 2018 0.0 0.0
2 March 2018 2.0 2.0
3 April 2018 15.0 13.0
4 May 2018 30.0 15.0
5 June 2018 14.0 14.0
6 July 2018 797.0 783.0
7 August 2018 1393.0 596.0
8 September 2018 86.0 86.0
9 October 2018 374.0 288.0
10 November 2018 21.0 21.0
11 December 2018 0.0 0.0
12 January 2019 0.0 0.0
13 February 2019 0.0 0.0
14 March 2019 0.0 0.0
15 April 2019 2.0 2.0
16 unknown 971.0 969.0
You can use pd.Series.diff with fillna. Here's a demo:
df = pd.DataFrame({'A': np.random.randint(0, 10, 5)})
df['B'] = df['A'].cumsum()
df['C'] = df['B'].diff().fillna(df['B']).astype(int)
print(df)
A B C
0 1 1 1
1 4 5 4
2 4 9 4
3 2 11 2
4 1 12 1