Pivot and rename Pandas dataframe - python

I have a dataframe in the format
Date Datediff Cumulative_sum
01 January 2019 1 5
02 January 2019 1 7
02 January 2019 2 15
01 January 2019 2 8
01 January 2019 3 13
and I want to pivot the column Datediff from the dataframe such that the end result looks like
Index Day-1 Day-2 Day-3
01 January 2019 5 8 13
02 January 2019 7 15
I have used the pivot command shuch that
pt = pd.pivot_table(df, index = "Date",
columns = "Datediff",
values = "Cumulative_sum") \
.reset_index() \
.set_index("Date"))
which returns the pivoted table
1 2 3
01 January 2019 5 8 13
02 January 2019 7 15
And I can then rename rename the columns using the loop
for column in pt:
pt.rename(columns = {column : "Day-" + str(column)}, inplace = True)
which returns exactly what I want. However, I was wondering if there is a faster way to rename the columns when pivoting and get rid of the loop altogether.

Use DataFrame.add_prefix:
df.add_prefix('Day-')
In your solution:
pt = (pd.pivot_table(df, index = "Date",
columns = "Datediff",
values = "Cumulative_sum")
.add_prefix('Day-'))

Related

pandas dataframe groupby apply multi columns and get count

I have a excel like this:
year
a
b
2021
12
23
2021
31
0
2021
15
21
2021
14
0
2022
32
0
2022
24
15
2022
28
29
2022
33
0
I wanna get count of condition: a>=30 and b==0 group by year
the final output like this:
2021 1
2022 2
I wanna use pandas dataframe to implement this, can anyone help? I'm quite new to python
For count matched rows chain both conditions by & for bitwise AND and aggregate sum, Trues are processing like 1 and False like 0:
df1 = ((df.a>=30) & (df.b==0)).astype(int)
.groupby(df['year']).sum().reset_index(name='count')
print (df1)
year count
0 2021 1
1 2022 2
Similar idea with helper column:
df1 = (df.assign(count = ((df.a>=30) & (df.b==0)).astype(int))
.groupby('year', as_index=False)['count']
.sum())

converting str to YYYYmmdd format in python

I have year, month and date in three columns, I am concatenating them to one column then trying to make this column to YYYY/mm/dd format as follows:
dfyz_m_d['dt'] = '01'# to bring one date of each of the month
dfyz_m_d['CalendarWeek1'] = dfyz_m_d['year'].map(str) + dfyz_m_d['mon'].map(str) + dfyz_m_d['dt'].map(str)
dfyz_m_d['CalendarWeek'] = pd.to_datetime(dfyz_m_d['CalendarWeek1'], format='%Y%m%d')
but for both 1 ( jan) and 10 ( Oct) months I am getting only oct in final outcome (CalendarWeek comun doesn't have any Jan. Basically it is retaining all records but Jan month also it is formatting to Oct
The issue is Jan is single digit numerically, so you end up with something like 2021101 which will be interpreted as Oct instead of Jan. Make sure your mon column is always converted to two digit months with leading zeros if needed using .zfill(2):
dfyz_m_d['year'].astype(str) + dfyz_m_d['mon'].astype(str).str.zfill(2) + dfyz_m_d['dt'].astype(str)
zfill example:
df = pd.DataFrame({'mon': [1,2,10]})
df.mon.astype(str).str.zfill(2)
0 01
1 02
2 10
Name: mon, dtype: object
I usually do
pd.to_datetime(df.mon,format='%m').dt.strftime('%m')
0 01
1 02
2 10
Name: mon, dtype: object
Also , if you name the column correctly , notice the name as year month and day
df['day'] = '01'
df['new'] = pd.to_datetime(df.rename(columns={'mon':'month'})).dt.strftime('%m/%d/%Y')
df
year mon day new
0 2020 1 1 01/01/2020
1 2020 1 1 01/01/2020
I like str.pad :)
dfyz_m_d['year'].astype(str) + dfyz_m_d['mon'].astype(str).str.pad(2, 'left', '0') + dfyz_m_d['dt'].astype(str)
It will pad zeros to the left to ensure that the length of the strings will be two. SO 1 becomes 01, but 10 stays to be 10.
You should be able to use pandas.to_datetime with your input dataframe. You may need to rename your columns.
import pandas as pd
df = pd.DataFrame({'year': [2015, 2016],
'month': [2, 3],
'dt': [4, 5]})
print(pd.to_datetime(df.rename(columns={"dt": "day"})))
Output
0 2015-02-04
1 2016-03-05
dtype: datetime64[ns]
You can add / between year, mon and dt and amend the format string to include it, as follows:
dfyz_m_d['dt'] = '01'
dfyz_m_d['CalendarWeek1'] = dfyz_m_d['year'].astype(str) + '/' + dfyz_m_d['mon'].astype(str) + '/' + dfyz_m_d['dt'].astype(str)
dfyz_m_d['CalendarWeek'] = pd.to_datetime(dfyz_m_d['CalendarWeek1'], format='%Y/%m/%d')
Data Input
year mon dt
0 2021 1 01
1 2021 2 01
2 2021 10 01
3 2021 11 01
Output
year mon dt CalendarWeek1 CalendarWeek
0 2021 1 01 2021/1/01 2021-01-01
1 2021 2 01 2021/2/01 2021-02-01
2 2021 10 01 2021/10/01 2021-10-01
3 2021 11 01 2021/11/01 2021-11-01
If you want the final output date format be YYYY/mm/dd, you can further use .dt.strftime after pd.to_datetime, as follows:
dfyz_m_d['dt'] = '01'
dfyz_m_d['CalendarWeek1'] = dfyz_m_d['year'].astype(str) + '/' + dfyz_m_d['mon'].astype(str) + '/' + dfyz_m_d['dt'].astype(str)
dfyz_m_d['CalendarWeek'] = pd.to_datetime(dfyz_m_d['CalendarWeek1'], format='%Y/%m/%d').dt.strftime('%Y/%m/%d')
Output
year mon dt CalendarWeek1 CalendarWeek
0 2021 1 01 2021/1/01 2021/01/01
1 2021 2 01 2021/2/01 2021/02/01
2 2021 10 01 2021/10/01 2021/10/01
3 2021 11 01 2021/11/01 2021/11/01

How to set order of sorting MultiIndex

I have dataframe like this:
import pandas as pd
import numpy as np
np.random.seed(123)
col_num = 1
row_num = 18
col_names = ['C' + str(x) for x in range(col_num)]
mix = pd.MultiIndex.from_product([['a', 'b'], [ '01 Jan 2011', '02 Feb 2000', '30 Apr 1999'], [1,2,3]])
df = pd.DataFrame(np.round(((np.random.rand(row_num,col_num)* 2 - 1)*100),2), columns = col_names, index = mix)
#df
C0
a 01 Jan 2011 1 39.29
2 -42.77
3 -54.63
02 Feb 2000 1 10.26
2 43.89
3 -15.38
30 Apr 1999 1 96.15
2 36.97
3 -3.81
b 01 Jan 2011 1 -21.58
2 -31.36
3 45.81
02 Feb 2000 1 -12.29
2 -88.06
3 -20.39
30 Apr 1999 1 47.60
2 -63.50
3 -64.91
How to sort MultiIndex in such a way that dates on level 1 are kept in chronological order while preserving sorting on other mix levels as is, including priority of levels ordering (ie: first level 0, then level1 and finally level2).
I need to keep dates as strings in final df. Final df will be pickled. I try to set sorting order of dates before serializing rather than writing sorting function after retrieving df.
Let's create a new MultiIndex after setting the level 1 values mapped to datetime then use argsort on this new index to get the indices that would sort the original dataframe:
idx = df.index.set_levels(pd.to_datetime(df.index.levels[1]), 1)
df1 = df.iloc[np.argsort(idx)]
print(df1)
C0
a 30 Apr 1999 1 96.15
2 36.97
3 -3.81
02 Feb 2000 1 10.26
2 43.89
3 -15.38
01 Jan 2011 1 39.29
2 -42.77
3 -54.63
b 30 Apr 1999 1 47.60
2 -63.50
3 -64.91
02 Feb 2000 1 -12.29
2 -88.06
3 -20.39
01 Jan 2011 1 -21.58
2 -31.36
3 45.81
If one wants to create desired df with sorted index and doesn't mind having categorical index, here is a code to achieve it (probably there is a simpler way but I couldn't find it :).
Start with df from question above.
from datetime import datetime as dt
org_l1 = df.index.get_level_values(1).unique().tolist()
l1_as_date = [dt.strptime(x, '%d %b %Y') for x in org_level1]
l1_as_date.sort()
l1_sorted_as_str = [dt.strftime(x, '%d %b %Y') for x in l1_as_date]
df= df.reset_index()
df.level_1 = df.level_1.astype('category')
df.level_1 = df.level_1.cat.set_categories(l1_sorted_as_str, ordered=True)
df = df.set_index(['level_0', 'level_1', 'level_2'])
df.sort_index(inplace=True)

convert year to a date with adding some number of day in pandas

I have a dataframe that looks like this:
Year vl
2017 20
2017 21
2017 22
2017 23
2017 24
2017 25
2017 26
...
I need to convert the year into the format dd.mm.yyyy. Every time start from the first day of the year. For example, 2017 will become 01.01.2017. And then, I need to multiply each value in the column "vl" by 7 and add them line by line to the column as the number of days, where the dates will be in the new format (as in the example 01.01.2017).
The result should be something like this:
Year vl new_date
2017 20 21.05.2017
2017 21 28.05.2017
2017 22 04.06.2017
2017 23 11.06.2017
2017 24 18.06.2017
2017 25 25.06.2017
2017 26 02.07.2017
...
Here is one option by pasting the Year (%Y) and Day of the year (%j) together and then parse and reformat it:
from datetime import datetime
df.apply(lambda r: datetime.strptime("{}{}".format(r.Year, r.vl*7+1), "%Y%j").strftime("%d.%m.%Y"), axis=1)
#0 21.05.2017
#1 28.05.2017
#2 04.06.2017
#3 11.06.2017
#4 18.06.2017
#5 25.06.2017
#6 02.07.2017
#dtype: object
Assign the column back to the original data frame:
df['new_date'] = df.apply(lambda r: datetime.strptime("{}{}".format(r.Year, r.vl*7+1), "%Y%j").strftime("%d.%m.%Y"), axis=1)
Unfortunately %U and %W aren't implemented in Pandas
But we can use the following vectorized approach:
In [160]: pd.to_datetime(df.Year.astype(str), format='%Y') + \
pd.to_timedelta(df.vl.mul(7).astype(str) + ' days')
Out[160]:
0 2017-05-21
1 2017-05-28
2 2017-06-04
3 2017-06-11
4 2017-06-18
5 2017-06-25
6 2017-07-02
dtype: datetime64[ns]

pandas rename: change values of index for a specific column only

I have the following pandas dataframe:
Cost
Year Month ID
2016 1 10 40
2 11 50
2017 4 1 60
The columns Year, Month and ID make up the index. I want to set the values within Month to be the name equivalent (e.g. 1 = Jan, 2 = Feb). I've come up with the following code:
df.rename(index={i: calendar.month_abbr[i] for i in range(1, 13)}, inplace=True)
However, this changes the values within every column in the index:
Cost
Year Month ID
2016 Jan 10 40
Feb 11 50
2017 Apr Jan 60 # Jan here is incorrect
I obviously only want to change the values in the Month column. How can I fix this?
use set_levels
m = {1: 'Jan', 2: 'Feb', 4: 'Mar'}
df.index.set_levels(
df.index.levels[1].to_series().map(m).values,
1, inplace=True)
print(df)
Cost
Year Month ID
2016 Jan 10 40
Feb 11 50
2017 Mar 1 60

Categories

Resources