Elegant way to sum over duplicate MultiIndex values - python

I have a DataFrame with a many-levelled MultiIndex.
I know that there are duplicates in the MultiIndex (because I don't care about a distinction that the underlying databse does care about)
I want to sum over these duplicates:
>>> x = pd.DataFrame({'month':['Sep', 'Sep', 'Oct', 'Oct'], 'day':['Mon', 'Mon', 'Mon', 'Tue'], 'sales':[1,2,3,4]})
>>> x
day month sales
0 Mon Sep 1
1 Mon Sep 2
2 Mon Oct 3
3 Tue Oct 4
>>> x = x.set_index(['day', 'month'])
sales
day month
Mon Sep 1
Sep 2
Oct 3
Tue Oct 4
To give me
day month
Mon Sep 3
Oct 3
Tue Oct 4
Buried deep in this SO answer to a similar question is the suggestion:
df.groupby(level=df.index.names).sum()
But this seems to me to fail the 'readability counts' criterion of good Python code.
Does anyone know of a more human-readable way?

Related

Pandas: Dataframe Calculation - New Rows with Division, New Columns with Sums and Averages

So i got an Pandas DataFrame that looks like this:
import pandas as pd
df1 = pd.DataFrame([[5618, 5863, 8873, 7903, 9477, 7177, 7648, 9592],
[5698, 6009, 8242, 7356, 6191, 8817, 7340, 11781],
[5721, 6858, 8401, 6826, 6910, 6243, 6814, 9704]],
columns=["Jul", "Aug", "Sep", "Oct", "Nov", "Dec", "Jan", "Feb"])
Output:
Jul Aug Sep Oct Nov Dec Jan Feb
0 5618 5863 8873 7903 9477 7177 7648 9592
1 5698 6009 8242 7356 6191 8817 7340 11781
2 5721 6858 8401 6826 6910 6243 6814 9704
At first i want to insert 2 new rows with index 3 and 4:
In the first one i want to divide the values of row 1 by the values of row 0:
Jul Aug Sep Oct Nov Dec Jan Feb
3 101,42 102,49 92,88 93,07 65,32 122,8 95,97 122,82
In the second one i want to divide the values of row 1 by the values of row 2:
Jul Aug Sep Oct Nov Dec Jan Feb
4 99,59 87,62 98,10 107,76 89,59 141,23 107,71 121,40
In the next step i want to create a new column with the sum of the raw-values of the month and the averages of the new created rows.
df1["Sum_Avg"] = df1.sum(axis=1)
Output:
Jul Aug Sep Oct Nov Dec Jan Feb Sum_Avg
0 5618 5863 8873 7903 9477 7177 7648 9592 62151
1 5698 6009 8242 7356 6191 8817 7340 11781 61434
2 5721 6858 8401 6826 6910 6243 6814 9704 57477
I don't know how to create the rows with index 3 and 4, so i even don't know how to put the averages in the same row as the sums.
At the end the full table should look like this:
Img
What i tried so far:
Making a new DataFrame with the Row 0:
df2 = pd.DataFrame(df1.iloc[[0]])
df2
Output:
Jul Aug Sep Oct Nov Dec Jan Feb
0 5618 5863 8873 7903 9477 7177 7648 9592
Making a new DataFrame with Row 1:
df3 = pd.DataFrame(df1.iloc[[1]])
df3
Output:
Jul Aug Sep Oct Nov Dec Jan Feb
1 5698 6009 8242 7356 6191 8817 7340 11781
Making a new DataFrame with the division of df2 and df3:
df4 = df3/df2
df4
Output:
Jul Aug Sep Oct Nov Dec Jan Feb
0 NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN
and here things messed up and this is the reason why i'm creating this post.
Use DataFrame.div with shifted values by DataFrame.shift, remove first only NaN row by indexing and add to original by concat:
df1["Sum_Avg"] = df1.sum(axis=1)
df = pd.concat([df1, df1.div(df1.shift()).iloc[1:]], ignore_index=True)
print (df)
Jul Aug Sep Oct Nov \
0 5618.000000 5863.000000 8873.000000 7903.000000 9477.000000
1 5698.000000 6009.000000 8242.000000 7356.000000 6191.000000
2 5721.000000 6858.000000 8401.000000 6826.000000 6910.000000
3 1.014240 1.024902 0.928885 0.930786 0.653266
4 1.004037 1.141288 1.019291 0.927950 1.116136
Dec Jan Feb Sum_Avg
0 7177.000000 7648.000000 9592.000000 62151.000000
1 8817.000000 7340.000000 11781.000000 61434.000000
2 6243.000000 6814.000000 9704.000000 57477.000000
3 1.228508 0.959728 1.228211 0.988464
4 0.708064 0.928338 0.823699 0.935589
Solution by ouput data:
df1["Sum_Avg"] = df1.sum(axis=1)
df = pd.concat([df1, df1.iloc[1].div(df1.iloc[[0,2]]) ], ignore_index=True)
print (df)
Jul Aug Sep Oct Nov \
0 5618.00000 5863.000000 8873.000000 7903.000000 9477.000000
1 5698.00000 6009.000000 8242.000000 7356.000000 6191.000000
2 5721.00000 6858.000000 8401.000000 6826.000000 6910.000000
3 1.01424 1.024902 0.928885 0.930786 0.653266
4 0.99598 0.876203 0.981074 1.077644 0.895948
Dec Jan Feb Sum_Avg
0 7177.000000 7648.000000 9592.000000 62151.000000
1 8817.000000 7340.000000 11781.000000 61434.000000
2 6243.000000 6814.000000 9704.000000 57477.000000
3 1.228508 0.959728 1.228211 0.988464
4 1.412302 1.077194 1.214035 1.068845
You could try this:
df = df1.T
df[3] = df[1] / df[0]
df[4] = df[1] / df[2]
df1 = df.T
df1["Sum_Avg"] = df1.sum(axis=1)
# Jul Aug ... Feb Sum_Avg
# 0 5618.00000 5863.000000 ... 9592.000000 62151.000000
# 1 5698.00000 6009.000000 ... 11781.000000 61434.000000
# 2 5721.00000 6858.000000 ... 9704.000000 57477.000000
# 3 1.01424 1.024902 ... 1.228211 7.968526
# 4 0.99598 0.876203 ... 1.214035 8.530380
# [5 rows x 9 columns]

How can I convert day of week, Month, Date to Year - Month - Date

I have dates from 2018 until 2021 in a pandas column and they look like this:
Date
Sun, Dec 30
Mon, Dec 31
Any idea how I can convert this to:
Date
Dec 30 2018
Dec 31 2018
In the sense that is it possible that knowing the day of the week i.e. (monday, tuesday etc) is it possible to get the year of that specific date?
I would take a look at this conversation. As mentioned, you will probably need to define a range of years, since it is possible that December 30th (for example) falls on a Sunday in more than one year. Otherwise, it is possible to collect a list of years where the input (Sun, Dec 30) is valid. You will probably need to use datetime to convert your strings to a Python readable format.
you can iterate the years from 2018 to 2022 to get every target date's weekday name, then find the match year.
df = pd.DataFrame({'Date': {0: 'Sun, Dec 30',
1: 'Mon, Dec 31'}})
for col in range(2018, 2022):
df[col] = '%s' % col + df['Date'].str.split(',').str[-1]
df[col] = pd.to_datetime(df[col], format='%Y %b %d').dt.strftime('%a, %b %d')
dfn = df.set_index('Date').stack().reset_index()
cond = dfn['Date'] == dfn[0]
obj = dfn[cond].set_index('Date')['level_1'].rename('year')
result:
print(obj)
Date
Sun, Dec 30 2018
Mon, Dec 31 2018
Name: year, dtype: int64
print(df.join(obj, on='Date'))
Date 2018 2019 2020 2021 year
0 Sun, Dec 30 Sun, Dec 30 Mon, Dec 30 Wed, Dec 30 Thu, Dec 30 2018
1 Mon, Dec 31 Mon, Dec 31 Tue, Dec 31 Thu, Dec 31 Fri, Dec 31 2018
df_result = obj.reset_index()
df_result['Date_new'] = df_result['Date'].str.split(',').str[-1] + ' ' + df_result['year'].astype(str)
print(df_result)
Date year Date_new
0 Sun, Dec 30 2018 Dec 30 2018
1 Mon, Dec 31 2018 Dec 31 2018

Pandas DF convert date string to date year and month [duplicate]

This question already has answers here:
Extracting just Month and Year separately from Pandas Datetime column
(13 answers)
Closed 2 years ago.
HI all I have a column in a dataframe that looks like:
print(df['Date']):
29-Nov-16
4-Dec-16
1-Oct-16
30-Nov-19
30-Jun-20
28-Apr-16
24-May-16
And i am trying to get an output that looks like
print(df):
Date Month Year
29-Nov-16 Nov 2016
4-Dec-16 Dec 2016
1-Oct-16 Oct 2016
30-Nov-19 Nov 2019
30-Jun-20 Jun 2020
28-Apr-16 Apr 2016
24-May-16 May 2016
I have tried the following:
df['Month'] = pd.datetime(df['Date']).month
df['Year'] = pd.datetime(df['Date']).year
but am getting a TypeError: cannot convert the series to <class 'int'>
Any ideas or references to help out?
Thanks!
Use strftime and str.split and assign them to new columns
df_final = df.assign(**pd.to_datetime(df['Date']).dt.strftime('%b-%Y')
.str.split('-', expand=True)
.set_axis(['Month','Year'], axis=1))
Out[32]:
Date Month Year
0 29-Nov-16 Nov 2016
1 4-Dec-16 Dec 2016
2 1-Oct-16 Oct 2016
3 30-Nov-19 Nov 2019
4 30-Jun-20 Jun 2020
5 28-Apr-16 Apr 2016
6 24-May-16 May 2016
you are missing dt after pd.datetime(df['Date'])
try this:
df['Month'] = pd.datetime(df['Date']).dt.month
df['Year'] = pd.datetime(df['Date']).dt.year

Pandas - Convert multiple Series having multiple columns to Dataframe

How can I convert Series to DataFrame?
The problem is mapping columns' name of Series and DataFrame
I have a Series like this:
(made with groupby and concat function)
CUS_ID DAY
2 MON 0.176644
TUE 0.246489
WED 0.160569
THU 0.234109
FRI 0.170916
...
dtype: float64
And what I want to get is like this:
CUS_ID MON TUE WED THU FRI
2 0.176644 0.246489 0.160569 0.234109 0.170916
The type must be DataFrame..!
Is there any way to get it without using 'for' statement??
You can simply unstack the index
s=pd.Series(data=[1,2,3,4,5],index=[[2,2,2,2,2],['mon','tue','wed','thu','fri']])
2 mon 1
tue 2
wed 3
thu 4
fri 5
s.unstack()
fri mon thu tue wed
2 5 1 4 2 3

Cleaning inconsistent date formatting in pandas dataframe

I have a very large dataframe in which one of the columns, ['date'], datetime (dtype is string still) is formatted as below.. sometimes it is displayed as hh:mm:ss and sometimes as h:mm:ss (with hours 9 and earlier)
Tue Mar 1 9:23:58 2016
Tue Mar 1 9:29:04 2016
Tue Mar 1 9:42:22 2016
Tue Mar 1 09:43:50 2016
pd.to_datetime() won't work when I'm trying to convert the string into datetime format so I was hoping to find some help in getting 0's in front of the time where missing.
Any help is greatly appreciated!
import pandas as pd
date_stngs = ('Tue Mar 1 9:23:58 2016','Tue Mar 1 9:29:04 2016','Tue Mar 1 9:42:22 2016','Tue Mar 1 09:43:50 2016')
a = pd.Series([pd.to_datetime(date) for date in date_stngs])
print a
output
0 2016-03-01 09:23:58
1 2016-03-01 09:29:04
2 2016-03-01 09:42:22
3 2016-03-01 09:43:50
time = df[0].str.split(' ').str.get(3).str.split('').str.get(0).str.strip().str[:8]
year = df[0].str.split('--').str.get(0).str[-5:].str.strip()
daynmonth = df[0].str[:10].str.strip()
df_1['date'] = daynmonth + ' ' +year + ' ' + time
df_1['date'] = pd.to_datetime(df_1['date'])
Found this to work myself when rearranging the order
Assuming you have a one column DataFrame with strings as above and column name is 0 then the following will split the strings by space and then take the third string and zero-fill it with zfill
Assuming starting df
0
0 Tue Mar 1 9:23:58 2016
1 Tue Mar 1 9:29:04 2016
2 Tue Mar 1 9:42:22 2016
3 Tue Mar 1 09:43:50 2016
df1 = df[0].str.split(expand=True)
df1[3] = df1[3].str.zfill(8)
pd.to_datetime(df1.apply(lambda x: ' '.join(x.tolist()), axis=1))
Output
0 2016-03-01 09:23:58
1 2016-03-01 09:29:04
2 2016-03-01 09:42:22
3 2016-03-01 09:43:50
dtype: datetime64[ns]

Categories

Resources