Pandas - Add mean, max, min as row in dataframe

Pandas - Add mean, max, min as row in dataframe - python

Dataframe evntually converts to Excel...
Trying to create a additional row with the avg and max above each column.
Do not want to disturb the original headers for the actual data.
enter image description here
I dont want to hard-code column names as these will change need kind of abstract. I attempted to create a max but failed. I need the max above the column headers.

Try this, I don't know how to create above the dataframe, but I believe that in the end it might be a good solution:
import pandas as pd
df = {
'date and time':['2022-03-01', '2022-03-02', '2022-03-03', '2022-03-04'],
'<PowerAC--->':[40, 20, 9, 12]
}
df = pd.DataFrame(df)
cols = ['<PowerAC--->']
agg = (df[cols].agg(['mean', max]))
out = pd.concat([df, agg])
print(out)

A one-liner method which also remove the "NaN" values to make it visually better (I'm a bit OCD ;))
df.append(df.agg({'<PowerAC--->' : ['mean', max]})).fillna('')

I would say it's a good idea to keep your data separated from the reporting on it - I don't really see the logic for an "additional row above the column".
I would generate statistics for the overall data as a separate dataframe.
import pandas as pd
import numpy as np
np.random.seed(1)
t = pd.date_range(start='2022-05-31', end='2022-06-07')
x = np.random.rand(len(t))
df = pd.DataFrame({'date': t, 'data': x})
print(df)
# The 'numeric_only=False' behaviour will become default in a future version of pandas
d_mean = df.mean(numeric_only=False)
d_max = df.max()
# We need to transpose this, as the `d_mean` and `d_max` are Series (columns), and we want them as rows
df_stats = pd.DataFrame({'mean': d_mean, 'max':d_max}).transpose()
print(df_stats)
df output:
date data
0 2022-05-31 0.417022
1 2022-06-01 0.720324
2 2022-06-02 0.000114
3 2022-06-03 0.302333
4 2022-06-04 0.146756
5 2022-06-05 0.092339
6 2022-06-06 0.186260
7 2022-06-07 0.345561
df_stats output:
date data
mean 2022-06-03 12:00:00 0.276339
max 2022-06-07 00:00:00 0.720324
You could add this and the dataframe together with
pd.concat([df_stats, df])
which looks like
date data
mean 2022-06-03 12:00:00 0.276339
max 2022-06-07 00:00:00 0.720324
0 2022-05-31 00:00:00 0.417022
1 2022-06-01 00:00:00 0.720324
2 2022-06-02 00:00:00 0.000114
3 2022-06-03 00:00:00 0.302333
4 2022-06-04 00:00:00 0.146756
5 2022-06-05 00:00:00 0.092339
6 2022-06-06 00:00:00 0.18626
7 2022-06-07 00:00:00 0.345561
but I would keep them separate unless you've got a very good reason to.
There may be some way which makes sense using a multi-index, but that's a bit beyond me, and probably beyond the scope of this question.
Edit: If you don't infer any meaning from the max and mean of the date column but still want something compatiable with that column (i.e. still a datetime but effectively null) you could replace it by np.datetime64['NaT'] (NaT similar to NaN, but "not a time"):
df_stats['date'] = np.datetime64['NaT']
print(pd.concat([df_stats, df]).head())
output:
date data
mean NaT 0.276339
max NaT 0.720324
0 2022-05-31 0.417022
1 2022-06-01 0.720324
2 2022-06-02 0.000114

Related

Best approach to group the differences between each row by month, year, etc?

I have a dataframe like the following:
Index Diff
2019-03-14 11:32:21.583000+00:00 0
2019-03-14 11:32:21.583000+00:00 2
2019-04-14 11:32:21.600000+00:00 13
2019-04-14 11:32:21.600000+00:00 14
2019-05-14 11:32:21.600000+00:00 19
2019-05-14 11:32:21.600000+00:00 27
What would be the best approach to group by the month and take the difference inside of those months?
Using the .diff() option I am able to find the difference between each row, but I am trying to use the df.groupby(pd.Grouper(freq='M')) with no success.
Expected Output:
Index Diff
0 2019-03-31 00:00:00+00:00 2.0
1 2019-04-30 00:00:00+00:00 1.0
2 2019-05-31 00:00:00+00:00 8.0
Any help would be much appreciated!!

Depending on whether or not your date is on the index, you can comment out df1 = df.reset_index(). Also, check that your index is in DateTimeIndex format if it is on the index. If not in the correct format, then you can change the data type with df.index = pd.to_datetime(df.index). Then, you should be set to change the Diff column with df1.groupby(pd.Grouper(key='Index', freq='M'))['Diff'].diff() and then later groupby with the full dataframe:
input:
import pandas as pd
df = pd.DataFrame({'Diff': {'2019-03-14 11:32:21.583000+00:00': 2,
'2019-04-14 11:32:21.600000+00:00': 14,
'2019-05-14 11:32:21.600000+00:00': 27}})
df.index.name = 'Index'
df.index = pd.to_datetime(df.index)
code:
df1 = df.reset_index()
df1['Diff'] = df1.groupby(pd.Grouper(key='Index', freq='M'))['Diff'].diff()
df1 = df1.groupby(pd.Grouper(key='Index', freq='M'))['Diff'].max().reset_index()
df1
output:
Index Diff
0 2019-03-31 00:00:00+00:00 2.0
1 2019-04-30 00:00:00+00:00 1.0
2 2019-05-31 00:00:00+00:00 8.0

Is there a Pandas function to highlight a week's 10 lowest values in a time series?

Rookie here so please excuse my question format:
I got an event time series dataset for two months (columns for "date/time" and "# of events", each row representing an hour).
I would like to highlight the 10 hours with the lowest numbers of events for each week. Is there a specific Pandas function for that? Thanks!

Let's say you have a dataframe df with column col as well as a datetime column.
You can simply sort the column with
import pandas as pd
df = pd.DataFrame({'col' : [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
'datetime' : ['2019-01-01 00:00:00','2015-02-01 00:00:00','2015-03-01 00:00:00','2015-04-01 00:00:00',
'2018-05-01 00:00:00','2016-06-01 00:00:00','2017-07-01 00:00:00','2013-08-01 00:00:00',
'2015-09-01 00:00:00','2015-10-01 00:00:00','2015-11-01 00:00:00','2015-12-01 00:00:00',
'2014-01-01 00:00:00','2020-01-01 00:00:00','2014-01-01 00:00:00']})
df = df.sort_values('col')
df = df.iloc[0:10,:]
df
Output:
col datetime
0 1 2019-01-01 00:00:00
1 2 2015-02-01 00:00:00
2 3 2015-03-01 00:00:00
3 4 2015-04-01 00:00:00
4 5 2018-05-01 00:00:00
5 6 2016-06-01 00:00:00
6 7 2017-07-01 00:00:00
7 8 2013-08-01 00:00:00
8 9 2015-09-01 00:00:00
9 10 2015-10-01 00:00:00

I know there's a function called nlargest. I guess there should be an nsmallest counterpart. pandas.DataFrame.nsmallest
df.nsmallest(n=10, columns=['col'])
My bad, so your DateTimeIndex is a Hourly sampling. And you need the hour(s) with least events weekly.
...
Date n_events
2020-06-06 08:00:00 3
2020-06-06 09:00:00 3
2020-06-06 10:00:00 2
...
Well I'd start by converting each hour into columns.
1. Create an Hour column that holds the hour of the day.
df['hour'] = df['date'].hour
Pivot the hour values into columns having values as n_events.
So you'll then have 1 datetime index, 24 hour columns, with values denoting #events. pandas.DataFrame.pivot_table
...
Date hour0 ... hour8 hour9 hour10 ... hour24
2020-06-06 0 3 3 2 0
...
Then you can resample it to weekly level aggregate using sum.
df.resample('w').sum()
The last part is a bit tricky to do on the dataframe. But fairly simple if you just need the output.
for row in df.itertuples():
print(sorted(row[1:]))

pandas get data for the end day of month?

The data is given as following:
return
2010-01-04 0.016676
2010-01-05 0.003839
...
2010-01-05 0.003839
2010-01-29 0.001248
2010-02-01 0.000134
...
What I want get is to extract all value that is the last day of month appeared in the data .
2010-01-29 0.00134
2010-02-28 ......
If I directly use pandas.resample, i.e., df.resample('M).last(). I would select the correct rows with the wrong index. (it automatically use the last day of the month as the index)
2010-01-31 0.00134
2010-02-28 ......
How can I get the correct answer in a Pythonic way?

An assumption made here is that your date data is part of the index. If not, I recommend setting it first.
Single Year
I don't think the resampling or grouper functions would do. Let's group on the month number instead and call DataFrameGroupBy.tail.
df.groupby(df.index.month).tail(1)
Multiple Years
If your data spans multiple years, you'll need to group on the year and month. Using a single grouper created from dt.strftime—
df.groupby(df.index.strftime('%Y-%m')).tail(1)
Or, using multiple groupers—
df.groupby([df.index.year, df.index.month]).tail(1)
Note—if your index is not a DatetimeIndex as assumed here, you'll need to replace df.index with pd.to_datetime(df.index, errors='coerce') above.

Although this doesn't answer the question properly I'll leave it if someone is interested.
An approach which would only work if you are certain you have all days (!IMPORTANT) is to add 1 day too with pd.Timedelta and check if day == 1. I did a small running time test and it is 6x faster than the groupby solution.
df[(df['dates'] + pd.Timedelta(days=1)).dt.day == 1]
Or if index:
df[(df.index + pd.Timedelta(days=1)).day == 1]
Full example:
import pandas as pd
df = pd.DataFrame({
'dates': pd.date_range(start='2016-01-01', end='2017-12-31'),
'i': 1
}).set_index('dates')
dfout = df[(df.index + pd.Timedelta(days=1)).day == 1]
print(dfout)
Returns:
i
dates
2016-01-31 1
2016-02-29 1
2016-03-31 1
2016-04-30 1
2016-05-31 1
2016-06-30 1
2016-07-31 1
2016-08-31 1
2016-09-30 1
2016-10-31 1
2016-11-30 1
2016-12-31 1
2017-01-31 1
2017-02-28 1
2017-03-31 1
2017-04-30 1
2017-05-31 1
2017-06-30 1
2017-07-31 1
2017-08-31 1
2017-09-30 1
2017-10-31 1
2017-11-30 1
2017-12-31 1

Pandas DataFrame.resample monthly offset from particular day of month

I have a DataFrame df with sporadic daily business day rows (i.e., there is not always a row for every business day.)
For each row in df I want to create a historical resampled mean dfm going back one month at a time. For example, if I have a row for 2018-02-22 then I want rolling means for rows in the following date ranges:
2018-01-23 : 2018-02-22
2017-12-23 : 2018-01-22
2017-11-23 : 2017-12-22
etc.
But I can't see a way to keep this pegged to the particular day of the month using conventional offsets. For example, if I do:
dfm = df.resample('30D').mean()
Then we see two problems:
It references the beginning of the DataFrame. In fact, I can't find a way to force .resample() to peg itself to the end of the DataFrame – even if I have it operate on df_reversed = df.loc[:'2018-02-22'].iloc[::-1]. Is there a way to "peg" the resampling to something other than the earliest date in the DataFrame? (And ideally pegged to each particular row as I run some lambda on the associated historical resampling from each row's date?)
It will drift over time, because not every month is 30 days long. So as I go back in time I will find that the interval 12 "months" prior ends 2017-02-27, not 2017-02-22 like I want.
Knowing that I want to resample by non-overlapping "months," the second problem can be well-defined for month days 29-31: For example, if I ask to resample for '2018-03-31' then the date ranges would end at the end of each preceding month:
2018-03-01 : 2018-03-31
2018-02-01 : 2018-02-28
2018-01-01 : 2018-02-31
etc.
Though again, I don't know: is there a good or easy way to do this in pandas?
tl;dr:
Given something like the following:
someperiods = 20 # this can be a number of days covering many years
somefrequency = '8D' # this can vary from 1D to maybe 10D
rng = pd.date_range('2017-01-03', periods=someperiods, freq=somefrequency)
df = pd.DataFrame({'x': rng.day}, index=rng) # x in practice is exogenous data
from pandas.tseries.offsets import *
df['MonthPrior'] = df.index.to_pydatetime() + DateOffset(months=-1)
Now:
For each row in df: calculate df['PreviousMonthMean'] = rolling average of all df.x in range [df.MonthPrior, df.index). In this example the resulting DataFrame would be:
Index x MonthPrior PreviousMonthMean
2017-01-03 3 2016-12-03 NaN
2017-01-11 11 2016-12-11 3
2017-01-19 19 2016-12-19 7
2017-01-27 27 2016-12-27 11
2017-02-04 4 2017-01-04 19
2017-02-12 12 2017-01-12 16.66666667
2017-02-20 20 2017-01-20 14.33333333
2017-02-28 28 2017-01-28 12
2017-03-08 8 2017-02-08 20
2017-03-16 16 2017-02-16 18.66666667
2017-03-24 24 2017-02-24 17.33333333
2017-04-01 1 2017-03-01 16
2017-04-09 9 2017-03-09 13.66666667
2017-04-17 17 2017-03-17 11.33333333
2017-04-25 25 2017-03-25 9
2017-05-03 3 2017-04-03 17
2017-05-11 11 2017-04-11 15
2017-05-19 19 2017-04-19 13
2017-05-27 27 2017-04-27 11
2017-06-04 4 2017-05-04 19
If we can get that far, then I need to find an efficient way to iterate that so that for each row in df I can aggregate consecutive but non-overlapping df['PreviousMonthMean'] values going back one calendar month at a time from the given DateTimeIndex....

Finding the min date in a Pandas DF row and create new Column

I have a table with a number of dates (some dates will be NaN) and I need to find the oldest date
so a row may have DATE_MODIFIED, WITHDRAWN_DATE, SOLD_DATE, STATUS_DATE etc..
So for each row there will be a date in one or more of the fields I want to find the oldest of those and make a new column in the dataframe.
Something like this, if I just do one , eg DATE MODIFIED I get a result but when I add the second as below
table['END_DATE']=min([table['DATE_MODIFIED']],[table['SOLD_DATE']])
I get:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
For that matter will this construct work to find the min date, assuming I create correct date columns initially?

Just apply the min function along the axis=1.
In [1]: import pandas as pd
In [2]: df = pd.read_csv('test.cvs', parse_dates=['d1', 'd2', 'd3'])
In [3]: df.ix[2, 'd1'] = None
In [4]: df.ix[1, 'd2'] = None
In [5]: df.ix[4, 'd3'] = None
In [6]: df
Out[6]:
d1 d2 d3
0 2013-02-07 00:00:00 2013-03-08 00:00:00 2013-05-21 00:00:00
1 2013-02-07 00:00:00 NaT 2013-05-21 00:00:00
2 NaT 2013-03-02 00:00:00 2013-05-21 00:00:00
3 2013-02-04 00:00:00 2013-03-08 00:00:00 2013-01-04 00:00:00
4 2013-02-01 00:00:00 2013-03-06 00:00:00 NaT
In [7]: df.min(axis=1)
Out[7]:
0 2013-02-07 00:00:00
1 2013-02-07 00:00:00
2 2013-03-02 00:00:00
3 2013-01-04 00:00:00
4 2013-02-01 00:00:00
dtype: datetime64[ns]

If tableis your DataFrame, then use its min method on the relevant columns:
table['END_DATE'] = table[['DATE_MODIFIED','SOLD_DATE']].min(axis=1)

A slight variation over Felix Zumstein's
table['END_DATE'] = table[['DATE_MODIFIED','SOLD_DATE']].min(axis=1).astype('datetime64[ns]')
The astype('datetime64[ns]') is necessary in the current version of pandas (july 2015) to avoid getting a float64 representation of the dates.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas - Add mean, max, min as row in dataframe - python

A one-liner method which also remove the "NaN" values to make it visually better (I'm a bit OCD ;)) df.append(df.agg({'<PowerAC--->' : ['mean', max]})).fillna('')

Related

Best approach to group the differences between each row by month, year, etc?

Is there a Pandas function to highlight a week's 10 lowest values in a time series?

pandas get data for the end day of month?

Pandas DataFrame.resample monthly offset from particular day of month

Finding the min date in a Pandas DF row and create new Column

Categories

Resources