Resampling with percentiles - python

I have a data frame with some numerical values and a date-timestamp.
What I would like to do is aggregate the data into monthly intervals outputting a max percentile value for each month.
What I have been doing so far is just using:
df = df.resample('M', on='ds').max()
Which gives me the max value for that month. However, from what I can see in my data there are usually one or two spikes in each month. The result is that by using max() I will get that spike value - which is not correct. So I way to filter out the few high value peaks I was wondering if I could use a percentile function instead of max(), .e.g:
np.percentile(df['y'], 99)
As far as I can see the resample function does not provide the option to use own functions. But I might be wrong? In any case, how can this be accomplished ?

Use custom lambda function in GroupBy.agg:
df = df.resample('M', on='ds')['y'].agg(lambda x: np.percentile(x, 99))

Related

pandas computing new column as a average of other two conditions

So I have this dataset of temperatures. Each line describe the temperature in celsius measured by hour in a day.
So, I need to compute a new variable called avg_temp_ar_mensal which representsthe average temperature of a city in a month. City in this dataset is represented as estacao and month as mes.
I'm trying to do this using pandas. The following line of code is the one I'm trying to use to solve this problem:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes', 'estacao']).mean()
The goal of this code is to store in a new column the average of the temperature of the city and month. But it doesn't work. If I try the following line of code:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes']).mean()
It will works, but it is wrong. It will calculate for every city of the dataset and I don't want it because it will cause noise in my data. I need to separate each temperature based on month and city and then calculate the mean.
The dataframe after groupby is smaller than the initial dataframe, that is why your code run into error.
There is two ways to solve this problem. The first one is using transform as:
df.groupby(['mes', 'estacao'])['temp_ar'].transform(lambda g: g.mean())
The second is to create a new dfn from groupby then merge back to df
dfn = df.groupby(['mes', 'estacao'])['temp_ar'].mean().reset_index(name='average')
df = pd.merge(df, dfn, on=['mes', 'estacao'], how='left']
You are calling a groupby on a single column when you are doing df2['temp_ar'].groupby(...). This doesn't make much sense since in a single column, there's nothing to group by.
Instead, you have to perform the groupby on all the columns you need. Also, make sure that the final output is a series and not a dataframe
df['new_column'] = df[['city_column', 'month_column', 'temp_column']].groupby(['city_column', 'month_column']).mean()['temp_column']
This should do the trick if I understand your dataset correctly. If not, please provide a reproducible version of your df

Grouping date ranges in pandas

Was trying to get output with date range(weekly). i am able to group the date and sum the values but how to get the output as per the below image. tried pd.grouping with frequency, resample with no luck any other methods can help?
looking for the desired output as per the image
resample works on time series data. If you want to resample a DataFrame, it should either have a DateTime index or you need to pass on parameter to resample.
This should work
df.resample('W', on='Date').sum()
W is the weekly frequency, see here.
Another option you might explor is cut, but IMO resample will be better for what you need.

Can I add a conditional count() to a groupby dataframe where the condition is a groupby result?

I have a two column dataframe with name limitData where the first column is CcyPair and second is trade notional
CcyPair,TradeNotional
USDCAD,1000000
USDCAD,7600
USDCAD,40000
GBPUSD,100000
GBPUSD,345000
etc
with a large number of CcyPair's and TradeNotional's per CcyPair. From here I generate summary statistics as follows
limitDataStats = limitData.groupby(['CcyPair']).describe()
This is easy enough. However, I would like to add a column to sumStats that contains the count of TradeNotional's greater than that ccyPair's 75% determined by .describe() stored in limitDataStats. I've searched a great deal and tried a number of variations but can't figure it out. Think it should be somewhere along the lines of the below (I thought I could reference the index of the groupby as mentioned here but that gives me the actual integer index here
limitData.groupby(['CcyPair'])['AbsBaseTrade'].apply(lambda x: x[x > limitDataStats.loc[x.index , '75%']].count())
Any ideas? Thanks, Colin
You can filter values greater than the 75th percentile and then count how many are greater than or equal to that value (used .sum() since boolean series is returned from ge())
limitData.groupby('CcyPair')['AbsBaseTrade'].apply(
lambda x: x.ge(x.quantile(.75)).sum()))

Python Pandas - Carrying the index over as the group name/index for a groupby produced dataframe

I have used groupby in pandas, however the label for the groups is simply an arbitrary value, whereas I would like this label to be the index of the original dataframe (which is datetime) so that I can create a new dataframe which I can plot in terms of datetime.
grouped_data = df.groupby(
['X',df.X.ne(df.X.shift()).cumsum().rename('grp')])
grouped_data2 = grouped_data['Y'].agg(np.trapz).loc[2.0:4.0]
The column x has changing values from 1-4 and the second line of code is intended to integrate the column Y in the groups where X is either 2 or 3. These are repeating units, so I don't want all the 2s and all the 3s integrated together, I want the period of time where it goes: 22222333333 as one group and then apply the np.trapz again to the next group where it goes: 2222233333. That way I should have a new dataframe with an index corresponding to the start of these time periods and values which are an integral of these periods.
If I understand correctly, you've already set your index to DateTime values? If yes, try the grouper function:
df.groupby(pd.Grouper(key={index name}, freq={appropriate offset alias}))
Without a sample data-set, I can't really provide a complete solution, but this should solve your indexing issue:)
Grouper Function tutorial
Offset aliases

Using a function to do a %change on a dataset

I am on python using pandas but running into this issue. I am having a dataset that has the countries on the columns and dates(my months) on the rows. The data consists of the population of an item.
I am required to calculate the % change of population month by month is there a function that I can use to get the data into a dataset with the %change month by month in the format attached?
I am trying to do the apply a function onto the dataset but getting the function to retrieve the previous month's population to do a % change is an issue.
Anyone has any good ideas to get this done? Thanks
You can use pct_change:
df.pct_change()
First order the data by month (if it isn't already), and then use the .shift() function for pandas dataframes
df['pct_change'] = (df.US - df.US.shift(1) ) / df.US
.shift() allows you to shift rows up or down depending on the argument.

Categories

Resources