Averaging over specific months in pandas - python

I'm having trouble creating averages using pandas. My problem is that I want to create the averages combining the months Nov,Dec,Jan,Feb,March, for each winter, however they fall on different years and therefore I can't just do an average of those values falling within one calendar year. I have tried subsetting the data into two datetime objects as..
nd_npi_obs = ndjfm_npi_obs[ndjfm_npi_obs.index.month.isin([11,12])]
jfm_npi_obs = ndjfm_npi_obs[ndjfm_npi_obs.index.month.isin([1,2,3])]
..however I'm having trouble manipulating the dates (years) in order to do a simple average. I'm inexperienced with pandas and wondering if there is a more elegant way than exporting to excel and changing the year! The data is in the form..
Date
1899-01-01 00:00:00 100994.0
1899-02-01 00:00:00 100932.0
1899-03-01 00:00:00 100978.0
1899-11-01 00:00:00 100274.0
1899-12-01 00:00:00 100737.0
1900-01-01 100655.0
1900-02-01 100633.0
1900-03-01 100512.0
1900-11-01 101212.0
1900-12-01 100430.0

Interesting problem. Since you are averaging over five months this makes resampling more tricky. You should be able to overcome this by logical indexing and building a new dataframe. I assume your index is a datetime value.
index = pd.date_range('1899 9 1', '1902, 3, 1', freq='1M')
data = np.random.randint(0, 100, (index.size, 5))
df = pd.DataFrame(index=index, data=data, columns=list('ABCDE'))
# find rows that meet your criteria and average
idx1 = (df.index.year==1899) & (df.index.month >10)
idx2 = (df.index.year==1900) & (df.index.month < 4)
winterAve = df.loc[idx1 | idx2, :].mean(axis=0)
Just to visually check that the indexing/slicing is doing what we need....
>>>df.loc[idx1 | idx2, :]
Out[200]:
A B C D E
1899-11-30 48 91 87 29 47
1899-12-31 63 5 0 35 22
1900-01-31 37 8 89 86 38
1900-02-28 7 35 56 63 46
1900-03-31 72 34 96 94 35
You should be able to put this in a for loop to iterate over multiple years, etc.

Group data by month using pd.Grouper
g = df.groupby(pd.Grouper(freq="M")) # DataFrameGroupBy (grouped by Month)
For each group, calculate the average of only 'A' column
monthly_averages = g.aggregate({"A":np.mean})

Related

finding outliers in subset of df [duplicate]

This question already has an answer here:
Pandas : zscore among the groups
(1 answer)
Closed 5 months ago.
below is example of df I use, sales data. df is big, having several Gb of data, few thousands brands, data for past 12 months, hundred of territories.
index date brand territory value
0 2019-01-01 A 1 63
1 2019-02-01 A 1 91
2 2019-03-01 A 1 139
3 2019-04-01 A 1 80
4 2019-05-01 A 1 149
I want to find outliers for each individual brand across all territories for all dates
To find outliers within whole df I can use use
outliers = df[(np.abs(stats.zscore(df['value'])) > 3)]
or stats.zscore(df['value'] just to calculate z-score
I would like to add column df[z-score]
so I though about something like this but apparently it doesn't work
df['z-score'] = df.groupby('brand', as_index=False)['value'].stats.zscore(df['value'])
Use transform
df['z-score'] = df.groupby('brand')['value'].transform(stats.zscore)

Segmenting a dataframe based on date with datetime column. Python

I have a dataframe named data as shown:
Date
Value
X1
X2
X3
2019-05
15
23
65
98
2019-05
34
132
56
87
2019-06
23
66
90
44
The date column is in a datetime format of Year-Month starting from 2017-01 and the most recent 2022-05. I want to write a piece that will extract data into separate data frames. More specifically I want one data frame to contain the rows of the current month and year (2022-05), another dataframe to contain to data from the previous month (2022-04), and one more dataframe that contains data from 12 months ago (2021-05).
For my code I have the following:
import pandas as pd
from datetime import datetime as dt
data = pd.read_csv("results.csv")
current = data[data["Date"].dt.month == dt.now().month]
My results show the following:
Date
Value
X1
X2
X3
2019-05
15
23
65
98
2019-05
34
132
56
87
2020-05
23
66
90
44
So I get the rows that match the current month but I need it to match the current year I assumed I could just add multiple conditions to match current month and current year but that did not seem to work for me.
Also is there a way to write the code in such a way where I can extract the data from the previous month and the previous year based on what the current month-year is? My first thought was to just take the month and subtract 1 and do the same thing for the year and if the current year is in January I would just write an exception to subtract 1 from both the month and year for the previous month analysis.
Split your DF into a dict of DFs and then access the one you want directly by the date (YYYY-MM).
index
Date
Value
X1
X2
X3
0
2017-04
15
23
65
98
1
2019-05
34
132
56
87
2
2021-06
23
66
90
44
dfs = {x:df[df.Date == x ] for x in df.Date.unique()}
dfs['2017-04']
index
Date
Value
X1
X2
X3
0
2017-04
15
23
65
98
You can do this with a groupby operation, which is a first-class kind of thing in tabular analysis (sql/pandas). In this case, you want to group by both year and month, creating dataframes:
dfs = []
for key, group_df in df.groupby([df.Date.dt.year, df.Date.dt.month]):
dfs.append(group_df)
dfs will have the subgroups you want.
One thing: it's worth noting that there is a performance cost breaking dataframes into list items. Its just as likely that you could do whatever processing comes next directly in the groupby statement, such as df.groupby(...).X1.transform(sum) for example.

Python convert daily column into a new dataframe with year as index week as column

I have a data frame with the date as an index and a parameter. I want to convert column data into a new data frame with year as row index and week number as column name and cells showing weekly mean value. I would then use this information to plot using seaborn https://seaborn.pydata.org/generated/seaborn.relplot.html.
My data:
df =
data
2019-01-03 10
2019-01-04 20
2019-05-21 30
2019-05-22 40
2020-10-15 50
2020-10-16 60
2021-04-04 70
2021-04-05 80
My code:
# convert the df into weekly averaged dataframe
wdf = df.groupby(df.index.dt.strftime('%Y-%W')).data.mean()
wdf
2019-01 15
2019-26 35
2020-45 55
2021-20 75
Expected answer: Column name denotes the week number, index denotes the year. Cell denotes the sample's mean in that week.
01 20 26 45
2019 15 NaN 35 NaN # 15 is mean of 1st week (10,20) in above df
2020 NaN NaN NaN 55
2021 NaN 75 NaN NaN
No idea on how to proceed further to get the expected answer from the above-obtained solution.
You can use a pivot_table :
df['year'] = pd.DatetimeIndex(df['date']).year
df['week'] = pd.DatetimeIndex(df['date']).week
final_table = pd.pivot_table(data = df,index= 'year', columns = 'week',values = 'data', aggfunc = np.mean )
You need to use two dimensions in the groupby, and then unstack to lay out the data as a grid:
df.groupby([df.index.year,df.index.week])['data'].mean().unstack()

Pythonic way to add Timestamps to pandas dataframe based on other Timestamps

The funky way that you index into pandas dataframes to change values is difficult for me. I can never figure out if I'm changing the value of a dataframe element, or if I'm changing a copy of that value.
I'm also new to python's syntax for operating on arrays, and struggle to turn loops over indexes (like in C++) into vector operations in python.
My problem is that I wish to add a column of pandas.Timestamp values to a dataframe based on values in other columns. Lets say I start with a dataframe like
import pandas as pd
import numpy as np
mydata = np.transpose([ [11, 22, 33, 44, 66, 77],
pd.to_datetime(['2015-02-26', '2015-02-27', '2015-02-25', np.NaN, '2015-01-24', '2015-03-24'], errors='coerce'),
pd.to_datetime(['2015-02-24', np.NaN, '2015-03-24', '2015-02-26', '2015-02-27', '2015-02-25'], errors='coerce')
])
df = pd.DataFrame(columns=['ID', 'BEFORE', 'AFTER'], data=mydata)
df.head(6)
which returns
ID BEFORE AFTER
0 11 2015-02-26 2015-02-24
1 22 2015-02-27 NaT
2 33 2015-02-25 2015-03-24
3 44 NaT 2015-02-26
4 66 2015-01-24 2015-02-27
5 77 2015-03-24 2015-02-25
I want to find the lesser of the dates BEFORE and AFTER and then make a new column called RELEVANT_DATE with the results. I can then drop BEFORE and AFTER. There are a zillion ways to do this but, for me, almost all of them don't work. The best I can do is this
# fix up NaT's only in specific columns, real data has more columns
futureDate = pd.to_datetime('2099-01-01')
df.fillna({'BEFORE':futureDate, 'AFTER':futureDate}, inplace=True)
# super clunky solution
numRows = np.shape(df)[0]
relevantDate = []
for index in range(numRows):
if df.loc[index, 'AFTER'] >= df.loc[index, 'BEFORE']:
relevantDate.append(df.loc[index, 'BEFORE'])
else:
relevantDate.append(df.loc[index, 'AFTER'])
# add relevant date column to df
df['RELEVANT_DATE'] = relevantDate
# delete irrelevant dates
df.drop(labels=['BEFORE', 'AFTER'], axis=1, inplace=True)
df.head(6)
returning
ID RELEVANT_DATE
0 11 2015-02-24
1 22 2015-02-27
2 33 2015-02-25
3 44 2015-02-26
4 66 2015-01-24
5 77 2015-02-25
This approach is super slow. With a few million rows it takes too long to be useful.
Can you provide a pythonic-style solution for this? Recall that I'm having trouble both with vectorizing these operations AND making sure they get set for real in the DataFrame.
Take the minimum across a row (axis=1). Set the index so you can bring 'ID' along for the ride.
df.set_index('ID').min(axis=1).rename('RELEVANT DATE').reset_index()
ID RELEVANT DATE
0 11 2015-02-24
1 22 2015-02-27
2 33 2015-02-25
3 44 2015-02-26
4 66 2015-01-24
5 77 2015-02-25
Or assign the new column to your existing DataFrame:
df['RELEVANT DATE'] = df[['BEFORE', 'AFTER']].min(1)

how to speed up dataframe analysis

I'm looping through a DataFrame of 200k rows. It's doing what I want but it takes hours. I'm not very sophisticated when it comes to all the ways you can join and manipulate DataFrames so I wonder if I'm doing this in a very inefficient way. It's quite simple, here's the code:
three_yr_gaps = []
for index, row in df.iterrows():
three_yr_gaps.append(df[(df['GROUP_ID'] == row['GROUP_ID']) &
(df['BEG_DATE'] >= row['THREE_YEAR_AGO']) &
(df['END_DATE'] <= row['BEG_DATE'])]['GAP'].sum() + row['GAP'])
df['GAP_THREE'] = three_yr_gaps
The DF has a column called GAP that holds an integer value. the logic I'm employing to sum this number up is:
for each row get these columns from the dataframe:
those that match on the group id, and...
those that have a beginning date within the last 3 years of this rows start date, and...
those that have an ending date before this row's beginning date.
sum up those rows GAP number and add this row's GAP number then append those to a list of indexes.
So is there a faster way to introduce this logic into some kind of automatic merge or join that could speed up this process?
PS.
I was asked for some clarification on input and output, so here's a constructed dataset to play with:
from dateutil import parser
df = pd.DataFrame( columns = ['ID_NBR','GROUP_ID','BEG_DATE','END_DATE','THREE_YEAR_AGO','GAP'],
data = [['09','185',parser.parse('2008-08-13'),parser.parse('2009-07-01'),parser.parse('2005-08-13'),44],
['10','185',parser.parse('2009-08-04'),parser.parse('2010-01-18'),parser.parse('2006-08-04'),35],
['11','185',parser.parse('2010-01-18'),parser.parse('2011-01-18'),parser.parse('2007-01-18'),0],
['12','185',parser.parse('2014-09-04'),parser.parse('2015-09-04'),parser.parse('2011-09-04'),0]])
and here's what I wrote at the top of the script, may help:
The purpose of this script is to extract gaps counts over the
last 3 year period. It uses gaps.sql as its source extract. this query
returns a DataFrame that looks like this:
ID_NBR GROUP_ID BEG_DATE END_DATE THREE_YEAR_AGO GAP
09 185 2008-08-13 2009-07-01 2005-08-13 44
10 185 2009-08-04 2010-01-18 2006-08-04 35
11 185 2010-01-18 2011-01-18 2007-01-18 0
12 185 2014-09-04 2015-09-04 2011-09-04 0
The python code then looks back at the previous 3 years (those
previous rows that have the same GROUP_ID but whose effective dates
come after their own THIRD_YEAR_AGO and whose end date come before
their own beginning date). Those rows are added up and a new column is
made called GAP_THREE. What remains is this:
ID_NBR GROUP_ID BEG_DATE END_DATE THREE_YEAR_AGO GAP GAP_THREE
09 185 2008-08-13 2009-07-01 2005-08-13 44 44
10 185 2009-08-04 2010-01-18 2006-08-04 35 79
11 185 2010-01-18 2011-01-18 2007-01-18 0 79
12 185 2014-09-04 2015-09-04 2011-09-04 0 0
you'll notice that row id_nbr 11 has a 79 value in the last 3 years but id_nbr 12 has 0 because the last gap was 35 in 2009 which is more than 3 years before 12's beginning date of 2014

Categories

Resources