Pandas get the Month Ending Values from Series - python

I need to get the month-end balance from a series of entries.
Sample data:
date contrib totalShrs
0 2009-04-23 5220.00 10000.000
1 2009-04-24 10210.00 20000.000
2 2009-04-27 16710.00 30000.000
3 2009-04-30 22610.00 40000.000
4 2009-05-05 28909.00 50000.000
5 2009-05-20 38409.00 60000.000
6 2009-05-28 46508.00 70000.000
7 2009-05-29 56308.00 80000.000
8 2009-06-01 66108.00 90000.000
9 2009-06-02 78108.00 100000.000
10 2009-06-12 86606.00 110000.000
11 2009-08-03 95606.00 120000.000
The output would look something like this:
2009-04-30 40000
2009-05-31 80000
2009-06-30 110000
2009-07-31 110000
2009-08-31 120000
Is there a simple Pandas method?
I don't see how I can do this with something like a groupby?
Or would I have to do something like iterrows, find all the monthly entries, order them by date and pick the last one?
Thanks.

Use Grouper with GroupBy.last, forward filling missing values by ffill with Series.reset_index:
#if necessary
#df['date'] = pd.to_datetime(df['date'])
df = df.groupby(pd.Grouper(freq='m',key='date'))['totalShrs'].last().ffill().reset_index()
#alternative
#df = df.resample('m',on='date')['totalShrs'].last().ffill().reset_index()
print (df)
date totalShrs
0 2009-04-30 40000.0
1 2009-05-31 80000.0
2 2009-06-30 110000.0
3 2009-07-31 110000.0
4 2009-08-31 120000.0

Following gives you the information you want, i.e. end of month values, though the format is not exactly what you asked:
df['month'] = df['date'].str.split('-', expand = True)[1] # split date column to get month column
newdf = pd.DataFrame(columns=df.columns) # create a new dataframe for output
grouped = df.groupby('month') # get grouped values
for g in grouped: # for each group, get last row
gdf = pd.DataFrame(data=g[1])
newdf.loc[len(newdf),:] = gdf.iloc[-1,:] # fill new dataframe with last row obtained
newdf = newdf.drop('date', axis=1) # drop date column, since month column is there
print(newdf)
Output:
contrib totalShrs month
0 22610 40000 04
1 56308 80000 05
2 86606 110000 06
3 95606 120000 08

Related

Pandas resample MultiIndex dataframe with forward fill

I am trying to resample a MultiIndex dataframe to a less granular frequency (daily to month end) by taking the last valid daily observation in every month.
For example, given the dataframe below:
df = pd.DataFrame({'date': [pd.to_datetime('2012-03-29')]*4
+ [pd.to_datetime('2012-03-30')]*4
+ [pd.to_datetime('2012-04-01')]*4,
'groups':[1,2,3,4]*3,
'values':np.random.normal(size=12)})
df = df.set_index(['date', 'groups'])
values
date groups
2012-03-29 1 0.013681
2 0.359522
3 -0.525454
4 -0.282541
2012-03-30 1 0.155501
2 -1.053596
3 0.003049
4 -0.165875
2012-04-01 1 -0.049135
2 2.701785
3 2.240875
4 0.057297
The desired final dataframe is:
values
date groups
2012-03-31 1 0.155501
2 -1.053596
3 0.003049
4 -0.165875
In a regular dataframe (with single index), the desired output can be achieved with df.asfreq('M', method='ffill') as shown below.
df = pd.DataFrame({'date': [pd.to_datetime('2012-03-29')] + pd.date_range('2012-04-01', '2012-04-04').to_list(),
'values':np.random.normal(size=5)})
df = df.set_index('date')
df_monthly = df.asfreq('M', method='ffill')
Where df is:
values
date
2012-03-29 1.988554
2012-04-01 -1.054163
2012-04-02 -1.112537
2012-04-03 0.224515
2012-04-04 0.152175
and df_monthly is:
values
date
2012-03-31 1.988554
Any help is much appreciated. Thanks in advance.
Use:
df_monthly = df.reset_index(level=1).groupby('groups')[['values']].apply(lambda x: x.asfreq('M', method='ffill')).swaplevel(1,0)
print (df_monthly)
values
date groups
2012-03-31 1 -2.951662
2 -1.495653
3 -0.948413
4 0.066219

count values of groups by consecutive days

i have data with 3 columns: date, id, sales.
my first task is filtering sales above 100. i did it.
second task, grouping id by consecutive days.
index
date
id
sales
0
01/01/2018
03
101
1
01/01/2018
07
178
2
02/01/2018
03
120
3
03/01/2018
03
150
4
05/01/2018
07
205
the result should be:
index
id
count
0
03
3
1
07
1
2
07
1
i need to do this task without using pandas/dataframe, but right now i can't imagine from which side attack this problem.
just for effort, i tried the suggestion for a solution here count consecutive days python dataframe
but the ids' not grouped.
here is my code:
data = df[df['sales'] >= 100]
data['date'] = pd.to_datetime(data['date']).dt.date
s = data.groupby('id').date.diff().dt.days.ne(1).cumsum()
new_frame = data.groupby(['id', s]).size().reset_index(level=0, drop=True)
it is very importent that the "new_frame" will have "count" column, because after i need to count id by range of those count days in "count" column. e.g. count of id's in range of 0-7 days, 7-12 days etc. but it's not part of my question.
Thank you a lot
Your code is close, but need some fine-tuning, as follows:
data = df[df['sales'] >= 100]
data['date'] = pd.to_datetime(data['date'], dayfirst=True)
df2 = data.sort_values(['id', 'date'])
s = df2.groupby('id').date.diff().dt.days.ne(1).cumsum()
new_frame = df2.groupby(['id', s]).size().reset_index(level=1, drop=True).reset_index(name='count')
Result:
print(new_frame)
id count
0 3 3
1 7 1
2 7 1
Summary of changes:
As your dates are in dd/mm/yyyy instead of the default mm/dd/yyyy, you have to specify the parameter dayfirst=True in pd.to_datetime(). Otherwise, 02/01/2018 will be regarded as 2018-02-01 instead of 2018-01-02 as expected and the day diff with adjacent entries will be around 30 as opposed to 1.
We added a sort step to sort by columns id and date to simplify the later grouping during the creation of the series s.
In the last groupby() the code reset_index(level=0, drop=True) should be dropping level=1 instead. Since, level=0 is the id fields which we want to keep.
In the last groupby(), we do an extra .reset_index(name='count') to make the Pandas series change back to a dataframe and also name the new column as count.

Python convert daily column into a new dataframe with year as index week as column

I have a data frame with the date as an index and a parameter. I want to convert column data into a new data frame with year as row index and week number as column name and cells showing weekly mean value. I would then use this information to plot using seaborn https://seaborn.pydata.org/generated/seaborn.relplot.html.
My data:
df =
data
2019-01-03 10
2019-01-04 20
2019-05-21 30
2019-05-22 40
2020-10-15 50
2020-10-16 60
2021-04-04 70
2021-04-05 80
My code:
# convert the df into weekly averaged dataframe
wdf = df.groupby(df.index.dt.strftime('%Y-%W')).data.mean()
wdf
2019-01 15
2019-26 35
2020-45 55
2021-20 75
Expected answer: Column name denotes the week number, index denotes the year. Cell denotes the sample's mean in that week.
01 20 26 45
2019 15 NaN 35 NaN # 15 is mean of 1st week (10,20) in above df
2020 NaN NaN NaN 55
2021 NaN 75 NaN NaN
No idea on how to proceed further to get the expected answer from the above-obtained solution.
You can use a pivot_table :
df['year'] = pd.DatetimeIndex(df['date']).year
df['week'] = pd.DatetimeIndex(df['date']).week
final_table = pd.pivot_table(data = df,index= 'year', columns = 'week',values = 'data', aggfunc = np.mean )
You need to use two dimensions in the groupby, and then unstack to lay out the data as a grid:
df.groupby([df.index.year,df.index.week])['data'].mean().unstack()

Python Pandas - Get the rows of first and last day of particular months

My data set df looks as follows:
Date Value
...
2012-07-31 61.9443
2012-07-30 62.1551
2012-07-27 62.3328
... ...
2011-10-04 48.3923
2011-10-03 48.5939
2011-09-30 50.0327
2011-09-29 51.8350
2011-09-28 50.5555
2011-09-27 51.8470
2011-09-26 49.6350
... ...
2011-08-03 61.3948
2011-08-02 61.5476
2011-08-01 64.1407
2011-07-29 65.0364
2011-07-28 65.7065
2011-07-27 66.3463
2011-07-26 67.1508
2011-07-25 67.5577
... ...
2010-10-05 57.3674
2010-10-04 56.3687
2010-10-01 57.6022
2010-09-30 58.0993
2010-09-29 57.9934
Below are the data type of the two columns:
Type Column Name Example Value
-----------------------------------------------------------------
datetime64[ns] Date 2020-06-19 00:00:00
float64 Value 108.82
I would like to have a subset of df that contains only the rows where the first entry in October and the last entry of July are selected:
Date Value
...
2012-07-31 61.9443
2011-10-03 48.5939
2011-07-29 65.0364
2010-10-01 57.6022
Any idea how to do that?
You can sort by the date so that you know they are in chronological order. After that create two data frames, one where month is 7 and take the last record of the group and one where month is 10 take the first record of the group.
Then you can concatenate them.
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(by='Date')
j = df[df['Date'].dt.month == 7].groupby([df.Date.dt.year, df.Date.dt.month]).last()
o = df[df['Date'].dt.month == 10].groupby([df.Date.dt.year, df.Date.dt.month]).first()
pd.concat([j,o]).reset_index(drop=True)
Output
Date Value
0 2011-07-29 65.0364
1 2012-07-31 61.9443
2 2010-10-01 57.6022
3 2011-10-03 48.5939
Here's a solution which is based on Pandas only:
df = df.sort_values("Date")
october = df.groupby([df["Date"].dt.year, df["Date"].dt.month], as_index = False).first()
october = october[october.Date.dt.month == 10]
july = df.groupby([df["Date"].dt.year, df["Date"].dt.month], as_index = False).last()
july = july[july.Date.dt.month == 7]
pd.concat([july, october])
The result is:
Date Value
2 2011-07-29 65.0364
6 2012-07-31 61.9443
1 2010-10-01 57.6022
5 2011-10-03 48.5939
An elegant solution without group just by using index from sorted dataframe:
# Sort you data by Date and convert date string to datetime
df['Date']=pd.to_datetime(df['Date'])
df = df.sort_values(by='Date')
# For selecting first row just subset by index where month is 7 and select first index i.e. 0
jul = df.loc[[df.index[df['Date'].dt.month == 7].tolist()[0]]]
# For sleecting last row just subset by index where months is 10 and select last index i.e -1
oct = df.loc[[df.index[df['Date'].dt.month == 10].tolist()[-1]]]
#Finally concatenate both
pd.concat([jul,oct]).reset_index(drop=True)

Grouping by column groups on a data frama in python pandas

I have a data frame with columns for every month of every year from 2000 to 2016
df.columns
output
Index(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06',
'2000-07', '2000-08', '2000-09', '2000-10',
...
'2015-11', '2015-12', '2016-01', '2016-02', '2016-03', '2016-04',
'2016-05', '2016-06', '2016-07', '2016-08'],
dtype='object', length=200)
and I would like to group over these column by quarters.
I have made a dictionary believing it would be the best method to use groupby then use aggregate and mean:
m2q = {'2000q1': ['2000-01', '2000-02', '2000-03'],
'2000q2': ['2000-04', '2000-05', '2000-06'],
'2000q3': ['2000-07', '2000-08', '2000-09'],
...
'2016q2': ['2016-04', '2016-05', '2016-06'],
'2016q3': ['2016-07', '2016-08']}
but
df.groupby(m2q)
is not giving me the desired output.
In fact its giving me an empty grouping.
Any suggestions to make this grouping work?
Or perhaps a more pythonian solution to categorize by quarters taking the mean of the specified columns?
You can convert your index to DatetimeIndex(example 1) or PeriodIndex(example 2).
And also please check Time Series / Date functionality subject for more detail.
import numpy as np
import pandas as pd
idx = ['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06',
'2000-07', '2000-08', '2000-09', '2000-10', '2000-11', '2000-12']
df = pd.DataFrame(np.arange(12), index=idx, columns=['SAMPLE_DATA'])
print(df)
SAMPLE_DATA
2000-01 0
2000-02 1
2000-03 2
2000-04 3
2000-05 4
2000-06 5
2000-07 6
2000-08 7
2000-09 8
2000-10 9
2000-11 10
2000-12 11
# Handle your timeseries data with pandas timeseries / date functionality
df.index=pd.to_datetime(df.index)
example 1
print(df.resample('Q').sum())
SAMPLE_DATA
2000-03-31 3
2000-06-30 12
2000-09-30 21
2000-12-31 30
example 2
print(df.to_period('Q').groupby(level=0).sum())
SAMPLE_DATA
2000Q1 3
2000Q2 12
2000Q3 21
2000Q4 30

Categories

Resources