I have this dataset, which have year, month, week and sales numbers:
df = pd.DataFrame()
df['year'] = [2011,2011,2011,2012,2012,2012]
df['month'] = [12,12,12,1,1,1]
df['week'] = [51,52,53,1,2,3]
df['sales'] = [10000,12000,11000,5000,12000,11000]
df['date_ix'] = df['year'] * 1000 + (df['week']-1) * 10 + 1
df['date_week'] = pd.to_datetime(df['date_ix'], format='%Y%W%w')
df
year month week sales date_ix date_week
0 2011 12 51 10000 2011501 2011-12-12
1 2011 12 52 12000 2011511 2011-12-19
2 2011 12 53 11000 2011521 2011-12-26
3 2012 1 1 5000 2012001 2011-12-26
4 2012 1 2 12000 2012011 2012-01-02
5 2012 1 3 11000 2012021 2012-01-09
Now date_week is the beginning day of the week (Monday). I want to convert date_week to day except by the first week of the year, where I want to isolate the day (in this case is 2012-01-01 which was Sunday). I have tried this, but something's wrong.
df['date_start'] = np.where((df['year']==2012) & (df['week']==1), \
pd.to_datetime(str(20120101), format='%Y%m%d'), \
pd.to_datetime(df['date_ix'], format='%Y%W%w'))
year month week sales date_ix date_week date_start
0 2011 12 51 10000 2011501 2011-12-12 1323648000000000000
1 2011 12 52 12000 2011511 2011-12-19 1324252800000000000
2 2011 12 53 11000 2011521 2011-12-26 1324857600000000000
3 2012 1 1 5000 2012001 2011-12-26 2012-01-01 00:00:00
4 2012 1 2 12000 2012011 2012-01-02 1325462400000000000
5 2012 1 3 11000 2012021 2012-01-09 1326067200000000000
The expected result should be:
year month week sales date_ix date_week date_start
0 2011 12 51 10000 2011501 2011-12-12 2011-12-12
1 2011 12 52 12000 2011511 2011-12-19 2011-12-19
2 2011 12 53 11000 2011521 2011-12-26 2011-12-26
3 2012 1 1 5000 2012001 2011-12-26 2012-01-01
4 2012 1 2 12000 2012011 2012-01-02 2012-01-02
5 2012 1 3 11000 2012021 2012-01-09 2012-01-09
Please, any help will be greatly appreciated.
You need to enclose df['year']==2012 and df['week']==1 with parentheses because of the priority of == and &.
df['date_start'] = np.where((df['year']==2012) & (df['week']==1), \
pd.to_datetime(str(20120101), format='%Y%m%d'), \
pd.to_datetime(df['date_ix'], format='%Y%W%w'))
Then change pd.to_datetime(str(20120101), format='%Y%m%d') in np.where to pd.to_datetime(df['year'], format='%Y')
df['date_start'] = np.where((df['year']==2012) & (df['week']==1), \
pd.to_datetime(df['year'], format='%Y'),
df['date_week'])
print(df)
year month week sales date_ix date_week date_start
0 2011 12 51 10000 2011501 2011-12-12 2011-12-12
1 2011 12 52 12000 2011511 2011-12-19 2011-12-19
2 2011 12 53 11000 2011521 2011-12-26 2011-12-26
3 2012 1 1 5000 2012001 2011-12-26 2012-01-01
4 2012 1 2 12000 2012011 2012-01-02 2012-01-02
5 2012 1 3 11000 2012021 2012-01-09 2012-01-09
What about this ?
df['date_start'] = pd.to_datetime(df.week.astype(str)+
df.year.astype(str).add('-1') ,format='%V%G-%u')
This will give date_start as the date of the Monday of the week of interest.
(Note that there is a shift with your current date_start, you might want to add a 1 week tmimedelta to compensate for it.
Related
I have a data files containing year, day of the year (DOY), hour and minutes as following:
BuoyID Year Hour Min DOY POS_DOY Lat Lon Ts
0 300234065718160 2019 7 0 216.2920 216.2920 58.559 -23.914 14.61
1 300234065718160 2019 9 0 216.3750 216.3750 58.563 -23.905 14.60
2 300234065718160 2019 10 0 216.4170 216.4170 58.564 -23.903 14.60
3 300234065718160 2019 11 0 216.4580 216.4580 58.563 -23.906 14.60
4 300234065718160 2019 12 0 216.5000 216.5000 58.561 -23.910 14.60
In order to make my datetime, I used:
dt_raw = pd.to_datetime(df_buoy['Year'] * 1000 + df_buoy['DOY'], format='%Y%j')
# Convert to datetime
dt_buoy = [d.date() for d in dt_raw]
date = datetime.datetime.combine(dt_buoy[0], datetime.time(df_buoy.Hour[0], df_buoy.Min[0]))
My problem arises when the hours are not int, but float instead. For example:
BuoyID Year Hour Min DOY POS_DOY Lat Lon BP Ts
0 300234061876910 2014 23.33 0 226.972 226.972 71.93081 -141.0792 1016.9 -0.01
1 300234061876910 2014 23.50 0 226.979 226.979 71.93020 -141.0826 1016.8 3.36
2 300234061876910 2014 23.67 0 226.986 226.986 71.92968 -141.0856 1016.8 3.28
3 300234061876910 2014 23.83 0 226.993 226.993 71.92934 -141.0876 1016.8 3.22
4 300234061876910 2014 0.00 0 227.000 227.000 71.92904 -141.0894 1016.8 3.18
What I tried to do was to convert the hours in str, get the first two indexes, thus obtaining the hour, and then subtract this from the 'Hour' and multiply by 60 to get minutes.
int_hour = [(int(str(i)[0:2])) for i in df_buoy.Hour]
minutes = map(lambda x, y: (x - y)*60, df_buoy.Hour, int_hour)
But, of course, if you have '0.' as your hour, Python will complain:
ValueError: invalid literal for int() with base 10: '0.'
My question is: does anyone know a simple way to convert year, DOY, hour (either int or *float) and minutes to datetime in a simple way?
Use to_timedelta for convert hours columns and add to datetimes, working well with integers and floats:
df['d'] = (pd.to_datetime(df['Year'] * 1000 + df['DOY'], format='%Y%j') +
pd.to_timedelta(df['Hour'], unit='h'))
print (df)
BuoyID Year Hour Min DOY POS_DOY Lat Lon Ts \
0 300234065718160 2019 7 0 216.292 216.292 58.559 -23.914 14.61
1 300234065718160 2019 9 0 216.375 216.375 58.563 -23.905 14.60
2 300234065718160 2019 10 0 216.417 216.417 58.564 -23.903 14.60
3 300234065718160 2019 11 0 216.458 216.458 58.563 -23.906 14.60
4 300234065718160 2019 12 0 216.500 216.500 58.561 -23.910 14.60
d
0 2019-08-04 07:00:00
1 2019-08-04 09:00:00
2 2019-08-04 10:00:00
3 2019-08-04 11:00:00
4 2019-08-04 12:00:00
df['d'] = (pd.to_datetime(df['Year'] * 1000 + df['DOY'], format='%Y%j') +
pd.to_timedelta(df['Hour'], unit='h'))
print (df)
BuoyID Year Hour Min DOY POS_DOY Lat Lon \
0 300234061876910 2014 23.33 0 226.972 226.972 71.93081 -141.0792
1 300234061876910 2014 23.50 0 226.979 226.979 71.93020 -141.0826
2 300234061876910 2014 23.67 0 226.986 226.986 71.92968 -141.0856
3 300234061876910 2014 23.83 0 226.993 226.993 71.92934 -141.0876
4 300234061876910 2014 0.00 0 227.000 227.000 71.92904 -141.0894
BP Ts d
0 1016.9 -0.01 2014-08-14 23:19:48
1 1016.8 3.36 2014-08-14 23:30:00
2 1016.8 3.28 2014-08-14 23:40:12
3 1016.8 3.22 2014-08-14 23:49:48
4 1016.0 NaN 2014-08-15 00:00:00
Code to generate random database for question (minimum reproducible issue):
df_random = pd.DataFrame(np.random.random((2000,3)))
df_random['order_date'] = pd.date_range(start='1/1/2015',
periods=len(df_random), freq='D')
df_random['customer_id'] = np.random.randint(1, 20, df_random.shape[0])
df_random
Output df_random
0 1 2 order_date customer_id
0 0.018473 0.970257 0.605428 2015-01-01 12
... ... ... ... ... ...
1999 0.800139 0.746605 0.551530 2020-06-22 11
Code to extract mean unique transactions month and year wise
for y in (2015,2019):
for x in (1,13):
df2 = df_random[(df_random['order_date'].dt.month == x)&(df_random['order_date'].dt.year== y)]
df2.sort_values(['customer_id','order_date'],inplace=True)
df2["days"] = df2.groupby("customer_id")["order_date"].apply(lambda x: (x - x.shift()) / np.timedelta64(1, "D"))
df_mean=round(df2['days'].mean(),2)
data2 = data.append(pd.DataFrame({'Mean': df_mean , 'Month': x, 'Year': y}, index=[0]), ignore_index=True)
print(data2)
Expected output
Mean Month Year
0 5.00 1 2015
.......................
11 6.62 12 2015
..............Mean values of days after which one transaction occurs in order_date for years 2016 and 2017 Jan to Dec
36 6.03 1 2018
..........................
47 6.76 12 2018
48 8.40 1 2019
.......................
48 8.40 12 2019
Basically I want single dataframe starting from 2015 Jan month to 2019 December
Instead of the expected output I am getting dataframe from Jan 2015 to Dec 2018 , then again Jan 2015 data and then the entire dataset repeats again from 2015 to 2018 many more times.
Please help
Try this:
data2 = pd.DataFrame([])
for y in range(2015,2020):
for x in range(1,13):
df2 = df_random[(df_random['order_date'].dt.month == x)&(df_random['order_date'].dt.year== y)]
df_mean=df2.groupby("customer_id")["order_date"].apply(lambda x: (x - x.shift()) / np.timedelta64(1, "D")).mean().round(2)
data2 = data2.append(pd.DataFrame({'Mean': df_mean , 'Month': x, 'Year': y}, index=[0]), ignore_index=True)
print(data2)
Try this :
df_random.order_date = pd.to_datetime(df_random.order_date)
df_random = df_random.set_index(pd.DatetimeIndex(df_random['order_date']))
output = df_random.groupby(pd.Grouper(freq="M"))[[0,1,2]].agg(np.mean).reset_index()
output['month'] = output.order_date.dt.month
output['year'] = output.order_date.dt.year
output = output.drop('order_date', axis=1)
output
Output
0 1 2 month year
0 0.494818 0.476514 0.496059 1 2015
1 0.451611 0.437638 0.536607 2 2015
2 0.476262 0.567519 0.528129 3 2015
3 0.519229 0.475887 0.612433 4 2015
4 0.464781 0.430593 0.445455 5 2015
... ... ... ... ... ...
61 0.416540 0.564928 0.444234 2 2020
62 0.553787 0.423576 0.422580 3 2020
63 0.524872 0.470346 0.560194 4 2020
64 0.530440 0.469957 0.566077 5 2020
65 0.584474 0.487195 0.557567 6 2020
Avoid any looping and simply include year and month in groupby calculation:
np.random.seed(1022020)
...
# ASSIGN MONTH AND YEAR COLUMNS, THEN SORT COLUMNS
df_random = (df_random.assign(month = lambda x: x['order_date'].dt.month,
year = lambda x: x['order_date'].dt.year)
.sort_values(['customer_id', 'order_date']))
# GROUP BY CALCULATION
df_random["days"] = (df_random.groupby(["customer_id", "year", "month"])["order_date"]
.apply(lambda x: (x - x.shift()) / np.timedelta64(1, "D")))
# FINAL MEAN AGGREGATION BY YEAR AND MONTH
final_df = (df_random.groupby(["year", "month"], as_index=False)["days"].mean().round(2)
.rename(columns={"days":"mean"}))
print(final_df.head())
# year month mean
# 0 2015 1 8.43
# 1 2015 2 5.87
# 2 2015 3 4.88
# 3 2015 4 10.43
# 4 2015 5 8.12
print(final_df.tail())
# year month mean
# 61 2020 2 8.27
# 62 2020 3 8.41
# 63 2020 4 8.81
# 64 2020 5 9.12
# 65 2020 6 7.00
For multiple aggregates, replace the single groupby.mean() to groupby.agg():
final_df = (df_random.groupby(["year", "month"], as_index=False)["days"]
.agg(['count', 'min', 'mean', 'median', 'max'])
.rename(columns={"days":"mean"}))
print(final_df.head())
# count min mean median max
# year month
# 2015 1 14 1.0 8.43 5.0 25.0
# 2 15 1.0 5.87 5.0 17.0
# 3 16 1.0 4.88 5.0 9.0
# 4 14 1.0 10.43 7.5 23.0
# 5 17 2.0 8.12 8.0 17.0
print(final_df.tail())
# count min mean median max
# year month
# 2020 2 15 1.0 8.27 6.0 21.0
# 3 17 1.0 8.41 7.0 16.0
# 4 16 1.0 8.81 7.0 20.0
# 5 16 1.0 9.12 7.0 22.0
# 6 7 2.0 7.00 7.0 17.0
I have a daily time series of closing prices of a financial instrument going back to 1990.
I am trying to compare the daily percentage change for each trading day of the previous years to it's respective trading day in 2019. I have 41 trading days of data for 2019 at this time.
I get so far as filtering down and creating a new DataFrame with only the first 41 dates, closing prices, daily percentage changes, and the "trading day of year" ("tdoy") classifier for each day in the set, but am not having luck from there.
I've found other Stack Overflow questions that help people compare datetime days, weeks, years, etc. but I am not able to recreate this because of the arbitrary value each "tdoy" represents.
I won't bother creating a sample DataFrame because of the number of rows so I've linked the CSV I've come up with to this point: Sample CSV.
I think the easiest approach would just be to create a new column that returns what the 2019 percentage change is for each corresponding "tdoy" (Trading Day of Year) using df.loc, and if I could figure this much out I could then create yet another column to do the simple difference between that year/day's percentage change to 2019's respective value. Below is what I try to use (and I've tried other variations) to no avail.
df['2019'] = df['perc'].loc[((df.year == 2019) & (df.tdoy == df.tdoy))]
I've tried to search Stack and Google in probably 20 different variations of my problem and can't seem to find an answer that fits my issue of arbitrary "Trading Day of Year" classification.
I'm sure the answer is right in front of my face somewhere but I am still new to data wrangling.
First step is to import the csv properly. I'm not sure if you made the adjustment, but your data's date column is a string object.
# import the csv and assign to df. parse dates to datetime
df = pd.read_csv('TimeSeriesEx.csv', parse_dates=['Dates'])
# filter the dataframe so that you only have 2019 and 2018 data
df=df[df['year'] >= 2018]
df.tail()
Unnamed: 0 Dates last perc year tdoy
1225 7601 2019-02-20 29.96 0.007397 2019 37
1226 7602 2019-02-21 30.49 0.017690 2019 38
1227 7603 2019-02-22 30.51 0.000656 2019 39
1228 7604 2019-02-25 30.36 -0.004916 2019 40
1229 7605 2019-02-26 30.03 -0.010870 2019 41
Put the tdoy and year into a multiindex.
# create a multiindex
df.set_index(['tdoy','year'], inplace=True)
df.tail()
Dates last perc
tdoy year
37 2019 7601 2019-02-20 29.96 0.007397
38 2019 7602 2019-02-21 30.49 0.017690
39 2019 7603 2019-02-22 30.51 0.000656
40 2019 7604 2019-02-25 30.36 -0.004916
41 2019 7605 2019-02-26 30.03 -0.010870
Make pivot table
# make a pivot table and assign it to a variable
df1 = df.pivot_table(values='last', index='tdoy', columns='year')
df1.head()
year 2018 2019
tdoy
1 33.08 27.55
2 33.38 27.90
3 33.76 28.18
4 33.74 28.41
5 33.65 28.26
Create calculated column
# create the new column
df1['pct_change'] = (df1[2019]-df1[2018])/df1[2018]
df1
year 2018 2019 pct_change
tdoy
1 33.08 27.55 -0.167170
2 33.38 27.90 -0.164170
3 33.76 28.18 -0.165284
4 33.74 28.41 -0.157973
5 33.65 28.26 -0.160178
6 33.43 28.18 -0.157045
7 33.55 28.32 -0.155887
8 33.29 27.94 -0.160709
9 32.97 28.17 -0.145587
10 32.93 28.11 -0.146371
11 32.93 28.24 -0.142423
12 32.79 28.23 -0.139067
13 32.51 28.77 -0.115042
14 32.23 29.01 -0.099907
15 32.28 29.01 -0.101301
16 32.16 29.06 -0.096393
17 32.52 29.38 -0.096556
18 32.68 29.51 -0.097001
19 32.50 30.03 -0.076000
20 32.79 30.30 -0.075938
21 32.87 30.11 -0.083967
22 33.08 30.42 -0.080411
23 33.07 30.17 -0.087693
24 32.90 29.89 -0.091489
25 32.51 30.13 -0.073208
26 32.50 30.38 -0.065231
27 33.16 30.90 -0.068154
28 32.56 30.81 -0.053747
29 32.21 30.87 -0.041602
30 31.96 30.24 -0.053817
31 31.85 30.33 -0.047724
32 31.57 29.99 -0.050048
33 31.80 29.89 -0.060063
34 31.70 29.95 -0.055205
35 31.54 29.95 -0.050412
36 31.54 29.74 -0.057070
37 31.86 29.96 -0.059636
38 32.07 30.49 -0.049267
39 32.04 30.51 -0.047753
40 32.36 30.36 -0.061805
41 32.62 30.03 -0.079399
Altogether without comments and data, the codes looks like:
df = pd.read_csv('TimeSeriesEx.csv', parse_dates=['Dates'])
df=df[df['year'] >= 2018]
df.set_index(['tdoy','year'], inplace=True)
df1 = df.pivot_table(values='last', index='tdoy', columns='year')
df1['pct_change'] = (df1[2019]-df1[2018])/df1[2018]
[EDIT] poster requesting for all dates compared to 2019.
df = pd.read_csv('TimeSeriesEx.csv', parse_dates=['Dates'])
df.set_index(['tdoy','year'], inplace=True)
Ignore year filter above, create pivot table
df1 = df.pivot_table(values='last', index='tdoy', columns='year')
Create a loop going through the years/columns and create a new field for each year comparing to 2019.
for y in df1.columns:
df1[str(y) + '_pct_change'] = (df1[2019]-df1[y])/df1[y]
To view some data...
df1.loc[1:4, "1990_pct_change":"1994_pct_change"]
year 1990_pct_change 1991_pct_change 1992_pct_change 1993_pct_change 1994_pct_change
tdoy
1 0.494845 0.328351 0.489189 0.345872 -0.069257
2 0.496781 0.364971 0.516304 0.361640 -0.045828
3 0.523243 0.382050 0.527371 0.369956 -0.035262
4 0.524960 0.400888 0.531536 0.367838 -0.034659
Final code for all years:
df = pd.read_csv('TimeSeriesEx.csv', parse_dates=['Dates'])
df.set_index(['tdoy','year'], inplace=True)
df1 = df.pivot_table(values='last', index='tdoy', columns='year')
for y in df1.columns:
df1[str(y) + '_pct_change'] = (df1[2019]-df1[y])/df1[y]
df1
I also came up with my own answer more along the lines of what I was trying to originally accomplish. DataFrame I'll work with for the example. df:
Dates last perc year tdoy
0 2016-01-04 29.93 -0.020295 2016 2
1 2016-01-05 29.63 -0.010023 2016 3
2 2016-01-06 29.59 -0.001350 2016 4
3 2016-01-07 29.44 -0.005069 2016 5
4 2017-01-03 34.57 0.004358 2017 2
5 2017-01-04 34.98 0.011860 2017 3
6 2017-01-05 35.00 0.000572 2017 4
7 2017-01-06 34.77 -0.006571 2017 5
8 2018-01-02 33.38 0.009069 2018 2
9 2018-01-03 33.76 0.011384 2018 3
10 2018-01-04 33.74 -0.000592 2018 4
11 2018-01-05 33.65 -0.002667 2018 5
12 2019-01-02 27.90 0.012704 2019 2
13 2019-01-03 28.18 0.010036 2019 3
14 2019-01-04 28.41 0.008162 2019 4
15 2019-01-07 28.26 -0.005280 2019 5
I created a DataFrame with only the 2019 values for tdoy and perc
df19 = df[['tdoy','perc']].loc[df['year'] == 2019]
and then zipped a dictionary for those values
perc19 = dict(zip(df19.tdoy,df19.perc))
to end up with
perc19=
{2: 0.012704174228675058,
3: 0.010035842293906852,
4: 0.008161816891412365,
5: -0.005279831045406497}
Then map these keys with the tdoy column in the original DataFrame to create a column titled 2019 that has the corresponding 2019 percentage change value for that trading day
df['2019'] = df['tdoy'].map(perc19)
and then create a vs2019 column where I find the difference of 2019 vs. perc and square it yielding
Dates last perc year tdoy 2019 vs2019
0 2016-01-04 29.93 -0.020295 2016 2 0.012704 6.746876
1 2016-01-05 29.63 -0.010023 2016 3 0.010036 3.995038
2 2016-01-06 29.59 -0.001350 2016 4 0.008162 1.358162
3 2016-01-07 29.44 -0.005069 2016 5 -0.005280 0.001590
4 2017-01-03 34.57 0.004358 2017 2 0.012704 0.431608
5 2017-01-04 34.98 0.011860 2017 3 0.010036 0.033038
6 2017-01-05 35.00 0.000572 2017 4 0.008162 0.864802
7 2017-01-06 34.77 -0.006571 2017 5 -0.005280 0.059843
8 2018-01-02 33.38 0.009069 2018 2 0.012704 0.081880
9 2018-01-03 33.76 0.011384 2018 3 0.010036 0.018047
10 2018-01-04 33.74 -0.000592 2018 4 0.008162 1.150436
From here I can groupby in various ways and further calculate to find most similar trending percentage changes vs. the year I am comparing against (2019).
I have a pandas column like this :
yrmnt
--------
2015 03
2015 03
2013 08
2015 08
2014 09
2015 10
2016 02
2015 11
2015 11
2015 11
2017 02
How to fetch lowest year month combination :2013 08 and highest : 2017 02
And find the difference in months between these two, ie 40
You can connvert column to_datetime and then find indices by max and min values by idxmax and
idxmin:
a = pd.to_datetime(df['yrmnt'], format='%Y %m')
print (a)
0 2015-03-01
1 2015-03-01
2 2013-08-01
3 2015-08-01
4 2014-09-01
5 2015-10-01
6 2016-02-01
7 2015-11-01
8 2015-11-01
9 2015-11-01
10 2017-02-01
Name: yrmnt, dtype: datetime64[ns]
print (df.loc[a.idxmax(), 'yrmnt'])
2017 02
print (df.loc[a.idxmin(), 'yrmnt'])
2013 08
Difference in months:
b = a.dt.to_period('M')
d = b.max() - b.min()
print (d)
42
Another solution working only with month period created by Series.dt.to_period:
b = pd.to_datetime(df['yrmnt'], format='%Y %m').dt.to_period('M')
print (b)
0 2015-03
1 2015-03
2 2013-08
3 2015-08
4 2014-09
5 2015-10
6 2016-02
7 2015-11
8 2015-11
9 2015-11
10 2017-02
Name: yrmnt, dtype: object
Then convert to custom format by Period.strftime minimal and maximal values:
min_d = b.min().strftime('%Y %m')
print (min_d)
2013 08
max_d = b.max().strftime('%Y %m')
print (max_d)
2017 02
And subtract for difference:
d = b.max() - b.min()
print (d)
42
I have a data frame with a date time index, and I would like to multiply some columns with the number of days in that month.
TUFNWGTP TELFS t070101 t070102 t070103 t070104
TUDIARYDATE
2003-01-03 8155462.672158 2 0 0 0 0
2003-01-04 1735322.527819 1 0 0 0 0
2003-01-04 3830527.482672 2 60 0 0 0
2003-01-02 6622022.995205 4 0 0 0 0
2003-01-09 3068387.344956 1 0 0 0 0
Here, I would like to multiply all the columns starting with t with 31. That is, expected output is
TUFNWGTP TELFS t070101 t070102 t070103 t070104
TUDIARYDATE
2003-01-03 8155462.672158 2 0 0 0 0
2003-01-04 1735322.527819 1 0 0 0 0
2003-01-04 3830527.482672 2 1680 0 0 0
2003-01-02 6622022.995205 4 0 0 0 0
2003-01-09 3068387.344956 1 0 0 0 0
I know that there are some ways using calendar or similar, but given that I'm already using pandas, there must be an easier way - I assume.
There is no such datetime property, but there is an offset M - but I don't know how I would use that without massive inefficiency.
There is now a Series.dt.days_in_month attribute for datetime series. Here is an example based on Jeff's answer.
In [3]: df = pd.DataFrame({'date': pd.date_range('20120101', periods=15, freq='M')})
In [4]: df['year'] = df['date'].dt.year
In [5]: df['month'] = df['date'].dt.month
In [6]: df['days_in_month'] = df['date'].dt.days_in_month
In [7]: df
Out[7]:
date year month days_in_month
0 2012-01-31 2012 1 31
1 2012-02-29 2012 2 29
2 2012-03-31 2012 3 31
3 2012-04-30 2012 4 30
4 2012-05-31 2012 5 31
5 2012-06-30 2012 6 30
6 2012-07-31 2012 7 31
7 2012-08-31 2012 8 31
8 2012-09-30 2012 9 30
9 2012-10-31 2012 10 31
10 2012-11-30 2012 11 30
11 2012-12-31 2012 12 31
12 2013-01-31 2013 1 31
13 2013-02-28 2013 2 28
14 2013-03-31 2013 3 31
pd.tslib.monthrange is an unadvertised / undocumented function that handles the days_in_month calculation (adjusting for leap years). This could/should prob be added as a property to Timestamp/DatetimeIndex.
In [34]: df = DataFrame({'date' : pd.date_range('20120101',periods=15,freq='M') })
In [35]: df['year'] = df['date'].dt.year
In [36]: df['month'] = df['date'].dt.month
In [37]: df['days_in_month'] = df.apply(lambda x: pd.tslib.monthrange(x['year'],x['month'])[1], axis=1)
In [38]: df
Out[38]:
date year month days_in_month
0 2012-01-31 2012 1 31
1 2012-02-29 2012 2 29
2 2012-03-31 2012 3 31
3 2012-04-30 2012 4 30
4 2012-05-31 2012 5 31
5 2012-06-30 2012 6 30
6 2012-07-31 2012 7 31
7 2012-08-31 2012 8 31
8 2012-09-30 2012 9 30
9 2012-10-31 2012 10 31
10 2012-11-30 2012 11 30
11 2012-12-31 2012 12 31
12 2013-01-31 2013 1 31
13 2013-02-28 2013 2 28
14 2013-03-31 2013 3 31
Here is a little clunky hand-made method to get the number of days in a month
import datetime
def days_in_month(dt):
next_month = datetime.datetime(
dt.year + dt.month / 12, dt.month % 12 + 1, 1)
start_month = datetime.datetime(dt.year, dt.month, 1)
td = next_month - start_month
return td.days
For example:
>>> days_in_month(datetime.datetime.strptime('2013-12-12', '%Y-%m-%d'))
31
>>> days_in_month(datetime.datetime.strptime('2013-02-12', '%Y-%m-%d'))
28
>>> days_in_month(datetime.datetime.strptime('2012-02-12', '%Y-%m-%d'))
29
>>> days_in_month(datetime.datetime.strptime('2012-01-12', '%Y-%m-%d'))
31
>>> days_in_month(datetime.datetime.strptime('2013-11-12', '%Y-%m-%d'))
30
I let you figure out how to read your table and do the multiplication yourself :)
import pandas as pd
from pandas.tseries.offsets import MonthEnd
df['dim'] = (pd.to_datetime(df.index) + MonthEnd(0)).dt.day
You can omit pd.to_datetime(), if your index is already DatetimeIndex.