Pandas/Python Pulling end of month rows from dataframe into separate dataframe - python

Currently I have a time series data frame as follows:
dfMain =
Date Portfolio Value
0 2016-07-01 1.000000e+06
1 2016-07-08 1.025168e+06
2 2016-07-15 1.028053e+06
3 2016-07-22 1.024184e+06
4 2016-07-29 1.022491e+06
5 2016-08-05 1.023241e+06
6 2016-08-12 1.030325e+06
7 2016-08-19 1.032742e+06
8 2016-08-26 1.032567e+06
9 2016-09-02 1.028614e+06
10 2016-09-09 9.930876e+05
11 2016-09-16 9.956875e+05
12 2016-09-23 1.010174e+06
13 2016-09-30 1.010388e+06
14 2016-10-07 1.004989e+06
15 2016-10-14 9.924929e+05
16 2016-10-21 9.969708e+05
17 2016-10-28 9.816373e+05
18 2016-11-04 9.563689e+05
19 2016-11-11 9.869579e+05
20 2016-11-18 9.936929e+05
21 2016-11-25 1.009625e+06
Given that the dataframe can be different (can't just pull specific rows from example) what would be the best way to pull the closest to the end of month dates from the dataframe? for example index 4 would be pulled because that is the closest to the end of month date.
Any tips would be greatly appreciated!

Group on the month number and find the last record:
df.Date = pd.to_datetime(df.Date, errors='coerce')
df.groupby(df.Date.dt.month).last()
Date Portfolio Value
Date
7 2016-07-29 1022491.0
8 2016-08-26 1032567.0
9 2016-09-30 1010388.0
10 2016-10-28 981637.3
11 2016-11-25 1009625.0
If rows aren't sorted by Date, call sort_values first:
df.sort_values('Date').groupby(df.Date.dt.month).last()
Date Portfolio Value
Date
7 2016-07-29 1022491.0
8 2016-08-26 1032567.0
9 2016-09-30 1010388.0
10 2016-10-28 981637.3
11 2016-11-25 1009625.0
Should work in any case.
If you have dates spanning multiple years, better to groupby on the year-month:
df.sort_values('Date').groupby([df.Date.dt.year, df.Date.dt.month]).last()

You need to sort the dates and then find the last value for each group.
df['Date'] = pd.to_datetime(df['Date'])
grp = df.sort_values('Date').groupby(df['Date'].dt.month)
pd.DataFrame([grp.get_group(x).iloc[-1] for x in grp.groups])
Output:
Date Portfolio Value
4 2016-07-29 1022491.0
8 2016-08-26 1032567.0
13 2016-09-30 1010388.0
17 2016-10-28 981637.3
21 2016-11-25 1009625.0

Related

Isues with date format in pandas

I am working with a dataset that contains dates in the American M-D-Y format.
When I load the dataset into a Pandas data frame and change the column type to the date format the dates get messed up.
Example: In the data set the first date is written as (11/04/2015) which means the 11th of April 2015. But when I convert to DateTime and use sort the data frame by the date the first date is (01/08/2015) which is incorrect. How can I change the column to DateTime and not get this messup?
dataset example :
IDX_CUSTOMER_ITEM_CODE IDX_COMPANY QtySold TotalOnHand Date
0 131 1 3 26 11/04/2015
1 134 1 3 17 11/04/2015
2 137 1 3 114 11/04/2015
3 140 1 3 18 11/04/2015
4 179 1 1 21 11/04/2015
... ... ... ... ... ...
1048570 1059 10 0 23 04/03/2017
1048571 1075 10 3 14 04/03/2017
1048572 2135 10 2 4 04/03/2017
1048573 1035 10 2 3 04/03/2017
1048574 1038 10 0 5 04/03/2017
The first date is 11 of April 2015 and last 4th march 2017.
When I do:
transactions['Date'] = pd.to_datetime(transactions['Date'])
The oldest date becomes 01/08/2015 and the latest 31/12/2016 which is incorrect. so tired:
transactions['Date'] = pd.to_datetime(transactions['Date'], format = '%dd-%mm-%yy')
Got the following error:
time data '11/04/2015' does not match format '%dd-%mm-%yy' (match)
You can also use dayfirst parameter:
pd.to_datetime(df['Date'], dayfirst=True)
Output:
0 2015-04-11
1 2015-04-11
2 2015-04-11
3 2015-04-11
4 2015-04-11
5 2017-03-04
6 2017-03-04
7 2017-03-04
8 2017-03-04
9 2017-03-04
You format is wrong. You can refer to Python strftime reference for the meaning of % code.
transactions['Date'] = pd.to_datetime(transactions['Date'], format = '%d/%m/%Y')
print(transactions['Date'])
0 2015-04-11
1 2015-04-11
2 2015-04-11
3 2015-04-11
4 2015-04-11
5 2017-03-04
6 2017-03-04
7 2017-03-04
8 2017-03-04
9 2017-03-04
Name: Date, dtype: datetime64[ns]

Pandas, sort by month and create new column for each year

I have a dataframe that looks like this:
date data
2013-09-03 10
2013-09-04 9
2013-10-03 14
2014-09-02 13
2015-08-07 12
2016-09-02 17
I then apply the code below to select only month 9
import pandas as pd
import datetime as dt
df= df[df['Date'].dt.month == 9] # select only the 9th month
This gets me the following:
date data
2013-09-03 10
2013-09-04 9
2014-09-02 13
2016-09-02 17
But what I am trying to create is a column for each time the 9th month is selected so it can become a separate column:
date data 2013 2014 2016
2013-09-03 10 10
2013-09-04 9 9
2014-09-07 13 13
2016-09-08 17 17
I think I have to use the dt.year function in a for loop to create a column for each year, but I think there may be a simpler solution in pandas?
You can try crosstab
s = pd.crosstab(index=df.index,columns=df.date.dt.year,values=df.data,aggfunc='sum').fillna('')
df = df.join(s)
df
Out[45]:
date data 2013 2014 2016
0 2013-09-03 10 10
1 2013-09-04 9 9
2 2014-09-02 13 13
3 2016-09-02 17 17

How do you get the sum of miles using dataframe?

index date miles
0 7/8/2015 14:00:00 10
1 7/8/2015 15:00:01 2
2 7/8/2015 16:00:01 5
3 7/9/2015 09:00:02 12
4 7/10/2015 12:00:00 4
5 7/11/2015 11:00:00 25
6 7/12/2015 04:34:33 10
7 7/12/2015 05:35:35 22
8 7/12/2015 23:11:11 14
9 7/13/2015 01:00:23 10
10 7/13/2015 03:00:03 2
I want to make this table to following;
7/8/2015 17
7/9/2015 12
7/10/2015 4
7/11/2015 25
7/12/2015 46
7/13/2015 12
How can i make something like this in python? Group by date to get sum of miles of each day
If you asked about a solution to add the miles of same day in one line .A way to do it is to go through all of the dates using (for loop) and add all that are equal or basically the same date to a variable then print each line
Using resample:
df.set_index('date', inplace=True)
ddf = df.resample('1D').sum()
resample needs a datetime index, so you need to set the index to 'date' before.
If df is your sample input, ddf will look:
miles
date
2015-07-08 17
2015-07-09 12
2015-07-10 4
2015-07-11 25
2015-07-12 46
2015-07-13 12
As #Valentino mentionned:
data = {
'date': ['7/8/2015 14:00:00', '7/8/2015 14:00:00', '7/8/2015 14:00:00', '7/9/2015 14:00:00'],
'miles': [10, 2, 5, 12]
}
df = pandas.DataFrame(data)
df['date'] = pandas.to_datetime(df.date)
df['date'] = df['date'].dt.strftime('%m/%d/%Y')
print(df)
Out:
date miles
0 7/8/2015 10
1 7/8/2015 2
2 7/8/2015 5
3 7/9/2015 12
print(df.groupby('date').sum())
Out:
date miles
7/8/2015 17
7/9/2015 12

Comparing daily value in each year in DataFrame to same day-number's value in another specific year

I have a daily time series of closing prices of a financial instrument going back to 1990.
I am trying to compare the daily percentage change for each trading day of the previous years to it's respective trading day in 2019. I have 41 trading days of data for 2019 at this time.
I get so far as filtering down and creating a new DataFrame with only the first 41 dates, closing prices, daily percentage changes, and the "trading day of year" ("tdoy") classifier for each day in the set, but am not having luck from there.
I've found other Stack Overflow questions that help people compare datetime days, weeks, years, etc. but I am not able to recreate this because of the arbitrary value each "tdoy" represents.
I won't bother creating a sample DataFrame because of the number of rows so I've linked the CSV I've come up with to this point: Sample CSV.
I think the easiest approach would just be to create a new column that returns what the 2019 percentage change is for each corresponding "tdoy" (Trading Day of Year) using df.loc, and if I could figure this much out I could then create yet another column to do the simple difference between that year/day's percentage change to 2019's respective value. Below is what I try to use (and I've tried other variations) to no avail.
df['2019'] = df['perc'].loc[((df.year == 2019) & (df.tdoy == df.tdoy))]
I've tried to search Stack and Google in probably 20 different variations of my problem and can't seem to find an answer that fits my issue of arbitrary "Trading Day of Year" classification.
I'm sure the answer is right in front of my face somewhere but I am still new to data wrangling.
First step is to import the csv properly. I'm not sure if you made the adjustment, but your data's date column is a string object.
# import the csv and assign to df. parse dates to datetime
df = pd.read_csv('TimeSeriesEx.csv', parse_dates=['Dates'])
# filter the dataframe so that you only have 2019 and 2018 data
df=df[df['year'] >= 2018]
df.tail()
Unnamed: 0 Dates last perc year tdoy
1225 7601 2019-02-20 29.96 0.007397 2019 37
1226 7602 2019-02-21 30.49 0.017690 2019 38
1227 7603 2019-02-22 30.51 0.000656 2019 39
1228 7604 2019-02-25 30.36 -0.004916 2019 40
1229 7605 2019-02-26 30.03 -0.010870 2019 41
Put the tdoy and year into a multiindex.
# create a multiindex
df.set_index(['tdoy','year'], inplace=True)
df.tail()
Dates last perc
tdoy year
37 2019 7601 2019-02-20 29.96 0.007397
38 2019 7602 2019-02-21 30.49 0.017690
39 2019 7603 2019-02-22 30.51 0.000656
40 2019 7604 2019-02-25 30.36 -0.004916
41 2019 7605 2019-02-26 30.03 -0.010870
Make pivot table
# make a pivot table and assign it to a variable
df1 = df.pivot_table(values='last', index='tdoy', columns='year')
df1.head()
year 2018 2019
tdoy
1 33.08 27.55
2 33.38 27.90
3 33.76 28.18
4 33.74 28.41
5 33.65 28.26
Create calculated column
# create the new column
df1['pct_change'] = (df1[2019]-df1[2018])/df1[2018]
df1
year 2018 2019 pct_change
tdoy
1 33.08 27.55 -0.167170
2 33.38 27.90 -0.164170
3 33.76 28.18 -0.165284
4 33.74 28.41 -0.157973
5 33.65 28.26 -0.160178
6 33.43 28.18 -0.157045
7 33.55 28.32 -0.155887
8 33.29 27.94 -0.160709
9 32.97 28.17 -0.145587
10 32.93 28.11 -0.146371
11 32.93 28.24 -0.142423
12 32.79 28.23 -0.139067
13 32.51 28.77 -0.115042
14 32.23 29.01 -0.099907
15 32.28 29.01 -0.101301
16 32.16 29.06 -0.096393
17 32.52 29.38 -0.096556
18 32.68 29.51 -0.097001
19 32.50 30.03 -0.076000
20 32.79 30.30 -0.075938
21 32.87 30.11 -0.083967
22 33.08 30.42 -0.080411
23 33.07 30.17 -0.087693
24 32.90 29.89 -0.091489
25 32.51 30.13 -0.073208
26 32.50 30.38 -0.065231
27 33.16 30.90 -0.068154
28 32.56 30.81 -0.053747
29 32.21 30.87 -0.041602
30 31.96 30.24 -0.053817
31 31.85 30.33 -0.047724
32 31.57 29.99 -0.050048
33 31.80 29.89 -0.060063
34 31.70 29.95 -0.055205
35 31.54 29.95 -0.050412
36 31.54 29.74 -0.057070
37 31.86 29.96 -0.059636
38 32.07 30.49 -0.049267
39 32.04 30.51 -0.047753
40 32.36 30.36 -0.061805
41 32.62 30.03 -0.079399
Altogether without comments and data, the codes looks like:
df = pd.read_csv('TimeSeriesEx.csv', parse_dates=['Dates'])
df=df[df['year'] >= 2018]
df.set_index(['tdoy','year'], inplace=True)
df1 = df.pivot_table(values='last', index='tdoy', columns='year')
df1['pct_change'] = (df1[2019]-df1[2018])/df1[2018]
[EDIT] poster requesting for all dates compared to 2019.
df = pd.read_csv('TimeSeriesEx.csv', parse_dates=['Dates'])
df.set_index(['tdoy','year'], inplace=True)
Ignore year filter above, create pivot table
df1 = df.pivot_table(values='last', index='tdoy', columns='year')
Create a loop going through the years/columns and create a new field for each year comparing to 2019.
for y in df1.columns:
df1[str(y) + '_pct_change'] = (df1[2019]-df1[y])/df1[y]
To view some data...
df1.loc[1:4, "1990_pct_change":"1994_pct_change"]
year 1990_pct_change 1991_pct_change 1992_pct_change 1993_pct_change 1994_pct_change
tdoy
1 0.494845 0.328351 0.489189 0.345872 -0.069257
2 0.496781 0.364971 0.516304 0.361640 -0.045828
3 0.523243 0.382050 0.527371 0.369956 -0.035262
4 0.524960 0.400888 0.531536 0.367838 -0.034659
Final code for all years:
df = pd.read_csv('TimeSeriesEx.csv', parse_dates=['Dates'])
df.set_index(['tdoy','year'], inplace=True)
df1 = df.pivot_table(values='last', index='tdoy', columns='year')
for y in df1.columns:
df1[str(y) + '_pct_change'] = (df1[2019]-df1[y])/df1[y]
df1
I also came up with my own answer more along the lines of what I was trying to originally accomplish. DataFrame I'll work with for the example. df:
Dates last perc year tdoy
0 2016-01-04 29.93 -0.020295 2016 2
1 2016-01-05 29.63 -0.010023 2016 3
2 2016-01-06 29.59 -0.001350 2016 4
3 2016-01-07 29.44 -0.005069 2016 5
4 2017-01-03 34.57 0.004358 2017 2
5 2017-01-04 34.98 0.011860 2017 3
6 2017-01-05 35.00 0.000572 2017 4
7 2017-01-06 34.77 -0.006571 2017 5
8 2018-01-02 33.38 0.009069 2018 2
9 2018-01-03 33.76 0.011384 2018 3
10 2018-01-04 33.74 -0.000592 2018 4
11 2018-01-05 33.65 -0.002667 2018 5
12 2019-01-02 27.90 0.012704 2019 2
13 2019-01-03 28.18 0.010036 2019 3
14 2019-01-04 28.41 0.008162 2019 4
15 2019-01-07 28.26 -0.005280 2019 5
I created a DataFrame with only the 2019 values for tdoy and perc
df19 = df[['tdoy','perc']].loc[df['year'] == 2019]
and then zipped a dictionary for those values
perc19 = dict(zip(df19.tdoy,df19.perc))
to end up with
perc19=
{2: 0.012704174228675058,
3: 0.010035842293906852,
4: 0.008161816891412365,
5: -0.005279831045406497}
Then map these keys with the tdoy column in the original DataFrame to create a column titled 2019 that has the corresponding 2019 percentage change value for that trading day
df['2019'] = df['tdoy'].map(perc19)
and then create a vs2019 column where I find the difference of 2019 vs. perc and square it yielding
Dates last perc year tdoy 2019 vs2019
0 2016-01-04 29.93 -0.020295 2016 2 0.012704 6.746876
1 2016-01-05 29.63 -0.010023 2016 3 0.010036 3.995038
2 2016-01-06 29.59 -0.001350 2016 4 0.008162 1.358162
3 2016-01-07 29.44 -0.005069 2016 5 -0.005280 0.001590
4 2017-01-03 34.57 0.004358 2017 2 0.012704 0.431608
5 2017-01-04 34.98 0.011860 2017 3 0.010036 0.033038
6 2017-01-05 35.00 0.000572 2017 4 0.008162 0.864802
7 2017-01-06 34.77 -0.006571 2017 5 -0.005280 0.059843
8 2018-01-02 33.38 0.009069 2018 2 0.012704 0.081880
9 2018-01-03 33.76 0.011384 2018 3 0.010036 0.018047
10 2018-01-04 33.74 -0.000592 2018 4 0.008162 1.150436
From here I can groupby in various ways and further calculate to find most similar trending percentage changes vs. the year I am comparing against (2019).

How long are Pandas groupby objects remembered?

I have the following example Python 3.4 script. It does the following:
creates a dataframe,
converts the date variable to datetime64 format,
creates a groupby object based on two categorical variables,
produces a dataframe that contains a count of the number items in each group,
merges count dataframe back with original dataframe to create a column containing the number of rows in each group
creates a column containing the difference in dates between sequential rows.
Here is the script:
import numpy as np
import pandas as pd
# Create dataframe consisting of id, date and two categories (gender and age)
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
# Convert date to datetime
tempDF['date'] = pd.to_datetime(tempDF['date'])
# Create groupby object based on two categorical variables
tempGroupby = tempDF.sort_values(['gender','age','id']).groupby(['gender','age'])
# Count number in each group and merge with original dataframe to create 'count' column
tempCountsDF = tempGroupby['id'].count().reset_index(drop=False)
tempCountsDF = tempCountsDF.rename(columns={'id': 'count'})
tempDF = tempDF.merge(tempCountsDF, on=['gender','age'])
# Calculate difference between consecutive rows in each group. (First row in each
# group should have date difference = NaT)
tempGroupby = tempDF.sort_values(['gender','age','id']).groupby(['gender','age'])
tempDF['diff'] = tempGroupby['date'].diff()
print(tempDF)
This script produces the following output:
age date gender id count diff
0 young 2015-02-04 02:34:00 male 1 2 NaT
1 young 2015-10-05 08:52:00 male 10 2 243 days 06:18:00
2 old 2015-06-04 12:34:00 female 2 3 NaT
3 old 2015-09-04 23:03:00 female 3 3 92 days 10:29:00
4 old 2015-04-21 12:59:00 female 6 3 -137 days +13:56:00
5 old 2015-12-04 01:00:00 male 4 6 NaT
6 old 2015-04-15 07:12:00 male 5 6 -233 days +06:12:00
7 old 2015-06-05 11:12:00 male 9 6 51 days 04:00:00
8 old 2015-05-19 19:22:00 male 12 6 -17 days +08:10:00
9 old 2015-04-06 12:57:00 male 15 6 -44 days +17:35:00
10 old 2015-06-15 03:23:00 male 17 6 69 days 14:26:00
11 young 2015-12-05 14:19:00 female 11 4 NaT
12 young 2015-05-27 22:31:00 female 13 4 -192 days +08:12:00
13 young 2015-01-06 11:09:00 female 14 4 -142 days +12:38:00
14 young 2015-06-19 05:37:00 female 18 4 163 days 18:28:00
And this exactly what I'd expect. However, it seems to rely on creating the groupby object twice (in exactly the same way). If the second groupby definition is commented out, it seems to lead to a very different output in the diff column:
import numpy as np
import pandas as pd
# Create dataframe consisting of id, date and two categories (gender and age)
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
# Convert date to datetime
tempDF['date'] = pd.to_datetime(tempDF['date'])
# Create groupby object based on two categorical variables
tempGroupby = tempDF.sort_values(['gender','age','id']).groupby(['gender','age'])
# Count number in each group and merge with original dataframe to create 'count' column
tempCountsDF = tempGroupby['id'].count().reset_index(drop=False)
tempCountsDF = tempCountsDF.rename(columns={'id': 'count'})
tempDF = tempDF.merge(tempCountsDF, on=['gender','age'])
# Calculate difference between consecutive rows in each group. (First row in each
# group should have date difference = NaT)
# ****** THIS TIME THE FOLLOWING GROUPBY DEFINITION IS COMMENTED OUT *****
# tempGroupby = tempDF.sort_values(['gender','age','id']).groupby(['gender','age'])
tempDF['diff'] = tempGroupby['date'].diff()
print(tempDF)
And, this time the output is very different (and NOT what I wanted at all)
age date gender id count diff
0 young 2015-02-04 02:34:00 male 1 2 NaT
1 young 2015-10-05 08:52:00 male 10 2 NaT
2 old 2015-06-04 12:34:00 female 2 3 92 days 10:29:00
3 old 2015-09-04 23:03:00 female 3 3 NaT
4 old 2015-04-21 12:59:00 female 6 3 -233 days +06:12:00
5 old 2015-12-04 01:00:00 male 4 6 -137 days +13:56:00
6 old 2015-04-15 07:12:00 male 5 6 NaT
7 old 2015-06-05 11:12:00 male 9 6 NaT
8 old 2015-05-19 19:22:00 male 12 6 51 days 04:00:00
9 old 2015-04-06 12:57:00 male 15 6 243 days 06:18:00
10 old 2015-06-15 03:23:00 male 17 6 NaT
11 young 2015-12-05 14:19:00 female 11 4 -17 days +08:10:00
12 young 2015-05-27 22:31:00 female 13 4 -192 days +08:12:00
13 young 2015-01-06 11:09:00 female 14 4 -142 days +12:38:00
14 young 2015-06-19 05:37:00 female 18 4 -44 days +17:35:00
(In my real-life script the results seem to be a little erratic, sometimes it works and sometimes it doesn't. But in the above script, the different outputs seem to occur consistently.)
Why is it necessary to recreate the groupby object on what is, essentially, the same dataframe (albeit with an additional column added) immediately before using the .diff() function? This seems very dangerous to me.
Not the same, the index has changed. For example:
tempDF.loc[1].id # before
10
tempDF.loc[1].id # after
2
So if you compute tempGroupby with the old tempDF and then change the indexes in tempDF when you do this:
tempDF['diff'] = tempGroupby['date'].diff()
the indexes do not match as you expect. You are assigning to each row the difference corresponding to the row that had that index in the old tempDF.

Categories

Resources