This question already has answers here:
How do I expand the output display to see more columns of a Pandas DataFrame?
(22 answers)
Closed 3 years ago.
I am trying to understand rolling function on pandas on python here is my example code
# importing pandas as pd
import pandas as pd
# By default the "date" column was in string format,
# we need to convert it into date-time format
# parse_dates =["date"], converts the "date" column to date-time format
# Resampling works with time-series data only
# so convert "date" column to index
# index_col ="date", makes "date" column
df = pd.read_csv("apple.csv", parse_dates = ["date"], index_col = "date")
print (df.close.rolling(3).sum())
print (df.close.rolling(3, win_type ='triang').sum())
cvs input file has 255 entries but I get few entries on the output, I get "..." between 2018-10-04 and 2017-12-26. I verified the input file, it has a lot more valid entries in between these dates.
date
2018-11-14 NaN
2018-11-13 NaN
2018-11-12 578.63
2018-11-09 590.87
2018-11-08 607.13
2018-11-07 622.91
2018-11-06 622.21
2018-11-05 615.31
2018-11-02 612.84
2018-11-01 631.29
2018-10-31 648.56
2018-10-30 654.38
2018-10-29 644.40
2018-10-26 641.84
2018-10-25 648.34
2018-10-24 651.19
2018-10-23 657.62
2018-10-22 658.47
2018-10-19 662.69
2018-10-18 655.98
2018-10-17 656.52
2018-10-16 659.36
2018-10-15 660.70
2018-10-12 661.62
2018-10-11 653.92
2018-10-10 652.92
2018-10-09 657.68
2018-10-08 667.00
2018-10-05 674.93
2018-10-04 676.05
...
2017-12-26 512.25
2017-12-22 516.18
2017-12-21 520.59
2017-12-20 524.37
2017-12-19 523.90
2017-12-18 525.31
2017-12-15 524.93
2017-12-14 522.61
2017-12-13 518.46
2017-12-12 516.19
2017-12-11 516.64
2017-12-08 513.74
2017-12-07 511.36
2017-12-06 507.70
2017-12-05 507.97
2017-12-04 508.45
2017-12-01 510.49
2017-11-30 512.70
2017-11-29 512.38
2017-11-28 514.40
2017-11-27 516.64
2017-11-24 522.13
2017-11-22 524.02
2017-11-21 523.07
2017-11-20 518.08
2017-11-17 513.27
2017-11-16 511.23
2017-11-15 510.33
2017-11-14 511.52
2017-11-13 514.39
Name: close, Length: 254, dtype: float64
thank you for your help ...
... just means that pandas isn't showing you all the rows, that's where the 'missing' ones are.
To display all rows:
with pd.option_context("display.max_rows", None):
print (df.close.rolling(3, win_type ='triang').sum())
Related
I have a dataframe with 3 columns:
file = glob.glob('InputFile.csv')
for i in file:
df = pd.read_csv(i)
df['Date'] = pd.to_datetime(df['Date'])
print(df)
Date X Y
0 2020-02-13 00:11:59 -91.3900 -31.7914
1 2020-02-13 01:11:59 -87.1513 -34.6838
2 2020-02-13 02:11:59 -82.9126 -37.5762
3 2020-02-13 03:11:59 -79.3558 -40.2573
4 2020-02-13 04:11:59 -73.2293 -44.2463
... ... ... ...
2034 2020-05-04 18:00:00 -36.4645 -18.3421
2035 2020-05-04 19:00:00 -36.5767 -16.8311
2036 2020-05-04 20:00:00 -36.0170 -14.9356
2037 2020-05-04 21:00:00 -36.4354 -11.0533
2038 2020-05-04 22:00:00 -40.3424 -11.4000
[2039 rows x 3 columns]
print(converted_file.dtypes)
Date datetime64[ns]
xTilt float64
yTilt float64
dtype: object
I would like the output to be:
Date X Y X_Diff Y_Diff
0 2020-02-16 00:11:59 -38.46270 -70.8352 -38.46270 -70.8352
1 2020-02-23 00:11:59 -80.70250 -7.1893 -42.23980 63.6459
2 2020-03-01 00:11:59 -47.38980 -39.2652 33.31270 -32.0759
3 2020-03-08 00:00:00 -35.65350 -64.5058 11.73630 -25.2406
4 2020-03-15 00:00:00 -43.03290 -15.8425 -7.37940 48.6633
5 2020-03-22 00:00:00 -19.77130 -25.5298 23.26160 -9.6873
6 2020-03-29 00:00:00 -13.18940 12.4093 6.58190 37.9391
7 2020-04-05 00:00:00 -8.49098 27.8407 4.69842 15.4314
8 2020-04-12 00:00:00 -19.05360 20.0445 -10.56262 -7.7962
9 2020-04-26 00:00:00 -25.61330 31.6306 -6.55970 11.5861
10 2020-05-03 00:00:00 -46.09250 -30.3557 -20.47920 -61.9863
In such a way that I would like to search from the InputFile.csv file all dates that are in Sundays and extract every first occurence of every Sunday (that is the first entry on that day and not the other times) along with the X and Y values that corresponds to that selected day. Then save it to a new dataframe where I could do subtraction in the X and Y. Copying the very first X and Y to be copied on columns X_Diff and Y_Diff, respectively. Then for the next entries of the output file, loop in all rows to get the difference of the next X minus the previous X then result will be appended in the X_Diff. Same goes with Y until the end of the file.
Here is my solution.
1. Preparation: I will need to generate some random data to be worked on.
import pandas as pd
import numpy as np
df = pd.date_range('2020-02-13', '2020-05-04', freq='1H').to_frame(name='Date').reset_index(drop=True)
df['X'] = np.random.randn(df.shape[0]) * 100
df['Y'] = np.random.randn(df.shape[0]) * 100
The data is like this:
Date X Y
0 2020-02-13 00:00:00 -12.044751 165.962038
1 2020-02-13 01:00:00 63.537406 65.137176
2 2020-02-13 02:00:00 67.555256 114.186898
... ... ... ..
2. Filter the dataframe to get Sunday only. Then, generate another column with date only for grouping purpose.
df = df[df.Date.dt.dayofweek == 0]
df['date_only'] = df.Date.dt.date
Then, it looks like this.
Date X Y date_only
96 2020-02-17 00:00:00 26.632391 120.311315 2020-02-17
97 2020-02-17 01:00:00 -14.111209 21.543440 2020-02-17
98 2020-02-17 02:00:00 -11.941086 -51.303122 2020-02-17
99 2020-02-17 03:00:00 -48.612563 137.023917 2020-02-17
100 2020-02-17 04:00:00 133.843010 -47.168805 2020-02-17
... ... ... ... ...
1796 2020-04-27 20:00:00 -158.310600 30.149292 2020-04-27
1797 2020-04-27 21:00:00 170.212825 181.626611 2020-04-27
1798 2020-04-27 22:00:00 59.773796 11.262186 2020-04-27
1799 2020-04-27 23:00:00 -99.757428 83.529157 2020-04-27
1944 2020-05-04 00:00:00 -168.435315 245.884281 2020-05-04
3. Next step, sort the data frame by "Date". Then, group the dataframe by "date_only". After that, take the first row of each group.
df = df.sort_values(by=['Date'])
df = df.groupby('date_only').apply(lambda g: g.head(1)).reset_index(drop=True).drop(columns=['date_only'])
Results:
Date X Y
0 2020-02-17 4.196690 -205.843619
1 2020-02-24 -189.811351 -5.294274
2 2020-03-02 -231.596763 -46.989246
3 2020-03-09 76.561269 -40.188202
4 2020-03-16 -18.653363 52.376442
5 2020-03-23 106.758484 22.969963
6 2020-03-30 -133.601545 185.561830
7 2020-04-06 -57.748555 -187.878427
8 2020-04-13 57.648834 10.365917
9 2020-04-20 -47.959093 177.455676
10 2020-04-27 -30.527067 -37.046330
11 2020-05-04 -52.854252 -136.069205
4. Last step, get the difference for each X/Y value with their previous value.
df['X_Diff'] = df.X.diff()
df['Y_Diff'] = df.Y.diff()
Results:
Date X Y X_Diff Y_Diff
0 2020-02-17 4.196690 -205.843619 NaN NaN
1 2020-02-24 -189.811351 -5.294274 -194.008042 200.549345
2 2020-03-02 -231.596763 -46.989246 -41.785412 -41.694972
3 2020-03-09 76.561269 -40.188202 308.158031 6.801044
4 2020-03-16 -18.653363 52.376442 -95.214632 92.564644
5 2020-03-23 106.758484 22.969963 125.411847 -29.406479
6 2020-03-30 -133.601545 185.561830 -240.360029 162.591867
7 2020-04-06 -57.748555 -187.878427 75.852990 -373.440257
8 2020-04-13 57.648834 10.365917 115.397389 198.244344
9 2020-04-20 -47.959093 177.455676 -105.607927 167.089758
10 2020-04-27 -30.527067 -37.046330 17.432026 -214.502006
11 2020-05-04 -52.854252 -136.069205 -22.327185 -99.022874
5. If you are not happy with the "NaN" for the first row, then just fill it with the X/Y columns' original values.
df['X_Diff'] = df['X_Diff'].fillna(df.X)
df['Y_Diff'] = df['Y_Diff'].fillna(df.Y)
Final results:
Date X Y X_Diff Y_Diff
0 2020-02-17 4.196690 -205.843619 4.196690 -205.843619
1 2020-02-24 -189.811351 -5.294274 -194.008042 200.549345
2 2020-03-02 -231.596763 -46.989246 -41.785412 -41.694972
3 2020-03-09 76.561269 -40.188202 308.158031 6.801044
4 2020-03-16 -18.653363 52.376442 -95.214632 92.564644
5 2020-03-23 106.758484 22.969963 125.411847 -29.406479
6 2020-03-30 -133.601545 185.561830 -240.360029 162.591867
7 2020-04-06 -57.748555 -187.878427 75.852990 -373.440257
8 2020-04-13 57.648834 10.365917 115.397389 198.244344
9 2020-04-20 -47.959093 177.455676 -105.607927 167.089758
10 2020-04-27 -30.527067 -37.046330 17.432026 -214.502006
11 2020-05-04 -52.854252 -136.069205 -22.327185 -99.022874
Note: There is no time displayed in the "Date" field in the final result. This is because the data I generated for those dates are hourly. So, the first row of each Sunday is XXXX-XX-XX 00:00:00, and the time 00:00:00 will not be displayed in pandas, although they actually exist.
Here is the Colab Link. You can have all my code in a notebook here.
https://colab.research.google.com/drive/1ecSSvJW0waCU19KPoj5uiiYmHp9SSQOf?usp=sharing
I will create a dataframe as Christopher did:
import pandas as pd
import numpy as np
df = pd.date_range('2020-02-13', '2020-05-04', freq='1H').to_frame(name='Date').reset_index(drop=True)
df['X'] = np.random.randn(df.shape[0]) * 100
df['Y'] = np.random.randn(df.shape[0]) * 100
Dataframe view
At First, set the datetime column as index
df = df.set_index('Date')
Secondly, get the rows only for sundays:
sunday_df= df[df.index.dayofweek == 6]
Third, resample the values to day format, take the last value of the day and remove rows with empty hours
sunday_df = sunday_df.resample('D').last().dropna()
Lastly, do the subtraction:
sunday_df['X_Diff'] = sunday_df.X.diff()
sunday_df['Y_Diff'] = sunday_df.Y.diff()
The last view of the new dataframe
I have a csv file like this and this is the code I wrote to filter the date
example['date_1'] = pd.to_datetime(example['date_1'])
example['date_2'] = pd.to_datetime(example['date_2'])
example
date_1 ID date_2
2015-01-12 111 2016-01-20 08:34:00
2016-01-11 222 2016-12-15 08:34:00
2016-01-11 7770 2016-12-15 08:34:00
2016-01-10 7881 2016-11-17 08:32:00
2016-01-03 90243 2016-04-14 08:35:00
2016-01-03 90354 2016-04-14 08:35:00
2015-01-11 1140303 2015-12-15 08:43:00
2015-01-11 1140414 2015-12-15 08:43:00
example[(example['date_1'] <= '2016-11-01')
& (example['date_1'] >= '2015-11-01')
& (example['date_2'] <= '2016-12-16')
& (example['date_2'] >= '2015-12-15')]
Output:
2016-01-11 222 2016-12-15 08:34:00
2016-01-11 7770 2016-12-15 08:34:00
2016-01-10 7881 2016-11-17 08:32:00
2016-01-03 90243 2016-04-14 08:35:00
2016-01-03 90354 2016-04-14 08:35:00
I don't understand why it changes the format of the date, and it seems like it mix up the month&day in the date, with the conditional filter, the expected result should be the same with the original dataset, but it erased several lines? Can someone help me with it, many thanks.
Some locales format the date as dd/mm/YYYY, while others use mm/dd/YYYY. By default pandas uses the american format of mm/dd/YYYY unless it can infer the alternate format from the values (when a day number is greater than 12...).
So if you know that you input date format is dd/mm/YYYY, you must say it to pandas:
example['date_1'] = pd.to_datetime(example['date_1'], dayfirst=True)
example['date_2'] = pd.to_datetime(example['date_2'], dayfirst=True)
Once pandas has a Timestamp column, it internally stores a number of nano seconds from 1970-01-01 00:00, and by default displays it according to ISO-8601, striping parts that are 0 for the columns. Parts being the full time, fractions of seconds or nanoseconds.
You should not care if you want to process the Timestamps. If at the end you want to force a format, explicitely change the column to its string representation:
df['date_1'] = df['date_1'].df.strftime('%d/%m/%Y %H:%M')
I'm trying to figure out how to take a dataframe representing players in a game, the dataframe has unique users and records of each day the particular user has been active.
I am trying to get the average playtime and average moves for each week in the various users lifetime.
(Week is defined by a user's first record, i.e. if a user's first record is 3rd of January, their 1st week starts then and the 2nd week start the 10th of January).
Example
userid date secondsPlayed movesMade
++/acsbP2NFC2BvgG1BzySv5jko= 2016-04-28 413.88188 85
++/acsbP2NFC2BvgG1BzySv5jko= 2016-05-01 82.67343 15
++/acsbP2NFC2BvgG1BzySv5jko= 2016-05-05 236.73809 39
++/acsbP2NFC2BvgG1BzySv5jko= 2016-05-10 112.69112 29
++/acsbP2NFC2BvgG1BzySv5jko= 2016-05-11 211.42790 44
-----------------------------------CONT----------------------------------
++/8ij1h8378h123123koF3oer1 2016-05-05 200.73809 11
++/8ij1h8378h123123koF3oer1 2016-05-10 51.69112 14
++/8ij1h8378h123123koF3oer1 2016-05-14 65.42790 53
The end result for this would be the following table:
userid date secondsPlayed_w movesMade_w
++/acsbP2NFC2BvgG1BzySv5jko= 2016-04-28 496.55531 100
++/acsbP2NFC2BvgG1BzySv5jko= 2016-05-05 236.73809 68
-----------------------------------CONT----------------------------------
++/8ij1h8378h123123koF3oer1 2016-05-05 252.42921 25
++/8ij1h8378h123123koF3oer1 2016-05-12 65.42790 53
Failed attempt #1:
So far I've tried doing a lot of different things, but the most useful dataframe I've managed to create was the following:
df_grouped = df.groupby('userid').apply(lambda x: x.set_index('date').resample('1D').first().fillna(0))
df_result = df_grouped.groupby(level=0)['secondsPlayed'].apply(lambda x: x.rolling(min_periods=1, window=7).mean()).reset_index(name='secondsPlayed_week')
Which is a very slow and wasteful computation, but nonetheless can be used as a intermediate step.
userid date secondsPlayed_w
++/acsbP2NFC2BvgG1BzySv5jko= 2016-04-28 4.138819e+02
++/acsbP2NFC2BvgG1BzySv5jko= 2016-04-29 2.069409e+02
++/acsbP2NFC2BvgG1BzySv5jko= 2016-04-30 1.379606e+02
++/acsbP2NFC2BvgG1BzySv5jko= 2016-05-01 1.241388e+02
++/acsbP2NFC2BvgG1BzySv5jko= 2016-05-02 9.931106e+01
++/acsbP2NFC2BvgG1BzySv5jko= 2016-05-03 8.275922e+01
++/acsbP2NFC2BvgG1BzySv5jko= 2016-05-04 7.093647e+01
++/acsbP2NFC2BvgG1BzySv5jko= 2016-05-05 4.563022e+01
Failed attempt #2:
df_result = (df
.reset_index()
.set_index("date")
.groupby(pd.Grouper(freq='W'))).agg({"userid":"first", "secondsPlayed":"sum", "movesUsed":"sum"})
.reset_index()
Which gave me the following dataframe, which has the fault of not being grouped by userids (the NaN problem is easily resolved).
date userid secondsPlayed_w movesMade_w
2016-04-10 +1kexX0Yk2Su639WaRKARcwjq5g= 2.581356e+03 320
2016-04-17 +1kexX0Yk2Su639WaRKARcwjq5g= 4.040738e+03 615
2016-04-24 NaN 0.000000e+00 0
2016-05-01 ++RBPf9KdTK6pTN+lKZHDLCXg10= 1.644130e+05 17453
2016-05-08 ++DndI7do036eqYh9iW7vekAnx0= 3.775905e+05 31997
2016-05-15 ++NjKpr/vyxNCiYcmeFK9qSqD9o= 4.993430e+05 34706
2016-05-22 ++RBPf9KdTK6pTN+lKZHDLCXg10= 3.940408e+05 23779
Immediate thought:
Can this problem be solved by using a groupby that groups by two columns. But I'm not at all sure how to go about that with this particular problem.
You can create a newid help groupby
df.date=pd.to_datetime(df.date)
df['Newweeknumber']=df.groupby('userid').date.diff().dt.days.cumsum().fillna(0)//7# get the week number by the first date of each id
df.groupby(['userid','Newweeknumber']).agg({"userid":"first", "secondsPlayed":"sum", "movesMade":"sum"})
Update
Try
df1 = pd.DataFrame(index=pd.date_range('2015-04-24', periods = 50)).assign(value=1)
df2 = pd.DataFrame(index=pd.date_range('2015-04-28', periods = 50)).assign(value=1)
df3 = pd.concat([df1,df2], keys=['A','B'])
df3 = df3.rename_axis(['user','date']).reset_index()
df3.groupby('user').apply(lambda x: x.resample('7D', on='date').sum())
Output:
value
user date
A 2015-04-24 7
2015-05-01 7
2015-05-08 7
2015-05-15 7
2015-05-22 7
2015-05-29 7
2015-06-05 7
2015-06-12 1
B 2015-04-28 7
2015-05-05 7
2015-05-12 7
2015-05-19 7
2015-05-26 7
2015-06-02 7
2015-06-09 7
2015-06-16 1
I have a DataFrame df with sporadic daily business day rows (i.e., there is not always a row for every business day.)
For each row in df I want to create a historical resampled mean dfm going back one month at a time. For example, if I have a row for 2018-02-22 then I want rolling means for rows in the following date ranges:
2018-01-23 : 2018-02-22
2017-12-23 : 2018-01-22
2017-11-23 : 2017-12-22
etc.
But I can't see a way to keep this pegged to the particular day of the month using conventional offsets. For example, if I do:
dfm = df.resample('30D').mean()
Then we see two problems:
It references the beginning of the DataFrame. In fact, I can't find a way to force .resample() to peg itself to the end of the DataFrame – even if I have it operate on df_reversed = df.loc[:'2018-02-22'].iloc[::-1]. Is there a way to "peg" the resampling to something other than the earliest date in the DataFrame? (And ideally pegged to each particular row as I run some lambda on the associated historical resampling from each row's date?)
It will drift over time, because not every month is 30 days long. So as I go back in time I will find that the interval 12 "months" prior ends 2017-02-27, not 2017-02-22 like I want.
Knowing that I want to resample by non-overlapping "months," the second problem can be well-defined for month days 29-31: For example, if I ask to resample for '2018-03-31' then the date ranges would end at the end of each preceding month:
2018-03-01 : 2018-03-31
2018-02-01 : 2018-02-28
2018-01-01 : 2018-02-31
etc.
Though again, I don't know: is there a good or easy way to do this in pandas?
tl;dr:
Given something like the following:
someperiods = 20 # this can be a number of days covering many years
somefrequency = '8D' # this can vary from 1D to maybe 10D
rng = pd.date_range('2017-01-03', periods=someperiods, freq=somefrequency)
df = pd.DataFrame({'x': rng.day}, index=rng) # x in practice is exogenous data
from pandas.tseries.offsets import *
df['MonthPrior'] = df.index.to_pydatetime() + DateOffset(months=-1)
Now:
For each row in df: calculate df['PreviousMonthMean'] = rolling average of all df.x in range [df.MonthPrior, df.index). In this example the resulting DataFrame would be:
Index x MonthPrior PreviousMonthMean
2017-01-03 3 2016-12-03 NaN
2017-01-11 11 2016-12-11 3
2017-01-19 19 2016-12-19 7
2017-01-27 27 2016-12-27 11
2017-02-04 4 2017-01-04 19
2017-02-12 12 2017-01-12 16.66666667
2017-02-20 20 2017-01-20 14.33333333
2017-02-28 28 2017-01-28 12
2017-03-08 8 2017-02-08 20
2017-03-16 16 2017-02-16 18.66666667
2017-03-24 24 2017-02-24 17.33333333
2017-04-01 1 2017-03-01 16
2017-04-09 9 2017-03-09 13.66666667
2017-04-17 17 2017-03-17 11.33333333
2017-04-25 25 2017-03-25 9
2017-05-03 3 2017-04-03 17
2017-05-11 11 2017-04-11 15
2017-05-19 19 2017-04-19 13
2017-05-27 27 2017-04-27 11
2017-06-04 4 2017-05-04 19
If we can get that far, then I need to find an efficient way to iterate that so that for each row in df I can aggregate consecutive but non-overlapping df['PreviousMonthMean'] values going back one calendar month at a time from the given DateTimeIndex....
My data:
ipdb> dta
plays
2015-03-01 401.0
2015-03-02 350.0
2015-03-03 448.0
2015-03-04 490.0
... ...
2015-08-23 655.0
2015-08-24 731.0
2015-08-25 684.0
2015-08-26 774.0
2015-08-27 808.0
2015-08-28 732.0
2015-08-29 694.0
2015-08-30 798.0
The data starts from 2015-03-01 to 2015-08-30, and I want to predict the future values.
My code snippt
arma_mod30 = sm.tsa.ARMA(dta, (d_level, 0)).fit()
result = arma_mod30.predict('2015-09-01', '2015-10-30', dynamic=True)
But the predict fuction returned the fllowing error:
ValueError: date 2015-09-01 00:00:00 not in date index. Try giving a date that is in the dates index or use an integer
How to predict future dates that not in the date index? Thanks!