I've got the following Pandas dataframes:
>>> df
Qual_B temp_B relhum_B Qual_F temp_F relhum_F
Date
1948-01-01 01:00:00 5 -6.0 96 NaN NaN NaN
1948-01-01 02:00:00 5 -5.3 97 NaN NaN NaN
1948-01-01 03:00:00 5 -4.5 98 NaN 3.5 NaN
1948-01-01 04:00:00 5 -4.3 98 NaN 3.7 NaN
1948-01-01 05:00:00 5 -4.0 99 NaN NaN NaN
>>> test
Qual_B temp_B relhum_B Qual_F temp_F relhum_F
Date
1948-01-01 01:00:00 True True True False False False
1948-01-01 02:00:00 True True True False False False
1948-01-01 03:00:00 True True True False True False
1948-01-01 04:00:00 True True True False True False
1948-01-01 05:00:00 True True True False False False
which represents data availability (I have created test with test = pandas.notnull(df)). I want a plot like this or a stacked barplot with time on the x-axis and the columns on the y axis and I have tried the following:
fig= plt.figure()
ax = fig.add_subplot(111)
ax.imshow(test.values, aspect='auto', cmap=plt.cm.gray, interpolation='nearest')
but it doesn't do anything, even though the type of the array is exactly the same as in the example above (both numpy.ndarray). The try to plot the original dataframe with
test.div(test, axis=0).T.plot(kind = 'barh', stacked=True, legend=False, color='b', edgecolor='none')
seems to be correct for the values, that are always present, but not for those that are partly present. Can anyone help?
Related
I have a Pandas dataframe in the following format:
id name timestamp time_diff <=30min
1 movie3 2009-05-04 18:00:00+00:00 NaN False
1 movie5 2009-05-05 18:15:00+00:00 00:15:00 True
1 movie1 2009-05-05 22:00:00+00:00 03:45:00 False
2 movie7 2009-05-04 09:30:00+00:00 NaN False
2 movie8 2009-05-05 12:00:00+00:00 02:30:00 False
3 movie1 2009-05-04 17:45:00+00:00 NaN False
3 movie7 2009-05-04 18:15:00+00:00 00:30:00 True
3 movie6 2009-05-04 18:30:00+00:00 00:15:00 True
3 movie6 2009-05-04 19:00:00+00:00 00:30:00 True
4 movie1 2009-05-05 12:45:00+00:00 NaN False
5 movie7 2009-05-04 11:00:00+00:00 NaN False
5 movie8 2009-05-04 11:15:00+00:00 00:15:00 True
The data shows the movies watched on a video streaming platform. Id is the user id, name is the name of the movie and timestamp is the timestamp at which the movie started. <30min indicates if the user has started the movie within 30minutes of the previous movie watched.
A movie-session is comprised by one or more movies played by a single user, where each movie has started within 30 minutes of the previous movie start time (Basically a session is defined as consecutive rows in which df['<30min'] == True).
The length of a session is defined as time_stamp of the last consecutive df['<30min'] == True - timestamp of the first True of the session.
How can I find the 3 longest sessions (in minutes) in the data, and the movies played during the sessions?
As a first step, I have tried something like this:
df.groupby((df['<20'] == False).cumsum())['time_diff'].fillna(pd.Timedelta(seconds=0)).cumsum()
But it doesn't work (the cumsum does not reset when df['time_diff']=False), and looks very slow.
Also, I think it would make my life harder when I have to select the longest 3 sessions as I could get multiple values for the same session that could be selected in the longest 3.
Not sure I understood you correctly. If I did then this may work;
Coercer timestamp to datetime;
df['timestamp']=pd.to_datetime(df['timestamp'])
filter out the True values which indicate consecutive watch.Groupby id whicle calculating the difference between maximum and minimum time. This is then joined to the main df
df.join(df[df['<=30min']==True].groupby('id')['timestamp'].transform(lambda x:x.max()-x.min()).to_frame().rename(columns={'timestamp':'Max'}))
id name timestamp time_diff <=30min Max
0 1 movie3 2009-05-04 18:00:00+00:00 NaN False NaT
1 1 movie5 2009-05-05 18:15:00+00:00 00:15:00 True 00:00:00
2 1 movie1 2009-05-05 22:00:00+00:00 03:45:00 False NaT
3 2 movie7 2009-05-04 09:30:00+00:00 NaN False NaT
4 2 movie8 2009-05-05 12:00:00+00:00 02:30:00 False NaT
5 3 movie1 2009-05-04 17:45:00+00:00 NaN False NaT
6 3 movie7 2009-05-04 18:15:00+00:00 00:30:00 True 00:45:00
7 3 movie6 2009-05-04 18:30:00+00:00 00:15:00 True 00:45:00
8 3 movie6 2009-05-04 19:00:00+00:00 00:30:00 True 00:45:00
9 4 movie1 2009-05-05 12:45:00+00:00 NaN False NaT
10 5 movie7 2009-05-04 11:00:00+00:00 NaN False NaT
11 5 movie8 2009-05-04 11:15:00+00:00 00:15:00 True 00:00:00
I am trying to find a way to calculate an inverse cumsum for pandas. This means applying cumsum but from bottom to top. The problem I'm facing is, I'm trying to find the number of workable day for each month for Spain both from top to bottom (1st workable day = 1, 2nd = 2, 3rd = 3, etc...) and bottom to top (last workable day = 1, day before last = 2, etc...).
So far I managed to get the top to bottom order to work but can't get the inverse order to work, I've searched a lot and couldn't find a way to perform an inverse cummulative sum:
import pandas as pd
from datetime import date
from workalendar.europe import Spain
import numpy as np
cal = Spain()
#print(cal.holidays(2019))
rng = pd.date_range('2019-01-01', periods=365, freq='D')
df = pd.DataFrame({ 'Date': rng})
df['flag_workable'] = df['Date'].apply(lambda x: cal.is_working_day(x))
df_workable = df[df['flag_workable'] == True]
df_workable['month'] = df_workable['Date'].dt.month
df_workable['workable_day'] = df_workable.groupby('month')['flag_workable'].cumsum()
print(df)
print(df_workable.head(30))
Output for January:
Date flag_workable month workable_day
1 2019-01-02 True 1 1.0
2 2019-01-03 True 1 2.0
3 2019-01-04 True 1 3.0
6 2019-01-07 True 1 4.0
7 2019-01-08 True 1 5.0
Example for last days of January:
Date flag_workable month workable_day
24 2019-01-25 True 1 18.0
27 2019-01-28 True 1 19.0
28 2019-01-29 True 1 20.0
29 2019-01-30 True 1 21.0
30 2019-01-31 True 1 22.0
This would be the expected output after applying the inverse cummulative:
Date flag_workable month workable_day inv_workable_day
1 2019-01-02 True 1 1.0 22.0
2 2019-01-03 True 1 2.0 21.0
3 2019-01-04 True 1 3.0 20.0
6 2019-01-07 True 1 4.0 19.0
7 2019-01-08 True 1 5.0 18.0
Last days of January:
Date flag_workable month workable_day inv_workable_day
24 2019-01-25 True 1 18.0 5.0
27 2019-01-28 True 1 19.0 4.0
28 2019-01-29 True 1 20.0 3.0
29 2019-01-30 True 1 21.0 2.0
30 2019-01-31 True 1 22.0 1.0
Invert the row order of the DataFrame prior to grouping so that the cumsum is calculated in reverse order within each month.
df['inv_workable_day'] = df[::-1].groupby('month')['flag_workable'].cumsum()
df['workable_day'] = df.groupby('month')['flag_workable'].cumsum()
# Date flag_workable month inv_workable_day workable_day
#1 2019-01-02 True 1 5.0 1.0
#2 2019-01-03 True 1 4.0 2.0
#3 2019-01-04 True 1 3.0 3.0
#6 2019-01-07 True 1 2.0 4.0
#7 2019-01-08 True 1 1.0 5.0
#8 2019-02-01 True 2 1.0 1.0
Solution
Whichever column you want to apply cumsum to you have two options:
Order descending a copy of that column by index, followed by cumsum and then order ascending by index. Finally assign it back to the data frame column.
Use numpy:
import numpy as np
array = df.column_data.to_numpy()
array = np.flip(array) # to flip the order
array = np.cumsum(array)
array = np.flip(array) # to flip back to original order
df.column_data_cumsum = array
I currently have some time series data that I applied a rolling mean on with a window of 17520.
Thus before the head of my data looked like this:
SETTLEMENTDATE ==
0 2006/01/01 00:30:00 8013.27833 ... 5657.67500 20.03
1 2006/01/01 01:00:00 7726.89167 ... 5460.39500 18.66
2 2006/01/01 01:30:00 7372.85833 ... 5766.02500 20.38
3 2006/01/01 02:00:00 7071.83333 ... 5503.25167 18.59
4 2006/01/01 02:30:00 6865.44000 ... 5214.01500 17.53
And now it looks like this:
SETTLEMENTDATE =
0 2006/01/01 00:30:00 NaN ... NaN NaN
1 2006/01/01 01:00:00 NaN ... NaN NaN
2 2006/01/01 01:30:00 NaN ... NaN NaN
3 2006/01/01 02:00:00 NaN ... NaN NaN
4 2006/01/01 02:30:00 NaN ... NaN NaN
How can I get it so that my data only begins, when there is not a NaN? (also making sure that the date matches)
=
You can try with rolling with min_periods = 1
data['NSW DEMAND'] = data['NSW DEMAND'].rolling(17520,min_periods=17520).mean()
Also try using for loo, you do not need to write the columns one by one
youcols=['xxx'...'xxx1']
for x in youcols:
data[x]=data[x].rolling(17520,min_periods=1).mean()
Base on your comments
for x in youcols:
data[x]=data[x].rolling(17520,min_periods=1).mean()
then ,
data=data.dropna(subset=youcols,thresh =1)
I have a date time indexed DataFrame, (65 columns (only 9 shown for clarity) -> number of sensors, and x rows -> number of observations(for the sample data I limited it to 700 rows, to illustrate the issue I am having).
demo csv:
https://pastebin.com/mpSgJF94
swp_data = pd.read_csv(FILE_NAME, index_col=0, header=0, parse_dates=True, infer_datetime_format=True)
swp_data = swp_data.sort_index()
For each column, I need to find the point where the value is 95% of the column max value, and figure out from the beginning of the DataFrame to the 95% point, where the difference between the time steps is greater than a given value (0.2 in this case).
something similar to what would work in R (not actual code but an illustration)
for (i in 1 : 95% point){
difference[i] <- s[i] - s[(i-1)]
}
breaking <-which(difference > 0.2)[1]
Which would take the 95% point as the end index of a loop, and look at the differences between the time steps and return an index value where the difference > 0.2
In pandas I have calculated the following:
95% value
s95 = (swp_data.max() + (swp_data.max() * .05))
A1-24, -20.6260635,
A1-18, -17.863923,
A1-12, -11.605629,
A2-24, -16.755144,
A2-18, -17.6815275,
A2-12, -16.369584,
A3-24, -15.5030295,
95% time
s95_time = (swp_data >= (swp_data.max() + (swp_data.max() * .05))).idxmax()
A1-24, 10/2/2011 1:30,
A1-18, 10/3/2011 6:20,
A1-12, 10/2/2011 17:20,
A2-24, 10/3/2011 6:10,
A2-18, 10/3/2011 1:30,
A2-12, 10/2/2011 17:10,
A3-24, 10/2/2011 1:30,
Thus far, I have the max value, and the 95% value, as well as a series of timestamps where each column reached its 95% point.
s95 value:
I have tried to mask the DataFrame, (trying to replicate R's which) by creating a boolean DataFrame of values <= 95% point, and have tried df.where using values >=95%. Neither mask or where has provided me what I need, as some of the sensors can already be above the 95% of max when I started recording (mask returns NaN for these values), while where returns these values but not the values below the 95% threshold.
The output I am looking for would be something along the lines of
A1-24, A1-18, A1-12, A2-24, A2-18, A2-12, A3-24, A3-18, A3-12
BREAKING hh:mm, hh:mm, hh:mm, hh:mm, hh:mm, hh:mm, hh:mm, hh:mm, hh:mm
where hh:mm equals the time from the start of the data file to the breaking value.
So far what I have found on SE and google, has me confused if I can subset the columns of the dataframe by different values, and am having trouble figuring out what I am trying to do is called.
edit: #Prateek Comment:
What i am trying to do is find a way that I can somewhat automate this process, so that using the position of 95% I can have the breaking point returned. I have ~200 csv files that I am trying to process, and would like as much of the filtering to be done using the 95% and breaking positions as possible.
A possible solution from what I understand.
Note that I renamed swap_data to df in the example, and the solution is tested on the provided csv sample file from your question.
Find duration from the start up to when value reaches 95% of column's max
Finding the first timepoint where each column reaches 95% of the max is done as you described:
idx = (df >= df.max(axis=0) * 1.05).idxmax()
>>> idx
Out[]:
A1-24 2011-10-02 01:30:00
A1-18 2011-10-03 06:20:00
A1-12 2011-10-02 17:20:00
A2-24 2011-10-03 06:10:00
A2-18 2011-10-03 01:30:00
A2-12 2011-10-02 17:10:00
A3-24 2011-10-02 01:30:00
dtype: datetime64[ns]
Note the using df.max() * 1.05 avoids to compute the max twice, as compared to s95 = (swp_data.max() + (swp_data.max() * .05)) otherwise it's the same.
Then computing the duration from the start of the dataframe is obtained by substracting the first timestamp
>>> idx - df.index[0]
Out[]:
A1-24 0 days 00:00:00
A1-18 1 days 04:50:00
A1-12 0 days 15:50:00
A2-24 1 days 04:40:00
A2-18 1 days 00:00:00
A2-12 0 days 15:40:00
A3-24 0 days 00:00:00
dtype: timedelta64[ns]
This is for each column the time spent from the start of the record to the s95 point.
Time is 0 if the first recorded value is already above this point.
Mask the dataframe to cover this period
mask = pd.concat([pd.Series(df.index)] * df.columns.size, axis=1) < idx.values.T
df_masked = df.where(mask.values)
>>> df_masked.dropna(how='all')
Out[]:
A1-24 A1-18 A1-12 A2-24 A2-18 A2-12 A3-24
Timestamp
2011-10-02 01:30:00 NaN -18.63589 -16.90389 -17.26780 -19.20653 -19.59666 NaN
2011-10-02 01:40:00 NaN -18.64686 -16.93100 -17.26832 -19.22702 -19.62036 NaN
2011-10-02 01:50:00 NaN -18.65098 -16.92761 -17.26132 -19.22705 -19.61355 NaN
2011-10-02 02:00:00 NaN -18.64307 -16.94764 -17.27702 -19.22746 -19.63462 NaN
2011-10-02 02:10:00 NaN -18.66338 -16.94900 -17.27325 -19.25358 -19.62761 NaN
2011-10-02 02:20:00 NaN -18.66217 -16.95625 -17.27386 -19.25455 -19.64009 NaN
2011-10-02 02:30:00 NaN -18.66015 -16.96130 -17.27040 -19.25898 -19.64241 NaN
2011-10-02 02:40:00 NaN -18.66883 -16.96980 -17.27580 -19.27054 -19.65454 NaN
2011-10-02 02:50:00 NaN -18.68635 -16.97897 -17.27488 -19.28492 -19.65808 NaN
2011-10-02 03:00:00 NaN -18.68009 -16.99057 -17.28346 -19.28928 -19.67182 NaN
2011-10-02 03:10:00 NaN -18.68450 -17.00258 -17.28196 -19.32272 -19.68135 NaN
2011-10-02 03:20:00 NaN -18.68777 -17.01009 -17.29675 -19.30864 -19.68747 NaN
2011-10-02 03:30:00 NaN -18.70067 -17.01706 -17.29178 -19.32034 -19.69742 NaN
2011-10-02 03:40:00 NaN -18.70095 -17.03559 -17.29352 -19.32741 -19.70945 NaN
2011-10-02 03:50:00 NaN -18.70636 -17.03651 -17.28925 -19.33549 -19.71560 NaN
2011-10-02 04:00:00 NaN -18.70937 -17.03548 -17.28996 -19.33433 -19.71211 NaN
2011-10-02 04:10:00 NaN -18.70599 -17.04444 -17.29223 -19.33740 -19.72227 NaN
2011-10-02 04:20:00 NaN -18.71292 -17.05510 -17.29449 -19.35154 -19.72779 NaN
2011-10-02 04:30:00 NaN -18.72158 -17.06376 -17.28770 -19.35647 -19.73064 NaN
2011-10-02 04:40:00 NaN -18.72185 -17.06910 -17.30018 -19.36785 -19.74481 NaN
2011-10-02 04:50:00 NaN -18.72048 -17.06599 -17.29004 -19.37320 -19.73424 NaN
2011-10-02 05:00:00 NaN -18.73083 -17.07618 -17.29528 -19.37319 -19.75045 NaN
2011-10-02 05:10:00 NaN -18.72215 -17.08587 -17.29650 -19.38400 -19.75713 NaN
2011-10-02 05:20:00 NaN -18.73206 -17.10233 -17.29767 -19.39254 -19.76838 NaN
2011-10-02 05:30:00 NaN -18.73719 -17.09621 -17.29842 -19.39363 -19.76258 NaN
2011-10-02 05:40:00 NaN -18.73839 -17.10910 -17.29237 -19.40390 -19.76864 NaN
2011-10-02 05:50:00 NaN -18.74257 -17.12091 -17.29398 -19.40846 -19.78042 NaN
2011-10-02 06:00:00 NaN -18.74327 -17.12995 -17.29097 -19.41153 -19.77897 NaN
2011-10-02 06:10:00 NaN -18.74326 -17.04482 -17.28397 -19.40928 -19.77430 NaN
2011-10-02 06:20:00 NaN -18.73100 -16.86221 -17.28575 -19.40956 -19.78396 NaN
... ... ... ... ... ... ... ...
2011-10-03 01:20:00 NaN -18.16448 NaN -16.99797 -17.95030 NaN NaN
2011-10-03 01:30:00 NaN -18.15606 NaN -16.98879 NaN NaN NaN
2011-10-03 01:40:00 NaN -18.12795 NaN -16.97951 NaN NaN NaN
2011-10-03 01:50:00 NaN -18.12974 NaN -16.97937 NaN NaN NaN
2011-10-03 02:00:00 NaN -18.11848 NaN -16.96770 NaN NaN NaN
2011-10-03 02:10:00 NaN -18.11879 NaN -16.95256 NaN NaN NaN
2011-10-03 02:20:00 NaN -18.08212 NaN -16.95461 NaN NaN NaN
2011-10-03 02:30:00 NaN -18.09060 NaN -16.94141 NaN NaN NaN
2011-10-03 02:40:00 NaN -18.07000 NaN -16.93006 NaN NaN NaN
2011-10-03 02:50:00 NaN -18.07461 NaN -16.91700 NaN NaN NaN
2011-10-03 03:00:00 NaN -18.06039 NaN -16.91466 NaN NaN NaN
2011-10-03 03:10:00 NaN -18.04229 NaN -16.89537 NaN NaN NaN
2011-10-03 03:20:00 NaN -18.03514 NaN -16.89753 NaN NaN NaN
2011-10-03 03:30:00 NaN -18.03014 NaN -16.88813 NaN NaN NaN
2011-10-03 03:40:00 NaN -18.00851 NaN -16.88086 NaN NaN NaN
2011-10-03 03:50:00 NaN -18.01028 NaN -16.87721 NaN NaN NaN
2011-10-03 04:00:00 NaN -18.00227 NaN -16.86687 NaN NaN NaN
2011-10-03 04:10:00 NaN -17.98804 NaN -16.85424 NaN NaN NaN
2011-10-03 04:20:00 NaN -17.96740 NaN -16.84466 NaN NaN NaN
2011-10-03 04:30:00 NaN -17.96451 NaN -16.84205 NaN NaN NaN
2011-10-03 04:40:00 NaN -17.95414 NaN -16.82609 NaN NaN NaN
2011-10-03 04:50:00 NaN -17.93661 NaN -16.81903 NaN NaN NaN
2011-10-03 05:00:00 NaN -17.92905 NaN -16.80737 NaN NaN NaN
2011-10-03 05:10:00 NaN -17.92743 NaN -16.80582 NaN NaN NaN
2011-10-03 05:20:00 NaN -17.91504 NaN -16.78991 NaN NaN NaN
2011-10-03 05:30:00 NaN -17.89965 NaN -16.78469 NaN NaN NaN
2011-10-03 05:40:00 NaN -17.89945 NaN -16.77288 NaN NaN NaN
2011-10-03 05:50:00 NaN -17.88822 NaN -16.76610 NaN NaN NaN
2011-10-03 06:00:00 NaN -17.87259 NaN -16.75742 NaN NaN NaN
2011-10-03 06:10:00 NaN -17.87308 NaN NaN NaN NaN NaN
[173 rows x 7 columns]
To achieve this you have to compute a bool mask for each column:
create a dataframe with the DateTimeIndex values repeated over the same number of columns as df: pd.concat([pd.Series(df.index)] * df.columns.size, axis=1).
Here df.index must be turned into a pd.Series for concatenation, then repeated to match the number of columns df.columns.size.
create the mask itself with < idx.values.T, where values gets idx as a numpy.array and T transposes it in order to compare column-wise with the dataframe.
mask the dataframe with df.where(mask.values), where using values gets the mask as a numpy.array. This is needed as the mask does not have the same labels as df.
optionally only keep the rows where a least one value is not NaN using .dropna(how='all')
Filter masked data on the difference between each time point
If I understand well it is the point where you want to filter your data on difference > 0.2 between each time point and for the selected period only.
It remains a bit unclear to me so do not hesitate to discuss in the comments if I misunderstood.
This can be done with:
df[df_masked.diff(1) > 0.2]
But unfortunately for the provided dataset there is no value matching these conditions.
>>> df[df_masked.diff(1) > 0.2].any()
Out[]:
A1-24 False
A1-18 False
A1-12 False
A2-24 False
A2-18 False
A2-12 False
A3-24 False
dtype: bool
Edit: vizualize results as bool dataframe (comments follow-up)
Visualizing the results as a boolean dataframe with index and columns is done very simply with df_masked.diff(1) > 0.2.
However there will likely be a lot of unnecessary rows containing only False, so you can filter it this way:
df_diff = df_masked.diff(1) > 0.1 # Raising the threshold a bit to get some values
>>> df_diff[df_diff.any(axis=1)]
Out[]:
A1-24 A1-18 A1-12 A2-24 A2-18 A2-12 A3-24
Timestamp
2011-10-02 06:20:00 False False True False False False False
2011-10-02 06:30:00 False False True False False False False
2011-10-02 06:40:00 False False True False False False False
2011-10-02 06:50:00 False False True False False False False
2011-10-02 07:00:00 False False True False False False False
2011-10-02 07:10:00 False False True False False False False
2011-10-02 07:20:00 False False True False False False False
2011-10-02 07:30:00 False False True False False False False
2011-10-02 07:40:00 False False True False False False False
2011-10-02 07:50:00 False False True False False False False
2011-10-02 08:00:00 False False True False False False False
2011-10-02 08:10:00 False False True False False False False
2011-10-02 08:20:00 False False True False False False False
2011-10-02 08:30:00 False False True False False False False
2011-10-02 08:40:00 False False True False False False False
2011-10-02 08:50:00 False False True False False False False
2011-10-02 09:00:00 False False True False False False False
2011-10-02 09:10:00 False False True False False False False
2011-10-02 09:20:00 False False True False False False False
2011-10-02 09:30:00 False False True False False False False
2011-10-02 12:20:00 False False False False False True False
2011-10-02 12:50:00 False False False False True True False
2011-10-02 13:10:00 False False False False False True False
I have a DataFrame like this
OPEN HIGH LOW CLOSE VOL
2012-01-01 19:00:00 449000 449000 449000 449000 1336303000
2012-01-01 20:00:00 NaN NaN NaN NaN NaN
2012-01-01 21:00:00 NaN NaN NaN NaN NaN
2012-01-01 22:00:00 NaN NaN NaN NaN NaN
2012-01-01 23:00:00 NaN NaN NaN NaN NaN
...
OPEN HIGH LOW CLOSE VOL
2013-04-24 14:00:00 11700000 12000000 11600000 12000000 20647095439
2013-04-24 15:00:00 12000000 12399000 11979000 12399000 23997107870
2013-04-24 16:00:00 12399000 12400000 11865000 12100000 9379191474
2013-04-24 17:00:00 12300000 12397995 11850000 11850000 4281521826
2013-04-24 18:00:00 11850000 11850000 10903000 11800000 15546034128
I need to fill NaN according this rule
When OPEN, HIGH, LOW, CLOSE are NaN,
set VOL to 0
set OPEN, HIGH, LOW, CLOSE to previous CLOSE candle value
else keep NaN
Since neither of the other two answers work, here's a complete answer.
I'm testing two methods here. The first is based on working4coin's comment on hd1's answer and the second being a slower, pure python implementation. It seems obvious that the python implementation should be slower but I decided to time the two methods to make sure and to quantify the results.
def nans_to_prev_close_method1(data_frame):
data_frame['volume'] = data_frame['volume'].fillna(0.0) # volume should always be 0 (if there were no trades in this interval)
data_frame['close'] = data_frame.fillna(method='pad') # ie pull the last close into this close
# now copy the close that was pulled down from the last timestep into this row, across into o/h/l
data_frame['open'] = data_frame['open'].fillna(data_frame['close'])
data_frame['low'] = data_frame['low'].fillna(data_frame['close'])
data_frame['high'] = data_frame['high'].fillna(data_frame['close'])
Method 1 does most of the heavy lifting in c (in the pandas code), and so should be quite fast.
The slow, python approach (method 2) is shown below
def nans_to_prev_close_method2(data_frame):
prev_row = None
for index, row in data_frame.iterrows():
if np.isnan(row['open']): # row.isnull().any():
pclose = prev_row['close']
# assumes first row has no nulls!!
row['open'] = pclose
row['high'] = pclose
row['low'] = pclose
row['close'] = pclose
row['volume'] = 0.0
prev_row = row
Testing the timing on both of them:
df = trades_to_ohlcv(PATH_TO_RAW_TRADES_CSV, '1s') # splits raw trades into secondly candles
df2 = df.copy()
wrapped1 = wrapper(nans_to_prev_close_method1, df)
wrapped2 = wrapper(nans_to_prev_close_method2, df2)
print("method 1: %.2f sec" % timeit.timeit(wrapped1, number=1))
print("method 2: %.2f sec" % timeit.timeit(wrapped2, number=1))
The results were:
method 1: 0.46 sec
method 2: 151.82 sec
Clearly method 1 is far faster (approx 330 times faster).
Here's how to do it via masking
Simulate a frame with some holes (A is your 'close' field)
In [20]: df = DataFrame(randn(10,3),index=date_range('20130101',periods=10,freq='min'),
columns=list('ABC'))
In [21]: df.iloc[1:3,:] = np.nan
In [22]: df.iloc[5:8,1:3] = np.nan
In [23]: df
Out[23]:
A B C
2013-01-01 00:00:00 -0.486149 0.156894 -0.272362
2013-01-01 00:01:00 NaN NaN NaN
2013-01-01 00:02:00 NaN NaN NaN
2013-01-01 00:03:00 1.788240 -0.593195 0.059606
2013-01-01 00:04:00 1.097781 0.835491 -0.855468
2013-01-01 00:05:00 0.753991 NaN NaN
2013-01-01 00:06:00 -0.456790 NaN NaN
2013-01-01 00:07:00 -0.479704 NaN NaN
2013-01-01 00:08:00 1.332830 1.276571 -0.480007
2013-01-01 00:09:00 -0.759806 -0.815984 2.699401
The ones we that are all Nan
In [24]: mask_0 = pd.isnull(df).all(axis=1)
In [25]: mask_0
Out[25]:
2013-01-01 00:00:00 False
2013-01-01 00:01:00 True
2013-01-01 00:02:00 True
2013-01-01 00:03:00 False
2013-01-01 00:04:00 False
2013-01-01 00:05:00 False
2013-01-01 00:06:00 False
2013-01-01 00:07:00 False
2013-01-01 00:08:00 False
2013-01-01 00:09:00 False
Freq: T, dtype: bool
Ones we want to propogate A
In [26]: mask_fill = pd.isnull(df['B']) & pd.isnull(df['C'])
In [27]: mask_fill
Out[27]:
2013-01-01 00:00:00 False
2013-01-01 00:01:00 True
2013-01-01 00:02:00 True
2013-01-01 00:03:00 False
2013-01-01 00:04:00 False
2013-01-01 00:05:00 True
2013-01-01 00:06:00 True
2013-01-01 00:07:00 True
2013-01-01 00:08:00 False
2013-01-01 00:09:00 False
Freq: T, dtype: bool
propogate first
In [28]: df.loc[mask_fill,'C'] = df['A']
In [29]: df.loc[mask_fill,'B'] = df['A']
fill the 0's
In [30]: df.loc[mask_0] = 0
Done
In [31]: df
Out[31]:
A B C
2013-01-01 00:00:00 -0.486149 0.156894 -0.272362
2013-01-01 00:01:00 0.000000 0.000000 0.000000
2013-01-01 00:02:00 0.000000 0.000000 0.000000
2013-01-01 00:03:00 1.788240 -0.593195 0.059606
2013-01-01 00:04:00 1.097781 0.835491 -0.855468
2013-01-01 00:05:00 0.753991 0.753991 0.753991
2013-01-01 00:06:00 -0.456790 -0.456790 -0.456790
2013-01-01 00:07:00 -0.479704 -0.479704 -0.479704
2013-01-01 00:08:00 1.332830 1.276571 -0.480007
2013-01-01 00:09:00 -0.759806 -0.815984 2.699401
This illustrates pandas' missing data behaviour. The incantation you're looking for is the fillna method, which takes a value:
In [1381]: df2
Out[1381]:
one two three four five timestamp
a NaN 1.138469 -2.400634 bar True NaT
c NaN 0.025653 -1.386071 bar False NaT
e 0.863937 0.252462 1.500571 bar True 2012-01-01 00:00:00
f 1.053202 -2.338595 -0.374279 bar True 2012-01-01 00:00:00
h NaN -1.157886 -0.551865 bar False NaT
In [1382]: df2.fillna(0)
Out[1382]:
one two three four five timestamp
a 0.000000 1.138469 -2.400634 bar True 1970-01-01 00:00:00
c 0.000000 0.025653 -1.386071 bar False 1970-01-01 00:00:00
e 0.863937 0.252462 1.500571 bar True 2012-01-01 00:00:00
f 1.053202 -2.338595 -0.374279 bar True 2012-01-01 00:00:00
h 0.000000 -1.157886 -0.551865 bar False 1970-01-01 00:00:00
You can even propogate them forward and backward:
In [1384]: df
Out[1384]:
one two three
a NaN 1.138469 -2.400634
c NaN 0.025653 -1.386071
e 0.863937 0.252462 1.500571
f 1.053202 -2.338595 -0.374279
h NaN -1.157886 -0.551865
In [1385]: df.fillna(method='pad')
Out[1385]:
one two three
a NaN 1.138469 -2.400634
c NaN 0.025653 -1.386071
e 0.863937 0.252462 1.500571
f 1.053202 -2.338595 -0.374279
h 1.053202 -1.157886 -0.551865
For your specific case, I think you'll need to do:
df['VOL'].fillna(0)
df.fillna(df['CLOSE'])