I have a date time indexed DataFrame, (65 columns (only 9 shown for clarity) -> number of sensors, and x rows -> number of observations(for the sample data I limited it to 700 rows, to illustrate the issue I am having).
demo csv:
https://pastebin.com/mpSgJF94
swp_data = pd.read_csv(FILE_NAME, index_col=0, header=0, parse_dates=True, infer_datetime_format=True)
swp_data = swp_data.sort_index()
For each column, I need to find the point where the value is 95% of the column max value, and figure out from the beginning of the DataFrame to the 95% point, where the difference between the time steps is greater than a given value (0.2 in this case).
something similar to what would work in R (not actual code but an illustration)
for (i in 1 : 95% point){
difference[i] <- s[i] - s[(i-1)]
}
breaking <-which(difference > 0.2)[1]
Which would take the 95% point as the end index of a loop, and look at the differences between the time steps and return an index value where the difference > 0.2
In pandas I have calculated the following:
95% value
s95 = (swp_data.max() + (swp_data.max() * .05))
A1-24, -20.6260635,
A1-18, -17.863923,
A1-12, -11.605629,
A2-24, -16.755144,
A2-18, -17.6815275,
A2-12, -16.369584,
A3-24, -15.5030295,
95% time
s95_time = (swp_data >= (swp_data.max() + (swp_data.max() * .05))).idxmax()
A1-24, 10/2/2011 1:30,
A1-18, 10/3/2011 6:20,
A1-12, 10/2/2011 17:20,
A2-24, 10/3/2011 6:10,
A2-18, 10/3/2011 1:30,
A2-12, 10/2/2011 17:10,
A3-24, 10/2/2011 1:30,
Thus far, I have the max value, and the 95% value, as well as a series of timestamps where each column reached its 95% point.
s95 value:
I have tried to mask the DataFrame, (trying to replicate R's which) by creating a boolean DataFrame of values <= 95% point, and have tried df.where using values >=95%. Neither mask or where has provided me what I need, as some of the sensors can already be above the 95% of max when I started recording (mask returns NaN for these values), while where returns these values but not the values below the 95% threshold.
The output I am looking for would be something along the lines of
A1-24, A1-18, A1-12, A2-24, A2-18, A2-12, A3-24, A3-18, A3-12
BREAKING hh:mm, hh:mm, hh:mm, hh:mm, hh:mm, hh:mm, hh:mm, hh:mm, hh:mm
where hh:mm equals the time from the start of the data file to the breaking value.
So far what I have found on SE and google, has me confused if I can subset the columns of the dataframe by different values, and am having trouble figuring out what I am trying to do is called.
edit: #Prateek Comment:
What i am trying to do is find a way that I can somewhat automate this process, so that using the position of 95% I can have the breaking point returned. I have ~200 csv files that I am trying to process, and would like as much of the filtering to be done using the 95% and breaking positions as possible.
A possible solution from what I understand.
Note that I renamed swap_data to df in the example, and the solution is tested on the provided csv sample file from your question.
Find duration from the start up to when value reaches 95% of column's max
Finding the first timepoint where each column reaches 95% of the max is done as you described:
idx = (df >= df.max(axis=0) * 1.05).idxmax()
>>> idx
Out[]:
A1-24 2011-10-02 01:30:00
A1-18 2011-10-03 06:20:00
A1-12 2011-10-02 17:20:00
A2-24 2011-10-03 06:10:00
A2-18 2011-10-03 01:30:00
A2-12 2011-10-02 17:10:00
A3-24 2011-10-02 01:30:00
dtype: datetime64[ns]
Note the using df.max() * 1.05 avoids to compute the max twice, as compared to s95 = (swp_data.max() + (swp_data.max() * .05)) otherwise it's the same.
Then computing the duration from the start of the dataframe is obtained by substracting the first timestamp
>>> idx - df.index[0]
Out[]:
A1-24 0 days 00:00:00
A1-18 1 days 04:50:00
A1-12 0 days 15:50:00
A2-24 1 days 04:40:00
A2-18 1 days 00:00:00
A2-12 0 days 15:40:00
A3-24 0 days 00:00:00
dtype: timedelta64[ns]
This is for each column the time spent from the start of the record to the s95 point.
Time is 0 if the first recorded value is already above this point.
Mask the dataframe to cover this period
mask = pd.concat([pd.Series(df.index)] * df.columns.size, axis=1) < idx.values.T
df_masked = df.where(mask.values)
>>> df_masked.dropna(how='all')
Out[]:
A1-24 A1-18 A1-12 A2-24 A2-18 A2-12 A3-24
Timestamp
2011-10-02 01:30:00 NaN -18.63589 -16.90389 -17.26780 -19.20653 -19.59666 NaN
2011-10-02 01:40:00 NaN -18.64686 -16.93100 -17.26832 -19.22702 -19.62036 NaN
2011-10-02 01:50:00 NaN -18.65098 -16.92761 -17.26132 -19.22705 -19.61355 NaN
2011-10-02 02:00:00 NaN -18.64307 -16.94764 -17.27702 -19.22746 -19.63462 NaN
2011-10-02 02:10:00 NaN -18.66338 -16.94900 -17.27325 -19.25358 -19.62761 NaN
2011-10-02 02:20:00 NaN -18.66217 -16.95625 -17.27386 -19.25455 -19.64009 NaN
2011-10-02 02:30:00 NaN -18.66015 -16.96130 -17.27040 -19.25898 -19.64241 NaN
2011-10-02 02:40:00 NaN -18.66883 -16.96980 -17.27580 -19.27054 -19.65454 NaN
2011-10-02 02:50:00 NaN -18.68635 -16.97897 -17.27488 -19.28492 -19.65808 NaN
2011-10-02 03:00:00 NaN -18.68009 -16.99057 -17.28346 -19.28928 -19.67182 NaN
2011-10-02 03:10:00 NaN -18.68450 -17.00258 -17.28196 -19.32272 -19.68135 NaN
2011-10-02 03:20:00 NaN -18.68777 -17.01009 -17.29675 -19.30864 -19.68747 NaN
2011-10-02 03:30:00 NaN -18.70067 -17.01706 -17.29178 -19.32034 -19.69742 NaN
2011-10-02 03:40:00 NaN -18.70095 -17.03559 -17.29352 -19.32741 -19.70945 NaN
2011-10-02 03:50:00 NaN -18.70636 -17.03651 -17.28925 -19.33549 -19.71560 NaN
2011-10-02 04:00:00 NaN -18.70937 -17.03548 -17.28996 -19.33433 -19.71211 NaN
2011-10-02 04:10:00 NaN -18.70599 -17.04444 -17.29223 -19.33740 -19.72227 NaN
2011-10-02 04:20:00 NaN -18.71292 -17.05510 -17.29449 -19.35154 -19.72779 NaN
2011-10-02 04:30:00 NaN -18.72158 -17.06376 -17.28770 -19.35647 -19.73064 NaN
2011-10-02 04:40:00 NaN -18.72185 -17.06910 -17.30018 -19.36785 -19.74481 NaN
2011-10-02 04:50:00 NaN -18.72048 -17.06599 -17.29004 -19.37320 -19.73424 NaN
2011-10-02 05:00:00 NaN -18.73083 -17.07618 -17.29528 -19.37319 -19.75045 NaN
2011-10-02 05:10:00 NaN -18.72215 -17.08587 -17.29650 -19.38400 -19.75713 NaN
2011-10-02 05:20:00 NaN -18.73206 -17.10233 -17.29767 -19.39254 -19.76838 NaN
2011-10-02 05:30:00 NaN -18.73719 -17.09621 -17.29842 -19.39363 -19.76258 NaN
2011-10-02 05:40:00 NaN -18.73839 -17.10910 -17.29237 -19.40390 -19.76864 NaN
2011-10-02 05:50:00 NaN -18.74257 -17.12091 -17.29398 -19.40846 -19.78042 NaN
2011-10-02 06:00:00 NaN -18.74327 -17.12995 -17.29097 -19.41153 -19.77897 NaN
2011-10-02 06:10:00 NaN -18.74326 -17.04482 -17.28397 -19.40928 -19.77430 NaN
2011-10-02 06:20:00 NaN -18.73100 -16.86221 -17.28575 -19.40956 -19.78396 NaN
... ... ... ... ... ... ... ...
2011-10-03 01:20:00 NaN -18.16448 NaN -16.99797 -17.95030 NaN NaN
2011-10-03 01:30:00 NaN -18.15606 NaN -16.98879 NaN NaN NaN
2011-10-03 01:40:00 NaN -18.12795 NaN -16.97951 NaN NaN NaN
2011-10-03 01:50:00 NaN -18.12974 NaN -16.97937 NaN NaN NaN
2011-10-03 02:00:00 NaN -18.11848 NaN -16.96770 NaN NaN NaN
2011-10-03 02:10:00 NaN -18.11879 NaN -16.95256 NaN NaN NaN
2011-10-03 02:20:00 NaN -18.08212 NaN -16.95461 NaN NaN NaN
2011-10-03 02:30:00 NaN -18.09060 NaN -16.94141 NaN NaN NaN
2011-10-03 02:40:00 NaN -18.07000 NaN -16.93006 NaN NaN NaN
2011-10-03 02:50:00 NaN -18.07461 NaN -16.91700 NaN NaN NaN
2011-10-03 03:00:00 NaN -18.06039 NaN -16.91466 NaN NaN NaN
2011-10-03 03:10:00 NaN -18.04229 NaN -16.89537 NaN NaN NaN
2011-10-03 03:20:00 NaN -18.03514 NaN -16.89753 NaN NaN NaN
2011-10-03 03:30:00 NaN -18.03014 NaN -16.88813 NaN NaN NaN
2011-10-03 03:40:00 NaN -18.00851 NaN -16.88086 NaN NaN NaN
2011-10-03 03:50:00 NaN -18.01028 NaN -16.87721 NaN NaN NaN
2011-10-03 04:00:00 NaN -18.00227 NaN -16.86687 NaN NaN NaN
2011-10-03 04:10:00 NaN -17.98804 NaN -16.85424 NaN NaN NaN
2011-10-03 04:20:00 NaN -17.96740 NaN -16.84466 NaN NaN NaN
2011-10-03 04:30:00 NaN -17.96451 NaN -16.84205 NaN NaN NaN
2011-10-03 04:40:00 NaN -17.95414 NaN -16.82609 NaN NaN NaN
2011-10-03 04:50:00 NaN -17.93661 NaN -16.81903 NaN NaN NaN
2011-10-03 05:00:00 NaN -17.92905 NaN -16.80737 NaN NaN NaN
2011-10-03 05:10:00 NaN -17.92743 NaN -16.80582 NaN NaN NaN
2011-10-03 05:20:00 NaN -17.91504 NaN -16.78991 NaN NaN NaN
2011-10-03 05:30:00 NaN -17.89965 NaN -16.78469 NaN NaN NaN
2011-10-03 05:40:00 NaN -17.89945 NaN -16.77288 NaN NaN NaN
2011-10-03 05:50:00 NaN -17.88822 NaN -16.76610 NaN NaN NaN
2011-10-03 06:00:00 NaN -17.87259 NaN -16.75742 NaN NaN NaN
2011-10-03 06:10:00 NaN -17.87308 NaN NaN NaN NaN NaN
[173 rows x 7 columns]
To achieve this you have to compute a bool mask for each column:
create a dataframe with the DateTimeIndex values repeated over the same number of columns as df: pd.concat([pd.Series(df.index)] * df.columns.size, axis=1).
Here df.index must be turned into a pd.Series for concatenation, then repeated to match the number of columns df.columns.size.
create the mask itself with < idx.values.T, where values gets idx as a numpy.array and T transposes it in order to compare column-wise with the dataframe.
mask the dataframe with df.where(mask.values), where using values gets the mask as a numpy.array. This is needed as the mask does not have the same labels as df.
optionally only keep the rows where a least one value is not NaN using .dropna(how='all')
Filter masked data on the difference between each time point
If I understand well it is the point where you want to filter your data on difference > 0.2 between each time point and for the selected period only.
It remains a bit unclear to me so do not hesitate to discuss in the comments if I misunderstood.
This can be done with:
df[df_masked.diff(1) > 0.2]
But unfortunately for the provided dataset there is no value matching these conditions.
>>> df[df_masked.diff(1) > 0.2].any()
Out[]:
A1-24 False
A1-18 False
A1-12 False
A2-24 False
A2-18 False
A2-12 False
A3-24 False
dtype: bool
Edit: vizualize results as bool dataframe (comments follow-up)
Visualizing the results as a boolean dataframe with index and columns is done very simply with df_masked.diff(1) > 0.2.
However there will likely be a lot of unnecessary rows containing only False, so you can filter it this way:
df_diff = df_masked.diff(1) > 0.1 # Raising the threshold a bit to get some values
>>> df_diff[df_diff.any(axis=1)]
Out[]:
A1-24 A1-18 A1-12 A2-24 A2-18 A2-12 A3-24
Timestamp
2011-10-02 06:20:00 False False True False False False False
2011-10-02 06:30:00 False False True False False False False
2011-10-02 06:40:00 False False True False False False False
2011-10-02 06:50:00 False False True False False False False
2011-10-02 07:00:00 False False True False False False False
2011-10-02 07:10:00 False False True False False False False
2011-10-02 07:20:00 False False True False False False False
2011-10-02 07:30:00 False False True False False False False
2011-10-02 07:40:00 False False True False False False False
2011-10-02 07:50:00 False False True False False False False
2011-10-02 08:00:00 False False True False False False False
2011-10-02 08:10:00 False False True False False False False
2011-10-02 08:20:00 False False True False False False False
2011-10-02 08:30:00 False False True False False False False
2011-10-02 08:40:00 False False True False False False False
2011-10-02 08:50:00 False False True False False False False
2011-10-02 09:00:00 False False True False False False False
2011-10-02 09:10:00 False False True False False False False
2011-10-02 09:20:00 False False True False False False False
2011-10-02 09:30:00 False False True False False False False
2011-10-02 12:20:00 False False False False False True False
2011-10-02 12:50:00 False False False False True True False
2011-10-02 13:10:00 False False False False False True False
Related
I'm trying to use the usual times I take medication (so + 4 hours on top of that) and fill in a data frame with a label, of being 2,1 or 0, for when I am on this medication, or for the hour after the medication as 2 for just being off of the medication.
As an example of the dataframe I am trying to add this column too,
<bound method NDFrame.to_clipboard of id sentiment magnitude angry disgusted fearful \
created
2020-05-21 12:00:00 23.0 -0.033333 0.5 NaN NaN NaN
2020-05-21 12:15:00 NaN NaN NaN NaN NaN NaN
2020-05-21 12:30:00 NaN NaN NaN NaN NaN NaN
2020-05-21 12:45:00 NaN NaN NaN NaN NaN NaN
2020-05-21 13:00:00 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ...
2021-04-20 00:45:00 NaN NaN NaN NaN NaN NaN
2021-04-20 01:00:00 NaN NaN NaN NaN NaN NaN
2021-04-20 01:15:00 NaN NaN NaN NaN NaN NaN
2021-04-20 01:30:00 NaN NaN NaN NaN NaN NaN
2021-04-20 01:45:00 46022.0 -1.000000 1.0 NaN NaN NaN
happy neutral sad surprised
created
2020-05-21 12:00:00 NaN NaN NaN NaN
2020-05-21 12:15:00 NaN NaN NaN NaN
2020-05-21 12:30:00 NaN NaN NaN NaN
2020-05-21 12:45:00 NaN NaN NaN NaN
2020-05-21 13:00:00 NaN NaN NaN NaN
... ... ... ... ...
2021-04-20 00:45:00 NaN NaN NaN NaN
2021-04-20 01:00:00 NaN NaN NaN NaN
2021-04-20 01:15:00 NaN NaN NaN NaN
2021-04-20 01:30:00 NaN NaN NaN NaN
2021-04-20 01:45:00 NaN NaN NaN NaN
[32024 rows x 10 columns]>
And the data for the timestamps for when i usually take my medication,
['09:00 AM', '12:00 PM', '03:00 PM']
How would I use those time stamps to get this sort of column information?
Update
So, trying to build upon the question, How would I make sure it only adds medication against places where there is data available, and making sure that the after medication timing of one hour is applied correctly!
Thanks
Use np.select() to choose the appropriate label for a given condition.
First dropna() if all values after created are null (subset=df.columns[1:]). You can change the subset depending on your needs (e.g., subset=['id'] if rows should be dropped just for having a null id).
Then generate datetime arrays for taken-, active-, and after-medication periods based on the duration of the medication. Check whether the created times match any of the times in active (label 1) or after (label 2), otherwise default to 0.
# drop rows that are empty except for column 0 (i.e., except for df.created)
df.dropna(subset=df.columns[1:], inplace=True)
# convert times to datetime
df.created = pd.to_datetime(df.created)
taken = pd.to_datetime(['09:00:00', '12:00:00', '15:00:00'])
# generate time arrays
duration = 2 # hours
active = np.array([(taken + pd.Timedelta(f'{h}H')).time for h in range(duration)]).ravel()
after = (taken + pd.Timedelta(f'{duration}H')).time
# define boolean masks by label
conditions = {
1: df.created.dt.floor('H').dt.time.isin(active),
2: df.created.dt.floor('H').dt.time.isin(after),
}
# create medication column with np.select()
df['medication'] = np.select(conditions.values(), conditions.keys(), default=0)
Here is the output with some slightly modified data that better demonstrate the active / after / nan scenarios:
created id sentiment magnitude medication
0 2020-05-21 12:00:00 23.0 -0.033333 0.5 1
3 2020-05-21 12:45:00 39.0 -0.500000 0.5 1
4 2020-05-21 13:00:00 90.0 -0.500000 0.5 1
5 2020-05-21 13:15:00 100.0 -0.033333 0.1 1
9 2020-05-21 14:15:00 1000.0 0.033333 0.5 2
10 2020-05-21 14:30:00 3.0 0.001000 1.0 2
17 2021-04-20 01:00:00 46022.0 -1.000000 1.0 0
20 2021-04-20 01:45:00 46022.0 -1.000000 1.0 0
I have a Pandas dataframe in the following format:
id name timestamp time_diff <=30min
1 movie3 2009-05-04 18:00:00+00:00 NaN False
1 movie5 2009-05-05 18:15:00+00:00 00:15:00 True
1 movie1 2009-05-05 22:00:00+00:00 03:45:00 False
2 movie7 2009-05-04 09:30:00+00:00 NaN False
2 movie8 2009-05-05 12:00:00+00:00 02:30:00 False
3 movie1 2009-05-04 17:45:00+00:00 NaN False
3 movie7 2009-05-04 18:15:00+00:00 00:30:00 True
3 movie6 2009-05-04 18:30:00+00:00 00:15:00 True
3 movie6 2009-05-04 19:00:00+00:00 00:30:00 True
4 movie1 2009-05-05 12:45:00+00:00 NaN False
5 movie7 2009-05-04 11:00:00+00:00 NaN False
5 movie8 2009-05-04 11:15:00+00:00 00:15:00 True
The data shows the movies watched on a video streaming platform. Id is the user id, name is the name of the movie and timestamp is the timestamp at which the movie started. <30min indicates if the user has started the movie within 30minutes of the previous movie watched.
A movie-session is comprised by one or more movies played by a single user, where each movie has started within 30 minutes of the previous movie start time (Basically a session is defined as consecutive rows in which df['<30min'] == True).
The length of a session is defined as time_stamp of the last consecutive df['<30min'] == True - timestamp of the first True of the session.
How can I find the 3 longest sessions (in minutes) in the data, and the movies played during the sessions?
As a first step, I have tried something like this:
df.groupby((df['<20'] == False).cumsum())['time_diff'].fillna(pd.Timedelta(seconds=0)).cumsum()
But it doesn't work (the cumsum does not reset when df['time_diff']=False), and looks very slow.
Also, I think it would make my life harder when I have to select the longest 3 sessions as I could get multiple values for the same session that could be selected in the longest 3.
Not sure I understood you correctly. If I did then this may work;
Coercer timestamp to datetime;
df['timestamp']=pd.to_datetime(df['timestamp'])
filter out the True values which indicate consecutive watch.Groupby id whicle calculating the difference between maximum and minimum time. This is then joined to the main df
df.join(df[df['<=30min']==True].groupby('id')['timestamp'].transform(lambda x:x.max()-x.min()).to_frame().rename(columns={'timestamp':'Max'}))
id name timestamp time_diff <=30min Max
0 1 movie3 2009-05-04 18:00:00+00:00 NaN False NaT
1 1 movie5 2009-05-05 18:15:00+00:00 00:15:00 True 00:00:00
2 1 movie1 2009-05-05 22:00:00+00:00 03:45:00 False NaT
3 2 movie7 2009-05-04 09:30:00+00:00 NaN False NaT
4 2 movie8 2009-05-05 12:00:00+00:00 02:30:00 False NaT
5 3 movie1 2009-05-04 17:45:00+00:00 NaN False NaT
6 3 movie7 2009-05-04 18:15:00+00:00 00:30:00 True 00:45:00
7 3 movie6 2009-05-04 18:30:00+00:00 00:15:00 True 00:45:00
8 3 movie6 2009-05-04 19:00:00+00:00 00:30:00 True 00:45:00
9 4 movie1 2009-05-05 12:45:00+00:00 NaN False NaT
10 5 movie7 2009-05-04 11:00:00+00:00 NaN False NaT
11 5 movie8 2009-05-04 11:15:00+00:00 00:15:00 True 00:00:00
I have two dataframes, one with past data. The other with a prediction. I would like to merge them so that there are no duplicate columns.
My code looks like this:
Past =
X RealData
2019-03-27 12:30:00 8.295 True
2019-03-27 13:00:00 7.707 True
2019-03-27 13:30:00 7.518 True
2019-03-27 14:00:00 7.518 True
2019-03-27 14:30:00 7.518 True
2019-03-27 15:00:00 7.455 True
2019-03-27 15:30:00 7.518 True
2019-03-27 16:00:00 20.244 True
2019-03-27 16:30:00 20.895 True
2019-03-27 17:00:00 21.630 True
2019-03-27 17:30:00 24.360 True
2019-03-27 18:00:00 24.591 True
2019-03-27 18:30:00 26.460 True
2019-03-27 19:00:00 14.280 True
2019-03-27 19:30:00 12.180 True
2019-03-27 20:00:00 11.550 True
2019-03-27 20:30:00 9.051 True
2019-03-27 21:00:00 8.673 True
2019-03-27 21:30:00 7.791 True
Future =
X RealData
2019-03-27 22:30:00 8.450913 False
2019-03-27 23:00:00 8.494944 False
2019-03-27 23:30:00 9.058649 False
2019-03-28 00:00:00 22.055525 False
2019-03-28 00:30:00 23.344284 False
2019-03-28 01:00:00 24.793011 False
2019-03-28 01:30:00 26.203117 False
2019-03-28 02:00:00 27.897289 False
2019-03-28 02:30:00 14.187933 False
2019-03-28 03:00:00 14.110393 False
At the moment, I am trying:
past_future = pd.concat([Future, Past], axis=1, sort=True)
And I am getting this:
X RealData X RealData
2019-03-27 12:30:00 8.295 True NaN NaN
2019-03-27 13:00:00 7.707 True NaN NaN
2019-03-27 13:30:00 7.518 True NaN NaN
2019-03-27 14:00:00 7.518 True NaN NaN
2019-03-27 14:30:00 7.518 True NaN NaN
2019-03-27 15:00:00 7.455 True NaN NaN
2019-03-27 15:30:00 7.518 True NaN NaN
2019-03-27 16:00:00 20.244 True NaN NaN
2019-03-27 16:30:00 20.895 True NaN NaN
2019-03-27 17:00:00 21.630 True NaN NaN
2019-03-27 17:30:00 24.360 True NaN NaN
2019-03-27 18:00:00 24.591 True NaN NaN
2019-03-27 18:30:00 26.460 True NaN NaN
2019-03-27 19:00:00 14.280 True NaN NaN
2019-03-27 19:30:00 12.180 True NaN NaN
2019-03-27 20:00:00 11.550 True NaN NaN
2019-03-27 20:30:00 9.051 True NaN NaN
2019-03-27 21:00:00 8.673 True NaN NaN
2019-03-27 21:30:00 7.791 True NaN NaN
2019-03-27 22:30:00 NaN NaN 8.450913 False
2019-03-27 23:00:00 NaN NaN 8.494944 False
2019-03-27 23:30:00 NaN NaN 9.058649 False
2019-03-28 00:00:00 NaN NaN 22.055525 False
2019-03-28 00:30:00 NaN NaN 23.344284 False
2019-03-28 01:00:00 NaN NaN 24.793011 False
2019-03-28 01:30:00 NaN NaN 26.203117 False
2019-03-28 02:00:00 NaN NaN 27.897289 False
2019-03-28 02:30:00 NaN NaN 14.187933 False
2019-03-28 03:00:00 NaN NaN 14.110393 False
My expected output is just two columns:
X RealData
2019-03-27 12:30:00 8.295 True
2019-03-27 13:00:00 7.707 True
2019-03-27 13:30:00 7.518 True
2019-03-27 14:00:00 7.518 True
... ... ...
2019-03-27 22:30:00 8.450913 False
2019-03-27 23:00:00 8.494944 False
2019-03-27 23:30:00 9.058649 False
Any idea how to handle this?
My simple advice - keep everything in the order.
Then everything is easy.
import pandas as pd
df1 = pd.read_csv('c:/4/a1.csv')
df2 = pd.read_csv('c:/4/a2.csv')
df2.dtypes
df1.date = pd.to_datetime(df1.date)
df2.date = pd.to_datetime(df1.date)
df2.dtypes
df1.set_index(df1.date, inplace=True)
df2.set_index(df2.date, inplace=True)
df = df1.append(df2)
df.sort_index()
df.drop_duplicates('date',keep='last', inplace=True)
df
Just to formalise what ags29 wrote here Best way to merge/concatenate/join two DataFrames with duplicate columns, but the different Datetime indices?
output = pd.concat([Future.reset_index(), Past.reset_index()], axis=0)
output.set_index('index', inplace=True)
While Wojciech MoszczyĆski's answer is much more thorough, this seems to do the job quite well.
I currently have some time series data that I applied a rolling mean on with a window of 17520.
Thus before the head of my data looked like this:
SETTLEMENTDATE ==
0 2006/01/01 00:30:00 8013.27833 ... 5657.67500 20.03
1 2006/01/01 01:00:00 7726.89167 ... 5460.39500 18.66
2 2006/01/01 01:30:00 7372.85833 ... 5766.02500 20.38
3 2006/01/01 02:00:00 7071.83333 ... 5503.25167 18.59
4 2006/01/01 02:30:00 6865.44000 ... 5214.01500 17.53
And now it looks like this:
SETTLEMENTDATE =
0 2006/01/01 00:30:00 NaN ... NaN NaN
1 2006/01/01 01:00:00 NaN ... NaN NaN
2 2006/01/01 01:30:00 NaN ... NaN NaN
3 2006/01/01 02:00:00 NaN ... NaN NaN
4 2006/01/01 02:30:00 NaN ... NaN NaN
How can I get it so that my data only begins, when there is not a NaN? (also making sure that the date matches)
=
You can try with rolling with min_periods = 1
data['NSW DEMAND'] = data['NSW DEMAND'].rolling(17520,min_periods=17520).mean()
Also try using for loo, you do not need to write the columns one by one
youcols=['xxx'...'xxx1']
for x in youcols:
data[x]=data[x].rolling(17520,min_periods=1).mean()
Base on your comments
for x in youcols:
data[x]=data[x].rolling(17520,min_periods=1).mean()
then ,
data=data.dropna(subset=youcols,thresh =1)
I've got the following Pandas dataframes:
>>> df
Qual_B temp_B relhum_B Qual_F temp_F relhum_F
Date
1948-01-01 01:00:00 5 -6.0 96 NaN NaN NaN
1948-01-01 02:00:00 5 -5.3 97 NaN NaN NaN
1948-01-01 03:00:00 5 -4.5 98 NaN 3.5 NaN
1948-01-01 04:00:00 5 -4.3 98 NaN 3.7 NaN
1948-01-01 05:00:00 5 -4.0 99 NaN NaN NaN
>>> test
Qual_B temp_B relhum_B Qual_F temp_F relhum_F
Date
1948-01-01 01:00:00 True True True False False False
1948-01-01 02:00:00 True True True False False False
1948-01-01 03:00:00 True True True False True False
1948-01-01 04:00:00 True True True False True False
1948-01-01 05:00:00 True True True False False False
which represents data availability (I have created test with test = pandas.notnull(df)). I want a plot like this or a stacked barplot with time on the x-axis and the columns on the y axis and I have tried the following:
fig= plt.figure()
ax = fig.add_subplot(111)
ax.imshow(test.values, aspect='auto', cmap=plt.cm.gray, interpolation='nearest')
but it doesn't do anything, even though the type of the array is exactly the same as in the example above (both numpy.ndarray). The try to plot the original dataframe with
test.div(test, axis=0).T.plot(kind = 'barh', stacked=True, legend=False, color='b', edgecolor='none')
seems to be correct for the values, that are always present, but not for those that are partly present. Can anyone help?