Trying to create new column with values that meet specific conditions. Below I have set out code which goes some way in explaining the logic but does not produce the correct output:
import pandas as pd
import numpy as np
df = pd.DataFrame({'date': ['2019-08-06 09:00:00', '2019-08-06 12:00:00', '2019-08-06 18:00:00', '2019-08-06 21:00:00', '2019-08-07 09:00:00', '2019-08-07 16:00:00', '2019-08-08 17:00:00' ,'2019-08-09 16:00:00'],
'type': [0, 1, np.nan, 1, np.nan, np.nan, 0 ,0],
'colour': ['blue', 'red', np.nan, 'blue', np.nan, np.nan, 'blue', 'red'],
'maxPixel': [255, 7346, 32, 5184, 600, 322, 72, 6000],
'minPixel': [86, 96, 14, 3540, 528, 300, 12, 4009],
'colourDate': ['2019-08-06 12:00:00', '2019-08-08 16:00:00', '2019-08-06 23:00:00', '2019-08-06 22:00:00', '2019-08-08 09:00:00', '2019-08-09 16:00:00', '2019-08-08 23:00:00' ,'2019-08-11 16:00:00'] })
max_conditions = [(df['type'] == 1) & (df['colour'] == 'blue'),
(df['type'] == 1) & (df['colour'] == 'red')]
max_choices = [np.where(df['date'] <= df['colourDate'], max(df['maxPixel']), np.nan),
np.where(df['date'] <= df['colourDate'], min(df['minPixel']), np.nan)]
df['pixelLimit'] = np.select(max_conditions, max_choices, default=np.nan)
Incorrect output:
date type colour maxPixel minPixel colourDate pixelLimit
0 2019-08-06 09:00:00 0.0 blue 255 86 2019-08-06 12:00:00 NaN
1 2019-08-06 12:00:00 1.0 red 7346 96 2019-08-08 16:00:00 12.0
2 2019-08-06 18:00:00 NaN NaN 32 14 2019-08-06 23:00:00 NaN
3 2019-08-06 21:00:00 1.0 blue 5184 3540 2019-08-06 22:00:00 6000.0
4 2019-08-07 09:00:00 NaN NaN 600 528 2019-08-08 09:00:00 NaN
5 2019-08-07 16:00:00 NaN NaN 322 300 2019-08-09 16:00:00 NaN
6 2019-08-08 17:00:00 0.0 blue 72 12 2019-08-08 23:00:00 NaN
7 2019-08-09 16:00:00 0.0 red 6000 4009 2019-08-11 16:00:00 NaN
Explanation why output is incorrect:
Value 12.0 in index row 1 for column df['pixelLimit'] is incorrect because this value is from df['minPixel'] index row 6 which has has a df['date'] datetime of 2019-08-08 17:00:00 which is greater than the 2019-08-08 16:00:00 df['date'] datetime contained in index row 1.
Value 6000.0 in index row 3 for column df['pixelLimit'] is incorrect because this value is from df['maxPixel'] index row 7 which has a df['date'] datetime of 2019-08-09 16:00:00 which is greater than the 2019-08-06 22:00:00 df['date'] datetime contained in index row .
Correct output:
date type colour maxPixel minPixel colourDate pixelLimit
0 2019-08-06 09:00:00 0.0 blue 255 86 2019-08-06 12:00:00 NaN
1 2019-08-06 12:00:00 1.0 red 7346 96 2019-08-08 16:00:00 14.0
2 2019-08-06 18:00:00 NaN NaN 32 14 2019-08-06 23:00:00 NaN
3 2019-08-06 21:00:00 1.0 blue 5184 3540 2019-08-06 22:00:00 5184.0
4 2019-08-07 09:00:00 NaN NaN 600 528 2019-08-08 09:00:00 NaN
5 2019-08-07 16:00:00 NaN NaN 322 300 2019-08-09 16:00:00 NaN
6 2019-08-08 17:00:00 0.0 blue 72 12 2019-08-08 23:00:00 NaN
7 2019-08-09 16:00:00 0.0 red 6000 4009 2019-08-11 16:00:00 NaN
Explanation why output is correct:
Value 14.0 in index row 1 for column df['pixelLimit'] is correct because we are looking for the minimum value in column df['minPixel'] which has a datetime in column df['date'] less than the datetime in index row 1 for column df['colourDate'] and greater or equal to the datetime in index row 1 for column df['date']
Value 5184.0 in index row 3 for column df['pixelLimit'] is correct because we are looking for the maximum value in column df['maxPixel'] which has a datetime in column df['date'] less than the datetime in index row 3 for column df['colourDate'] and greater or equal to the datetime in index row 3 for column df['date']
Considerations:
Maybe np.select is not best suited for this task and some sort of function might serve the task better?
Also, maybe I need to create some sort of dynamic len to use as a starting point for each row?
Request
Please can anyone out there help me amend my code to achieve the correct output
For matching problems like this one possibility is to do the complete merge, then subset, using a Boolean Series, to all rows that satisfy your condition (for that row) and find the max or min among all the possible matches. Since this requires slightly different columns and different functions I split the operations into 2 very similar pieces of code, one to deal with 1/blue and the other for 1/red.
First some housekeeping, make things datetime
import pandas as pd
df['date'] = pd.to_datetime(df['date'])
df['colourDate'] = pd.to_datetime(df['colourDate'])
Calculate the min pixel for 1/red between the times for each row
# Subset of rows we need to do this for
dfmin = df[df.type.eq(1) & df.colour.eq('red')].reset_index()
# To each row merge all rows from the original DataFrame
dfmin = dfmin.merge(df[['date', 'minPixel']], how='cross')
# If pd.version < 1.2 instead use:
#dfmin = dfmin.assign(t=1).merge(df[['date', 'minPixel']].assign(t=1), on='t')
# Only keep rows between the dates, then among those find the min minPixel
smin = (dfmin[dfmin.date_y.between(dfmin.date_x, dfmin.colourDate)]
.groupby('index')['minPixel_y'].min()
.rename('pixel_limit'))
#index
#1 14
#Name: pixel_limit, dtype: int64
# Max is basically a mirror
dfmax = df[df.type.eq(1) & df.colour.eq('blue')].reset_index()
dfmax = dfmax.merge(df[['date', 'maxPixel']], how='cross')
#dfmax = dfmax.assign(t=1).merge(df[['date', 'maxPixel']].assign(t=1), on='t')
smax = (dfmax[dfmax.date_y.between(dfmax.date_x, dfmax.colourDate)]
.groupby('index')['maxPixel_y'].max()
.rename('pixel_limit'))
Finally because the above groups over the original index (i.e. 'index') we can simply assign back to align with the original DataFrame.
df['pixel_limit'] = pd.concat([smin, smax])
date type colour maxPixel minPixel colourDate pixel_limit
0 2019-08-06 09:00:00 0.0 blue 255 86 2019-08-06 12:00:00 NaN
1 2019-08-06 12:00:00 1.0 red 7346 96 2019-08-08 16:00:00 14.0
2 2019-08-06 18:00:00 NaN NaN 32 14 2019-08-06 23:00:00 NaN
3 2019-08-06 21:00:00 1.0 blue 5184 3540 2019-08-06 22:00:00 5184.0
4 2019-08-07 09:00:00 NaN NaN 600 528 2019-08-08 09:00:00 NaN
5 2019-08-07 16:00:00 NaN NaN 322 300 2019-08-09 16:00:00 NaN
6 2019-08-08 17:00:00 0.0 blue 72 12 2019-08-08 23:00:00 NaN
7 2019-08-09 16:00:00 0.0 red 6000 4009 2019-08-11 16:00:00 NaN
If you need to bring along a lot of different information for the row with the min/max Pixel then instead of groupby min/max we will sort_values and then gropuby + head or tail to get the min or max pixel. For the min this would look like (slight renaming of suffixes):
# Subset of rows we need to do this for
dfmin = df[df.type.eq(1) & df.colour.eq('red')].reset_index()
# To each row merge all rows from the original DataFrame
dfmin = dfmin.merge(df[['date', 'minPixel']].reset_index(), how='cross',
suffixes=['', '_match'])
# For older pandas < 1.2
#dfmin = (dfmin.assign(t=1)
# .merge(df[['date', 'minPixel']].reset_index().assign(t=1),
# on='t', suffixes=['', '_match']))
# Only keep rows between the dates, then among those find the min minPixel row.
# A bunch of renaming.
smin = (dfmin[dfmin.date_match.between(dfmin.date, dfmin.colourDate)]
.sort_values('minPixel_match', ascending=True)
.groupby('index').head(1)
.set_index('index')
.filter(like='_match')
.rename(columns={'minPixel_match': 'pixel_limit'}))
The Max would then be similar using .tail
dfmax = df[df.type.eq(1) & df.colour.eq('blue')].reset_index()
dfmax = dfmax.merge(df[['date', 'maxPixel']].reset_index(), how='cross',
suffixes=['', '_match'])
smax = (dfmax[dfmax.date_match.between(dfmax.date, dfmin.colourDate)]
.sort_values('maxPixel_match', ascending=True)
.groupby('index').tail(1)
.set_index('index')
.filter(like='_match')
.rename(columns={'maxPixel_match': 'pixel_limit'}))
And finally we concat along axis=1 now that we need to join multiple columns to the original:
result = pd.concat([df, pd.concat([smin, smax])], axis=1)
date type colour maxPixel minPixel colourDate index_match date_match pixel_limit
0 2019-08-06 09:00:00 0.0 blue 255 86 2019-08-06 12:00:00 NaN NaN NaN
1 2019-08-06 12:00:00 1.0 red 7346 96 2019-08-08 16:00:00 2.0 2019-08-06 18:00:00 14.0
2 2019-08-06 18:00:00 NaN NaN 32 14 2019-08-06 23:00:00 NaN NaN NaN
3 2019-08-06 21:00:00 1.0 blue 5184 3540 2019-08-06 22:00:00 3.0 2019-08-06 21:00:00 5184.0
4 2019-08-07 09:00:00 NaN NaN 600 528 2019-08-08 09:00:00 NaN NaN NaN
5 2019-08-07 16:00:00 NaN NaN 322 300 2019-08-09 16:00:00 NaN NaN NaN
6 2019-08-08 17:00:00 0.0 blue 72 12 2019-08-08 23:00:00 NaN NaN NaN
7 2019-08-09 16:00:00 0.0 red 6000 4009 2019-08-11 16:00:00 NaN NaN NaN
Related
I have rain and temp data sourced from Environment Canada but it contains some NaN values.
start_date = '2015-12-31'
end_date = '2021-05-26'
mask = (data['date'] > start_date) & (data['date'] <= end_date)
df = data.loc[mask]
print(df)
date time rain_gauge_value temperature
8760 2016-01-01 00:00:00 0.0 -2.9
8761 2016-01-01 01:00:00 0.0 -3.4
8762 2016-01-01 02:00:00 0.0 -3.6
8763 2016-01-01 03:00:00 0.0 -3.6
8764 2016-01-01 04:00:00 0.0 -4.0
... ... ... ... ...
56107 2021-05-26 19:00:00 0.0 22.0
56108 2021-05-26 20:00:00 0.0 21.5
56109 2021-05-26 21:00:00 0.0 21.1
56110 2021-05-26 22:00:00 0.0 19.5
56111 2021-05-26 23:00:00 0.0 18.5
[47352 rows x 4 columns]
Find the rows with a NaN value
null = df[df['rain_gauge_value'].isnull()]
print(null)
date time rain_gauge_value temperature
11028 2016-04-04 12:00:00 NaN -6.9
11986 2016-05-14 10:00:00 NaN NaN
11987 2016-05-14 11:00:00 NaN NaN
11988 2016-05-14 12:00:00 NaN NaN
11989 2016-05-14 13:00:00 NaN NaN
... ... ... ... ...
49024 2020-08-04 16:00:00 NaN NaN
49025 2020-08-04 17:00:00 NaN NaN
50505 2020-10-05 09:00:00 NaN 11.3
54083 2021-03-03 11:00:00 NaN -5.1
54084 2021-03-03 12:00:00 NaN -4.5
[6346 rows x 4 columns]
This is my dataframe I want to use to fill the NaN values
print(rain_df)
date time rain_gauge_value temperature
0 2015-12-28 00:00:00 0.1 -6.0
1 2015-12-28 01:00:00 0.0 -7.0
2 2015-12-28 02:00:00 0.0 -8.0
3 2015-12-28 03:00:00 0.0 -8.0
4 2015-12-28 04:00:00 0.0 -7.0
... ... ... ... ...
48043 2021-06-19 19:00:00 0.6 20.0
48044 2021-06-19 20:00:00 0.6 19.0
48045 2021-06-19 21:00:00 0.8 18.0
48046 2021-06-19 22:00:00 0.4 17.0
48047 2021-06-19 23:00:00 0.0 16.0
[48048 rows x 4 columns]
But when I use the fillna() method, some of the values don't get substitued.
null = null.fillna(rain_df)
null = null[null['rain_gauge_value'].isnull()]
print(null)
date time rain_gauge_value temperature
48057 2020-06-25 09:00:00 NaN NaN
48058 2020-06-25 10:00:00 NaN NaN
48059 2020-06-25 11:00:00 NaN NaN
48060 2020-06-25 12:00:00 NaN NaN
48586 2020-07-17 10:00:00 NaN NaN
48587 2020-07-17 11:00:00 NaN NaN
48588 2020-07-17 12:00:00 NaN NaN
49022 2020-08-04 14:00:00 NaN NaN
49023 2020-08-04 15:00:00 NaN NaN
49024 2020-08-04 16:00:00 NaN NaN
49025 2020-08-04 17:00:00 NaN NaN
50505 2020-10-05 09:00:00 NaN 11.3
54083 2021-03-03 11:00:00 NaN -5.1
54084 2021-03-03 12:00:00 NaN -4.5
How can I resolve this issue?
when fillna, you probably want a method, like fill using previous/next value, mean of column etc, what we can do is like this
nulls_index = df['rain_gauge_value'].isnull()
df = df.fillna(method='ffill') # use ffill as example
nulls_after_fill = df[nulls_index]
take a look at:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
You need to inform pandas how you want to patch. It may be obvious to you want to use the "patch" dataframe's values when the date and times line up, but it won't be obvious to pandas. see my dummy example:
raw = pd.DataFrame(dict(date=[date(2015,12,28), date(2015,12,28)], time= [time(0,0,0),time(0,0,1)],temp=[1.,np.nan],rain=[4.,np.nan]))
raw
date time temp rain
0 2015-12-28 00:00:00 1.0 4.0
1 2015-12-28 00:00:01 NaN NaN
patch = pd.DataFrame(dict(date=[date(2015,12,28), date(2015,12,28)], time=[time(0,0,0),time(0,0,1)],temp=[5.,5.],rain=[10.,10.]))
patch
date time temp rain
0 2015-12-28 00:00:00 5.0 10.0
1 2015-12-28 00:00:01 5.0 10.0
you need the indexes of raw and patch to correspond to how you want to patch the raw data (in this case, you want to patch based on date and time)
raw.set_index(['date','time']).fillna(patch.set_index(['date','time']))
returns
temp rain
date time
2015-12-28 00:00:00 1.0 4.0
00:00:01 5.0 10.0
I have got a time series of meteorological observations with date and value columns:
df = pd.DataFrame({'date':['11/10/2017 0:00','11/10/2017 03:00','11/10/2017 06:00','11/10/2017 09:00','11/10/2017 12:00',
'11/11/2017 0:00','11/11/2017 03:00','11/11/2017 06:00','11/11/2017 09:00','11/11/2017 12:00',
'11/12/2017 00:00','11/12/2017 03:00','11/12/2017 06:00','11/12/2017 09:00','11/12/2017 12:00'],
'value':[850,np.nan,np.nan,np.nan,np.nan,500,650,780,np.nan,800,350,690,780,np.nan,np.nan],
'consecutive_hour': [ 3,0,0,0,0,3,6,9,0,3,3,6,9,0,0]})
With this DataFrame, I want a third column of consecutive_hours such that if the value in a particular timestamp is less than 1000, we give corresponding value in "consecutive-hours" of "3:00" hours and find consecutive such occurrence like 6:00 9:00 as above.
Lastly, I want to summarize the table counting consecutive hours occurrence and number of days such that the summary table looks like:
df_summary = pd.DataFrame({'consecutive_hours':[3,6,9,12],
'number_of_day':[2,0,2,0]})
I tried several online solutions and methods like shift(), diff() etc. as mentioned in:How to groupby consecutive values in pandas DataFrame
and more, spent several days but no luck yet.
I would highly appreciate help on this issue.
Thanks!
Input data:
>>> df
date value
0 2017-11-10 00:00:00 850.0
1 2017-11-10 03:00:00 NaN
2 2017-11-10 06:00:00 NaN
3 2017-11-10 09:00:00 NaN
4 2017-11-10 12:00:00 NaN
5 2017-11-11 00:00:00 500.0
6 2017-11-11 03:00:00 650.0
7 2017-11-11 06:00:00 780.0
8 2017-11-11 09:00:00 NaN
9 2017-11-11 12:00:00 800.0
10 2017-11-12 00:00:00 350.0
11 2017-11-12 03:00:00 690.0
12 2017-11-12 06:00:00 780.0
13 2017-11-12 09:00:00 NaN
14 2017-11-12 12:00:00 NaN
The cumcount_reset function is adapted from this answer of #jezrael:
Python pandas cumsum with reset everytime there is a 0
cumcount_reset = \
lambda b: b.cumsum().sub(b.cumsum().where(~b).ffill().fillna(0)).astype(int)
df["consecutive_hour"] = (df.set_index("date")["value"] < 1000) \
.groupby(pd.Grouper(freq="D")) \
.apply(lambda b: cumcount_reset(b)).mul(3) \
.reset_index(drop=True)
Output result:
>>> df
date value consecutive_hour
0 2017-11-10 00:00:00 850.0 3
1 2017-11-10 03:00:00 NaN 0
2 2017-11-10 06:00:00 NaN 0
3 2017-11-10 09:00:00 NaN 0
4 2017-11-10 12:00:00 NaN 0
5 2017-11-11 00:00:00 500.0 3
6 2017-11-11 03:00:00 650.0 6
7 2017-11-11 06:00:00 780.0 9
8 2017-11-11 09:00:00 NaN 0
9 2017-11-11 12:00:00 800.0 3
10 2017-11-12 00:00:00 350.0 3
11 2017-11-12 03:00:00 690.0 6
12 2017-11-12 06:00:00 780.0 9
13 2017-11-12 09:00:00 NaN 0
14 2017-11-12 12:00:00 NaN 0
Summary table
df_summary = df.loc[df.groupby(pd.Grouper(key="date", freq="D"))["consecutive_hour"] \
.apply(lambda h: (h - h.shift(-1).fillna(0)) > 0),
"consecutive_hour"] \
.value_counts().reindex([3, 6, 9, 12], fill_value=0) \
.rename("number_of_day") \
.rename_axis("consecutive_hour") \
.reset_index()
>>> df_summary
consecutive_hour number_of_day
0 3 2
1 6 0
2 9 2
3 12 0
I've looked around (eg.
Python - Locating the closest timestamp) but can't find anything on this.
I have a list of datetimes, and a dataframe containing 10k + rows, of start and end times (formatted as datetimes).
The dataframe is effectively listing parameters for runs of an instrument.
The list describes times from an alarm event.
The datetime list items are all within a row (i.e. between a start and end time) in the dataframe. Is there an easy way to locate the rows which would contain the timeframe within which the alarm time would be? (sorry for poor wording there!)
eg.
for i in alarms:
df.loc[(df.start_time < i) & (df.end_time > i), 'Flag'] = 'Alarm'
(this didn't work but shows my approach)
Example datasets
# making list of datetimes for the alarms
df = pd.DataFrame({'Alarms':["18/07/19 14:56:21", "19/07/19 15:05:15", "20/07/19 15:46:00"]})
df['Alarms'] = pd.to_datetime(df['Alarms'])
alarms = list(df.Alarms.unique())
# dataframe of runs containing start and end times
n=33
rng1 = pd.date_range('2019-07-18', '2019-07-22', periods=n)
rng2 = pd.date_range('2019-07-18 03:00:00', '2019-07-22 03:00:00', periods=n)
df = pd.DataFrame({ 'start_date': rng1, 'end_Date': rng2})
Herein a flag would go against line (well, index) 4, 13 and 21.
You can use pandas.IntervalIndex here:
# Create and set IntervalIndex
intervals = pd.IntervalIndex.from_arrays(df.start_date, df.end_Date)
df = df.set_index(intervals)
# Update using loc
df.loc[alarms, 'flag'] = 'alarm'
# Finally, reset_index
df = df.reset_index(drop=True)
[out]
start_date end_Date flag
0 2019-07-18 00:00:00 2019-07-18 03:00:00 NaN
1 2019-07-18 03:00:00 2019-07-18 06:00:00 NaN
2 2019-07-18 06:00:00 2019-07-18 09:00:00 NaN
3 2019-07-18 09:00:00 2019-07-18 12:00:00 NaN
4 2019-07-18 12:00:00 2019-07-18 15:00:00 alarm
5 2019-07-18 15:00:00 2019-07-18 18:00:00 NaN
6 2019-07-18 18:00:00 2019-07-18 21:00:00 NaN
7 2019-07-18 21:00:00 2019-07-19 00:00:00 NaN
8 2019-07-19 00:00:00 2019-07-19 03:00:00 NaN
9 2019-07-19 03:00:00 2019-07-19 06:00:00 NaN
10 2019-07-19 06:00:00 2019-07-19 09:00:00 NaN
11 2019-07-19 09:00:00 2019-07-19 12:00:00 NaN
12 2019-07-19 12:00:00 2019-07-19 15:00:00 NaN
13 2019-07-19 15:00:00 2019-07-19 18:00:00 alarm
14 2019-07-19 18:00:00 2019-07-19 21:00:00 NaN
15 2019-07-19 21:00:00 2019-07-20 00:00:00 NaN
16 2019-07-20 00:00:00 2019-07-20 03:00:00 NaN
17 2019-07-20 03:00:00 2019-07-20 06:00:00 NaN
18 2019-07-20 06:00:00 2019-07-20 09:00:00 NaN
19 2019-07-20 09:00:00 2019-07-20 12:00:00 NaN
20 2019-07-20 12:00:00 2019-07-20 15:00:00 NaN
21 2019-07-20 15:00:00 2019-07-20 18:00:00 alarm
22 2019-07-20 18:00:00 2019-07-20 21:00:00 NaN
23 2019-07-20 21:00:00 2019-07-21 00:00:00 NaN
24 2019-07-21 00:00:00 2019-07-21 03:00:00 NaN
25 2019-07-21 03:00:00 2019-07-21 06:00:00 NaN
26 2019-07-21 06:00:00 2019-07-21 09:00:00 NaN
27 2019-07-21 09:00:00 2019-07-21 12:00:00 NaN
28 2019-07-21 12:00:00 2019-07-21 15:00:00 NaN
29 2019-07-21 15:00:00 2019-07-21 18:00:00 NaN
30 2019-07-21 18:00:00 2019-07-21 21:00:00 NaN
31 2019-07-21 21:00:00 2019-07-22 00:00:00 NaN
32 2019-07-22 00:00:00 2019-07-22 03:00:00 NaN
you were calling your columns start_date and end_Date, but in your for you use start_time and end_time.
try this:
import pandas as pd
df = pd.DataFrame({'Alarms': ["18/07/19 14:56:21", "19/07/19 15:05:15", "20/07/19 15:46:00"]})
df['Alarms'] = pd.to_datetime(df['Alarms'])
alarms = list(df.Alarms.unique())
# dataframe of runs containing start and end times
n = 33
rng1 = pd.date_range('2019-07-18', '2019-07-22', periods=n)
rng2 = pd.date_range('2019-07-18 03:00:00', '2019-07-22 03:00:00', periods=n)
df = pd.DataFrame({'start_date': rng1, 'end_Date': rng2})
for i in alarms:
df.loc[(df.start_date < i) & (df.end_Date > i), 'Flag'] = 'Alarm'
print(df[df['Flag']=='Alarm']['Flag'])
Output:
4 Alarm
13 Alarm
21 Alarm
Name: Flag, dtype: object
I'm trying to merge two dataframes with different datetime frequencies and also filling up missing values with duplicates.
Dataframe df1 with minute frequency:
time
0 2017-06-01 00:00:00
1 2017-06-01 00:01:00
2 2017-06-01 00:02:00
3 2017-06-01 00:03:00
4 2017-06-01 00:04:00
Dataframe df2 with hourly frequency:
time2 temp hum
1 2017-06-01 00:00:00 13.5 90.0
2 2017-06-01 01:00:00 12.2 95.0
3 2017-06-01 02:00:00 11.7 96.0
4 2017-06-01 03:00:00 11.5 96.0
5 2017-06-01 04:00:00 11.1 97.0
So far i merged these dataframe but get NaNs:
m2o_merge = df1.merge(df2, left_on= 'time', right_on= 'time2', how='outer')
m2o_merge.head()
time time2 temp hum
0 2017-06-01 00:00:00 2017-06-01 13.5 90.0
1 2017-06-01 00:01:00 NaT NaN NaN
2 2017-06-01 00:02:00 NaT NaN NaN
3 2017-06-01 00:03:00 NaT NaN NaN
4 2017-06-01 00:04:00 NaT NaN NaN
My desired dataframe should look like this (NaN filled up with hourly value df2):
time temp hum
0 2017-06-01 00:00:00 13.5 90.0
1 2017-06-01 00:01:00 13.5 90.0
2 2017-06-01 00:02:00 13.5 90.0
3 2017-06-01 00:03:00 13.5 90.0
4 2017-06-01 00:04:00 13.5 90.0
...
So far i found this solution: merge series/dataframe with different time frequencies in python, but the Datetime column is not my index. Does anyone know how to get there ?
As suggested by Ben Pap i did the following two Steps as a solution:
import pandas as pd
data1 = {'time':pd.date_range('2017-06-01 00:00:00','2017-06-01 00:09:00', freq='T')}
data2 = {'time2':pd.date_range('2017-06-01 00:00:00','2017-06-01 04:00:00', freq='H'), 'temp':[13.5,12.2,11.7,11.5,11.1], 'hum':[90.0,95.0,96.0,96.0,97.0]}
# Create DataFrame
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
m2o_merge = df1.merge(df2, left_on= 'time', right_on= 'time2', how='outer')
m2o_merge.head()
m2o_merge.fillna(method='ffill', inplace=True)
filled_df = m2o_merge.drop(['time2'], axis=1)
filled_df.head()
I have a dataframe like this:
df = pd.DataFrame({'timestamp':pd.date_range('2018-01-01', '2018-01-02', freq='2h', closed='right'),'col1':[np.nan, np.nan, np.nan, 1,2,3,4,5,6,7,8,np.nan], 'col2':[np.nan, np.nan, 0, 1,2,3,4,5,np.nan,np.nan,np.nan,np.nan], 'col3':[np.nan, -1, 0, 1,2,3,4,5,6,7,8,9], 'col4':[-2, -1, 0, 1,2,3,4,np.nan,np.nan,np.nan,np.nan,np.nan]
})[['timestamp', 'col1', 'col2', 'col3', 'col4']]
which looks like this:
timestamp col1 col2 col3 col4
0 2018-01-01 02:00:00 NaN NaN NaN -2.0
1 2018-01-01 04:00:00 NaN NaN -1.0 -1.0
2 2018-01-01 06:00:00 NaN 0.0 NaN 0.0
3 2018-01-01 08:00:00 1.0 1.0 1.0 1.0
4 2018-01-01 10:00:00 2.0 NaN 2.0 2.0
5 2018-01-01 12:00:00 3.0 3.0 NaN 3.0
6 2018-01-01 14:00:00 NaN 4.0 4.0 4.0
7 2018-01-01 16:00:00 5.0 NaN 5.0 NaN
8 2018-01-01 18:00:00 6.0 NaN 6.0 NaN
9 2018-01-01 20:00:00 7.0 NaN 7.0 NaN
10 2018-01-01 22:00:00 8.0 NaN 8.0 NaN
11 2018-01-02 00:00:00 NaN NaN 9.0 NaN
Now, I want to find an efficient and pythonic way of chopping off (for each column! Not counting timestamp) before the first valid index and after the last valid index. In this example I have 4 columns, but in reality I have a lot more, 600 or so. I am looking for a way of chop of all the NaN values before the first valid index and all the NaN values after the last valid index.
One way would be to loop through I guess.. But is there a better way? This way has to be efficient. I tried to "unpivot" the dataframe using melt, but then this didn't help.
An obvious point is that each column would have a different number of rows after the chopping. So I would like the result to be a list of data frames (one for each column) having timestamp and the column in question. For instance:
timestamp col1
3 2018-01-01 08:00:00 1.0
4 2018-01-01 10:00:00 2.0
5 2018-01-01 12:00:00 3.0
6 2018-01-01 14:00:00 NaN
7 2018-01-01 16:00:00 5.0
8 2018-01-01 18:00:00 6.0
9 2018-01-01 20:00:00 7.0
10 2018-01-01 22:00:00 8.0
My try
I tried like this:
final = []
columns = [c for c in df if c !='timestamp']
for col in columns:
first = df.loc[:, col].first_valid_index()
last = df.loc[:, col].last_valid_index()
final.append(df.loc[:, ['timestamp', col]].iloc[first:last+1, :])
One idea is to use a list or dictionary comprehension after setting your index as timestamp. You should test with your data to see if this resolves your issue with performance. It is unlikely to help if your limitation is memory.
df = df.set_index('timestamp')
final = {col: df[col].loc[df[col].first_valid_index(): df[col].last_valid_index()] \
for col in df}
print(final)
{'col1': timestamp
2018-01-01 08:00:00 1.0
2018-01-01 10:00:00 2.0
2018-01-01 12:00:00 3.0
2018-01-01 14:00:00 4.0
2018-01-01 16:00:00 5.0
2018-01-01 18:00:00 6.0
2018-01-01 20:00:00 7.0
2018-01-01 22:00:00 8.0
Name: col1, dtype: float64,
...
'col4': timestamp
2018-01-01 02:00:00 -2.0
2018-01-01 04:00:00 -1.0
2018-01-01 06:00:00 0.0
2018-01-01 08:00:00 1.0
2018-01-01 10:00:00 2.0
2018-01-01 12:00:00 3.0
2018-01-01 14:00:00 4.0
Name: col4, dtype: float64}
You can use the power of functional programming and apply a function to each column. This may speed things up. Also, as you timestamps looks sorted, you can use them as index of your Datarame.
df.set_index('timestamp', inplace=True)
final = []
def func(col):
first = col.first_valid_index()
last = col.last_valid_index()
final.append(col.loc[first:last])
return
df.apply(func)
Also, you can compact everything in a one liner:
final = []
df.apply(lambda col: final.append(col.loc[col.first_valid_index() : col.last_valid_index()]))
My approach is to find the cumulative sum of NaN for each column and its inverse and filter those entries that are greater than 0. Then I do a dict comprehension to return a dataframe for each column (you can change that to a list if that's what you prefer).
For your example we have
cols = [c for c in df.columns if c!='timestamp']
result_dict = {c: df[(df[c].notnull().cumsum() > 0) &
(df.ix[::-1,c].notnull().cumsum()[::-1] > 0)][['timestamp', c]]
for c in cols}