pandas - Splitting date ranges on specific day boundary - python
I've got a DataFrame of date ranges (the actual DataFrame has more data attached to it but has the same start and end columns). The data ultimately needs to be analyzed week-by-week on a Sunday-Saturday basis. Thus, I'd like to go through the DataFrame, and split any date ranges (start to finish) that cross from a Saturday to Sunday. For example, given the DataFrame:
import pandas as pd
date_ranges = [
{'start': '2020-01-16 22:30:00', 'end': '2020-01-17 01:00:00'}, # spans thurs-fri, ok as is
{'start': '2020-01-17 04:30:00', 'end': '2020-01-17 12:30:00'}, # no span, ok as is
{'start': '2020-01-18 10:15:00', 'end': '2020-01-18 14:00:00'}, # no span, ok as is
{'start': '2020-01-18 22:30:00', 'end': '2020-01-19 02:00:00'} # spans sat-sun, must split
]
data_df = pd.DataFrame(date_ranges)
I want my result to look like:
result_ranges = [
{'start': '2020-01-16 22:30:00', 'end': '2020-01-17 01:00:00'}, # spans thurs-fri, ok as is
{'start': '2020-01-17 04:30:00', 'end': '2020-01-17 12:30:00'}, # no span, ok as is
{'start': '2020-01-18 10:15:00', 'end': '2020-01-18 14:00:00'}, # no span, ok as is
{'start': '2020-01-18 22:30:00', 'end': '2020-01-19 00:00:00'}, # split out saturday portion
{'start': '2020-01-19 00:00:00', 'end': '2020-01-19 02:00:00'} # and the sunday portion
]
result_df = pd.DataFrame(result_ranges)
Any thoughts on how to effectively do this in pandas would be greatly appreciated. Currently I am doing the bad thing, and iterating over rows, and it is quite slow when the data set gets large.
Manipulations like this are always difficult and at some level I think a loop is necessary. In this case, instead of looping over the rows, we can loop over the edges. This should lead to a rather big gain in performance when the number of weeks your data span is much smaller than the number of rows you have.
We define edges and modify the DataFrame endpoints where necessary. In the end the desired DataFrame is whatever is left of the DataFrame we modified, plus all the separate timespans we stored in l. The original Index is preserved, so you can see exactly what rows were split. If a single timespan straddles N edges it gets split into N+1 separate rows.
Setup
import pandas as pd
df[['start', 'end']]= df[['start', 'end']].apply(pd.to_datetime)
edges = pd.date_range(df.start.min().normalize() - pd.Timedelta(days=7),
df.end.max().normalize() + pd.Timedelta(days=7), freq='W-Sun')
Code
l = []
for edge in edges:
m = df.start.lt(edge) & df.end.gt(edge) # Rows to modify
l.append(df.loc[m].assign(end=edge)) # Clip end of modified rows
df.loc[m, 'start'] = edge # Fix start for next edge
result = pd.concat(l+[df]).sort_values('start')
Output
start end
0 2020-01-16 22:30:00 2020-01-17 01:00:00
1 2020-01-17 04:30:00 2020-01-17 12:30:00
2 2020-01-18 10:15:00 2020-01-18 14:00:00
3 2020-01-18 22:30:00 2020-01-19 00:00:00
3 2020-01-19 00:00:00 2020-01-19 02:00:00
My solution is even more general that you defined, namely it creates
a sequence of "week rows" from each source row, even if both dates
contain between them e.g. two Sat/Sun breaks.
To check that it works, I added one such row to your DataFrame, so that
it contains:
start end
0 2020-01-16 22:30:00 2020-01-17 01:00:00
1 2020-01-17 04:30:00 2020-01-17 12:30:00
2 2020-01-18 10:15:00 2020-01-18 14:00:00
3 2020-01-18 22:30:00 2020-01-19 02:00:00
4 2020-01-25 20:30:00 2020-02-02 03:00:00
Note that the last row includes two Sat/Sun break, from 25.01 to 26.01
and from 1.02 to 2.02.
Start from conversion of both columns to datetime:
data_df.start = pd.to_datetime(data_df.start)
data_df.end = pd.to_datetime(data_df.end)
To process your data, define the following function, to be applied to each row:
def weekRows(row):
row.index = pd.DatetimeIndex(row)
gr = row.resample('W-SUN', closed='left')
ngr = gr.ngroups # Number of groups
i = 1
data = []
for key, grp in gr:
dt1 = key - pd.Timedelta('7D')
dt2 = key
if i == 1:
dt1 = row.iloc[0]
if i == ngr:
dt2 = row.iloc[1]
data.append([dt1, dt2])
i += 1
return pd.DataFrame(data, columns=['start', 'end'])
Let's present "individually", how it operates on 2 last rows:
When you run:
row = data_df.loc[3]
weekRows(row)
(for the last but one row), you will get:
start end
0 2020-01-18 22:30:00 2020-01-19 00:00:00
1 2020-01-19 00:00:00 2020-01-19 02:00:00
And when you run:
row = data_df.loc[4]
weekRows(row)
(for the last), you will get:
start end
0 2020-01-25 20:30:00 2020-01-26 00:00:00
1 2020-01-26 00:00:00 2020-02-02 00:00:00
2 2020-02-02 00:00:00 2020-02-02 03:00:00
And to get your desired result, run:
result = pd.concat(data_df.apply(weekRows, axis=1).values, ignore_index=True)
The result is:
start end
0 2020-01-16 22:30:00 2020-01-17 01:00:00
1 2020-01-17 04:30:00 2020-01-17 12:30:00
2 2020-01-18 10:15:00 2020-01-18 14:00:00
3 2020-01-18 22:30:00 2020-01-19 00:00:00
4 2020-01-19 00:00:00 2020-01-19 02:00:00
5 2020-01-25 20:30:00 2020-01-26 00:00:00
6 2020-01-26 00:00:00 2020-02-02 00:00:00
7 2020-02-02 00:00:00 2020-02-02 03:00:00
First 3 rows result from your first 3 source rows.
Two next rows (index 3 and 4) result from source row with index 3.
And the last 3 row (index 5 thru 7) result from the last source row.
Similar to #Valdi_Bo's answer, I looked into breaking down a single interval of (start, end) into a series of intervals, including all the midnights of Sundays in between.
This is accomplished by the following function:
def break_weekly(start, end):
edges = list(pd.date_range(start, end, freq='W', normalize=True, closed='right'))
if edges and edges[-1] == end:
edges.pop()
return pd.Series(list(zip([start] + edges, edges + [end])))
This code will create a weekly date range from "start" to "end", normalizing to midnight time (so Sunday midnight) and will keep the interval open on the left (so it starts on the Sunday following start.)
There's a corner case for when "end" is exactly midnight on Sunday, since the interval needs to be closed on one side, we're keeping it closed on the right, so we're checking whether those two match and drop it if they're the same.
We then use zip() to create tuples with each pairs of dates, including the "start" at the beginning in the left, and the "end" timestamp at the end of the right.
We finally return a pd.Series of those tuples, since that makes apply() do what we expect.
Example usage:
>>> break_weekly(pd.Timestamp('2020-01-18 22:30:00'), pd.Timestamp('2020-01-19 02:00:00'))
0 (2020-01-18 22:30:00, 2020-01-19 00:00:00)
1 (2020-01-19 00:00:00, 2020-01-19 02:00:00)
dtype: object
At this point, you can apply it to the original data frame to find the complete list of intervals.
First, convert the types of the columns to pd.Timestamp (you have strings in the columns in your example):
data_df = data_df.apply(pd.to_datetime)
Then you can find the whole list of intervals with:
intervals = (data_df
.apply(lambda r: break_weekly(r.start, r.end), axis=1)
.unstack().dropna().reset_index(level=0, drop=True)
.apply(lambda r: pd.Series(r, index=['start', 'end'])))
The first step applies break_weekly() to the "start" and "end" columns, row by row. Since break_weekly() returns a pd.Series, it will end up producing a new DataFrame with one column per date interval (as many as there are weeks in an interval).
Then unstack() will merge those columns back together, and dropna() will drop the NaN that were generated because each row had a different number of columns (different number of intervals for each row.)
At this point we have a multi-index, so reset_index(level=0, drop=True) will drop the index level we don't care about and only keep the one that matches the original DataFrame.
Finally, the last apply() will convert the entries from Python tuples back to a pd.Series and will name the columns "start" and "end" again.
Looking at the result up until this point:
>>> intervals
start end
0 2020-01-16 22:30:00 2020-01-17 01:00:00
1 2020-01-17 04:30:00 2020-01-17 12:30:00
2 2020-01-18 10:15:00 2020-01-18 14:00:00
3 2020-01-18 22:30:00 2020-01-19 00:00:00
3 2020-01-19 00:00:00 2020-01-19 02:00:00
Since the indices match the ones from your original DataFrame, you can now use this DataFrame to connect it back to your original one, if you had more columns with values there and you want to duplicate those here, it's just a matter of joining them together.
For example:
>>> data_df['value'] = ['abc', 'def', 'ghi', 'jkl']
>>> intervals.join(df.drop(['start', 'end'], axis=1))
start end value
0 2020-01-16 22:30:00 2020-01-17 01:00:00 abc
1 2020-01-17 04:30:00 2020-01-17 12:30:00 def
2 2020-01-18 10:15:00 2020-01-18 14:00:00 ghi
3 2020-01-18 22:30:00 2020-01-19 00:00:00 jkl
3 2020-01-19 00:00:00 2020-01-19 02:00:00 jkl
You'll notice that the value in the last row has been copied to both rows in that interval.
Related
How do subtraction between timestamp two rows per two with shift - Pandas Python
I would like to make a subtraction with date_time in pandas python but with a shift of two rows, I don't know the function Timestamp 2020-11-26 20:00:00 2020-11-26 21:00:00 2020-11-26 22:00:00 2020-11-26 23:30:00 Explanation: (2020-11-26 21:00:00) - (2020-11-26 20:00:00) (2020-11-26 23:30:00) - (2020-11-26 22:00:00) The result must be: 01:00:00 01:30:00
Firstly you need to check if this is as type datetime. If not, kindly do pd.to_datetime() demo = pd.DataFrame(columns=['Timestamps']) demotime = ['20:00:00','21:00:00','22:00:00','23:30:00'] demo['Timestamps'] = demotime demo['Timestamps'] = pd.to_datetime(demo['Timestamps']) Your dataframe would look like: Timestamps 0 2020-11-29 20:00:00 1 2020-11-29 21:00:00 2 2020-11-29 22:00:00 3 2020-11-29 23:30:00 After that you can either use for loop or while and in that just do: demo.iloc[i+1,0]-demo.iloc[i,0]
IIUC, you want to iterate on chunks of two and find the difference, one approach is to: res = df.groupby(np.arange(len(df)) // 2).diff().dropna() print(res) Output Timestamp 1 0 days 01:00:00 3 0 days 01:30:00
Calculate the sum between the fixed time range using Pandas
My dataset looks like this: time Open 2017-01-01 00:00:00 1.219690 2017-01-01 01:00:00 1.688490 2017-01-01 02:00:00 1.015285 2017-01-01 03:00:00 1.357672 2017-01-01 04:00:00 1.293786 2017-01-01 05:00:00 1.040048 2017-01-01 06:00:00 1.225080 2017-01-01 07:00:00 1.145402 ...., .... 2017-12-31 23:00:00 1.145402 I want to find the sum between the time-range specified and save it to new dataframe. let's say, I want to find the sum between 2017-01-01 22:00:00 and 2017-01-02 04:00:00. This is the sum of 6 hours between 2 days. I want to find the sum of the data in the time-range such as 10 PM to next day 4 AM and put it in a different data frame for example df_timerange_sum. Please note that we are doing sum of time in 2 different date? What did I do? I used the sum() to calculate time-range like this: df[~df['time'].dt.hour.between(10, 4)].sum()but it gives me sum as a whole of the df but not on the between time-range I have specified. I also tried the resample but I cannot find a way to do it for time-specific
df['time'].dt.hour.between(10, 4) is always False because no number is larger than 10 and smaller than 4 at the same time. What you want is to mark between(4,21) and then negate that to get the other hours. Here's what I would do: # mark those between 4AM and 10PM # data we want is where s==False, i.e. ~s s = df['time'].dt.hour.between(4, 21) # use s.cumsum() marks the consecutive False block # on which we will take sum blocks = s.cumsum() # again we only care for ~s (df[~s].groupby(blocks[~s], as_index=False) # we don't need the blocks as index .agg({'time':'min', 'Open':'sum'}) # time : min -- select the beginning of blocks ) # Open : sum -- compute sum of Open Output for random data: time Open 0 2017-01-01 00:00:00 1.282701 1 2017-01-01 22:00:00 2.766324 2 2017-01-02 22:00:00 2.838216 3 2017-01-03 22:00:00 4.151461 4 2017-01-04 22:00:00 2.151626 5 2017-01-05 22:00:00 2.525190 6 2017-01-06 22:00:00 0.798234
an alternative (in my opinion more straightforward) approach that accomplishes the same thing..there's definitely ways to reduce the code but I am also relatively new to pandas df.set_index(['time'],inplace=True) #make time the index col (not 100% necessary) df2=pd.DataFrame(columns=['start_time','end_time','sum_Open']) #new df that stores your desired output + start and end times if you need them df2['start_time']=df[df.index.hour == 22].index #gets/stores all start datetimes df2['end_time']=df[df.index.hour == 4].index #gets/stores all end datetimes for i,row in df2.iterrows(): df2.set_value(i,'sum_Open',df[(df.index >= row['start_time']) & (df.index <= row['end_time'])]['Open'].sum()) you'd have to add an if statement or something to handle the last day which ends at 11pm.
Resampling a dataframe into a new one while doing some additional operations
I am working with a dataframe where each entry (row) comes with a start time, a duration and other attributes. I would like to create a new dataframe from this one where I would sort of transform each entry from the original one into 15 minutes intervals while keeping all other attributes the same. The amount of entries in the new dataframe per entry in the old one would depend on the actual duration of the original one. At first I tried using pd.resample but it did not do exactly what I expected. I then constructed a function using itertuples() that works quite well but it took about half an hour with a dataframe of around 3000 rows. Now I want to do the same for 2 million rows so I am looking for other possibilities. Let's say I have the following dataframe: testdict = {'start':['2018-01-05 11:48:00', '2018-05-04 09:05:00', '2018-08-09 07:15:00', '2018-09-27 15:00:00'], 'duration':[22,8,35,2], 'Attribute_A':['abc', 'def', 'hij', 'klm'], 'id': [1,2,3,4]} testdf = pd.DataFrame(testdict) testdf.loc[:,['start']] = pd.to_datetime(testdf['start']) print(testdf) >>>testdf start duration Attribute_A id 0 2018-01-05 11:48:00 22 abc 1 1 2018-05-04 09:05:00 8 def 2 2 2018-08-09 07:15:00 35 hij 3 3 2018-09-27 15:00:00 2 klm 4 And I would like my outcome to be like the following: >>>resultdf start duration Attribute_A id 0 2018-01-05 11:45:00 12 abc 1 1 2018-01-05 12:00:00 10 abc 1 2 2018-05-04 09:00:00 8 def 2 3 2018-08-09 07:15:00 15 hij 3 4 2018-08-09 07:30:00 15 hij 3 5 2018-08-09 07:45:00 5 hij 3 6 2018-09-27 15:00:00 2 klm 4 This is the function that I built with itertuples which produced the desired result (the one I showed just above this): def min15_divider(df,newdf): for row in df.itertuples(): orig_min = row.start.minute remains = orig_min % 15 # Check if it is already a multiple of 15 if remains == 0: new_time = row.start.replace(second=0) if row.duration < 15: # if it shorter than 15 min just use that for the duration to_append = {'start': new_time, 'Attribute_A': row.Attribute_A, 'duration': row.duration, 'id':row.id} newdf = newdf.append(to_append, ignore_index=True) else: # if not, divide that in 15 min intervals until duration is exceeded cumu_dur = 15 while cumu_dur < row.duration: to_append = {'start': new_time, 'Attribute_A': row.Attribute_A, 'id':row.id} if cumu_dur < 15: to_append['duration'] = cumu_dur else: to_append['duration'] = 15 new_time = new_time + pd.Timedelta('15 minutes') cumu_dur = cumu_dur + 15 newdf = newdf.append(to_append, ignore_index=True) else: # add the remainder in the last 15 min interval final_dur = row.duration - (cumu_dur - 15) to_append = {'start': new_time, 'Attribute_A': row.Attribute_A,'duration': final_dur, 'id':row.id} newdf = newdf.append(to_append, ignore_index=True) else: # When it is not an exact multiple of 15 min new_min = orig_min - remains # convert to multiple of 15 new_time = row.start.replace(minute=new_min) new_time = new_time.replace(second=0) cumu_dur = 15 - remains # remaining minutes in the initial interval while cumu_dur < row.duration: # divide total in 15 min intervals until duration is exceeded to_append = {'start': new_time, 'Attribute_A': row.Attribute_A, 'id':row.id} if cumu_dur < 15: to_append['duration'] = cumu_dur else: to_append['duration'] = 15 new_time = new_time + pd.Timedelta('15 minutes') cumu_dur = cumu_dur + 15 newdf = newdf.append(to_append, ignore_index=True) else: # when we reach the last interval or the starting duration was less than the remaining minutes if row.duration < 15: final_dur = row.duration # original duration less than remaining minutes in first interval else: final_dur = row.duration - (cumu_dur - 15) # remaining duration in last interval to_append = {'start': new_time, 'Attribute_A': row.Attribute_A, 'duration': final_dur, 'id':row.id} newdf = newdf.append(to_append, ignore_index=True) return newdf Is there any other way to do this without using itertuples that could save me some time? Thanks in advance. PS. I apologize for anything that may seem a bit weird in my post as it is the first time that I have asked a question myself here in stackoverflow. EDIT Many entries can have the same starting time, so .groupby 'start' could be problematic. There is, however, a column with unique values for each entry called simply "id".
Using pd.resample is a good idea, but since you have only the starting time each row, you need to build the end row before you can use it. The code below assumes that each starting time in 'start' column is unique, so that grouby can be used in a bit unusual way, since it will extract only one row. I use groupby because it will automatically regroups the dataframes produced by the custom function used by apply. Note also that the column 'duration' is converted to timedelta in minutes in order to better perform some math later. import pandas as pd testdict = {'start':['2018-01-05 11:48:00', '2018-05-04 09:05:00', '2018-08-09 07:15:00', '2018-09-27 15:00:00'], 'duration':[22,8,35,2], 'Attribute_A':['abc', 'def', 'hij', 'klm']} testdf = pd.DataFrame(testdict) testdf['start'] = pd.to_datetime(testdf['start']) testdf['duration'] = pd.to_timedelta(testdf['duration'], 'T') print(testdf) def calcduration(df, starttime): if len(df) == 1: return elif len(df) == 2: df['duration'].iloc[0] = pd.Timedelta(15, 'T') - (starttime - df.index[0]) df['duration'].iloc[1] = df['duration'].iloc[1] - df['duration'].iloc[0] elif len(df) > 2: df['duration'].iloc[0] = pd.Timedelta(15, 'T') - (starttime - df.index[0]) df['duration'].iloc[1:-1] = pd.Timedelta(15, 'T') df['duration'].iloc[-1] = df['duration'].iloc[-1] - df['duration'].iloc[:-1].sum() def expandtime(x): frow = x.copy() frow['start'] = frow['start'] + frow['duration'] gdf = pd.concat([x, frow], axis=0) gdf = gdf.set_index('start') resdf = gdf.resample('15T').nearest() calcduration(resdf, x['start'].iloc[0]) return resdf findf = testdf.groupby('start', as_index=False).apply(expandtime) print(findf) This code produces: duration Attribute_A start 0 2018-01-05 11:45:00 00:12:00 abc 2018-01-05 12:00:00 00:10:00 abc 1 2018-05-04 09:00:00 00:08:00 def 2 2018-08-09 07:15:00 00:15:00 hij 2018-08-09 07:30:00 00:15:00 hij 2018-08-09 07:45:00 00:05:00 hij 3 2018-09-27 15:00:00 00:02:00 klm A bit of explanation expandtime is the first custom function. It takes a dataframe of one row (because we assume that 'start' values are uniques), builds a second row whose 'start' is equal to 'start' of the first row + duration and then uses resample to sample it in time intervals of 15 minutes. Values of all other columns are duplicated. calcduration is used to do some math on the column 'duration' in order to calculate the correct duration of each row.
So, starting with your df: testdict = {'start':['2018-01-05 11:48:00', '2018-05-04 09:05:00', '2018-08-09 07:15:00', '2018-09-27 15:00:00'], 'duration':[22,8,35,2], 'Attribute_A':['abc', 'def', 'hij', 'klm']} df = pd.DataFrame(testdict) df.loc[:,['start']] = pd.to_datetime(df['start']) print(df) First calculate an ending time for each row: df['dur'] = pd.to_timedelta(df['duration'], unit='m') df['end'] = df['start'] + df['dur'] Then create two new columns that hold the regular interval (15 minute) start and end dates: df['start15'] = df['start'].dt.floor('15min') df['end15'] = df['end'].dt.floor('15min') At this point, the dataframe looks like: Attribute_A duration start dur end start15 end15 0 abc 22 2018-01-05 11:48:00 00:22:00 2018-01-05 12:10:00 2018-01-05 11:45:00 2018-01-05 12:00:00 1 def 8 2018-05-04 09:05:00 00:08:00 2018-05-04 09:13:00 2018-05-04 09:00:00 2018-05-04 09:00:00 2 hij 35 2018-08-09 07:15:00 00:35:00 2018-08-09 07:50:00 2018-08-09 07:15:00 2018-08-09 07:45:00 3 klm 2 2018-09-27 15:00:00 00:02:00 2018-09-27 15:02:00 2018-09-27 15:00:00 2018-09-27 15:00:00 The start15 and end15 columns combine to have the right times, but you need to merge them: df = pd.melt(df, ['dur', 'start', 'Attribute_A', 'end'], ['start15', 'end15'], value_name='start15') df = df.drop('variable', 1).drop_duplicates('start15').sort_values('start15').set_index('start15') Output: dur start Attribute_A start15 2018-01-05 11:45:00 00:22:00 2018-01-05 11:48:00 abc 2018-01-05 12:00:00 00:22:00 2018-01-05 11:48:00 abc 2018-05-04 09:00:00 00:08:00 2018-05-04 09:05:00 def 2018-08-09 07:15:00 00:35:00 2018-08-09 07:15:00 hij 2018-08-09 07:45:00 00:35:00 2018-08-09 07:15:00 hij 2018-09-27 15:00:00 00:02:00 2018-09-27 15:00:00 klm Looking good, but the 2018-08-09 07:30:00 row is missing. Fill in this and any other missing rows with groupby and resample: df = df.groupby('start').resample('15min').ffill().reset_index(0, drop=True).reset_index() Get the end15 column back, it was dropped during the melt operation earlier: df['end15'] = df['end'].dt.floor('15min') Then calculate the correct durations for each row. I split this into two calculations (durations that spread across multiple timesteps, and ones that don't) to keep it readable: df.loc[df['start15'] != df['end15'], 'duration'] = np.minimum(df['end15'] - df['start'], pd.Timedelta('15min').to_timedelta64()) df.loc[df['start15'] == df['end15'], 'duration'] = np.minimum(df['end'] - df['end15'], df['end'] - df['start']) Then just some clean-up to make it look like you wanted: df['duration'] = (df['duration'].dt.seconds/60).astype(int) print(df) df = df[['start15', 'duration', 'Attribute_A']].copy() Result: start15 duration Attribute_A 0 2018-01-05 11:45:00 12 abc 1 2018-01-05 12:00:00 10 abc 2 2018-05-04 09:00:00 8 def 3 2018-08-09 07:15:00 15 hij 4 2018-08-09 07:30:00 15 hij 5 2018-08-09 07:45:00 5 hij 6 2018-09-27 15:00:00 2 klm Please note, portions of this answer were based on this answer
Merge DataFrames with Matching Values From Two Different Columns - Pandas [duplicate]
This question already has answers here: Pandas Merging 101 (8 answers) Closed 4 years ago. I have two different DataFrames that I want to merge with date and hours columns. I saw some threads that are there, but I could not find the solution for my issue. I also read this document and tried different combinations, however, did not work well. Example of my two different DataFrames, DF1 date hours var1 var2 0 2013-07-10 00:00:00 150.322617 52.225920 1 2013-07-10 01:00:00 155.250917 53.365296 2 2013-07-10 02:00:00 124.918667 51.158249 3 2013-07-10 03:00:00 143.839217 53.138251 ..... 9 2013-09-10 09:00:00 148.135818 86.676341 10 2013-09-10 10:00:00 147.833517 53.658016 11 2013-09-10 12:00:00 149.580233 69.745368 12 2013-09-10 13:00:00 163.715317 14.524894 13 2013-09-10 14:00:00 168.856650 10.762779 DF2 date hours myvar1 myvar2 0 2013-07-10 09:00:00 1.617 98.56 1 2013-07-10 10:00:00 2.917 23.60 2 2013-07-10 12:00:00 19.667 36.15 3 2013-07-10 13:00:00 14.217 45.16 ..... 20 2013-09-10 20:00:00 1.517 53.56 21 2013-09-10 21:00:00 5.233 69.47 22 2013-09-10 22:00:00 13.717 14.25 23 2013-09-10 23:00:00 18.850 10.69 As you can see in both DataFrames, DF2 starts with 09:00:00 and I want to join with DF1 09:00:00, which is basically the matchind dates and times. So far, I tried many different combination using previous threads and the documentation mentioned above. An example, merged_df = DF2.merge(DF1, how = 'left', on = ['date', 'hours']) This was introduces NAN values for right right DataFrame. I know, I do not have to use both date and hours columns, however, still getting the same result. I tried R quick like this, which works perfectly fine. merged_df <- left_join(DF1, DF2, by = 'date') Is there anyway in pandas to merge DatFrames just with matching values without getting NaN values?
Use how='inner' in pd.merge: merged_df = DF2.merge(DF1, how = 'inner', on = ['date', 'hours']) This will perform and "inner-join" thereby omitting rows in each dataframe that do not match. Hence, no NaN in either the right or left part of merged dataframe.
Remove row where date is between two dates
I would like to delete the rows from dataframe df1, if the current date is between the ShiftScheduledStart and ShiftScheduledEnd values. My idea was the code below, however this does not give the right result. df1[(df1['ShiftScheduledEnd'] < CurrentDateVar) & (CurrentDateVar < df1['ShiftScheduledStart'])] What is wrong? Thanks!
I don't know what you're expecting but none of your rows satisfy your condition: In [7]: t="""ShiftScheduledEnd,ShiftScheduledStart 16-5-2015 14:30,16-5-2015 6:00 13-7-2015 22:00,13-7-2015 14:00 13-7-2015 22:30,13-7-2015 14:00 13-7-2015 22:00,13-7-2015 14:00""" df1 = pd.read_csv(io.StringIO(t), parse_dates=[0,1]) print(df1) CurrentDateVar = pd.to_datetime('14-7-2015 23:45') CurrentDateVar ShiftScheduledEnd ShiftScheduledStart 0 2015-05-16 14:30:00 2015-05-16 06:00:00 1 2015-07-13 22:00:00 2015-07-13 14:00:00 2 2015-07-13 22:30:00 2015-07-13 14:00:00 3 2015-07-13 22:00:00 2015-07-13 14:00:00 Out[7]: Timestamp('2015-07-14 23:45:00') In [8]: df1[(df1['ShiftScheduledStart'] < CurrentDateVar) & (df1['ShiftScheduledEnd'] > CurrentDateVar)] Out[8]: Empty DataFrame Columns: [ShiftScheduledEnd, ShiftScheduledStart] Index: []