Merge DataFrames with Matching Values From Two Different Columns - Pandas [duplicate]

Merge DataFrames with Matching Values From Two Different Columns - Pandas [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I have two different DataFrames that I want to merge with date and hours columns. I saw some threads that are there, but I could not find the solution for my issue. I also read this document and tried different combinations, however, did not work well.
Example of my two different DataFrames,
DF1
date hours var1 var2
0 2013-07-10 00:00:00 150.322617 52.225920
1 2013-07-10 01:00:00 155.250917 53.365296
2 2013-07-10 02:00:00 124.918667 51.158249
3 2013-07-10 03:00:00 143.839217 53.138251
.....
9 2013-09-10 09:00:00 148.135818 86.676341
10 2013-09-10 10:00:00 147.833517 53.658016
11 2013-09-10 12:00:00 149.580233 69.745368
12 2013-09-10 13:00:00 163.715317 14.524894
13 2013-09-10 14:00:00 168.856650 10.762779
DF2
date hours myvar1 myvar2
0 2013-07-10 09:00:00 1.617 98.56
1 2013-07-10 10:00:00 2.917 23.60
2 2013-07-10 12:00:00 19.667 36.15
3 2013-07-10 13:00:00 14.217 45.16
.....
20 2013-09-10 20:00:00 1.517 53.56
21 2013-09-10 21:00:00 5.233 69.47
22 2013-09-10 22:00:00 13.717 14.25
23 2013-09-10 23:00:00 18.850 10.69
As you can see in both DataFrames, DF2 starts with 09:00:00 and I want to join with DF1 09:00:00, which is basically the matchind dates and times. So far, I tried many different combination using previous threads and the documentation mentioned above. An example,
merged_df = DF2.merge(DF1, how = 'left', on = ['date', 'hours'])
This was introduces NAN values for right right DataFrame. I know, I do not have to use both date and hours columns, however, still getting the same result. I tried R quick like this, which works perfectly fine.
merged_df <- left_join(DF1, DF2, by = 'date')
Is there anyway in pandas to merge DatFrames just with matching values without getting NaN values?

Use how='inner' in pd.merge:
merged_df = DF2.merge(DF1, how = 'inner', on = ['date', 'hours'])
This will perform and "inner-join" thereby omitting rows in each dataframe that do not match. Hence, no NaN in either the right or left part of merged dataframe.

Related

Is there a Pandas function to highlight a week's 10 lowest values in a time series?

Rookie here so please excuse my question format:
I got an event time series dataset for two months (columns for "date/time" and "# of events", each row representing an hour).
I would like to highlight the 10 hours with the lowest numbers of events for each week. Is there a specific Pandas function for that? Thanks!

Let's say you have a dataframe df with column col as well as a datetime column.
You can simply sort the column with
import pandas as pd
df = pd.DataFrame({'col' : [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
'datetime' : ['2019-01-01 00:00:00','2015-02-01 00:00:00','2015-03-01 00:00:00','2015-04-01 00:00:00',
'2018-05-01 00:00:00','2016-06-01 00:00:00','2017-07-01 00:00:00','2013-08-01 00:00:00',
'2015-09-01 00:00:00','2015-10-01 00:00:00','2015-11-01 00:00:00','2015-12-01 00:00:00',
'2014-01-01 00:00:00','2020-01-01 00:00:00','2014-01-01 00:00:00']})
df = df.sort_values('col')
df = df.iloc[0:10,:]
df
Output:
col datetime
0 1 2019-01-01 00:00:00
1 2 2015-02-01 00:00:00
2 3 2015-03-01 00:00:00
3 4 2015-04-01 00:00:00
4 5 2018-05-01 00:00:00
5 6 2016-06-01 00:00:00
6 7 2017-07-01 00:00:00
7 8 2013-08-01 00:00:00
8 9 2015-09-01 00:00:00
9 10 2015-10-01 00:00:00

I know there's a function called nlargest. I guess there should be an nsmallest counterpart. pandas.DataFrame.nsmallest
df.nsmallest(n=10, columns=['col'])
My bad, so your DateTimeIndex is a Hourly sampling. And you need the hour(s) with least events weekly.
...
Date n_events
2020-06-06 08:00:00 3
2020-06-06 09:00:00 3
2020-06-06 10:00:00 2
...
Well I'd start by converting each hour into columns.
1. Create an Hour column that holds the hour of the day.
df['hour'] = df['date'].hour
Pivot the hour values into columns having values as n_events.
So you'll then have 1 datetime index, 24 hour columns, with values denoting #events. pandas.DataFrame.pivot_table
...
Date hour0 ... hour8 hour9 hour10 ... hour24
2020-06-06 0 3 3 2 0
...
Then you can resample it to weekly level aggregate using sum.
df.resample('w').sum()
The last part is a bit tricky to do on the dataframe. But fairly simple if you just need the output.
for row in df.itertuples():
print(sorted(row[1:]))

pandas - Splitting date ranges on specific day boundary

I've got a DataFrame of date ranges (the actual DataFrame has more data attached to it but has the same start and end columns). The data ultimately needs to be analyzed week-by-week on a Sunday-Saturday basis. Thus, I'd like to go through the DataFrame, and split any date ranges (start to finish) that cross from a Saturday to Sunday. For example, given the DataFrame:
import pandas as pd
date_ranges = [
{'start': '2020-01-16 22:30:00', 'end': '2020-01-17 01:00:00'}, # spans thurs-fri, ok as is
{'start': '2020-01-17 04:30:00', 'end': '2020-01-17 12:30:00'}, # no span, ok as is
{'start': '2020-01-18 10:15:00', 'end': '2020-01-18 14:00:00'}, # no span, ok as is
{'start': '2020-01-18 22:30:00', 'end': '2020-01-19 02:00:00'} # spans sat-sun, must split
]
data_df = pd.DataFrame(date_ranges)
I want my result to look like:
result_ranges = [
{'start': '2020-01-16 22:30:00', 'end': '2020-01-17 01:00:00'}, # spans thurs-fri, ok as is
{'start': '2020-01-17 04:30:00', 'end': '2020-01-17 12:30:00'}, # no span, ok as is
{'start': '2020-01-18 10:15:00', 'end': '2020-01-18 14:00:00'}, # no span, ok as is
{'start': '2020-01-18 22:30:00', 'end': '2020-01-19 00:00:00'}, # split out saturday portion
{'start': '2020-01-19 00:00:00', 'end': '2020-01-19 02:00:00'} # and the sunday portion
]
result_df = pd.DataFrame(result_ranges)
Any thoughts on how to effectively do this in pandas would be greatly appreciated. Currently I am doing the bad thing, and iterating over rows, and it is quite slow when the data set gets large.

Manipulations like this are always difficult and at some level I think a loop is necessary. In this case, instead of looping over the rows, we can loop over the edges. This should lead to a rather big gain in performance when the number of weeks your data span is much smaller than the number of rows you have.
We define edges and modify the DataFrame endpoints where necessary. In the end the desired DataFrame is whatever is left of the DataFrame we modified, plus all the separate timespans we stored in l. The original Index is preserved, so you can see exactly what rows were split. If a single timespan straddles N edges it gets split into N+1 separate rows.
Setup
import pandas as pd
df[['start', 'end']]= df[['start', 'end']].apply(pd.to_datetime)
edges = pd.date_range(df.start.min().normalize() - pd.Timedelta(days=7),
df.end.max().normalize() + pd.Timedelta(days=7), freq='W-Sun')
Code
l = []
for edge in edges:
m = df.start.lt(edge) & df.end.gt(edge) # Rows to modify
l.append(df.loc[m].assign(end=edge)) # Clip end of modified rows
df.loc[m, 'start'] = edge # Fix start for next edge
result = pd.concat(l+[df]).sort_values('start')
Output
start end
0 2020-01-16 22:30:00 2020-01-17 01:00:00
1 2020-01-17 04:30:00 2020-01-17 12:30:00
2 2020-01-18 10:15:00 2020-01-18 14:00:00
3 2020-01-18 22:30:00 2020-01-19 00:00:00
3 2020-01-19 00:00:00 2020-01-19 02:00:00

My solution is even more general that you defined, namely it creates
a sequence of "week rows" from each source row, even if both dates
contain between them e.g. two Sat/Sun breaks.
To check that it works, I added one such row to your DataFrame, so that
it contains:
start end
0 2020-01-16 22:30:00 2020-01-17 01:00:00
1 2020-01-17 04:30:00 2020-01-17 12:30:00
2 2020-01-18 10:15:00 2020-01-18 14:00:00
3 2020-01-18 22:30:00 2020-01-19 02:00:00
4 2020-01-25 20:30:00 2020-02-02 03:00:00
Note that the last row includes two Sat/Sun break, from 25.01 to 26.01
and from 1.02 to 2.02.
Start from conversion of both columns to datetime:
data_df.start = pd.to_datetime(data_df.start)
data_df.end = pd.to_datetime(data_df.end)
To process your data, define the following function, to be applied to each row:
def weekRows(row):
row.index = pd.DatetimeIndex(row)
gr = row.resample('W-SUN', closed='left')
ngr = gr.ngroups # Number of groups
i = 1
data = []
for key, grp in gr:
dt1 = key - pd.Timedelta('7D')
dt2 = key
if i == 1:
dt1 = row.iloc[0]
if i == ngr:
dt2 = row.iloc[1]
data.append([dt1, dt2])
i += 1
return pd.DataFrame(data, columns=['start', 'end'])
Let's present "individually", how it operates on 2 last rows:
When you run:
row = data_df.loc[3]
weekRows(row)
(for the last but one row), you will get:
start end
0 2020-01-18 22:30:00 2020-01-19 00:00:00
1 2020-01-19 00:00:00 2020-01-19 02:00:00
And when you run:
row = data_df.loc[4]
weekRows(row)
(for the last), you will get:
start end
0 2020-01-25 20:30:00 2020-01-26 00:00:00
1 2020-01-26 00:00:00 2020-02-02 00:00:00
2 2020-02-02 00:00:00 2020-02-02 03:00:00
And to get your desired result, run:
result = pd.concat(data_df.apply(weekRows, axis=1).values, ignore_index=True)
The result is:
start end
0 2020-01-16 22:30:00 2020-01-17 01:00:00
1 2020-01-17 04:30:00 2020-01-17 12:30:00
2 2020-01-18 10:15:00 2020-01-18 14:00:00
3 2020-01-18 22:30:00 2020-01-19 00:00:00
4 2020-01-19 00:00:00 2020-01-19 02:00:00
5 2020-01-25 20:30:00 2020-01-26 00:00:00
6 2020-01-26 00:00:00 2020-02-02 00:00:00
7 2020-02-02 00:00:00 2020-02-02 03:00:00
First 3 rows result from your first 3 source rows.
Two next rows (index 3 and 4) result from source row with index 3.
And the last 3 row (index 5 thru 7) result from the last source row.

Similar to #Valdi_Bo's answer, I looked into breaking down a single interval of (start, end) into a series of intervals, including all the midnights of Sundays in between.
This is accomplished by the following function:
def break_weekly(start, end):
edges = list(pd.date_range(start, end, freq='W', normalize=True, closed='right'))
if edges and edges[-1] == end:
edges.pop()
return pd.Series(list(zip([start] + edges, edges + [end])))
This code will create a weekly date range from "start" to "end", normalizing to midnight time (so Sunday midnight) and will keep the interval open on the left (so it starts on the Sunday following start.)
There's a corner case for when "end" is exactly midnight on Sunday, since the interval needs to be closed on one side, we're keeping it closed on the right, so we're checking whether those two match and drop it if they're the same.
We then use zip() to create tuples with each pairs of dates, including the "start" at the beginning in the left, and the "end" timestamp at the end of the right.
We finally return a pd.Series of those tuples, since that makes apply() do what we expect.
Example usage:
>>> break_weekly(pd.Timestamp('2020-01-18 22:30:00'), pd.Timestamp('2020-01-19 02:00:00'))
0 (2020-01-18 22:30:00, 2020-01-19 00:00:00)
1 (2020-01-19 00:00:00, 2020-01-19 02:00:00)
dtype: object
At this point, you can apply it to the original data frame to find the complete list of intervals.
First, convert the types of the columns to pd.Timestamp (you have strings in the columns in your example):
data_df = data_df.apply(pd.to_datetime)
Then you can find the whole list of intervals with:
intervals = (data_df
.apply(lambda r: break_weekly(r.start, r.end), axis=1)
.unstack().dropna().reset_index(level=0, drop=True)
.apply(lambda r: pd.Series(r, index=['start', 'end'])))
The first step applies break_weekly() to the "start" and "end" columns, row by row. Since break_weekly() returns a pd.Series, it will end up producing a new DataFrame with one column per date interval (as many as there are weeks in an interval).
Then unstack() will merge those columns back together, and dropna() will drop the NaN that were generated because each row had a different number of columns (different number of intervals for each row.)
At this point we have a multi-index, so reset_index(level=0, drop=True) will drop the index level we don't care about and only keep the one that matches the original DataFrame.
Finally, the last apply() will convert the entries from Python tuples back to a pd.Series and will name the columns "start" and "end" again.
Looking at the result up until this point:
>>> intervals
start end
0 2020-01-16 22:30:00 2020-01-17 01:00:00
1 2020-01-17 04:30:00 2020-01-17 12:30:00
2 2020-01-18 10:15:00 2020-01-18 14:00:00
3 2020-01-18 22:30:00 2020-01-19 00:00:00
3 2020-01-19 00:00:00 2020-01-19 02:00:00
Since the indices match the ones from your original DataFrame, you can now use this DataFrame to connect it back to your original one, if you had more columns with values there and you want to duplicate those here, it's just a matter of joining them together.
For example:
>>> data_df['value'] = ['abc', 'def', 'ghi', 'jkl']
>>> intervals.join(df.drop(['start', 'end'], axis=1))
start end value
0 2020-01-16 22:30:00 2020-01-17 01:00:00 abc
1 2020-01-17 04:30:00 2020-01-17 12:30:00 def
2 2020-01-18 10:15:00 2020-01-18 14:00:00 ghi
3 2020-01-18 22:30:00 2020-01-19 00:00:00 jkl
3 2020-01-19 00:00:00 2020-01-19 02:00:00 jkl
You'll notice that the value in the last row has been copied to both rows in that interval.

Python: Add conditional column in dataframe - if date lies in any of the multiple periods defined in another dataframe

I have a df1:
I have another df2:
I want to add another column in df1 Flag:
1 if Timestamp for that Job lies in any of the Time period defined in df2 against that job (in any of the 4 Jobnames in df2).
0 otherwise.
I tried to be as clear as possible.
I have done it iterating over the df1 rows using iterrows, but we all know that is not the right/efficient way.
I'm looking for a vectorized operation to achieve the same!!

I assume that TimeStamp, StartTime and EndTime columns in your DataFrames
are of datetime type (not string).
First reformat df2 into df3 the following way:
df3 = df2.set_index(['Key', 'StartTime', 'EndTime']).stack()\
.replace('', np.nan).dropna().reset_index(level=[0,3], drop=True)\
.rename('JobName').reset_index()
For my sample data (first few rows of your df2) the result is:
StartTime EndTime JobName
0 2019-04-20 12:30:00 2019-04-20 13:20:00 A
1 2019-04-20 12:30:00 2019-04-20 13:20:00 S
2 2019-04-16 12:30:00 2019-04-21 13:20:00 B
3 2019-04-17 12:30:00 2019-04-22 13:20:00 C
4 2019-04-17 12:30:00 2019-04-22 13:20:00 A
5 2019-04-17 12:30:00 2019-04-22 13:20:00 G
Then define the following function, returning the Flag value for
the current row in df1:
def getFlag(row):
return 1 if any(df3.JobName.eq(row.JobName) & df3.StartTime.le(row.TimeStamp)
& df3.EndTime.ge(row.TimeStamp)) else 0
And the last step is to apply this function and save the result in a new column:
df1['Flag'] = df1.apply(getFlag, axis=1)

Using pandas to perform time delta from 2 "hh:mm:ss XX" columns in Microsoft Excel

I have an Excel file with a column named StartTime having hh:mm:ss XX data and the cells are in `h:mm:ss AM/FM' custom format. For example,
ID StartTime
1 12:00:00 PM
2 1:00:00 PM
3 2:00:00 PM
I used the following code to read the file
df = pd.read_excel('./mydata.xls',
sheet_name='Sheet1',
converters={'StartTime' : str},
)
df shows
ID StartTime
1 12:00:00
2 1:00:00
3 2:00:00
Is it a bug or how do you overcome this? Thanks.
[Update: 7-Dec-2018]
I guess I may have made changes to the Excel file that made it weird. I created another Excel file and present here (I could not attach an Excel file here, and it is not safe too):
I created the following code to test:
import pandas as pd
df = pd.read_excel('./Book1.xlsx',
sheet_name='Sheet1',
converters={'StartTime': str,
'EndTime': str
}
)
df['Hours1'] = pd.NaT
df['Hours2'] = pd.NaT
print(df,'\n')
df.loc[~df.StartTime.isnull() & ~df.EndTime.isnull(),
'Hours1'] = pd.to_datetime(df.EndTime) - pd.to_datetime(df.StartTime)
df['Hours2'] = pd.to_datetime(df.EndTime) - pd.to_datetime(df.StartTime)
print(df)
The outputs are
ID StartTime EndTime Hours1 Hours2
0 0 11:00:00 12:00:00 NaT NaT
1 1 12:00:00 13:00:00 NaT NaT
2 2 13:00:00 14:00:00 NaT NaT
3 3 NaN NaN NaT NaT
4 4 14:00:00 NaN NaT NaT
ID StartTime EndTime Hours1 Hours2
0 0 11:00:00 12:00:00 3600000000000 01:00:00
1 1 12:00:00 13:00:00 3600000000000 01:00:00
2 2 13:00:00 14:00:00 3600000000000 01:00:00
3 3 NaN NaN NaT NaT
4 4 14:00:00 NaN NaT NaT
Now the question has become: "Using pandas to perform time delta from 2 "hh:mm:ss XX" columns in Microsoft Excel". I have changed the title of the question too. Thank you for those who replied and tried it out.
The question is
How to represent the time value to hour instead of microseconds?

It seems that the StartTime column is formated as text in your file.
Have you tried reading it with parse_dates along with a parser function specified via the date_parser parameter? Should work similar to read_csv() although the docs don't list the above options explicitly despite them being available.
Like so:
pd.read_excel(r'./mydata.xls',
parse_dates=['StartTime'],
date_parser=lambda x: pd.datetime.strptime(x, '%I:%M:%S %p').time())
Given the update:
pd.read_excel(r'./mydata.xls', parse_dates=['StartTime', 'EndTime'])
(df['EndTime'] - df['StartTime']).dt.seconds//3600
alternatively
# '//' is available since pandas v0.23.4, otherwise use '/' and round
(df['EndTime'] - df['StartTime'])//pd.Timedelta(1, 'h')
both resulting in the same
0 1
1 1
2 1
dtype: int64

Pandas Merge on Specific Attributes of DateTimeIndex

I currently have two pandas data frames which are both indexed using the pandas DateTimeIndex format.
df1
datetimeindex value
2014-01-01 00:00:00 204.501667
2014-01-01 01:00:00 125.345000
2014-01-01 02:00:00 119.660000
df2 (where the year 1900 is a filler year I added during import. Actual year does not matter)
datetimeindex temperature
1900-01-01 00:00:00 48.2
1900-01-01 01:00:00 30.2
1900-01-01 02:00:00 42.8
I would like to use pd.merge to combine the data frames based on the left index, however, I would like to ignore the year altogether to yield this:
merged_df
datetimeindex value temperature
2014-01-01 00:00:00 204.501667 48.2
2014-01-01 01:00:00 125.345000 30.2
2014-01-01 02:00:00 119.660000 42.8
so far I have tried:
merged_df = pd.merge(df1,df2,left_on =
['df1.index.month','df1.index.day','df1,index.hour'],right_on =
['df2.index.month','df2.index.day','df2.index.hour'],how = 'left')
which gave me the error KeyError: 'df2.index.month'
Is there a way to perform this merge as I have outlined it?
Thanks

You have to lose the quotesL
In [11]: pd.merge(df1, df2, left_on=[df1.index.month, df1.index.day, df1.index.hour],
right_on=[df2.index.month, df2.index.day, df2.index.hour])
Out[11]:
key_0 key_1 key_2 value temperature
0 1 1 0 204.501667 48.2
1 1 1 1 125.345000 30.2
2 1 1 2 119.660000 42.8
Here "df2.index.month" is a string whereas df2.index.month is the array of months.

Probably not as efficient because pd.to_datetime can be slow:
df2['NewIndex'] = pd.to_datetime(df2.index)
df2['NewIndex'] = df2['NewIndex'].apply(lambda x: x.replace(year=2014))
df2.set_index('NewIndex',inplace=True)
Then just do a merge on the whole index.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merge DataFrames with Matching Values From Two Different Columns - Pandas [duplicate] - python

Use how='inner' in pd.merge: merged_df = DF2.merge(DF1, how = 'inner', on = ['date', 'hours']) This will perform and "inner-join" thereby omitting rows in each dataframe that do not match. Hence, no NaN in either the right or left part of merged dataframe.

Related

Is there a Pandas function to highlight a week's 10 lowest values in a time series?

pandas - Splitting date ranges on specific day boundary

Python: Add conditional column in dataframe - if date lies in any of the multiple periods defined in another dataframe

Using pandas to perform time delta from 2 "hh:mm:ss XX" columns in Microsoft Excel

Pandas Merge on Specific Attributes of DateTimeIndex

Categories

Resources