I need to match multiple criteria between two dataframes and then assign an ID.
This is complicated by the fact that one criteria needs to be 'like or similar' and not exact as it involves a time reference that is slightly different.
I need the timestamp match second +/- a 1/2 second. I then would like to add a column that print's the ID in a new column in DF2:
DF1
TimeStamp ID Size
2018-07-12T03:34:54.228000Z 46236499 0.0013
2018-07-12T03:34:54.301000Z 46236500 0.01119422
DF2
TimeStamp Size ID #new column
2018-07-12T03:34:54.292Z 0.00 blank #no match/no data
2018-07-12T03:34:54.300Z 0.01119422 46236500 #size and
#timestamp match within tolerances
In the example above the script would look at the time stamp column and look for any timestamp in DF2 that had the following information "2018-07-12T03:34:54" +/- a 1/2 second + had the exact same 'Size' element.
This needs to be done like this as there could be multiple 'Size' elements that are the same throughout the dataset.
It would then stamp the corresponding ID in the newly created 'ID' column within DF2 or if DF2 was copied to a new DF, I would just add the new 'ID' column within DF3.
Depending on which rows you need in the final dataframe you may choose different join operators. One solution uses the combined dataframes joined by the column Size and then filters the remaining columns based on the absoulte time difference between the merged datetime columns.
df3 = df1.merge(df2, left_on='Size', right_on='Size', how='right')
df3['deltaTime'] = numpy.abs(df3['TimeStamp_x'] - df3['TimeStamp_y'])
df3 = df3[(df3['deltaTime'] < timedelta(milliseconds=500)) | pandas.isnull(df3['deltaTime'])]
Output:
TimeStamp_x ID_x Size TimeStamp_y ID_y deltaTime
0 2018-07-12 03:34:54.301 46236500.0 0.011194 2018-07-12 03:34:54.300 46236500 00:00:00.001000
1 2018-07-12 03:34:54.301 46236500.0 0.011194 2018-07-12 03:34:54.800 46236501 00:00:00.499000
3 NaT NaN 0.000000 2018-07-12 03:34:54.292 blank NaT
If you don't want any none merged rows then just remove | pandas.isnull(df3['deltaTime']) and use an inner join.
Related
Is it possible to return the entirety of data not just part of which we are grouping by?
I mean for example - I have a dataframe with 5 columns and one of those columns contains distance, the other one is timestamp and the last important one is name. I grouped dataframe by timestamp - agg function I applied is (min) on distance. As a return i get correctly grouped dataframe with timestamp and distance - how can i add columns name there. If I group it by name as well then timestamp becomes duplicated - it has to stay unique. As a final result I need to get dataframe like this:
timestamp
name
distance
2020-03-03 15:30:235
Billy
123
2020-03-03 15:30:435
Johny
111
But instead i get this:
timestamp
distance
2020-03-03 15:30:235
123
2020-03-03 15:30:435
111
Whole table has more than 700k rows so joining it back on distance gives me that amount of rows which my PC can't even handle.
Here is my groupby which gives me 2nd table:
grouped_df = df1.groupby('timestamp')['distance'].min()
Here is what i tried to do in order to get name inside the table:
grouped_df.merge(df1, how='left', left_on=['timestamp','distance'],
right_on = ['timestamp','distance'])
Just try
out = df.sort_values('distance').drop_duplicates('timestamp')
Then try with transform
m = df.groupby('timestamp')['distance'].transform('min')
dout = df[df.distance==m]
You can use GroupBy.agg method to apply min on the distance column, and apply a relevant function on name column (lambda x:x to simply return its data). This will return the dataframe with both columns back to you:
grouped_df = df1.groupby('timestamp').agg({'distance':'min','name':lambda x:x})
Also if you want timestamp back as a column and not as index, you can set as_index=False during groupby.
Output:
>>> grouped_df = df1.groupby('timestamp', as_index=False).agg({'distance':'min','name':lambda x:x})
>>> grouped_df
timestamp distance name
0 2020-03-03 15:30:235 123 Billy
1 2020-03-03 15:30:435 111 Johny
Say I have 2 dataframes df1 and df2.
import pandas as pd
df1 = pd.DataFrame({'weight': [1,2,3,4], 'weight_units': ['lb','oz','oz', 'lb']})
df2 = pd.DataFrame({'weight': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,1,2,3,4,5,6,7,8], 'price':[1,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2,2.1,2.2,2.3,2.4,2.5,2.6,2.7,2.8,2.9,3,3.1,3.2,3.3]})
The first dataframe (df1) contains the weight of an object along with the units of measurements of that weight oz & lb (ounces, pounds)
The second dataframe (df2) contains a column which has the weight value (in both pounds and ounces) and an associated price value. In the weight column after the value hits 16 (ounces) it restarts at 1 and goes up (1,2,3,4) signifying its now pounds.
Note: 1lb = 16oz
My Question is: How can I merge both of these dataframes on the weight column such that if the package has units of oz I start the merge using the first 16 values in df2 and if its in lb I start the merge on the second dataframe from the 17th value onwards? Or any other sensible way of performing this merge correctly and getting the correct output if its in lb or oz
Thoughts:
My main concern was that since the weight column is not unique technically (since numbers 1-16 repeats for pounds and ounces) you cant simply just merge as if a package has a weight of 1 unless I can use the units condition it wouldnt know which price value to take.
Ideal Output:
A dataframe which has correctly merged taking the correct value if its lb or oz (pounds or ounces)
df3 = pd.DataFrame({'weight': [1,2,3,4], 'weight_units': ['lb','oz','oz', 'lb'], 'price':[2.7, 1.1,1.2,2]})
One idea is create new column, e.g. by:
df2['weight_units'] = ['oz'] * 16 + ['lb'] * (len(df2) - 16)
Or:
df2['weight_units'] = df2['weight'].eq(1).cumsum().map({1:'oz', 2:'lb'})
And then merge by df1:
df = df1.merge(df2, on=['weight','weight_units'])
I am new to python, and need some help with a question I am having regarding the date time function.
I have df_a which has a column titled time, and I am trying to create a new column id in this df_a.
I want the id column to be determined by whether or not the time is contained within a range of times on df_b columns between "date" and "date_new", for example the first row has a date of "2019-01-07 20:52:41" and "date_new" of "2019-01-07 21:07:41" (a 15 minute time interval), I would like the index for this row, to appear as my id in df_a for when the time is "2019-01-07 20:56:30" (i.e. with id=0) and so on for all the rows in df_a,
This question is similar, but cannot figure out how to make it work with mine as I keep getting
python assign value to pandas df if falls between range of dates in another df
s = pd.Series(df_b['id'].values,pd.IntervalIndex.from_arrays(df_b['date'],df_b['date_new']))
df_a['id']=df_a['time'].map(s)
ValueError: cannot handle non-unique indices
one caveat is that the ranges in df_b are not always unique, meaning some of the intervals contain the same periods of time, in these cases its fine if it uses the id of the first time period in df_b that it falls in, additionally there are over 200 rows in df_b, and 2000 in df_a, so it will take to long to define each time period in a for-loop type format, unless there is an easier way to do it than defining each, thank you in advance for all of your help! if this could use any clarification please let me know!
df_a
time id
2019-01-07 22:02:56 NaN
2019-01-07 21:57:12 NaN
2019-01-08 09:35:30 NaN
df_b
date date_new id
2019-01-07 21:50:56 2019-01-07 22:05:56 0
2019-01-08 09:30:30 2019-01-08 09:45:30 1
Expected Result
df_a
time id
2019-01-07 22:02:56 0
2019-01-07 21:57:12 0
2019-01-08 09:35:30 1
Let me rephrase your problem. For each row in dataframe df_a you want to check whether its value in df_a['time'] is in the interval given by the values in columns df_b['date'] and df_b['date_new']. If so, set the value in df_a["id"] as that in the corresponding df_b["id"].
If this is your question, this is a (very rough) solution:
for ia, ra in df_a.iterrows():
for ib, rb in df_b.iterrows():
if (ra["time"]>=rb['date']) & (ra["time"]<=rb['date_new']):
df_a.loc[ia, "id"] = rb["id"]
break
pandas doesn't have great support for non-equi joins, which is what you are looking for, but it does have a function merge_asof which you might want to check out:
http://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.merge_asof.html
This should significantly speed up your join.
For example:
df_a = pd.DataFrame({'time': ['2019-01-07 22:02:56', '2019-01-07 21:57:12', '2019-01-08 09:35:30']})
df_b = pd.DataFrame({'date': ['2019-01-07 21:50:56', '2019-01-08 09:30:30'], 'date_new': ['2019-01-07 22:05:56', '2019-01-08 09:45:30'], 'id':[0,1]})
df_a['time'] = pd.to_datetime(df_a['time'])
df_b['date'] = pd.to_datetime(df_b['date'])
df_b['date_new'] = pd.to_datetime(df_b['date_new'])
#you need to sort df_a first before using merge_asof
df_a.sort_values('time',inplace=True)
result = pd.merge_asof(df_a, df_b, left_on='time', right_on='date')
#get rid of rows where df_a.time values are greater than df_b's new date
result = result[result.date_new > result.time]
In the docs, it is shown that using subset will drop rows where if a cell in any of the listed columns has missing data. However, I want to drop data only if there is missing data in ALL the columns I list (not to be confused with all the columns I have in my data frame). So to be extra clear:
born name first_appearance
0 1940-04-25 Batman 1945-09-01
1 NaT Captain America 1960-02-03
2 NaT Pikachu NaT
I would only want to drop the last record. Where all dates are of the timestamp type.
From the docs, you can add thresh and axis parameters:
df.dropna(subset=['born', 'first_appearance'],axis=0, thresh=1)
I have a number of dataframes all which contain columns labeled 'Date' and 'Cost' along with additional columns. I'd like to add the numerical data in the 'Cost' columns across the different frames based on lining up the dates in the 'Date' columns to provide a timeseries of total costs for each of the dates.
There are different numbers of rows in each of the dataframes.
This seems like something that Pandas should be well suited to doing, but I can't find a clean solution.
Any help appreciated!
Here are two of the dataframes:
df1:
Date Total Cost Funded Costs
0 2015-09-30 724824 940451
1 2015-10-31 757605 940451
2 2015-11-15 788051 940451
3 2015-11-30 809368 940451
df2:
Date Total Cost Funded Costs
0 2015-11-30 3022 60000
1 2016-01-15 3051 60000
I want to have the resulting dataframe have five rows (there are five different dates) and a single column with the total of the 'Total Cost' column from each of the dataframes. Initially I used the following:
totalFunding = df1['Total Cost'].values + df2['Total Cost'].values
This worked fine until there were different dates in each of the dataframes.
Thanks!
The solution posted below works great, except that I need to do this recursively as I have a number of data frames. I created the following function:
def addDataFrames(f_arg, *argv):
dfTotal = f_arg
for arg in argv:
dfTotal = dfTotal.set_index('Date').add(arg.set_index('Date'), fill_value = 0)
return dfTotal
Which works fine when adding the first two dataframes. However, the addition method appears to convert my Date column into an index in the resulting sum and therefore subsequent passes through the function fail. Here is what dfTotal looks like after the first two data frames are added together:
Total Cost Funded Costs Remaining Cost Total Employee Hours
Date
2015-09-30 1449648 1880902 431254 7410.6
2015-10-31 1515210 1880902 365692 7874.4
2015-11-15 1576102 1880902 304800 8367.2
2015-11-30 1618736 1880902 262166 8578.0
2015-12-15 1671462 1880902 209440 8945.2
2015-12-31 1721840 1880902 159062 9161.2
2016-01-15 1764894 1880902 116008 9495.0
Note that what was originally a column in the dataframe called 'Date' is now listed as the index causing df.set_index('Date') to generate an error on subsequent passes through my function.
DataFrame.add does exactly what you're looking for; it matches the DataFrames based on index, so:
df1.set_index('Date').add(df2.set_index('Date'), fill_value=0)
should do the trick. If you just want the Total Cost column and you want it as a DataFrame:
df1.set_index('Date').add(df2.set_index('Date'), fill_value=0)[['Total Cost']]
See also the documentation for DataFrame.add at:
http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.add.html
Solution found. As mentioned, the add method converted the 'Date' column into the dataframe index. This was resolved using:
dfTotal['Date'] = dfTotal.index
The complete function is then:
def addDataFrames(f_arg, *argv):
dfTotal = f_arg
for arg in argv:
dfTotal = dfTotal.set_index('Date').add(arg.set_index('Date'), fill_value = 0)
dfTotal['Date'] = dfTotal.index
return dfTotal