pandas compare two dataframes with criteria - python

I have two dataframes. df1 and df2.
I would like to get whatever values are common from df1 and df2 and the dt value of df2 must be greater than df1's dt value
In this case, the expected value is fee
df1 = pd.DataFrame([['2015-01-01 06:00','foo'],
['2015-01-01 07:00','fee'], ['2015-01-01 08:00','fum']],
columns=['dt', 'value'])
df1.dt=pd.to_datetime(df1.dt)
df2=pd.DataFrame([['2015-01-01 06:10','zoo'],
['2015-01-01 07:10','fee'],['2015-01-01 08:10','feu'],
['2015-01-01 09:10','boo']], columns=['dt', 'value'])
df2.dt=pd.to_datetime(df2.dt)

One way would be to merge on 'value' column so this will produce only matching rows, you can then filter the merged df using the 'dt_x', 'dt_y' columns:
In [15]:
merged = df2.merge(df1, on='value')
merged[merged['dt_x'] > merged['dt_y']]
Out[15]:
dt_x value dt_y
0 2015-01-01 07:10:00 fee 2015-01-01 07:00:00
You can't do something like the following because the lengths don't match:
df2[ (df2['value'].isin(df1['value'])) & (df2['dt'] > df1['dt']) ]
raises:
ValueError: Series lengths must match to compare

Related

Pandas combine rows into strings separated by slash and aggregating by some other columns

I have the initial df and I want to aggregate the 'combo' column into a unique string, separated by slashes, but respecting the order indicated in the sort.
In desired data you can find my final target dataset
raw_data = {'name': ['B','B','A','A','A','A','C'],
'date' : pd.to_datetime(pd.Series(['2017-04-03','2017-04-03','2017-03-31','2017-03-31','2017-03-31','2017-04-04','2017-04-04'])),
'order': [2,1,4,2,1,1,1],
'combo': ['x','y','x','y','z','x','x']}
df = pd.DataFrame(raw_data, columns = ['name','date','order','combo'])
df=df.sort_values(["name","date","order"])
df
desired_raw = {'name': ['A','A','B','C'],
'date' : pd.to_datetime(pd.Series(['2017-03-31','2017-04-04','2017-04-03','2017-04-04'])),
'combined_combo': ["z/y/x","x","y/x","x"]}
desired_data = pd.DataFrame(desired_raw, columns = ['name','date','combined_combo'])
desired_data
#what I did until now
df1 = df.groupby(['name','date'])['combo'].apply(list).reset_index(name='new')
df1
Here is one way:
combined_combo = df.groupby(['name', 'date'])['combo'].agg('/'.join).rename('combined_combo')
print(combined_combo)
Out:
name date
A 2017-03-31 z/y/x
2017-04-04 x
B 2017-04-03 y/x
C 2017-04-04 x
Name: combined_combo, dtype: object
If you don't want the groups as the index use:
desired_data = combined_combo.reset_index()

Pandas: How to exclude the matched one and get only the highlighted differences in dataframe with Multilevel Column Index

I will be showing the difference of two dataframes using below,
df_all = pd.concat([df_source.set_index('id'), df_target.set_index('id')],
axis='columns', keys=['First', 'Second']).drop_duplicates(keep=False)
df_final = df_all.swaplevel(axis='columns')[df_source.columns[1:]]
def highlight_diff(data, color='yellow'):
attr = 'background-color: {}'.format(color)
other = data.xs('First', axis='columns', level=-1)
return pd.DataFrame(np.where(data.ne(other, level=0), attr, ''),
index=data.index, columns=data.columns)
df_final.style.apply(highlight_diff, axis=None)
now how do i show only the rows which is having mismatch? in this case Id 103 and 106 rows and exclude other rows
You can use .filter() to filter and split the columns into 2 subsets of columns with First and Second, use .droplevel to keep only one level of column index to make the 2 portion with same column index to facilitate comparison. Finally, use .compare() to compare and highlight the differences of the 2 portions, as follows:
df1 = df_final.filter(like='First').droplevel(level=1, axis=1)
df2 = df_final.filter(like='Second').droplevel(level=1, axis=1)
df1.compare(df2).rename({'self': 'First', 'other': 'Second'}, axis=1)
Output:
lastname profession
First Second First Second
id
103 Brenn_modified Brenn NaN NaN
106 NaN NaN doctor_modified doctor

Trying to cross two dataframes, one with values the other with bools

I am trying to get a new dataframe from two source dataframes. The first would contain data, and the second would only contain True or False.
Both have the same column names, the same number of columns, and the same number of rows.
import pandas as pd
data1 = [['Alex',10],['Bob',12],['Clarke',13]]
df1 = pd.DataFrame(data1,columns=['Name','Age'])
data2 = [[True,False],[False,True],[False,False]]
df2 = pd.DataFrame(data2,columns=['Name','Age'])
df3 = df1 X df2
df3 = [['Alex', ''],['',12],['','']]
I would like to get a dataframe where the fields are empty because in df2 they are to False and with the value of df1 when in df2 it is to True
Try this:
df3 = df1[df2].fillna('')
Example Output:
Explanation:
Since df1 and df2 have the same indexes; doing df1[df2] filters and shows the values that are true and add NaN for false.
fillna('') replaces all NaN values with empty strings.

Python sum with condition using a date and a condition

I have to dataframes and I am using pandas.
I want to do a cumulative sum from a variable date and by the value in a column
I want to add a second column to df2 that show the date to know the day when the sum of the AVG column is greater than 100 after date2 in df2.
For example with df1 and df2 being the dataframe I start with and df3 what I want and df3['date100'] is the day the sum of avg is greater than 100:
df1 = pd.DataFrame({'date1': ['1/1/2014', '2/1/2014', '3/1/2014','1/1/2014', '2/1/2014', '3/1/2014','1/1/2014', '2/1/2014', '3/1/2014'],
'Place':['A','A','A','B','B','B','C','C','C'],'AVG': [62,14,47,25,74,60,78,27,41]})
df2 = pd.DataFrame({'date2': ['1/1/2014', '2/1/2014'], 'Place':['A','C'])})
*Something*
df3 = pd.DataFrame({'date2': ['1/1/2014', '2/1/2014'], 'Place':['A','C'], 'date100': ['3/1/2014', '2/1/2014'], 'sum': [123, 105]})
I found some answers but most them use groupby and df2 has no groups.
Since your example is very basic, if you have edge cases you want me to take care of, just ask. This solution implies that :
The solution :
# For this solution your DataFrame needs to be sorted by date.
limit = 100
df = pd.DataFrame({
'date1': ['1/1/2014', '2/1/2014', '3/1/2014','1/1/2014',
'2/1/2014', '3/1/2014','1/1/2014', '2/1/2014', '3/1/2014'],
'Place':['A','A','A','B','B','B','C','C','C'],
'AVG': [62,14,47,25,74,60,78,27,41]})
df2 = pd.DataFrame({'date2': ['1/1/2014', '2/1/2014'], 'Place':['A','C']})
result = []
for row in df2.to_dict('records'):
# For each date, I want to select the date that comes AFTER this one.
# Then, I take the .cumsum(), because it's the agg you wish to do.
# Filter by your limit and take the first occurrence.
# Converting this to a dict, appending it to a list, makes it easy
# to rebuild a DataFrame later.
ndf = df.loc[ (df['date1'] >= row['date2']) & (df['Place'] == row['Place']) ]\
.sort_values(by='date1')
ndf['avgsum'] = ndf['AVG'].cumsum()
final_df = ndf.loc[ ndf['avgsum'] >= limit ]
# Error handling, in case there is not avgsum above the threshold.
try:
final_df = final_df.iloc[0][['date1', 'avgsum']].rename({'date1' : 'date100'})
result.append( final_df.to_dict() )
except IndexError:
continue
df3 = pd.DataFrame(result)
final_df = pd.concat([df2, df3], axis=1, sort=False)
print(final_df)
# date2 Place avgsum date100
# 0 1/1/2014 A 123.0 3/1/2014
# 1 2/1/2014 C NaN NaN
Here is a direct solution, with following assumptions:
df1 is sorted by date
one solution exists for every date in df2
You can then do:
df2 = df2.join(pd.concat([
pd.DataFrame(pd.DataFrame(df1.loc[df1.date1 >= d].AVG.cumsum()).query('AVG>=100')
.iloc[0]).transpose()
for d in df2.date2]).rename_axis('ix').reset_index())\
.join(df1.drop(columns='AVG'), on='ix').rename(columns={'AVG': 'sum', 'date1': 'date100'})\
.drop(columns='ix')[['date2', 'date100', 'sum']]
This does the following:
for each date in df2 find the first date when the cumul on AVG will be at least 100
combine the results in one single dataframe indexed by the index of that line in df1
store that index in an ix column and reset the index to join that dataframe to df2
join that to df1 minus the AVG column using the ix column
rename the columns, remove the ix column, and re-order everything

Inner join of dataframes based on datetime

I have two dataframes df1 and df2.
df1.index
DatetimeIndex(['2001-09-06', '2002-08-04', '2000-01-22', '2000-12-19',
'2008-02-09', '2010-07-07', '2011-06-04', '2007-03-14',
'2003-05-17', '2016-02-27',..dtype='datetime64[ns]', name=u'DateTime', length=6131, freq=None)
df2.index
DatetimeIndex(['2002-01-01 01:00:00', '2002-01-01 10:00:00',
'2002-01-01 11:00:00', '2002-01-01 12:00:00',
'2002-01-01 13:00:00', '2002-01-01 14:00:00',..dtype='datetime64[ns]', length=129273, freq=None)
i.e. df1 has index as days and df2 has index as datetime. I want to perform inner join of df1 and df2 on indexes such that if dates corresponding to hours in df2 is available in df1 we consider the inner join as true else false.
I want to obtain two df11 and df22 as output. df11 will have common dates and corresponding columns from df1. df22 will have common date-hours and corresponding columns from df2.
E.g. '2002-08-04' in df1 and '2002-08-04 01:00:00' in df2 is considered present in both.
If however '1802-08-04' in df1 has no hour in df2, it is not present in df11.
If however '2045-08-04 01:00:00' in df2 has no date in df1, it is not present in df22.
Right now I am using numpy in1d and pandas normalize functions to achieve this task in a lengthy manner. I was looking for pythonic way to achieve this.
Consider a dummy DF constructed as shown:
idx1 = pd.date_range(start='2000/1/1', periods=100, freq='12D')
idx2 = pd.date_range(start='2000/1/1', periods=100, freq='300H')
np.random.seed([42, 314])
DF containing DateTimeIndex as only date attribute:
df1 = pd.DataFrame(np.random.randint(0,10,(100,2)), idx1)
df1.head()
DF containing DateTimeIndex as date + time attribute:
df2 = pd.DataFrame(np.random.randint(0,10,(100,2)), idx2)
df2.head()
Get common index considering only matching dates as the distinguishing parameter.
intersect = pd.Index(df2.index.date).intersection(df1.index)
First common index DF containing columns of it's original dataframe :
df11 = df1.loc[intersect]
df11
Second common index DF containing columns of it's original dataframe:
df22 = df2.iloc[np.where(df2.index.date.reshape(-1,1) == intersect.values)[0]]
df22

Categories

Resources