Delete rows of Dataframe based on multiple conditions from different Dataframe

Delete rows of Dataframe based on multiple conditions from different Dataframe - python

I have two large Dataframes. The first one contains data, consisting of a date column and a location column, followed by several data column. The second DataFrame consists of a date column and a location column. I want to remove all the rows where the date and the location of df1 match df2.
I have tried a few ways to fix this, including drop statements, drop statements within for loops and redefining the dataframe based on multiple conditions. They all don't work
date = pd.to_datetime(['2019-01-01','2019-01-01','2019-01-02','2019-01-02','2019-01-03','2019-01-03'],format='%Y-%m-%d')
location = [1,2,1,2,1,2]
nr = [8,10,15,2,20,38]
df1 = pd.DataFrame(columns=['date','location','nr'])
df1['date']=date
df1['location']=location
df1['nr']=nr
this results in the following dataframe:
date location nr
0 2019-01-01 1 8
1 2019-01-01 2 10
2 2019-01-02 1 15
3 2019-01-02 2 2
4 2019-01-03 1 20
5 2019-01-03 2 38
the second dataframe:
date2 = pd.to_datetime(['2019-01-01','2019-01-02'],format='%Y-%m-%d')
location2 = [2,1]
df2 = pd.DataFrame(columns=['date','location'])
df2['date']=date2
df2['location']=location2
resulting in the following dataframe:
date location
0 2019-01-01 2
1 2019-01-02 1
then the drop statement:
for i in range(len(df2)):
dayA = df2['date'].iloc[i]
placeA = df2['location'].iloc[i]
df1.drop(df1.loc[(df1['date']==dayA)& (df1['location']==placeA)],inplace=True)
which results in this case in the error code in the example :
KeyError: "['date' 'location' 'nr'] not found in axis"
However in my larger dataframe it results in the error:
TypeError: 'NoneType' object is not iterable
what I need however is
date location nr
0 2019-01-01 1 8
3 2019-01-02 2 2
4 2019-01-03 1 20
5 2019-01-03 2 38
what am I doing wrong

df1.loc[(df1['date']==dayA)& (df1['location']==placeA)] is the dataframe consisting of rows where the date and location match. drop is expecting the index where they match. So you need df1.loc[(df1['date']==dayA)& (df1['location']==placeA)].index. However, this is a very inefficient method. You can use merge instead as the other answers discuss. Another method would be df1 = df1.loc[~df1[['date','location']].apply(tuple,axis=1).isin(zip(df2.date,df2.location))].

I would use pandas merge and a little trick:
df2['temp'] = 2
df = pd.merge(df1, df2, how='outer', on=['date', 'location'])
df = df[pd.isna(df.temp)]
del df['temp']

Problem is with this line:
df1.drop(df1.loc[(df1['date']==dayA)& (df1['location']==placeA)],inplace=True)
You can achieve your target like this:
df1 = df1.loc[~((df1['date']==dayA) & (df1['location']==placeA))]
Basically, everytime you find a match for each row, you essentially remove it from the df1 dataframe.
Output:
date location nr
0 2019-01-01 1 8
3 2019-01-02 2 2
4 2019-01-03 1 20
5 2019-01-03 2 38

Use pandas merge:
This should work
df1['index_col'] = df1.index
df = df1.merge(df2,on=['date','location'],how='left')
df = df.dropna()
df = df[df1.columns]
result_df = df1[~df.index_col.isin(df1.index_col)]

Related

Aggregation in pandas dataframe with columns names in one row

I am using Python 3.6 and I am doing an aggregation, which I have done correctly, but the column names are not in the form I want.
df = pd.DataFrame({'ID':[1,1,2,2,2],
'revenue':[1,3,5,1,5],
'month':['2012-01-01','2012-01-01','2012-03-01','2014-01-01','2012-01-01']})
print(df)
ID month revenue
0 1 2012-01-01 1
1 1 2012-01-01 3
2 2 2012-03-01 5
3 2 2014-01-01 1
4 2 2012-01-01 5
Doing the aggregation below.
df = df.groupby(['ID']).agg({'revenue':'sum','month':[('distinct_m','nunique'),('month_m','first')]}).reset_index()
print(df)
ID revenue month
sum distinct_m month_m
0 1 4 1 2012-01-01
1 2 11 3 2012-03-01
Desired output is:
ID revenue distinct_m month
0 1 4 1 2012-01-01
1 2 11 3 2012-03-01
The problem is that I am using a mixed form of expressions inside agg(). Had it been only agg('revenue':'sum'), I would have got a column named revenue in precisely the same format I wanted, as shown below:
ID revenue
0 1 4
1 2 11
But, since I am creating 2 additional columns as well, using tuple form ('distinct_m','nunique'),('month_m','first'), I get column names spread across two rows.
Is there a way to get the desired output shown above in one aggregation agg()? I want to avoid using tuple form for 'revenue':'sum'. I am not looking for multiple operations afterwards to get the column names right. I am using Python 3.6.

For avoid this problem is used named aggregations working in pandas 0.25+, where is possible specify each columns names:
df = (df.groupby(['ID']).agg(revenue=('revenue','sum'),
distinct_m=('month','nunique'),
month_m = ('month','first')
).reset_index())
print(df)
ID revenue distinct_m month_m
0 1 4 1 2012-01-01
1 2 11 3 2012-03-01
For lower pandas versions is possible flatten columns in MultiIndex and then rename:
df = df.groupby(['ID']).agg({'revenue':'sum',
'month':[('distinct_m','nunique'),('month_m','first')]})
df.columns = df.columns.map('_'.join)
df = df.rename(columns={'revenue_sum':'revenue',
'month_distinct_m':'distinct_m',
'month_month_m':'month_m'})
df = df.reset_index()
print(df)
ID revenue distinct_m month_m
0 1 4 1 2012-01-01
1 2 11 3 2012-03-01

How to drop records based on number of unique days using pandas?

I have a dataframe like as shown below
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-05 12:59:00','2173-05-04 13:14:00','2173-05-05 13:37:00','2173-07-03 13:39:00','2173-07-04 11:30:00','2173-04-04 16:00:00','2173-04-09 22:00:00','2173-04-11 04:00:00','2173- 04-13 04:30:00','2173-04-14 08:00:00'],
'val' :[5,5,5,5,1,6,5,5,8,3,4,6]})
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['month'] = df['time_1'].dt.month
What I would like to do is drop records/subjects who doesn't have more than 4 or more unique days
If you see my sample dataframe, you can see that subject_id = 1 has only 3 unique dates which is 3,4 and 5 so I would like to drop subject_id = 1 completely. But if you see subject_id = 2 he has more than 4 unique dates like 4,9,11,13,14. Please note that date values has timestamp, hence I extract the day from each datetime field and check for unique records.
This is what I tried
df.groupby(['subject_id','day']).transform('size')>4 # doesn't work
df[df.groupby(['subject_id','day'])['subject_id'].transform('size')>=4] # doesn't produce expected output
I expect my output to be like this

Change your function from size to DataFrameGroupBy.nunique, grouping only by the subject_id column:
df = df[df.groupby('subject_id')['day'].transform('nunique')>=4]
Or alternatively you can use filtration, but this should be slower if you're using a larger dataframe or many unique groups:
df = df.groupby('subject_id').filter(lambda x: x['day'].nunique()>=4)
print (df)
subject_id time_1 val day month
7 2 2173-04-04 16:00:00 5 4 4
8 2 2173-04-09 22:00:00 8 9 4
9 2 2173-04-11 04:00:00 3 11 4
10 2 2173-04-13 04:30:00 4 13 4
11 2 2173-04-14 08:00:00 6 14 4

Pandas get rows by its values from dataframes

I have a reference dataframe:
ex:
time latitude longtitude pm2.5
0 . 0 0 0
1 . 0 5 1
......
And I have a query with
ex:
time latitude longtitude
0 . 1 3
1 . 0 5
.......
I want to get the pm2.5 which matches the rows in query.
I have used the iteration of rows but it seems very slow.
predications_phy = []
for index, row in X_test.iterrows():
Y = phyDf[(phyDf["time"] == row["time"]) & (phyDf["latitude"] == row["latitude"]) & (phyDf["longtitude"] == row["longtitude"])]
predications_phy.append(Y)
What is the efficient and correct way to get the rows?

Given reference dataframe df1 and query dataframe df2, you can perform a left merge to extract your result:
res = df2.merge(df1, how='left')
print(res)
# time latitude longtitude pm2.5
# 0 0 1 3 NaN
# 1 1 0 5 1.0
Loops are highly discouraged unless your operation cannot be vectorised.

Pandas: "Left Join" not working correctly

I am trying to apply left join to the two dataframe shown below.
outlier day season
0 11556.0 0 1
==========================================
date bikeid date2
0 1 16736 2016-06-06
1 1 16218 2016-06-13
2 1 15254 2016-06-20
3 1 16327 2016-06-27
4 1 17745 2016-07-04
5 1 16975 2016-07-11
6 1 17705 2016-07-18
7 1 16792 2016-07-25
8 1 18540 2016-08-01
9 1 17212 2016-08-08
10 1 11556 2016-08-15
11 1 17694 2016-08-22
12 1 14936 2016-08-29
outliers = pd.merge(outliers, sum_Day, how = 'left', left_on = ['outlier'], right_on = ['bikeid'])
outliers = outliers.dropna(axis=1, how='any')
trip_outlier day season
0 11556.0 0 1
As shown after above applying left join i dropped all the NaN rows which gives the result above. However the desired results should be as shown below
trip_outlier day season date2
0 11556.0 0 1 2016-08-15

It seems dtype of outlier column in outliers is float. Need same dtypes in both joined columns.
Check it by:
print (outliers['outlier'].dtype)
print (sum_Day['bikeid'].dtype)
So use astype for convert:
outliers['outlier'] = outliers['outlier'].astype(int)
#if not int
#sum_Day['bikeid'] = sum_Day['bikeid'].astype(int)
EDIT:
If some NaNs in outlier column is not possible convert to int, first is necessary remove NaNs:
outliers = outliers.dropna('outlier')
outliers['outlier'] = outliers['outlier'].astype(int)

One way to get the desired result would be using the below code:
outliers = outliers.merge(sum_Day.rename(columns={'bikeid': 'outlier'}), on = 'outlier', \
how = 'left')

Split a pandas date list based on another pandas date list

I'm trying to split one date list by using another. So:
d = {'date':['1/15/2015','2/15/2015'], 'num':[1,2]}
s = {'split':['2/1/2015']}
df = pd.DataFrame(d)
sf = pd.DataFrame(s)
df['date'] = pd.to_datetime(df['date'])
sf['split'] = pd.to_datetime(sf['split'])
df['date'].split_by(sf['split'])
would yield:
date num
0 2015-01-15 1.0
1 2015-02-01 NaN
2 2015-02-15 2.0
...but of course, it doesn't. I'm sure there's a simple merge or join I'm missing here, but I can't figure it out. Thanks.
Also, if the 'split' list has multiple dates, some of which fall outside the range of the 'date' list, I don't want them included. So basically, the extents of the new range would be the same as the old.
(side note: if there's a better way to convert a dictionary to a DataFrame and immediately convert the date strings to datetimes, that would be icing on the cake)

I think you need boolean indexing for filter sf by min and max of column date in df first and then concat with sort_values, for align need rename column:
d = {'date':['1/15/2015','2/15/2015'], 'num':[1,2]}
s = {'split':['2/1/2015', '2/1/2016', '2/1/2014']}
df = pd.DataFrame(d)
sf = pd.DataFrame(s)
df['date'] = pd.to_datetime(df['date'])
sf['split'] = pd.to_datetime(sf['split'])
print (df)
date num
0 2015-01-15 1
1 2015-02-15 2
print (sf)
split
0 2015-02-01
1 2016-02-01
2 2014-02-01
mask = (sf.split <= df.date.max()) & (sf.split >= df.date.min())
print (mask)
0 True
1 False
2 False
Name: split, dtype: bool
sf = sf[mask]
print (sf)
split
0 2015-02-01
df = pd.concat([df, sf.rename(columns={'split':'date'})]).sort_values('date')
print (df)
date num
0 2015-01-15 1.0
0 2015-02-01 NaN
1 2015-02-15 2.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Delete rows of Dataframe based on multiple conditions from different Dataframe - python

I would use pandas merge and a little trick: df2['temp'] = 2 df = pd.merge(df1, df2, how='outer', on=['date', 'location']) df = df[pd.isna(df.temp)] del df['temp']

Use pandas merge: This should work df1['index_col'] = df1.index df = df1.merge(df2,on=['date','location'],how='left') df = df.dropna() df = df[df1.columns] result_df = df1[~df.index_col.isin(df1.index_col)]

Related

Aggregation in pandas dataframe with columns names in one row

How to drop records based on number of unique days using pandas?

Pandas get rows by its values from dataframes

Pandas: "Left Join" not working correctly

Split a pandas date list based on another pandas date list

Categories

Resources