I have two large Dataframes. The first one contains data, consisting of a date column and a location column, followed by several data column. The second DataFrame consists of a date column and a location column. I want to remove all the rows where the date and the location of df1 match df2.
I have tried a few ways to fix this, including drop statements, drop statements within for loops and redefining the dataframe based on multiple conditions. They all don't work
date = pd.to_datetime(['2019-01-01','2019-01-01','2019-01-02','2019-01-02','2019-01-03','2019-01-03'],format='%Y-%m-%d')
location = [1,2,1,2,1,2]
nr = [8,10,15,2,20,38]
df1 = pd.DataFrame(columns=['date','location','nr'])
df1['date']=date
df1['location']=location
df1['nr']=nr
this results in the following dataframe:
date location nr
0 2019-01-01 1 8
1 2019-01-01 2 10
2 2019-01-02 1 15
3 2019-01-02 2 2
4 2019-01-03 1 20
5 2019-01-03 2 38
the second dataframe:
date2 = pd.to_datetime(['2019-01-01','2019-01-02'],format='%Y-%m-%d')
location2 = [2,1]
df2 = pd.DataFrame(columns=['date','location'])
df2['date']=date2
df2['location']=location2
resulting in the following dataframe:
date location
0 2019-01-01 2
1 2019-01-02 1
then the drop statement:
for i in range(len(df2)):
dayA = df2['date'].iloc[i]
placeA = df2['location'].iloc[i]
df1.drop(df1.loc[(df1['date']==dayA)& (df1['location']==placeA)],inplace=True)
which results in this case in the error code in the example :
KeyError: "['date' 'location' 'nr'] not found in axis"
However in my larger dataframe it results in the error:
TypeError: 'NoneType' object is not iterable
what I need however is
date location nr
0 2019-01-01 1 8
3 2019-01-02 2 2
4 2019-01-03 1 20
5 2019-01-03 2 38
what am I doing wrong
df1.loc[(df1['date']==dayA)& (df1['location']==placeA)] is the dataframe consisting of rows where the date and location match. drop is expecting the index where they match. So you need df1.loc[(df1['date']==dayA)& (df1['location']==placeA)].index. However, this is a very inefficient method. You can use merge instead as the other answers discuss. Another method would be df1 = df1.loc[~df1[['date','location']].apply(tuple,axis=1).isin(zip(df2.date,df2.location))].
I would use pandas merge and a little trick:
df2['temp'] = 2
df = pd.merge(df1, df2, how='outer', on=['date', 'location'])
df = df[pd.isna(df.temp)]
del df['temp']
Problem is with this line:
df1.drop(df1.loc[(df1['date']==dayA)& (df1['location']==placeA)],inplace=True)
You can achieve your target like this:
df1 = df1.loc[~((df1['date']==dayA) & (df1['location']==placeA))]
Basically, everytime you find a match for each row, you essentially remove it from the df1 dataframe.
Output:
date location nr
0 2019-01-01 1 8
3 2019-01-02 2 2
4 2019-01-03 1 20
5 2019-01-03 2 38
Use pandas merge:
This should work
df1['index_col'] = df1.index
df = df1.merge(df2,on=['date','location'],how='left')
df = df.dropna()
df = df[df1.columns]
result_df = df1[~df.index_col.isin(df1.index_col)]
Related
I am using Python 3.6 and I am doing an aggregation, which I have done correctly, but the column names are not in the form I want.
df = pd.DataFrame({'ID':[1,1,2,2,2],
'revenue':[1,3,5,1,5],
'month':['2012-01-01','2012-01-01','2012-03-01','2014-01-01','2012-01-01']})
print(df)
ID month revenue
0 1 2012-01-01 1
1 1 2012-01-01 3
2 2 2012-03-01 5
3 2 2014-01-01 1
4 2 2012-01-01 5
Doing the aggregation below.
df = df.groupby(['ID']).agg({'revenue':'sum','month':[('distinct_m','nunique'),('month_m','first')]}).reset_index()
print(df)
ID revenue month
sum distinct_m month_m
0 1 4 1 2012-01-01
1 2 11 3 2012-03-01
Desired output is:
ID revenue distinct_m month
0 1 4 1 2012-01-01
1 2 11 3 2012-03-01
The problem is that I am using a mixed form of expressions inside agg(). Had it been only agg('revenue':'sum'), I would have got a column named revenue in precisely the same format I wanted, as shown below:
ID revenue
0 1 4
1 2 11
But, since I am creating 2 additional columns as well, using tuple form ('distinct_m','nunique'),('month_m','first'), I get column names spread across two rows.
Is there a way to get the desired output shown above in one aggregation agg()? I want to avoid using tuple form for 'revenue':'sum'. I am not looking for multiple operations afterwards to get the column names right. I am using Python 3.6.
For avoid this problem is used named aggregations working in pandas 0.25+, where is possible specify each columns names:
df = (df.groupby(['ID']).agg(revenue=('revenue','sum'),
distinct_m=('month','nunique'),
month_m = ('month','first')
).reset_index())
print(df)
ID revenue distinct_m month_m
0 1 4 1 2012-01-01
1 2 11 3 2012-03-01
For lower pandas versions is possible flatten columns in MultiIndex and then rename:
df = df.groupby(['ID']).agg({'revenue':'sum',
'month':[('distinct_m','nunique'),('month_m','first')]})
df.columns = df.columns.map('_'.join)
df = df.rename(columns={'revenue_sum':'revenue',
'month_distinct_m':'distinct_m',
'month_month_m':'month_m'})
df = df.reset_index()
print(df)
ID revenue distinct_m month_m
0 1 4 1 2012-01-01
1 2 11 3 2012-03-01
I have a dataframe like as shown below
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-05 12:59:00','2173-05-04 13:14:00','2173-05-05 13:37:00','2173-07-03 13:39:00','2173-07-04 11:30:00','2173-04-04 16:00:00','2173-04-09 22:00:00','2173-04-11 04:00:00','2173- 04-13 04:30:00','2173-04-14 08:00:00'],
'val' :[5,5,5,5,1,6,5,5,8,3,4,6]})
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['month'] = df['time_1'].dt.month
What I would like to do is drop records/subjects who doesn't have more than 4 or more unique days
If you see my sample dataframe, you can see that subject_id = 1 has only 3 unique dates which is 3,4 and 5 so I would like to drop subject_id = 1 completely. But if you see subject_id = 2 he has more than 4 unique dates like 4,9,11,13,14. Please note that date values has timestamp, hence I extract the day from each datetime field and check for unique records.
This is what I tried
df.groupby(['subject_id','day']).transform('size')>4 # doesn't work
df[df.groupby(['subject_id','day'])['subject_id'].transform('size')>=4] # doesn't produce expected output
I expect my output to be like this
Change your function from size to DataFrameGroupBy.nunique, grouping only by the subject_id column:
df = df[df.groupby('subject_id')['day'].transform('nunique')>=4]
Or alternatively you can use filtration, but this should be slower if you're using a larger dataframe or many unique groups:
df = df.groupby('subject_id').filter(lambda x: x['day'].nunique()>=4)
print (df)
subject_id time_1 val day month
7 2 2173-04-04 16:00:00 5 4 4
8 2 2173-04-09 22:00:00 8 9 4
9 2 2173-04-11 04:00:00 3 11 4
10 2 2173-04-13 04:30:00 4 13 4
11 2 2173-04-14 08:00:00 6 14 4
I have a reference dataframe:
ex:
time latitude longtitude pm2.5
0 . 0 0 0
1 . 0 5 1
......
And I have a query with
ex:
time latitude longtitude
0 . 1 3
1 . 0 5
.......
I want to get the pm2.5 which matches the rows in query.
I have used the iteration of rows but it seems very slow.
predications_phy = []
for index, row in X_test.iterrows():
Y = phyDf[(phyDf["time"] == row["time"]) & (phyDf["latitude"] == row["latitude"]) & (phyDf["longtitude"] == row["longtitude"])]
predications_phy.append(Y)
What is the efficient and correct way to get the rows?
Given reference dataframe df1 and query dataframe df2, you can perform a left merge to extract your result:
res = df2.merge(df1, how='left')
print(res)
# time latitude longtitude pm2.5
# 0 0 1 3 NaN
# 1 1 0 5 1.0
Loops are highly discouraged unless your operation cannot be vectorised.
I am trying to apply left join to the two dataframe shown below.
outlier day season
0 11556.0 0 1
==========================================
date bikeid date2
0 1 16736 2016-06-06
1 1 16218 2016-06-13
2 1 15254 2016-06-20
3 1 16327 2016-06-27
4 1 17745 2016-07-04
5 1 16975 2016-07-11
6 1 17705 2016-07-18
7 1 16792 2016-07-25
8 1 18540 2016-08-01
9 1 17212 2016-08-08
10 1 11556 2016-08-15
11 1 17694 2016-08-22
12 1 14936 2016-08-29
outliers = pd.merge(outliers, sum_Day, how = 'left', left_on = ['outlier'], right_on = ['bikeid'])
outliers = outliers.dropna(axis=1, how='any')
trip_outlier day season
0 11556.0 0 1
As shown after above applying left join i dropped all the NaN rows which gives the result above. However the desired results should be as shown below
trip_outlier day season date2
0 11556.0 0 1 2016-08-15
It seems dtype of outlier column in outliers is float. Need same dtypes in both joined columns.
Check it by:
print (outliers['outlier'].dtype)
print (sum_Day['bikeid'].dtype)
So use astype for convert:
outliers['outlier'] = outliers['outlier'].astype(int)
#if not int
#sum_Day['bikeid'] = sum_Day['bikeid'].astype(int)
EDIT:
If some NaNs in outlier column is not possible convert to int, first is necessary remove NaNs:
outliers = outliers.dropna('outlier')
outliers['outlier'] = outliers['outlier'].astype(int)
One way to get the desired result would be using the below code:
outliers = outliers.merge(sum_Day.rename(columns={'bikeid': 'outlier'}), on = 'outlier', \
how = 'left')
I'm trying to split one date list by using another. So:
d = {'date':['1/15/2015','2/15/2015'], 'num':[1,2]}
s = {'split':['2/1/2015']}
df = pd.DataFrame(d)
sf = pd.DataFrame(s)
df['date'] = pd.to_datetime(df['date'])
sf['split'] = pd.to_datetime(sf['split'])
df['date'].split_by(sf['split'])
would yield:
date num
0 2015-01-15 1.0
1 2015-02-01 NaN
2 2015-02-15 2.0
...but of course, it doesn't. I'm sure there's a simple merge or join I'm missing here, but I can't figure it out. Thanks.
Also, if the 'split' list has multiple dates, some of which fall outside the range of the 'date' list, I don't want them included. So basically, the extents of the new range would be the same as the old.
(side note: if there's a better way to convert a dictionary to a DataFrame and immediately convert the date strings to datetimes, that would be icing on the cake)
I think you need boolean indexing for filter sf by min and max of column date in df first and then concat with sort_values, for align need rename column:
d = {'date':['1/15/2015','2/15/2015'], 'num':[1,2]}
s = {'split':['2/1/2015', '2/1/2016', '2/1/2014']}
df = pd.DataFrame(d)
sf = pd.DataFrame(s)
df['date'] = pd.to_datetime(df['date'])
sf['split'] = pd.to_datetime(sf['split'])
print (df)
date num
0 2015-01-15 1
1 2015-02-15 2
print (sf)
split
0 2015-02-01
1 2016-02-01
2 2014-02-01
mask = (sf.split <= df.date.max()) & (sf.split >= df.date.min())
print (mask)
0 True
1 False
2 False
Name: split, dtype: bool
sf = sf[mask]
print (sf)
split
0 2015-02-01
df = pd.concat([df, sf.rename(columns={'split':'date'})]).sort_values('date')
print (df)
date num
0 2015-01-15 1.0
0 2015-02-01 NaN
1 2015-02-15 2.0