Working in Python / pandas / dataframes
I have these two dataframes:
Dataframe one:
1 2 3
1 Stockholm 100 250
2 Stockholm 150 376
3 Stockholm 105 235
4 Stockholm 109 104
5 Burnley 145 234
6 Burnley 100 250
Dataframe two:
1 2 3
1 Stockholm 100 250
2 Stockholm 117 128
3 Stockholm 105 235
4 Stockholm 100 250
5 Burnley 145 234
6 Burnley 100 953
And I would like to find the duplicate rows found in Dataframe one and Dataframe two and remove the duplicates from Dataframe one. As in data frame two, you can find row 1, 3, 5 in data frame one, which would remove them from data frame on and create the below:
1 2 3
1 Stockholm 150 376
2 Stockholm 109 104
3 Burnley 100 250
Use:
df_merge = pd.merge(df1, df2, on=[1,2,3], how='inner')
df1 = df1.append(df_merge)
df1['Duplicated'] = df1.duplicated(keep=False) # keep=False marks the duplicated row with a True
df_final = df1[~df1['Duplicated']] # selects only rows which are not duplicated.
del df_final['Duplicated'] # delete the indicator column
The idea is as follows:
do a inner join on all the columns
append the output of the inner join to df1
identify the duplicated rows in df1
select the not duplicated rows in df1
Each number corresponds to each line of code.
A one-liner:
df1.merge(df2, indicator=True, how='outer').query('_merge=="left_only"').drop('_merge', axis=1)
Output:
1 2 3
2 Stockholm 150 376
4 Stockholm 109 104
6 Burnley 100 250
I have a dataframe that I want to group using multiple columns and then add a calculated column (mean) based on the grouping. Can someone give me a hand?
I have tried the grouping and it works fine, but adding the calculated (rolling mean) column is proving to be a hustle
import pandas as pd
import numpy as np
df = pd.DataFrame([[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16], list('AAAAAAAABBBBBBBB'), ['RED','BLUE','GREEN','YELLOW','RED','BLUE','GREEN','YELLOW','RED','BLUE','GREEN','YELLOW','RED','BLUE','GREEN','YELLOW'], ['1','1','1','1','2','2','2','2','1','1','1','1','2','2','2','2'],[100,112,99,120,105,114,100,150,200,134,167,150,134,189,172,179]]).T
df.columns = ['id','Station','Train','month_code','total']
df2 = df.groupby(['Station','Train','month_code','total']).size().reset_index().groupby(['Station','Train','month_code'])['total'].max()
Looking at getting an outcome similar to this below
Station Train month_code total average
A BLUE 1 112
2 114 113
GREEN 1 99 106.5
2 100 99.5
RED 1 100 100
2 105 102.5
YELLOW 1 120 112.5
2 150 135
B BLUE 1 134 142
2 189 161.5
GREEN 1 167 178
2 172 169.5
RED 1 200 186
2 134 167
YELLOW 1 150 142
2 179 164.5
How about you change your initial groupby to keep the column name 'total'.
df3 = df.groupby(['Station','Train','month_code']).sum()
>>> df3.head()
id total
Station Train month_code
A BLUE 1 2 112
2 6 114
GREEN 1 3 99
2 7 100
RED 1 1 100
Then do a rolling mean on the total column.
df3['average'] = df3['total'].rolling(2).mean()
>>> df3.head()
id total average
Station Train month_code
A BLUE 1 2 112 NaN
2 6 114 113.0
GREEN 1 3 99 106.5
2 7 100 99.5
RED 1 1 100 100.0
You can then still remove the id column if you don't want it.
There is a column in my dataset that looks like that:
col1
100
100
100
101
101
102
102
103
103
103
103
104
104
I want to create a column that gives an increasing number per group. Specifically, where is 100 in the col1 there will be 01. The next 100 will have 02 and so on. When it reaches to the row that has 101 it will perform similarly:01, the next 101, 02 like it did with 100.
I tried it and I can't make it do what I am planning:
I have to make a new column first
df['nc'] = df.groupby(col1)
which is wrong.
Desired output:
col1 nc
100 01
100 02
100 03
101 01
101 02
102 01
102 02
103 01
103 02
103 ........ and so on
103
104
104
I think you're looking for this.
df['nc'] = df.groupby('col1').cumcount()+1
Which gives:
col1 nc
0 100 1
1 100 2
2 100 3
3 101 1
4 101 2
5 102 1
6 102 2
7 103 1
8 103 2
9 103 3
10 103 4
11 104 1
12 104 2
You can format the numbers as necessary if you require leading zeros.
I have a dataframe from which I need to calculate a number of features from. The dataframe df looks something like this for a object and an event:
id event_id event_date age money_spent rank
1 100 2016-10-01 4 150 2
2 100 2016-09-30 5 10 4
1 101 2015-12-28 3 350 3
2 102 2015-10-25 5 400 5
3 102 2015-10-25 7 500 2
1 103 2014-04-15 2 1000 1
2 103 2014-04-15 3 180 6
From this I need to know for each id and event_id (basically each row), what was the number of days since the last event date, total money spend upto that date, avg. money spent upto that date, rank in last 3 events etc.
What is the best way to work with this kind of problem in pandas where for each row I need information from all rows with the same id before the date of that row, and so the calculations? I want to return a new dataframe with the corresponding calculated features like
id event_id event_date days_last_event avg_money_spent total_money_spent
1 100 2016-10-01 278 500 1500
2 100 2016-09-30 361 196.67 590
1 101 2015-12-28 622 675 1350
2 102 2015-10-25 558 290 580
3 102 2015-10-25 0 500 500
1 103 2014-04-15 0 1000 1000
2 103 2014-04-15 0 180 180
I came up with the following solution:
df1= df.sort_values(by="event_date",ascending = False)
g = df1.groupby(by=["id"])
df1["total_money_spent","count"]= g.agg({"money_spent":["cumsum","cumcount"]})
df1["avg_money_spent"]=df1["total_money_spent"]/(df1["count"]+1)
I have two DataFrame
df1:
mat inv
0 100 23
1 101 35
2 102 110
df2:
mat sale
0 100 45
1 101 100
2 102 90
I merged the DataFrame in df:
mat inv sale
0 100 23 45
1 101 35 100
2 102 110 90
so I could create another column days:
df['days'] = df.inv / df.sale * 30
then I delete the column sale, and get this as result:
df:
mat inv days
0 100 23 15
1 101 35 10
2 102 110 36
Can I create the dayscolumn directly in df1 without first merging the DataFrame? since I don't need the column of df2, just the value to do the operation of days, and I don't really want to merge them to delete it in the end.
You can create the new column directly if you make sure the mat columns align properly:
df1 = df1.set_index('mat')
df2 = df2.set_index('mat')
df2['days'] = df1.inv.div(df2.sale).mul(30)
sale days
mat
100 4 15.33
101 100 10.50
102 90 36.67
you can also do it this way:
In [181]: df1['days'] = (df1.inv / df1['mat'].map(df2.set_index('mat')['sale']) * 30).astype(int)
In [182]: df1
Out[182]:
mat inv days
0 100 23 15
1 101 35 10
2 102 110 36
surely df1['days'] = df1['inv'] / df2['sale'] * 30 works?