i have a dataframe df1
id value
1 100
2 100
3 100
4 100
5 100
i have another dataframe df2
id value
2 50
5 30
i want to replace these values for id's in df2 with the values in df1.
final modified df1:
id value
1 100
2 50
3 100
4 100
5 30
i will be running this in a loop. i'e df2, will change time to time (df1, outside loop)
what would be the best way to change the values?
Use combine_first, but first set_index by id in both DataFrames:
Notice: id column in df2 has to be unique.
df = df2.set_index('id').combine_first(df1.set_index('id')).reset_index()
print (df)
id value
0 1 100.0
1 2 50.0
2 3 100.0
3 4 100.0
4 5 30.0
A loc based solution -
i = df1.set_index('id')
j = df2.set_index('id')
i.loc[j.index, 'value'] = j['value']
df2 = i.reset_index()
df2
id value
0 1 100
1 2 50
2 3 100
3 4 100
4 5 30
Related
I have a dataframe who looks like this:
Name rent sale
0 A 180 2
1 B 1 4
2 M 12 1
3 O 10 1
4 A 180 5
5 M 2 19
that i want to make condition that if i have a duplicate row and a duplicate value in column field => Example :
duplicate row A have duplicate value 180 in rent column
I keep only one (without making the sum)
Or make the sum => Example duplicate row A with different values 2 & 5 in Sale column and duplicate row M with different values in rent & sales columns
Expected output:
Name rent sale
0 A 180 7
1 B 1 4
2 M 14 20
3 O 10 1
I tried this code but it's not workin as i want
import pandas as pd
df=pd.DataFrame({'Name':['A','B','M','O','A','M'],
'rent':[180,1,12,10,180,2],
'sale':[2,4,1,1,5,19]})
df2 = df.drop_duplicates().groupby('Name',sort=False,as_index=False).agg(Name=('Name','first'),
rent=('rent', 'sum'),
sale=('sale','sum'))
print(df2)
I got this output
Name rent sale
0 A 360 7
1 B 1 4
2 M 14 20
3 O 10 1
Can try summing only the unique values per group:
def sum_unique(s):
return s.unique().sum()
df2 = df.groupby('Name', sort=False, as_index=False).agg(
Name=('Name', 'first'),
rent=('rent', sum_unique),
sale=('sale', sum_unique)
)
df2:
Name rent sale
0 A 180 7
1 B 1 4
2 M 14 20
3 O 10 1
You can first groupby by Name and rent, and then just by Name:
df2 = df.groupby(['Name', 'rent'], as_index=False).sum().groupby('Name', as_index=False).sum()
I have two dataframes with common columns. I would like to create a new column that contains the difference between two columns (one from each dataframe) based on a condition from a third column.
df_a:
Time Volume ID
1 5 1
2 6 2
3 7 3
df_b:
Time Volume ID
1 2 2
2 3 1
3 4 3
output is appending a new column to df_a with the differnece between volume columns (df_a.Volume - df_b.Volume) where the two IDs are equal.
df_a:
Time Volume ID Diff
1 5 1 2
2 6 2 4
3 7 3 3
If ID is unique per row in each dataframe:
df_a['Diff'] = df_a['Volume'] - df_a['ID'].map(df_b.set_index('ID')['Volume'])
Output:
Time Volume ID Diff
0 1 5 1 2
1 2 6 2 4
2 3 7 3 3
An option is to merge the two dfs on ID and then calculate Diff:
df_a = df_a.merge(df_b.drop(['Time'], axis=1), on="ID", suffixes=['', '2'])
df_a['Diff'] = df_a['Volume'] - df_a['Volume2']
df:
Time Volume ID Volume2 Diff
0 1 5 1 3 2
1 2 6 2 2 4
2 3 7 3 4 3
Merge the two dataframes on 'ID', then take the difference:
import pandas as pd
df_a = pd.DataFrame({'Time': [1,2,3], 'Volume': [5,6,7], 'ID':[1,2,3]})
df_b = pd.DataFrame({'Time': [1,2,3], 'Volume': [2,3,4], 'ID':[2,1,3]})
merged = pd.merge(df_a,df_b, on = 'ID')
df_a['Diff'] = merged['Volume_x'] - merged['Volume_y']
print(df_a)
#output:
Time Volume ID Diff
0 1 5 1 2
1 2 6 2 4
2 3 7 3 3
I am joining two tables left_table and right_table on non-unique keys that results in row explosion. I then want to aggregate rows to match the number of rows in left_table. To do this I aggregate over left_table columns.
Weirdly, when I save the table the columns in left_table double. It seems like columns of left_table become an index for resulting dataframe...
Left table
k1 k2 s v c target
0 1 3 20 40 2 2
1 1 2 10 20 1 1
2 1 2 10 80 2 1
Right table
k11 k22 s2 v2
0 1 2 0 100
1 2 3 30 200
2 1 2 10 300
Left join
k1 k2 s v c target s2 v2
0 1 3 20 40 2 2 NaN NaN
1 1 2 10 20 1 1 0.0 100.0
2 1 2 10 20 1 1 10.0 300.0
3 1 2 10 80 2 1 0.0 100.0
4 1 2 10 80 2 1 10.0 300.0
Aggregation code
dic = {}
keys_to_agg_over = left_table_col_names
for col in numeric_cols:
if col in all_cols:
dic[col] = 'median'
left_join = left_join.groupby(keys_to_agg_over).aggregate(dic)
After aggregation (doubled number of left table cols)
k1 k2 s v c target s2 v2
k1 k2 s v c target
1 2 10 20 1 1 1 2 10 20 1 1 5.0 200.0
80 2 1 1 2 10 80 2 1 5.0 200.0
3 20 40 2 2 1 3 20 40 2 2 NaN NaN
Saved to csv file
k1,k2,s,v,c,target,k1,k2,s,v,c,target,s2,v2
1,2,10,20,1,1,1,2,10,20,1,1,5.0,200.0
1,2,10,80,2,1,1,2,10,80,2,1,5.0,200.0
1,3,20,40,2,2,1,3,20,40,2,2,,
I tried resetting index, as left_join.reset_index() but I get
ValueError: cannot insert target, already exists
How to fix the issue of column-doubling?
You have a couple of options:
Store csv not including the index: I guess you are using the to_csv method to store the result in a csv. By default it includes you index columns in the generated csv. you can do to_csv(index=False) to avoid storing them.
reset_index dropping it: you can use left_join.reset_index(drop=True) in order to discard the index columns and not add them in the dataframe. By default reset_index adds the current index columns to the dataframe, generating the ValueError you obtain.
It seems like you are using:
left_join = left_table.merge(right_table, left_on = ["k1", "k2"], "right_on" = ["k11", "k22"] , how = "left")
This will result in a dataframe with repeated rows since indexes 1 and 2 from the left table both can be joined to indexes 0 and 2 of the right table. If that is the behavior you expected, and just want to get rid of duplicated rows you can try using:
left_join = left_join.drop_duplicates()
Before aggregating. This solution won't stop duplicating rows, it will rather eliminate them to not cause any trouble.
You can also pass the parameter as_index = False in the groupby function like this:
left_join = left_join.groupby(keys_to_agg_over, as_index = False).aggregate(dic)
To stop geting the "grouping columns" as indexes.
i have a table in pandas dataframe df
Leafid pidx pidy value
100 1 3 10
100 2 6 12
200 5 7 48
300 7 1 11
i have another dataframe df2 which has
pid price
1 10
2 20
3 30
4 40
5 50
6 60
7 70
i want to merge df and df2 such that i have two more column's price_pidx and price_pidy
and then also do division of price_pidy/price_pidx
for example:
Leafid pidx pidy value price_pidx price_pidy price_pidy/price_pidx
`100 1 3 10 10 30 3`
my final df should have columns
pidx pidy value price_pidx/price_pidy
i don't want to use .map() in this.
is there any way to do it using pd.merge?
i know how to bring one column price_pidx but how to bring both?
for eg.
pd.merge(df,df2['pid','price'],how = left, left_on = 'pidx' right_on = 'pid')
but how to bring both price_pidx and price_pidy
Without map it is complicated, because need reshape by melt, then merge and last unstack:
df = pd.melt(df, id_vars='value', value_name='pid', var_name='g')
df2 = pd.merge(df,df2[['pid','price']], how='left', on = 'pid')
df2 = df2.set_index(['value','g']).unstack()
df2.columns = ['_'.join(col) for col in df2.columns]
df2['col'] = df2.price_pidy / df2.price_pidx
df2 = df2.rename(columns={'pid_pidx':'pidx','pid_pidy':'pidy'})
print (df2)
pidx pidy price_pidx price_pidy col
value
10 1 3 10 30 3.000000
11 7 1 70 10 0.142857
12 2 6 20 60 3.000000
48 5 7 50 70 1.400000
I have two data frames of same IDs with identical structure:
X, Y, Value, ID
The only difference between the two should be values in column Value - it may need to be sorted by ID first so both have same order of rows to make sure.
I want to compare these two data frames by row based on column Value and keep the row from first or second depending where the Value is bigger. I would also like to see example how to add additional column SUM for sum of Value columns from both data frames.
I will be glad for any example, including using numpy if you feel it is better to use for this than Pandas.
Edit: I just realized after testing the example from the first answer that the data frames I have are missing completely the rows with ids where Value was null. That makes two data frames of different number of rows. So could be please also included how to make them same size before comparison - adding rows with missing ids from each other with IDs and zeros?
import numpy as np
# create a new dataframe, where Value is the max value per row
val1 = df1['Value']
val2 = df2['Value'][val1.index] # align to val1
df = df1.copy()
df['Value'] = np.maximum(val1, val2)
# add a SUM column:
df1['SUM'] = df1['Value'].sum()
df2['SUM'] = df2['Value'].sum()
df = (pd.concat([df1, df2])
.groupby(['ID','X','Y'])
.agg({'value':'max', 'value_sum':'sum'}))
I use reindex_like for align dataframes and then where and loc for filling column Value of new dataframe df:
print df1
X Y Value ID
0 1 4 10 0
1 2 5 55 1
2 3 6 21 2
print df2
X Y Value ID
0 2 5 7 1
1 3 6 34 2
#align dataframes
df1 = df1.set_index('ID')
df2 = df2.set_index('ID')
df2 = df2.reindex_like(df1)
print df2
X Y Value
ID
0 NaN NaN NaN
1 2 5 7
2 3 6 34
#create new df
df = df1.copy()
df['Value'] = df1['Value'].where(df1['Value'] > df2['Value'], df2['Value'])
#if value is NaN in column df2 give value of column1
df.loc[df2['Value'].isnull(), 'Value'] = df1['Value']
print df
X Y Value
ID
0 1 4 10
1 2 5 55
2 3 6 34
#sum columns Value to columns SUM
df1['SUM'] = df1['Value'].sum()
df2['SUM'] = df2['Value'].sum()
print df1
X Y Value SUM
ID
0 1 4 10 86
1 2 5 55 86
2 3 6 21 86
#remove rows with NaN
print df2.dropna()
X Y Value SUM
ID
1 2 5 7 41
2 3 6 34 41