i want to add 2 columns of 2 different dataframes based on the condition that name is same:
import pandas as pd
df1 = pd.DataFrame([("Apple",2),("Litchi",4),("Orange",6)], columns=['a','b'])
df2 = pd.DataFrame([("Apple",200),("Orange",400),("Litchi",600)], columns=['a','c'])
now i want to add column b and c if the name is same in a.
I tried this df1['b+c']=df1['b']+df2['c'] but it simply adds column b and c so the result comes as
a b b+c
0 Apple 2 202
1 Litchi 4 404
2 Orange 6 606
but i want to
a b+c
0 Apple 202
1 Litchi 604
2 Orange 406
i guess i have to use isin but i am not getting how?
Columns b and c are aligned by index values in sum operation, so is necessary create index by DataFrame.set_index by column a:
s1 = df1.set_index('a')['b']
s2 = df2.set_index('a')['c']
df1 = (s1+s2).reset_index(name='b+c')
print (df1)
a b+c
0 Apple 202
1 Litchi 604
2 Orange 406
EDIT: If need original value for not matched values use Series.add with parameter fill_value=0
df2 = pd.DataFrame([("Apple",200),("Apple",400),("Litchi",600)], columns=['a','c'])
print (df2)
a c
0 Apple 200
1 Apple 400
2 Litchi 600
s1 = df1.set_index('a')['b']
s2 = df2.set_index('a')['c']
df1 = s1.add(s2, fill_value=0).reset_index(name='b+c')
print (df1)
a b+c
0 Apple 202.0
1 Apple 402.0
2 Litchi 604.0
3 Orange 6.0
Related
I need help transforming the data as follows:
From a dataset in this version (df1)
ID apples oranges pears apples_pears oranges_pears
0 1 1 0 0 1 0
1 2 0 1 0 1 0
2 3 0 1 1 0 1
to a data set like the following (df2):
ID apples oranges pears
0 1 2 0 1
1 2 1 1 1
2 3 0 2 2
What I'm trying to accomplish is get the total values of apples, from all the columns in which the word "apple" appears in the column name. E.g. in the df1 there are 2 column names in which the word "apple" appears. If you sum up all the apples from the first row, there would be a total of 2. I want a single column for apples in the new dataset (df2). Note that a 1 for appleas_pears is a 1 for EACH apples and pears.
Idea is split DataFrame to new 2 - first change columns names by all values before _ and for second filter columns with _ by DataFrame.filter and change columns by value after _, last join together by concat and sum per columns:
df1 = df.set_index('ID')
df2 = df1.filter(like='_')
df1.columns = df1.columns.str.split('_').str[0]
df2.columns = df2.columns.str.split('_').str[1]
df = pd.concat([df1, df2], axis=1).sum(level=0, axis=1).reset_index()
print (df)
ID apples oranges pears
0 1 2 0 1
1 2 1 1 1
2 3 0 2 2
I have 2 dfs with different len:
df1:
ESTACION DZ
0 ALAMOR 1
1 EL TIGRE 1
2 SAN PEDRO 1
3 TABACONAS 1
4 BATAN 2
5 CACAO 2
6 CHOTANO 2
7 CIRATO 2
8 LLAUCANO 2
9 NARANJOS 2
10 MAGUNCHAL 2
11 PUCHACA 2
12 MAYGASBAMBA 2
df2:
Estacion Co Dre
56 ALAMOR C 1
89 LAGARTERA C 1
90 PUENTE PIURA C 1
211 PUENTE SULLANA C 1
249 PALTASHACO C 1
250 TAMBO GRANDE C 1
342 VENTANILLAS C 2
421 CACAO C 2
466 DESAGUADERO C 2
508 QUEBRADA HONDA C 2
I want to save in another df (df3) common values between df1['ESTACION'] and df2['Estacion']
so i tried this code:
duplicates = pd.concat([df1,df2])[pd.concat([df1,df2])
.duplicated(subset=['ESTACION','Estacion'], keep=False)]
But i'm not getting the common values. I hope you can give me an answer or some advice. Thanks!!
Edited to make the answer more specific to your situation
You can use merge which by default does an inner join. And if you insist in having a dataframe with strictly the common values of a single column, try this:
df3=pd.merge(df1, df2, left_on=['ESTACION'], right_on=['Estacion'])
df3.drop(df3.columns.difference(['ESTACION']), 1, inplace=True)
If you want to get the common values regardless of where they appear and how many times, you can simply do this:
common_values = list(set(np.unique(df1['ESTACION'].values)).intersection(set(np.unique(df2['Estacion'].values))))
You need to have ran previously import numpy as np of course.
This will give you a list of all the values which are found in both columns. Then, you can assign them to a new DataFrame's column like so
df3['common_values'] = common_values or do whatever else you want with those values.
I think you want to merge the dataframes, like this:
df3 = pd.merge(df1,df2, left_on=['ESTACION'],right_on=['Estacion'])
as mentioned in the comments, the column ESTACION is not the same as Estacion
I have two separate pandas dataframes (df1 and df2) which have multiple columns with some common columns.
I would like to find every row in df2 that does not have a match in df1. Match between df1 and df2 is defined as having the same values in two different columns A and B in the same row.
df1
A B C text
45 2 1 score
33 5 2 miss
20 1 3 score
df2
A B D text
45 3 1 shot
33 5 2 shot
10 2 3 miss
20 1 4 miss
Result df (Only Rows 1 and 3 are returned as the values of A and B in df2 have a match in the same row in df1 for Rows 2 and 4)
A B D text
45 3 1 shot
10 2 3 miss
Is it possible to use the isin method in this scenario?
This works:
# set index (as selecting columns)
df1 = df1.set_index(['A','B'])
df2 = df2.set_index(['A','B'])
# now .isin will work
df2[~df2.index.isin(df1.index)].reset_index()
A B D text
0 45 3 1 shot
1 10 2 3 miss
I am trying to merge two dataframes, one with columns: customerId, full name, and emails and the other dataframe with columns: customerId, amount, and date. I want to have the first dataframe be the main dataframe and the other dataframe information be included but only if the customerIds match up; I tried doing:
merge = pd.merge(df, df2, on='customerId', how='left')
but the dataframe that is produced contains a lot of repeats and looks wrong:
customerId full name emails amount date
0 002963338 Star shine star.shine#cdw.com $2,910.94 2016-06-14
1 002963338 Star shine star.shine#cdw.com $9,067.70 2016-05-27
2 002963338 Star shine star.shine#cdw.com $6,507.24 2016-04-12
3 002963338 Star shine star.shine#cdw.com $1,457.99 2016-02-24
4 986423367 palm tree tree.palm#snapchat.com,tree#.com $4,604.83 2016-07-16
this cant be right, please help!
There is problem you have duplicates in customerId column.
So solution is remove them, e.g. by drop_duplicates:
df2 = df2.drop_duplicates('customerId')
Sample:
df = pd.DataFrame({'customerId':[1,2,1,1,2], 'full name':list('abcde')})
print (df)
customerId full name
0 1 a
1 2 b
2 1 c
3 1 d
4 2 e
df2 = pd.DataFrame({'customerId':[1,2,1,2,1,1], 'full name':list('ABCDEF')})
print (df2)
customerId full name
0 1 A
1 2 B
2 1 C
3 2 D
4 1 E
5 1 F
merge = pd.merge(df, df2, on='customerId', how='left')
print (merge)
customerId full name_x full name_y
0 1 a A
1 1 a C
2 1 a E
3 1 a F
4 2 b B
5 2 b D
6 1 c A
7 1 c C
8 1 c E
9 1 c F
10 1 d A
11 1 d C
12 1 d E
13 1 d F
14 2 e B
15 2 e D
df2 = df2.drop_duplicates('customerId')
merge = pd.merge(df, df2, on='customerId', how='left')
print (merge)
customerId full name_x full name_y
0 1 a A
1 2 b B
2 1 c A
3 1 d A
4 2 e B
I do not see repeats as a whole row but there are repetetions in customerId. You could remove them using:
df.drop_duplicates('customerId', inplace = 1)
where df could be the dataframe corresponding to amount or one obtained post merge. In case you want fewer rows (say n), you could use:
df.groupby('customerId).head(n)
here are my two dataframes
index = pd.MultiIndex.from_product([['a','b'],[1,2]],names=['one','two'])
df = pd.DataFrame({'col':[10,20,30,40]}, index = index)
df
col
one two
a 1 10
2 20
b 1 30
2 40
index_1 = pd.MultiIndex.from_product([['a','b'],[1.,2],['abc','mno','xyz']], names = ['one','two','three'])
temp = pd.DataFrame({'col1':[1,2,3,4,5,6,7,8,9,10,11,12]}, index = index_1)
temp
col1
one two three
a 1.0 abc 1
mno 2
xyz 3
2.0 abc 4
mno 5
xyz 6
b 1.0 abc 7
mno 8
xyz 9
2.0 abc 10
mno 11
xyz 12
how can I merge both of them?
I have tried, this
pd.merge(left = temp, right = df, left_on = temp.index.levels[0], right_on = df.index.levels[0])
but this does not work
KeyError: "Index([u'a', u'b'], dtype='object', name=u'one') not in index"
if I convert the index into columns through reset_index() than the merge works. However, I wish to achieve this while preserving the index structure.
my desired output is:
method 1
reset_index + merge
df.reset_index().merge(temp.reset_index()).set_index(index_1.names)
method 2
join with reset_index partial
df.join(temp.reset_index('three')).set_index('three', append=True)