This question already has answers here:
Pandas merge two dataframes summing values [duplicate]
(2 answers)
how to merge two dataframes and sum the values of columns
(2 answers)
Closed 4 years ago.
I am new to pandas, could you help me with the case belove pls
I have 2 DF:
df1 = pd.DataFrame({'A': ['name', 'color', 'city', 'animal'], 'number': ['1', '32', '22', '13']})
df2 = pd.DataFrame({'A': ['name', 'color', 'city', 'animal'], 'number': ['12', '2', '42', '15']})
df1
A number
0 name 1
1 color 32
2 city 22
3 animal 13
DF1
A number
0 name 12
1 color 2
2 city 42
3 animal 15
I need to get the sum of the colum number e.g.
DF1
A number
0 name 13
1 color 34
2 city 64
3 animal 27
but if I do new = df1 + df2 i get a
NEW
A number
0 namename 13
1 colorcolor 34
2 citycity 64
3 animalanimal 27
I even tried with merge on="A" but nothing.
Can anyone enlight me pls
Thank you
Here are two different ways: one with add, and one with concat and groupby. In either case, you need to make sure that your number columns are numeric first (your example dataframes have strings):
# set `number` to numeric (could be float, I chose int here)
df1['number'] = df1['number'].astype(int)
df2['number'] = df2['number'].astype(int)
# method 1, set the index to `A` in each and add the two frames together:
df1.set_index('A').add(df2.set_index('A')).reset_index()
# method 2, concatenate the two frames, groupby A, and get the sum:
pd.concat((df1,df2)).groupby('A',as_index=False).sum()
Output:
A number
0 animal 28
1 city 64
2 color 34
3 name 13
Merging isn't a bad idea, you just need to remember to convert numeric series to numeric, select columns to merge on, then sum on numeric columns via select_dtypes:
df1['number'] = pd.to_numeric(df1['number'])
df2['number'] = pd.to_numeric(df2['number'])
df = df1.merge(df2, on='A')
df['number'] = df.select_dtypes(include='number').sum(1) # 'number' means numeric columns
df = df[['A', 'number']]
print(df)
A number
0 name 13
1 color 34
2 city 64
3 animal 28
Related
I am using the code below to make a search on a .csv file and match a column in both files and grab a different column I want and add it as a new column. However, I am trying to make the match based on two columns instead of one. Is there a way to do this?
import pandas as pd
df1 = pd.read_csv("matchone.csv")
df2 = pd.read_csv("comingfrom.csv")
def lookup_prod(ip):
for row in df2.itertuples():
if ip in row[1]:
return row[3]
else:
return '0'
df1['want'] = df1['name'].apply(lookup_prod)
df1[df1.want != '0']
print(df1)
#df1.to_csv('file_name.csv')
The code above makes a search from the column name 'samename' in both files and gets the column I request ([3]) from the df2. I want to make the code make a match for both column 'name' and another column 'price' and only if both columns in both df1 and df2 match then the code take the value on ([3]).
df 1 :
name price value
a 10 35
b 10 21
c 10 33
d 10 20
e 10 88
df 2 :
name price want
a 10 123
b 5 222
c 10 944
d 10 104
e 5 213
When the code is run (asking for the want column from d2, based on both if df1 name = df2 name) the produced result is :
name price value want
a 10 35 123
b 10 21 222
c 10 33 944
d 10 20 104
e 10 88 213
However, what I want is if both df1 name = df2 name and df1 price = df2 price, then take the column df2 want, so the desired result is:
name price value want
a 10 35 123
b 10 21 0
c 10 33 944
d 10 20 104
e 10 88 0
You need to use pandas.DataFrame.merge() method with multiple keys:
df1.merge(df2, on=['name','price'], how='left').fillna(0)
Method represents missing values as NaNs, so that the column's dtype changes to float64 but you can change it back after filling the missed values with 0.
Also please be aware that duplicated combinations of name and price in df2 will appear several times in the result.
If you are matching the two dataframes based on the name and the price, you can use df.where and df.isin
df1['want'] = df2['want'].where(df1[['name','price']].isin(df2).all(axis=1)).fillna('0')
df1
name price value want
0 a 10 35 123.0
1 b 10 21 0
2 c 10 33 944.0
3 d 10 20 104.0
4 e 10 88 0
Expanding on https://stackoverflow.com/a/73830294/20110802:
You can add the validate option to the merge in order to avoid duplication on one side (or both):
pd.merge(df1, df2, on=['name','price'], how='left', validate='1:1').fillna(0)
Also, if the float conversion is a problem for you, one option is to do an inner join first and then pd.concat the result with the "leftover" df1 where you already added a constant valued column. Would look something like:
df_inner = pd.merge(df1, df2, on=['name', 'price'], how='inner', validate='1:1')
merged_pairs = set(zip(df_inner.name, df_inner.price))
df_anti = df1.loc[~pd.Series(zip(df1.name, df1.price)).isin(merged_pairs)]
df_anti['want'] = 0
df_result = pd.concat([df_inner, df_anti]) # perhaps ignore_index=True ?
Looks complicated, but should be quite performant because it filters by set. I think there might be a possibility to set name and price as index, merge on index and then filter by index to not having to do the zip-set-shenanigans, bit I'm no expert on multiindex-handling.
#Try this code it will give you expected results
import pandas as pd
df1 = pd.DataFrame({'name' :['a','b','c','d','e'] ,
'price' :[10,10,10,10,10],
'value' : [35,21,33,20,88]})
df2 = pd.DataFrame({'name' :['a','b','c','d','e'] ,
'price' :[10,5,10,10,5],
'want' : [123,222,944,104 ,213]})
new = pd.merge(df1,df2, how='left', left_on=['name','price'], right_on=['name','price'])
print(new.fillna(0))
This question already has answers here:
Merge two dataframes by index
(7 answers)
Closed 1 year ago.
I am working with an adult dataset where I split the dataframe to label encode categorical columns. Now I want to append the new dataframe with the original dataframe. What is the simplest way to perform the same?
Original Dataframe-
age
salary
32
3000
25
2300
After label encoding few columns
country
gender
1
1
4
2
I want to append the above dataframe and the final result should be the following.
age
salary
country
gender
32
3000
1
1
25
2300
4
2
Any insights are helpful.
lets consider two dataframe named as df1 and df2 hence,
df1.merge(df2,left_index=True, right_index=True)
You can use .join() if the datrframes rows are matched by index, as follows:
.join() is a left join by default and join by index by default.
df1.join(df2)
In addition to simple syntax, it has the extra advantage that when you put your master/original dataframe on the left, left join ensures that the dataframe indexes of the master are retained in the result.
Result:
age salary country gender
0 32 3000 1 1
1 25 2300 4 2
You maybe find your solution in checking pandas.concat.
import numpy as np
import pandas as pd
df1 = pd.DataFrame(np.array([[32,3000],[25,2300]]), columns=['age', 'salary'])
df2 = pd.DataFrame(np.array([[1,1],[4,2]]), columns=['country', 'gender'])
pd.concat([df1, df2], axis=1)
age salary country gender
0 32 25 1 1
1 3000 2300 4 2
I have two dataframes.
feelingsDF with columns 'feeling', 'count', 'code'.
countryDF with columns 'feeling', 'countryCount'.
How do I make another dataframe that takes the columns from countryDF and combines it with the code column in feelingsDF?
I'm guessing you would need to somehow use same feeling column in feelingsDF to combine them and match sure the same code matches the same feeling.
I want the three columns to appear as:
[feeling][countryCount][code]
You are joining the two dataframes by the column 'feeling'. Assuming you only want the entries in 'feeling' that are common to both dataframes, you would want to do an inner join.
Here is a similar example with two dfs:
x = pd.DataFrame({'feeling': ['happy', 'sad', 'angry', 'upset', 'wow'], 'col1': [1,2,3,4,5]})
y = pd.DataFrame({'feeling': ['okay', 'happy', 'sad', 'not', 'wow'], 'col2': [20,23,44,10,15]})
x.merge(y,how='inner', on='feeling')
Output:
feeling col1 col2
0 happy 1 23
1 sad 2 44
2 wow 5 15
To drop the 'count' column, select the other columns of feelingsDF, and then sort by the 'countryCount' column. Note that this will leave your index out of order, but you can reindex the combined_df afterwards.
combined_df = feelingsDF[['feeling', 'code']].merge(countryDF, how='inner', on='feeling').sort_values('countryCount')
# To reset the index after sorting:
combined_df = combined_df.reset_index(drop=True)
You can join two dataframes using pd.merge. Assuming that you want to join on the feeling column, you can use:
df= pd.merge(feelingsDF, countryDF, on='feeling', how='left')
See documentation for pd.merge to understand how to use the on and how parameters.
feelingsDF = pd.DataFrame([{'feeling':1,'count':10,'code':'X'},
{'feeling':2,'count':5,'code':'Y'},{'feeling':3,'count':1,'code':'Z'}])
feeling count code
0 1 10 X
1 2 5 Y
2 3 1 Z
countryDF = pd.DataFrame([{'feeling':1,'country':'US'},{'feeling':2,'country':'UK'},{'feeling':3,'country':'DE'}])
feeling country
0 1 US
1 2 UK
2 3 DE
df= pd.merge(feelingsDF, countryDF, on='feeling', how='left')
feeling count code country
0 1 10 X US
1 2 5 Y UK
2 3 1 Z DE
I have two dataframes, df1 and df2. df1 has repeat observations arranged in wide format, and df2 in long format.
import pandas as pd
df1 = pd.DataFrame({"ID":[1,2,3],"colA_1":[1,2,3],"date1":["1.1.2001", "2.1.2001","3.1.2001"],"colA_2":[4,5,6],"date2":["1.1.2002", "2.1.2002","3.1.2002"]})
df2 = pd.DataFrame({"ID":[1,1,2,2,3,3],"col1":[1,1.5,2,2.5,3,3.5],"date":["1.1.2001", "1.1.2002","2.1.2001","2.1.2002","3.1.2001","3.1.2002"], "col3":[11,12,13,14,15,16],"col4":[21,22,23,24,25,26]})
df1 looks like:
ID colA_1 date1 colA_2 date2
0 1 1 1.1.2001 4 1.1.2002
1 2 2 2.1.2001 5 2.1.2002
2 3 3 3.1.2001 6 3.1.2002
df2 looks like:
ID col1 date1 col3 col4
0 1 1.0 1.1.2001 11 21
1 1 1.5 1.1.2002 12 22
2 2 2.0 2.1.2001 13 23
3 2 2.5 2.1.2002 14 24
4 3 3.0 3.1.2001 15 25
5 3 3.5 3.1.2002 16 26
6 3 4.0 4.1.2002 17 27
I want to take a given column from df2, "col3", and then:
(1) if the columns "ID" and "date" in df2 match with the columns "ID" and "date1" in df1, I want to put the value in a new column in df1 called "colB_1".
(2) else if the columns "ID" and "date" in df2 match with the columns "ID" and "date2" in df1, I want to put the value in a new column in df1 called "colB_2".
(3) else if the columns "ID" and "date" in df2 have no match with either ("ID" and "date1") or ("ID" and "date2"), I want to ignore these rows.
So, the output of this output dataframe, df3, should look like this:
ID colA_1 date1 colA_2 date2 colB_1 colB_2
0 1 1 1.1.2001 4 1.1.2002 11 12
1 2 2 2.1.2001 5 2.1.2002 13 14
2 3 3 3.1.2001 6 3.1.2002 15 16
What is the best way to do this?
I found this link, but the answer doesn't work for my case. I would like a really explicit way to specify column matching. I think it's possible that df.mask might be able to help me, but I am not sure how to implement it.
e.g.: the following code
df3 = df1.copy()
df3["colB_1"] = ""
df3["colB_2"] = ""
filter1 = (df1["ID"] == df2["ID"]) & (df1["date1"] == df2["date"])
filter2 = (df1["ID"] == df2["ID"]) & (df1["date2"] == df2["date"])
df3["colB_1"] = df.mask(filter1, other=df2["col3"])
df3["colB_2"] = df.mask(filter2, other=df2["col3"])
gives the error
ValueError: Can only compare identically-labeled Series objects
I asked this question previously, and it was marked as closed; my question was marked as a duplicate of this one. However, this is not the case. The answers in the linked question suggest the use of either map or df.merge. Map does not work with multiple conditions (in my case, ID and date). And df.merge (the answer given for matching multiple columns) does not work in my case when one of the column names in df1 and df2 that are to be merged are different ("date" and "date1", for example).
For example, the below code:
df3 = df1.merge(df2[["ID","date","col3"]], on=['ID','date1'], how='left')
fails with a Key Error.
Also noteworthy is that I will be dealing with many different files, with many different column naming schemes, and I will need a different subset each time. This is why I would like an answer that explicitly names the columns and conditions.
Any help with this would be much appreciated.
You can the pd.wide_to_long after replacing the underscore , this will unpivot the dataframe which you can use to merge with df2 and then pivot back using unstack:
m =df1.rename(columns=lambda x: x.replace('_',''))
unpiv = pd.wide_to_long(m,['colA','date'],'ID','v').reset_index()
merge_piv = (unpiv.merge(df2[['ID','date','col3']],on=['ID','date'],how='left')
.set_index(['ID','v'])['col3'].unstack().add_prefix('colB_'))
final = df1.merge(merge_piv,left_on='ID',right_index=True)
ID colA_1 date1 colA_2 date2 colB_1 colB_2
0 1 1 1.1.2001 4 1.1.2002 11 12
1 2 2 2.1.2001 5 2.1.2002 13 14
2 3 3 3.1.2001 6 3.1.2002 15 16
This question already has answers here:
How to change the order of DataFrame columns?
(41 answers)
Closed 3 years ago.
I have a dataframe with over 200 columns. The issue is as they were generated the order is
['Q1.3','Q6.1','Q1.2','Q1.1',......]
I need to sort the columns as follows:
['Q1.1','Q1.2','Q1.3',.....'Q6.1',......]
Is there some way for me to do this within Python?
df = df.reindex(sorted(df.columns), axis=1)
This assumes that sorting the column names will give the order you want. If your column names won't sort lexicographically (e.g., if you want column Q10.3 to appear after Q9.1), you'll need to sort differently, but that has nothing to do with pandas.
You can also do more succinctly:
df.sort_index(axis=1)
Make sure you assign the result back:
df = df.sort_index(axis=1)
Or, do it in-place:
df.sort_index(axis=1, inplace=True)
You can just do:
df[sorted(df.columns)]
Edit: Shorter is
df[sorted(df)]
For several columns, You can put columns order what you want:
#['A', 'B', 'C'] <-this is your columns order
df = df[['C', 'B', 'A']]
This example shows sorting and slicing columns:
d = {'col1':[1, 2, 3], 'col2':[4, 5, 6], 'col3':[7, 8, 9], 'col4':[17, 18, 19]}
df = pandas.DataFrame(d)
You get:
col1 col2 col3 col4
1 4 7 17
2 5 8 18
3 6 9 19
Then do:
df = df[['col3', 'col2', 'col1']]
Resulting in:
col3 col2 col1
7 4 1
8 5 2
9 6 3
Tweet's answer can be passed to BrenBarn's answer above with
data.reindex_axis(sorted(data.columns, key=lambda x: float(x[1:])), axis=1)
So for your example, say:
vals = randint(low=16, high=80, size=25).reshape(5,5)
cols = ['Q1.3', 'Q6.1', 'Q1.2', 'Q9.1', 'Q10.2']
data = DataFrame(vals, columns = cols)
You get:
data
Q1.3 Q6.1 Q1.2 Q9.1 Q10.2
0 73 29 63 51 72
1 61 29 32 68 57
2 36 49 76 18 37
3 63 61 51 30 31
4 36 66 71 24 77
Then do:
data.reindex_axis(sorted(data.columns, key=lambda x: float(x[1:])), axis=1)
resulting in:
data
Q1.2 Q1.3 Q6.1 Q9.1 Q10.2
0 2 0 1 3 4
1 7 5 6 8 9
2 2 0 1 3 4
3 2 0 1 3 4
4 2 0 1 3 4
If you need an arbitrary sequence instead of sorted sequence, you could do:
sequence = ['Q1.1','Q1.2','Q1.3',.....'Q6.1',......]
your_dataframe = your_dataframe.reindex(columns=sequence)
I tested this in 2.7.10 and it worked for me.
Don't forget to add "inplace=True" to Wes' answer or set the result to a new DataFrame.
df.sort_index(axis=1, inplace=True)
The quickest method is:
df.sort_index(axis=1)
Be aware that this creates a new instance. Therefore you need to store the result in a new variable:
sortedDf=df.sort_index(axis=1)
The sort method and sorted function allow you to provide a custom function to extract the key used for comparison:
>>> ls = ['Q1.3', 'Q6.1', 'Q1.2']
>>> sorted(ls, key=lambda x: float(x[1:]))
['Q1.2', 'Q1.3', 'Q6.1']
One use-case is that you have named (some of) your columns with some prefix, and you want the columns sorted with those prefixes all together and in some particular order (not alphabetical).
For example, you might start all of your features with Ft_, labels with Lbl_, etc, and you want all unprefixed columns first, then all features, then the label. You can do this with the following function (I will note a possible efficiency problem using sum to reduce lists, but this isn't an issue unless you have a LOT of columns, which I do not):
def sortedcols(df, groups = ['Ft_', 'Lbl_'] ):
return df[ sum([list(filter(re.compile(r).search, list(df.columns).copy())) for r in (lambda l: ['^(?!(%s))' % '|'.join(l)] + ['^%s' % i for i in l ] )(groups) ], []) ]
print df.sort_index(by='Frequency',ascending=False)
where by is the name of the column,if you want to sort the dataset based on column