This question already has answers here:
Merge two dataframes by index
(7 answers)
Closed 1 year ago.
I am working with an adult dataset where I split the dataframe to label encode categorical columns. Now I want to append the new dataframe with the original dataframe. What is the simplest way to perform the same?
Original Dataframe-
age
salary
32
3000
25
2300
After label encoding few columns
country
gender
1
1
4
2
I want to append the above dataframe and the final result should be the following.
age
salary
country
gender
32
3000
1
1
25
2300
4
2
Any insights are helpful.
lets consider two dataframe named as df1 and df2 hence,
df1.merge(df2,left_index=True, right_index=True)
You can use .join() if the datrframes rows are matched by index, as follows:
.join() is a left join by default and join by index by default.
df1.join(df2)
In addition to simple syntax, it has the extra advantage that when you put your master/original dataframe on the left, left join ensures that the dataframe indexes of the master are retained in the result.
Result:
age salary country gender
0 32 3000 1 1
1 25 2300 4 2
You maybe find your solution in checking pandas.concat.
import numpy as np
import pandas as pd
df1 = pd.DataFrame(np.array([[32,3000],[25,2300]]), columns=['age', 'salary'])
df2 = pd.DataFrame(np.array([[1,1],[4,2]]), columns=['country', 'gender'])
pd.concat([df1, df2], axis=1)
age salary country gender
0 32 25 1 1
1 3000 2300 4 2
Related
This question already has answers here:
How to remove common rows in two dataframes in Pandas?
(4 answers)
Closed 25 days ago.
I have two data frames with the exact same column, but one of them have 1000 rows (df1), and one of them 500 (df2). The rows of df2 are also found in the data frame df1, but I want the rows that are not.
For example, lets say this is df1:
Gender Age
1 F 43
3 F 56
33 M 76
476 F 30
810 M 29
and df2:
Gender Age
3 F 56
476 F 30
I want a new data frame, df3, that have the unshared rows:
Gender Age
1 F 43
33 M 76
810 M 29
How can I do that ?
Use pd.Index.difference:
df3 = df1.loc[df1.index.difference(df2.index)]
This has many ways.
I know 3 ways for it.
first way:
df = df1[~df1.index.isin(df2.index)]
second way:
left merge 2 dataframes and then filter rows that just in df1
third way:
Add a column to both dataframes that define the source and then concat two dataframes with axis=1
then countt each index in data frame and filter all index that seen 1 time and has a record with source=df1
finally:
Use from first way. It is very very faster
You can concatenate two tables and delete any rows that have duplicates:
df3 = pd.concat([df1, df2]).drop_duplicates(keep=False)
The keep parameter ask you if you want to keep the row that has duplicates or not. If keep = True, it will keep 1 copy of the rows. If false, delete all rows that have duplicates.
I have two dataframes, df1 and df2. df1 has repeat observations arranged in wide format, and df2 in long format.
import pandas as pd
df1 = pd.DataFrame({"ID":[1,2,3],"colA_1":[1,2,3],"date1":["1.1.2001", "2.1.2001","3.1.2001"],"colA_2":[4,5,6],"date2":["1.1.2002", "2.1.2002","3.1.2002"]})
df2 = pd.DataFrame({"ID":[1,1,2,2,3,3],"col1":[1,1.5,2,2.5,3,3.5],"date":["1.1.2001", "1.1.2002","2.1.2001","2.1.2002","3.1.2001","3.1.2002"], "col3":[11,12,13,14,15,16],"col4":[21,22,23,24,25,26]})
df1 looks like:
ID colA_1 date1 colA_2 date2
0 1 1 1.1.2001 4 1.1.2002
1 2 2 2.1.2001 5 2.1.2002
2 3 3 3.1.2001 6 3.1.2002
df2 looks like:
ID col1 date1 col3 col4
0 1 1.0 1.1.2001 11 21
1 1 1.5 1.1.2002 12 22
2 2 2.0 2.1.2001 13 23
3 2 2.5 2.1.2002 14 24
4 3 3.0 3.1.2001 15 25
5 3 3.5 3.1.2002 16 26
6 3 4.0 4.1.2002 17 27
I want to take a given column from df2, "col3", and then:
(1) if the columns "ID" and "date" in df2 match with the columns "ID" and "date1" in df1, I want to put the value in a new column in df1 called "colB_1".
(2) else if the columns "ID" and "date" in df2 match with the columns "ID" and "date2" in df1, I want to put the value in a new column in df1 called "colB_2".
(3) else if the columns "ID" and "date" in df2 have no match with either ("ID" and "date1") or ("ID" and "date2"), I want to ignore these rows.
So, the output of this output dataframe, df3, should look like this:
ID colA_1 date1 colA_2 date2 colB_1 colB_2
0 1 1 1.1.2001 4 1.1.2002 11 12
1 2 2 2.1.2001 5 2.1.2002 13 14
2 3 3 3.1.2001 6 3.1.2002 15 16
What is the best way to do this?
I found this link, but the answer doesn't work for my case. I would like a really explicit way to specify column matching. I think it's possible that df.mask might be able to help me, but I am not sure how to implement it.
e.g.: the following code
df3 = df1.copy()
df3["colB_1"] = ""
df3["colB_2"] = ""
filter1 = (df1["ID"] == df2["ID"]) & (df1["date1"] == df2["date"])
filter2 = (df1["ID"] == df2["ID"]) & (df1["date2"] == df2["date"])
df3["colB_1"] = df.mask(filter1, other=df2["col3"])
df3["colB_2"] = df.mask(filter2, other=df2["col3"])
gives the error
ValueError: Can only compare identically-labeled Series objects
I asked this question previously, and it was marked as closed; my question was marked as a duplicate of this one. However, this is not the case. The answers in the linked question suggest the use of either map or df.merge. Map does not work with multiple conditions (in my case, ID and date). And df.merge (the answer given for matching multiple columns) does not work in my case when one of the column names in df1 and df2 that are to be merged are different ("date" and "date1", for example).
For example, the below code:
df3 = df1.merge(df2[["ID","date","col3"]], on=['ID','date1'], how='left')
fails with a Key Error.
Also noteworthy is that I will be dealing with many different files, with many different column naming schemes, and I will need a different subset each time. This is why I would like an answer that explicitly names the columns and conditions.
Any help with this would be much appreciated.
You can the pd.wide_to_long after replacing the underscore , this will unpivot the dataframe which you can use to merge with df2 and then pivot back using unstack:
m =df1.rename(columns=lambda x: x.replace('_',''))
unpiv = pd.wide_to_long(m,['colA','date'],'ID','v').reset_index()
merge_piv = (unpiv.merge(df2[['ID','date','col3']],on=['ID','date'],how='left')
.set_index(['ID','v'])['col3'].unstack().add_prefix('colB_'))
final = df1.merge(merge_piv,left_on='ID',right_index=True)
ID colA_1 date1 colA_2 date2 colB_1 colB_2
0 1 1 1.1.2001 4 1.1.2002 11 12
1 2 2 2.1.2001 5 2.1.2002 13 14
2 3 3 3.1.2001 6 3.1.2002 15 16
I have a DataFrame and I need to turn one column into multiple columns, and then create another column that index/labels values of the new/multiple columns
import pandas as pd
df = pd.DataFrame({'state':['AK','AK','AK','AK','AL','AL','AL','AL'], 'county':['Cnty1','Cnty1','Cnty2','Cnty2','Cnty3','Cnty3','Cnty4','Cnty4'],
'year':['2000','2001','2000','2001','2000','2001','2000','2001'], 'count1':[5,7,4,8,9,1,0,1], 'count2':[8,1,4,6,7,3,8,5]})
Using pivot_table() and reset_index() I'm able to move the values of year into columns, but not able to dis-aggregate it by the other columns.
Using:
pivotDF = pd.pivot_table(df, index = ['state', 'county'], columns = 'year')
pivotDF = pivotDF.reset_index()
Gets me close, but not what I need.
What I need is, another column that labels count1 and count2, with the values in the year columns. Something that looks like this:
I realize a DataFrame would have all the values for 'state' and 'county' filled in, which is fine, but I'm outputting this to Excel and need it to look just like this so if there's a way to have this format that would be a bonus.
Many thanks.
You are looking for pivot then stack
s=df.pivot_table(index=['state','county'],columns='year',values=['count1','count2'],aggfunc='mean').stack(level=0)
s
Out[142]:
year 2000 2001
state county
AK Cnty1 count1 5 7
count2 8 1
Cnty2 count1 4 8
count2 4 6
AL Cnty3 count1 9 1
count2 7 3
Cnty4 count1 0 1
count2 8 5
You've got most of the answer down. Just add a stack with level=0 to stack on that level rather than the default year level.
pd.pivot_table(df, index=['state', 'county'], columns='year', values=['count1', 'count2']) \
.stack(level=0)
This question already has answers here:
python panda: return indexes of common rows
(2 answers)
Closed 4 years ago.
I have the follwing two pandas dataframes:
df1 = pd.DataFrame([[21,80,180],[23,95,191],[36,83,176]], columns = ["age", "weight", "height"])
df2 = pd.DataFrame([[22,88,184],[39,84,196],[23,95,190]], columns = ["age", "weight", "height"])
df1:
age weight height
0 21 80 180
1 23 95 191
2 36 83 176
df2:
age weight height
0 22 88 184
1 39 84 196
2 23 95 190
I would like to compare the two dataframes and get the indices of both dataframes where age and weight in one dataframe are equal to age and weight in the second dataframe. The result in this case would be:
matching_indices = [1,2] #[df1 index, df2 index]
I know how to achieve this with iterrows(), but I prefer something less time consuming since the dataset I have is relatively large. Do you have any ideas?
Use merge with default inner join and reset_index for convert index to column for prevent lost this information:
df = df1.reset_index().merge(df2.reset_index(), on=['age','weight'], suffixes=('_df1','_df2'))
print (df)
index_df1 age weight height_df1 index_df2 height_df2
0 1 23 95 191 2 190
print (df[['index_df1','index_df2']])
index_df1 index_df2
0 1 2
This question already has answers here:
Pandas merge two dataframes summing values [duplicate]
(2 answers)
how to merge two dataframes and sum the values of columns
(2 answers)
Closed 4 years ago.
I am new to pandas, could you help me with the case belove pls
I have 2 DF:
df1 = pd.DataFrame({'A': ['name', 'color', 'city', 'animal'], 'number': ['1', '32', '22', '13']})
df2 = pd.DataFrame({'A': ['name', 'color', 'city', 'animal'], 'number': ['12', '2', '42', '15']})
df1
A number
0 name 1
1 color 32
2 city 22
3 animal 13
DF1
A number
0 name 12
1 color 2
2 city 42
3 animal 15
I need to get the sum of the colum number e.g.
DF1
A number
0 name 13
1 color 34
2 city 64
3 animal 27
but if I do new = df1 + df2 i get a
NEW
A number
0 namename 13
1 colorcolor 34
2 citycity 64
3 animalanimal 27
I even tried with merge on="A" but nothing.
Can anyone enlight me pls
Thank you
Here are two different ways: one with add, and one with concat and groupby. In either case, you need to make sure that your number columns are numeric first (your example dataframes have strings):
# set `number` to numeric (could be float, I chose int here)
df1['number'] = df1['number'].astype(int)
df2['number'] = df2['number'].astype(int)
# method 1, set the index to `A` in each and add the two frames together:
df1.set_index('A').add(df2.set_index('A')).reset_index()
# method 2, concatenate the two frames, groupby A, and get the sum:
pd.concat((df1,df2)).groupby('A',as_index=False).sum()
Output:
A number
0 animal 28
1 city 64
2 color 34
3 name 13
Merging isn't a bad idea, you just need to remember to convert numeric series to numeric, select columns to merge on, then sum on numeric columns via select_dtypes:
df1['number'] = pd.to_numeric(df1['number'])
df2['number'] = pd.to_numeric(df2['number'])
df = df1.merge(df2, on='A')
df['number'] = df.select_dtypes(include='number').sum(1) # 'number' means numeric columns
df = df[['A', 'number']]
print(df)
A number
0 name 13
1 color 34
2 city 64
3 animal 28