Get the rows that are not shared between two data frames [duplicate] - python

This question already has answers here:
How to remove common rows in two dataframes in Pandas?
(4 answers)
Closed 25 days ago.
I have two data frames with the exact same column, but one of them have 1000 rows (df1), and one of them 500 (df2). The rows of df2 are also found in the data frame df1, but I want the rows that are not.
For example, lets say this is df1:
Gender Age
1 F 43
3 F 56
33 M 76
476 F 30
810 M 29
and df2:
Gender Age
3 F 56
476 F 30
I want a new data frame, df3, that have the unshared rows:
Gender Age
1 F 43
33 M 76
810 M 29
How can I do that ?

Use pd.Index.difference:
df3 = df1.loc[df1.index.difference(df2.index)]

This has many ways.
I know 3 ways for it.
first way:
df = df1[~df1.index.isin(df2.index)]
second way:
left merge 2 dataframes and then filter rows that just in df1
third way:
Add a column to both dataframes that define the source and then concat two dataframes with axis=1
then countt each index in data frame and filter all index that seen 1 time and has a record with source=df1
finally:
Use from first way. It is very very faster

You can concatenate two tables and delete any rows that have duplicates:
df3 = pd.concat([df1, df2]).drop_duplicates(keep=False)
The keep parameter ask you if you want to keep the row that has duplicates or not. If keep = True, it will keep 1 copy of the rows. If false, delete all rows that have duplicates.

Related

Different aggregation for dataframe with several columns

I am looking for some short-cut to reduce the manual grouping required:
I have a dataframe with many columns. When grouping the dataframe by 'Level', I want to group two columns using nunique(),but all other columns (ca. 60 columns representing years from 2021 onward) using mean().
Does anyone have an idea how to define 'the rest' of the columns?
Thanks!
I would do it following way
import pandas as pd
df = pd.DataFrame({'X':[1,1,1,2,2,2],'A':[1,2,3,4,5,6],'B':[1,2,3,4,5,6],'C':[7,8,9,10,11,12],'D':[13,14,15,16,17,18],'E':[19,20,21,22,23,24]})
aggdct = dict.fromkeys(df.columns, pd.Series.mean)
del aggdct['X']
aggdct['A'] = pd.Series.nunique
print(df.groupby('X').agg(aggdct))
output
A B C D E
X
1 3 2 8 14 20
2 3 5 11 17 23
Explanation: I prepare dict with information how to agg using dict.fromkeys which does provide dict with keys being names of column and values being pd.Series.mean functions, then remove column to be used in groupby and changing selected column to hold pd.Series.nunique rather than pd.Series.mean

How to append two dataframes in python pandas [duplicate]

This question already has answers here:
Merge two dataframes by index
(7 answers)
Closed 1 year ago.
I am working with an adult dataset where I split the dataframe to label encode categorical columns. Now I want to append the new dataframe with the original dataframe. What is the simplest way to perform the same?
Original Dataframe-
age
salary
32
3000
25
2300
After label encoding few columns
country
gender
1
1
4
2
I want to append the above dataframe and the final result should be the following.
age
salary
country
gender
32
3000
1
1
25
2300
4
2
Any insights are helpful.
lets consider two dataframe named as df1 and df2 hence,
df1.merge(df2,left_index=True, right_index=True)
You can use .join() if the datrframes rows are matched by index, as follows:
.join() is a left join by default and join by index by default.
df1.join(df2)
In addition to simple syntax, it has the extra advantage that when you put your master/original dataframe on the left, left join ensures that the dataframe indexes of the master are retained in the result.
Result:
age salary country gender
0 32 3000 1 1
1 25 2300 4 2
You maybe find your solution in checking pandas.concat.
import numpy as np
import pandas as pd
df1 = pd.DataFrame(np.array([[32,3000],[25,2300]]), columns=['age', 'salary'])
df2 = pd.DataFrame(np.array([[1,1],[4,2]]), columns=['country', 'gender'])
pd.concat([df1, df2], axis=1)
age salary country gender
0 32 25 1 1
1 3000 2300 4 2

How to merge two dataframes with different lengths in python

I am trying to merge two weelly DateFrames, which are made-up of one column each, but with different lengths.
Could I please know how to merge them, maintaining the 'Week' indexing?
[df1]
Week Coeff1
1 -0.456662
1 -0.533774
1 -0.432871
1 -0.144993
1 -0.553376
... ...
53 -0.501221
53 -0.025225
53 1.529864
53 0.044380
53 -0.501221
[16713 rows x 1 columns]
[df2]
Week Coeff
1 0.571707
1 0.086152
1 0.824832
1 -0.037042
1 1.167451
... ...
53 -0.379374
53 1.076622
53 -0.547435
53 -0.638206
53 0.067848
[63265 rows x 1 columns]
I've tried this code:
df3 = pd.merge(df1, df2, how='inner', on='Week')
df3 = df3.drop_duplicates()
df3
But it gave me a new df (df3) with 13386431 rows × 2 columns
Desired outcome: A new df which has 3 columns (week, coeff1, coeff2), as df2 is longer, I expect to have some NaNs in coeff1 to fill the gaps.
I assume your output should look somewhat like this:
Week
Coeff1
Coeff2
1
-0.456662
0.571707
1
-0.533774
0.086152
1
-0.432871
0.824832
2
3
3
2
NaN
3
Don't mind the actual numbers though.
The problem is you won't achieve that with a join on Week, neither left nor inner and that is due to the fact that the Week-Index is not unique.
So, on a left join, pandas is going to join all the Coeff2-Values where df2.Week == 1 on every single row in df1 where df1.Week == 1. And that is why you get these millions of rows.
I will try and give you a workaround later, but maybe this helps you to think about this problem from another perspective!
Now is later:
What you actually want to do is to concatenate the Dataframes "per week".
You achieve that by iterating over every week, creating a df_subset[week] concatenating df1[week] and df2[week] by axis=1 and then concatenating all these subsets on axis=0 afterwards:
weekly_dfs=[]
for week in df1.Week.unique():
sub_df1 = df1.loc[df1.Week == week, "Coeff1"].reset_index(drop=True)
sub_df2 = df2.loc[df2.Week == week, "Coeff2"].reset_index(drop=True)
concat_df = pd.concat([sub_df1, sub_df2], axis=1)
concat_df["Week"] = week
weekly_dfs.append(concat_df)
df3 = pd.concat(weekly_dfs).reset_index(drop=True)
The last reset of the index is optional but I recommend it anyways!
Based on your last comment on the question, you may want to concatenate instead of merging the two data frames:
df3 = pd.concat([df1,df2], ignore_index=True, axis=1)
The resulting DataFrame should have 63265 rows and will need some work to get it to the required format (remove the added index columns, rename the remaining columns, etc.), but pd.concat should be a good start.
According to pandas' merge documentation, you can use merge in a way like that:
What you are looking for is a left join. However, the default option is an inner join. You can change this by passing a different how argument:
df2.merge(df1,how='left', left_on='Week', right_on='Week')
note that this would keep these rows in the bigger df and assign NaN to them when merging with the shorter df.

How do you remove values from a data frame based on whether they are present in another data frame?

I have 2 data frames, the 1st contains a list of values I am looking to work with and the second contains these values plus a large number of other values. I am looking for the best way to remove the values that do not appear in the 1st data frame from the 2nddata frame to reduce the number of entries I am working with.
Example
Input
DF1
Alpha
code
A
1
D
2
E
3
F
4
DF2
Alpha
code
A
23
B
12
C
1
D
32
E
23
F
45
G
51
H
26
Desired Output:
DF1
Alpha
code
A
1
D
2
E
3
F
4
DF2
Alpha
code
A
23
D
32
E
23
F
45
Assuming that your first column in DF1 is called "Alpha", you can do this:
my_list_DF1 = DF1['Alpha'].unique().tolist() # gets all unique values of first column from DF1 into a list
Then, you can filter your DF2, to include only those values, using isin:
new_DF2 = DF2[DF2['Alpha'].isin(my_list_DF1)]
Which will result in a smaller DF2, only including the common values from the so called 'Alpha' column.
You could do an inner join, dropping all rows that doesn't have an entry and merging all others:
pd.merge(DF1, DF2, on='Alpha', how='inner')
But then you would subsequently have to drop the columns you dont need, and posibly rename if some share a name.

How to find the indices of identical rows based on two columns in two different pandas dataframe? [duplicate]

This question already has answers here:
python panda: return indexes of common rows
(2 answers)
Closed 4 years ago.
I have the follwing two pandas dataframes:
df1 = pd.DataFrame([[21,80,180],[23,95,191],[36,83,176]], columns = ["age", "weight", "height"])
df2 = pd.DataFrame([[22,88,184],[39,84,196],[23,95,190]], columns = ["age", "weight", "height"])
df1:
age weight height
0 21 80 180
1 23 95 191
2 36 83 176
df2:
age weight height
0 22 88 184
1 39 84 196
2 23 95 190
I would like to compare the two dataframes and get the indices of both dataframes where age and weight in one dataframe are equal to age and weight in the second dataframe. The result in this case would be:
matching_indices = [1,2] #[df1 index, df2 index]
I know how to achieve this with iterrows(), but I prefer something less time consuming since the dataset I have is relatively large. Do you have any ideas?
Use merge with default inner join and reset_index for convert index to column for prevent lost this information:
df = df1.reset_index().merge(df2.reset_index(), on=['age','weight'], suffixes=('_df1','_df2'))
print (df)
index_df1 age weight height_df1 index_df2 height_df2
0 1 23 95 191 2 190
print (df[['index_df1','index_df2']])
index_df1 index_df2
0 1 2

Categories

Resources