Concatenate two dataframes and remove duplicate rows based on condition - python

I am trying to concatenate two dataframe and in case of duplication I'd like to consider the row that has the maximum value for a column C
I tried this command :
df = pd.concat([df1, df2]).max(level=0)
So if two rows have same value for columns A and B, I will just take that row with the maximum value for column C.

You can sort by column C, then drop duplicates by columns A & B:
df = pd.concat([df1, df2])\
.sort_values('C')\
.drop_duplicates(subset=['A', 'B'], keep='last')
Your attempt exhibits a couple of misunderstandings:
pd.DataFrame.max is used to calculate maximum values, not to filter a dataframe.
The level parameter is relevant only for MultiIndex dataframes.

Related

Pandas data frame merge unique values

I have two dataframes(df3 & dfA) - one with unique values only(df3), the other with multiple values(dfA).
I want to add a column 'Trdr' from dfA to DF3 based on column 'ISIN'
The issue is : dfA has multiple lines with the same ISIN with a 'Trdr' value. Therefore when I try to merge the datasets it adds a row line for every ISIN that has a 'trdr' value.
In laymans terms, I want to Vlookup the 1st value that pandas can find in dfA and assign it to df3.
'''df3=pd.merge( df3,dfA[['Trdr','ISIN']],how="left",on='ISIN')'''
df3 before the merge showing unique ISINs only
dfA showing the trdr values im trying to merge
df3 after the merge showing multiple lines of VODAFONE rather than 1 unique value as per initial screenshot

replace/change duplicate columns values where column name is same but values are different, then drop duplicate columns

Is there any way to drop duplicate columns, but replacing their values depending upon conditions like
in table below, I would like to remove duplicate/second A and B columns, but want to replace the value of primary A and B (1st and 2nd column) where value is 0 but 1 in duplicate columns.
Ex - In 3rd row, where A, B have value 0 , should replace with 1 with their respective duplicate columns value..
Input Data :
Output Data:
This is an example of a problem I'm working on, my real data have around 200 columns so i'm hoping to find an optimal solution without hardcoding columns names for removal..
Use DataFrame.any per duplicated columns names if only 1,0 values in columns:
df = df.any(axis=1, level=0).astype(int)
Or if need maximal value per duplicated columns names:
df = df.max(axis=1, level=0)

Comparing two dataframes and storing results in another dataframe

I have two data frames like this: The first has one column and 720 rows (dataframe A), the second has ten columns and 720 rows(dataframe B). The dataframes contain only numerical values.
I am trying to compare them this way: I want to go through each column of dataframe B and compare each cell(row) of that column to the corresponding row in dataframe A .
(Example: For the first column of dataframe B I compare the first row to the first row of dataframe A, then the second row of B to the second row of A etc.)
Basically I want to compare each column of dataframe B to the single column in dataframe A, row by row.
If the the value in dataframe B is smaller or equal than the value in dataframe A, I want to add +1 to another dataframe (or list, depending on how its easier). In the end, I want to drop any column in dataframe B that doesnt have at least one cell to satisfy the condition (basically if the value added to the list or new dataframe is 0).
I tried something like this (written for a single row, I was thinking of creating a for loop using this) but it doesn't seem to do what I want:
DfA_i = pd.DataFrame(DA.iloc[i])
DfB_j = pd.DataFrame(DB.iloc[j])
B = DfB_j.values
DfC['Criteria'] = DfA_i.apply(lambda x: len(np.where(x.values <= B)), axis=1)
dv = dt_dens.values
if dv[1] < 1:
DF = DA.drop(i)
I hope I made my problem clear enough and sorry for any mistakes. Thanks for any help.
Let's try:
dfB.loc[:, dfB.ge(dfA.values).any()]
Explanation: dfA.values returns the numpy array with shape (720,1). Then dfB.ge(dfA.values) check each column from dfB against that single column from dfA; this returns a boolean dataframe of same size with dfB. Finally .any() check along the columns of that boolean dataframe for any True.
how about this:
pd.DataFrame(np.where(A.to_numpy() <= B.to_numpy(),1,np.nan), columns=B.columns, index=A.index).dropna(how='all')
you and replace the np.nan in the np.where condition with whatever values you wish, including keeping the original values of dataframe 'B'

Find duplicated rows, multiply a certain column by number of duplicates, drop duplicated rows

I have a pandas dataframe of about 70000 rows, and 4500 of them are duplicates of an original. The columns are a mix of string columns and number columns. The column I'm interested in is the value column. I'd like to look through the entire dataframe to find rows that are completely identical, count the number of duplicated rows per row (inclusive of the original), and multiply the value in that row by the number of duplicates.
I'm not really sure how to go about this from the start, but I've tried using df[df.duplicated(keep = False)] to obtain a dataframe df1 of duplicated rows (inclusive of original rows). I appended a column of Trues to the end of df1. I tried to use .groupby with a combination of columns to sum up the number of Trues but the result was unable to capture true number of duplicates (i obtained about 3600 unique duplicated rows in this case).
Here's my actual code:
duplicate_bool = df.duplicated(keep = False)
df['duplicate_bool'] = duplicate_bool
df1= df[duplicate_bool]
f = {'duplicate_bool':'sum'}
df2= df1.groupby(['Date', 'Exporter', 'Buyer', \
'Commodity Description', 'Partner Code', \
'Quantity', 'Price per MT'], as_index = False).agg(f)
My idea here was to obtain a separate dataframe df2 with no duplicates, and i could multiply the entry in the value column inside with the number stored in the summed duplicate_bool column. Then I'd simply append df2 to my original dataframe after removing all the duplicates identified by .duplicated.
However, if I use groupby with all columns I get an empty dataframe. If I don't use all the columns, I don't get the true number of duplicates and i wont be able to append it in any way.
I think I'd like a better way to do this since i'm confusing myself.
I think this question is nothing more of figuring out how to get a count of the occurrences of each unique row. If a row occurs only once, this number is one. If it occurs more often, it will be > 1. This count you can then use to multiply, filter, etc.
This nice one-liner (taken from How to count duplicate rows in pandas dataframe?) creates an extra column with the number of occurrences of each row:
df = df.groupby(df.columns.tolist()).size().reset_index().rename(columns={0:'dup_count'}).
To then calculate the true value of each row:
df['total_value'] = df['value'] * df['dup_count']
And to filter we can use the dup_count column to remove all duplicate rows:
dff = df[df['dup_count'] == 1]

Creating new pandas dataframe by extracting columns from other dataframes - ValueError

I have to extract columns from different pandas dataframes and merge them into a single new dataframe. This is what I am doing:
newdf=pd.DataFrame()
newdf['col1']=sorted(df1.columndf1.unique())
newdf['col2']=df2.columndf2.unique(),
newdf['col3']=df3.columndf3.unique()
newdf
I am sure that the three columns have the same length (I have checked) but I get the error
ValueError: Length of values does not match length of index
I have tried to pass them as pd.Series but the result is the same. I am on Python 2.7.
It seems there is problem length of unique values is different.
One possible solution is concat all data together and apply unique.
If unique data not same sizes, get NaNs in last values of columns.
newdf = pd.concat([df1.columndf1, df2.columndf2, df3.columndf3], axis=1)
.apply(lambda x: pd.Series(x.unique()))
EDIT:
Another possible solution:
a = sorted(df1.columndf1.unique())
b = list(df2.columndf2.unique())
c = list(df3.columndf3.unique())
newdf=pd.DataFrame({'col1':a, 'col2':b, 'col3':c})

Categories

Resources