pandas dataframe: remove all rows that includes in other dataframe - python

I have pandas dataframe like below:
dataframe 1 (name: df)
as you can see: each (A,B,C) has n X's and V's
and I made outlier df as
df_outlier = df[(df["V"] > 150)]
Then, I want to remove all (A,B,C) that includes in df_outlier
for example, if df_outlier looks like below:
I want to remove below rows from original dataframe:
First, I tried below codes:
df_filtered = pd.merge(df, df_outlier, indicator=True, how = 'outer').query('_merge=="left_only"').drop(['_merge'],axis=1)
However, it only remove rows in df_outlier, not all (a,b,c) rows in df_outlier
Sorry for my poor English skills, so if you fell harder to understand..

Just select the column in df_outlier for check
df_filtered = pd.merge(df, df_outlier[['A','B','C']], indicator=True, how = 'outer').query('_merge=="left_only"').drop(['_merge'],axis=1)

Related

How to assign values to the rows of a data frame which satisfy certain conditions?

I have two data frames:
df1 = pd.read_excel("test1.xlsx")
df2 = pd.read_excel("test2.xlsx")
I am trying to assign values of df1 to df2 where a certain condition is met (Column1 is equal to Column1 then assign values of ColY to ColX).
df1.loc[df1['Col1'] == df2['Col1'],'ColX'] = df2['ColY']
This results in an error as df2['ColY] is the whole column. How do i assign for only the rows that match?
You can use numpy.where:
import numpy as np
df1['ColX'] = np.where(df1['Col1'].eq(df2['Col1']), df2['ColY'], df1['ColX'])
Since you wanted to assign from df1 to df2 your code should have been
df2.loc[df1['Col1']==df2['Col2'],'ColX']=df1.['ColY']
The code you wrote won't assign the values from df1 to df2, but from df2 to df1.
And also if you could clarify to which dataframe ColX and ColY belong to I could help more(Or does both dataframe have them??).
Your code is pretty much right!!! Only change the df1 and df2 as above.

How to find the top any % of a dataframe?

I want to find the top 1% in my dataframe and append all the values in a list. Then i can check the first value inside and use it as a filter in the dataframe, any idea how to do it ? Or if you have a simplier way to do it !
You can find the dataframe i use here :
https://raw.githubusercontent.com/srptwice/forstack/main/resultat_projet.csv
What i tried is to watch my dataframe with heatmap (from Seaborn) and use a filter like that :
df4 = df2[df2 > 50700]
You can use df.<column name>.quantile(<percentile>) to get the top % of a dataframe. For example, the code below would get you the rows for df2 where bfly column is at the top 10% (90th percentile)
import pandas as pd
df = pd.read_csv('./resultstat_projet.csv')
df.columns = df.columns.str.replace(' ', '') # remove blank spaces in columns
df2 = df[df.bfly > df.bfly.quantile(0.9)]
print(df2)

How to rank (in percent) each column in a dataframe in place?

The df is as shown below...
The below code can only rank one column in place. I would like to rank all columns and post the rank values in a separate df
df['rank_2020-06-23'] = df['2020-06-23'].rank(pct=True)
print(df)
Something like that should work:
df_ranks=pd.concat([pd.DataFrame(df[col].rank(pct=True)) for col in df.columns], axis=1)
It's simply using your function in a list comprehension, storing the results in dataframes to get a list of dataframes:
list_df_ranks=[pd.DataFrame(df[col].rank(pct=True)) for col in df.columns]
Then merging into one:
df_ranks=pd.concat(list_df_ranks, axis=1)

Pandas: Replacing a row with another data frame

I have two data frames: df1
and df2
Now I want to replace one of rows of df1 (highlighted in red colour) with all values of df2. I try with following codes but didn't give the desired result. Here is the code:
df1[df1['Category_2']=='Specified Functionality'].update(df2)
I also tried:
df1[df1['Category_2']=='Specified Functionality'] = df2
Could anyone guide me where I am making the mistake?
You can insert the rows like this:
row = 13
df2 = df2.rename(columns = {'Functionality': 'Category_2')
df = pd.concat([df1[0:row], df2, df1[row+1:]]).reset_index(drop=True)

How do I replace items on one DataFrame with items from other DataFrame?

I have two DataFrames:
df = pd.DataFrame({'ID': ['bumgm001', 'lestj001',
'tanam001', 'hellj001', 'chacj001']})
df1 = pd.DataFrame({'playerID': ['bumgama01', 'lestejo01',
'tanakama01', 'hellije01', 'chacijh01'],
'retroID': ['bumgm001', 'lestj001', 'tanam001', 'hellj001', 'chacj001']})
OR
df df1
ID playerID retroID
'bumgm001' 'bumgama01' 'bumgm001'
'lestj001' 'lestejo01' 'lestj001'
'tanam001' 'tanakama01' 'tanam001'
'hellj001' 'hellije01' 'hellj001'
'chacj001' 'chacijh01' 'chacj001'
Now, my actual DataFrames are a little more complicated than this, but I simplified it here so it's clearer what I'm trying to do.
I would like to take all of the ID's in df and replace them with the corresponding playerID's in df1.
My final df should look like this:
df
**ID**
'bumgama01'
'lestejo01'
'tanakama01'
'hellije01'
'chacijh01'
I have tried to do it using the following method:
for row in df.itertuples(): #row[1] == the retroID column
playerID = df1.loc[df1['retroID']==row[1], 'playerID']]
df.loc[df['ID']==row[1], 'ID'].replace(to_replace=
df.loc[df['ID']==row[1], 'ID'], value=playerID)
The code seems to run just fine. But my retroID's in df have been changed to NaN rather than the proper playerIDs.
This strikes me as a datatype problem, but I'm not familiar enough with Pandas to diagnose any further.
EDIT:
Unfortunately, I made my example too simplistic. I edited to better represent the issue I'm having. I'm trying to look up the item from one DataFrame in a second DataFrame, then I want to replace the item from the first Dataframe with an item from the corresponding row of the second Dataframe. The columns DO NOT have the same name.
You can use the second dataframe as a dictionary for replacement:
to_replace = df1.set_index('retroID')['playerID'].to_dict()
df['retroID'].replace(to_replace, inplace=True)
According to your example, this is what you want:
df['ID'] = df1['playerID']
If data is not in order (row 1 from df is not the same as row 1 from df1) then use
df['ID']=df1.set_index('retroID').reindex(df['ID'])['playerID'].values
Credit to Wen for second approach
Output
ID
0 bumgama01
1 lestejo01
2 tanakama01
3 hellije01
4 chacijh01
Let me know if it's correct
OK, I've figured out a solution. As it turns out, my problem was a type problem. I updated my code from:
for row in df.itertuples(): #row[1] == the retroID column
playerID = df1.loc[df1['retroID']==row[1], 'playerID']]
df.loc[df['ID']==row[1], 'ID'].replace(to_replace=
df.loc[df['ID']==row[1], 'ID'], value=playerID)
to:
for row in df.itertuples(): #row[1] == the retroID column
playerID = df1.loc[df1['retroID']==row[1], 'playerID']].values[0]
df.loc[df['ID']==row[1], 'ID'].replace(to_replace=
df.loc[df['ID']==row[1], 'ID'], value=playerID)
This works because "playerID" is now a scalar object(thanks to .values[0]) rather than some other datatype which is not compatible with a DataFrame.

Categories

Resources