I have two dataframes(df3 & dfA) - one with unique values only(df3), the other with multiple values(dfA).
I want to add a column 'Trdr' from dfA to DF3 based on column 'ISIN'
The issue is : dfA has multiple lines with the same ISIN with a 'Trdr' value. Therefore when I try to merge the datasets it adds a row line for every ISIN that has a 'trdr' value.
In laymans terms, I want to Vlookup the 1st value that pandas can find in dfA and assign it to df3.
'''df3=pd.merge( df3,dfA[['Trdr','ISIN']],how="left",on='ISIN')'''
df3 before the merge showing unique ISINs only
dfA showing the trdr values im trying to merge
df3 after the merge showing multiple lines of VODAFONE rather than 1 unique value as per initial screenshot
Related
I'm trying to drop duplicated rows in dataframe by using .drop_duplicates with a subset list so that it only drops the rows that has the same value in the given column names. But for some reason, it didn't drop all of them.
This is the dataframe before dropping...
This is the code that I used to drop the rows...
df_combined.drop_duplicates(subset = ['Anonymized_ID', 'COURSE', 'GRADE'], keep='last', inplace=True)
This is the dataframe after dropping...
I was expecting to see only two rows after dropping since they have same values for the specified column
I have two dataframes (which do not have the same number of rows)
dfA contains two columns "CCLE_ID" and "Name" amongst other unimportant ones
dfB contains two columns "CCLE ID" and "Cell line" amongst other unimportant ones.
Right now, dfB['CCLE ID'] values are set to 0.
What I want to do is compare all the values in dfA['Name'] column and dfB['Cell line'] column. They are all strings and stand for the shorthand name of cell lines. If a value for dfA['Name'] and dfB['Cell line'] column matches, then I want to replace the value 0 of dfB['CCLE ID'] column with the string from dfA['CCLE_ID'] column of that matched cell name.
I am honestly so lost as to how to do this (pandas beginner).
dfA
CCLE ID Cell line Cancer Query Cancer Label Score
0 0 CAOV4 OV yes 0.604027
1 0 KURAMOCHI OV yes 0.592132
2 0 OVSAHO OV yes 0.586126
dfB
First we presume dfA and dfB have the same number of rows because if they don't, then it's more complicated and you have two choices: either reshape the dataFrames to have the same number of rows, or use other Python libraries to perform the transformation.
Based on this initial presumption that the data Frames have the same number of rows, I'm going to try and break this down for you step by step.
With the two dataframes, dfA and dfB, start by merging the data. You can remove the extra columns from dfB later.
To merge the dfA columns into dfB for simplicity, add two columns dfaName and dfa_CCLE_ID.
dfB['dfaName'] = dfa['Name']
dfB['dfa_CCLE_ID'] = dfa['CCLE_ID']
Then use pandas.dataFrame.apply() to conditionnally transform your data.
dfB['CCLE_ID'] = dfB[['dfaName','Cell line', 'dfa_CCLE_ID']].apply(lambda x: x['dfa_CCLE_ID'] if x['dfaName']==x['Cell line'] else x, axis=1)
A nice extra could be to use a dataframe mask to generate and see comparison. It is a good step to take to view and test your data transformation. In this example, create an extra column in dfB with true/false values for the comparison.
dfB['column_matcher'] = dfb['dfaName']==dfB['Cell line']
I need to concatenate two DataFrames where both dataframes have a column named 'sample ids'. The first dataframe has all the relevant information needed, however the sample ids column in the first dataframe is missing all the sample ids that are within the second dataframe. Is there a way to insert the 'missing' sample ids (IN SEQUENTIAL ORDER) into the first dataframe using the second dataframe?
I have tried the following:
pd.concat([DF1,DF2],axis=1)
this did retain all information from both DataFrames, but the sample ids from both datframes were separated into different columns.
pd.merge(DF1,DF2,how='outer/inner/left/right')
this did not produce the desired outcome in the least...
I have shown the templates of the two dataframes below. Please help my brain is exploding!!!
DataFrame 2
DataFrame 1
If you want to:
insert the 'missing' sample ids (IN SEQUENTIAL ORDER) into the first
dataframe using the second dataframe
you can use an outer join by .merge() with how='outer', as follows:
df_out = df1.merge(df2, on="samp_id", how='outer')
To further ensure the samp_id are IN SEQUENTIAL ORDER, you can further sort on samp_id using .sort_values(), as follows:
df_out = df1.merge(df2, on="samp_id", how='outer').sort_values('samp_id', ignore_index=True)
Try this :
df = df1.merge(df2, on="samp_id")
If I have two dataframes. Dataframe A have an a_id column, dataframe B have an b_id column and a b_value column. How can I join A and B on a_id = b_id and get C with id and max(b_value)?
enter image description here
you can use the concat function in pandas to append either columns or rows from one DataFrame to another.
heres an example:
# Read in first 10 lines of surveys table
survey_sub = surveys_df.head(10)
# Grab the last 10 rows
survey_sub_last10 = surveys_df.tail(10)
# Reset the index values to the second dataframe appends properly
survey_sub_last10 = survey_sub_last10.reset_index(drop=True)
# drop=True option avoids adding new index column with old index values
When i concatenate DataFrames, i need to specify the axis. axis=0 tells pandas to stack the second DataFrame UNDER the first one. It will automatically detect whether the column names are the same and will stack accordingly. axis=1 will stack the columns in the second DataFrame to the RIGHT of the first DataFrame. To stack the data vertically, i need to make sure we have the same columns and associated column format in both datasets. When i stack horizontally, i want to make sure what i am doing makes sense (i.e. the data are related in some way).
I am trying to concatenate two dataframe and in case of duplication I'd like to consider the row that has the maximum value for a column C
I tried this command :
df = pd.concat([df1, df2]).max(level=0)
So if two rows have same value for columns A and B, I will just take that row with the maximum value for column C.
You can sort by column C, then drop duplicates by columns A & B:
df = pd.concat([df1, df2])\
.sort_values('C')\
.drop_duplicates(subset=['A', 'B'], keep='last')
Your attempt exhibits a couple of misunderstandings:
pd.DataFrame.max is used to calculate maximum values, not to filter a dataframe.
The level parameter is relevant only for MultiIndex dataframes.