I need an help cause I'm trying to obtain a value by comparing different dataframes columns.
First of all, I've tried to use "for loop" to reach the goal, but I have milion of rows, so it takes a lot of time.
Now, I would like to use numpy.where, in this way:
I have 2 data frames:
- df1 where each rows are different to the others (the column ID is the unique primary Key) --> df1['ID', 'status', 'boolean']
- df2 contains few rows, and each rows are different to the others -->
df2['code', 'segment', 'value']
Now, I need to create a new column for dataframe1 called 'weight'.
I tried to create the column 'weight' in this way:
df1['weight'] = numpy.where(df1['boolean'] == 1, df2[ (df2['code']==df1['ID']) & (df2['segment']==df1['status'])] ['value'], 0)
Columns 'code'+'segment' is a unique key, so it returns one and only one value.
The program execution show this error:
"ValueError: Can only compare identically-labeled Series objects"
Can anyone help me to understand it?
Thank you.
You could do this with a left join
Something like this might work. Without sample data I can't check this in detail
df_merged = df1.join(df2.set_index(['code', 'segment']), how='left', on=['ID', 'status'])
df1['weight'] = df_merged['value'].re_index(df1.index).fillna(0)
The set_index() is needed for
on : column name, tuple/list of column names, or array-like
Column(s) in the caller to join on the index in other, otherwise joins index-on-index. If multiples columns given, the passed DataFrame must have a MultiIndex. Can pass an array as the join key if not already contained in the calling DataFrame. Like an Excel VLOOKUP operation
Related
I have two data frames that I am comparing two columns df_luad['Tumor_Sample_Barcode'] and df_tmb['Patient ID'] of them, if the two columns values from two dataframes are equal, then it adds a column from the second dataframe df_tmb['TMB (nonsynonymous)']to a new third dataframe df1 column df1['tmb_value'].
df1['tmb_value'] = np.where(df_luad['Tumor_Sample_Barcode'].eq(df_tmb['Patient ID']), 'True' , df_tmb['TMB (nonsynonymous)'])
However, I am getting this error:
*** ValueError: Length of values (586) does not match length of index (521)
which is related to the row numbers. df_luad has 521 rows and df_tmb has 586 rows. how to add the values of df_tmb['TMB (nonsynonymous)'] only for the matching rows (records) in df_luad?
the following is df_tmb data
and this is df_luad:
If I understand your question correctly you may wish to use panda's merge method. It would look something like this
df1 = df_luad.merge(
df_tmb, left_on='Tumor_Sample_Barcode', right_on='Patient ID', how='left'
)
In this case all rows of df_tmb where the value of 'Patient ID' is also present in the 'Tumor_Sample_Barcode' column of df_luad should be merged. The merge will include all columns of df_tmb including 'TMB (nonsynonymous)'. If there are other columns you do not want then you will have to remove them manually.
If this doesn't produce the required effect then please post a full reproducible example including expected output so we can better understand your question.
Before I start, I have found similar questions and tried the responding answers however, I am still running into an issue and can't figure out why.
I have 6 data frames. I want one resulting data frame that merges all of these 6 into one, based on their common index column Country. Things to note are: the data frames have different number of rows, some country's do not have corresponding values, resulting in NaN.
Here is what I have tried:
data_frames = [WorldPopulation_df, WorldEconomy_df, WorldEducation_df, WorldAggression_df, WorldCorruption_df, WorldCyberCapabilities_df]
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['Country'], how = 'outer'), data_frames)
This doesn't work as the final resulting data frame pairs up the wrong values with wrong country. Any suggestions?
let's see, "pd.merge" is used when you would add new columns from a key.
In case you have 6 dataframes with the same number of columns, in the same order, you can try this:
columns_order = ['country', 'column_1']
concat_ = pd.concat(
[data_1[columns_order], data_2[columns_order], data_3[columns_order],
data_4[columns_order], data_5[columns_order], data_6[columns_order]],
ignore_index=True,
axis=0
)
From here, if you want to have a single value for the "country" column, you can apply a group by to it:
concat_.groupby(by=['country']).agg({'column_1': max}).reset_index()
I am trying to merge two dataframes on inner join and append the values and I was able to perform the join but for some reason the values are ordered properly in each column.
To explain more about this,
Please find the below screenshot where my first dataframe has the stored values of each column
My second dataframe has the string values which needs to be replaced with the values stored in my dataframe 1 above.
Below is the output that I have got but when you look at the values and compare with dataframe 2, they are not assigned properly, For eg:If you consider row 1 in dataframe 2, the Column 1 should have value(i.e. secind column in Dataframe 2) 1.896552 but in my outut I have something else.
Below is the code I worked with to achive the above result.
Joined_df_wna_test = pd.DataFrame()
for col in Unseen_cleaned_rmv_un_2:
Joined_data = pd.merge(Output_df_unseen, my_dataframe_categorical, on=col, how='inner')
Joined_df = pd.DataFrame(Joined_data)
Joined_df_wna_test[col]= Joined_df.Value
Joined_df_wna_test
Joined_df_wna_test
Could someone please help me in overcoming this issue this?
Found Answer for this
The how='left' part is what actually makes it keep the order
Joined_data = Unseen_cleaned_rmv_un.merge(my_dataframe_categorical, on=col, how='left')
When merging 2 datasets, I get NAN on the second raw for each category.
This is a toy dataset to illustrate the problem:
df1=pd.DataFrame({'Num':[1,1,2,3,3],
'date':['1995-09-01','1995-10-04','1995-11-07','1995-11-10','1995-11-25'],
'A':[42.5,40,38,40,28],
'B': [13.3,12.3,12.2,11,10]})
df2=pd.DataFrame({'Num':[1,1,1,1,2,2,3,3,3,3],
'date':['1995-09-01','1995-09-02','1995-10-03','1995-10-04','1995-10-05','1995-11-07','1995-11-08','1995-11-09','1995-11-10','1995-11-25'],
'C':[42.5,39.5,37.2,40,41,38,38.2,39.7,40,28],
'D': [13.3,12.8,12.1,12.3,13.3,12.2,12.4,12.8,11,10]})
After running the following code:
data = pd.merge(df1, df2, how='left', left_on=['Num','date'], right_on = ['Num','date'])
Here is what I should obtain (which I do with this toy dataset)
However, with my real dataset, I obtain :
I have checked datatypes and they match, and no nulls or nans show up on keys. Num is formated as int64 and date as datetime64
If you ever face a situation like above described problem, here's what I did to solve the situation :
check dtypes for the common keys match on both dataframes
If the problem persists, check the rows of the column/s (keys) you are willing to do the merge on.
In my case, "num" key was ok. However, "date" key presented different rows in df2 compared to df1. This explains that after the merge, some rows would contain data (on the right part) and some other wouldn't.
Given the merge type I had chosen (how="left):
resulting shape of the merge dataframe was correct.
all rows contained correct info on for the left dataframe (df1)
some rows on the right side of the merged dataframe would contain NAN given the lack of match (of one of the 2 keys) within the first and the second data frame.
I would go with:
df1.merge(df2, on=['Num','date'])
I have a dataframe(df1) with index as a date range and no columns specified and another dataframe(df2) with float values in every column.
I tried joining a specific column from df2 to df1 using .join() method and ended up with all values as NaN in df1. What should I do to solve this?
It's unclear what you mean without any example of the data or their shape, and without more details about what kind of 'join' you're trying to do. It sounds like you are trying to concatenate dataframes without relying on a column or index level names to join on. That's what join or merge try to do, so if you don't have common values on the on parameter of the join, you'll end up with nans. If I'm correct and you just want a concatenation of dataframes, then you can use concat. I can't provide the code without more details, but it would look something like this:
new_df = pd.concat([df1, df2[['whatever_column_you_want_to_concatenate']]], axis=1)