I use python 2.7 and pandas 0.13
I try to merge two dataframes:
dfa[['descriptor_ref','issue_ref']]
dfi[['descriptor', 'cfit']]
dfm = pd.merge(dfa, dfi, left_on='descriptor_ref', right_on='descriptor', how='outer')
Both descriptor_ref and desriptor are dtypes: object.
The columns from the right dataframe (dfi) are present in the result (fdm) but the values are not. All columns from dfi are empty in dfm. The number of rows in dfm is equal to the number of rows in dfa.
What could cause this ?
Related
I am trying to merge two dataframes based on a date column with this code:
data_df = (pd.merge(data, one_min_df, on='date', how='outer'))
The first dataframe has 3784 columns and the second dataframe has 3764. Every date in the second dataframe is also within the first dataframe. I would like to get the dataframes to merge on the date column with any dates that the longer dataframe has being left as blank or NaN etc.
The code I have here gives the 3764 values followed by 20 empty rows, rather than correctly matching them.
Try this:
data['date'] = pd.to_datetime(data['date'])
one_min_df['date'] = pd.to_datetime(one_min_df['date'])
data_df = pd.merge(data, one_min_df, on='date', how='left')
I need to concatenate two DataFrames where both dataframes have a column named 'sample ids'. The first dataframe has all the relevant information needed, however the sample ids column in the first dataframe is missing all the sample ids that are within the second dataframe. Is there a way to insert the 'missing' sample ids (IN SEQUENTIAL ORDER) into the first dataframe using the second dataframe?
I have tried the following:
pd.concat([DF1,DF2],axis=1)
this did retain all information from both DataFrames, but the sample ids from both datframes were separated into different columns.
pd.merge(DF1,DF2,how='outer/inner/left/right')
this did not produce the desired outcome in the least...
I have shown the templates of the two dataframes below. Please help my brain is exploding!!!
DataFrame 2
DataFrame 1
If you want to:
insert the 'missing' sample ids (IN SEQUENTIAL ORDER) into the first
dataframe using the second dataframe
you can use an outer join by .merge() with how='outer', as follows:
df_out = df1.merge(df2, on="samp_id", how='outer')
To further ensure the samp_id are IN SEQUENTIAL ORDER, you can further sort on samp_id using .sort_values(), as follows:
df_out = df1.merge(df2, on="samp_id", how='outer').sort_values('samp_id', ignore_index=True)
Try this :
df = df1.merge(df2, on="samp_id")
I have two CSV files, CSV_Cleaned: It has 891 rows and CSV_Uncleaned: this one has 945 rows, I wish to get only those rows from CSV_Uncleaned whose index value matches with CSV_Cleaned. How do I do it?
NOTE: My data frame has no column named 'index', I am talking about the index values that are automatically generated on the left of the 1st column.
assuming the column of interest is called "index" on the csv files, you can do this using merge
df1 = pd.read_csv('CSV_cleaned.csv')
df2 = pd.read_csv('CSV_Uncleaned.csv')
df = df1.merge(df2, left_on='index', right_on='index', how='left')
in case you already have DataFrames that need to be merged by their index:
df = df1.merge(df2, left_index=True, right_index=True, how='left')
I believe the merge type in R is a left outer join. The merge I implemented in Python returned a dataframe that had the same shape as the resulting merged df in R. Although when I had dropped the duplicates (df2.drop_duplicates), 4000 rows were dropped in Python as opposed to the 50 rows dropped when applying the drop duplicates function to the post-merge R data frame
The dataframe I need to merge are df1 and df2
R:
df2<-merge( df2[ , -which(names(df2) %in% c(column9,column10))], df1[,c(column1,column2,column4,column5)],by.x=c(column1,column2),by.y=c(column2,column4),all.x=T
Python:
df2 = df2[[column1,column2,column3...column8]].merge(df1[[column1,column2,column4,column5]],how='left',left_on=[column1,column2],right_on=[column2,column4]
df2[column1] and df2[column2] are the columns I want to merge on because their names in df1 are df1[column2] and df1[column4] but have the same row values.
My gut tells me that the issue is stemming from this portion of the code that I might be misinterpreting: -which(names(df2) %in% c(column9,column10)
Please feel free to send some tips my way if I'm messing up somewhere
First, the list subset of columns in Pandas is no longer recommended. Instead, use reindex to subset columns which handles missing labels.
And the R translation of -which(names(df2) %in% c(column9, column10)) in Pandas can be ~df2.columns.isin([column9, column10]). And because isin returns a boolean series, to subset consider DataFrame.loc:
df2 = (df.loc[:, ~df2.columns.isin([column9, column10])]
.merge(df1.reindex([column1, column2, column4, column5], axis='columns'),
how='left',
left_on=[column1, column2],
right_on=[column2, column4])
)
I have two pandas dataframes, both index with datetime entries. The df1 has non-unique time indices, whereas df2 has unique ones. I would like to add a column df2.a to df1 in the following way: for every row in df1 with timestamp ts, df1.a should contain the most recent value of df2.a whose timestamp is less then ts.
For example, let's say that df2 is sampled every minute, and there are rows with timestamps 08:00:15, 08:00:47, 08:02:35 in df1. In this case I would like the value from df2.a[08:00:00] to be used for the first two rows, and df2.a[08:02:00] for the third. How can I do this?
You are describing an asof-join, which was just released in pandas 0.19.
pd.merge(df1, df2, left_on='ts', right_on='a')
apply to rows of df1, reindex on df2 with ffill.
df1['df2.a'] = df1.apply(lambda x: pd.Series(df2.a.reindex([x.name]).ffill().values), axis=1)