How to merge two differently sized dataframes based on date - python

I am trying to merge two dataframes based on a date column with this code:
data_df = (pd.merge(data, one_min_df, on='date', how='outer'))
The first dataframe has 3784 columns and the second dataframe has 3764. Every date in the second dataframe is also within the first dataframe. I would like to get the dataframes to merge on the date column with any dates that the longer dataframe has being left as blank or NaN etc.
The code I have here gives the 3764 values followed by 20 empty rows, rather than correctly matching them.

Try this:
data['date'] = pd.to_datetime(data['date'])
one_min_df['date'] = pd.to_datetime(one_min_df['date'])
data_df = pd.merge(data, one_min_df, on='date', how='left')

Related

Merge two dataframes one common column

I have two data frames, one with three columns and another with two columns. Two columns are common in both data frames:
enter image description here
I have to update the Marks column of df1 from df2 where the data is missing only and keep the existing value as same in the df1.
I have tried pd.merge but the result created a separate column which was not intended.
Following worked for me:
df1['Mark'] = df1.Marks_x.combine_first(df1.Marks_y)
df1['Marks_x'] = df1['Mark']
df1 = df1.drop(['Marks_y', 'Mark'], axis=1)
df1 = df1.rename(columns = {'Marks_x':'Marks'})

Concatenating DataFrames where DataFrame1 contains the missing values of DataFrame2 (Column Specific). DataFrame1 does not have NaN values

I need to concatenate two DataFrames where both dataframes have a column named 'sample ids'. The first dataframe has all the relevant information needed, however the sample ids column in the first dataframe is missing all the sample ids that are within the second dataframe. Is there a way to insert the 'missing' sample ids (IN SEQUENTIAL ORDER) into the first dataframe using the second dataframe?
I have tried the following:
pd.concat([DF1,DF2],axis=1)
this did retain all information from both DataFrames, but the sample ids from both datframes were separated into different columns.
pd.merge(DF1,DF2,how='outer/inner/left/right')
this did not produce the desired outcome in the least...
I have shown the templates of the two dataframes below. Please help my brain is exploding!!!
DataFrame 2
DataFrame 1
If you want to:
insert the 'missing' sample ids (IN SEQUENTIAL ORDER) into the first
dataframe using the second dataframe
you can use an outer join by .merge() with how='outer', as follows:
df_out = df1.merge(df2, on="samp_id", how='outer')
To further ensure the samp_id are IN SEQUENTIAL ORDER, you can further sort on samp_id using .sort_values(), as follows:
df_out = df1.merge(df2, on="samp_id", how='outer').sort_values('samp_id', ignore_index=True)
Try this :
df = df1.merge(df2, on="samp_id")

How do I merge two CSV files on common index values in Python using pandas?

I have two CSV files, CSV_Cleaned: It has 891 rows and CSV_Uncleaned: this one has 945 rows, I wish to get only those rows from CSV_Uncleaned whose index value matches with CSV_Cleaned. How do I do it?
NOTE: My data frame has no column named 'index', I am talking about the index values that are automatically generated on the left of the 1st column.
assuming the column of interest is called "index" on the csv files, you can do this using merge
df1 = pd.read_csv('CSV_cleaned.csv')
df2 = pd.read_csv('CSV_Uncleaned.csv')
df = df1.merge(df2, left_on='index', right_on='index', how='left')
in case you already have DataFrames that need to be merged by their index:
df = df1.merge(df2, left_index=True, right_index=True, how='left')

How to fill df2 with data from df1 by matching column values from df1 which match df2 index and column names

I have a large dataframe df1 with many data columns, two of which are dates and colNum. I have built a second dataframe df2 which spans the date range and colNum of df1. I now want to fill df2 with a third column (any of the many other data columns) of df1 which meet the criteria of dates and colNum from df1 that match dateIndex and colNum of df2.
I've tried various incarnations of MERGE with no success.
I can loop through the combinations, but df1 is very large (270k, 2k) so it takes forever to do fill one df2 from one of df1's columns, let alone all of them.
Slow looping version
dataList = ['revt']
for i in dataList:
goodRows = df1.index[~np.isnan(df1[i])].tolist()
for j in goodRows:
df2.loc[df1['dates'][j], str(df1['colNum'][j])] = df1[i][j]
Input
Desired Output
convert index to column e.g
df1.reset_index() #as per your statement date seems to be in index
df2.rest_index()
df2 = pd.merge(df2, df1, on = ['dateIndex', 'colNum'], how = 'left') #keep either "left" or "inner" as per your convenience
update
rather you can keep date in index and in pd.merge there is a option to join via index too

Pandas: add column with the most recent values

I have two pandas dataframes, both index with datetime entries. The df1 has non-unique time indices, whereas df2 has unique ones. I would like to add a column df2.a to df1 in the following way: for every row in df1 with timestamp ts, df1.a should contain the most recent value of df2.a whose timestamp is less then ts.
For example, let's say that df2 is sampled every minute, and there are rows with timestamps 08:00:15, 08:00:47, 08:02:35 in df1. In this case I would like the value from df2.a[08:00:00] to be used for the first two rows, and df2.a[08:02:00] for the third. How can I do this?
You are describing an asof-join, which was just released in pandas 0.19.
pd.merge(df1, df2, left_on='ts', right_on='a')
apply to rows of df1, reindex on df2 with ffill.
df1['df2.a'] = df1.apply(lambda x: pd.Series(df2.a.reindex([x.name]).ffill().values), axis=1)

Categories

Resources