I have two pandas dataframes, both index with datetime entries. The df1 has non-unique time indices, whereas df2 has unique ones. I would like to add a column df2.a to df1 in the following way: for every row in df1 with timestamp ts, df1.a should contain the most recent value of df2.a whose timestamp is less then ts.
For example, let's say that df2 is sampled every minute, and there are rows with timestamps 08:00:15, 08:00:47, 08:02:35 in df1. In this case I would like the value from df2.a[08:00:00] to be used for the first two rows, and df2.a[08:02:00] for the third. How can I do this?
You are describing an asof-join, which was just released in pandas 0.19.
pd.merge(df1, df2, left_on='ts', right_on='a')
apply to rows of df1, reindex on df2 with ffill.
df1['df2.a'] = df1.apply(lambda x: pd.Series(df2.a.reindex([x.name]).ffill().values), axis=1)
Related
I got a DF called "df" with 4 numerical columns [frame,id,x,y]
I made a loop that creates two dataframes called df1 and df2. Both df1 and df2 are subseted of the original dataframe.
What I want to do (and I am not understanding how to do it) is this: I want to CHECK if df1 and df2 have same VALUES in the column called "id". If they do, I want to concatenate those rows of df2 (that have the same id values) to df1.
For example: if df1 has rows with different id values (1,6,4,8) and df2 has this id values (12,7,8,10). I want to concatenate df2 rows that have the id value=8 to df1. That is all I need
This is my code:
for i in range(0,max(df['frame']),30):
df1=df[df['frame'].between(i, i+30)]
df2=df[df['frame'].between(i-30, i)]
There are several ways to accomplish what you need.
The simplest one is to get the slice of df2 that contains the values you need with .isin() and concatenate it with df1 in one line.
df3 = pd.concat([df1, df2[df2.id.isin(df1.id)]], axis = 0)
To gain more control and avoid any errors that might stem from updating df1 and df2 elsewhere, you may want to take the apart this one-liner.
look_for_vals = set(df1['id'].tolist())
# do some stuff
need_ix = df2[df2["id"].isin(look_for_vals )].index
# do more stuff
df3 = pd.concat([df1, df2.loc[need_ix,:]], axis=0)
Instead of set() you may also use df1['id'].unique()
I am trying to merge two dataframes based on a date column with this code:
data_df = (pd.merge(data, one_min_df, on='date', how='outer'))
The first dataframe has 3784 columns and the second dataframe has 3764. Every date in the second dataframe is also within the first dataframe. I would like to get the dataframes to merge on the date column with any dates that the longer dataframe has being left as blank or NaN etc.
The code I have here gives the 3764 values followed by 20 empty rows, rather than correctly matching them.
Try this:
data['date'] = pd.to_datetime(data['date'])
one_min_df['date'] = pd.to_datetime(one_min_df['date'])
data_df = pd.merge(data, one_min_df, on='date', how='left')
I have a large dataframe df1 with many data columns, two of which are dates and colNum. I have built a second dataframe df2 which spans the date range and colNum of df1. I now want to fill df2 with a third column (any of the many other data columns) of df1 which meet the criteria of dates and colNum from df1 that match dateIndex and colNum of df2.
I've tried various incarnations of MERGE with no success.
I can loop through the combinations, but df1 is very large (270k, 2k) so it takes forever to do fill one df2 from one of df1's columns, let alone all of them.
Slow looping version
dataList = ['revt']
for i in dataList:
goodRows = df1.index[~np.isnan(df1[i])].tolist()
for j in goodRows:
df2.loc[df1['dates'][j], str(df1['colNum'][j])] = df1[i][j]
Input
Desired Output
convert index to column e.g
df1.reset_index() #as per your statement date seems to be in index
df2.rest_index()
df2 = pd.merge(df2, df1, on = ['dateIndex', 'colNum'], how = 'left') #keep either "left" or "inner" as per your convenience
update
rather you can keep date in index and in pd.merge there is a option to join via index too
I am trying to concatenate two dataframe and in case of duplication I'd like to consider the row that has the maximum value for a column C
I tried this command :
df = pd.concat([df1, df2]).max(level=0)
So if two rows have same value for columns A and B, I will just take that row with the maximum value for column C.
You can sort by column C, then drop duplicates by columns A & B:
df = pd.concat([df1, df2])\
.sort_values('C')\
.drop_duplicates(subset=['A', 'B'], keep='last')
Your attempt exhibits a couple of misunderstandings:
pd.DataFrame.max is used to calculate maximum values, not to filter a dataframe.
The level parameter is relevant only for MultiIndex dataframes.
I use python 2.7 and pandas 0.13
I try to merge two dataframes:
dfa[['descriptor_ref','issue_ref']]
dfi[['descriptor', 'cfit']]
dfm = pd.merge(dfa, dfi, left_on='descriptor_ref', right_on='descriptor', how='outer')
Both descriptor_ref and desriptor are dtypes: object.
The columns from the right dataframe (dfi) are present in the result (fdm) but the values are not. All columns from dfi are empty in dfm. The number of rows in dfm is equal to the number of rows in dfa.
What could cause this ?