i have two dataframe like this.
df1
MainId,Time,info1,info2
100,2018-07-12 08:05:00,a,b
100,2018-07-12 08:07:00,x,y
101,2018-07-14 16:00,c,d
100,2018-07-14 19:30:00,d,e
104,2018-07-14 03:30:00,g,h
and
df2
Id,MainId,startTime,endTime,value
1,100,2018-07-12 08:00:00,2018-07-12 08:10:00,1001
2,150,2018-07-14 10:05:00,2018-07-14 17:05:00,1002
3,101,2018-07-12 0:05:00,2018-07-12 19:05:00,1003
4,100,2018-07-12 08:05:00,2018-07-12 08:15:00,1004
df2 is main dataframe and df1 is subdataframe. I would like to check starttime and endtime of df2 with the time in df1 with respective to MainId. If df1.Time isin df2(start and endtime) with respective to MainId, then i want to include info1 and info2 column of df1 to df2. If there are no values, then I would like to enter just nan.
I want my output like this
Id,MainId,info1,info2,value
1,100,a,b,1001
1,100,x,y,1001
2,150,nan,nan,1002
3,101,nan,nan,1003
4,100,a,b,1004
4,100,x,y,1004
Here I have two same Id(In Id1) and MainId in output because they have different info1 and info2 and I want to include that one too.
This is what I am doing in pandas
df2['info1'] = np.where((df2['MainId'] == df1['MainId'])& (df1['Time'].isin([df2['startTime'], df2['endTime']])),df1['info1'], np.nan)
but it is throwing an error
ValueError: Can only compare identically-labeled Series objects
How Can i Fix this error ? Is there a better way ?
df1 and df2 have diferente Index (you can check this by inspecting df1.index and df2.index. Hence, when you do df2['MainId'] == df1['MainId'], you have 2 series objects that are not comparable.
Try using a left join, something like:
df3 = df2.join(df1.set_index('MainId'), on='MainId'))
should give you the dataframe you want. You can then use it to execute your comparisons.
Related
I have two data frames:
df1 = pd.read_excel("test1.xlsx")
df2 = pd.read_excel("test2.xlsx")
I am trying to assign values of df1 to df2 where a certain condition is met (Column1 is equal to Column1 then assign values of ColY to ColX).
df1.loc[df1['Col1'] == df2['Col1'],'ColX'] = df2['ColY']
This results in an error as df2['ColY] is the whole column. How do i assign for only the rows that match?
You can use numpy.where:
import numpy as np
df1['ColX'] = np.where(df1['Col1'].eq(df2['Col1']), df2['ColY'], df1['ColX'])
Since you wanted to assign from df1 to df2 your code should have been
df2.loc[df1['Col1']==df2['Col2'],'ColX']=df1.['ColY']
The code you wrote won't assign the values from df1 to df2, but from df2 to df1.
And also if you could clarify to which dataframe ColX and ColY belong to I could help more(Or does both dataframe have them??).
Your code is pretty much right!!! Only change the df1 and df2 as above.
I have the following dataframes site_1_df and `site_2_df (both are similar):
site_1_df:
And the following dataframe:
site_1_index_df = pd.DataFrame(site_1_df.index.values.tolist(), columns=["SiteNumber", "WeekNumber", "PG"])
site_2_index_df = pd.DataFrame(site_2_df.index.values.tolist(), columns=["SiteNumber", "WeekNumber", "PG"])
index_intersection = pd.merge(left=site_1_index_df, right=site_2_index_df,
on=["WeekNumber", "PG"], how="inner")[["WeekNumber", "PG"]]
index_intersection:
Consequently, it is clear that site_1_df and site_2_df are multi-level indexed dataframes. Therefore, I woulld like to use the index_intersection to index the above dataframe. Or, If I am indexing from site_1_df, then I want a subset of the rows from the same dataframe. And technically, I should get back a dataframe that has (8556 rows x 6 columns), i.e., the same number of rows of index_intersection. How can I achieve that efficiently in pandas?
I tried:
index_intersection = pd.merge(left=site_1_index_df, right=site_2_index_df,
on=["WeekNumber", "PG"], how="inner")[["SiteNumber_x", "WeekNumber", "PG"]]
index_intersection = index_intersection.rename(columns={"SiteNumber_x": "SiteNumber"})
index_intersection = index_intersection.set_index(["SiteNumber", "WeekNumber", "PG"])
index_intersection
And I got:
However, indexing the dataframe using another dataframe such as:
site_2_df.loc[index_intersection]
# or
site_2_df.loc[index_intersection.index]
# or
site_2_df.loc[index_intersection.index.values]
will give me an error:
NotImplementedError: Indexing a MultiIndex with a DataFrame key is not implemented
Any help is much appreciated!!
So I figured out that I can find the intersection of 2 dataframes, based on their index through:
sites_common_rows = pd.merge(left=site_1_df.reset_index([0]), right=site_2_df.reset_index([0]),
left_index=True, right_index=True, how="inner")
The reset_index([0]) above is used to ignore the SiteNumber since this totally different from one dataframe to another. Consequently, I am able to find the inner join between two dataframes from their indexes.
I have an array of dataframes dfs = [df0, df1, ...]. Each one of them have a date column of varying size (some dates might be in one dataframe but not the other).
What I'm trying to do is this:
pd.concat(dfs).groupby("date", as_index=False).sum()
But with date no longer being a column but an index (dfs = [df.set_index("date") for df in dfs]).
I've seen you can pass df.index to groupby (.groupby(df.index)) but df.index might not include all the dates.
How can I do this?
The goal here is to call .sum() on the groupby, so I'm not tied to using groupby nor concat is there's any alternative method to do so.
If I am able to understand maybe you want something like this:
df = pd.concat([dfs])
df.groupby(df.index).sum()
Here's small example:
tmp1 = pd.DataFrame({'date':['2019-09-01','2019-09-02','2019-09-03'],'value':[1,1,1]}).set_index('date')
tmp2 = pd.DataFrame({'date':['2019-09-01','2019-09-02','2019-09-04','2019-09-05'],'value':[2,2,2,2]}).set_index('date')
df = pd.concat([tmp1,tmp2])
df.groupby(df.index).sum()
Say, I have a dataframe df where one of the columns is:
df['letters'] = pd.Series(['a','a','m','a'])
and I want to add a message column to the df, like below:
(if something like index() existed, would be nice.)
message_col = np.where(df['letters']=='a','found','missing at index'+str(df['letters'].index()))
The result would be:
df['message_col'] = pd.Series(['found','found','missing at index 2','found'])
I have 2 data frames, DF1 and DF2.
DF1 = ['IDS'] ['DateTime']
DF2 = ['IDS'] ['Deadline']
IDS are STR and the others are DateTime.
I want to add a column to DF1 called ['Late']
My code is
DF1['Late'] = DF1.where(DF1['IDS']==DF2['IDS'] and DF1['DateTime'] > DF2['Deadline'], "Late","Not Late")
I get the following:
ValueError: Can only compare identically-labeled Series objects
So I made a new Columns in DF1 called ['Deadline']
DF1['Late'] = DF1.where(DF1['IDS']==DF2['IDS'] and DF1['Deadline'] > DF2['Deadline'], "Late","Not Late")
But i get the same error?
Thanks,