How to join 2 dataframe based on some logic - python

I have a dataframe with the below biweekly data
date value
15-06-2012 20
30-06-2012 30
And I need to join with another dataframe that has below data:
date cost
2-05-2011 5
3-04-2012 80
2-06-2012 10
3-06-2012 10
4-06-2012 30
5-06-2012 20
10-06-2012 10
15-06-2012 10
18-06-2012 30
20-06-2012 20
21-06-2012 30
22-06-2012 30
29-06-2012 20
29-10-2012 30
I need to join 2 dataframes in such a way that from another dataframe, i get average cost between 1-15 th june 2012 to fill 15-06-2012 cost and similarly for 30-06-2012 cost, I get avg value between 16-06-2012 to 30-06-2012 and get the below results
date value cost
15-06-2012 20 15 which is (10+10+30+20+10+10)/6
30-06-2012 30 26 which is (30+20+30+30+20)/5

Change to datetime of your columns date , then we using merge_asof
#df.date=pd.to_datetime(df.date,dayfirst=True)
#df1.date=pd.to_datetime(df1.date,dayfirst=True)
df['keepkey']=df.date
mergedf=pd.merge_asof(df1,df,on='date',direction ='forward')
mergedf.groupby('keepkey',as_index=False).mean()
Out[373]:
keepkey cost value
0 2012-06-15 15 20
1 2012-06-30 26 30
Update :
df['keepkey']=df.date
df['key']=df.date.dt.strftime('%Y-%m')
df1['key']=df1.date.dt.strftime('%Y-%m')
mergedf=pd.merge_asof(df1,df,on='date',by='key',direction ='forward')
mergedf.groupby('keepkey',as_index=False).mean()
Out[417]:
keepkey cost key value
0 2012-06-15 15 6 20.0
1 2012-06-30 26 6 30.0

This would need a merge followed by a groupby:
m = df.merge(df2, on='date', how='outer')
m['date'] = pd.to_datetime(m.date, dayfirst=True)
m = m.sort_values('date')
(m.groupby(m['value'].notnull().shift().fillna(False).cumsum(),
as_index=False)
.agg({'date' : 'last', 'cost' : 'mean', 'value' : 'last'}))
date cost value
0 2012-06-15 15.0 20.0
1 2012-06-30 26.0 30.0

Related

counting number of dates in between two date range from different dataframe [duplicate]

This question already has answers here:
How to join two dataframes for which column values are within a certain range?
(9 answers)
Closed 7 days ago.
Existing dataframe :
df_1
Id dates time(sec)_1 time(sec)_2
1 02/02/2022 15 20
1 04/02/2022 20 30
1 03/02/2022 30 40
1 06/02/2022 50 40
2 10/02/2022 10 10
2 11/02/2022 15 20
df_2
Id min_date action_date
1 02/02/2022 04/02/2022
2 06/02/2022 10/02/2022
Expected Dataframe :
df_2
Id min_date action_date count_of_dates avg_time_1 avg_time_2
1 02/02/2022 04/02/2022 3 21.67 30
2 06/02/2022 10/02/2022 1 10 10
count of dates, avg_time_1 , avg_time_2 is to be created from the df_1.
count of dates is calculated considering the min_date and action_date i.e. number of dates from from df_1 falling under min_date and action_date.
avg_time_1 and avg_time_2 are calculated w.r.t. to count of dates
stuck with applying the condition for dates :-( any leads.?
If small data is possible filter per rows by custom function:
df_1['dates'] = df_1['dates'].apply(pd.to_datetime)
df_2[['min_date','action_date']] = df_2[['min_date','action_date']].apply(pd.to_datetime)
def f(x):
m = df_1['Id'].eq(x['Id']) & df_1['dates'].between(x['min_date'], x['action_date'])
s = df_1.loc[m, ['time(sec)_1','time(sec)_2']].mean()
return pd.Series([m.sum()] + s.to_list(), index=['count_of_dates'] + s.index.tolist())
df = df_2.join(df_2.apply(f, axis=1))
print (df)
Id min_date action_date count_of_dates time(sec)_1 time(sec)_2
0 1 2022-02-02 2022-04-02 3.0 21.666667 30.0
1 2 2022-06-02 2022-10-02 1.0 10.000000 10.0
If Id in df_2 is unique is possible improve performance by merge df_1 with aggregate size and mean:
df = df_2.merge(df_1, on='Id')
d = {'count_of_dates':('Id','size'),
'time(sec)_1':('time(sec)_1','mean'),
'time(sec)_2':('time(sec)_2','mean')}
df = df_2.join(df[df['dates'].between(df['min_date'], df['action_date'])]
.groupby('Id').agg(**d), on='Id')
print (df)
Id min_date action_date count_of_dates time(sec)_1 time(sec)_2
0 1 2022-02-02 2022-04-02 3 21.666667 30
1 2 2022-06-02 2022-10-02 1 10.000000 10

Sum two columns only if the values of one column is bigger/greater 0

I've got the following dataframe
lst=[['01012021','',100],['01012021','','50'],['01022021',140,5],['01022021',160,12],['01032021','',20],['01032021',200,25]]
df1=pd.DataFrame(lst,columns=['Date','AuM','NNA'])
I am looking for a code which sums the columns AuM and NNA only if the values of column AuM contains a value. The result is showed below:
lst=[['01012021','',100,''],['01012021','','50',''],['01022021',140,5,145],['01022021',160,12,172],['01032021','',20,'']]
df2=pd.DataFrame(lst,columns=['Date','AuM','NNA','Sum'])
It is not a good practice to use '' in place of NaN when you have numeric data.
That said, a generic solution to your issue would be to use sum with the skipna=False option:
df1['Sum'] = (df1[['AuM', 'NNA']] # you can use as many columns as you want
.apply(pd.to_numeric, errors='coerce') # convert to numeric
.sum(1, skipna=False) # sum if all are non-NaN
.fillna('') # fill NaN with empty string (bad practice)
)
output:
Date AuM NNA Sum
0 01012021 100
1 01012021 50
2 01022021 140 5 145.0
3 01022021 160 12 172.0
4 01032021 20
5 01032021 200 25 225.0
I assume you mean to include the last row too:
df2 = (df1.assign(Sum=df1.loc[df1.AuM.ne(""), ["AuM", "NNA"]].sum(axis=1))
.fillna(""))
print(df2)
Result:
Date AuM NNA Sum
0 01012021 100
1 01012021 50
2 01022021 140 5 145.0
3 01022021 160 12 172.0
4 01032021 20
5 01032021 200 25 225.0

Pandas: Comparing each row's value with index and replacing adjacent column's value

I have a dataframe as shown below:
A B ans_0 ans_3 ans_4
timestamp
2022-05-09 09:28:00 0 45 20 200 100
2022-05-09 09:28:01 3 100 10 80 50
2022-05-09 09:28:02 4 30 30 60 10
In this dataframe, the values in column A are present as a part of the column names. That is, the values 0,3 and 4 of column A are present in the column name ans_0, ans_3 and ans_4.
My goal is, for each row, the value in column A is compared with the row.index and if it matches, the value present in that particular column is taken and put in column B.
The output should look as shown below:
A B ans_0 ans_3 ans_4
timestamp
2022-05-09 09:28:00 0 20 20 200 100
2022-05-09 09:28:01 3 80 10 80 50
2022-05-09 09:28:02 4 10 30 60 10
For eg: In the first row, the value 0 from column A is compared and matched with the column ans_0. The value present which is 20 is put in column B. column B had a value of 45 which is replaced by 20.
Is there an easier way to do this?
Thanks!
You need to use indexing lookup, for this you first need to ensure that the names in A match the column names (0 -> 'ans_0'):
idx, cols = pd.factorize('ans_'+df['A'].astype(str))
import numpy as np
df['B'] = (df.reindex(cols, axis=1).to_numpy()
[np.arange(len(df)), idx]
)
output:
A B ans_0 ans_3 ans_4
timestamp
2022-05-09 09:28:00 0 20 20 200 100
2022-05-09 09:28:01 3 80 10 80 50
2022-05-09 09:28:02 4 10 30 60 10
You could reindex the ans columns with A column values; then get the values on the diagonal:
import numpy as np
df.columns = df.columns.str.split('_', expand=True)
df['B'] = np.diag(df['ans'].reindex(df['A'].squeeze().astype('string'), axis=1))
df.columns = [f"{i}_{j}" if j==j else i for i,j in df.columns]
Output:
A B ans_0 ans_3 ans_4
timestamp
2022-05-09 09:28:00 0 20 20 200 100
2022-05-09 09:28:01 3 80 10 80 50
2022-05-09 09:28:02 4 10 30 60 10

Python convert daily column into a new dataframe with year as index week as column

I have a data frame with the date as an index and a parameter. I want to convert column data into a new data frame with year as row index and week number as column name and cells showing weekly mean value. I would then use this information to plot using seaborn https://seaborn.pydata.org/generated/seaborn.relplot.html.
My data:
df =
data
2019-01-03 10
2019-01-04 20
2019-05-21 30
2019-05-22 40
2020-10-15 50
2020-10-16 60
2021-04-04 70
2021-04-05 80
My code:
# convert the df into weekly averaged dataframe
wdf = df.groupby(df.index.dt.strftime('%Y-%W')).data.mean()
wdf
2019-01 15
2019-26 35
2020-45 55
2021-20 75
Expected answer: Column name denotes the week number, index denotes the year. Cell denotes the sample's mean in that week.
01 20 26 45
2019 15 NaN 35 NaN # 15 is mean of 1st week (10,20) in above df
2020 NaN NaN NaN 55
2021 NaN 75 NaN NaN
No idea on how to proceed further to get the expected answer from the above-obtained solution.
You can use a pivot_table :
df['year'] = pd.DatetimeIndex(df['date']).year
df['week'] = pd.DatetimeIndex(df['date']).week
final_table = pd.pivot_table(data = df,index= 'year', columns = 'week',values = 'data', aggfunc = np.mean )
You need to use two dimensions in the groupby, and then unstack to lay out the data as a grid:
df.groupby([df.index.year,df.index.week])['data'].mean().unstack()

How to remove duplicate entries but keep the first row selected columns value and last row selected columns value?

I'm creating the charts in periscopedata and doing pandas to derive our results. I'm facing difficulties when removing duplicates from the results.
This is our data look like in final dataframe after calculating.
vendor_ID date opening purchase paid closing
B2345 01/01/2015 5 20 10 15
B2345 01/01/2015 15 50 20 45
B2345 02/01/2015 45 4 30 19
I want to remove the duplicate entry based on vendor_ID and date but keep the starting opening and keep the last entry closing
i.e) Expected result I want
vendor_ID date opening purchase paid closing
B2345 01/01/2015 5 70 30 45
B2345 02/01/2015 45 4 30 19
I've tried below code to remove the duplicates but that gave us different error.
df.drop_duplicates(subset=["vendor_ID", "date"], keep="last", inplace=True)
How do I code such way to remove the duplicates and keep the first and last as mentioned in above example.
Use GroupBy.agg with GroupBy.first, GroupBy.last and GroupBy.sum specified for each column for output:
Notice: Thanks #Erfan - if need use minimal and maximal column instead first and last change dict to {'opening':'min','purchase':'sum','paid':'sum', 'closing':'max'}
df1 = (df.groupby(["vendor_ID", "date"], as_index=False)
.agg({'opening':'first','purchase':'sum','paid':'sum', 'closing':'last'}))
print (df1)
vendor_ID date opening purchase paid closing
0 B2345 01/01/2015 5 70 30 45
1 B2345 02/01/2015 45 4 30 19
Also if not sure if datetimes are sorted:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df = df.sort_values(["vendor_ID", "date"])
df1 = (df.groupby(["vendor_ID", "date"], as_index=False)
.agg({'opening':'first','purchase':'sum','paid':'sum', 'closing':'last'}))
print (df1)
vendor_ID date opening purchase paid closing
0 B2345 2015-01-01 5 70 30 45
1 B2345 2015-01-02 45 4 30 19
You can also create dictionary dynamic for sum all columns without first 2 and used for first and last:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df = df.sort_values(["vendor_ID", "date"])
d = {'opening':'first', 'closing':'last'}
sum_cols = df.columns.difference(list(d.keys()) + ['vendor_ID','date'])
final_d = {**dict.fromkeys(sum_cols,'sum'), **d}
df1 = df.groupby(["vendor_ID", "date"], as_index=False).agg(final_d).reindex(df.columns,axis=1)
print (df1)
vendor_ID date opening purchase paid closing
0 B2345 2015-01-01 5 70 30 45
1 B2345 2015-01-02 45 4 30 19

Categories

Resources