I have two dataframes, one with news and the other with stock price. Both the dataframes have a "Date" column. I want to merge them on a gap of 5 days.
Lets say my news dataframe is df1 and the other price dataframe as df2.
My df1 looks like this:
News_Dates News
2018-09-29 Huge blow to ABC Corp. as they lost the 2012 tax case
2018-09-30 ABC Corp. suffers a loss
2018-10-01 ABC Corp to Sell stakes
2018-12-20 We are going to comeback strong said ABC CEO
2018-12-22 Shares are down massively for ABC Corp.
My df2 looks like this:
Dates Price
2018-10-04 120
2018-12-24 131
First method of merging I do is:
pd.merge_asof(df1_zscore.sort_values(by=['Dates']), df_n.sort_values(by=['News_Dates']), left_on=['Dates'], right_on=['News_Dates'] \
tolerance=pd.Timedelta('5d'), direction='backward')
The resulting df is:
Dates News_Dates News Price
2018-10-04 2018-10-01 ABC Corp to Sell stakes 120
2018-12-24 2018-12-22 Shares are down massively for ABC Corp. 131
The second way of merging I do is:
pd.merge_asof(df_n.sort_values(by=['Dates']), df1_zscore.sort_values(by=['Dates']), left_on=['News_Dates'], right_no=['Dates'] \
tolerance=pd.Timedelta('5d'), direction='forward').dropna()
And the resulting df as:
News_Dates News Dates Price
2018-09-29 Huge blow to ABC Corp. as they lost the 2012 tax case 2018-10-04 120
2018-09-30 ABC Corp. suffers a loss 2018-10-04 120
2018-10-01 ABC Corp to Sell stakes 2018-10-04 120
2018-12-22 Shares are down massively for ABC Corp. 2018-12-24 131
Both the merging results in separate dfs, however there are values in both the cases which are missing, like for second case for 4th October price, news from 29th, 30th Sept should have been merged. And in case 2 for 24th December price 20th December should also have been merged.
So I'm not quite able to figure out where am I going wrong.
P.S. My objective is to merge the price df with the news df that have come in the last 5 days from the price date.
You can swap the left and right dataframe:
df = pd.merge_asof(
df1,
df2,
left_on='News_Dates',
right_on='Dates',
tolerance=pd.Timedelta('5D'),
direction='nearest'
)
df = df[['Dates', 'News_Dates', 'News', 'Price']]
print(df)
Dates News_Dates News Price
0 2018-10-04 2018-09-29 Huge blow to ABC Corp. as they lost the 2012 t... 120
1 2018-10-04 2018-09-30 ABC Corp. suffers a loss 120
2 2018-10-04 2018-10-01 ABC Corp to Sell stakes 120
3 2018-12-24 2018-12-20 We are going to comeback strong said ABC CEO 131
4 2018-12-24 2018-12-22 Shares are down massively for ABC Corp. 131
Here is my solution using numpy
df_n = pd.DataFrame([('2018-09-29', 'Huge blow to ABC Corp. as they lost the 2012 tax case'), ('2018-09-30', 'ABC Corp. suffers a loss'), ('2018-10-01', 'ABC Corp to Sell stakes'), ('2018-12-20', 'We are going to comeback strong said ABC CEO'), ('2018-12-22', 'Shares are down massively for ABC Corp.')], columns=('News_Dates', 'News'))
df1_zscore = pd.DataFrame([('2018-10-04', '120'), ('2018-12-24', '131')], columns=('Dates', 'Price'))
df_n["News_Dates"] = pd.to_datetime(df_n["News_Dates"])
df1_zscore["Dates"] = pd.to_datetime(df1_zscore["Dates"])
n_dates = df_n["News_Dates"].values
p_dates = df1_zscore[["Dates"]].values
## substract each pair of n_dates and p_dates and create a matrix
mat_date_compare = (p_dates - n_dates).astype('timedelta64[D]')
## get matrix of boolean for which difference is between 0 and 5 day
## to be used as index for original array
comparision = (mat_date_compare <= pd.Timedelta("5d")) & (mat_date_compare >= pd.Timedelta("0d"))
## get cell numbers which is in range 0 to matrix size which meets the condition
ind = np.arange(len(n_dates)*len(p_dates))[comparision.ravel()]
## calculate row and column index from cell number to index the df
pd.concat([df1_zscore.iloc[ind//len(n_dates)].reset_index(drop=True),
df_n.iloc[ind%len(n_dates)].reset_index(drop=True)], sort=False, axis=1)
Result
Dates Price News_Dates News
0 2018-10-04 120 2018-09-29 Huge blow to ABC Corp. as they lost the 2012 t...
1 2018-10-04 120 2018-09-30 ABC Corp. suffers a loss
2 2018-10-04 120 2018-10-01 ABC Corp to Sell stakes
3 2018-12-24 131 2018-12-20 We are going to comeback strong said ABC CEO
4 2018-12-24 131 2018-12-22 Shares are down massively for ABC Corp.
Related
I have a data frame like shown below
customer
organization
currency
volume
revenue
Duration
Peter
XYZ Ltd
CNY, INR
20
3,000
01-Oct-2022
John
abc Ltd
INR
7
184
01-Oct-2022
Mary
aaa Ltd
USD
3
43
03-Oct-2022
John
bbb Ltd
THB
17
2,300
04-Oct-2022
Dany
ccc Ltd
CNY, INR , KRW
45
15,100
04-Oct-2022
If I pivot as shown below
df = pd.pivot_table(df, values=['runs', 'volume','revenue'],
index=['customer', 'organization', 'currency'],
columns=['Duration'],
aggfunc=sum,
fill_value=0
)
level = 0 becomes volume for all Duration (level 1) revenue for all Duration duration for all Duration.
I would like to pivot by Duration as level 0 and volume, revenue as level 2.
How to achieve it?
Current output:
I would like to have date as level 0 and volume, revenue and runs under it.
You can use swaplevel like below in your current pivot code; try this;
df1 = df.pivot_table(index=['customer', 'organization', 'currency'],
columns=['Duration'],
aggfunc=sum,
fill_value=0).swaplevel(0,1, axis=1).sort_index(axis=1)
Hope this Helps...
I have a pandas data frame like this:
Account Id Gross Sum Invoice Type Name Net Sum Company Security Supplier Date Completed YearMonth Category
710830 282.81 Invoice 282.81 asd5a Abc 1/1/2018 2018-1 Postal
445800 4868.71 Invoice 3926.4 adc6ac Def 1/1/2018 2018-1 R&D
710350 282.81 Invoice 282.81 fgn6 Ghi 2/9/2018 2018-2 Other
710510 282.81 Invoice 282.81 dg jkl 2/9/2018 2018-2 Electricity
710630 841.59 Invoice 707.07 dfvbfbf mno 3/2/2018 2018-3 Repairs
710610 841.59 Invoice 707.07 rrcv pqr 3/2/2018 2018-3 Leasing
710810 12.14 Invoice 10.12 btbfd stu 1/1/2019 2019-1 Telephone
704300 81517.6 Invoice 65740 dfbtt vwx 1/1/2019 2019-1 Statutory
710510 2105.64 Invoice 1776.53 dfdftb5 dfb 2/9/2019 2019-2 Electricity
710510 2105.64 Invoice 1776.53 ebdfb5b bcd 2/9/2019 2019-2 Electricity
710920 66.96 Invoice 54 dfrrt65 efg 3/2/2019 2019-3 Data
700330 239.47 Invoice 239.47 aae3a11 hij 3/2/2019 2019-3 Coffee
What i want is to add rows at the bottom of the data frame that calculates the average of same month last 3 years.
For example :
For year month 2020-1 the calculation should be for 2020-1 = sum(Net Sum Company) In 2019-1 + sum(Net Sum Company) in 2018-1 + sum(Net Sum Company) In 2017-1 divided by the number of months considered i.e 3 , so only last three years has to be considered. That way i'll get the average and append the same as new row at the bottom that has nothing but the Year Month and average of net sum company column.
The end goal is to get a data frame like this:
Account Id Gross Sum Invoice Type Name Net Sum Company Security Supplier Date Completed YearMonth Category
710830 282.81 Invoice 282.81 asd5a Abc 1/1/2018 2018-1 Postal
445800 4868.71 Invoice 3926.4 adc6ac Def 1/1/2018 2018-1 R&D
710350 282.81 Invoice 282.81 fgn6 Ghi 2/9/2018 2018-2 Other
710510 282.81 Invoice 282.81 dg jkl 2/9/2018 2018-2 Electricity
710630 841.59 Invoice 707.07 dfvbfbf mno 3/2/2018 2018-3 Repairs
710610 841.59 Invoice 707.07 rrcv pqr 3/2/2018 2018-3 Leasing
710810 12.14 Invoice 10.12 btbfd stu 1/1/2019 2019-1 Telephone
704300 81517.6 Invoice 65740 dfbtt vwx 1/1/2019 2019-1 Statutory
710510 2105.64 Invoice 1776.53 dfdftb5 dfb 2/9/2019 2019-2 Electricity
710510 2105.64 Invoice 1776.53 ebdfb5b bcd 2/9/2019 2019-2 Electricity
710920 66.96 Invoice 54 dfrrt65 efg 3/2/2019 2019-3 Data
700330 239.47 Invoice 239.47 aae3a11 hij 3/2/2019 2019-3 Coffee
- - - 34979.66 - - - 2020-1 -
- - - 2059.34 - - - 2020-2 -
- - - 853.805 - - - 2020-3 -
I am new to pandas so any guidance is appreciated. This has to be strictly done using pandas only.
IIUC, you want to:
find the next year per month in the dataframe
sum per month the Net Sum Company column over the 3 previous years
divide each sum by the number of months (2 in the sample) to get a monthly average
add those averages to the dataframe with the new year and the month in the YearMonth column
Code could be:
# extract Year and Month Series from the dataframe
year = df['YearMonth'].str.slice(stop=4).astype(int)
month = df['YearMonth'].str.slice(start=5)
# compute the new year per month as max(year) + 1
newyear_month = year.groupby(month).max() + 1
# build a Series aligned with the dataframe from that new year
newyear = pd.DataFrame(month).merge(
pd.DataFrame(newyear_month),
left_on='YearMonth', right_index=True, suffixes=('_x', '')
)['YearMonth'].sort_index()
# compute the sum of relevant years per month
tmp = df.loc[(newyear-3 <= year) & (year <= newyear-1), 'Net Sum Company'
].groupby(month).sum()
# divide by the number of distinct month per sum
tmp /= df.groupby(month)['YearMonth'].nunique()
# compute a YearMonth column for that new dataframe
tmp = pd.concat([newyear_month.astype(str), tmp], axis=1)
tmp['YearMonth'] = tmp['YearMonth'] + '-' + tmp.index # tmp is indexed by month
# force the type of Account Id to object to allow it to contain null values
df['Account Id'] = df['Account Id'].astype(object)
# concat the new rows to the dataframe and reset the index
new_df = df.append(tmp, sort=False).reset_index(drop=True)
With your sample, new_df gives:
Account Id Gross Sum Invoice Type Name Net Sum Company Security Supplier Date Completed YearMonth Category
0 710830 282.81 Invoice 282.810 asd5a Abc 1/1/2018 2018-1 Postal
1 445800 4868.71 Invoice 3926.400 adc6ac Def 1/1/2018 2018-1 R&D
2 710350 282.81 Invoice 282.810 fgn6 Ghi 2/9/2018 2018-2 Other
3 710510 282.81 Invoice 282.810 dg jkl 2/9/2018 2018-2 Electricity
4 710630 841.59 Invoice 707.070 dfvbfbf mno 3/2/2018 2018-3 Repairs
5 710610 841.59 Invoice 707.070 rrcv pqr 3/2/2018 2018-3 Leasing
6 710810 12.14 Invoice 10.120 btbfd stu 1/1/2019 2019-1 Telephone
7 704300 81517.60 Invoice 65740.000 dfbtt vwx 1/1/2019 2019-1 Statutory
8 710510 2105.64 Invoice 1776.530 dfdftb5 dfb 2/9/2019 2019-2 Electricity
9 710510 2105.64 Invoice 1776.530 ebdfb5b bcd 2/9/2019 2019-2 Electricity
10 710920 66.96 Invoice 54.000 dfrrt65 efg 3/2/2019 2019-3 Data
11 700330 239.47 Invoice 239.470 aae3a11 hij 3/2/2019 2019-3 Coffee
12 NaN NaN NaN 34979.665 NaN NaN NaN 2020-1 NaN
13 NaN NaN NaN 2059.340 NaN NaN NaN 2020-2 NaN
14 NaN NaN NaN 853.805 NaN NaN NaN 2020-3 NaN
Remarks:
finding the new year per month allows to use the code on a rolling year (from July 2017 to June 2019 for example)
you can replace NaN with empty strings (or whatever) with new_df = new_df.fillna('')
For a simple 3y rolling average, do something like this:
df1['Date Completed'] = pd.to_datetime(df1['Date Completed'])
df1['roll_3y_avg'] = df1.rolling(window='1096D', on='Date Completed', closed='right')['Net Sum Company'].mean()
I have two dataframes, one with news and the other with stock price. Both the dataframes have a "Date" column. I want to merge them on a gap of 5 days.
Lets say my news dataframe is df1 and the other price dataframe as df2.
My df1 looks like this:
News_Dates News
2018-09-29 Huge blow to ABC Corp. as they lost the 2012 tax case
2018-09-30 ABC Corp. suffers a loss
2018-10-01 ABC Corp to Sell stakes
2018-12-20 We are going to comeback strong said ABC CEO
2018-12-22 Shares are down massively for ABC Corp.
My df2 looks like this:
Dates Price
2018-10-04 120
2018-12-24 131
First method of merging I do is:
pd.merge_asof(df1_zscore.sort_values(by=['Dates']), df_n.sort_values(by=['News_Dates']), left_on=['Dates'], right_on=['News_Dates'] \
tolerance=pd.Timedelta('5d'), direction='backward')
The resulting df is:
Dates News_Dates News Price
2018-10-04 2018-10-01 ABC Corp to Sell stakes 120
2018-12-24 2018-12-22 Shares are down massively for ABC Corp. 131
The second way of merging I do is:
pd.merge_asof(df_n.sort_values(by=['Dates']), df1_zscore.sort_values(by=['Dates']), left_on=['News_Dates'], right_no=['Dates'] \
tolerance=pd.Timedelta('5d'), direction='forward').dropna()
And the resulting df as:
News_Dates News Dates Price
2018-09-29 Huge blow to ABC Corp. as they lost the 2012 tax case 2018-10-04 120
2018-09-30 ABC Corp. suffers a loss 2018-10-04 120
2018-10-01 ABC Corp to Sell stakes 2018-10-04 120
2018-12-22 Shares are down massively for ABC Corp. 2018-12-24 131
Both the merging results in separate dfs, however there are values in both the cases which are missing, like for second case for 4th October price, news from 29th, 30th Sept should have been merged. And in case 2 for 24th December price 20th December should also have been merged.
So I'm not quite able to figure out where am I going wrong.
P.S. My objective is to merge the price df with the news df that have come in the last 5 days from the price date.
You can swap the left and right dataframe:
df = pd.merge_asof(
df1,
df2,
left_on='News_Dates',
right_on='Dates',
tolerance=pd.Timedelta('5D'),
direction='nearest'
)
df = df[['Dates', 'News_Dates', 'News', 'Price']]
print(df)
Dates News_Dates News Price
0 2018-10-04 2018-09-29 Huge blow to ABC Corp. as they lost the 2012 t... 120
1 2018-10-04 2018-09-30 ABC Corp. suffers a loss 120
2 2018-10-04 2018-10-01 ABC Corp to Sell stakes 120
3 2018-12-24 2018-12-20 We are going to comeback strong said ABC CEO 131
4 2018-12-24 2018-12-22 Shares are down massively for ABC Corp. 131
Here is my solution using numpy
df_n = pd.DataFrame([('2018-09-29', 'Huge blow to ABC Corp. as they lost the 2012 tax case'), ('2018-09-30', 'ABC Corp. suffers a loss'), ('2018-10-01', 'ABC Corp to Sell stakes'), ('2018-12-20', 'We are going to comeback strong said ABC CEO'), ('2018-12-22', 'Shares are down massively for ABC Corp.')], columns=('News_Dates', 'News'))
df1_zscore = pd.DataFrame([('2018-10-04', '120'), ('2018-12-24', '131')], columns=('Dates', 'Price'))
df_n["News_Dates"] = pd.to_datetime(df_n["News_Dates"])
df1_zscore["Dates"] = pd.to_datetime(df1_zscore["Dates"])
n_dates = df_n["News_Dates"].values
p_dates = df1_zscore[["Dates"]].values
## substract each pair of n_dates and p_dates and create a matrix
mat_date_compare = (p_dates - n_dates).astype('timedelta64[D]')
## get matrix of boolean for which difference is between 0 and 5 day
## to be used as index for original array
comparision = (mat_date_compare <= pd.Timedelta("5d")) & (mat_date_compare >= pd.Timedelta("0d"))
## get cell numbers which is in range 0 to matrix size which meets the condition
ind = np.arange(len(n_dates)*len(p_dates))[comparision.ravel()]
## calculate row and column index from cell number to index the df
pd.concat([df1_zscore.iloc[ind//len(n_dates)].reset_index(drop=True),
df_n.iloc[ind%len(n_dates)].reset_index(drop=True)], sort=False, axis=1)
Result
Dates Price News_Dates News
0 2018-10-04 120 2018-09-29 Huge blow to ABC Corp. as they lost the 2012 t...
1 2018-10-04 120 2018-09-30 ABC Corp. suffers a loss
2 2018-10-04 120 2018-10-01 ABC Corp to Sell stakes
3 2018-12-24 131 2018-12-20 We are going to comeback strong said ABC CEO
4 2018-12-24 131 2018-12-22 Shares are down massively for ABC Corp.
I have this DataFrame:
year vehicule number_of_passengers
2017-01-09 bus 100
2017-11-02 car 150
2018-08-01 car 180
2016-08-09 bus 100
...
I would like to have something like this (the average number of passengers per year and per vehicule) :
year vehicule avg_number_of_passengers
2018 car 123.5
2018 bus 213.7
2017 ... ...
...
I've tried with some groupby() but can't find the good command. Can you help me ?
I need to combine 2 pandas dataframes where df1.date is within 2 months previous of df2. I then want to calculate how many traders had traded the same stock during that period and count the total shares purchased.
I have tried using the approach listed below, but found it far to complicated. I believe there would be a smarter/simpler solution.
Pandas: how to merge two dataframes on offset dates?
A sample dataset is below:
DF1 (team_1):
date shares symbol trader
31/12/2013 154 FDX Max
30/06/2016 2367 GOOGL Max
21/07/2015 293 ORCL Max
18/07/2015 304 ORCL Sam
DF2 (team_2):
date shares symbol trader
23/08/2015 345 ORCL John
04/07/2014 567 FB John
06/12/2013 221 ACER Sally
31/11/2012 889 HP John
05/06/2010 445 ABBV Kate
Required output:
date shares symbol trader team_2_traders team_2_shares_bought
23/08/2015 345 ORCL John 2 597
04/07/2014 567 FB John 0 0
06/12/2013 221 ACER Sally 0 0
31/11/2012 889 HP John 0 0
05/06/2010 445 ABBV Kate 0 0
This adds 2 new columns...
'team_2_traders' = count of how many traders from team_1 traded the same stock during the previous 2 months from the date listed on DF2.
'team_2_shares_bought' = count of the total shares purchased by team_1 during the previous 2 months from the date listed on DF2.
If anyone is willing to give this a crack, please use the snippet below to setup the dataframes. Please keep in mind the actual dataset contains millions of rows and 6,000 company stocks.
team_1 = {'symbol':['FDX','GOOGL','ORCL','ORCL'],
'date':['31/12/2013','30/06/2016','21/07/2015','18/07/2015'],
'shares':[154,2367,293,304],
'trader':['Max','Max','Max','Sam']}
df1 = pd.DataFrame(team_1)
team_2 = {'symbol':['ORCL','FB','ACER','HP','ABBV'],
'date':['23/08/2015','04/07/2014','06/12/2013','31/11/2012','05/06/2010'],
'shares':[345,567,221,889,445],
'trader':['John','John','Sally','John','Kate']}
df2 = pd.DataFrame(team_2)
Appreciate the help - thank you.
Please check my solution.
from pandas.tseries.offsets import MonthEnd
df_ = df2.merge(df1, on=['symbol'])
df_['date_x'] = pd.to_datetime(df_['date_x'])
df_['date_y'] = pd.to_datetime(df_['date_y'])
df_2m = df_[df_['date_x'] < df_['date_y'] + MonthEnd(2)] \
.loc[:, ['date_y', 'shares_y', 'symbol', 'trader_y']] \
.groupby('symbol')
df1_ = pd.concat([df_2m['shares_y'].sum(), df_2m['trader_y'].count()], axis=1)
print(df1_)
shares_y trader_y
symbol
ORCL 597 2
print(df2.merge(df1_.reset_index(), on='symbol', how='left').fillna(0))
date shares symbol trader shares_y trader_y
0 23/08/2015 345 ORCL John 597.0 2.0
1 04/07/2014 567 FB John 0.0 0.0
2 06/12/2013 221 ACER Sally 0.0 0.0
3 30/11/2012 889 HP John 0.0 0.0
4 05/06/2010 445 ABBV Kate 0.0 0.0