I have a pandas dataframe that looks like this:
SCORE ZIP CODE DATE
0 95.2 90210 2016-01-01
1 98.36 70024 2019-03-02
2 78.2 34567 2017-09-01
3 99.25 00345 2018-05-02
4 ..... ..... .....
For each ZIP CODE, I need to calculate the mean, mode, and median, of the SCORE per each year in the DATE column.
How can I do that?
Related
I have 2 tables.
Table A has 105 rows:
bbgid dt weekly_price_per_stock weekly_pct_change
0 BBG000J9HHN8 2018-12-31 13562.328 0.000000
1 BBG000J9HHN8 2019-01-07 34717.536 1.559851
2 BBG000J9HHN8 2019-01-14 28300.218 -0.184844
3 BBG000J9HHN8 2019-01-21 35370.134 0.249818
4 BBG000J9HHN8 2019-01-28 36104.512 0.020763
... ... ... ... ...
100 BBG000J9HHN8 2020-11-30 62065.827 0.278765
101 BBG000J9HHN8 2020-12-07 62145.445 0.001283
102 BBG000J9HHN8 2020-12-14 63516.146 0.022056
103 BBG000J9HHN8 2020-12-21 51283.187 -0.192596
104 BBG000J9HHN8 2020-12-28 51306.951 0.000463
Table B has 257970 rows:
bbgid dt weekly_price_per_stock weekly_pct_change
0 BBG000B9WJ55 2018-12-31 34.612737 0.000000
1 BBG000B9WJ55 2019-01-07 70.618471 1.040245
2 BBG000B9WJ55 2019-01-14 89.123337 0.262040
3 BBG000B9WJ55 2019-01-21 90.377643 0.014074
4 BBG000B9WJ55 2019-01-28 90.527678 0.001660
... ... ... ... ...
257965 BBG00YFR2NJ6 2020-12-21 30.825000 -0.251275
257966 BBG00YFR2NJ6 2020-12-28 40.960000 0.328792
257967 BBG00YM46B38 2020-12-14 0.155900 -0.996194
257968 BBG00YM46B38 2020-12-21 0.372860 1.391661
257969 BBG00YM46B38 2020-12-28 0.535650 0.436598
In table A there's only a group of stocks (CCPM) but in table B i have a lot of different stock groups. I want to run a linear regression of table B pct_change vs table A (CCPM) pct_change so i can know how the stocks in table B move with respect to CCPM stocks during the period of time in the dt column. The problem is that i only have 105 rows in table A and when i group table B by bbgid i always get more rows so i'm having a error that says X and y must be the same size.
Both tables have been previously grouped by week and their pct_change has been calculated weekly. I should compare the variations in pct_change from table B with those on table A based on date and one group at a time from table B vs the CCPM stocks' pct_change.
I would like to extract the slope from each regression and store them in a column inside the same table and associate it to its corresponding group.
I have tried the solutions in this post and this post without success.
Is there any workaround to do this or i'm a doing something wrong? Please help me fix this.
Thank you very much in advance.
I'm trying to calculate the stock returns for my portfolio which requires "geometrically averaging" the percentages by year.
For simplicity, I have a dataframe that looks likes this:
Date Returns
2013-06-01 1%
2013-07-01 5%
2013-08-01 -4%
2014-01-01 12%
2014-02-01 -9%
I'd like the output to show:
Date Geo Return
2013 1.8%
2015 1.9%
Which is derived by: (1+.01)(1+.05)(1+-.04) = 1.8%
I am able to use the groupby function by year, but it only sums for me and I can't get the geometric average to work. Could someone please help?
Thanks!
Note that you have requested the cumulative product, which is different that the usual definition for the geometric mean.
df["returns"] = 1 + .01*df.Returns.str.split("%").str[0].astype(int)
df["geom_ave"] = df.groupby(df.Date.dt.year).returns.transform("prod")
output:
Date Returns returns geom_ave
0 2013-06-01 1% 1.01 1.01808
1 2013-07-01 5% 1.05 1.01808
2 2013-08-01 -4% 0.96 1.01808
3 2014-01-01 12% 1.12 1.01920
4 2014-02-01 -9% 0.91 1.01920
If instead you want the geometric mean, you can try:
from scipy import stats
series = df.groupby(df.Date.dt.year).returns.apply(stats.gmean)
I have a dataset of 1281695 rows and 4 columns in which I have 6 years of monthly data from 2013 to 2019. So, it's obvious to have repeated dates in the dataset. I want to arrange data as dates in ascending order like Jan 2013, Feb 2013,..Dec 2013, Jan 2014,......Dec 2019(6 years of data).I want to achieve ascending order for all of the dataset but it shows ascending order for some data and random order for the remaining data.
I tried sort_values of pandas library.
I tried something like this :
data = df.sort_values(['SKU', 'Region', 'FMonth'], axis=0, ascending=[False, True, True]).reset_index()
where SKU, Region, FMonth are my independent variables. FMonth is the date variable.
And the code arranges the starting of data but not the end of data. Like when I tried:
data.head()
result:
index SKU Region FMonth sh
0 8264 855019.133127 3975.495636 2013-01-01 67640.0
1 20022 855019.133127 3975.495636 2013-02-01 73320.0
2 31972 855019.133127 3975.495636 2013-03-01 86320.0
3 43897 855019.133127 3975.495636 2013-04-01 98040.0
4 55642 855019.133127 3975.495636 2013-05-01 73240.0
And,
data.tail()
result:
index SKU Region FMonth sh
1281690 766746 0.000087 7187.170501 2017-03-01 0.0
1281691 881816 0.000087 7187.170501 2017-09-01 0.0
1281692 980113 0.000087 7187.170501 2018-02-01 0.0
1281693 1020502 0.000087 7187.170501 2018-04-01 0.0
1281694 1249130 0.000087 7187.170501 2019-03-01 0.0
where 'sh' is my dependent variable.
Data is not really attractive but please focus on FMonth(date) column only.
As we can see the last rows are not arranged in ascending order but the starting rows are arranged in specified order. And if I change the ascending attribute of FMonth column in the above code, means in descending order the data shows the order in the starting rows but not in the last rows.
What am I doing wrong? What to do to achieve ascending order in all of the dataset? And what is happening and why?
Do you just need to prioritize Month?
z = pd.read_clipboard()
z.columns = [i.strip() for i in z.columns]
z.sort_values(['FMonth', 'Region', 'SKU'], axis=0, ascending=[True, True, True])
Out[23]:
index SKU Region FMonth sh
1 20022 8 52 1/1/2013 73320
0 8264 1 67 1/1/2013 67640
3 43897 5 34 3/1/2013 98040
2 31972 3 99 3/1/2013 86320
4 55642 4 98 5/1/2013 73240
I have 2 datasets for which there are multiple repeated rows for each date due to different recording time of each health attribute pertaining to that date.
I want to shrink my dataset to aggregate the values of each column pertaining to the same day. I don't want to create a new data frame because I need to then merge the datasets with other datasets. After trying the code below, my df's shape still returns the same number of rows. Any help would be appreciated.
sample data:
Data Head
Output snapshot
count calorie update_time speed distance date
101 4.290000 2018-04-30 18:35:00.291 1.527778 78.420000 2018-04-30
25 0.960000 2018-04-13 19:55:00.251 1.027778 14.360000 2018-04-13
38 1.530000 2018-04-02 10:14:58.210 1.194444 24.190000 2018-04-02
35 1.450000 2018-04-27 10:55:01.281 1.500000 27.450000 2018-04-27
0 0.000000 2018-04-21 13:46:36.801 0.000000 0.000000 2018-04-21
34 1.820000 2018-04-01 08:35:05.481 2.222222 30.260000 2018-04-01
df_SC['date']=df_SC.groupby('date').agg({"distance": "sum","calorie":"sum",
"count":"sum","speed":"mean"}).reset_index()
expect sum of distance, calorie, count and mean of speed to show up under each respective column and against each data.
Below are the below data frames i have esh -> earnings surprise history
and sph-> stock price history.
earnings surprise history
ticker reported_date reported_time_code eps_actual
0 ABC 2017-10-05 AMC 1.01
1 ABC 2017-07-04 BMO 0.91
2 ABC 2017-03-03 BMO 1.08
3 ABC 2016-10-02 AMC 0.5
stock price history
ticker date adj_open ad_close
0 ABC 2017-10-06 12.10 13.11
1 ABC 2017-12-05 11.11 11.87
2 ABC 2017-12-04 12.08 11.40
3 ABC 2017-12-03 12.01 13.03
..
101 ABC 2017-07-04 9.01 9.59
102 ABC 2017-07-03 7.89 8.19
I like to build a new dataframe by merging two datasets which shall have the following columns as shown below and also if the reported_time_code from the earnings surprise history is AMC then the record to be referred from stock price history should be the next day.if the reported_time_code is BM0 then record to be referred from stock price history should be the same day. if i used straight merge function on the actual_reported column of esh and data column of sph it will break the above conditions. looking for efficient way of transforming the data
Here is the resultant transformed data set
ticker date adj_open ad_close eps_actual
0 ABC 2017-10-06 12.10 13.11 1.01
101 ABC 2017-07-04 9.01 9.59 0.91
Let's add a new column, 'date', to stock price history dataframe based on reported_time_code using np.where and drop unwanted columns then merge to earning history dataframe:
eh['reported_date'] = pd.to_datetime(eh.reported_date)
sph['date'] = pd.to_datetime(sph.date)
eh_new = eh.assign(date=np.where(eh.reported_time_code == 'AMC',
eh.reported_date + pd.DateOffset(days=1),
eh.reported_date)).drop(['reported_date','reported_time_code'],axis=1)
sph.merge(eh_new, on=['ticker','date'])
Output:
ticker date adj_open ad_close eps_actual
0 ABC 2017-10-06 12.10 13.11 1.01
1 ABC 2017-07-04 9.01 9.59 0.91
It is it is great that your offset is only one day. Then you can do something like the following:
mask = esh['reported_time_code'] == 'AMC'
# The mask is basically an array of 0 and 1's \
all we have to do is to convert them into timedelta objects standing for \
the number of days to offset
offset = mask.values.astype('timedelta64[D]')
# The D inside the bracket stands for the unit of time to which \
you want to attach your number. In this case, we want [D]ays.
esh['date'] = esh['reported_date'] + offset
esh.merge(sph, on=['ticker', 'date']).drop(['reported_date', 'reported_time_code'], \
axis=1, inplace=True)