I am trying to optimise a function so it can work on a much larger dataframe.
I have a dataframe (called test_data) that looks like this
To create a toy example I have filtered this dataframe like so:
value_list = ["DDD","MMM","AAPL","MSFT","AMZN","TSLA"]
test_data2 = test_data[test_data['Asset'].isin(value_list)]
I have written a basic function to generate the required output:
def generate_stock_price_dataframe():
price_dataframe = pd.DataFrame()
for stock in test_data2['Asset'].unique():
data = pd.DataFrame(index = test_data2.index.unique())
data[stock] = pd.DataFrame(test_data2.query("Asset==#stock")['Price'])
price_dataframe = pd.concat([price_dataframe,data],axis=1)
stock_price_data = price_dataframe
return stock_price_data
and this gives the required output.
This works nicely for the toy example with only a few assets.
However, When I run this with the full dataframe with 1000's assets...it just doesn't work.
Wheres the best place to start to speed this up?
Thank you
EDIT: Here is some code to recreate the question.
assets = ['AAPL','AAPL','AAPL','AAPL','AAPL','MSFT','MSFT','MSFT','MSFT','MSFT','AMZN','AMZN','AMZN','AMZN','AMZN',]
dates = ['05/01/2021','05/02/2021','05/03/2021','05/04/2021','05/05/2021','05/01/2021','05/02/2021','05/03/2021','05/04/2021','05/05/2021','05/01/2021','05/02/2021','05/03/2021','05/04/2021','05/05/2021']
prices = range(1, 16)
test_data2 = pd.DataFrame(index=dates)
test_data2['Asset'] = assets
test_data2['Price'] = prices
df = generate_stock_price_dataframe()
df.tail()
df = test_data.pivot(columns='Asset')
Output
Price
Asset AAPL AMZN MSFT
05/01/2021 1 11 6
05/02/2021 2 12 7
05/03/2021 3 13 8
05/04/2021 4 14 9
05/05/2021 5 15 10
If we want to drop the Price from Multilevel Columns and the columns axis name Asset.
df = test_data.pivot(columns='Asset').droplevel(0,1).rename_axis(None, axis='columns')
df
Output
AAPL AMZN MSFT
05/01/2021 1 11 6
05/02/2021 2 12 7
05/03/2021 3 13 8
05/04/2021 4 14 9
05/05/2021 5 15 10
Related
df: (DataFrame)
Open High Close Volume
2020/1/1 1 2 3 323232
2020/1/2 2 3 4 321321
....
2020/12/31 4 5 6 123213
....
2021
The performance i needed is : (Graph NO.1)
Open High Close Volume Year_Sum_Volume
2020/1/1 1 2 3 323232 (323232 + 321321 +....+ 123213)
2020/1/2 2 3 4 321321 (323232 + 321321 +....+ 123213)
....
2020/12/31 4 5 6 123213 (323232 + 321321 +....+ 123213)
....
2021 (x+x+x.....x)
I want a sum of Volume in different year (the Year_Sum_Volume is the volume of each year)
This is the code i try to calculate the sum of volume in each year but how can i add this data
to daily data , i want to add Year_Sum_Volume to df,like(Graph no.1)
df.resample('Y', on='Date')['Volume'].sum()
thanks you for answering
I believe groupby.sum() and merge should be your friends
import pandas as pd
df = pd.DataFrame({"date":['2021-12-30', '2021-12-31', '2022-01-01'], "a":[1,2.1,3.2]})
df.date = pd.to_datetime(df.date)
df["year"] = df.date.dt.year
df_sums = df.groupby("year").sum().rename(columns={"a":"a_sum"})
df = df.merge(df_sums, right_index=True, left_on="year")
which gives:
date
a
year
a_sum
0
2021-12-30 00:00:00
1
2021
3.1
1
2021-12-31 00:00:00
2.1
2021
3.1
2
2022-01-01 00:00:00
3.2
2022
3.2
Based on your output, Year_Sum_Volume is the same value for every row and can be calculated using df['Volume'].sum().
Then you join a column of a scaled list:
df.join(pd.DataFrame( {'Year_Sum_Volume': [your_sum_val] * len(df['Volume'])} ))
Try below code (after converting date column to pd.to_datetime)
df.assign(Year_Sum_Volume = df.groupby(df['date'].dt.year)['a'].transform('sum'))
I have a dataset, df, where I have a new value for each day. I would like to output the percent difference of these values from row to row as well as the raw value difference:
Date Value
10/01/2020 1
10/02/2020 2
10/03/2020 5
10/04/2020 8
Desired output:
Date Value PercentDifference ValueDifference
10/01/2020 1
10/02/2020 2 100 2
10/03/2020 5 150 3
10/04/2020 8 60 3
This is what I am doing:
import pandas as pd
df = pd.read_csv('df.csv')
df = (df.merge(df.assign(Date=df['Date'] - pd.to_timedelta('1D')),
on='Date')
.assign(Value = lambda x: x['Value_y']-x['Value_x'])
[['Date','Value']]
)
df['PercentDifference'] = [f'{x:.2%}' for x in (df['Value'].div(df['Value'].shift(1)) -
1).fillna(0)]
A member has helped me with the code above, I am also trying to incorporate the value difference as shown in my desired output.
Note - Is there a way to incorporate a 'period' - say, checking the percent difference and value difference over a 7 day period and 30 day period and so on?
Any suggestion is appreciated
Use Series.pct_change and Series.diff
df['PercentageDiff'] = df['Value'].pct_change().mul(100)
df['ValueDiff'] = df['Value'].diff()
Date Value PercentageDiff ValueDiff
0 10/01/2020 1 NaN NaN
1 10/02/2020 2 100.0 1.0
2 10/03/2020 5 150.0 3.0
3 10/04/2020 8 60.0 3.0
Or you use df.assign
df.assign(
percentageDiff=df["Value"].pct_change().mul(100),
ValueDiff=df["Value"].diff()
)
df2 = df_cleaned.groupby('company').size()
df2.columns = ['company', 'frequency']
#df2.sort_values('frequency') # error : No axis named frequency for object type <class 'type'>
df2
I have a dataframe "df_cleaned" with a 'company' column and im trying to create a new dataframe "df2" with a extra 'frequency' column to check how many times each company has been mentioned. I am unable to create a new frequency column. Seems like i'm doing something wrong, please help me out.
Screenshot showing no frequency column
You don't provide the data for us, so generate it:
import numpy as np
source = ['3Com', '3M', 'A-T-O', 'A.H. Robins']
cmp = [source[i] for i in np.random.randint(4, size = 20)]
df = pd.DataFrame(cmp, columns = ['company'])
Out[1]:
company
0 A.H. Robins
1 3M
2 A.H. Robins
3 A.H. Robins
4 3M
5 3M
6 3Com
7 A-T-O
8 3Com
9 A-T-O
10 3M
11 3M
12 A-T-O
13 3M
14 3M
15 A.H. Robins
16 A-T-O
17 A-T-O
18 A-T-O
19 3Com
df.groupby('company')[['company']].count().rename(columns = {'company':'frequency'})
Out[2]:
frequency
company
3Com 3
3M 7
A-T-O 6
A.H. Robins 4
Use:
df2 = df_cleaned.groupby('company').size().to_frame('frecuency')
I have time-series data of a yearly sports tournament, with the date when each game was played. I want to group the games by the season(year) they were played in. Each season starts in August and ends the NEXT year in july.
How would I go about grouping the games by season, like -
season(2016-2017), season(2017-2018), etc..
This Answer involving df.resample() may be related, but I'm not sure how I'd go about doing it.
This is what the date column looks like:
DATE
26/09/09
04/10/09
17/10/09
25/10/09
31/10/09
...
29/09/18
07/10/18
28/10/18
03/11/18
I want to group by seasons so that I can perform visualization operations over the aggregated data.
UPDATE: For the time being my solution is to split up the dataframe into groups of 32 as I know each season has 32 games. This is the code I've used:
split_df = np.array_split(df, np.arange(0, len(df),32))
But I'd rather prefer something more elegant and more inclusive of time-series data so I'll keep the question open.
The key to success is proper grouping, in your case pd.Grouper(key='DATA', freq='AS-AUG').
Note that freq='AS-AUG' states that your groups should start from the start of
August each year.
Look at the following script:
import pandas as pd
# Source columns
dates = [ '01/04/09', '31/07/09', '01/08/09', '26/09/09', '04/10/09', '17/12/09',
'25/01/10', '20/04/10', '31/07/10', '01/08/10', '28/10/10', '03/11/10',
'25/12/10', '20/04/11', '31/07/11' ]
scores_x = np.random.randint(0, 20, len(dates))
scores_y = np.random.randint(0, 20, len(dates))
# Source DataFrame
df = pd.DataFrame({'DATA': dates, 'SCORE_X': scores_x, 'SCORE_Y': scores_y})
# Convert string date to datetime
df.DATA = pd.to_datetime(df.DATA, format='%d/%m/%y')
# Grouping
gr = df.groupby(pd.Grouper(key='DATA', freq='AS-AUG'))
If you print the results:
for name, group in gr:
print()
print(name)
print(group)
you will get:
2008-08-01 00:00:00
DATA SCORE_X SCORE_Y
0 2009-04-01 16 11
1 2009-07-31 10 7
2009-08-01 00:00:00
DATA SCORE_X SCORE_Y
2 2009-08-01 19 6
3 2009-09-26 14 5
4 2009-10-04 8 11
5 2009-12-17 12 19
6 2010-01-25 0 0
7 2010-04-20 17 6
8 2010-07-31 18 2
2010-08-01 00:00:00
DATA SCORE_X SCORE_Y
9 2010-08-01 15 18
10 2010-10-28 2 4
11 2010-11-03 8 16
12 2010-12-25 13 1
13 2011-04-20 19 7
14 2011-07-31 8 3
As you can see, each group starts just on 1-st of August and ends on
31-st of July.
They you can do with your groups whatever you want.
Use -
df.groupby(df['DATE'].dt.year).count()
Output
DATE
DATE
2009 5
2018 4
Custom Season Grouping
min_year = df['DATE'].dt.year.min()
max_year = df['DATE'].dt.year.max()
rng = pd.date_range(start='{}-07'.format(min_year), end='{}-08'.format(max_year), freq='12M').to_series()
df.groupby(pd.cut(df['DATE'], rng)).count()
Output
DATE
DATE
(2009-07-31, 2010-07-31] 3
(2010-07-31, 2011-07-31] 0
(2011-07-31, 2012-07-31] 0
(2012-07-31, 2013-07-31] 0
(2013-07-31, 2014-07-31] 0
(2014-07-31, 2015-07-31] 0
(2015-07-31, 2016-07-31] 0
(2016-07-31, 2017-07-31] 0
(2017-07-31, 2018-07-31] 1
Resampling using 'A-JUL' as an anchored offset alias should do the trick:
>>> df
SAMPLE
DATE
2009-01-30 1
2009-07-10 4
2009-11-20 3
2010-01-01 5
2010-05-13 1
2010-08-01 1
>>> df.resample('A-JUL').sum()
SAMPLE
DATE
2009-07-31 5
2010-07-31 9
2011-07-31 1
A indicates it is a yearly interval, -JUL indicates it ends in July.
You could build a season column and group by that. In below code, I used pandas.DateOffset() to move all dates 7 months back so a game that happened in August would look like it happened in January to align the season year with the calendar year. Building season string is fairly straightforward after that.
import pandas as pd
from datetime import date
dates = pd.date_range(date(2009, 8, 1), date(2018, 7, 30), freq='17d')
df = pd.DataFrame(dates, columns=['date'])
# copy the date column to a separate dataframe to do the work
df_tmp = df[['date']]
df_tmp['season_start_year'] = (df_tmp['date'] - pd.DateOffset(months=7)).dt.year
df_tmp['season_end_year'] = df_tmp['season_start_year'] + 1
df_tmp['season'] = df_tmp['season_start_year'].map(str) + '-' + df_tmp['season_end_year'].map(str)
# copy season column to the main dataframe
df['season'] = df_tmp['season']
df.groupby('season').count()
I am trying to group by hospital staff working hours bi monthly. I have raw data on daily basis which look like below.
date hourse_spent emp_id
9/11/2016 8 1
15/11/2016 8 1
22/11/2016 8 2
23/11/2016 8 1
How I want to group by is.
cycle hourse_spent emp_id
1/11/2016-15/11/2016 16 1
16/11/2016-31/11/2016 8 2
16/11/2016-31/11/2016 8 1
I am trying to do the same with grouper and frequency in pandas something as below.
data.set_index('date',inplace=True)
print data.head()
dt = data.groupby(['emp_id', pd.Grouper(key='date', freq='MS')])['hours_spent'].sum().reset_index().sort_values('date')
#df.resample('10d').mean().interpolate(method='linear',axis=0)
print dt.resample('SMS').sum()
I also tried resampling
df1 = dt.resample('MS', loffset=pd.Timedelta(15, 'd')).sum()
data.set_index('date',inplace=True)
df1 = data.resample('MS', loffset=pd.Timedelta(15, 'd')).sum()
But this is giving data of 15 days interval not like 1 to 15 and 15 to 31.
Please let me know what I am doing wrong here.
You were almost there. This will do it -
dt = df.groupby(['emp_id', pd.Grouper(key='date', freq='SM')])['hours_spent'].sum().reset_index().sort_values('date')
emp_id date hours_spent
1 2016-10-31 8
1 2016-11-15 16
2 2016-11-15 8
The freq='SM' is the concept of semi-months which will use the 15th and the last day of every month
Put DateTime-Values into Bins
If I got you right, you basically want to put your values in the date column into bins. For this, pandas has the pd.cut() function included, which does exactly what you want.
Here's an approach which might help you:
import pandas as pd
df = pd.DataFrame({
'hours' : 8,
'emp_id' : [1,1,2,1],
'date' : [pd.datetime(2016,11,9),
pd.datetime(2016,11,15),
pd.datetime(2016,11,22),
pd.datetime(2016,11,23)]
})
bins_dt = pd.date_range('2016-10-16', freq='SM', periods=3)
cycle = pd.cut(df.date, bins_dt)
df.groupby([cycle, 'emp_id']).sum()
Which gets you:
cycle emp_id hours
------------------------ ------ ------
(2016-10-31, 2016-11-15] 1 16
2 NaN
(2016-11-15, 2016-11-30] 1 8
2 8
Had a similar question, here was my solution:
df1['BiMonth'] = df1['Date'] + pd.DateOffset(days=-1) + pd.offsets.SemiMonthEnd()
df1['BiMonth'] = df1['BiMonth'].dt.to_period('D')
The construction "df1['Date'] + pd.DateOffset(days=-1)" will take whatever is in the date column and -1 day.
The construction "+ pd.offsets.SemiMonthEnd()" converts it to a bimonthly basket, but its off by a day unless you reduce the reference date by 1.
The construction "df1['BiMonth'] = df1['BiMonth'].dt.to_period('D')" cleans out the time so you just have days.