A Yearly data to Daily data in Python - python

df: (DataFrame)
Open High Close Volume
2020/1/1 1 2 3 323232
2020/1/2 2 3 4 321321
....
2020/12/31 4 5 6 123213
....
2021
The performance i needed is : (Graph NO.1)
Open High Close Volume Year_Sum_Volume
2020/1/1 1 2 3 323232 (323232 + 321321 +....+ 123213)
2020/1/2 2 3 4 321321 (323232 + 321321 +....+ 123213)
....
2020/12/31 4 5 6 123213 (323232 + 321321 +....+ 123213)
....
2021 (x+x+x.....x)
I want a sum of Volume in different year (the Year_Sum_Volume is the volume of each year)
This is the code i try to calculate the sum of volume in each year but how can i add this data
to daily data , i want to add Year_Sum_Volume to df,like(Graph no.1)
df.resample('Y', on='Date')['Volume'].sum()
thanks you for answering

I believe groupby.sum() and merge should be your friends
import pandas as pd
df = pd.DataFrame({"date":['2021-12-30', '2021-12-31', '2022-01-01'], "a":[1,2.1,3.2]})
df.date = pd.to_datetime(df.date)
df["year"] = df.date.dt.year
df_sums = df.groupby("year").sum().rename(columns={"a":"a_sum"})
df = df.merge(df_sums, right_index=True, left_on="year")
which gives:
date
a
year
a_sum
0
2021-12-30 00:00:00
1
2021
3.1
1
2021-12-31 00:00:00
2.1
2021
3.1
2
2022-01-01 00:00:00
3.2
2022
3.2

Based on your output, Year_Sum_Volume is the same value for every row and can be calculated using df['Volume'].sum().
Then you join a column of a scaled list:
df.join(pd.DataFrame( {'Year_Sum_Volume': [your_sum_val] * len(df['Volume'])} ))

Try below code (after converting date column to pd.to_datetime)
df.assign(Year_Sum_Volume = df.groupby(df['date'].dt.year)['a'].transform('sum'))

Related

Count ratios conditional on 2 columns

I am new to pandas and trying to figure out the following how to calculate the percentage change (difference) between 2 years, given that sometimes there is no previous year.
I am given a dataframe as follows:
company date amount
1 Company 1 2020 3
2 Company 1 2021 1
3 COMPANY2 2020 7
4 Company 3 2020 4
5 Company 3 2021 4
.. ... ... ...
766 Company N 2021 9
765 Company N 2020 1
767 Company XYZ 2021 3
768 Company X 2021 3
769 Company Z 2020 2
I wrote something like this:
for company in unique(df2.company):
company_df = df2[df2.company== company]
company_df.sort_values(by ="date")
company_df_year = company_df.amount.tolist()
company_df_year.pop()
company_df_year.insert(0,0)
company_df["value_year_before"] = company_df_year
if any in company_df.value_year_before == None:
company_df["diff"] = 0
else:
company_df["diff"] = (company_df.amount- company_df.value_year_before)/company_df.value_year_before
df2["ratio"] = company_df["diff"]
But I keep getting >NAN.
Where did I make a mistake?
The main issue is that you are overwriting company_df in each iteration of the loop and only keeping the last one.
However, normally when using Pandas if you are starting to use a for loop then you are doing something wrong and there is an easier way to accomplish the goal. Here you could use groupby and pct_change to compute the ratio of each group.
df = df.sort_values(['company', 'date'])
df['ratio'] = df.groupby('company')['amount'].pct_change()
df['ratio'] = df['ratio'].fillna(0.0)
Groupby will keep the order of the rows within each group so we sort before to ensure that the order of the dates is correct and fillna replace any nans with 0.
Result:
company date amount ratio
3 COMPANY2 2020 7 0.000000
1 Company 1 2020 3 0.000000
2 Company 1 2021 1 -0.666667
4 Company 3 2020 4 0.000000
5 Company 3 2021 4 0.000000
765 Company N 2020 1 0.000000
766 Company N 2021 9 8.000000
768 Company X 2021 3 0.000000
767 Company XYZ 2021 3 0.000000
769 Company Z 2020 2 0.000000
Apply an anonymous function that calculate the change percentage and returns that if there is more than one values. Use:
df = pd.DataFrame({'company': [1,1,3], 'date':[2020,2021,2020], 'amount': [4,5,7]})
df.groupby('company')['amount'].apply(lambda x: (list(x)[1]-list(x)[0])/list(x)[0] if len(x)>1 else 'not enough values')
Input df:
Output:

Calculate a count of groupby rows that occur within a rolling window of days in Pandas

I have the following dataframe:
import pandas as pd
#Create DF
d = {'Name': ['Jim','Jim','Jim', 'Jim','Jack','Jack'],
'Date': ['08/01/2021','27/01/2021','05/02/2021','10/02/2021','26/01/2021','20/02/2021']}
df = pd.DataFrame(data=d)
df['Date'] = pd.to_datetime(df.Date,format='%d/%m/%Y')
df
I would like to add a column (to this same dataframe) calculating the how many have occurred in the last 28 days grouped by Name. Does anyone know the most efficient way to do this over 200,000 rows of code? with about 1000 different Name's?
The new column values should be 1,2,3,3,1,2. Any help would be much appreciated! Thanks!
Set the index of dataframe to Date, then group the frame by Name and apply rolling count with a closed window having offset of 28 days
df['count'] = df.set_index('Date')\
.groupby('Name', sort=False)['Name']\
.rolling('28d', closed='both').count().tolist()
Name Date count
0 Jim 2021-01-08 1.0
1 Jim 2021-01-27 2.0
2 Jim 2021-02-05 3.0
3 Jim 2021-02-10 3.0
4 Jack 2021-01-26 1.0
5 Jack 2021-02-20 2.0

Output raw value difference from one period to the next using Python

I have a dataset, df, where I have a new value for each day. I would like to output the percent difference of these values from row to row as well as the raw value difference:
Date Value
10/01/2020 1
10/02/2020 2
10/03/2020 5
10/04/2020 8
Desired output:
Date Value PercentDifference ValueDifference
10/01/2020 1
10/02/2020 2 100 2
10/03/2020 5 150 3
10/04/2020 8 60 3
This is what I am doing:
import pandas as pd
df = pd.read_csv('df.csv')
df = (df.merge(df.assign(Date=df['Date'] - pd.to_timedelta('1D')),
on='Date')
.assign(Value = lambda x: x['Value_y']-x['Value_x'])
[['Date','Value']]
)
df['PercentDifference'] = [f'{x:.2%}' for x in (df['Value'].div(df['Value'].shift(1)) -
1).fillna(0)]
A member has helped me with the code above, I am also trying to incorporate the value difference as shown in my desired output.
Note - Is there a way to incorporate a 'period' - say, checking the percent difference and value difference over a 7 day period and 30 day period and so on?
Any suggestion is appreciated
Use Series.pct_change and Series.diff
df['PercentageDiff'] = df['Value'].pct_change().mul(100)
df['ValueDiff'] = df['Value'].diff()
Date Value PercentageDiff ValueDiff
0 10/01/2020 1 NaN NaN
1 10/02/2020 2 100.0 1.0
2 10/03/2020 5 150.0 3.0
3 10/04/2020 8 60.0 3.0
Or you use df.assign
df.assign(
percentageDiff=df["Value"].pct_change().mul(100),
ValueDiff=df["Value"].diff()
)

Pandas get the Month Ending Values from Series

I need to get the month-end balance from a series of entries.
Sample data:
date contrib totalShrs
0 2009-04-23 5220.00 10000.000
1 2009-04-24 10210.00 20000.000
2 2009-04-27 16710.00 30000.000
3 2009-04-30 22610.00 40000.000
4 2009-05-05 28909.00 50000.000
5 2009-05-20 38409.00 60000.000
6 2009-05-28 46508.00 70000.000
7 2009-05-29 56308.00 80000.000
8 2009-06-01 66108.00 90000.000
9 2009-06-02 78108.00 100000.000
10 2009-06-12 86606.00 110000.000
11 2009-08-03 95606.00 120000.000
The output would look something like this:
2009-04-30 40000
2009-05-31 80000
2009-06-30 110000
2009-07-31 110000
2009-08-31 120000
Is there a simple Pandas method?
I don't see how I can do this with something like a groupby?
Or would I have to do something like iterrows, find all the monthly entries, order them by date and pick the last one?
Thanks.
Use Grouper with GroupBy.last, forward filling missing values by ffill with Series.reset_index:
#if necessary
#df['date'] = pd.to_datetime(df['date'])
df = df.groupby(pd.Grouper(freq='m',key='date'))['totalShrs'].last().ffill().reset_index()
#alternative
#df = df.resample('m',on='date')['totalShrs'].last().ffill().reset_index()
print (df)
date totalShrs
0 2009-04-30 40000.0
1 2009-05-31 80000.0
2 2009-06-30 110000.0
3 2009-07-31 110000.0
4 2009-08-31 120000.0
Following gives you the information you want, i.e. end of month values, though the format is not exactly what you asked:
df['month'] = df['date'].str.split('-', expand = True)[1] # split date column to get month column
newdf = pd.DataFrame(columns=df.columns) # create a new dataframe for output
grouped = df.groupby('month') # get grouped values
for g in grouped: # for each group, get last row
gdf = pd.DataFrame(data=g[1])
newdf.loc[len(newdf),:] = gdf.iloc[-1,:] # fill new dataframe with last row obtained
newdf = newdf.drop('date', axis=1) # drop date column, since month column is there
print(newdf)
Output:
contrib totalShrs month
0 22610 40000 04
1 56308 80000 05
2 86606 110000 06
3 95606 120000 08

grouping time-series data based on starting and ending date

I have time-series data of a yearly sports tournament, with the date when each game was played. I want to group the games by the season(year) they were played in. Each season starts in August and ends the NEXT year in july.
How would I go about grouping the games by season, like -
season(2016-2017), season(2017-2018), etc..
This Answer involving df.resample() may be related, but I'm not sure how I'd go about doing it.
This is what the date column looks like:
DATE
26/09/09
04/10/09
17/10/09
25/10/09
31/10/09
...
29/09/18
07/10/18
28/10/18
03/11/18
I want to group by seasons so that I can perform visualization operations over the aggregated data.
UPDATE: For the time being my solution is to split up the dataframe into groups of 32 as I know each season has 32 games. This is the code I've used:
split_df = np.array_split(df, np.arange(0, len(df),32))
But I'd rather prefer something more elegant and more inclusive of time-series data so I'll keep the question open.
The key to success is proper grouping, in your case pd.Grouper(key='DATA', freq='AS-AUG').
Note that freq='AS-AUG' states that your groups should start from the start of
August each year.
Look at the following script:
import pandas as pd
# Source columns
dates = [ '01/04/09', '31/07/09', '01/08/09', '26/09/09', '04/10/09', '17/12/09',
'25/01/10', '20/04/10', '31/07/10', '01/08/10', '28/10/10', '03/11/10',
'25/12/10', '20/04/11', '31/07/11' ]
scores_x = np.random.randint(0, 20, len(dates))
scores_y = np.random.randint(0, 20, len(dates))
# Source DataFrame
df = pd.DataFrame({'DATA': dates, 'SCORE_X': scores_x, 'SCORE_Y': scores_y})
# Convert string date to datetime
df.DATA = pd.to_datetime(df.DATA, format='%d/%m/%y')
# Grouping
gr = df.groupby(pd.Grouper(key='DATA', freq='AS-AUG'))
If you print the results:
for name, group in gr:
print()
print(name)
print(group)
you will get:
2008-08-01 00:00:00
DATA SCORE_X SCORE_Y
0 2009-04-01 16 11
1 2009-07-31 10 7
2009-08-01 00:00:00
DATA SCORE_X SCORE_Y
2 2009-08-01 19 6
3 2009-09-26 14 5
4 2009-10-04 8 11
5 2009-12-17 12 19
6 2010-01-25 0 0
7 2010-04-20 17 6
8 2010-07-31 18 2
2010-08-01 00:00:00
DATA SCORE_X SCORE_Y
9 2010-08-01 15 18
10 2010-10-28 2 4
11 2010-11-03 8 16
12 2010-12-25 13 1
13 2011-04-20 19 7
14 2011-07-31 8 3
As you can see, each group starts just on 1-st of August and ends on
31-st of July.
They you can do with your groups whatever you want.
Use -
df.groupby(df['DATE'].dt.year).count()
Output
DATE
DATE
2009 5
2018 4
Custom Season Grouping
min_year = df['DATE'].dt.year.min()
max_year = df['DATE'].dt.year.max()
rng = pd.date_range(start='{}-07'.format(min_year), end='{}-08'.format(max_year), freq='12M').to_series()
df.groupby(pd.cut(df['DATE'], rng)).count()
Output
DATE
DATE
(2009-07-31, 2010-07-31] 3
(2010-07-31, 2011-07-31] 0
(2011-07-31, 2012-07-31] 0
(2012-07-31, 2013-07-31] 0
(2013-07-31, 2014-07-31] 0
(2014-07-31, 2015-07-31] 0
(2015-07-31, 2016-07-31] 0
(2016-07-31, 2017-07-31] 0
(2017-07-31, 2018-07-31] 1
Resampling using 'A-JUL' as an anchored offset alias should do the trick:
>>> df
SAMPLE
DATE
2009-01-30 1
2009-07-10 4
2009-11-20 3
2010-01-01 5
2010-05-13 1
2010-08-01 1
>>> df.resample('A-JUL').sum()
SAMPLE
DATE
2009-07-31 5
2010-07-31 9
2011-07-31 1
A indicates it is a yearly interval, -JUL indicates it ends in July.
You could build a season column and group by that. In below code, I used pandas.DateOffset() to move all dates 7 months back so a game that happened in August would look like it happened in January to align the season year with the calendar year. Building season string is fairly straightforward after that.
import pandas as pd
from datetime import date
dates = pd.date_range(date(2009, 8, 1), date(2018, 7, 30), freq='17d')
df = pd.DataFrame(dates, columns=['date'])
# copy the date column to a separate dataframe to do the work
df_tmp = df[['date']]
df_tmp['season_start_year'] = (df_tmp['date'] - pd.DateOffset(months=7)).dt.year
df_tmp['season_end_year'] = df_tmp['season_start_year'] + 1
df_tmp['season'] = df_tmp['season_start_year'].map(str) + '-' + df_tmp['season_end_year'].map(str)
# copy season column to the main dataframe
df['season'] = df_tmp['season']
df.groupby('season').count()

Categories

Resources