Cumulative Sales of a Last Period - python

I have the following code that starts like this:
# Import Libraies
import numpy as np
import pandas as pd
import datetime as dt
#Conexion to Drive
from google.colab import drive
drive.mount('/content/drive')
ruta = '/content/drive/MyDrive/example.csv'
df = pd.read_csv(ruta)
df.head(10)
The file that I import, you can download it from here: Data
And it looks like this:
Then what I do is group the values ​​and then create a metric called Rolling Year (RY_ACTUAL) and (RY_LAST), these help me to know the sales of each category, for example the Blue category, twelve months ago. This metric works fine:
# ROLLING YEAR
# I want to make a Roling Year for each category. Thats mean how much sell each category since 12 moths ago TO current month
# RY_ACTUAL One year have 12 months so I pass as parameter in the rolling 12
f = lambda x:x.rolling(12).sum()
df_group["RY_ACTUAL"] = df_group.groupby(["CATEGORY"])['Sales'].apply(f)
# RY_24 I create a rolling with 24 as parameter to compare actual RY vs last RY
f_1 = lambda x:x.rolling(24).sum()
df_group["RY_24"] = df_group.groupby(["CATEGORY"])['Sales'].apply(f_1)
#RY_LAST Substract RY_24 - RY_Actual to get the correct amount. Thats mean amount of RY vs the amount of RY-1
df_group["RY_LAST"] = df_group["RY_24"] - df_group["RY_ACTUAL"]
My problem is in the metric called Year To Date, which is nothing more than the accumulated sales of each category from JANUARY to the month where you read the table, for example if I stop in March 2015, know how much each category sold in January to March. The column I created called YTD_ACTUAL does just that for me and I achieve it like this:
# YTD_ACTUAL
df_group['YTD_ACTUAL'] = df_group.groupby(["CATEGORY","DATE"]).Sales.cumsum()
However, what I have not been able to do is the YTD_LAST column, that is, from the past period, which reminding the previous example where it was stopped in March 2015, suppose in the blue category, it should return to me how much was the accumulated sales for the blue category from JANUARY to MARCH but 2014 year.
My try >.<
#YTD_LAST
df_group['YTD_LAST'] = df_group.groupby(["CATEGORY", "DATE"]).Sales.apply(f)
Could someone help me to make this column correctly?
Thank you in advance, community!

Related

Return Earliest Date based on value within dataset

I am working with REIGN data that documents elections and leaders in countries around the world (https://www.oneearthfuture.org/datasets/reign)
In the dataset there is an boolean election anticipation variable that turns from 0 to 1 to denote that an election is anticipated in at least the next 6 months, possibly sooner.
Excel sheet of data in question
I want to create a new column that returns the earliest date of when anticipation (column N) turns 1 (i.e. when was the election first anticipated).
So for example, with Afghanistan in column we have an election in 2014 and in 2017.
In column N we see it turn from 0 to 1 on Oct, 2014 (election anticipated) and then we see it go back to 0 on July, 2014 (election concluded) until it goes back to 1 on Jan, 2019 (election anticipated) and then turns back to 0 on Oct, 2019.
So if successful, I would capture Oct, 2014 (election anticipated) and Jan, 2019 (election anticipated) as election announcement dates in a new column along with any other dates an election was anticipated.
Currently I have the following:
#bringing in Reign CSV
regin = pd.read_csv('REIGN_2021_7(1).csv')
#shows us the first 5 rows to make sure they look good
print(regin.head())
#show us the rows and columns in the file
regin.shape
#Getting our index
print(regin.columns)
#adding in a date column that concatenates year and month
regin['date'] = pd.to_datetime(regin[['year', 'month']].assign(DAY=1))
regin.head
#def conditions(s):
if (s['anticipation'] == 1):
return (s['date'])
else:
return 0
regin['announced_date'] = regin.apply(conditions, axis=1)
print(regin.head)
Biggest issue for me is that while this returns the date of when a 1 appears, it does not display the earliest date. How I can loop through the anticipation column and return the minimum date, but do so multiple times as a country will have many elections over the years and there are therefore multiple instances in column N for one country of the anticipation turning on(1) and off(0).
Thanks in advance for any assistance! Let me know if anything is unclear.
If you can loop over your dates, you will probably want to use the datetime module (assuming all dates have the same format):
from datetime import datetime
[...]
earliest_date = datetime.today()
[... loop over data, by country ...]
date1 = datetime.strptime(input_date_string1, date_string_format)
if date1 < earliest_date:
earliest_date = date1
[...]
This module supports (among other things):
parsing date objects from a string (.strptime(in_str, format))
comparison of date objects (date1 > date2)
datetime object from current date + time (.today())
datetime object from arbitrary date (.date(year, month, day))
docs: https://docs.python.org/3/library/datetime.html

Python drop random numbers of a beta distribution

I have a question about beta distributions and random variables. My data includes performance data from 2012 to 2016 on an hourly basis. I recalculated the data monthly, so I have only one value for every month. After that, I created a new df with all the values ​​of a month as shown in my code sample.
import numpy as np
import pandas as pd
from scipy.stats import beta
import matplotlib.pyplot as plt
output = pd.read_csv("./data/external/power_output_hourly.csv", delimiter=",", parse_dates=True, index_col=[0])
print(output.head())
output_month = output.resample('1M').sum()
print(output_month.head())
jan = output_month[:1]
jan = jan.append(output_month[12:13])
jan = jan.append(output_month[24:25])
jan = jan.append(output_month[36:37])
jan = jan.append(output_month[48:49])
print(jan)
...
months = [jan, feb, mar, apr, mai, jun, jul, aug, sep, okt, nov, dez]
My next step is to pull random numbers from a beta distribution based on the past values ​​of each month. Therefor, I wanna use the scipypackage and numpy.random. The problem is, that I don't know how...I need only 20 numbers, but I don't know, how I can determine the a and b value. Do I just have to try random values ​​or can I extract the corresponding values ​​from my past data? I am thankful for every help!
Try fit (=find the parameters) the beta distribution for each month using scipy.stats.beta.fit(MONTH). See here for short description of its outputs, or read into source code for details (poorly documented, unfortunately).
FYI More discussion about fitting beta distribution found in this post, for I haven't used the function a lot myself.

Adjust datetime in Pandas to get CustomBusinessWeek

I have a long series of stock daily prices and I am trying to get week prices to do some calculations. I have been reading the documentation and I see you can set offsets get a specific date of the week which is what I want. This is the code assume stock is part of a loop I am runing.
df_clean_BW[WEEKLY_PricesFriday'] = stock.resample('W-FRI').last()
But for US stock market there are many days where it is a holiday on Friday so then I saw you can adjust this for USCalendar Holidays. This is the code I was using
from pandas.tseries.offsets import CustomBusinessDay
from pandas.tseries.holiday import USFederalHolidayCalendar
bday_us = CustomBusinessDay(calendar=USFederalHolidayCalendar())
But I dont know how to combine the two so that if there is a holiday on Friday to take the day prior (the Thursday instead). So something like this but this throws an error
df_clean_BW[WEEKLY_PricesFriday'] = stock.resample('W-FRI' & bday_us).last()
I have a long list of dates so I don't want to create a list of exception days because that would be too long. Here is an example of the output I would want. In this case Jan 1, 2016 was a Friday so I just want to take December 31, 2015 instead. This must be a common request for anyone who looks at stock data but I cant figure out a way to do it.
Date Price Week Price
12/30/2015 103.3227
12/31/2015 101.3394
1/4/2016 101.426 101.3394 << Take 12/31 as 1.1 is holiday
1/5/2016 98.8844
1/6/2016 96.9492
1/7/2016 92.8575
1/8/2016 93.3485 93.3485
First generate your array of Fridays including holidays. Then use np.busday_offset() to offset them like this:
np.busday_offset(fridays, 0, roll='backward', busdaycal=bday_us.calendar)

Iterating through dates over a specific period of time in python

I am currently working with a large dataset that has lists of daily inventory. I want to compare the inventory over 2 days to see what has changed, and continue that process for an entire month. For example for the month of January, I want to see the change between January 1 and 2, and then January 2 and 3, and so on until I reach January 31st.
I was able to write a code to compare inventory between 2 dates. But how do I iterate that process for the code to continue running for the next set of days? I am new to programming and would appreciate any help.
In the code below, I created 2 subsets: the 1st for the inventory on October 14 and the 2nd for inventory on October 15. In the 3rd line, I calculate what has changed between the 2 days using the unique identifier in the dataset (image).
cars_date_1 = cars_extract_drop[(df['as_of_date'] > '2015-10-14') &
(df['as_of_date'] < '2015-10-15')]
cars_date_2 = cars_extract_drop[(df['as_of_date'] > '2015-10-15') &
(df['as_of_date'] < '2015-10-16')]
cars_sold = cars_date_1[~cars_date_1['image'].isin(cars_date_2['image'])]
Pandas pd.date_range() function and iteration through each element:
rng = pd.date_range('1/1/2011', periods=365, freq='D')
for i in range(365):
day_1 = rng(i)
day_2 = rng(i+1)
difference_function(day_1, day_2)

Group DataFrame by Business Day of Month

I am trying to group a Pandas DataFrame that is indexed by date by the business day of month, approx 22/month.
I would like to return a result that contains 22 rows with mean of some value in `DataFrame.
I can by day of month but cant seem to figure out how to by business day.
Is there a function that will return the business day of month of a date?
if someone could provide a simple example that would be most appreciated.
Assuming your dates are in the index (if not use 'set_index):
df.groupby(pd.TimeGrouper('B'))
See time series functionality.
I think what the question is asking is to groupby business day of month - the other answer just seems to resample the data to the nearest business day (at least for me).
This code returns a groupby object with 22 rows
from datetime import date
import pandas as pd
import numpy as np
d = pd.Series(np.random.randn(1000), index=pd.bdate_range(start='01 Jan 2018', periods=1000))
def to_bday_of_month(dt):
month_start = date(dt.year, dt.month, 1)
return np.busday_count(month_start, dt)
day_of_month = [to_bday_of_month(dt) for dt in d.index.date]
d.groupby(day_of_month).mean()

Categories

Resources