Python drop random numbers of a beta distribution - python

I have a question about beta distributions and random variables. My data includes performance data from 2012 to 2016 on an hourly basis. I recalculated the data monthly, so I have only one value for every month. After that, I created a new df with all the values ​​of a month as shown in my code sample.
import numpy as np
import pandas as pd
from scipy.stats import beta
import matplotlib.pyplot as plt
output = pd.read_csv("./data/external/power_output_hourly.csv", delimiter=",", parse_dates=True, index_col=[0])
print(output.head())
output_month = output.resample('1M').sum()
print(output_month.head())
jan = output_month[:1]
jan = jan.append(output_month[12:13])
jan = jan.append(output_month[24:25])
jan = jan.append(output_month[36:37])
jan = jan.append(output_month[48:49])
print(jan)
...
months = [jan, feb, mar, apr, mai, jun, jul, aug, sep, okt, nov, dez]
My next step is to pull random numbers from a beta distribution based on the past values ​​of each month. Therefor, I wanna use the scipypackage and numpy.random. The problem is, that I don't know how...I need only 20 numbers, but I don't know, how I can determine the a and b value. Do I just have to try random values ​​or can I extract the corresponding values ​​from my past data? I am thankful for every help!

Try fit (=find the parameters) the beta distribution for each month using scipy.stats.beta.fit(MONTH). See here for short description of its outputs, or read into source code for details (poorly documented, unfortunately).
FYI More discussion about fitting beta distribution found in this post, for I haven't used the function a lot myself.

Related

Cumulative Sales of a Last Period

I have the following code that starts like this:
# Import Libraies
import numpy as np
import pandas as pd
import datetime as dt
#Conexion to Drive
from google.colab import drive
drive.mount('/content/drive')
ruta = '/content/drive/MyDrive/example.csv'
df = pd.read_csv(ruta)
df.head(10)
The file that I import, you can download it from here: Data
And it looks like this:
Then what I do is group the values ​​and then create a metric called Rolling Year (RY_ACTUAL) and (RY_LAST), these help me to know the sales of each category, for example the Blue category, twelve months ago. This metric works fine:
# ROLLING YEAR
# I want to make a Roling Year for each category. Thats mean how much sell each category since 12 moths ago TO current month
# RY_ACTUAL One year have 12 months so I pass as parameter in the rolling 12
f = lambda x:x.rolling(12).sum()
df_group["RY_ACTUAL"] = df_group.groupby(["CATEGORY"])['Sales'].apply(f)
# RY_24 I create a rolling with 24 as parameter to compare actual RY vs last RY
f_1 = lambda x:x.rolling(24).sum()
df_group["RY_24"] = df_group.groupby(["CATEGORY"])['Sales'].apply(f_1)
#RY_LAST Substract RY_24 - RY_Actual to get the correct amount. Thats mean amount of RY vs the amount of RY-1
df_group["RY_LAST"] = df_group["RY_24"] - df_group["RY_ACTUAL"]
My problem is in the metric called Year To Date, which is nothing more than the accumulated sales of each category from JANUARY to the month where you read the table, for example if I stop in March 2015, know how much each category sold in January to March. The column I created called YTD_ACTUAL does just that for me and I achieve it like this:
# YTD_ACTUAL
df_group['YTD_ACTUAL'] = df_group.groupby(["CATEGORY","DATE"]).Sales.cumsum()
However, what I have not been able to do is the YTD_LAST column, that is, from the past period, which reminding the previous example where it was stopped in March 2015, suppose in the blue category, it should return to me how much was the accumulated sales for the blue category from JANUARY to MARCH but 2014 year.
My try >.<
#YTD_LAST
df_group['YTD_LAST'] = df_group.groupby(["CATEGORY", "DATE"]).Sales.apply(f)
Could someone help me to make this column correctly?
Thank you in advance, community!

Processing data with incorrect dates like 30th of February

In trying to process a large number of bank account statements given in CSV format I realized that some of the dates are incorrect (30th of February, which is not possible).
So this snippet fails [1] telling me that some dates are incorrect:
df_from_csv = pd.read_csv( csv_filename
, encoding='cp1252'
, sep=";"
, thousands='.', decimal=","
, dayfirst=True
, parse_dates=['Buchungstag', 'Wertstellung']
)
I could of course pre-process those CSV files and replace the 30th of Feb with 28th of Feb (or whatever the Feb ended in that year).
But is there a way to do this in Pandas, while importing? Like "If this column fails, set it to X"?
Sample row
775945;28.02.2018;30.02.2018;;901;"Zinsen"
As you can see, the date 30.02.2018 is not correct, because there ain't no 30th of Feb. But this seems to be a known problem in Germany. See [2].
[1] Here's the error message:
ValueError: day is out of range for month
[2] https://de.wikipedia.org/wiki/30._Februar
Here is how I solved it:
I added a custom date-parser:
import calendar
def mydateparser(dat_str):
"""Given a date like `30.02.2020` create a correct date `28.02.2020`"""
if dat_str.startswith("30.02"):
(d, m, y) = [int(el) for el in dat_str.split(".")]
# This here will get the first and last days in a given year/month:
(first, last) = calendar.monthrange(y, m)
# Use the correct last day (`last`) in creating a new datestring:
dat_str = f"{last:02d}.{m:02d}.{y}"
return pd.datetime.strptime(dat_str, "%d.%m.%Y")
# and used it in `read_csv`
for csv_filename in glob.glob(f"{path}/*.csv"):
# read csv into DataFrame
df_from_csv = pd.read_csv(csv_filename,
encoding='cp1252',
sep=";",
thousands='.', decimal=",",
dayfirst=True,
parse_dates=['Buchungstag', 'Wertstellung'],
date_parser=mydateparser
)
This allows me to fix those incorrect "30.02.XX" dates and allow pandas to convert those two date columns (['Buchungstag', 'Wertstellung']) into dates, instead of objects.
You could load it all up as text, then run it through a regex to identify non legal dates - which you could apply some adjustment function.
A sample regex you might apply could be:
ok_date_pattern = re.compile(r"^(0[1-9]|[12][0-9]|3[01])[-](0[1-9]|1[012])[-](19|20|99)[0-9]{2}\b")
This finds dates in DD-MM-YYYY format where the DD is constrained to being from 01 to 31 (i.e. a day of 42 would be considered illegal) and MM is constrained to 01 to 12, and YYYY is constrained to being within the range 1900 to 2099.
There are other regexes that go into more depth - such as some of the inventive answers found here
What you then need is a working adjustment function - perhaps that parses the date as best it can and returns a nearest legal date. I'm not aware of anything that does that out of the box, but a function could be written to deal with the most common edge cases I guess.
Then it'd be a case of tagging legal and illegal dates using an appropriate regex, and assigning some date-conversion function to deal with these two classes of dates appropriately.

Add time in pandas beyond year 2262

I would like to add months to date in pandas. They may exceed beyond year 2262. There is a solution for relatively small number of months:
import numpy as np
import pandas as pd
pd.Timestamp('2018-01-22 00:00:00') + np.timedelta64(12, 'M')
which results in
Timestamp('2019-01-22 05:49:12')
However, when I add larger number (which, as a result, exceeds year 2262):
pd.Timestamp('2018-01-22 00:00:00') + np.timedelta64(3650, 'M')
Python does return result:
Timestamp('1737-09-02 14:40:26.290448384')
How to cope with this?
datetime
Pandas.Timestamp aims to handle much finer time resolution down to the nanosecond. This precision takes up enough of the 64 bits allocated to it that it can only go up to the year 2262. However, datetime.datetime does not have this limitation and can go up to year 9999. If you start working with datetime objects instead of Timestamp objects, you'll lose some functionality but you will be able to go beyond 2262.
Also, your number of months also went beyond the maximum number of days for a Timedelta
Let's begin by picking a more reasonable sized number of months that is a nice multiple of 48 (Four years).
d = pd.Timedelta(48, 'M')
And our date
t = pd.Timestamp('2018-01-22')
A multiple that represents how many times our 48 months fits inside the desired 3650 months.
m = 3650 / 48
Then we can use to_pydatetime and to_pytimedelta
t.to_pydatetime() + d.to_pytimedelta() * m
datetime.datetime(2322, 3, 24, 14, 15, 0, 1)

Python Timedelta Arithmetic With noleap Calendars

I'm working with CMIP5 data that has time units of "days since 1-1-1850". To find the current day I'm working with in the file, I would normally just do a timedelta addition from 1-1-1850 and the time value (in days) for the datapoint that I'm working with. However, CMIP5 (or at least the file I'm using) uses a 'noleap' calendar, meaning that all years are only 365 days.
In my current case, when dealing with the datapoint that corresponds to January 1, 1980, I add its time argument of 47450 days to the original date of January 1, 1850. However, I get back an answer of December 1, 1979 because all the Feb. 29ths between 1850 and 1980 are excluded. Is there an additional argument in timedelta or datetime in general that deals with calendars that exclude leap days?
netCDF num2date is the function you are looking for:
import netCDF4
ncfile = netCDF4.Dataset('./foo.nc', 'r')
time = ncfile.variables['time'] # note that we do not cast to numpy array yet
time_convert = netCDF4.num2date(time[:], time.units, time.calendar)
Note that the CMIP5 models do not have a standard calendar, so the time.calendar argument is important to include when doing this conversion.

Group DataFrame by Business Day of Month

I am trying to group a Pandas DataFrame that is indexed by date by the business day of month, approx 22/month.
I would like to return a result that contains 22 rows with mean of some value in `DataFrame.
I can by day of month but cant seem to figure out how to by business day.
Is there a function that will return the business day of month of a date?
if someone could provide a simple example that would be most appreciated.
Assuming your dates are in the index (if not use 'set_index):
df.groupby(pd.TimeGrouper('B'))
See time series functionality.
I think what the question is asking is to groupby business day of month - the other answer just seems to resample the data to the nearest business day (at least for me).
This code returns a groupby object with 22 rows
from datetime import date
import pandas as pd
import numpy as np
d = pd.Series(np.random.randn(1000), index=pd.bdate_range(start='01 Jan 2018', periods=1000))
def to_bday_of_month(dt):
month_start = date(dt.year, dt.month, 1)
return np.busday_count(month_start, dt)
day_of_month = [to_bday_of_month(dt) for dt in d.index.date]
d.groupby(day_of_month).mean()

Categories

Resources