Pandas dataframe create Variable "Winter & rolling year" (interyears)

Pandas dataframe create Variable "Winter & rolling year" (interyears) - python

I am looking at a history of daily min/max temperatures for the past ~40 years of a specific city (the variable precipitation isn't needed).
I imported the CSV file with the aim to calculate an average for the low and high temperature for each winter (I consider the range November-March as winter). So I suppose a solution could be to loop over the years and maybe create a column which consists of "Winter&year" (for instance the first of december 2018 fell in winter 2018 and the 23rd of February 2019 fell in winter 2018 too). I found plenty of examples to aggregate days/months into seasons but nothing where the year changes and I actually struggle with that bit.
The structure of the data is the following:
Could anyone point me to the right direction?
Many thanks

Related

How do I split the number of days in a 24 year period (1995-2019)

Please forgive the use of the photo but I tried copying out the dataframe but it wasn’t coming out the way I wanted it to.
The number of sales is represented by the number of rows of the dataframe which is 30255.
Above is a sample of the dataframe I am working with.
Letting n be the number of days starting at n=1 for 1st January 1995 and ending at n=9131 for 31st December 2019.
And also considering the number of sales of 'D' over each 365-day period. (Representing each datapoint for the yearly sales using day 183 as the midpoint of the first 365-day period.)
My problem is how to split that data into each 365-day period.
Any help would be much appreciated

Does Pandas account for leap years when calculating dates

I am trying to add 148.328971 years precisely from the day 01.01.2000 using pandas. I first converted this to days by multiplying it by 365. So here is my question, albeit probably a dumb one.
Does pandas consider leap years when calculating days? The obvious answer is yes because it is calculating days but I have to make sure, precision of dates and time is important in the analysis I am trying to do.
I get this result when calculating with pandas which I am not sure is entirely correct:
03.25.2148 1:47:09 AM
code being used:
import pandas as pd
start = "01/01/2000"
end = pd.to_datetime(start) + pd.DateOffset(days=54140.074415)
print(end)
Any help would be greatly appreciated! Sorry in advance if this seems to be basic knowledge but I have to be certain

Your question is flawed, as stated; a year is not a fixed length of time. So what does "precisely 148.328971 years" even mean? Do you mean 148 calendar years plus 0.328971 of way through the next calendar year? And if you're counting down to a precision of 0.000001 year – a millionth of a year, or about 30 seconds – then "01.01.2000" is a pretty imprecise starting point; from what time on January 1st, 2000 are you counting?
Let's assume you mean civil calendar years from midnight UTC. Then 148.0 years would get you to January 1, 2148, still at midnight UTC. Since 2148 is a leap year, 0.328971 of the way through it would add 0.328971 × 366 = 120.403 more days, which gets you to April 30, 2148 at 09:40 UTC.
Maybe you mean to be counting some "year" value that really is fixed? We do that in other contexts; the light year is based on the mean Julian calendar year, and so is defined as the distance light travels in exactly 365.25 atomic days. If you mean 148.328971 of those years, that'd be 54,177.1567 days, which would get you to May 1, 2148 at 03:45 UTC instead.
But we don't use the Julian calendar anymore for civil purposes. Maybe you want instead the mean year of the Gregorian calendar, which replaced it in the West? That's exactly 365.2425 days; 148.328971 of those years is 54,176.0442 days, from 2000 Jan 1 is back to 2148 April 30, only now at 01:03 UTC.
Then again, In parts of the world where the Eastern Orthodox Church is dominant, they instead use the Revised Julian calendar, whose mean year is 365.242̅ days (exactly 365 days, 5 hours, 48 minutes, 48 seconds). 148.328971 of those years is only 54,176.0030 days, which still get you to April 30th, but just barely over the line from the 29th at less than 5 minutes after midnight: 00:04 UTC.
So if you're counting calendar years of some description, you wind up somewhere on April 30th or May 1st, 2148. I trust this is helpful.
But maybe you mean to toss out calendars and go directly to the value they're trying to approximate: the mean tropical year! But then we have to ask where in the year you're measuring from, because December solstice to December solstice is a different length on average than March equinox to March equinox (because the length of the year itself is constantly changing). When we need a fixed value we tend to use the average of averages, as it were, taking the mean of the length values across the whole year. As of 2000 that value was about 365.24219 days. 148.328971 of those is only 54,175.9982 days, which leaves you even earlier: April 29th, 2148 at 23:57 UTC.
Then there's the sidereal year, but even though it arguably has the greatest claim at being the "real" period of the Earth's orbit, it doesn't see much use outside astronomy; probably not that.
Anyway, the real question is - where does this "148.328971" figure come from, and what is the intent behind it? Once you know what the desired answer actually is, it will be easy enough to find its value.

Yes, it does. However, your conversion from years to days is already ignoring the leap years. You can multiply by 365.25 (365.242, as suggested in the comments) which gives better results.
You can check the accuracy of the results on wolfram alpha: https://www.wolframalpha.com/input/?i=148.328971+years++from+01%2F01%2F2000
In addition, you can use pandas DateOffset with years. However, currently only integer values are supported.
import pandas as pd
start = "01/01/2000"
end = pd.to_datetime(start) + pd.DateOffset(years =148, days =0.328971*365.242)
print(end)
# 2148-04-30 03:45:35.229600
It seems to work well but misses by few hours.

Make time-series smooth saving monthly sales

I want to generate daily sales based on week distribution and monthly sum, and I want it to look smooth, without jumps from month to month, and another condition is not to change monthly sum. The idea of how it should looks like is below, kinda looks like normal distribution:
(blue line - current sales distribution, red - approximately what I would like to get)
My own method was about decreasing/increasing (increasing if monthly sales in the next month are higher than in current) sales in the month begginning and increasing in the end. I generated list of values and then multiply sales by that list. But that doesn't really work because in some cases decreasing sales in the begging can be too much when monthly sales in the next month are much higher than in current month.
Different time-series smoothing technics will change monthly sales, and even bringing the monthly sales to desired values by adding error (abs(new_sales_after_smooth - desired_monthly_sales))/30) to every day in month, will not change the situation and there will also be sharp ups and downs from month to month.
And sales from month to month not only can be increasing, also decreasing.
Saving weekly seasonability is also important
I would be grateful for any ideas on how to solve this problem. Example data is below. Numbers in propotrion column is part of monthly sales.
weekday
proportion
Monday
0.040088
Tuesday
0.028345
Wednesday
0.027814
Thursday
0.034188
Friday
0.035997
Saturday
0.031616
Sunday
0.032600
month
sales
July
16263212
August
17422652
September
18028792
October
20588807
November
26466756
December
40903354

Looking to find the sum of a unique member's payment based of whether some dates fall in between a certain time in python

this is my first time asking on the community although I have used the website for help extensively in the past. I was not able to find a solution on this specific problem and I am fairly amatuer at python so having a hard time putting the logic in code although I think the logic is clear enough. I am using python via google colab for this and have shared a google sheet with data at the end.
In my scenario, we have a start month, length of time, and payout month. End month can be calculated through length. A person can be a part of multiple groups and thus can have multiple start, end and payout months.
The goal is to find how much is expected to be paid by a member as off today.
eg group begins in jan 2020, is 10 months long and will end in oct 2020. The monthly contribution is 5k. The payout month is lets say mar 2020. While we technically should be getting 10 payments (10 month group) we will expect only 9 payments ie 45k because when the payout month comes around, the member is not expected to pay for that month. If say the group began in dec 2020 and if it was 10 months long, then as off today we would only expect 5 payments (dec to apr 21).
These scenarios get complicated for eg when a member is part of 3 groups, so 3 start dates, 3 end dates and 3 payout dates and likely 3 different instalment amounts. lets say the start dates are jan 20, feb 20, mar 20 and all groups are 10 months long. lets also say that there is a payout in apr 20. In apr 20, all the groups will be active (the end month has not been reached yet), so in apr 20 (the payout month) we will expect no payments from all the groups.
Meaning that, if there are 3 groups and there is a payout that falls between any groups start and end month, then we will not expect a payment for that group in that month. If there are two payouts that fall in between the start and end months of the groups then we we will not expect 6 payments for that month, 2 for each group and so on. If say 3 groups and 1 payout falls between the dates of only 2 groups, then we will not expect instalments for only those two groups (what ever the instalment is for those groups)
The following google sheet has some sample data.
The group ID col is entirely unique and will have no dups (you can think of this an invoice since all invoices are unique). The member code col can have duplicates since a member can have more than one group. Do not worry about the days in the date, what matter is the month and year. We have start month, group length and payout month. we also have how much money is owed monthly by a member for that group.
https://docs.google.com/spreadsheets/d/1nAXlifIQdYiN1MWTv7vs2FqbFu2v6ykCzQjrJNPTBWI/edit#gid=0
any help or advice would be great.
EDITED -> I have tried the following but got an error: (i coded the months ie jan 2020 = 1, feb 2020 = 2 and so on so i dont have to mess around with dates)
deal_list = df['Group ID'].tolist()
def instalment(deal_list):
for member in df['Member Code'].unique():
if df['Coded Payout Month']>=df['Coded Start Month'] and df['Coded
Payout Month']<=df['Coded End Month']:
count_months = count_months + 1
return count_months * df['Instalment']
instalment(deal_list)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
EDITED - have also tried the following just now (took help from Pandas: Groupby and iterate with conditionals within groups?). It sort of worked in that it gave me a count of 1 for each row. I was trying to get the number of times each payout month appears within the dates of a group
grouped = df.groupby('Member Code')
for g_idx, group in grouped:
for r_idx, row in group.iterrows():
if (((row['Coded Payout Month'] >= group['Coded Start Month']).any())
& (row['Coded Payout Month'] <= group['Coded End Month']).any()):
df.loc[r_idx, 'payout_cut'] =+ 1
print(df)

I found a way around it. Essentially, rather than trying to iterate through all the rows, I transformed my data into long form first in Google sheets via transpose and filter (I filtered for all payout months for a member and transposed the results into the rows. I then pushed that into colab and through pd.melt transformed the data back into unique rows per deal with the additional payouts as required. Then running the condition was simple enough and finally summed for all true value.
I can explain a bit more of anyone needs.
I took inspiration from here:
https://youtu.be/pKvWD0f18Pc

Calculating 30 year climate normal from gridded dataset in Python

I am trying to calculate the 30 year temperature normal (1981-2010 average) for the NARR daily gridded data set linked below.
In the end for each grid point I want an array that contains 365 values, each of which contains the average temperature of that day calculated from the 30 years of data for that day. For example the first value in each grid point's array would be the average Jan 1 temperature calculated from the 30 years (1981-2010) of Jan 1 temperature data for that grid point. My end goal is to be able to use this new 30yrNormal array to calculate daily temperature anomalies from.
So far I have only been able to calculate anomalies from one year worth of data. The problem with this is that it is taking the difference between the daily temperature and the average for the whole year rather then the difference between the daily temperature and the 30 year average of that daily temperature:
file='air.sfc.2018.nc'
ncin = Dataset(file,'r')
#put data into numpy arrays
lons=ncin.variables['lon'][:]
lats=ncin.variables['lat'][:]
lats1=ncin.variables['lat'][:,0]
temp=ncin.variables['air'][:]
ncin.close()
AvgT=np.mean(temp[:,:,:],axis=0)
#compute anomalies by removing time-mean
T_anom=temp-AvgT
Data:
ftp://ftp.cdc.noaa.gov/Datasets/NARR/Dailies/monolevel/
For the years 1981-2010

This is most easily solved using CDO.
You can use my package, nctoolkit (https://nctoolkit.readthedocs.io/en/latest/ & https://pypi.org/project/nctoolkit/) if you are working with Python on Linux. This uses CDO as a backend.
Assuming that the 30 files are a list called ff_list. The code below should work.
First you would create the 30 year daily mean climatology.
import nctoolkit as nc
mean_30 = nc.open_data(ff_list)
mean_30.merge_time()
mean_30.drop(month=2,day=29)
mean_30.tmean("day")
mean_30.run()
Then you would subtract this from the daily figures to get the anomalies.
anom_30 = nc.open_data(ff_list)
anom_30.cdo_command("del29feb")
anom_30.subtract(mean_30)
anom_30.run()
This should have the anomalies
One issue is whether the files have leap years or how you want to handle leap years if they exists. CDO has an undocumented command -delfeb29, which I have used above

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.