Pandas pivot table function values into wrong rows - python

I'm making a pivot table from a CSV (cl_total_data.csv) file using pandas pd.pivot_table() and need find a fix to values in the wrong rows.
[Original CSV File]
The error occurs when the year has 53 weeks(i.e. 53 values) instead of 52, the first value in the year with 53 weeks is set as the last value in the pivot table
[Pivot Table with wrong values top]
[Pivot Table with wrong values bottom]
[Original CSV 2021 w/ 53 values]
The last value for the pivot table 2021 row 53 (1123544) is the first value of the year for 2021-01-01 (1123544) in the original CSV table for 2021.
I figured out how to fix this in the pivot table after making it. I use
Find columns with 53 values:
cl_total_p.columns[~cl_total_p.isnull().any()]
Then take the values from the original CSV files to its corresponding year and replace the values in the pivot table
cl_total_p[2021] = cl_total_data.loc['2021'].Quantity.values
My problem is:
I can't figure out what I'm coding wrong in the pivot table function that causes this misplacement of values. Is there a better way to code it?
Using my manual solution takes a lot of time especially when I'm using multiple CSV files 10+ and having to fix every single misplacement in columns with 53 weeks. Is there a for loop I can code to loop through all columns with 53 weeks and replace them with their corresponding year?
I tried
import numpy
import pandas
year_range = np.arange(1982,2023)
week_range = np.arange(54)
for i in year_range:
for y in week_range:
cl_total_p[i] = cl_total_data.loc['y'].Quantity.values
But I get an error :( How can I fix the pivot table value misplacement? and/or find a for loop to take the original values and replace them in the pivot table?

I can't figure out what I'm coding wrong in the pivot table function that causes this misplacement of values. Is there a better way to code it?
The problem here lies in the definition of the ISO week number. Let's look at this line of code:
cl_total_p = pd.pivot_table(cl_total_data, index = cl_total_data.index.isocalendar().week, columns = cl_total_data.index.year, values = 'Quantity')
This line uses the ISO week number to determine the row position, and the non-ISO year to determine the column position.
The ISO week number is defined as the number of weeks since the first week of the year with a majority of its days in that year. This means that it is possible for the first week of the year to not line up with the first day of the year. For that reason, the ISO week number is used alongside the ISO year number, which says that the part of the year before the first week belongs to the the previous year.
For that reason, January 1st, 2021 was not the first week of 2021 in the ISO system. It was the 53rd week of 2020. When you mix the ISO week with the non-ISO year, you get the result that it was the 53rd week of 2021, a date which is a year off.
Here's an example of how to show this with the linux program date:
$ date -d "Jan 1 2021" "+%G-%V"
2020-53
You have a few options:
Use both the ISO week and the ISO year for consistency. The isocalendar() function can provide both the ISO week and ISO year.
If you don't want the ISO system, you can come up with your own definition of "week" which avoids having the year's first day belong to the previous year. One approach you could take is to take the day of year, divide by seven, and round down. Unfortunately, this does mean that the week will start on a different day each year.

Related

Looking to find the sum of a unique member's payment based of whether some dates fall in between a certain time in python

this is my first time asking on the community although I have used the website for help extensively in the past. I was not able to find a solution on this specific problem and I am fairly amatuer at python so having a hard time putting the logic in code although I think the logic is clear enough. I am using python via google colab for this and have shared a google sheet with data at the end.
In my scenario, we have a start month, length of time, and payout month. End month can be calculated through length. A person can be a part of multiple groups and thus can have multiple start, end and payout months.
The goal is to find how much is expected to be paid by a member as off today.
eg group begins in jan 2020, is 10 months long and will end in oct 2020. The monthly contribution is 5k. The payout month is lets say mar 2020. While we technically should be getting 10 payments (10 month group) we will expect only 9 payments ie 45k because when the payout month comes around, the member is not expected to pay for that month. If say the group began in dec 2020 and if it was 10 months long, then as off today we would only expect 5 payments (dec to apr 21).
These scenarios get complicated for eg when a member is part of 3 groups, so 3 start dates, 3 end dates and 3 payout dates and likely 3 different instalment amounts. lets say the start dates are jan 20, feb 20, mar 20 and all groups are 10 months long. lets also say that there is a payout in apr 20. In apr 20, all the groups will be active (the end month has not been reached yet), so in apr 20 (the payout month) we will expect no payments from all the groups.
Meaning that, if there are 3 groups and there is a payout that falls between any groups start and end month, then we will not expect a payment for that group in that month. If there are two payouts that fall in between the start and end months of the groups then we we will not expect 6 payments for that month, 2 for each group and so on. If say 3 groups and 1 payout falls between the dates of only 2 groups, then we will not expect instalments for only those two groups (what ever the instalment is for those groups)
The following google sheet has some sample data.
The group ID col is entirely unique and will have no dups (you can think of this an invoice since all invoices are unique). The member code col can have duplicates since a member can have more than one group. Do not worry about the days in the date, what matter is the month and year. We have start month, group length and payout month. we also have how much money is owed monthly by a member for that group.
https://docs.google.com/spreadsheets/d/1nAXlifIQdYiN1MWTv7vs2FqbFu2v6ykCzQjrJNPTBWI/edit#gid=0
any help or advice would be great.
EDITED -> I have tried the following but got an error: (i coded the months ie jan 2020 = 1, feb 2020 = 2 and so on so i dont have to mess around with dates)
deal_list = df['Group ID'].tolist()
def instalment(deal_list):
for member in df['Member Code'].unique():
if df['Coded Payout Month']>=df['Coded Start Month'] and df['Coded
Payout Month']<=df['Coded End Month']:
count_months = count_months + 1
return count_months * df['Instalment']
instalment(deal_list)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
EDITED - have also tried the following just now (took help from Pandas: Groupby and iterate with conditionals within groups?). It sort of worked in that it gave me a count of 1 for each row. I was trying to get the number of times each payout month appears within the dates of a group
grouped = df.groupby('Member Code')
for g_idx, group in grouped:
for r_idx, row in group.iterrows():
if (((row['Coded Payout Month'] >= group['Coded Start Month']).any())
& (row['Coded Payout Month'] <= group['Coded End Month']).any()):
df.loc[r_idx, 'payout_cut'] =+ 1
print(df)
I found a way around it. Essentially, rather than trying to iterate through all the rows, I transformed my data into long form first in Google sheets via transpose and filter (I filtered for all payout months for a member and transposed the results into the rows. I then pushed that into colab and through pd.melt transformed the data back into unique rows per deal with the additional payouts as required. Then running the condition was simple enough and finally summed for all true value.
I can explain a bit more of anyone needs.
I took inspiration from here:
https://youtu.be/pKvWD0f18Pc

pandas create a column to compare a value with one week ago

I have a pandas dataframe and the index column is time with hourly precision. I want to create a new column that compares the value of the column "Sales number" at each hour with the same exact time one week ago.
I know that it can be written in using shift function:
df['compare'] = df['Sales'] - df['Sales'].shift(7*24)
But I wonder how can I take advantage of the date_time format of the index. I mean, is there any alternatives to using shift(7*24) when the index is in date_time format?
Try something with
df['Sales'].shift(7,freq='D')

Get sum of business days in dataframe python with resample

I have a time-series where I want to get the sum the business day values for each week. A snapshot of the dataframe (df) used is shown below. Note that 2017-06-01 is a Friday and hence the days missing represent the weekend
I use resample to group the data by each week, and my aim is to get the sum. When I apply this function however I get results which I can't justify. I was expecting in the first row to get 0 which is the sum of the values contained in the first week, then 15 for the next week etc...
df_resampled = df.resample('W', label='left').sum()
df_resampled.head()
Can someone explain to me what am I missing since it seems like I have not understood the resampling function correctly?

Extract future timeseries data and join on past timeseries that are 12 hours apart?

I am in a data science course and my instructor isn't very strong in python.
Use a shift function to pull prices by 12 hours (aligning prices 12 hours in the future with a row's current prices). Then create a new column populated with this info.
So I should have my index, column 1, and newcolumn
I have tried a few different ways. I have tried extracting the 12 hours into a list and merging, I have tried using .slice, and I have tried creating a function.
https://imgur.com/a/AYaM1Ye
This seemed to work
slice= currency [currency.index.min():currency.index.max()]
#Move the datetime values forward an hour
shifted = slice.shift(periods=1, freq='12H')

Python: Date conversion to year-weeknumber, issue at switch of year

I am trying to convert a dataframe column with a date and timestamp to a year-weeknumber format, i.e., 01-05-2017 03:44 = 2017-1. This is pretty easy, however, I am stuck at dates that are in a new year, yet their weeknumber is still the last week of the previous year. The same thing that happens here.
I did the following:
df['WEEK_NUMBER'] = df.date.dt.year.astype(str).str.cat(df.date.dt.week.astype(str), sep='-')
Where df['date'] is a very large column with date and times, ranging over multiple years.
A date which gives a problem is for example:
Timestamp('2017-01-01 02:11:27')
The output for my code will be 2017-52, while it should be 2016-52. Since the data covers multiple years, and weeknumbers and their corresponding dates change every year, I cannot simply subtract a few days.
Does anybody have an idea of how to fix this? Thanks!
Replace df.date.dt.year by this:
(df.date.dt.year- ((df.date.dt.week>50) & (df.date.dt.month==1)))
Basically, it means that you will substract 1 to the year value if the week number is greater than 50 and the month is January.

Categories

Resources