Python: Date conversion to year-weeknumber, issue at switch of year - python

I am trying to convert a dataframe column with a date and timestamp to a year-weeknumber format, i.e., 01-05-2017 03:44 = 2017-1. This is pretty easy, however, I am stuck at dates that are in a new year, yet their weeknumber is still the last week of the previous year. The same thing that happens here.
I did the following:
df['WEEK_NUMBER'] = df.date.dt.year.astype(str).str.cat(df.date.dt.week.astype(str), sep='-')
Where df['date'] is a very large column with date and times, ranging over multiple years.
A date which gives a problem is for example:
Timestamp('2017-01-01 02:11:27')
The output for my code will be 2017-52, while it should be 2016-52. Since the data covers multiple years, and weeknumbers and their corresponding dates change every year, I cannot simply subtract a few days.
Does anybody have an idea of how to fix this? Thanks!

Replace df.date.dt.year by this:
(df.date.dt.year- ((df.date.dt.week>50) & (df.date.dt.month==1)))
Basically, it means that you will substract 1 to the year value if the week number is greater than 50 and the month is January.

Related

getting previous week highs and lows in pandas dataframe using 30 min data

I have a few set of days where the index is based on 30min data from monday to friday. There might some missing dates (Might be because of holidays). But i would like to find the highest from column high and lowest from column low for ever past week. Like i am calculating today so previous week high and low is marked in the yellow of attached image.
Tried using rolling , resampling but some how not working. Can any one help
enter image description here
You really should add sample data to your question (by that I mean a piece of code/text that can easily be used to create a dataframe for illustrating how the proposed solution works).
Here's a suggestion. With df your dataframe, and column datatime with datetimes (and not strings):
df["week"] = (
df["datetime"].dt.isocalendar().year.astype(str)
+ df["datetime"].dt.isocalendar().week.astype(str)
)
mask = df["high"] == df.groupby("week")["high"].transform("max")
df = df.merge(
df[mask].rename(columns={"low": "high_low"})
.groupby("week").agg({"high_low": "min"}).shift(),
on="week", how="left"
).drop(columns="week")
Add a week column to df (year + week) for grouping along weeks.
Extract the rows with the weekly maximum highs by mask (there could be more than one for a week).
Build a corresponding dataframe with the weekly minimum of the lows corresponding to the weekly maximum highs (column named high_low), shift it once to get the value from the previous week, and .merge it to df.
If column datetime doesn't contain datetimes:
df["datetime"] = pd.to_datetime(df["datetime"])
If I have understood correctly, the solution should be
get the week number from the date
groupby the week number and fetch the max and min number.
groupby the week fetch max date to get max/last date for a week
now merge all the dataframes into one based on date key
Once the steps are done, you could do any formatting as required.

Replacing dates that fall on weekends to next business day in dataframe

I have a dataframe with bunch of dates in them, I would like to check each entry if it is a weekday or weekend, if it is a weekend i would like to increase the date to the next weekday. What is the most pythonic way of doing this ? I was thinking about using a list comprehension and s.th like
days = pd.date_range(start='1/1/2020', end='1/08/2020')
dates = pd.DataFrame(days,columns=['dates'])
dates['dates'] = [day+pd.DateOffset(days=1) if day.weekday() >4 else day for day in dates['dates']]
How could I adjust the code to cover day+pd.DateOffset(days=1) (for Sunday) and day+pd.DateOffset(days=2) (for Saturday) and get an updated column with the shifted dates? Running the code twice with +1 should work but is certainly not pretty

Pandas pivot table function values into wrong rows

I'm making a pivot table from a CSV (cl_total_data.csv) file using pandas pd.pivot_table() and need find a fix to values in the wrong rows.
[Original CSV File]
The error occurs when the year has 53 weeks(i.e. 53 values) instead of 52, the first value in the year with 53 weeks is set as the last value in the pivot table
[Pivot Table with wrong values top]
[Pivot Table with wrong values bottom]
[Original CSV 2021 w/ 53 values]
The last value for the pivot table 2021 row 53 (1123544) is the first value of the year for 2021-01-01 (1123544) in the original CSV table for 2021.
I figured out how to fix this in the pivot table after making it. I use
Find columns with 53 values:
cl_total_p.columns[~cl_total_p.isnull().any()]
Then take the values from the original CSV files to its corresponding year and replace the values in the pivot table
cl_total_p[2021] = cl_total_data.loc['2021'].Quantity.values
My problem is:
I can't figure out what I'm coding wrong in the pivot table function that causes this misplacement of values. Is there a better way to code it?
Using my manual solution takes a lot of time especially when I'm using multiple CSV files 10+ and having to fix every single misplacement in columns with 53 weeks. Is there a for loop I can code to loop through all columns with 53 weeks and replace them with their corresponding year?
I tried
import numpy
import pandas
year_range = np.arange(1982,2023)
week_range = np.arange(54)
for i in year_range:
for y in week_range:
cl_total_p[i] = cl_total_data.loc['y'].Quantity.values
But I get an error :( How can I fix the pivot table value misplacement? and/or find a for loop to take the original values and replace them in the pivot table?
I can't figure out what I'm coding wrong in the pivot table function that causes this misplacement of values. Is there a better way to code it?
The problem here lies in the definition of the ISO week number. Let's look at this line of code:
cl_total_p = pd.pivot_table(cl_total_data, index = cl_total_data.index.isocalendar().week, columns = cl_total_data.index.year, values = 'Quantity')
This line uses the ISO week number to determine the row position, and the non-ISO year to determine the column position.
The ISO week number is defined as the number of weeks since the first week of the year with a majority of its days in that year. This means that it is possible for the first week of the year to not line up with the first day of the year. For that reason, the ISO week number is used alongside the ISO year number, which says that the part of the year before the first week belongs to the the previous year.
For that reason, January 1st, 2021 was not the first week of 2021 in the ISO system. It was the 53rd week of 2020. When you mix the ISO week with the non-ISO year, you get the result that it was the 53rd week of 2021, a date which is a year off.
Here's an example of how to show this with the linux program date:
$ date -d "Jan 1 2021" "+%G-%V"
2020-53
You have a few options:
Use both the ISO week and the ISO year for consistency. The isocalendar() function can provide both the ISO week and ISO year.
If you don't want the ISO system, you can come up with your own definition of "week" which avoids having the year's first day belong to the previous year. One approach you could take is to take the day of year, divide by seven, and round down. Unfortunately, this does mean that the week will start on a different day each year.

pandas mis-identiying month as day and day as month for a date-time column

I have a date-time column as 'yyyy-mm-dd' in string format. I changed it to datetime format using pd.to_date_time. Now when I'm trying to get the month from this date using df['col'].iloc[0].month, then dd is generated as output instead of 'mm' and similar thing is happening for df['col'].iloc[0].day as well. It is outputting mm instead ofdd. I tried to add new columns year containing yyyy, month containing df['col'].iloc[0].day output and day containing df['col'].iloc[0].month output and then combine them to get back the date. What are the easier ways to swap the days and months?

Get sum of business days in dataframe python with resample

I have a time-series where I want to get the sum the business day values for each week. A snapshot of the dataframe (df) used is shown below. Note that 2017-06-01 is a Friday and hence the days missing represent the weekend
I use resample to group the data by each week, and my aim is to get the sum. When I apply this function however I get results which I can't justify. I was expecting in the first row to get 0 which is the sum of the values contained in the first week, then 15 for the next week etc...
df_resampled = df.resample('W', label='left').sum()
df_resampled.head()
Can someone explain to me what am I missing since it seems like I have not understood the resampling function correctly?

Categories

Resources