Incorrect Parsing of dates in pandas dataframe - python

Hello I am trying to change my dataframe dates into a format i can use to extract useful information.
The dataset comes with a 'week' feature that is in the form DD/MM/YY as follows:
In [128]: df_train[['week', 'units_sold']]
Out[128]:
week units_sold
0 17/01/11 20
1 17/01/11 28
2 17/01/11 19
3 17/01/11 44
4 17/01/11 52
I have changed the dates as follows:
df_train['new_date'] = pd.to_datetime(df_train['week'])
new_date units_sold
0 2011-01-17 20.0
1 2011-01-17 28.0
2 2011-01-17 19.0
3 2011-01-17 44.0
4 2011-01-17 52.0
Using the 'new_date' feature I created, I did the following for some information extraction:
df_train['weekday'] = df_train['new_date'].dt.weekofyear #week day of the year
df_train['QTR'] = df_train['new_date'].apply(lambda x: x.quarter) #current quarter of the year
df_train['month'] = df_train['new_date'].apply(lambda x: x.month) #current month
df_train['year'] = df_train['new_date'].dt.year #current year
However, when reviewing my data I run into some errors. For example a certain date in my dataset is 07/02/11 which should translate to a month of 2. except my parsing shows that the month is 7, which I know is incorrect: see entry 3483
Out[127]:
week month
18 17/01/11 1
1173 24/01/11 1
2328 31/01/11 1
3483 07/02/11 7
4638 14/02/11 2
Can anyone tell me where i went wrong?
Any help is apprecaited!

Use dayfirst=True parameter:
df_train['new_date'] = pd.to_datetime(df_train['week'], dayfirst=True)
And then .dt accessor for improve performance, because in apply are loops under the hood:
df_train['weekday'] = df_train['new_date'].dt.weekofyear #week day of the year
df_train['QTR'] = df_train['new_date'].dt.quarter #current quarter of the year
df_train['month'] = df_train['new_date'].dt.month #current month
df_train['year'] = df_train['new_date'].dt.year

Related

How do I convert the month and day in dataset to datetime so that I can index it?

End goal: I want to create a graph so that the x-axis is the date and there are two y-axis, one Ontario and one CMB. Ultimately there would be 5 graphs (Ontario,2 vs CMB,2 & Ontario,3 vs CMB,3 & etc.)
Ideal Graph
However, my datasets have date as "mm-dd" in the Datestamp column which are object type. Also, both tables do not have all the same dates.
dfplot_ont.tail()
Datestamp Ontario,2 Ontario,3 Ontario,4 Ontario,5 Ontario,7
18 12-29 -0.664715 0.245738 0.668187 0.016819 -0.493384
19 12-30 0.491311 0.302230 1.140404 1.421685 1.552911
20 01-02 1.213827 0.471704 1.400124 1.599767 1.621120
21 01-03 1.502834 0.048018 0.927907 0.956694 1.052705
22 01-04 -1.965244 -2.917788 0.597355 0.234474 -0.857170
dfplot_cmb.tail()
Datestamp CMB,2 CMB,3 CMB,4 CMB,5 CMB,7
15 12-28 0.907092 0.937362 0.991568 1.030808 1.139708
16 12-29 0.900410 0.919994 0.992267 0.991359 1.034978
17 12-30 1.181259 1.193806 1.272700 1.283576 1.265860
18 01-03 0.751646 0.752037 0.681900 0.686982 0.600167
19 01-04 0.606714 0.532544 0.339825 0.282894 0.127186
I need to change this to datetime but it seems like I need to include a year to change it. How do I code "if the month is 12, then year is 2022 and if the month is 1, then year is 2023"? I will also need to swap it out for the year will always be 2023 once there is data at the end of the year.
I have tried this, but it does not change Datestamp to datetime type:
dfplot_ont['Datestamp'] = pd.to_datetime(dfplot_ont['Datestamp'], format='%m-%d').dt.strftime('%m-%d')
I have also tried this, but then the index ends up not being mm-dd:
dfplot_ont = dfplot_ont.set_index(pd.to_datetime(dfplot_ont['Datestamp'], format='%MM-%dd'))
Datestamp Ontario,2 Ontario,3 Ontario,4 Ontario,5 Ontario,7
Datestamp
1900-01-22 00:12:00 12-22 0.708066 -0.149703 -0.724853 -1.200072 -0.356965
1900-01-23 00:12:00 12-23 -0.520212 0.415213 -1.362347 -1.140712 -0.970853
1900-01-26 00:12:00 12-26 -0.014450 0.612933 -1.149849 -0.952737 -0.925380
1900-01-27 00:12:00 12-27 0.202305 0.669425 -1.102627 -0.893376 -0.925380
1900-01-28 00:12:00 12-28 -0.953721 0.302230 -0.394301 -0.042542 0.302397
I tried this as well, but similar to above, datestamp is not correct:
dfplot_cmb\['Datestamp'\] = pd.to_datetime(dfplot_cmb\['Datestamp'\], format='%M-%d')
dfplot_cmb.set_index('Datestamp', inplace=True)
dfplot_cmb.head()
CMB,2 CMB,3 CMB,4 CMB,5 CMB,7
Datestamp
1900-01-19 00:12:00 -1.559724 -1.663136 -1.719869 -1.771499 -1.778253
1900-01-20 00:12:00 -1.311374 -1.250774 -1.156484 -1.076946 -1.038540
1900-01-21 00:12:00 -1.220269 -1.156733 -1.106780 -1.077736 -1.057990
1900-01-22 00:12:00 -0.554371 -0.517907 -0.513658 -0.517146 -0.498735
1900-01-23 00:12:00 0.298617 0.252807 0.218531 0.167709 0.205619
How do I code "if the month is 12, then year is 2022 and if the month is 1, then year is 2023"?
First, create a function to implement the above logic:
import datetime
def create_dt(ds):
mm = ds[:2]
dd = ds[-2:]
# Modify logic accordingly if you have other months
if month == '01':
dt = f'2023-{mm}-{dd}'
else:
dt = f'2022-{mm}-{dd}'
return datetime.datetime.strptime(dt, '%Y-%m-%d').date()
Then apply this function to column Datestamp to crate a new column:
dfplot_ont['datetime']= dfplot_ont['Datestamp'].apply(create_dt)
Use the new column datetime as your x-axis.
Here's a method that will add an offset for each year.
data = {'date': ['12-29', '12-30', '12-31', '01-02', '01-03', '01-04']}
df = pd.DataFrame(data)
Set a start year. In this case we'll start at 2022.
start_year = 2022
df['date'] = pd.to_datetime( df['date'].str.slice(0,2) + '-'
+ df['date'].str.slice(3,5) + '-'
+ str(start_year)
)
Next we need to create a date offset to increment each year accordingly. Whenever the current month is less than the previous month, we know we've started a new year. We can turn this into a Boolean and cumulatively sum it to get each year's offset.
df['offset'] = df['date'].dt.month.lt(df['date'].dt.month.shift()).cumsum()
Without cumsum(), the data looks like this. You can see how we cumulatively sum each time the year changes.
date offset
0 2022-12-29 False
1 2022-12-30 False
2 2022-12-31 False
3 2022-01-02 True
4 2022-01-03 False
5 2022-01-04 False
Finally, we add the offset to the date by year.
df['date'] = df['date'] + df['offset'].astype('timedelta64[Y]')
date offset
0 2022-12-29 0
1 2022-12-30 0
2 2022-12-31 0
3 2023-01-02 1
4 2023-01-03 1
5 2023-01-04 1

Calculating values from time series in pandas multi-indexed pivot tables

I've got a dataframe in pandas that stores the Id of a person, the quality of interaction, and the date of the interaction. A person can have multiple interactions across multiple dates, so to help visualise and plot this I converted it into a pivot table grouping first by Id then by date to analyse the pattern over time.
e.g.
import pandas as pd
df = pd.DataFrame({'Id':['A4G8','A4G8','A4G8','P9N3','P9N3','P9N3','P9N3','C7R5','L4U7'],
'Date':['2016-1-1','2016-1-15','2016-1-30','2017-2-12','2017-2-28','2017-3-10','2019-1-1','2018-6-1','2019-8-6'],
'Quality':[2,3,6,1,5,10,10,2,2]})
pt = df.pivot_table(values='Quality', index=['Id','Date'])
print(pt)
Leads to this:
Id
Date
Quality
A4G8
2016-1-1
2
2016-1-15
4
2016-1-30
6
P9N3
2017-2-12
1
2017-2-28
5
2017-3-10
10
2019-1-1
10
C7R5
2018-6-1
2
L4U7
2019-8-6
2
However, I'd also like to...
Measure the time from the first interaction for each interaction per Id
Measure the time from the previous interaction with the same Id
So I'd get a table similar to the one below
Id
Date
Quality
Time From First
Time To Prev
A4G8
2016-1-1
2
0 days
NA days
2016-1-15
4
14 days
14 days
2016-1-30
6
29 days
14 days
P9N3
2017-2-12
1
0 days
NA days
2017-2-28
5
15 days
15 days
2017-3-10
10
24 days
9 days
The Id column is a string type, and I've converted the date column into datetime, and the Quality column into an integer.
The column is rather large (>10,000 unique ids) so for performance reasons I'm trying to avoid using for loops. I'm guessing the solution is somehow using pd.eval but I'm stuck as to how to apply it correctly.
Apologies I'm a python, pandas, & stack overflow) noob and I haven't found the answer anywhere yet so even some pointers on where to look would be great :-).
Many thanks in advance
Convert Dates to datetimes and then substract minimal datetimes per groups by GroupBy.transformb subtracted by column Date and for second new column use DataFrameGroupBy.diff:
df['Date'] = pd.to_datetime(df['Date'])
df['Time From First'] = df['Date'].sub(df.groupby('Id')['Date'].transform('min'))
df['Time To Prev'] = df.groupby('Id')['Date'].diff()
print (df)
Id Date Quality Time From First Time To Prev
0 A4G8 2016-01-01 2 0 days NaT
1 A4G8 2016-01-15 3 14 days 14 days
2 A4G8 2016-01-30 6 29 days 15 days
3 P9N3 2017-02-12 1 0 days NaT
4 P9N3 2017-02-28 5 16 days 16 days
5 P9N3 2017-03-10 10 26 days 10 days
6 P9N3 2019-01-01 10 688 days 662 days
7 C7R5 2018-06-01 2 0 days NaT
8 L4U7 2019-08-06 2 0 days NaT
df["Date"] = pd.to_datetime(df.Date)
df = df.merge(
df.groupby(["Id"]).Date.first(),
on="Id",
how="left",
suffixes=["", "_first"]
)
df["Time From First"] = df.Date-df.Date_first
df['Time To Prev'] = df.groupby('Id').Date.diff()
df.set_index(["Id", "Date"], inplace=True)
df
output:

How to show to average sales for each year within ten years for a specific city in Pandas?

What would be the correct way to show what was the average sales volume in Carlisle city for each year between
2010-2020?
Here is an abbreviated form of the large data frame showing only the columns and rows relevant to the question:
import pandas as pd
df = pd.DataFrame({'Date': ['01/09/2009','01/10/2009','01/11/2009','01/12/2009','01/01/2010','01/02/2010','01/03/2010','01/04/2010','01/05/2010','01/06/2010','01/07/2010','01/08/2010','01/09/2010','01/10/2010','01/11/2010','01/12/2010','01/01/2011','01/02/2011'],
'RegionName': ['Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle'],
'SalesVolume': [118,137,122,132,83,81,105,114,110,106,137,130,129,121,129,100,84,62]})
This is what I've tried:
import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv ('C:/Users/user/AppData/Local/Programs/Python/Python39/Scripts/uk_hpi_dataset_2021_01.csv')
df.Date = pd.to_datetime(df.Date)
df['Year'] = pd.to_datetime(df['Date']).apply(lambda x:
'{year}'.format(year=x.year).zfill(2))
carlisle_vol = df[df['RegionName'].str.contains('Carlisle')]
carlisle_vol.groupby('Year')['SalesVolume'].mean()
print(sales_vol)
When I try to run this code, it doesn't filter the 'Date' column to only calculate the average SalesVolume for the years beginning in '01/01/2010' and ending at '01/12/2020'. For some reason, it also prints out every other column is well. Can anyone please help me to answer this question correctly?
This is the result I've got
>>> df.loc[(df["Date"].dt.year.between(2010, 2020))
& (df["RegionName"] == "Carlisle")] \
.groupby([pd.Grouper(key="Date", freq="Y")])["SalesVolume"].mean()
Date
2010-01-01 112.083333
2011-01-01 73.000000
Freq: A-DEC, Name: SalesVolume, dtype: float64
For further:
The only difference between the answer of #nocibambi is the groupby parameter and particularly the freq argument of pd.Grouper. Imagine your accounting year starts the 1st september.
Sales each 3 months:
>>> df
Date Sales
0 2010-09-01 1 # 1st group: mean=2.5
1 2010-12-01 2
2 2011-03-01 3
3 2011-06-01 4
4 2011-09-01 5 # 2nd group: mean=6.5
5 2011-12-01 6
6 2012-03-01 7
7 2012-06-01 8
>>> df.groupby(pd.Grouper(key="Date", freq="AS-SEP")).mean()
Sales
Date
2010-09-01 2.5
2011-09-01 6.5
Check the documentation to know all values of freq aliases and anchoring suffix
You can access year with the datetime accessor:
df[
(df["RegionName"] == "Carlisle")
& (df["Date"].dt.year >= 2010)
& (df["Date"].dt.year <= 2020)
].groupby(df.Date.dt.year)["SalesVolume"].mean()
>>>
Date
2010 112.083333
2011 73.000000
Name: SalesVolume, dtype: float64

Crate and append rows based on average of previous rows and condition columns

I'm working on a dataframe named df that contains a year of daily information for a float variable (balance) for many account values (used as main key). I'm trying to create a new column expected_balance by matching the date of previous months, calculating an average and using it as expected future value. I'll explain in detail now:
The dataset is generated after appending and parsing multiple json values, once I finish working on it, I get this:
date balance account day month year fdate
0 2018-04-13 470.57 SP014 13 4 2018 201804
1 2018-04-14 375.54 SP014 14 4 2018 201804
2 2018-04-15 375.54 SP014 15 4 2018 201804
3 2018-04-16 229.04 SP014 16 4 2018 201804
4 2018-04-17 216.62 SP014 17 4 2018 201804
... ... ... ... ... ... ... ...
414857 2019-02-24 381.26 KO012 24 2 2019 201902
414858 2019-02-25 181.26 KO012 25 2 2019 201902
414859 2019-02-26 160.82 KO012 26 2 2019 201902
414860 2019-02-27 0.82 KO012 27 2 2019 201902
414861 2019-02-28 109.50 KO012 28 2 2019 201902
Each account value has 365 values (a starting date when the information was obtained and a year of info), resampled by day. After that, I'm splitting this dataframe into train and test. Train consists of all previous values except for the last 2 months of information and test are these last 2 months (the last month is not necesarilly full, if the last/max date value is 20-04-2019, then train will be from 20-04-2018 to 31-03-2019 and test 01-03-2019 to 20-04-2019). This is how I manage:
df_test_1 = df[df.fdate==df.groupby('account').fdate.transform('max')].copy()
dft = df.drop(df_test_1.index)
df_test_2 = dft[dft.fdate==dft.groupby('account').fdate.transform('max')].copy()
df_train = dft.drop(df_test_2.index)
df_test = pd.concat([df_test_2,df_test_1])
#print("Shape df: ",df.shape) #for validation purposes
#print("Shape test: ",df_test.shape) #for validation purposes
#print("Shape train: ",df_train.shape) #for validation purposes
What I need to do now is create a new column exp_bal (expected balance) for each date in the df_test that's calculated by averaging all train values for the particular day (this is the method requested so I must follow the instructions).
Here is an example of an expected output/result, I'm only printing account's AA001 values for a specific day for the last 2 train months (suppose these values always repeat for the other 8 months):
date balance account day month year fdate
... ... ... ... ... ... ... ...
0 2019-03-20 200.00 AA000 20 3 2019 201903
1 2019-04-20 100.00 AA000 20 4 2019 201904
I should be able to use this information to append a new column for each day that is the average of the same day value for all months of df_train
date balance account day month year fdate exp_bal
0 2018-05-20 470.57 AA000 20 5 2018 201805 150.00
30 2019-06-20 381.26 AA000 20 6 2019 201906 150.00
So then I can calculate a mse for the that prediction for that account.
First of all I'm using this to iterate over each account:
ids = list(df['account'].unique())
for i in range(0,len(ids)):
dft_train = df_train[df_train['account'] == ids[i]]
dft_test = df_test[df_test['account'] == ids[i]]
first_date = min(dft_test['date'])
last_date = max(df_ttest['date'])
dft_train = dft_train.set_index('date')
dft_test = dft_train.set_index('date')
And after this I'm lost on how to use the dft_train values to create this average for a given day that will be appended in a new column in dft_test.
I appreciate any help or suggestion, also feel free to ask for clarification/ more info, I'll gladly edit this. Thanks in advance!
Not sure if it's the only question you have with the above, but this is how to calculate the expected balance of the train data:
import pandas as pd, numpy as np
# make test data
n = 60
df = pd.DataFrame({'Date': np.tile(pd.date_range('2018-01-01',periods=n).values, 2), 'Account': np.repeat(['A', 'B'], n), 'Balance': range(2*n)})
df['Day'] = df.Date.dt.day
# calculate expected balance
df['exp_bal'] = df.groupby(['Account', 'Day']).Balance.transform('mean')
# example output for day 5
print(df[df.Day==5])
Output:
Date Account Balance Day exp_bal
4 2018-01-05 A 4 5 19.5
35 2018-02-05 A 35 5 19.5
64 2018-01-05 B 64 5 79.5
95 2018-02-05 B 95 5 79.5

Add Months to Data Frame using a period column

I'm looking to add a %Y%m%d date column to my dataframe using a period column that has integers 1-32, which represent monthly data points starting at a defined environment variable "odate" (e.g. if odate=20190531 then period 1 should be 20190531, period 2 should be 20190630, etc.)
I tried defining a dictionary with the number of periods in the column as the keys and the value being odate + MonthEnd(period -1)
This works fine and well; however, I want to improve the code to be flexible given changes in the number of periods.
Is there a function that will allow me to fill the date columns with the odate in period 1 and then subsequent month ends for subsequent periods?
example dataset:
odate=20190531
period value
1 5.5
2 5
4 6.2
3 5
5 40
11 5
desired dataset:
odate=20190531
period value date
1 5.5 2019-05-31
2 5 2019-06-30
4 6.2 2019-08-31
3 5 2019-07-31
5 40 2019-09-30
11 5 2020-03-31
You can use pd.date_range():
pd.date_range(start = '2019-05-31', periods = 100,freq='M')
You can change total periods depending on what you need, the freq='M' means a Month-End frequency
Here is a list of Offset Aliases you can for freq parameter.
If you just want to add or subtract some period to a date you can use pd.DataOffset:
odate = pd.Timestamp('20191031')
odate
>> Timestamp('2019-10-31 00:00:00')
odate - pd.DateOffset(months=4)
>> Timestamp('2019-06-30 00:00:00')
odate + pd.DateOffset(months=4)
>> Timestamp('2020-02-29 00:00:00')
To add given the period column to Month Ends:
odate = pd.Timestamp('20190531')
df['date'] = df.period.apply(lambda x: odate + pd.offsets.MonthEnd(x-1))
df
period value date
0 1 5.5 2019-05-31
1 2 5.0 2019-06-30
2 4 6.2 2019-08-31
3 3 5.0 2019-07-31
4 5 40.0 2019-09-30
5 11 5.0 2020-03-31
To improve performance use list-comprehension:
df['date'] = [odate + pd.offsets.MonthEnd(period-1) for period in df.period]

Categories

Resources