I have a variable as:
start_dt = 201901 which is basically Jan 2019
I have an initial data frame as:
month
0
1
2
3
4
I want to add a new column (date) to the dataframe where for month 0, the date is the start_dt - 1 month, and for subsequent months, the date is a month + 1 increment.
I want the resulting dataframe as:
month date
0 12/1/2018
1 1/1/2019
2 2/1/2019
3 3/1/2019
4 4/1/2019
You can subtract 1 and add datetimes converted to month periods by Timestamp.to_period and then output convert to timestamps by to_timestamp:
start_dt = 201801
start_dt = pd.to_datetime(start_dt, format='%Y%m')
s = df['month'].sub(1).add(start_dt.to_period('m')).dt.to_timestamp()
print (s)
0 2017-12-01
1 2018-01-01
2 2018-02-01
3 2018-03-01
4 2018-04-01
Name: month, dtype: datetime64[ns]
Or is possible convert column to month offsets with subtract 1 and add datetime:
s = df['month'].apply(lambda x: pd.DateOffset(months=x-1)).add(start_dt)
print (s)
0 2017-12-01
1 2018-01-01
2 2018-02-01
3 2018-03-01
4 2018-04-01
Name: month, dtype: datetime64[ns]
Here is how you can use the third-party library dateutil to increment a datetime by one month:
import pandas as pd
from datetime import datetime
from dateutil.relativedelta import relativedelta
start_dt = '201801'
number_of_rows = 10
start_dt = datetime.strptime(start_dt, '%Y%m')
df = pd.DataFrame({'date': [start_dt+relativedelta(months=+n)
for n in range(-1, number_of_rows-1)]})
print(df)
Output:
date
0 2017-12-01
1 2018-01-01
2 2018-02-01
3 2018-03-01
4 2018-04-01
5 2018-05-01
6 2018-06-01
7 2018-07-01
8 2018-08-01
9 2018-09-01
As you can see, in each iteration of the for loop, the initial datetime is being incremented by the corresponding number (starting at -1) of the iteration.
Related
Im receiving the following error:
ValueError: time data '2013' does not match format '%Y%m%d' (match)
Here is the section of code where the error is occuring:
# Convert periodEndDate from string to datetime to epoch timestamp
df['periodEndDate'] = df['periodEndDate'].apply(lambda x: pd.to_datetime(int(x), format='%Y%m%d').timestamp())
df['periodEndDate'] = df['periodEndDate'].astype(int)
df['periodTypeId'] = 1
return df.to_dict('records')
output:
0 2013
1 2012
2 2015
3 20111231
4 2016
5 2014
6 2017
7 2018
I understand that the code is failing as '2013' does not match the format, is it possible to insert a day and month to resolve this issue?
Don't specify the format. Let pandas infer it.
df['periodEndDate'] = pd.to_datetime(df["periodEndDate"])
>>> df
0 2013-01-01
1 2012-01-01
2 2015-01-01
3 2011-12-31
4 2016-01-01
5 2014-01-01
6 2017-01-01
7 2018-01-01
Name: periodEndDate, dtype: datetime64[ns]
I have the following data frame where the column hour shows hours of the day in int64 form. I'm trying to convert that into a time format; so that hour 1 would show up as '01:00'. I then want to add this to the date column and convert it into a timestamp index.
Using the datetime function in pandas resulted in the column "hr2", which is not what I need. I'm not sure I can even apply datetime directly, as the original data (i.e. in column "hr") is not really a date time format to begin with. Google searches so far have been unproductive.
While I am still in the dark concerning the format of your date column. I will assume the Date column is a string object and the hr column is an int64 object. To create the column TimeStamp in pandas tmestamp format this is how I would proceed>
Given df:
Date Hr
0 12/01/2010 1
1 12/01/2010 2
2 12/01/2010 3
3 12/01/2010 4
4 12/02/2010 1
5 12/02/2010 2
6 12/02/2010 3
7 12/02/2010 4
df['TimeStamp'] = df.apply(lambda row: pd.to_datetime(row['Date']) + pd.to_timedelta(row['Hr'], unit='H'), axis = 1)
yields:
Date Hr TimeStamp
0 12/01/2010 1 2010-12-01 01:00:00
1 12/01/2010 2 2010-12-01 02:00:00
2 12/01/2010 3 2010-12-01 03:00:00
3 12/01/2010 4 2010-12-01 04:00:00
4 12/02/2010 1 2010-12-02 01:00:00
5 12/02/2010 2 2010-12-02 02:00:00
6 12/02/2010 3 2010-12-02 03:00:00
7 12/02/2010 4 2010-12-02 04:00:00
The timestamp column can then be used as your index.
I need to sum a column "qtd" taking into account the last 6 months of a reference date.
prod date qtd sum
proda 2018-01-01 2 2
proda 2018-02-01 2 4
proda 2018-04-01 1 5
proda 2018-05-01 4 9
proda 2018-06-01 2 11
proda 2018-07-01 1 11
I need to figure out how to calculate the column "sum".
Note that I don't always have every month on my dataframe, for example I don't have March.
Given a reference date (date) I need to calculate 6 months back and sum the column "qtd"
Thanks!
cumsum( ) function will bring you the cumulative sum for a given column. From numpy.
df[‘sum’] = df[‘qtd’].cumsum()
Ok. In case you want to extract only the slice and calc cumsum(), you can use:
start_date = '2018-01-01'
end_date = '2018-05-01'
between = (df['date'] >= start_date) & (df['date'] <= end_date)
df2 = df[between]
df2['sum'] = df2['qtd'].cumsum()
df2
prod date qtd sum
0 proda 2018-01-01 2 2
1 proda 2018-02-01 2 4
2 proda 2018-04-01 1 5
3 proda 2018-05-01 4 9
Or if you want to calculate it only between specific dates and add it to your data frame, you can use:
start_date = '2/1/18'
end_date = '6/1/18'
def total(start, end, df):
sum_col = []
for i in range(df.shape[0]): # Loop for all lines
if df['date'][i] < start:
# If before start date, NA (you could change to 0 too)
sum_col.append('NaN')
elif df['date'][i] == start: # start to sum
sum_col.append(df['qtd'][I])
#sum between your start and end dates
elif (df['date'][i] > start) and (df['date'][i] <= end):
sum_col.append(df['qtd'][i]+sum_col[i-1])
# after end date, it just adds NAs. You can change to repeat the last total
elif df['date'][i] > end:
sum_col.append('NaN')
return sum_col
df['sum'] = total(start_date, end_date, df)
df
output:
prod date qtd sum
0 proda 1/1/18 2 NaN
1 proda 2/1/18 2 2
2 proda 4/1/18 1 3
3 proda 5/1/18 4 7
4 proda 6/1/18 2 9
5 proda 7/1/18 1 NaN
Hope this helps.
I have a dataset with meteorological features for 2019, to which I want to join two columns of power consumption datasets for 2017, 2018. I want to match them by hour, day and month, but the data belongs to different years. How can I do that?
The meteo dataset is a 6 column similar dataframe with datetimeindexes belonging to 2019.
You can from the index 3 additional columns that represent the hour, day and month and use them for a later join. DatetimeIndex has attribtues for different parts of the timestamp:
import pandas as pd
ind = pd.date_range(start='2020-01-01', end='2020-01-20', periods=10)
df = pd.DataFrame({'number' : range(10)}, index = ind)
df['hour'] = df.index.hour
df['day'] = df.index.day
df['month'] = df.index.month
print(df)
number hour day month
2020-01-01 00:00:00 0 0 1 1
2020-01-03 02:40:00 1 2 3 1
2020-01-05 05:20:00 2 5 5 1
2020-01-07 08:00:00 3 8 7 1
2020-01-09 10:40:00 4 10 9 1
2020-01-11 13:20:00 5 13 11 1
2020-01-13 16:00:00 6 16 13 1
2020-01-15 18:40:00 7 18 15 1
2020-01-17 21:20:00 8 21 17 1
2020-01-20 00:00:00 9 0 20 1
I'm new to Python and I want to aggregate (groupby) ID's in my first column.
The values in the second column are timestamps (datetime format) and by aggregating the ID's, I want the to get the average difference between the timestamps (in days) per ID in the aggregated ID column. My table looks like df1 and I want something like df2, but since I'm an absolute beginner, I have no idea how to do this.
import pandas as pd
import numpy as np
from datetime import datetime
In[1]:
# df1
ID = np.array([1,1,1,2,2,3])
Timestamp = np.array([
datetime.strptime('2018-01-01 18:07:02', "%Y-%m-%d %H:%M:%S"),
datetime.strptime('2018-01-08 18:07:02', "%Y-%m-%d %H:%M:%S"),
datetime.strptime('2018-03-15 18:07:02', "%Y-%m-%d %H:%M:%S"),
datetime.strptime('2018-01-01 18:07:02', "%Y-%m-%d %H:%M:%S"),
datetime.strptime('2018-02-01 18:07:02', "%Y-%m-%d %H:%M:%S"),
datetime.strptime('2018-01-01 18:07:02', "%Y-%m-%d %H:%M:%S")])
df = pd.DataFrame({'ID': ID, 'Timestamp': Timestamp})
Out[1]:
ID Timestamp
0 1 2018-01-01 18:07:02
1 1 2018-01-08 18:07:02
2 1 2018-03-15 18:07:02
3 2 2018-01-01 18:07:02
4 2 2018-02-01 18:07:02
5 3 2018-01-01 18:07:02
In[2]:
#df2
ID = np.array([1,2,3])
Avg_Difference = np.array([7, 1, "nan"])
df2 = pd.DataFrame({'ID': ID, 'Avg_Difference': Avg_Difference})
Out[2]:
ID Avg_Difference
0 1 7
1 2 1
2 3 nan
You could do something like this:
df.groupby('ID')['Timestamp'].apply(lambda x: x.diff().mean())
In your case, it looks like:
>>> df
ID Timestamp
0 1 2018-01-01 18:07:02
1 1 2018-01-08 18:07:02
2 1 2018-03-15 18:07:02
3 2 2018-01-01 18:07:02
4 2 2018-02-01 18:07:02
5 3 2018-01-01 18:07:02
>>> df.groupby('ID')['Timestamp'].apply(lambda x: x.diff().mean())
ID
1 36 days 12:00:00
2 31 days 00:00:00
3 NaT
Name: Timestamp, dtype: timedelta64[ns]
If you want it as a dataframe with the column named Avg_Difference, just add to_frame at the end:
df.groupby('ID')['Timestamp'].apply(lambda x: x.diff().mean()).to_frame('Avg_Difference')
Avg_Difference
ID
1 36 days 12:00:00
2 31 days 00:00:00
3 NaT
Edit Based on your comment, if you want to remove the time element, and just get the number of days, you can do the following:
df.groupby('ID')['Timestamp'].apply(lambda x: x.diff().mean()).dt.days.to_frame('Avg_Difference')
Avg_Difference
ID
1 36.0
2 31.0
3 NaN