For an excel file in which date column is not type of date format, so in date 2018.10, we can see 0 has been omitted and it becomes 2018.1.
date
2018.12
2018.11
2018.1
2018.9
2018.8
2018.7
2018.6
2018.5
2018.4
2018.3
2018.2
2018.1
How can I convert this column to year month format correctly? Thank you.
I try with df['date'] = pd.to_datetime(df['date'].map('{:.1f}'.format), format='%Y.%m'), but I get this:
8 2018-01-01
9 2018-01-01
10 2018-01-01
11 2018-09-01
12 2018-08-01
13 2018-07-01
14 2018-06-01
15 2018-05-01
16 2018-04-01
17 2018-03-01
18 2018-02-01
First convert values to strings and then to datetimes in first step.
Then correct October - test if previous month is 11, next is 9 and incorrect is 1:
df['date'] = pd.to_datetime(df['date'].astype(str), format='%Y.%m')
mo = df['date'].dt.month
mask = mo.shift().eq(11) & mo.eq(1) & mo.shift(-1).eq(9)
df.loc[mask, 'date'] = df.loc[mask, 'date'] + pd.offsets.DateOffset(month=10)
print (df)
date
0 2018-12-01
1 2018-11-01
2 2018-10-01
3 2018-09-01
4 2018-08-01
5 2018-07-01
6 2018-06-01
7 2018-05-01
8 2018-04-01
9 2018-03-01
10 2018-02-01
11 2018-01-01
it might be easiest to fix this in the excel file! if you've got a lot of data (thousands of rows) then maybe it's worth writing code. code options are:
look at row above/below and try and infer whether .1 means be January or October
ignore the column, if you have data for every month then just make up the correct sequence
Related
I am trying to find a way to extract the date cell where the date column starts changing the time frequency. In this case, it's 2021-06-30. Any suggestion on how to do this in a pandas approach?
df1 = pd.DataFrame(pd.date_range(start='2021-07', end='2025-07', freq='Y'))
df2 = pd.DataFrame(pd.date_range(start='2020-07', end='2021-07', freq='M'))
pd.concat([df2, df1]).reset_index(drop = True)
Answer: 2021-06-30
0
0 2020-07-31
1 2020-08-31
2 2020-09-30
3 2020-10-31
4 2020-11-30
5 2020-12-31
6 2021-01-31
7 2021-02-28
8 2021-03-31
9 2021-04-30
10 2021-05-31
11 2021-06-30
12 2021-12-31
13 2022-12-31
14 2023-12-31
15 2024-12-31
Since the frequency is changing from month to year, you can determine the date after which the consecutive difference becomes more than 31 days.
df = pd.concat([df2, df1]).reset_index(drop = True)
df.loc[df[0].diff(1).apply(lambda d: d.days > 31).idxmax()-1]
This gives 2021-06-30 as output. df[0].diff(1) calculates the difference in consecutive dates, and then we check when the difference becomes more than 31 days. idxmax() will return the first date which is more than 31 days far from prev date and therefore we have to decrease it by 1.
Im receiving the following error:
ValueError: time data '2013' does not match format '%Y%m%d' (match)
Here is the section of code where the error is occuring:
# Convert periodEndDate from string to datetime to epoch timestamp
df['periodEndDate'] = df['periodEndDate'].apply(lambda x: pd.to_datetime(int(x), format='%Y%m%d').timestamp())
df['periodEndDate'] = df['periodEndDate'].astype(int)
df['periodTypeId'] = 1
return df.to_dict('records')
output:
0 2013
1 2012
2 2015
3 20111231
4 2016
5 2014
6 2017
7 2018
I understand that the code is failing as '2013' does not match the format, is it possible to insert a day and month to resolve this issue?
Don't specify the format. Let pandas infer it.
df['periodEndDate'] = pd.to_datetime(df["periodEndDate"])
>>> df
0 2013-01-01
1 2012-01-01
2 2015-01-01
3 2011-12-31
4 2016-01-01
5 2014-01-01
6 2017-01-01
7 2018-01-01
Name: periodEndDate, dtype: datetime64[ns]
I have a variable as:
start_dt = 201901 which is basically Jan 2019
I have an initial data frame as:
month
0
1
2
3
4
I want to add a new column (date) to the dataframe where for month 0, the date is the start_dt - 1 month, and for subsequent months, the date is a month + 1 increment.
I want the resulting dataframe as:
month date
0 12/1/2018
1 1/1/2019
2 2/1/2019
3 3/1/2019
4 4/1/2019
You can subtract 1 and add datetimes converted to month periods by Timestamp.to_period and then output convert to timestamps by to_timestamp:
start_dt = 201801
start_dt = pd.to_datetime(start_dt, format='%Y%m')
s = df['month'].sub(1).add(start_dt.to_period('m')).dt.to_timestamp()
print (s)
0 2017-12-01
1 2018-01-01
2 2018-02-01
3 2018-03-01
4 2018-04-01
Name: month, dtype: datetime64[ns]
Or is possible convert column to month offsets with subtract 1 and add datetime:
s = df['month'].apply(lambda x: pd.DateOffset(months=x-1)).add(start_dt)
print (s)
0 2017-12-01
1 2018-01-01
2 2018-02-01
3 2018-03-01
4 2018-04-01
Name: month, dtype: datetime64[ns]
Here is how you can use the third-party library dateutil to increment a datetime by one month:
import pandas as pd
from datetime import datetime
from dateutil.relativedelta import relativedelta
start_dt = '201801'
number_of_rows = 10
start_dt = datetime.strptime(start_dt, '%Y%m')
df = pd.DataFrame({'date': [start_dt+relativedelta(months=+n)
for n in range(-1, number_of_rows-1)]})
print(df)
Output:
date
0 2017-12-01
1 2018-01-01
2 2018-02-01
3 2018-03-01
4 2018-04-01
5 2018-05-01
6 2018-06-01
7 2018-07-01
8 2018-08-01
9 2018-09-01
As you can see, in each iteration of the for loop, the initial datetime is being incremented by the corresponding number (starting at -1) of the iteration.
I have a dataset with meteorological features for 2019, to which I want to join two columns of power consumption datasets for 2017, 2018. I want to match them by hour, day and month, but the data belongs to different years. How can I do that?
The meteo dataset is a 6 column similar dataframe with datetimeindexes belonging to 2019.
You can from the index 3 additional columns that represent the hour, day and month and use them for a later join. DatetimeIndex has attribtues for different parts of the timestamp:
import pandas as pd
ind = pd.date_range(start='2020-01-01', end='2020-01-20', periods=10)
df = pd.DataFrame({'number' : range(10)}, index = ind)
df['hour'] = df.index.hour
df['day'] = df.index.day
df['month'] = df.index.month
print(df)
number hour day month
2020-01-01 00:00:00 0 0 1 1
2020-01-03 02:40:00 1 2 3 1
2020-01-05 05:20:00 2 5 5 1
2020-01-07 08:00:00 3 8 7 1
2020-01-09 10:40:00 4 10 9 1
2020-01-11 13:20:00 5 13 11 1
2020-01-13 16:00:00 6 16 13 1
2020-01-15 18:40:00 7 18 15 1
2020-01-17 21:20:00 8 21 17 1
2020-01-20 00:00:00 9 0 20 1
I have a dataframe (snippet below) with index in format YYYYMM and several columns of values, including one called "month" in which I've extracted the MM data from the index column.
index st us stu px month
0 202001 2616757.0 3287969.0 0.795858 2.036 01
1 201912 3188693.0 3137911.0 1.016183 2.283 12
2 201911 3610052.0 2752828.0 1.311398 2.625 11
3 201910 3762043.0 2327289.0 1.616492 2.339 10
4 201909 3414939.0 2216155.0 1.540930 2.508 09
What I want to do is make a new column called 'stavg' which takes the 5-year average of the 'st' column for the given month. For example, since the top row refers to 202001, the stavg for that row should be the average of the January values from 2019, 2018, 2017, 2016, and 2015. Going back in time by each additional year should pull the moving average back as well, such that stavg for the row for, say, 201205 should show the average of the May values from 2011, 2010, 2009, 2008, and 2007.
index st us stu px month stavg
0 202001 2616757.0 3287969.0 0.795858 2.036 01 xxx
1 201912 3188693.0 3137911.0 1.016183 2.283 12 xxx
2 201911 3610052.0 2752828.0 1.311398 2.625 11 xxx
3 201910 3762043.0 2327289.0 1.616492 2.339 10 xxx
4 201909 3414939.0 2216155.0 1.540930 2.508 09 xxx
I know how to generate new columns of data based on operations on other columns on the same row (such as dividing 'st' by 'us' to get 'stu' and extracting digits from index to get 'month') but this notion of creating a column of data based on previous values is really stumping me.
Any clues on how to approach this would be greatly appreciated!! I know that for the first five years of data, I won't be able to populate the 'stavg' column with anything, which is fine--I could use NaN there.
Try defining a function and using apply method
df['year'] = (df['index'].astype(int)/100).astype(int)
def get_stavg(df, year, month):
# get year from index
df_year_month = df.query('#year - 5 <= year < #year and month == #month')
return df_year_month.st.mean()
df['stavg'] = df.apply(lambda x: get_stavg(df, x['year'], x['month']), axis=1)
If you are looking for a pandas only solution you could do something like
Dummy Data
Here we create a dummy datasets with 10 years of data with only two months (Jan and Feb).
import pandas as pd
df1 = pd.DataFrame({"date":pd.date_range("2010-01-01", periods=10, freq="AS-JAN")})
df2 = pd.DataFrame({"date":pd.date_range("2010-01-01", periods=10, freq="AS-FEB")})
df1["n"] = df1.index*2
df2["n"] = df2.index*3
df = pd.concat([df1, df2]).sort_values("date").reset_index(drop=True)
df.head(10)
date n
0 2010-01-01 0
1 2010-02-01 0
2 2011-01-01 2
3 2011-02-01 3
4 2012-01-01 4
5 2012-02-01 6
6 2013-01-01 6
7 2013-02-01 9
8 2014-01-01 8
9 2014-02-01 12
Groupby + rolling mean
df["n_mean"] = df.groupby(df["date"].dt.month)["n"]\
.rolling(5).mean()\
.reset_index(0,drop=True)
date n n_mean
0 2010-01-01 0 NaN
1 2010-02-01 0 NaN
2 2011-01-01 2 NaN
3 2011-02-01 3 NaN
4 2012-01-01 4 NaN
5 2012-02-01 6 NaN
6 2013-01-01 6 NaN
7 2013-02-01 9 NaN
8 2014-01-01 8 4.0
9 2014-02-01 12 6.0
10 2015-01-01 10 6.0
11 2015-02-01 15 9.0
12 2016-01-01 12 8.0
13 2016-02-01 18 12.0
14 2017-01-01 14 10.0
15 2017-02-01 21 15.0
16 2018-01-01 16 12.0
17 2018-02-01 24 18.0
18 2019-01-01 18 14.0
19 2019-02-01 27 21.0
By definition for the first 4 years the result is NaN.
Update
For your particular case
import pandas as pd
index = [f"{y}01" for y in range(2010, 2020)] +\
[f"{y}02" for y in range(2010, 2020)]
df = pd.DataFrame({"index":index})
df["st"] = df.index + 1
# dates/ index should be sorted
df = df.sort_values("index").reset_index(drop=True)
# extract month
df["month"] = df["index"].str[-2:]
df["st_mean"] = df.groupby("month")["st"]\
.rolling(5).mean()\
.reset_index(0,drop=True)