I am reading a csv file of the number of employees in the US by year and month (in thousands). It starts out like this:
Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
1961,45119,44969,45051,44997,45119,45289,45400,45535,45591,45716,45931,46035
1962,46040,46309,46375,46679,46668,46644,46720,46775,46888,46927,46910,46901
1963,46912,47000,47077,47316,47328,47356,47461,47542,47661,47805,47771,47863
...
I want my Pandas Dataframe to have the datetime as the index for each month's value. I'm doing this so I can later add values for specific time ranges. I want it to look something like this:
1961-01-01 45119.0
1961-02-01 44969.0
1961-03-01 45051.0
1961-04-01 44997.0
1961-05-01 45119.0
...
I did some research and thought that if I stacked the years and months together, I could combine them into a datetime. Here is what I have done:
import pandas as pd
import numpy as np
df = pd.read_csv("BLS_private.csv", header=5, index_col="Year")
df.columns = range(1, 13) # I transformed months into numbers 1-12 for easier datetime conversion
df = df.stack() # Months are no longer columns
print(df)
Here is my output:
Year
1961 1 45119.0
2 44969.0
3 45051.0
4 44997.0
5 45119.0
...
I do not know how to combine the year and the months in the stacked indices. Does stacking the indices help at all in my case? I am also not the most familiar with Pandas datetime, so any explanation about how I could use that would be very helpful. Also if anyone has alternate solutions than making datetime the index, I welcome ideas.
After the stack create the DateTimeIndex from the current index
from datetime import datetime
dt_index = pd.to_datetime([datetime(year=year, month=month, day=1)
for year, month in df.index.values])
df.index = dt_index
df.head(3)
# 1961-01-01 45119
# 1961-02-01 44969
# 1961-03-01 45051
import pandas as pd
df = pd.read_csv("BLS_private.csv", index_col="Year")
dates = pd.date_range(start=str(df.index[0]), end=str(df.index[-1] + 1), closed='left', freq="MS")
df = df.stack()
df.index = dates
df.to_frame()
s = """Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
1961,45119,44969,45051,44997,45119,45289,45400,45535,45591,45716,45931,46035
1962,46040,46309,46375,46679,46668,46644,46720,46775,46888,46927,46910,46901
1963,46912,47000,47077,47316,47328,47356,47461,47542,47661,47805,47771,47863"""
df = pd.read_csv(StringIO(s))
# set index and stack
stack = df.set_index('Year').stack().reset_index()
# create a new index
stack.index = pd.to_datetime(stack['Year'].astype(str) +'-'+ stack['level_1'])
# remove columns
final = stack[0].to_frame()
1961-01-01 45119
1961-02-01 44969
1961-03-01 45051
1961-04-01 44997
1961-05-01 45119
1961-06-01 45289
Related
Here I got a pandas data frame with daily return of stocks and columns are date and return rate.
But if I only want to keep the last day of each week, and the data has some missing days, what can I do?
import pandas as pd
df = pd.read_csv('Daily_return.csv')
df.Date = pd.to_datetime(db.Date)
count = 300
for last_day in ('2017-01-01' + 7n for n in range(count)):
Actually my brain stop working at this point with my limited imagination......Maybe one of the biggest point is "+7n" kind of stuff is meaningless with some missing dates.
I'll create a sample dataset with 40 dates and 40 sample returns, then sample 90 percent of that randomly to simulate the missing dates.
The key here is that you need to convert your date column into datetime if it isn't already, and make sure your df is sorted by the date.
Then you can groupby year/week and take the last value. If you run this repeatedly you'll see that the selected dates can change if the value dropped was the last day of the week.
Based on that
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['date'] = pd.date_range(start='04-18-2022',periods=40, freq='D')
df['return'] = np.random.uniform(size=40)
# Keep 90 percent of the records so we can see what happens when some days are missing
df = df.sample(frac=.9)
# In case your dates are actually strings
df['date'] = pd.to_datetime(df['date'])
# Make sure they are sorted from oldest to newest
df = df.sort_values(by='date')
df = df.groupby([df['date'].dt.isocalendar().year,
df['date'].dt.isocalendar().week], as_index=False).last()
print(df)
Output
date return
0 2022-04-24 0.299958
1 2022-05-01 0.248471
2 2022-05-08 0.506919
3 2022-05-15 0.541929
4 2022-05-22 0.588768
5 2022-05-27 0.504419
I have a question somehow similar to what discussed here How to add a year to a column of dates in pandas
however in my case, the number of years to add to the date column is stored in another column. This is my not working code:
import datetime
import pandas as pd
df1 = pd.DataFrame( [ ["Tom",5], ['Jane',3],['Peter',1]], columns = ["Name","Years"])
df1['Date'] = datetime.date.today()
df1['Final_Date'] = df1['Date'] + pd.offsets.DateOffset(years=df1['Years'])
The goal is to add 5 years to the current date for row 1, 3 years to current date in row 2, eccetera.
Any suggestions? Thank you
Convert to time delta by converting years to days, then adding to a converted datetime column:
df1['Final_Date'] = pd.to_datetime(df1['Date']) \
+ pd.to_timedelta(df1['Years'] * 365, unit='D')
Use of to_timedelta with unit='Y' for years is deprecated and throws ValueError.
Edit. If you need day-exact changes, you will need to go row-by-row and update the date objects accordingly. Other answers explain.
Assuming the number of different values in Years is limited, you can try groupby and do the operation with pd.DateOffset like:
df1['new_date'] = (
df1.groupby('Years')
['Date'].apply(lambda x: x + pd.DateOffset(years=x.name))
)
print(df1)
Name Years Date new_date
0 Tom 5 2021-07-13 2026-07-13
1 Jane 3 2021-07-13 2024-07-13
2 Peter 1 2021-07-13 2022-07-13
else you can extract year, month and day, add the Years column to year and recreate a datetime column
df1['Date'] = pd.to_datetime(df1['Date'])
df1['new_date'] = (
df1.assign(year=lambda x: x['Date'].dt.year+x['Years'],
month=lambda x: x['Date'].dt.month,
day=lambda x: x['Date'].dt.day,
new_date=lambda x: pd.to_datetime(x[['year','month','day']]))
['new_date']
)
same result
import datetime
import pandas as pd
df1 = pd.DataFrame( [ ["Tom",5], ['Jane',3],['Peter',1]], columns = ["Name","Years"])
df1['Date'] = datetime.date.today()
df1['Final_date'] = datetime.date.today()
df1['Final_date'] = df1.apply(lambda g: g['Date'] + pd.offsets.DateOffset(years = g['Years']), axis=1)
print(df1)
Try this, you were trying to add the whole column when you called pd.offsets.DateOffset(years=df1['Years']) instead of just 1 value in the column.
EDIT: I changed from iterrows to a vectorization method due to iterrows's poor performance
This question already has answers here:
how to highlight weekends for time series line plot in python
(3 answers)
Closed 2 years ago.
i plotted a daily line plot for flights and i would like to highlight all the saturdays and sundays. I'm trying to do it with axvspan but i'm struggling with the use of it? Any suggestions on how can this be coded?
(flights.loc[flights['date'].dt.month.between(1, 2), 'date']
.dt.to_period('D')
.value_counts()
.sort_index()
.plot(kind="line",figsize=(12,6))
)
Thx in advance for any help provided
Using a date column of type pandas timestamp, you can get the weekday of a date directly using pandas.Timestamp.weekday. Then you can use df.iterrows() to check whether or not each date is a saturday or sunday and include a shape in the figure like this:
for index, row in df.iterrows():
if row['date'].weekday() == 5 or row['date'].weekday() == 6:
fig.add_shape(...)
With a setup like this, you would get a line indicating whether or not each date is a saturday or sunday. But given that you're dealing with a continuous time series, it would probably make sense to illustrate these periods as an area for the whole period instead of highlighting each individual day. So just identify each saturday and set the whole period to each saturday plus pd.DateOffset(1) to get this:
Complete code with sample data
# imports
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
import datetime
pd.set_option('display.max_rows', None)
# data sample
cols = ['signal']
nperiods = 20
np.random.seed(12)
df = pd.DataFrame(np.random.randint(-2, 2, size=(nperiods, len(cols))),
columns=cols)
datelist = pd.date_range(datetime.datetime(2020, 1, 1).strftime('%Y-%m-%d'),periods=nperiods).tolist()
df['date'] = datelist
df = df.set_index(['date'])
df.index = pd.to_datetime(df.index)
df.iloc[0] = 0
df = df.cumsum().reset_index()
df['signal'] = df['signal'] + 100
# plotly setup
fig = px.line(df, x='date', y=df.columns[1:])
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='rgba(0,0,255,0.1)')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='rgba(0,0,255,0.1)')
for index, row in df.iterrows():
if row['date'].weekday() == 5: #or row['date'].weekday() == 6:
fig.add_shape(type="rect",
xref="x",
yref="paper",
x0=row['date'],
y0=0,
# x1=row['date'],
x1=row['date'] + pd.DateOffset(1),
y1=1,
line=dict(color="rgba(0,0,0,0)",width=3,),
fillcolor="rgba(0,0,0,0.1)",
layer='below')
fig.show()
You can use pandas' dt.weekday to get an integer corresponding to the weekday of a given date. 5 equals to Saturday and 6 to Sunday (Monday is 0). You can use this information as an additional way to slice your dataframe and filter those entries that belong to either Saturdays or Sundays. As you mentioned they can be highlighted with axvspan and matplotlib versions >3 are able to use the datetime objects as an input. 1 day has to be added via datetime.timedelta, because no rectangle will be drawn if xmin=xmax.
Here is the code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
#create sample data and dataframe
datelist = pd.date_range(start="2014-12-09",end="2015-03-02").tolist()
datelist += datelist #test to see whether it works with multiple entries having the same date
flights = pd.DataFrame(datelist, columns=["date"])
#plot command, save object in variable
plot = flights.loc[flights['date'].dt.month.between(1, 2), 'date'].dt.to_period('D').value_counts().sort_index().plot(kind="line",figsize=(12,6))
#filter out saturdays and sundays from the date range needed
weekends = flights.loc[(flights['date'].dt.month.between(1, 2)) & ((flights['date'].dt.weekday == 5) | (flights['date'].dt.weekday == 6)), 'date']
#5 = Saturday, 6 = Sunday
#plot axvspan for every sat or sun, set() to get unique dates
for day in set(weekends.tolist()):
plot.axvspan(day, day + datetime.timedelta(days=1))
i have this kind of dataframe
These data represents the value of an consumption index generally encoded once a month (at the end or at the beginning of the following month) but sometimes more. This value can be resetted to "0" if the counter is out and be replaced. Moreover some month no data is available.
I would like select only one entry per month but this entry has to be the nearest to the first day of the month AND inferior to the 15th day of the month (because if the day is higher it could be the measure of the end of the month). Another condition is that if the difference between two values is negative (the counter has been replaced), the value need to be kept even if the date is not the nearest day near the first day of month.
For example, the output data need to be
The purpose is to calculate only a consumption per month.
A solution is to parse the dataframe (as a array) and perform some if conditions statements. However i wonder if there is "simple" alternative to achieve that.
Thank you
You can normalize the month data with MonthEnd and then drop duplicates based off that column and keep the last value.
from pandas.tseries.offsets import MonthEnd
df.New = df.Index + MonthEnd(1)
df.Diff = abs((df.Index - df.New).dt.days)
df = df.sort_values(df.New, df.Diff)
df = df.drop_duplicates(subset='New', keep='first').drop(['New','Diff'], axis=1)
That should do the trick, but I was not able to test, so please copy and past the sample data into StackOverFlow if this isn't doing the job.
Defining dataframe, converting index to datetime, defining helper columns,
using them to run shift method to conditionally remove rows, and finally removing the helper columns:
from pandas.tseries.offsets import MonthEnd, MonthBegin
import pandas as pd
from datetime import datetime as dt
import numpy as np
df = pd.DataFrame([
[1254],
[1265],
[1277],
[1301],
[1345],
[1541]
], columns=["Value"]
, index=[dt.strptime("05-10-19", '%d-%m-%y'),
dt.strptime("29-10-19", '%d-%m-%y'),
dt.strptime("30-10-19", '%d-%m-%y'),
dt.strptime("04-11-19", '%d-%m-%y'),
dt.strptime("30-11-19", '%d-%m-%y'),
dt.strptime("03-02-20", '%d-%m-%y')
]
)
early_days = df.loc[df.index.day < 15]
early_month_end = early_days.index - MonthEnd(1)
early_day_diff = early_days.index - early_month_end
late_days = df.loc[df.index.day >= 15]
late_month_end = late_days.index + MonthBegin(1)
late_day_diff = late_month_end - late_days.index
df["day_offset"] = (early_day_diff.append(late_day_diff) / np.timedelta64(1, 'D')).astype(int)
df["start_of_month"] = df.index.day < 15
df["month"] = df.index.values.astype('M8[D]').astype(str)
df["month"] = df["month"].str[5:7].str.lstrip('0')
# df["month_diff"] = df["month"].astype(int).diff().fillna(0).astype(int)
df = df[df["month"].shift().ne(df["month"].shift(-1))]
df = df.drop(columns=["day_offset", "start_of_month", "month"])
print(df)
Returns:
Value
2019-10-05 1254
2019-10-30 1277
2019-11-04 1301
2019-11-30 1345
2020-02-03 1541
A small snippet from my dataframe
I have separate columns for month and date. I need to parse only month and date into a pandas datetime type(other datetime types would also help), so that I could plot a TimeSeries Line plot.
I tried this piece of code,
df['newdate'] = pd.to_datetime(df[['Days','Month']], format='%d%m')
but I threw me an error
KeyError: "['Days' 'Month'] not in index"
How should I approach this error?
an illustration of my comment; if you take the columns as type string, you can join and strptime them easily as follows:
import pandas as pd
df = pd.DataFrame({'Month': [1,2,11,12], 'Days': [1,22,3,23]})
pd.to_datetime(df['Month'].astype(str)+' '+df['Days'].astype(str), format='%m %d')
# 0 1900-01-01
# 1 1900-02-22
# 2 1900-11-03
# 3 1900-12-23
# dtype: datetime64[ns]
You could also add a 'Year' column to your df with an arbitrary year number and use the method you originally intended:
df = pd.DataFrame({'Month': [1,2,11,12], 'Days': [1,22,3,23]})
df['Year'] = 2020
pd.to_datetime(df[['Year', 'Month', 'Days']])