I have a list of dates for 2-3 calender years in a Dataframe. I want to tag them in 5weekly fashion like below:
date 5week
2015-01-01 1
2015-01-02 1
. .
2015-01-25 2
2015-01-26 2
. .
2015-02-22 3
or make buckets of 5 weeks intervals. What would be the most elegant way to do this? I am trying a loop with some bugs. the code is:
for i in range(len(df)):
if df.loc[i,'week'] < 5:
df.loc[i,'5we']=0
elif df.loc[i,'week']%5==0:
df.loc[i,'5we']=count
if (df.loc[i,'week']!=df.loc[i-1,'week']):
count+=1
else:
df.loc[i,'5we']=count
but I think this is a clumsy way to do this even if I get it to work (it is not currently). Please share your expert knowledge.
I figured it out. It is rather simple.
First I need to extract the week of the year from the date. This can be done by:
df['week']=df['date'].dt.strftime("%U")
date week
2015-01-01 0
2015-01-05 1
Now, I need to simply divide the week of the year by 5 and typecast the result into int instead of float:
df['5week']=df['week'].astype(int)/5
df['5week']=df['5week'].astype(int)
Related
I am trying to figure out an exercise on string manipulation and sorting.
The exercise asks to extract words that have time reference (e.g., hours, days) from the text, and sort rows based on the time extracted in an ascendent order.
An example of data is:
Customer Text
1 12 hours ago — the customer applied for a discount
2 6 hours ago — the customer contacted the customer service
3 1 day ago — the customer reported an issue
4 1 day ago — no answer
4 2 days ago — Open issue
5
In this task I can identify several difficulties:
- time reference can be expressed as hours/days/weeks
- there are null values or no reference to time
- get a time format suitable and more general, e.g., based on the current datetime
On the first point, I noted that generally the dates are before —, whether present, so it could be easy to extract them.
On the second point, an if statement could avoid error messages due to incomplete/missing fields.
I do not know how to answer to the third point, though.
My expected result would be:
Customer Text Sort by
1 12 hours ago — the customer applied for a discount 1
2 6 hours ago — the customer contacted the customer service 2
3 1 day ago — the customer reported an issue 2
4 1 day ago — no answer 2
4 2 days ago — Open issue 3
5
Given the DataFrame sample, I will assume that for this exercise the first two words of the text are what you are after. I am unclear on how the sorting works, but for the third point, a more suitable time would be the current time - timedelta from by the Text column
You can apply an if-else lambda function to the first two words of each row of Text and convert this to a pandas Timedelta object - for example pd.Timedelta("1 day") will return a Timedelta object.
Then you can subtract the Timedelta column from the current time which you can obtain with pd.Timestamp.now():
df["Timedelta"] = df.Text.apply(lambda x: pd.Timedelta(' '.join(x.split(" ")[:2])) if pd.notnull(x) else x)
df["Time"] = pd.Timestamp.now() - df["Timedelta"]
Output:
>>> df
Customer Text Timedelta Time
0 1 12 hours ago — the customer applied for a disc... 0 days 12:00:00 2021-11-23 09:22:40.691768
1 2 6 hours ago — the customer contacted the custo... 0 days 06:00:00 2021-11-23 15:22:40.691768
2 3 1 day ago — the customer reported an issue 1 days 00:00:00 2021-11-22 21:22:40.691768
3 4 1 day ago — no answer 1 days 00:00:00 2021-11-22 21:22:40.691768
4 4 2 days ago — Open issue 2 days 00:00:00 2021-11-21 21:22:40.691768
5 5 NaN NaT NaT
I have a Panda Dataframe with the following data:
df1[['interval','answer']]
interval answer
0 0 days 06:19:17.767000 no
1 0 days 00:26:35.867000 no
2 0 days 00:29:12.562000 no
3 0 days 01:04:36.362000 no
4 0 days 00:04:28.746000 yes
5 0 days 02:56:56.644000 yes
6 0 days 00:20:13.600000 no
7 0 days 02:31:17.836000 no
8 0 days 02:33:44.575000 no
9 0 days 00:08:08.785000 no
10 0 days 03:48:48.183000 no
11 0 days 00:22:19.327000 no
12 0 days 00:05:05.253000 question
13 0 days 01:08:01.338000 unsubscribe
14 0 days 15:10:30.503000 no
15 0 days 11:09:05.824000 no
16 1 days 12:56:07.526000 no
17 0 days 18:10:13.593000 no
18 0 days 02:25:56.299000 no
19 2 days 03:54:57.715000 no
20 0 days 10:11:28.478000 no
21 0 days 01:04:55.025000 yes
22 0 days 13:59:40.622000 yes
The format of the df is:
id object
datum datetime64[ns]
datum2 datetime64[ns]
answer object
interval timedelta64[ns]
dtype: object
As a result the boxplot looks like:
enter image description here
Any idea?
Any help is appreciated...
Robert
Seaborn may help you achieve what you want.
First of all, one needs to make sure the columns are of the type one wants.
In order to recreate your problem, created the same dataframe (and gave it the same name df1). Here one can see the data types of the columns
[In]: df1.dtypes
[Out]:
interval object
answer object
dtype: object
For the column "answers", one can use pandas.factorize as follows
df1['NewAnswer'] = pd.factorize(df1['answer'])[0] + 1
That will create a new column and assign the values 1 to No, 2 to Yes, 3 to Question, 4 to Unscribe.
With this, one can, already, create a box plot using sns.boxplot as
ax = sns.boxplot(x="interval", y="NewAnswer", hue="answer", data=df1)
Which results in the following
The amount of combinations one can do are various, so I will leave only these as OP didn't specify its requirements nor gave an example of the expected output.
Notes:
Make sure you have the required libraries installed.
There may be other visualizations that would work better with these dataframe, here one can see a gallery with examples.
I have a large dataset spanning many years and I want to subset this data frame by selecting data based on a specific day of the month using python.
This is simple enough and I have achieved with the following line of code:
df[df.index.day == 12]
This selects data from the 12th of each month for all years in the data set. Great.
The problem I have however is the original data set is based on working day data. Therefore the 12th might actually be a weekend or national holiday and thus doesnt appear in the data set. Nothing is returned for that month as such.
What I would like to happen is to select the 12th where available, else select the next working day in the data set.
All help appreciated!
Here's a solution that looks at three days from every month (12, 13, and 14), and then picks the minimum. If the 12th is a weekend it won't exist in the original dataframe, and you'll get the 13th. The same goes for the 14th.
Here's the code:
# Create dummy data - initial range
df = pd.DataFrame(pd.date_range("2018-01-01", "2020-06-01"), columns = ["date"])
# Create dummy data - Drop weekends
df = df[df.date.dt.weekday.isin(range(5))]
# get only the 12, 13, and 14 of every month
# group by year and month.
# get the minimum
df[df.date.dt.day.isin([12, 13, 14])].groupby(by=[df.date.dt.year, df.date.dt.month], as_index=False).min()
Result:
date
0 2018-01-12
1 2018-02-12
2 2018-03-12
3 2018-04-12
4 2018-05-14
5 2018-06-12
6 2018-07-12
7 2018-08-13
8 2018-09-12
9 2018-10-12
...
Edit
Per a question in the comments about national holidays: the same solution applies. Instead of picking 3 days (12, 13, 14), pick a larger number (e.g. 12-18). Then, get the minimum of these that actually exists in the dataframe - and that's the first working day starting with the 12th.
You can backfill the dataframe first to fill the missing values then select the date you want
df = df.asfreq('d', method='bfill')
Then you can do df[df.index.day == 12]
This is my approach, I will explain each line below the code. Please feel free to add a comment if there's something unclear:
!pip install workalendar #Install the module
import pandas as pd #Import pandas
from workalendar.usa import NewYork #Import the required country and city
df = pd.DataFrame(pd.date_range(start='1/1/2018', end='12/31/2018')).rename(columns={0:'Dates'}) #Create a dataframe with dates for the year 2018
cal = NewYork() #Instance the calendar
df['Is_Working_Day'] = df['Dates'].map(lambda x: cal.is_working_day(x)) #Create an extra column, True for working days, False otherwise
df[(df['Dates'].dt.day >= 12) & (df['Is_Working_Day'] == True)].groupby(df['Dates'].dt.month)['Dates'].first()
Essentially this last line returns all days with values equal or higher than 12 that are actual working days, we then group them by month and return the first day for each where this condition is met (day >= 12 and Working_day = True).
Output:
Dates
1 2018-01-12
2 2018-02-13
3 2018-03-12
4 2018-04-12
5 2018-05-14
6 2018-06-12
7 2018-07-12
8 2018-08-13
9 2018-09-12
10 2018-10-12
11 2018-11-13
12 2018-12-12
I am trying to fetch previous week same day data and then take an average of the value ("current_demand") for today's forecast (predict).
for example:
Today is Monday, so then I want to fetch data from the last two weeks Monday's data same time or block and then take an average of the value ["current_demand"] to predict today's value.
Input Data:
current_demand Date Blockno weekday
18839 01-06-2018 1 4
18836 01-06-2018 2 4
12256 02-06-2018 1 5
12266 02-06-2018 2 5
17957 08-06-2018 1 4
17986 08-06-2018 2 4
18491 09-06-2018 1 5
18272 09-06-2018 2 5
Expecting result:
18398 15-06-2018 1 4
something like that. I want to take same value, same block and same day of the previous two-week value then calculate for next value average.
I have tried some thing:
def forecast(DATA):
df = DATA
day = {0:'Monday',1:'Tuesday',2:'Wednesday',3:'Thursday',4:'Friday',5:'Saturday',6:'Sunday'}
df.friday = day - timedelta(days=day.weekday() + 3)
print df
forecast(DATA)
Please suggest me something. Thank you in advance
I like relativedelta for this kind of job
from dateutil.relativedelta import relativedelta
(datetime.datetime.today() + relativedelta(weeks=-2)).date()
Output:
datetime.date(2018, 7, 23)
without the actual structure of your df it's hard to provide a solution tailored to your needs
I'm running into problems when taking lower-frequency time-series in pandas, such as monthly or quarterly data, and upsampling it to a weekly frequency. For example,
data = np.arange(3, dtype=np.float64)
s = Series(data, index=date_range('2012-01-01', periods=len(data), freq='M'))
s.resample('W-SUN')
results in a series filled with NaN everywhere. Basically the same thing happens if I do:
s.reindex(DatetimeIndex(start=s.index[0].replace(day=1), end=s.index[-1], freq='W-SUN'))
If s were indexed with a PeriodIndex instead I would get an error: ValueError: Frequency M cannot be resampled to <1 Week: kwds={'weekday': 6}, weekday=6>
I can understand why this might happen, as the weekly dates don't exactly align with the monthly dates, and weeks can overlap months. However, I would like to implement some simple rules to handle this anyway. In particular, (1) set the last week ending in the month to the monthly value, (2) set the first week ending in the month to the monthly value, or (3) set all the weeks ending in the month to the monthly value. What might be an approach to accomplish that? I can imagine wanting to extend this to bi-weekly data as well.
EDIT: An example of what I would ideally like the output of case (1) to be would be:
2012-01-01 NaN
2012-01-08 NaN
2012-01-15 NaN
2012-01-22 NaN
2012-01-29 0
2012-02-05 NaN
2012-02-12 NaN
2012-02-19 NaN
2012-02-26 1
2012-03-04 NaN
2012-03-11 NaN
2012-03-18 NaN
2012-03-25 2
I made a github issue regarding your question. Need to add the relevant feature to pandas.
Case 3 is achievable directly via fill_method:
In [25]: s
Out[25]:
2012-01-31 0
2012-02-29 1
2012-03-31 2
Freq: M
In [26]: s.resample('W', fill_method='ffill')
Out[26]:
2012-02-05 0
2012-02-12 0
2012-02-19 0
2012-02-26 0
2012-03-04 1
2012-03-11 1
2012-03-18 1
2012-03-25 1
2012-04-01 2
Freq: W-SUN
But for others you'll have to do some contorting right now that will hopefully be remedied by the github issue before the next release.
Also it looks like you want the upcoming 'span' resampling convention as well that will upsample from the start of the first period to the end of the last period. I'm not sure there is an easy way to anchor the start/end points for a DatetimeIndex but it should at least be there for PeriodIndex.