I am trying to figure out an exercise on string manipulation and sorting.
The exercise asks to extract words that have time reference (e.g., hours, days) from the text, and sort rows based on the time extracted in an ascendent order.
An example of data is:
Customer Text
1 12 hours ago — the customer applied for a discount
2 6 hours ago — the customer contacted the customer service
3 1 day ago — the customer reported an issue
4 1 day ago — no answer
4 2 days ago — Open issue
5
In this task I can identify several difficulties:
- time reference can be expressed as hours/days/weeks
- there are null values or no reference to time
- get a time format suitable and more general, e.g., based on the current datetime
On the first point, I noted that generally the dates are before —, whether present, so it could be easy to extract them.
On the second point, an if statement could avoid error messages due to incomplete/missing fields.
I do not know how to answer to the third point, though.
My expected result would be:
Customer Text Sort by
1 12 hours ago — the customer applied for a discount 1
2 6 hours ago — the customer contacted the customer service 2
3 1 day ago — the customer reported an issue 2
4 1 day ago — no answer 2
4 2 days ago — Open issue 3
5
Given the DataFrame sample, I will assume that for this exercise the first two words of the text are what you are after. I am unclear on how the sorting works, but for the third point, a more suitable time would be the current time - timedelta from by the Text column
You can apply an if-else lambda function to the first two words of each row of Text and convert this to a pandas Timedelta object - for example pd.Timedelta("1 day") will return a Timedelta object.
Then you can subtract the Timedelta column from the current time which you can obtain with pd.Timestamp.now():
df["Timedelta"] = df.Text.apply(lambda x: pd.Timedelta(' '.join(x.split(" ")[:2])) if pd.notnull(x) else x)
df["Time"] = pd.Timestamp.now() - df["Timedelta"]
Output:
>>> df
Customer Text Timedelta Time
0 1 12 hours ago — the customer applied for a disc... 0 days 12:00:00 2021-11-23 09:22:40.691768
1 2 6 hours ago — the customer contacted the custo... 0 days 06:00:00 2021-11-23 15:22:40.691768
2 3 1 day ago — the customer reported an issue 1 days 00:00:00 2021-11-22 21:22:40.691768
3 4 1 day ago — no answer 1 days 00:00:00 2021-11-22 21:22:40.691768
4 4 2 days ago — Open issue 2 days 00:00:00 2021-11-21 21:22:40.691768
5 5 NaN NaT NaT
Related
I am looking for a way to check the frequency of dates in a column. I have a date with a frequency of every week, but sometimes there is a hurdle of 2 or 3 weeks, and the pd.infer_freq method returns NaN.
My data:
2022-01-01
2022-01-08
2022-01-23
2022-01-30
Your sample data is too small for pd.infer_freq to be able to infer the frequencies. You could find the most common time difference between consecutive days and use that to infer the frequency -
s = pd.Series(dates)
print((s - s.shift(1)).mode())
Output
0 7 days
dtype: timedelta64[ns]
I am trying to calculate the time difference(in days) between the customer's previous visit out time and the customer's latest visit in time.
time difference = latest in time - previous out time
Here is a sample of input data
sample output table
The approach I have tried so far groupby based on customer ID and rank
temp['RANK'] = temp.groupby('customer ID')['in time'].rank(ascending=True)
but I am unsure that how to calculate the difference.
You can use GroupBy.shift() to get the previous out time within the group. Substracted by current in time. Then, use dt.days to get the number of days of the timedelta between in time and out time within the group, as follows:
# convert date strings to datetime format
df['out time'] = pd.to_datetime(df['out time'], dayfirst=True)
df['in time'] = pd.to_datetime(df['in time'], dayfirst=True)
df['Visit diff (in days)'] = (df['in time'] - df['out time'].groupby(df['customer ID']).shift()).dt.days
Data Input:
print(df)
customer ID out time in time
0 1 05-12-1999 15:20:07 05-12-1999 14:23:31
1 1 21-12-1999 09:59:34 21-12-1999 09:41:09
2 2 05-12-1999 11:53:34 05-12-1999 11:05:37
3 2 08-12-1999 19:55:00 08-12-1999 19:40:10
4 3 01-12-1999 15:15:26 01-12-1999 13:08:11
5 3 16-12-1999 17:10:09 16-12-1999 16:34:10
Result:
print(df)
customer ID out time in time Visit diff (in days)
0 1 1999-12-05 15:20:07 1999-12-05 14:23:31 NaN
1 1 1999-12-21 09:59:34 1999-12-21 09:41:09 15.0
2 2 1999-12-05 11:53:34 1999-12-05 11:05:37 NaN
3 2 1999-12-08 19:55:00 1999-12-08 19:40:10 3.0
4 3 1999-12-01 15:15:26 1999-12-01 13:08:11 NaN
5 3 1999-12-16 17:10:09 1999-12-16 16:34:10 15.0
You may try the following:
temp.groupby('customer ID').apply(lambda x: (x['in time'].max() - x['out time'].min()).days )
I have a Panda Dataframe with the following data:
df1[['interval','answer']]
interval answer
0 0 days 06:19:17.767000 no
1 0 days 00:26:35.867000 no
2 0 days 00:29:12.562000 no
3 0 days 01:04:36.362000 no
4 0 days 00:04:28.746000 yes
5 0 days 02:56:56.644000 yes
6 0 days 00:20:13.600000 no
7 0 days 02:31:17.836000 no
8 0 days 02:33:44.575000 no
9 0 days 00:08:08.785000 no
10 0 days 03:48:48.183000 no
11 0 days 00:22:19.327000 no
12 0 days 00:05:05.253000 question
13 0 days 01:08:01.338000 unsubscribe
14 0 days 15:10:30.503000 no
15 0 days 11:09:05.824000 no
16 1 days 12:56:07.526000 no
17 0 days 18:10:13.593000 no
18 0 days 02:25:56.299000 no
19 2 days 03:54:57.715000 no
20 0 days 10:11:28.478000 no
21 0 days 01:04:55.025000 yes
22 0 days 13:59:40.622000 yes
The format of the df is:
id object
datum datetime64[ns]
datum2 datetime64[ns]
answer object
interval timedelta64[ns]
dtype: object
As a result the boxplot looks like:
enter image description here
Any idea?
Any help is appreciated...
Robert
Seaborn may help you achieve what you want.
First of all, one needs to make sure the columns are of the type one wants.
In order to recreate your problem, created the same dataframe (and gave it the same name df1). Here one can see the data types of the columns
[In]: df1.dtypes
[Out]:
interval object
answer object
dtype: object
For the column "answers", one can use pandas.factorize as follows
df1['NewAnswer'] = pd.factorize(df1['answer'])[0] + 1
That will create a new column and assign the values 1 to No, 2 to Yes, 3 to Question, 4 to Unscribe.
With this, one can, already, create a box plot using sns.boxplot as
ax = sns.boxplot(x="interval", y="NewAnswer", hue="answer", data=df1)
Which results in the following
The amount of combinations one can do are various, so I will leave only these as OP didn't specify its requirements nor gave an example of the expected output.
Notes:
Make sure you have the required libraries installed.
There may be other visualizations that would work better with these dataframe, here one can see a gallery with examples.
I have a list of dates for 2-3 calender years in a Dataframe. I want to tag them in 5weekly fashion like below:
date 5week
2015-01-01 1
2015-01-02 1
. .
2015-01-25 2
2015-01-26 2
. .
2015-02-22 3
or make buckets of 5 weeks intervals. What would be the most elegant way to do this? I am trying a loop with some bugs. the code is:
for i in range(len(df)):
if df.loc[i,'week'] < 5:
df.loc[i,'5we']=0
elif df.loc[i,'week']%5==0:
df.loc[i,'5we']=count
if (df.loc[i,'week']!=df.loc[i-1,'week']):
count+=1
else:
df.loc[i,'5we']=count
but I think this is a clumsy way to do this even if I get it to work (it is not currently). Please share your expert knowledge.
I figured it out. It is rather simple.
First I need to extract the week of the year from the date. This can be done by:
df['week']=df['date'].dt.strftime("%U")
date week
2015-01-01 0
2015-01-05 1
Now, I need to simply divide the week of the year by 5 and typecast the result into int instead of float:
df['5week']=df['week'].astype(int)/5
df['5week']=df['5week'].astype(int)
I am trying to fetch previous week same day data and then take an average of the value ("current_demand") for today's forecast (predict).
for example:
Today is Monday, so then I want to fetch data from the last two weeks Monday's data same time or block and then take an average of the value ["current_demand"] to predict today's value.
Input Data:
current_demand Date Blockno weekday
18839 01-06-2018 1 4
18836 01-06-2018 2 4
12256 02-06-2018 1 5
12266 02-06-2018 2 5
17957 08-06-2018 1 4
17986 08-06-2018 2 4
18491 09-06-2018 1 5
18272 09-06-2018 2 5
Expecting result:
18398 15-06-2018 1 4
something like that. I want to take same value, same block and same day of the previous two-week value then calculate for next value average.
I have tried some thing:
def forecast(DATA):
df = DATA
day = {0:'Monday',1:'Tuesday',2:'Wednesday',3:'Thursday',4:'Friday',5:'Saturday',6:'Sunday'}
df.friday = day - timedelta(days=day.weekday() + 3)
print df
forecast(DATA)
Please suggest me something. Thank you in advance
I like relativedelta for this kind of job
from dateutil.relativedelta import relativedelta
(datetime.datetime.today() + relativedelta(weeks=-2)).date()
Output:
datetime.date(2018, 7, 23)
without the actual structure of your df it's hard to provide a solution tailored to your needs