I have a Panda Dataframe with the following data:
df1[['interval','answer']]
interval answer
0 0 days 06:19:17.767000 no
1 0 days 00:26:35.867000 no
2 0 days 00:29:12.562000 no
3 0 days 01:04:36.362000 no
4 0 days 00:04:28.746000 yes
5 0 days 02:56:56.644000 yes
6 0 days 00:20:13.600000 no
7 0 days 02:31:17.836000 no
8 0 days 02:33:44.575000 no
9 0 days 00:08:08.785000 no
10 0 days 03:48:48.183000 no
11 0 days 00:22:19.327000 no
12 0 days 00:05:05.253000 question
13 0 days 01:08:01.338000 unsubscribe
14 0 days 15:10:30.503000 no
15 0 days 11:09:05.824000 no
16 1 days 12:56:07.526000 no
17 0 days 18:10:13.593000 no
18 0 days 02:25:56.299000 no
19 2 days 03:54:57.715000 no
20 0 days 10:11:28.478000 no
21 0 days 01:04:55.025000 yes
22 0 days 13:59:40.622000 yes
The format of the df is:
id object
datum datetime64[ns]
datum2 datetime64[ns]
answer object
interval timedelta64[ns]
dtype: object
As a result the boxplot looks like:
enter image description here
Any idea?
Any help is appreciated...
Robert
Seaborn may help you achieve what you want.
First of all, one needs to make sure the columns are of the type one wants.
In order to recreate your problem, created the same dataframe (and gave it the same name df1). Here one can see the data types of the columns
[In]: df1.dtypes
[Out]:
interval object
answer object
dtype: object
For the column "answers", one can use pandas.factorize as follows
df1['NewAnswer'] = pd.factorize(df1['answer'])[0] + 1
That will create a new column and assign the values 1 to No, 2 to Yes, 3 to Question, 4 to Unscribe.
With this, one can, already, create a box plot using sns.boxplot as
ax = sns.boxplot(x="interval", y="NewAnswer", hue="answer", data=df1)
Which results in the following
The amount of combinations one can do are various, so I will leave only these as OP didn't specify its requirements nor gave an example of the expected output.
Notes:
Make sure you have the required libraries installed.
There may be other visualizations that would work better with these dataframe, here one can see a gallery with examples.
Related
*Input:*
df["waiting_time"].value_counts()
*Output:*
2 days 6724
4 days 5290
1 days 5213
7 days 4906
6 days 4037
...
132 days 1
125 days 1
117 days 1
146 days 1
123 days 1
Name: waiting_time, Length: 128, dtype: int64
I tried:
df['wait_dur'] = df['waiting_time'].values.astype(str)
and I've tried apply as well. No changes to the data type, it stays the same.
You need to skip the 'values' part in your code:
df['wait_dur'] = df['waiting_time'].astype(str)
If you check first row for example, you will get:
type(df['wait_dur'][0])
<class 'str'>
df = df.applymap(str)
This should work, it applies the map string throughout.
If you want to see more methods go here.
I am trying to figure out an exercise on string manipulation and sorting.
The exercise asks to extract words that have time reference (e.g., hours, days) from the text, and sort rows based on the time extracted in an ascendent order.
An example of data is:
Customer Text
1 12 hours ago — the customer applied for a discount
2 6 hours ago — the customer contacted the customer service
3 1 day ago — the customer reported an issue
4 1 day ago — no answer
4 2 days ago — Open issue
5
In this task I can identify several difficulties:
- time reference can be expressed as hours/days/weeks
- there are null values or no reference to time
- get a time format suitable and more general, e.g., based on the current datetime
On the first point, I noted that generally the dates are before —, whether present, so it could be easy to extract them.
On the second point, an if statement could avoid error messages due to incomplete/missing fields.
I do not know how to answer to the third point, though.
My expected result would be:
Customer Text Sort by
1 12 hours ago — the customer applied for a discount 1
2 6 hours ago — the customer contacted the customer service 2
3 1 day ago — the customer reported an issue 2
4 1 day ago — no answer 2
4 2 days ago — Open issue 3
5
Given the DataFrame sample, I will assume that for this exercise the first two words of the text are what you are after. I am unclear on how the sorting works, but for the third point, a more suitable time would be the current time - timedelta from by the Text column
You can apply an if-else lambda function to the first two words of each row of Text and convert this to a pandas Timedelta object - for example pd.Timedelta("1 day") will return a Timedelta object.
Then you can subtract the Timedelta column from the current time which you can obtain with pd.Timestamp.now():
df["Timedelta"] = df.Text.apply(lambda x: pd.Timedelta(' '.join(x.split(" ")[:2])) if pd.notnull(x) else x)
df["Time"] = pd.Timestamp.now() - df["Timedelta"]
Output:
>>> df
Customer Text Timedelta Time
0 1 12 hours ago — the customer applied for a disc... 0 days 12:00:00 2021-11-23 09:22:40.691768
1 2 6 hours ago — the customer contacted the custo... 0 days 06:00:00 2021-11-23 15:22:40.691768
2 3 1 day ago — the customer reported an issue 1 days 00:00:00 2021-11-22 21:22:40.691768
3 4 1 day ago — no answer 1 days 00:00:00 2021-11-22 21:22:40.691768
4 4 2 days ago — Open issue 2 days 00:00:00 2021-11-21 21:22:40.691768
5 5 NaN NaT NaT
I have a dataframe which looks like this
In []: df.head()
Out [] :
DATE NAME AMOUNT CURRENCY
2018-07-27 John 100 USD
2018-06-25 Jane 150 GBP
...
The contents under the DATE column are of date type.
I want to aggregate all the data to be able to see to understand the days of the month and the count of the number of transactions that happened on that date.
I also wanted to group it by year as well as day.
The end result I wanted would have looked something like this
YEAR DAY COUNT
2018 1 0
2 1
3 0
4 0
5 3
6 4
and so on
I used the following code but the numbers are all wrong. Please help
In []: df = pd.DataFrame({'DATE':pd.date_range(start=dt.datetime(2018,7,27),end=dt.datetime(2020,7,21))})
df.groupby([df['DATE'].dt.year, df['DATE'].dt.day]).agg({'count'})
I have been spinning my wheels with this problem and was wondering if anyone has any insight on how best to approach it. I have a pandas DataFrame with a number of columns, including one datetime64[ns]. I would like to find some way to 'group' records together which have datetimes which are very close to one another. For example, I might be interested in grouping the following transactions together if they occur within two seconds of each other by assigning a common ID called Grouped ID:
Transaction ID Time Grouped ID
1 08:10:02 1
2 08:10:03 1
3 08:10:50
4 08:10:55
5 08:11:00 2
6 08:11:01 2
7 08:11:02 2
8 08:11:03 3
9 08:11:04 3
10 08:15:00
Note that I am not looking to have the time window expand ad infinitum if transactions continue to occur at quick intervals - once a full 2 second window has passed, a new window would begin with the next transaction (as shown in transactions 5 - 9). Additionally, I will ultimately be performing this analysis at the millisecond level (i.e. combine transactions within 50 ms) but stuck with seconds for ease of presentation above.
Thanks very much for any insight you can offer!
The solution i suggest requires you to reindex your data with your Time data.
You can use a list of datetimes with the desired frequency, use searchsorted to find the nearest datetimes in your index, and then use it for slicing (as suggested in question python pandas dataframe slicing by date conditions and Python pandas, how to truncate DatetimeIndex and fill missing data only in certain interval).
I'm using pandas 0.14.1 and the DataOffset object (http://pandas.pydata.org/pandas-docs/dev/timeseries.html?highlight=dateoffset). I didn't check with datetime64, but i guess you might adapt the code. DataOffset goes down to the microsecond level.
Using the following code,
import pandas as pd
import pandas.tseries.offsets as pto
import numpy as np
# Create some ome test data
d_size = 15
df = pd.DataFrame({"value": np.arange(d_size)}, index=pd.date_range("2014/11/03", periods=d_size, freq=pto.Milli()))
# Define periods to define groups (ticks)
ticks = pd.date_range("2014/11/03", periods=d_size/3, freq=5*pto.Milli())
# find nearest indexes matching the ticks
index_ticks = np.unique(df.index.searchsorted(ticks))
# make a dataframe with the group ids
dgroups = pa.DataFrame(index=df.index, columns=['Group id',])
# sets the group ids
for i, (mini, maxi) in enumerate(zip(index_ticks[:-1], index_ticks[1:])):
dgroups.loc[mini:maxi] = i
# update original dataframe
df['Group id'] = dgroups['Group id']
I was able to obtain this kind of dataframe:
value Group id
2014-11-03 00:00:00 0 0
2014-11-03 00:00:00.001000 1 0
2014-11-03 00:00:00.002000 2 0
2014-11-03 00:00:00.003000 3 0
2014-11-03 00:00:00.004000 4 0
2014-11-03 00:00:00.005000 5 1
2014-11-03 00:00:00.006000 6 1
2014-11-03 00:00:00.007000 7 1
2014-11-03 00:00:00.008000 8 1
2014-11-03 00:00:00.009000 9 1
2014-11-03 00:00:00.010000 10 2
2014-11-03 00:00:00.011000 11 2
2014-11-03 00:00:00.012000 12 2
2014-11-03 00:00:00.013000 13 2
2014-11-03 00:00:00.014000 14 2
With Pandas I have created a DataFrame from an imported .csv file (this file is generated through simulation). The DataFrame consists of half-hourly energy consumption data for a single year. I have already created a DateTimeindex for the dates.
I would like to be able to reformat this data into average hourly week and weekend profile results. With the week profile excluding holidays.
DataFrame:
Date_Time Equipment:Electricity:LGF Equipment:Electricity:GF
01/01/2000 00:30 0.583979872 0.490327348
01/01/2000 01:00 0.583979872 0.490327348
01/01/2000 01:30 0.583979872 0.490327348
01/01/2000 02:00 0.583979872 0.490327348
I found an example (Getting the average of a certain hour on weekdays over several years in a pandas dataframe) that explains doing this for several years, but not explicitly for a week (without holidays) and weekend.
I realised that there is no resampling techniques in Pandas that do this directly, I used several aliases (http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases) for creating Monthly and Daily profiles.
I was thinking of using the business day frequency and create a new dateindex with working days and compare that to my DataFrame datetimeindex for every half hour. Then return values for working days and weekend days when true or false respectively to create a new dataset, but am not sure how to do this.
PS; I am just getting into Python and Pandas.
Dummy data (for future reference, more likely to get an answer if you post some in a copy-paste-able form)
df = pd.DataFrame(data={'a':np.random.randn(1000)},
index=pd.date_range(start='2000-01-01', periods=1000, freq='30T'))
Here's an approach. First define a US (or modify as appropriate) business day offset with holidays, and generate and range covering your dates.
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
bday_us = CustomBusinessDay(calendar=USFederalHolidayCalendar())
bday_over_df = pd.date_range(start=df.index.min().date(),
end=df.index.max().date(), freq=bday_us)
Then, develop your two grouping columns. An hour column is easy.
df['hour'] = df.index.hour
For weekday/weekend/holiday, define a function to group the data.
def group_day(date):
if date.weekday() in [5,6]:
return 'weekend'
elif date.date() in bday_over_df:
return 'weekday'
else:
return 'holiday'
df['day_group'] = df.index.map(group_day)
Then, just group by the two columns as you wish.
In [140]: df.groupby(['day_group', 'hour']).sum()
Out[140]:
a
day_group hour
holiday 0 1.890621
1 -0.029606
2 0.255001
3 2.837000
4 -1.787479
5 0.644113
6 0.407966
7 -1.798526
8 -0.620614
9 -0.567195
10 -0.822207
11 -2.675911
12 0.940091
13 -1.601885
14 1.575595
15 1.500558
16 -2.512962
17 -1.677603
18 0.072809
19 -1.406939
20 2.474293
21 -1.142061
22 -0.059231
23 -0.040455
weekday 0 9.192131
1 2.759302
2 8.379552
3 -1.189508
4 3.796635
5 3.471802
... ...
18 -5.217554
19 3.294072
20 -7.461023
21 8.793223
22 4.096128
23 -0.198943
weekend 0 -2.774550
1 0.461285
2 1.522363
3 4.312562
4 0.793290
5 2.078327
6 -4.523184
7 -0.051341
8 0.887956
9 2.112092
10 -2.727364
11 2.006966
12 7.401570
13 -1.958666
14 1.139436
15 -1.418326
16 -2.353082
17 -1.381131
18 -0.568536
19 -5.198472
20 -3.405137
21 -0.596813
22 1.747980
23 -6.341053
[72 rows x 1 columns]