I have the following dataframe df:
timestamp objectId result
0 2015-11-24 09:00:00 Stress 3
1 2015-11-24 09:00:00 Productivity 0
2 2015-11-24 09:00:00 Abilities 4
3 2015-11-24 09:00:00 Challenge 0
4 2015-11-24 10:00:00 Productivity 87
5 2015-11-24 10:00:00 Abilities 84
6 2015-11-24 10:00:00 Challenge 58
7 2015-11-24 10:00:00 Stress 25
8 2015-11-24 11:00:00 Productivity 93
9 2015-11-24 11:00:00 Abilities 93
10 2015-11-24 11:00:00 Challenge 93
11 2015-11-24 11:00:00 Stress 19
12 2015-11-24 12:00:00 Challenge 90
13 2015-11-24 12:00:00 Abilities 96
14 2015-11-24 12:00:00 Stress 94
15 2015-11-24 12:00:00 Productivity 88
16 2015-11-24 13:00:00 Productivity 12
17 2015-11-24 13:00:00 Challenge 17
18 2015-11-24 13:00:00 Abilities 89
19 2015-11-24 13:00:00 Stress 13
I would like to achieve a barchart like the following
Where instead of a,b,c,d there would be the labels in the column ObjectID the y-axis should correspond to the column result and x-axis should be the values grouped of the column timestamp.
I tried several things but nothing worked. This was the closest, but the plot() method doesn't take any customisation via parameters (e.g. kind='bar' doesn't work).
groups = df.groupby('objectId')
sgb = groups['result']
sgb.plot()
Any other idea?
import seaborn as sns
In [36]:
df.timestamp = df.timestamp.factorize()[0]
In [39]:
df.objectId = df.objectId.map({'Stress' : 'a' , 'Productivity' : 'b' , 'Abilities' : 'c' , 'Challenge' : 'd'})
In [41]:
df
Out[41]:
timestamp objectId result
0 0 a 3
1 0 b 0
2 0 c 4
3 0 d 0
4 1 b 87
5 1 c 84
6 1 d 58
7 1 a 25
8 2 b 93
9 2 c 93
10 2 d 93
11 2 a 19
12 3 d 90
13 3 c 96
14 3 a 94
15 3 b 88
16 4 b 12
17 4 d 17
18 4 c 89
19 4 a 13
In [40]:
sns.barplot(x = 'timestamp' , y = 'result' , hue = 'objectId' , data = df );
The answer of #NaderHisham is a very easy solution!
But just as a reference, if you for some reason cannot use seaborn, this is a pure pandas/matplotlib solution:
You need to reshape your data, so the different objectIds becomes the columns:
In [20]: df.set_index(['timestamp', 'objectId'])['result'].unstack()
Out[20]:
objectId Abilities Challenge Productivity Stress
timestamp
09:00:00 4 0 0 3
10:00:00 84 58 87 25
11:00:00 93 93 93 19
12:00:00 96 90 88 94
13:00:00 89 17 12 13
If you make a bar plot of this, you get the desired result:
In [24]: df.set_index(['timestamp', 'objectId'])['result'].unstack().plot(kind='bar')
Out[24]: <matplotlib.axes._subplots.AxesSubplot at 0xc44a5c0>
Related
I am trying to retrieve values from a dataframe (df2) that fall within a datetime range that is specified in another dataframe (df1).
df1 = pd.DataFrame({'date': pd.date_range("20180101", periods=5)}, index=list('ABCDE'))
df1['t-2'] = df1['date'] + dt.timedelta(days=-2)
df1['t+2'] = df1['date'] + dt.timedelta(days=2)
df1:
date t-2 t+2
A 2018-01-01 2017-12-30 2018-01-03
B 2018-01-02 2017-12-31 2018-01-04
C 2018-01-03 2018-01-01 2018-01-05
D 2018-01-04 2018-01-02 2018-01-06
E 2018-01-05 2018-01-03 2018-01-07
df2 = pd.DataFrame(np.random.randint(0,100,size=(5, 50)),
index=list('ABCDE'), columns = pd.date_range("20171201", periods=50))
df2:
2017-12-01 2017-12-02 2017-12-03 ... 2018-01-17 2018-01-18 2018-01-19
A 58 61 45 ... 72 77 68
B 88 94 68 ... 93 68 24
C 97 47 21 ... 22 48 89
D 44 8 62 ... 57 29 29
E 0 21 26 ... 65 46 36
I would like to take the 5 cells from df2 that fall between the dates in df1 t-2 & t+2 & append these to df1 so that it looks like the following:
date t-2 t+2 t-2d t-1d t t+1d t+2d
A 2018-01-01 2017-12-30 2018-01-03 21 28 25 7 6
B 2018-01-02 2017-12-31 2018-01-04 28 25 7 6 45
C 2018-01-03 2018-01-01 2018-01-05 25 7 6 45 74
D 2018-01-04 2018-01-02 2018-01-06 7 6 45 74 23
E 2018-01-05 2018-01-03 2018-01-07 6 45 74 23 57
So far I have tried any number of combinations of the following
pd.merge(df1, df2.loc[:, df1['t-2']:df1['t+2']], how='inner', left_index=True, right_index=True)
but i receive a TypeError. Any help greatly appreciated.
TypeError: cannot do slice indexing on DatetimeIndex with these indexers [A 2017-12-30
B 2017-12-31
C 2018-01-01
D 2018-01-02
E 2018-01-03
Name: t-2, dtype: datetime64[ns]] of type Series
I am working with structured log data structured as the following (here a pastebin snippet of mock data for easy tinkering):
import pandas as pd
df = pd.read_csv("https://pastebin.com/raw/qrqTMrGa")
print(df)
id date info_a_cnt info_b_cnt has_err
0 123 2020-01-01 123 32 0
1 123 2020-01-02 2 43 0
2 123 2020-01-03 43 4 1
3 123 2020-01-04 43 4 0
4 123 2020-01-05 43 4 0
5 123 2020-01-06 43 4 0
6 123 2020-01-07 43 4 1
7 123 2020-01-08 43 4 0
8 232 2020-01-04 56 4 0
9 232 2020-01-05 97 1 0
10 232 2020-01-06 23 74 0
11 232 2020-01-07 91 85 1
12 232 2020-01-08 91 85 0
13 232 2020-01-09 91 85 0
14 232 2020-01-10 91 85 1
Variables are pretty straightforward:
id: the id of the observed machine
date: observation date
info_a_cnt: counts of a specific kind of info event
info_b_cnt: same as above for a different event type
has_err: whether or not the machine logged any errors
Now, I'd like to group the dataframe by id to create a variable storing the number of days left before an error event. The desired dataframe should look like:
id date info_a_cnt info_b_cnt has_err days_to_err
0 123 2020-01-01 123 32 0 2
1 123 2020-01-02 2 43 0 1
2 123 2020-01-03 43 4 1 0
3 123 2020-01-04 43 4 0 3
4 123 2020-01-05 43 4 0 2
5 123 2020-01-06 43 4 0 1
6 123 2020-01-07 43 4 1 0
7 232 2020-01-04 56 4 0 3
8 232 2020-01-05 97 1 0 2
9 232 2020-01-06 23 74 0 1
10 232 2020-01-07 91 85 1 0
11 232 2020-01-08 91 85 0 2
12 232 2020-01-09 91 85 0 1
13 232 2020-01-10 91 85 1 0
I am having an hard time figuring out the correct implementation with the right grouping functions.
Edit:
All the answers below work really well when dealing with dates with a daily granularity. I am wondering how to adapt #jezrael solution below to a dataframe containing timestamps (logs will be batched with 15 minutes interval):
df:
df = pd.read_csv("https://pastebin.com/raw/YZukAhBz")
print(df)
id date info_a_cnt info_b_cnt has_err
0 123 2020-01-01 12:00:00 123 32 0
1 123 2020-01-01 12:15:00 2 43 0
2 123 2020-01-01 12:30:00 43 4 1
3 123 2020-01-01 12:45:00 43 4 0
4 123 2020-01-01 13:00:00 43 4 0
5 123 2020-01-01 13:15:00 43 4 0
6 123 2020-01-01 13:30:00 43 4 1
7 123 2020-01-01 13:45:00 43 4 0
8 232 2020-01-04 17:00:00 56 4 0
9 232 2020-01-05 17:15:00 97 1 0
10 232 2020-01-06 17:30:00 23 74 0
11 232 2020-01-07 17:45:00 91 85 1
12 232 2020-01-08 18:00:00 91 85 0
13 232 2020-01-09 18:15:00 91 85 0
14 232 2020-01-10 18:30:00 91 85 1
I am wondering how to adapt #jezrael answer in order to land on something like:
id date info_a_cnt info_b_cnt has_err mins_to_err
0 123 2020-01-01 12:00:00 123 32 0 30
1 123 2020-01-01 12:15:00 2 43 0 15
2 123 2020-01-01 12:30:00 43 4 1 0
3 123 2020-01-01 12:45:00 43 4 0 45
4 123 2020-01-01 13:00:00 43 4 0 30
5 123 2020-01-01 13:15:00 43 4 0 15
6 123 2020-01-01 13:30:00 43 4 1 0
7 123 2020-01-01 13:45:00 43 4 0 60
8 232 2020-01-04 17:00:00 56 4 0 45
9 232 2020-01-05 17:15:00 97 1 0 30
10 232 2020-01-06 17:30:00 23 74 0 15
11 232 2020-01-07 17:45:00 91 85 1 0
12 232 2020-01-08 18:00:00 91 85 0 30
13 232 2020-01-09 18:15:00 91 85 0 15
14 232 2020-01-10 18:30:00 91 85 1 0
Use GroupBy.cumcount with ascending=False by column id and helper Series with Series.cumsum but form back - so added indexing by Series.iloc:
g = f['has_err'].iloc[::-1].cumsum().iloc[::-1]
df['days_to_err'] = df.groupby(['id', g])['has_err'].cumcount(ascending=False)
print(df)
id date info_a_cnt info_b_cnt has_err days_to_err
0 123 2020-01-01 123 32 0 2
1 123 2020-01-02 2 43 0 1
2 123 2020-01-03 43 4 1 0
3 123 2020-01-04 43 4 0 3
4 123 2020-01-05 43 4 0 2
5 123 2020-01-06 43 4 0 1
6 123 2020-01-07 43 4 1 0
7 123 2020-01-08 43 4 0 0
8 232 2020-01-04 56 4 0 3
9 232 2020-01-05 97 1 0 2
10 232 2020-01-06 23 74 0 1
11 232 2020-01-07 91 85 1 0
12 232 2020-01-08 91 85 0 2
13 232 2020-01-09 91 85 0 1
14 232 2020-01-10 91 85 1 0
EDIT: For count cumulative sum of differencies of dates use custom lambda function with GroupBy.transform:
df['days_to_err'] = (df.groupby(['id', df['has_err'].iloc[::-1].cumsum()])['date']
.transform(lambda x: x.diff().dt.days.cumsum())
.fillna(0)
.to_numpy()[::-1])
print(df)
id date info_a_cnt info_b_cnt has_err days_to_err
0 123 2020-01-01 123 32 0 2.0
1 123 2020-01-02 2 43 0 1.0
2 123 2020-01-03 43 4 1 0.0
3 123 2020-01-04 43 4 0 3.0
4 123 2020-01-05 43 4 0 2.0
5 123 2020-01-06 43 4 0 1.0
6 123 2020-01-07 43 4 1 0.0
7 123 2020-01-08 43 4 0 0.0
8 232 2020-01-04 56 4 0 3.0
9 232 2020-01-05 97 1 0 2.0
10 232 2020-01-06 23 74 0 1.0
11 232 2020-01-07 91 85 1 0.0
12 232 2020-01-08 91 85 0 2.0
13 232 2020-01-09 91 85 0 1.0
14 232 2020-01-10 91 85 1 0.0
EDIT1: Use Series.dt.total_seconds with divide by 60:
#some data sample cleaning
df = pd.read_csv("https://pastebin.com/raw/YZukAhBz", parse_dates=['date'])
df['date'] = df['date'].apply(lambda x: x.replace(month=1, day=1))
print(df)
df['days_to_err'] = (df.groupby(['id', df['has_err'].iloc[::-1].cumsum()])['date']
.transform(lambda x: x.diff().dt.total_seconds().div(60).cumsum())
.fillna(0)
.to_numpy()[::-1])
print(df)
id date info_a_cnt info_b_cnt has_err days_to_err
0 123 2020-01-01 12:00:00 123 32 0 30.0
1 123 2020-01-01 12:15:00 2 43 0 15.0
2 123 2020-01-01 12:30:00 43 4 1 0.0
3 123 2020-01-01 12:45:00 43 4 0 45.0
4 123 2020-01-01 13:00:00 43 4 0 30.0
5 123 2020-01-01 13:15:00 43 4 0 15.0
6 123 2020-01-01 13:30:00 43 4 1 0.0
7 123 2020-01-01 13:45:00 43 4 0 0.0
8 232 2020-01-01 17:00:00 56 4 0 45.0
9 232 2020-01-01 17:15:00 97 1 0 30.0
10 232 2020-01-01 17:30:00 23 74 0 15.0
11 232 2020-01-01 17:45:00 91 85 1 0.0
12 232 2020-01-01 18:00:00 91 85 0 30.0
13 232 2020-01-01 18:15:00 91 85 0 15.0
14 232 2020-01-01 18:30:00 91 85 1 0.0
Use:
df2 = df[::-1]
df['days_to_err'] = df2.groupby(['id', df2['has_err'].eq(1).cumsum()]).cumcount()
id date info_a_cnt info_b_cnt has_err days_to_err
0 123 2020-01-01 123 32 0 2
1 123 2020-01-02 2 43 0 1
2 123 2020-01-03 43 4 1 0
3 123 2020-01-04 43 4 0 3
4 123 2020-01-05 43 4 0 2
5 123 2020-01-06 43 4 0 1
6 123 2020-01-07 43 4 1 0
7 123 2020-01-08 43 4 0 0
8 232 2020-01-04 56 4 0 3
9 232 2020-01-05 97 1 0 2
10 232 2020-01-06 23 74 0 1
11 232 2020-01-07 91 85 1 0
12 232 2020-01-08 91 85 0 2
13 232 2020-01-09 91 85 0 1
14 232 2020-01-10 91 85 1 0
I have four columns of data imported using pandas:
DayOfYear Time Field Distance
1 09:00:00 50 100
1 10:00:00 51 110
1 11:00:00 52 130
2 09:00:00 54 170
2 10:00:00 55 200
2 11:00:00 56 220
3 09:00:00 58 250
3 10:00:00 59 280
3 11:00:00 60 300
4 09:00:00 61 320
4 10:00:00 63 350
4 11:00:00 65 400
5 09:00:00 66 420
5 10:00:00 68 450
5 11:00:00 70 500
6 09:00:00 72 520
6 10:00:00 74 560
6 11:00:00 75 600
7 09:00:00 77 630
7 10:00:00 79 670
7 11:00:00 80 700
...
So far I have needed to plot Field against Distance for whichever range of days that i need which i have done by using
startday = 1
endday= 6
plt.plot(rawdata[rawdata['Day'].between(startday,endday)].set_index('Distance')['Field'])
Now on the same plot i would like to highlight a region for specific time range. So I'd like to highlight , along the distance axis, for day 3 between 8AM to 10AM.
My data looks something like this:
ID1 ID2 Date Values
1 1 2018-01-05 75
1 1 2018-01-06 83
1 1 2018-01-07 17
1 1 2018-01-08 15
1 2 2018-02-01 85
1 2 2018-02-02 98
2 1 2018-02-15 54
2 1 2018-02-16 17
2 1 2018-02-17 83
2 1 2018-02-18 94
2 2 2017-12-18 16
2 2 2017-12-19 84
2 2 2017-12-20 47
2 2 2017-12-21 28
2 2 2017-12-22 38
All the operations must be done within groups of ['ID1', 'ID2'].
What I want to do is upsample the dataframe in a way such that I end up with a sub-dataframe for each 'Date' index which includes all previous dates including the current one from it's own ['ID1', 'ID2'] group. The resulting dataframe should look like this:
ID1 ID2 DateGroup Date Values
1 1 2018-01-05 2018-01-05 75
1 1 2018-01-06 2018-01-05 75
1 1 2018-01-06 2018-01-06 83
1 1 2018-01-07 2018-01-05 75
1 1 2018-01-07 2018-01-06 83
1 1 2018-01-07 2018-01-07 17
1 1 2018-01-08 2018-01-05 75
1 1 2018-01-08 2018-01-06 83
1 1 2018-01-08 2018-01-07 17
1 1 2018-01-08 2018-01-08 15
1 2 2018-02-01 2018-02-01 85
1 2 2018-02-02 2018-02-01 85
1 2 2018-02-02 2018-02-02 98
2 1 2018-02-15 2018-02-15 54
2 1 2018-02-16 2018-02-15 54
2 1 2018-02-16 2018-02-16 17
2 1 2018-02-17 2018-02-15 54
2 1 2018-02-17 2018-02-16 17
2 1 2018-02-17 2018-02-17 83
2 1 2018-02-18 2018-02-15 54
2 1 2018-02-18 2018-02-16 17
2 1 2018-02-18 2018-02-17 83
2 1 2018-02-18 2018-02-18 94
2 2 2017-12-18 2017-12-18 16
2 2 2017-12-19 2017-12-18 16
2 2 2017-12-19 2017-12-19 84
2 2 2017-12-20 2017-12-18 16
2 2 2017-12-20 2017-12-19 84
2 2 2017-12-20 2017-12-20 47
2 2 2017-12-21 2017-12-18 16
2 2 2017-12-21 2017-12-19 84
2 2 2017-12-21 2017-12-20 47
2 2 2017-12-21 2017-12-21 28
2 2 2017-12-22 2017-12-18 16
2 2 2017-12-22 2017-12-19 84
2 2 2017-12-22 2017-12-20 47
2 2 2017-12-22 2017-12-21 28
2 2 2017-12-22 2017-12-22 38
The dataframe I'm working with is quite big (~20 million rows), thus I would like to avoid iterating through each row.
Is it possible to use a function or combination of pandas functions like resample/apply/reindex to achieve what I need?
Assuming ID1 and ID2 is your original Index. You should reset the index, set Date as Index, reset the index back to [ID1, ID2]:
df = df.reset_index().set_index(['Date']).resample('d').ffill().reset_index().set_index(['ID1','ID2'])
If your 'Date' field is string, then you should be converting it into datetime before resampling on that field. You can use the below for that:
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
I have the following df of values for various slices across time:
date A B C
0 2016-01-01 5 7 2
1 2016-01-02 6 12 15
...
2 2016-01-08 9 5 16
...
3 2016-12-24 5 11 13
4 2016-12-31 3 52 22
I would like to create a new dataframe that calculates the w-w change in each slice, by date. For example, I want the new table to be blank for all slices from jan 1 - jan 7. I want the value of jan 8 to be the jan 8 value for the given slice minus the value of the jan 1 value of that slice. I then want the value of jan 9 to be the jan 9 value for the given slice minus the value of the jan 2 slice. So and so forth, all the way down.
The example table would look like this:
date A B C
0 2016-01-01 0 0 0
1 2016-01-02 0 0 0
...
2 2016-01-08 4 -2 14
...
3 2016-12-24 4 12 2
4 2016-12-31 -2 41 9
You may assume the offset is ALWAYS 7. In other words, there are no missing dates.
#Unatiel's answer is correct in this case, where there are no missing dates, and should be accepted.
But I wanted to post a modification here for cases with missing dates, for anyone interested. From the docs:
The shift method accepts a freq argument which can accept a
DateOffset class or other timedelta-like object or also a offset
alias
from pandas.tseries.offsets import Week
res = ((df - df.shift(1, freq=Week()).reindex(df.index))
.fillna(value=0)
.astype(int))
print(res)
A B
date
2016-01-01 0 0
2016-01-02 0 0
2016-01-03 0 0
2016-01-04 0 0
2016-01-05 0 0
2016-01-06 0 0
2016-01-07 0 0
2016-01-08 31 46
2016-01-09 4 20
2016-01-10 -51 -65
2016-01-11 56 5
2016-01-12 -51 24
.. ..
2016-01-20 34 -30
2016-01-21 -28 19
2016-01-22 24 8
2016-01-23 -28 -46
2016-01-24 -11 -60
2016-01-25 -34 -7
2016-01-26 -12 -28
2016-01-27 -41 42
2016-01-28 -2 48
2016-01-29 35 -51
2016-01-30 -8 62
2016-01-31 -6 -9
If we know offset is always 7 then use shift(), here is a quick example showing how it works :
df = pandas.DataFrame({'x': range(30)})
df.shift(7)
x
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 0.0
8 1.0
9 2.0
10 3.0
11 4.0
12 5.0
...
So with this you can do :
df - df.shift(7)
x
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 7.0
8 7.0
...
In your case, don't forget to set_index('date') before.