Pandas - sort and head inside groupby - python

I have following dataframe:
uniq_id value
2016-12-26 11:03:10 001 342
2016-12-26 11:03:13 004 5
2016-12-26 12:03:13 005 14
2016-12-26 12:03:13 008 114
2016-12-27 11:03:10 009 343
2016-12-27 11:03:13 013 5
2016-12-27 12:03:13 016 124
2016-12-27 12:03:13 018 114
And i need get top N records for each day sorted by value.
Something like this (for N=2):
2016-12-26 001 342
008 114
2016-12-27 009 343
016 124
Please suggest right way to do that in pandas 0.19.x

Unfortunately there is no yet such method as DataFrameGroupBy.nlargest(), which would allow us to do the following:
df.groupby(...).nlargest(2, columns=['value'])
So here is a bit ugly, but working solution:
In [73]: df.set_index(df.index.normalize()).reset_index().sort_values(['index','value'], ascending=[1,0]).groupby('index').head(2)
Out[73]:
index uniq_id value
0 2016-12-26 1 342
3 2016-12-26 8 114
4 2016-12-27 9 343
6 2016-12-27 16 124
PS i think there must be a better one...
UPDATE: if your DF wouldn't have duplicated index values, the following solution should work as well:
In [117]: df
Out[117]:
uniq_id value
2016-12-26 11:03:10 1 342
2016-12-26 11:03:13 4 5
2016-12-26 12:03:13 5 14
2016-12-26 12:33:13 8 114 # <-- i've intentionally changed this index value
2016-12-27 11:03:10 9 343
2016-12-27 11:03:13 13 5
2016-12-27 12:03:13 16 124
2016-12-27 12:33:13 18 114 # <-- i've intentionally changed this index value
In [118]: df.groupby(pd.TimeGrouper('D')).apply(lambda x: x.nlargest(2, 'value')).reset_index(level=1, drop=1)
Out[118]:
uniq_id value
2016-12-26 1 342
2016-12-26 8 114
2016-12-27 9 343
2016-12-27 16 124

df.set_index('uniq_id', append=True) \
.groupby(df.index.date).value.nlargest(2) \
.rename_axis([None, None, 'uniq_id']).reset_index(-1)
uniq_id value
2016-12-26 2016-12-26 11:03:10 1 342
2016-12-26 12:03:13 8 114
2016-12-27 2016-12-27 11:03:10 9 343
2016-12-27 12:03:13 16 124

A solution that is easier to remember might be:
df.sort_values(by='value').groupby('date').head(2)
This will give for each date the two rows with the highest value in value column.
In the example from the OT, one would need to set df['date'] = df.index before, because the column used for grouping happens to be the index.

Related

fb prophet daily prediction does not give accurate result for missing values

My dataframe (df) contains 2 inputs UnitShrtDescr and SchShrtDescr
.
So for particular UnitShrtDescr and SchShrtDescr it must predict next value. But my data contains lots of missing values (output for in-between dates are 0).
During prediction prophet continuously predict value for each and every day without considering in between dates output as empty. How can i resolve this?
>df #(main dataframe)
>
UnitShrtDescr SchShrtDescr y ds id
8110 50 93 1 2011-12-01 243
3437 29 87 1 2011-12-21 133
6867 43 75 1 2011-12-23 204
1102 8 23 1 2011-12-28 36
5271 36 14 1 2011-12-28 166
... ... ... ... ... ...
13138 83 0 1 2018-05-18 390
14424 92 3 1 2018-05-18 432
11556 69 0 1 2018-05-18 334
11767 69 5 1 2018-05-18 338
4458 30 102 1 2018-05-18 141
15950 rows × 5 columns
code:
model = Prophet(daily_seasonality=True)
model.add_regressor("UnitShrtDescr")
model.add_regressor("SchShrtDescr")
model.fit(df)
input regressor that i want to predict is
UnitShrtDescr=40 and SchShrtDescr=93. So i made make_future_dataframe:
future = model.make_future_dataframe(periods=100, include_history=False)
future["UnitShrtDescr"]=40
future["SchShrtDescr"]=93
Previous value for UnitShrtDescr=40 and SchShrtDescr=93 was:
>dfx[(dfx['UnitShrtDescr']==40) & (dfx['SchShrtDescr']==93)].tail(10)
>
UnitShrtDescr SchShrtDescr y ds id
6293 40 93 1 2018-02-27 189
6294 40 93 3 2018-02-28 189
6295 40 93 1 2018-03-17 189
6296 40 93 1 2018-03-29 189
6297 40 93 1 2018-03-30 189
6298 40 93 4 2018-03-31 189
6299 40 93 1 2018-04-26 189
6300 40 93 1 2018-04-27 189
6301 40 93 4 2018-04-30 189
6302 40 93 1 2018-05-16 189
Please note Gap between dates is much bigger which means y is 0 for between dates.
So when i make prediction it must predict in-between dates as 0 also.
But in this case it continuously predict y without considering in between y as 0
output = model.predict(future)
>output[['ds','yhat']].head(10)
>
ds yhat
0 2018-05-19 2.959505
1 2018-05-20 2.631181
2 2018-05-21 2.418850
3 2018-05-22 2.411914
4 2018-05-23 2.386383
5 2018-05-24 2.444841
6 2018-05-25 2.409294
7 2018-05-26 2.937428
8 2018-05-27 2.588136
9 2018-05-28 2.358953
Please Suggest Changes or better alternative for my case

Filtering static/stationary areas

I was trying to filter my sensor data. My objective is to filter the sensor data where the data is more or less stationary over a period of time. can anyone help me in this
time : 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
sensor : 121
115
122
123
116
117
113
116
113
114
115
112
116
129
123
125
130
120
121
122
this is a sample data, i need to take the first data and compare it to the next 20 seconds of data, if all the 20 datas is in the the range of +or- 10 then i need to filter these 20 datas to another column, and i need to continue this process of filtering
However your question is not very clear but from my understanding what you want is between time duration of 20 seconds if the sensor is in between the range of +10 and -10 from the first reading then you have to append those values to new column and above or below that should not be considered. I tried replicating your DataFrame and you could go ahead in this way:
import pandas as pd
data = {'time':[1, 2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23],
'sensor':[121, 115, 122, 123,116,117,113,116,113,114,115,112,116,129,123,125,130,120,121,122,123,124,144]}
df_new = pd.DataFrame(data) #I am taking time duration of 23 seconds where 23rd second data is out of range as 144 - 121 > 10
time sensor
0 1 121
1 2 115
2 3 122
3 4 123
4 5 116
5 6 117
6 7 113
7 8 116
8 9 113
9 10 114
10 11 115
11 12 112
12 13 116
13 14 129
14 15 123
15 16 125
16 17 130
17 18 120
18 19 121
19 20 122
20 21 123
21 22 124
22 23 144
list = []
for i in range(0, len(df_new['sensor'])):
if 0 <= df_new['time'][i] - df_new['time'][0] <= 23: #you take here 20 which is your requirement instead of 23 as I am doing to demonstrate for the value of 144
if -10 < df_new['sensor'][0] - df_new['sensor'][i] < 10:
list.append(df_new['sensor'][i])
else:
list.append('out of range')
else:
break
df_new['result'] = list
df_new
time sensor result
0 1 121 121
1 2 115 115
2 3 122 122
3 4 123 123
4 5 116 116
5 6 117 117
6 7 113 113
7 8 116 116
8 9 113 113
9 10 114 114
10 11 115 115
11 12 112 112
12 13 116 116
13 14 129 129
14 15 123 123
15 16 125 125
16 17 130 130
17 18 120 120
18 19 121 121
19 20 122 122
20 21 123 123
21 22 124 124
22 23 144 out of range
There is no sample data. Generated. Clearly filter on time could be two date times, I've just picked certain hours. For stable, example selected values that are between 45th & 55th percentile.
import numpy as np
t = pd.date_range(dt.date(2021,1,10), dt.date(2021,1,11), freq="min")
df = pd.DataFrame({"time":t, "val":np.random.dirichlet(np.ones(len(t)),size=1)[0]})
# filter on hour and val. val between 45th and 55th percentile
df2 = df[df.time.dt.hour.between(3,4) & df.val.between(df.val.quantile(.45), df.val.quantile(.55))]
output
time val
2021-01-10 03:13:00 0.000499
2021-01-10 03:41:00 0.000512
2021-01-10 04:00:00 0.000541
2021-01-10 04:39:00 0.000413
rolling window
Question was updated to state stable is defined as next window rows with a +/- rng output in a new column.
Using this definition, using rolling() capability with a lambda function to check that all subsequent rows within window are within tolerance levels of the first observation in the window. Any observation out of this range will return NaN. Also note last rows will return NaN as there are insufficient remaining rows to do test.
import pandas as pd
import io
import datetime as dt
import numpy as np
from distutils.version import StrictVersion
df = pd.read_csv(io.StringIO("""sensor
121
115
122
123
116
117
113
116
113
114
115
112
116
129
123
125
130
120
121
122"""))
df["time"] = pd.date_range(dt.date(2021,1,10), freq="s", periods=len(df))
# how many rows to compare
window = 5
# */- range
rng = 10
if StrictVersion(pd.__version__) < StrictVersion("1.0.0"):
df["stable"] = df["sensor"].rolling(window).apply(lambda x: np.where(pd.Series(x).between(x[0]-rng,x[0]+rng).all(), x[0], np.nan)).shift(-(window-1))
else:
df["stable"] = df.rolling(window).apply(lambda x: np.where(x.between(x.values[0]-rng,x.values[0]+rng).all(), x.values[0], np.nan)).shift(-(window-1))
output
sensor time stable
121 2021-01-10 00:00:00 121.0
115 2021-01-10 00:00:01 115.0
122 2021-01-10 00:00:02 122.0
123 2021-01-10 00:00:03 123.0
116 2021-01-10 00:00:04 116.0
117 2021-01-10 00:00:05 117.0
113 2021-01-10 00:00:06 113.0
116 2021-01-10 00:00:07 116.0
113 2021-01-10 00:00:08 113.0
114 2021-01-10 00:00:09 NaN
115 2021-01-10 00:00:10 NaN
112 2021-01-10 00:00:11 NaN
116 2021-01-10 00:00:12 NaN
129 2021-01-10 00:00:13 129.0
123 2021-01-10 00:00:14 123.0
125 2021-01-10 00:00:15 125.0
130 2021-01-10 00:00:16 NaN
120 2021-01-10 00:00:17 NaN
121 2021-01-10 00:00:18 NaN
122 2021-01-10 00:00:19 NaN

How to reshape dataframe with pandas

I have a dataframe as below picture shown, how to effectively get all value with ":" in the cell and create a new dataframe? For instance, "cnt:1" shall be converted to "1"; "ack:dsn:113" shall be converted to "113", etc.
You can use rsplit with limit 1 and select second values of lists:
df = df.applymap(lambda x: x.rsplit(':', 1)[1])
Or:
df = df.apply(lambda x: x.str.rsplit(':', 1).str[1])
print (df)
dsn cnt retry rssir lqir rssif lqif
0 113 1 1 -24 6 -49 5
1 114 2 1 -24 10 -49 15
2 115 3 1 -24 5 -59 14
3 116 4 1 -24 8 -58 11
4 117 5 1 -24 12 -57 14
Or simplier as pointed Anton vBR:
df = df.applymap(lambda x: x.rsplit(':')[-1])
df = df.apply(lambda x: x.str.rsplit(':').str[-1])
With pandas.DataFrame.replace using regex=True
df.replace('(.*:)', '', regex=True)
dsn cnt retry rssir lqir rssif lqif
0 113 1 1 -24 6 -49 5
1 114 2 1 -24 10 -49 15
2 115 3 1 -24 5 -59 14
3 116 4 1 -24 8 -58 11
4 117 5 1 -24 12 -57 14
More cumbersome with Numpy string functions
from numpy.core.defchararray import rsplit
pd.DataFrame(
np.array(
[t[1] for t in rsplit(
df.values.ravel().astype(str), ':', 1
)]
).reshape(df.shape),
df.index, df.columns
)
dsn cnt retry rssir lqir rssif lqif
0 113 1 1 -24 6 -49 5
1 114 2 1 -24 10 -49 15
2 115 3 1 -24 5 -59 14
3 116 4 1 -24 8 -58 11
4 117 5 1 -24 12 -57 14

Adding a row from a dataframe into another by matching columns with NaN values in row pandas python

The Scenario:
I have 2 dataframes fc0 and yc0. Where fc0 is a Cluster and yc0 is another dataframe which needs to be merged in fc0.
The Nature of data is as follows:
fc0
uid 1 2 3 4 5 6
234 235 4.000000 4.074464 4.128026 3.973045 3.921663 4.024864
235 236 3.524208 3.125669 3.652112 3.626923 3.524318 3.650589
236 237 4.174080 4.226267 4.200133 4.150983 4.124157 4.200052
yc0
iid uid 1 2 5 6 9 15
0 944 5.0 3.0 4.0 3.0 3.0 5.0
The Twist
I have 1682 columns in fc0 and I have few hundered values in yc0. Now I need the yc0 to go into fc0
In haste of resolving it, I even tried yc0.reset_index(inplace=True) but wasn't really helpful.
Expected Output
uid 1 2 3 4 5 6
234 235 4.000000 4.074464 4.128026 3.973045 3.921663 4.024864
235 236 3.524208 3.125669 3.652112 3.626923 3.524318 3.650589
236 237 4.174080 4.226267 4.200133 4.150983 4.124157 4.200052
944 5.0 3.0 NaN NaN 4.0 3.0 3.0
References
Link1 Tried this, but landed up inserting NaN values for 1st 16 Columns and rest of the data shifted by that many columns
Link2 Couldn't match column keys, besides I tried it for row.
Link3 Merging doesn't match the columns in it.
Link4 Concatenation doesn't work that way.
Link5 Same issues with Join.
EDIT 1
fc0.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 235 entries, 234 to 468
Columns: 1683 entries, uid to 1682
dtypes: float64(1682), int64(1)
memory usage: 3.0 MB
and
yc0.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Columns: 336 entries, uid to 1007
dtypes: float64(335), int64(1)
memory usage: 2.7 KB
Here's a MVCE example. Does this small sample data show the functionality that you are expecting?
df1 = pd.DataFrame(np.random.randint(0,100,(5,4)), columns=list('ABCE'))
A B C E
0 81 57 54 88
1 63 63 74 10
2 13 89 88 66
3 90 81 3 31
4 66 93 55 4
df2 = pd.DataFrame(np.random.randint(0,100,(5,4)), columns=list('BCDE'))
B C D E
0 93 48 62 25
1 24 97 52 88
2 53 50 21 13
3 81 27 7 81
4 10 21 77 19
df_out = pd.concat([df1,df2])
print(df_out)
Output:
A B C D E
0 81.0 57 54 NaN 88
1 63.0 63 74 NaN 10
2 13.0 89 88 NaN 66
3 90.0 81 3 NaN 31
4 66.0 93 55 NaN 4
0 NaN 93 48 62.0 25
1 NaN 24 97 52.0 88
2 NaN 53 50 21.0 13
3 NaN 81 27 7.0 81
4 NaN 10 21 77.0 19

pandas: dropping columns based on value in last row

Starting out with data like this:
np.random.seed(314)
df = pd.DataFrame({
'date':[pd.date_range('2016-04-01', '2016-04-05')[r] for r in np.random.randint(0,5,20)],
'cat':['ABCD'[r] for r in np.random.randint(0,4,20)],
'count': np.random.randint(0,100,20)
})
cat count date
0 B 84 2016-04-04
1 A 95 2016-04-05
2 D 89 2016-04-02
3 D 39 2016-04-05
4 A 39 2016-04-01
5 C 61 2016-04-05
6 C 58 2016-04-04
7 B 49 2016-04-03
8 D 20 2016-04-02
9 B 54 2016-04-01
10 B 87 2016-04-01
11 D 36 2016-04-05
12 C 13 2016-04-05
13 A 79 2016-04-04
14 B 91 2016-04-03
15 C 83 2016-04-05
16 C 85 2016-04-05
17 D 93 2016-04-01
18 C 32 2016-04-02
19 B 29 2016-04-03
Next, I calculate totals by date, pivot cat into columns, and calculate running totals for each column:
summary = df.groupby(['date','cat']).sum().unstack().fillna(0).cumsum()
cat A B C D
date
2016-04-01 80 235 99 0
2016-04-02 85 295 153 14
2016-04-03 111 363 224 14
2016-04-04 111 379 296 50
2016-04-05 111 511 296 50
Now I want to remove columns where the last column is less than some value, say 150. The result should look like:
cat B C
date
2016-04-01 235 99
2016-04-02 295 153
2016-04-03 363 224
2016-04-04 379 296
2016-04-05 511 296
I've figured out one part of it:
mask = summary[-1:].squeeze() > 150
cat
count A False
B True
C True
D False
will give me a mask for dropping columns. What I can't figure out is how to use it with a call to summary.drop(...). Any hints?
Instead of dropping the columns you do not want, you can also select the ones you want (using the mask with boolean indexing):
In [16]: mask = summary[-1:].squeeze() > 220
In [17]: summary.loc[:, mask]
Out[17]:
count
cat B D
date
2016-04-01 141.0 94.0
2016-04-02 235.0 94.0
2016-04-03 235.0 144.0
2016-04-04 326.0 144.0
2016-04-05 384.0 229.0
(I used 220 instead of 150, otherwise all columns were selected)
Further, a better way to calculate the mask is probably the following:
mask = summary.iloc[-1] > 220
which just selects the last row (by position) instead of using squeeze.

Categories

Resources