Select groups using slicing based on the group index in pandas DataFrame - python

I have a Dataframe with a users indicated by the column: 'user_id'. Each of these users have several entries in the dataframe based on the date on which they did something, which is also a column. The dataframe looks somthing like
df:
user_id date
0 2019-04-13 02:00:00
0 2019-04-13 03:00:00
3 2019-02-18 22:00:00
3 2019-02-18 23:00:00
3 2019-02-19 00:00:00
3 2019-02-19 02:00:00
3 2019-02-19 03:00:00
3 2019-02-19 04:00:00
8 2019-04-05 04:00:00
8 2019-04-05 05:00:00
8 2019-04-05 06:00:00
8 2019-04-05 15:00:00
15 2019-04-28 19:00:00
15 2019-04-28 20:00:00
15 2019-04-29 01:00:00
23 2019-06-24 02:00:00
23 2019-06-24 05:00:00
23 2019-06-24 06:00:00
24 2019-03-27 12:00:00
24 2019-03-27 13:00:00
What I want to do is, for example, select the first 3 users. I wanted to do this with a code like this:
df.groupby('user_id').iloc[:3]
I know that groupby doesn't have an iloc so how could I achieve the same thing like an iloc in the groups, so I am able to slice them?

I found a way based on crayxt's answer:
df[df['user_id'].isin(df['user_id'].unique()[:3])]

Related

Pandas Dataframe - Search by index

I have a dataframe where the index is a timestamp.
DATE VALOR
2020-12-01 00:00:00 0.00635
2020-12-01 01:00:00 0.00941
2020-12-01 02:00:00 0.01151
2020-12-01 03:00:00 0.00281
2020-12-01 04:00:00 0.01080
... ...
2021-04-30 19:00:00 0.77059
2021-04-30 20:00:00 0.49285
2021-04-30 21:00:00 0.49057
2021-04-30 22:00:00 0.50339
2021-04-30 23:00:00 0.48792
I´m searching for a specific date
drop.loc['2020-12-01 04:00:00']
VALOR 0.0108
Name: 2020-12-01 04:00:00, dtype: float64
I want the return for the index of search above.
In this case is line 5. After I want to use this value to do a slice in the dataframe
drop[:5]
Thanks!
It looks like you want to subset drop up to index '2020-12-01 04:00:00'.
Then simply do this: drop.loc[:'2020-12-01 04:00:00']
No need to manually get the line number.
output:
VALOR
DATE
2020-12-01 00:00:00 0.00635
2020-12-01 01:00:00 0.00941
2020-12-01 02:00:00 0.01151
2020-12-01 03:00:00 0.00281
2020-12-01 04:00:00 0.01080
If you really want to get the position:
pos = drop.index.get_loc(key='2020-12-01 04:00:00') ## returns: 4
drop[:pos+1]

Assign first element of groupby to a column yields NaN

Why does this not work out?
I get the right results if I just print it out, but if I use the same to assign it to the df column, I get Nan values...
print(df.groupby('cumsum').first()['Date'])
cumsum
1 2021-01-05 11:00:00
2 2021-01-06 08:00:00
3 2021-01-06 10:00:00
4 2021-01-06 13:00:00
5 2021-01-06 14:00:00
...
557 2021-08-08 08:00:00
558 2021-08-08 09:00:00
559 2021-08-08 11:00:00
560 2021-08-08 13:00:00
561 2021-08-08 18:00:00
Name: Date, Length: 561, dtype: datetime64[ns]
vs
df["Date_First"] = df.groupby('cumsum').first()['Date']
Date
2021-01-01 00:00:00 NaT
2021-01-01 01:00:00 NaT
2021-01-01 02:00:00 NaT
2021-01-01 03:00:00 NaT
2021-01-01 04:00:00 NaT
..
2021-08-08 14:00:00 NaT
2021-08-08 15:00:00 NaT
2021-08-08 16:00:00 NaT
2021-08-08 17:00:00 NaT
2021-08-08 18:00:00 NaT
Name: Date_Last, Length: 5268, dtype: datetime64[ns]
What happens here?
I used an exmpmle form here, but want to get the first elements.
https://www.codeforests.com/2021/03/30/group-consecutive-rows-in-pandas/
What happens here?
If use:
print(df.groupby('cumsum')['Date'].first())
#print(df.groupby('cumsum').first()['Date'])
output are aggregated values by column cumsum with aggregated function first.
So in index are unique values cumsum, so if assign to new column there is mismatch with original index and output are NaNs.
Solution is use GroupBy.transform, which repeat aggregated values to Series (column) with same size like original DataFrame, so index is same like original and assign working perfectly:
df["Date_First"] = df.groupby('cumsum')['Date'].transform("first")

How to add a new categorical column with numbering as per time Interval in Pandas

Value
2021-07-15 00:00:00 10
2021-07-15 06:00:00 10
2021-07-15 12:00:00 10
2021-07-15 18:00:00 10
2021-07-16 00:00:00 20
2021-07-16 06:00:00 10
2021-07-16 12:00:00 10
2021-07-16 18:00:00 20
I want to add a column such that when it
00:00:00 1
06:00:00 2
12:00:00 3
18:00:00 4
Eventually, I want something like this
Value Number
2021-07-15 00:00:00 10 1
2021-07-15 06:00:00 10 2
2021-07-15 12:00:00 10 3
2021-07-15 18:00:00 10 4
2021-07-16 00:00:00 20 1
2021-07-16 06:00:00 10 2
2021-07-16 12:00:00 10 3
2021-07-16 18:00:00 20 4
and so on
I want that Numbering column such that whenever it's 00:00:00 time it always says 1, whenever it's 06:00:00 time it always says 2, whenever it's 12:00:00 time it always says 3, whenever it's 18:00:00 time it always says 4. In this way, I will have a categorical column having only 1,2,3,4 values
Sorry, new here, so I don't have enough rep to comment. But #Keiku's solution is closer than you realise. If you replace .time by .hour, you get the hour of the day. Divide that by 6 to get 0-3 categories for 0:00 to 18:00. If you must have them in the range 1-4 specifically, simply add 1.
To borrow #Keiku's example code:
import pandas as pd
df = pd.DataFrame({
'2021-07-15 00:00:00 0.48',
'2021-07-15 06:00:00 80.00',
'2021-07-15 12:00:00 6.10',
'2021-07-15 18:00:00 1400.00',
'2021-07-16 00:00:00 1400.00'
}, columns=['value'])
df['date'] = pd.to_datetime(df['value'].str[:19])
df.sort_values(['date'], ascending=[True], inplace=True)
df['category'] = df['date'].dt.hour / 6 # + 1 if you want this to be 1-4
You can use pd.to_datetime to convert to datetime and .dt.time to extract the time. You can use pd.factorize for 1,2,3,4 categories.
import pandas as pd
df = pd.DataFrame({
'2021-07-15 00:00:00 0.48',
'2021-07-15 06:00:00 80.00',
'2021-07-15 12:00:00 6.10',
'2021-07-15 18:00:00 1400.00',
'2021-07-16 00:00:00 1400.00'
}, columns=['value'])
df
# value
# 0 2021-07-15 00:00:00 0.48
# 1 2021-07-15 06:00:00 80.00
# 2 2021-07-15 12:00:00 6.10
# 3 2021-07-16 00:00:00 1400.00
# 4 2021-07-15 18:00:00 1400.00
df['date'] = pd.to_datetime(df['value'].str[:19])
df.sort_values(['date'], ascending=[True], inplace=True)
df['time'] = df['date'].dt.time
df['index'], _ = pd.factorize(df['time'])
df['index'] += 1
df
# value date time index
# 0 2021-07-15 00:00:00 0.48 2021-07-15 00:00:00 00:00:00 1
# 1 2021-07-15 06:00:00 80.00 2021-07-15 06:00:00 06:00:00 2
# 2 2021-07-15 12:00:00 6.10 2021-07-15 12:00:00 12:00:00 3
# 4 2021-07-15 18:00:00 1400.00 2021-07-15 18:00:00 18:00:00 4
# 3 2021-07-16 00:00:00 1400.00 2021-07-16 00:00:00 00:00:00 1

How can i standardize time series data?

I'm working on OHLC trading data and i have different datasets with different ranges of prices. For example, on one dataset the price will range from 100 to 150, on another from 2 to 3, on another from 0.5 to 0.8 and so on, so very different magnitudes.
On each dataset, i'm looping through the data and for each point i'm computing the slope on the last five prices on each point, and for that i'm using np.polyfit().
Here is my code:
x = df['Date'].to_numpy()
y = df['Close'].to_numpy()
fits = []
for idx, j in enumerate(y):
arr_y = y[:idx]
arr_x = x[:idx]
p_y = arr_y[-5:]
p_x = arr_x[-5:]
if len(py) >= 4 and len(px) >= 4:
fit = np.polyfit(p_x, p_y, 1)
ang_coeff = fit[0]
intercept = fit[1]
fits.append(ang_coeff)
else:
fits.append(np.nan)
df['SLOPE'] = fits
Here is what the code does: loop through the prices, and for each price, calculate the slope based on the last five prices.
This code works well, but the problem is that, since i'm working with more dataset where prices are going to be a lot different on each dataset, it becomes hard for me to perform any kind of analysis. So a very high slope value on a dataset will be very low on another dataset. My question is: how can i standardize or normalize (i know they are two different things) this data? How can i process my slope values so that an "high" slope value on a dataset will be high on another dataset too?
Here is a sample of my outputs:
Date Close Slope
2021-01-17 00:00:00 34031.098338 29.572362
2021-01-17 04:00:00 34034.475090 20.097445
2021-01-17 08:00:00 34034.982351 8.655060
2021-01-17 12:00:00 34044.665386 3.914707
2021-01-17 16:00:00 34049.372571 4.538112
2021-01-17 20:00:00 34059.458965 4.673876
2021-01-18 00:00:00 34063.656831 6.435797
2021-01-18 04:00:00 34070.819559 7.214254
2021-01-18 08:00:00 34086.331298 6.659261
2021-01-18 12:00:00 34099.272005 8.527805
2021-01-18 16:00:00 34099.560423 10.230055
2021-01-18 20:00:00 34106.109568 10.025963
2021-01-19 00:00:00 34110.932662 8.380914
2021-01-19 04:00:00 34122.312205 5.604029
2021-01-19 08:00:00 34134.855812 5.745264
2021-01-19 12:00:00 34162.275141 8.679342
2021-01-19 16:00:00 34190.550778 13.625430
2021-01-19 20:00:00 34211.505419 19.919917
2021-01-20 00:00:00 34222.969489 23.408140
2021-01-20 04:00:00 34237.699255 22.545763
2021-01-20 08:00:00 34240.094551 18.326694
2021-01-20 12:00:00 34239.827609 12.528138
2021-01-20 16:00:00 34239.900596 7.376944
2021-01-20 20:00:00 34246.295214 3.599057
2021-01-21 00:00:00 34248.790292 1.699797
2021-01-21 04:00:00 34251.656251 2.385909
2021-01-21 08:00:00 34211.135875 3.254698
2021-01-21 12:00:00 34150.903010 -5.216841
2021-01-21 16:00:00 34127.857586 -22.843883
2021-01-21 20:00:00 34072.463679 -34.261865
2021-01-22 00:00:00 34018.425804 -44.166343
2021-01-22 04:00:00 33974.399053 -46.385947
2021-01-22 08:00:00 33946.475779 -46.243970
2021-01-22 12:00:00 33929.852159 -46.082824
2021-01-22 16:00:00 33927.598892 -35.717306
2021-01-22 20:00:00 33918.627401 -22.620072
2021-01-23 00:00:00 33905.044709 -13.042019
2021-01-23 04:00:00 33894.973038 -9.408690
2021-01-23 08:00:00 33861.417022 -9.231243
And a different dataset:
Date Close Slope
2021-02-18 04:00:00 0.492204 4.013722e-04
2021-02-18 08:00:00 0.492488 4.721365e-04
2021-02-18 12:00:00 0.493027 4.831912e-04
2021-02-18 16:00:00 0.493569 4.591663e-04
2021-02-18 20:00:00 0.494286 4.463141e-04
2021-02-19 00:00:00 0.494799 5.245110e-04
2021-02-19 04:00:00 0.495515 5.880476e-04
2021-02-19 08:00:00 0.496172 6.204948e-04
2021-02-19 12:00:00 0.496634 6.435782e-04
2021-02-19 16:00:00 0.497133 6.069365e-04
2021-02-19 20:00:00 0.497526 5.787601e-04
2021-02-20 00:00:00 0.497712 4.983345e-04
2021-02-20 04:00:00 0.497762 3.972312e-04
2021-02-20 08:00:00 0.497956 2.835458e-04
2021-02-20 12:00:00 0.498307 1.880521e-04
2021-02-20 16:00:00 0.498692 1.804976e-04
2021-02-20 20:00:00 0.498813 2.505608e-04
2021-02-21 00:00:00 0.499153 2.839021e-04
2021-02-21 04:00:00 0.499364 2.901245e-04
2021-02-21 08:00:00 0.499471 2.574213e-04
2021-02-21 12:00:00 0.499556 2.107408e-04
2021-02-21 16:00:00 0.499902 1.803125e-04
2021-02-21 20:00:00 0.500177 1.690260e-04
2021-02-22 00:00:00 0.500221 2.059057e-04
2021-02-22 04:00:00 0.501403 2.121462e-04
2021-02-22 08:00:00 0.502194 4.012434e-04
2021-02-22 12:00:00 0.502318 5.809102e-04
2021-02-22 16:00:00 0.502852 6.255775e-04
2021-02-22 20:00:00 0.503182 6.177676e-04
2021-02-23 00:00:00 0.503209 4.214821e-04
2021-02-23 04:00:00 0.503271 2.893487e-04
2021-02-23 08:00:00 0.502459 2.262497e-04
2021-02-23 12:00:00 0.502190 -6.951268e-05
2021-02-23 16:00:00 0.501697 -2.733434e-04
2021-02-23 20:00:00 0.501526 -4.105911e-04
2021-02-24 00:00:00 0.501506 -4.251799e-04
2021-02-24 04:00:00 0.501420 -2.571382e-04
2021-02-24 08:00:00 0.501332 -1.730550e-04
2021-02-24 12:00:00 0.501099 -8.359633e-05
2021-02-24 16:00:00 0.500684 -1.027447e-04
2021-02-24 20:00:00 0.500341 -1.962963e-04
2021-02-25 00:00:00 0.500027 -2.806065e-04
2021-02-25 04:00:00 0.499747 -3.368647e-04
2021-02-25 08:00:00 0.499428 -3.361539e-04
2021-02-25 12:00:00 0.499212 -3.105732e-04
2021-02-25 16:00:00 0.498883 -2.857117e-04
So these two datasets have very different Close values, which means the slope values are going to be completely different, so a very "high" slope value on the second dataset is nothing compared to the first dataset's slope values. Is there any way i can solve this? Do i have to apply some sort of normalization or standardization? Or do i need to use a different kind of calculation or metric? Thanks in advance!
The Close values can be scaled using sklearn's MinMaxScaler()
You can also simplify the polyfit loop by using Rolling.apply() with a window size of 5
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
for df in [df1, df2]:
df['Close'] = scaler.fit_transform(df['Close'].to_numpy().reshape(-1, 1))
df['Slope'] = df['Close'].rolling(5, center=True).apply(lambda x: np.polyfit(x.index, x, 1)[0])
>>> df1
Date Close Slope
0 2021-01-17 00:00:00 0.434814 NaN
1 2021-01-17 04:00:00 0.443467 NaN
2 2021-01-17 08:00:00 0.444766 0.011977
3 2021-01-17 12:00:00 0.469580 0.016492
4 2021-01-17 16:00:00 0.481642 0.018487
...
34 2021-01-22 16:00:00 0.169593 -0.024110
35 2021-01-22 20:00:00 0.146603 -0.023655
36 2021-01-23 00:00:00 0.111797 -0.039980
37 2021-01-23 04:00:00 0.085988 NaN
38 2021-01-23 08:00:00 0.000000 NaN
>>> df2
Date Close Slope
0 2021-02-18 04:00:00 0.000000 NaN
1 2021-02-18 08:00:00 0.025662 NaN
2 2021-02-18 12:00:00 0.074365 0.047393
3 2021-02-18 16:00:00 0.123340 0.053140
4 2021-02-18 20:00:00 0.188127 0.056077
...
41 2021-02-25 00:00:00 0.706876 -0.028065
42 2021-02-25 04:00:00 0.681576 -0.025815
43 2021-02-25 08:00:00 0.652751 -0.025508
44 2021-02-25 12:00:00 0.633234 NaN
45 2021-02-25 16:00:00 0.603506 NaN
Recommend you adjust the scale by first calculating the Average True Range (ATR see https://www.investopedia.com/terms/a/atr.asp) of one of the datasets and figure a reasonable scale to get a representative slope for that one. Then for other datasets calculate the ratio of their ATR to the standardized dataset and adjust the slope by that ratio.
For example if a new dataset has an ATR which is only a tenth of your "standard" ATR, then you multiply its slope measurements by 10 to put it to the same scale.
I recommend you use unit length scaling (scaling to unit length) or unit normal scaling (standardization) if you want the series' to maintain their statistical properties but be scale free. It doesn't matter which one you use since you're just looking at slopes, and the fitted slopes between the two methods are identical (Montgomery, et. al., section 3.9).
Essentially, take the z-score of all of your regressors and the response variable, for UNS, and fit the transformed data without an intercept. For ULS, take the mean deviated regressor and response values divided by the square root of the corrected sum of squares values.
There are other methods you can try. They fall under the heading of Feature Scaling, and include min-max normalization and mean normalization Wikipedia, 2021.
Montgomery, D. C., Peck, E. A., & Vining, G. G. (2012). Introduction to Linear Regression Analysis (5th ed.). John Wiley & Sons, Inc.

Flagging list of datetimes within date ranges in pandas dataframe

I've looked around (eg.
Python - Locating the closest timestamp) but can't find anything on this.
I have a list of datetimes, and a dataframe containing 10k + rows, of start and end times (formatted as datetimes).
The dataframe is effectively listing parameters for runs of an instrument.
The list describes times from an alarm event.
The datetime list items are all within a row (i.e. between a start and end time) in the dataframe. Is there an easy way to locate the rows which would contain the timeframe within which the alarm time would be? (sorry for poor wording there!)
eg.
for i in alarms:
df.loc[(df.start_time < i) & (df.end_time > i), 'Flag'] = 'Alarm'
(this didn't work but shows my approach)
Example datasets
# making list of datetimes for the alarms
df = pd.DataFrame({'Alarms':["18/07/19 14:56:21", "19/07/19 15:05:15", "20/07/19 15:46:00"]})
df['Alarms'] = pd.to_datetime(df['Alarms'])
alarms = list(df.Alarms.unique())
# dataframe of runs containing start and end times
n=33
rng1 = pd.date_range('2019-07-18', '2019-07-22', periods=n)
rng2 = pd.date_range('2019-07-18 03:00:00', '2019-07-22 03:00:00', periods=n)
df = pd.DataFrame({ 'start_date': rng1, 'end_Date': rng2})
Herein a flag would go against line (well, index) 4, 13 and 21.
You can use pandas.IntervalIndex here:
# Create and set IntervalIndex
intervals = pd.IntervalIndex.from_arrays(df.start_date, df.end_Date)
df = df.set_index(intervals)
# Update using loc
df.loc[alarms, 'flag'] = 'alarm'
# Finally, reset_index
df = df.reset_index(drop=True)
[out]
start_date end_Date flag
0 2019-07-18 00:00:00 2019-07-18 03:00:00 NaN
1 2019-07-18 03:00:00 2019-07-18 06:00:00 NaN
2 2019-07-18 06:00:00 2019-07-18 09:00:00 NaN
3 2019-07-18 09:00:00 2019-07-18 12:00:00 NaN
4 2019-07-18 12:00:00 2019-07-18 15:00:00 alarm
5 2019-07-18 15:00:00 2019-07-18 18:00:00 NaN
6 2019-07-18 18:00:00 2019-07-18 21:00:00 NaN
7 2019-07-18 21:00:00 2019-07-19 00:00:00 NaN
8 2019-07-19 00:00:00 2019-07-19 03:00:00 NaN
9 2019-07-19 03:00:00 2019-07-19 06:00:00 NaN
10 2019-07-19 06:00:00 2019-07-19 09:00:00 NaN
11 2019-07-19 09:00:00 2019-07-19 12:00:00 NaN
12 2019-07-19 12:00:00 2019-07-19 15:00:00 NaN
13 2019-07-19 15:00:00 2019-07-19 18:00:00 alarm
14 2019-07-19 18:00:00 2019-07-19 21:00:00 NaN
15 2019-07-19 21:00:00 2019-07-20 00:00:00 NaN
16 2019-07-20 00:00:00 2019-07-20 03:00:00 NaN
17 2019-07-20 03:00:00 2019-07-20 06:00:00 NaN
18 2019-07-20 06:00:00 2019-07-20 09:00:00 NaN
19 2019-07-20 09:00:00 2019-07-20 12:00:00 NaN
20 2019-07-20 12:00:00 2019-07-20 15:00:00 NaN
21 2019-07-20 15:00:00 2019-07-20 18:00:00 alarm
22 2019-07-20 18:00:00 2019-07-20 21:00:00 NaN
23 2019-07-20 21:00:00 2019-07-21 00:00:00 NaN
24 2019-07-21 00:00:00 2019-07-21 03:00:00 NaN
25 2019-07-21 03:00:00 2019-07-21 06:00:00 NaN
26 2019-07-21 06:00:00 2019-07-21 09:00:00 NaN
27 2019-07-21 09:00:00 2019-07-21 12:00:00 NaN
28 2019-07-21 12:00:00 2019-07-21 15:00:00 NaN
29 2019-07-21 15:00:00 2019-07-21 18:00:00 NaN
30 2019-07-21 18:00:00 2019-07-21 21:00:00 NaN
31 2019-07-21 21:00:00 2019-07-22 00:00:00 NaN
32 2019-07-22 00:00:00 2019-07-22 03:00:00 NaN
you were calling your columns start_date and end_Date, but in your for you use start_time and end_time.
try this:
import pandas as pd
df = pd.DataFrame({'Alarms': ["18/07/19 14:56:21", "19/07/19 15:05:15", "20/07/19 15:46:00"]})
df['Alarms'] = pd.to_datetime(df['Alarms'])
alarms = list(df.Alarms.unique())
# dataframe of runs containing start and end times
n = 33
rng1 = pd.date_range('2019-07-18', '2019-07-22', periods=n)
rng2 = pd.date_range('2019-07-18 03:00:00', '2019-07-22 03:00:00', periods=n)
df = pd.DataFrame({'start_date': rng1, 'end_Date': rng2})
for i in alarms:
df.loc[(df.start_date < i) & (df.end_Date > i), 'Flag'] = 'Alarm'
print(df[df['Flag']=='Alarm']['Flag'])
Output:
4 Alarm
13 Alarm
21 Alarm
Name: Flag, dtype: object

Categories

Resources