Why does this not work out?
I get the right results if I just print it out, but if I use the same to assign it to the df column, I get Nan values...
print(df.groupby('cumsum').first()['Date'])
cumsum
1 2021-01-05 11:00:00
2 2021-01-06 08:00:00
3 2021-01-06 10:00:00
4 2021-01-06 13:00:00
5 2021-01-06 14:00:00
...
557 2021-08-08 08:00:00
558 2021-08-08 09:00:00
559 2021-08-08 11:00:00
560 2021-08-08 13:00:00
561 2021-08-08 18:00:00
Name: Date, Length: 561, dtype: datetime64[ns]
vs
df["Date_First"] = df.groupby('cumsum').first()['Date']
Date
2021-01-01 00:00:00 NaT
2021-01-01 01:00:00 NaT
2021-01-01 02:00:00 NaT
2021-01-01 03:00:00 NaT
2021-01-01 04:00:00 NaT
..
2021-08-08 14:00:00 NaT
2021-08-08 15:00:00 NaT
2021-08-08 16:00:00 NaT
2021-08-08 17:00:00 NaT
2021-08-08 18:00:00 NaT
Name: Date_Last, Length: 5268, dtype: datetime64[ns]
What happens here?
I used an exmpmle form here, but want to get the first elements.
https://www.codeforests.com/2021/03/30/group-consecutive-rows-in-pandas/
What happens here?
If use:
print(df.groupby('cumsum')['Date'].first())
#print(df.groupby('cumsum').first()['Date'])
output are aggregated values by column cumsum with aggregated function first.
So in index are unique values cumsum, so if assign to new column there is mismatch with original index and output are NaNs.
Solution is use GroupBy.transform, which repeat aggregated values to Series (column) with same size like original DataFrame, so index is same like original and assign working perfectly:
df["Date_First"] = df.groupby('cumsum')['Date'].transform("first")
Related
I have a dataframe where the index is a timestamp.
DATE VALOR
2020-12-01 00:00:00 0.00635
2020-12-01 01:00:00 0.00941
2020-12-01 02:00:00 0.01151
2020-12-01 03:00:00 0.00281
2020-12-01 04:00:00 0.01080
... ...
2021-04-30 19:00:00 0.77059
2021-04-30 20:00:00 0.49285
2021-04-30 21:00:00 0.49057
2021-04-30 22:00:00 0.50339
2021-04-30 23:00:00 0.48792
I´m searching for a specific date
drop.loc['2020-12-01 04:00:00']
VALOR 0.0108
Name: 2020-12-01 04:00:00, dtype: float64
I want the return for the index of search above.
In this case is line 5. After I want to use this value to do a slice in the dataframe
drop[:5]
Thanks!
It looks like you want to subset drop up to index '2020-12-01 04:00:00'.
Then simply do this: drop.loc[:'2020-12-01 04:00:00']
No need to manually get the line number.
output:
VALOR
DATE
2020-12-01 00:00:00 0.00635
2020-12-01 01:00:00 0.00941
2020-12-01 02:00:00 0.01151
2020-12-01 03:00:00 0.00281
2020-12-01 04:00:00 0.01080
If you really want to get the position:
pos = drop.index.get_loc(key='2020-12-01 04:00:00') ## returns: 4
drop[:pos+1]
I have a dataframe that's indexed by datetime and has one column of integers and another column that I want to put in a string if a condition of the integers is met. I need the condition to assess the integer in row X against the integer in row X-1, but only if both rows are on the same day.
I am currently using the condition:
df.loc[(df['IntCol'] > df['IntCol'].shift(periods=1)), 'StringCol'] = 'Success'
This successfully applies my condition, however if the shifted row is on a different day then the condition will still use it and I want it to ignore any rows that are on a different day. I've tried various iterations of groupby(df.index.date) but can't seem to figure out if that will work or not.
Not sure if this is the best way to do it but gets you the answer:
df['out'] = np.where(df['int_col'] > df.groupby(df.index)['int_col'].shift(1), 'Success', 'Failure')
I think this is what you want. You were probably closer to the answer than you thought...
There is two dataframes use to show that the logic you have works whether or not data is random or integers are sorted range.
You will need to import random to see the data
dates = list(pd.date_range(start='2021/1/1', periods=16, freq='4H'))
def compare(x):
x.loc[(x['IntCol'] > x['IntCol'].shift(periods=1)), 'StringCol'] = 'Success'
return x
#### Will show success in all rows except where dates change because it's a range in numerical order
df = pd.DataFrame({'IntCol': range(10,26)}, index=dates)
df.groupby(df.index.date).apply(compare)
2021-01-01 00:00:00 10 NaN
2021-01-01 04:00:00 11 Success
2021-01-01 08:00:00 12 Success
2021-01-01 12:00:00 13 Success
2021-01-01 16:00:00 14 Success
2021-01-01 20:00:00 15 Success
2021-01-02 00:00:00 16 NaN
2021-01-02 04:00:00 17 Success
2021-01-02 08:00:00 18 Success
2021-01-02 12:00:00 19 Success
2021-01-02 16:00:00 20 Success
2021-01-02 20:00:00 21 Success
2021-01-03 00:00:00 22 NaN
2021-01-03 04:00:00 23 Success
2021-01-03 08:00:00 24 Success
2021-01-03 12:00:00 25 Success
### random numbers to show that it works here too
df = pd.DataFrame({'IntCol': [random.randint(3, 500) for x in range(0,16)]}, index=dates)
df.groupby(df.index.date).apply(compare)
IntCol StringCol
2021-01-01 00:00:00 386 NaN
2021-01-01 04:00:00 276 NaN
2021-01-01 08:00:00 143 NaN
2021-01-01 12:00:00 144 Success
2021-01-01 16:00:00 10 NaN
2021-01-01 20:00:00 343 Success
2021-01-02 00:00:00 424 NaN
2021-01-02 04:00:00 362 NaN
2021-01-02 08:00:00 269 NaN
2021-01-02 12:00:00 35 NaN
2021-01-02 16:00:00 278 Success
2021-01-02 20:00:00 268 NaN
2021-01-03 00:00:00 58 NaN
2021-01-03 04:00:00 169 Success
2021-01-03 08:00:00 85 NaN
2021-01-03 12:00:00 491 Success
I made predictions with an Arima model that predict the next 168 hours (one week) of cars on the road. I also want to add a column called "datetime" that starts with 00:00 01-01-2021 and increases with one hour for each row.
Is there an intelligent way of doing this?
You can do:
x=pd.to_datetime('2021-01-01 00:00')
y=pd.to_datetime('2021-01-07 23:59')
pd.Series(pd.date_range(x,y,freq='H'))
output:
pd.Series(pd.date_range(x,y,freq='H'))
Out[153]:
0 2021-01-01 00:00:00
1 2021-01-01 01:00:00
2 2021-01-01 02:00:00
3 2021-01-01 03:00:00
4 2021-01-01 04:00:00
163 2021-01-07 19:00:00
164 2021-01-07 20:00:00
165 2021-01-07 21:00:00
166 2021-01-07 22:00:00
167 2021-01-07 23:00:00
Length: 168, dtype: datetime64[ns]
I'm new to Python and I'm having some trouble writing the following in a pythonic way and not using too many loops that could slow down the performance.
I have a dataframe that looks like this:
Datetime Status Value
2014-09-28 18:00:00 On 3
2014-09-28 19:00:00 On 3
2014-09-28 20:00:00 On 3
2014-09-28 21:00:00 Off 4
2014-09-28 22:00:00 Off 6
2014-09-28 23:00:00 Unknown nan
2014-09-29 00:00:00 Unknown nan
2014-09-29 01:00:00 Unknown nan
2014-09-29 02:00:00 Unknown nan
2014-09-29 03:00:00 On 1
2014-09-29 04:00:00 On 3
2014-09-29 05:00:00 On 5
2014-09-29 06:00:00 Off 3
2014-09-29 07:00:00 Off 2
And I need to create another dataframe with the initial, final date and duration the machine was in a certain status. In addition, I would like to determine the average value for each status (consecutive lines with same status). For example:
Initial_date Final_date Duration Value Status
2014-09-28 18:00:00 2014-09-28 20:00:00 3 3 On
2014-09-28 21:00:00 2014-09-28 22:00:00 2 5 Off
2014-09-28 23:00:00 2014-09-29 02:00:00 4 nan Unknown
2014-09-29 03:00:00 2014-09-29 05:00:00 3 3 On
2014-09-29 06:00:00 2014-09-29 07:00:00 2 2.5 Off
Could you please help me? Thanks in advance!
Try to construct your dataframe using something like this:
idx = df[df.ne(df.shift(-1)).Status].index
idx2 = pd.cut(df.index, bins=np.append([0], idx),
include_lowest=True, right=True)
df2=pd.DataFrame({
'Initial_date':df[df.ne(df.shift()).Status].Datetime.values,
'Final_date':df[df.ne(df.shift(-1)).Status].Datetime.values,
'Duration':df.groupby(idx2, as_index=False).size().values,
'Value':df.groupby(idx2, as_index=False).Value.mean().values.flatten(),
'Status':df.groupby(idx2, as_index=False).Status.first().values.flatten()
})
In this script I identify when the variable 'Status' is changing by comparing it with a shifted version of the same dataframe. This is done by combining pandas eq and shift functions. Data are then segmented using pandas cut and groupby functions to obtain the aggregate statistics that you requested (duration, mean value and status). This can easily be extended to obtain other meaningful quantities if you like (e.g. standard deviation).
I have a Dataframe with a users indicated by the column: 'user_id'. Each of these users have several entries in the dataframe based on the date on which they did something, which is also a column. The dataframe looks somthing like
df:
user_id date
0 2019-04-13 02:00:00
0 2019-04-13 03:00:00
3 2019-02-18 22:00:00
3 2019-02-18 23:00:00
3 2019-02-19 00:00:00
3 2019-02-19 02:00:00
3 2019-02-19 03:00:00
3 2019-02-19 04:00:00
8 2019-04-05 04:00:00
8 2019-04-05 05:00:00
8 2019-04-05 06:00:00
8 2019-04-05 15:00:00
15 2019-04-28 19:00:00
15 2019-04-28 20:00:00
15 2019-04-29 01:00:00
23 2019-06-24 02:00:00
23 2019-06-24 05:00:00
23 2019-06-24 06:00:00
24 2019-03-27 12:00:00
24 2019-03-27 13:00:00
What I want to do is, for example, select the first 3 users. I wanted to do this with a code like this:
df.groupby('user_id').iloc[:3]
I know that groupby doesn't have an iloc so how could I achieve the same thing like an iloc in the groups, so I am able to slice them?
I found a way based on crayxt's answer:
df[df['user_id'].isin(df['user_id'].unique()[:3])]