Access a pandas group as new data frame - python

I am new to data analysis with pandas/pandas, coming from a Matlab background. I am trying to group data and then process the individual groups. However, I cannot figure out how to actually access the grouping result.
Here is my setup: I have a pandas dataframe df with a regular-spaced DateTime index timestamp of 10 minutes frequency. My data spans several weeks in total. I now want to group the data by days, like so:
grouping = df.groupby([pd.Grouper(level="timestamp", freq="D",)])
Note that I do not want to aggregate the groups (contrary to most examples and tutorials, it seems). I simply want to take each group in turn and process it individually, like so (does not work):
for g in grouping:
g_df = d.toDataFrame()
some_processing(g_df)
How do I do that? I haven't found any way to extract daily dataframe objects from the DataFrameGroupBy object.

Expand your groups into a dictionary of dataframes:
data = dict(list(df.groupby(df.index.date.astype(str))))
>>> data.keys()
dict_keys(['2021-01-01', '2021-01-02'])
>>> data['2021-01-01']
value
timestamp
2021-01-01 00:00:00 0.405630
2021-01-01 01:00:00 0.262235
2021-01-01 02:00:00 0.913946
2021-01-01 03:00:00 0.467516
2021-01-01 04:00:00 0.367712
2021-01-01 05:00:00 0.849070
2021-01-01 06:00:00 0.572143
2021-01-01 07:00:00 0.423401
2021-01-01 08:00:00 0.931463
2021-01-01 09:00:00 0.554809
2021-01-01 10:00:00 0.561663
2021-01-01 11:00:00 0.537471
2021-01-01 12:00:00 0.461099
2021-01-01 13:00:00 0.751878
2021-01-01 14:00:00 0.266371
2021-01-01 15:00:00 0.954553
2021-01-01 16:00:00 0.895575
2021-01-01 17:00:00 0.752671
2021-01-01 18:00:00 0.230219
2021-01-01 19:00:00 0.750243
2021-01-01 20:00:00 0.812728
2021-01-01 21:00:00 0.195416
2021-01-01 22:00:00 0.178367
2021-01-01 23:00:00 0.607105
Note: I changed your groups to be easier indexing: '2021-01-01' instead of Timestamp('2021-01-01 00:00:00', freq='D')

Related

Pandas Dataframe - Search by index

I have a dataframe where the index is a timestamp.
DATE VALOR
2020-12-01 00:00:00 0.00635
2020-12-01 01:00:00 0.00941
2020-12-01 02:00:00 0.01151
2020-12-01 03:00:00 0.00281
2020-12-01 04:00:00 0.01080
... ...
2021-04-30 19:00:00 0.77059
2021-04-30 20:00:00 0.49285
2021-04-30 21:00:00 0.49057
2021-04-30 22:00:00 0.50339
2021-04-30 23:00:00 0.48792
I´m searching for a specific date
drop.loc['2020-12-01 04:00:00']
VALOR 0.0108
Name: 2020-12-01 04:00:00, dtype: float64
I want the return for the index of search above.
In this case is line 5. After I want to use this value to do a slice in the dataframe
drop[:5]
Thanks!
It looks like you want to subset drop up to index '2020-12-01 04:00:00'.
Then simply do this: drop.loc[:'2020-12-01 04:00:00']
No need to manually get the line number.
output:
VALOR
DATE
2020-12-01 00:00:00 0.00635
2020-12-01 01:00:00 0.00941
2020-12-01 02:00:00 0.01151
2020-12-01 03:00:00 0.00281
2020-12-01 04:00:00 0.01080
If you really want to get the position:
pos = drop.index.get_loc(key='2020-12-01 04:00:00') ## returns: 4
drop[:pos+1]

How to apply a condition to Pandas dataframe rows, but only apply the condition to rows of the same day?

I have a dataframe that's indexed by datetime and has one column of integers and another column that I want to put in a string if a condition of the integers is met. I need the condition to assess the integer in row X against the integer in row X-1, but only if both rows are on the same day.
I am currently using the condition:
df.loc[(df['IntCol'] > df['IntCol'].shift(periods=1)), 'StringCol'] = 'Success'
This successfully applies my condition, however if the shifted row is on a different day then the condition will still use it and I want it to ignore any rows that are on a different day. I've tried various iterations of groupby(df.index.date) but can't seem to figure out if that will work or not.
Not sure if this is the best way to do it but gets you the answer:
df['out'] = np.where(df['int_col'] > df.groupby(df.index)['int_col'].shift(1), 'Success', 'Failure')
I think this is what you want. You were probably closer to the answer than you thought...
There is two dataframes use to show that the logic you have works whether or not data is random or integers are sorted range.
You will need to import random to see the data
dates = list(pd.date_range(start='2021/1/1', periods=16, freq='4H'))
def compare(x):
x.loc[(x['IntCol'] > x['IntCol'].shift(periods=1)), 'StringCol'] = 'Success'
return x
#### Will show success in all rows except where dates change because it's a range in numerical order
df = pd.DataFrame({'IntCol': range(10,26)}, index=dates)
df.groupby(df.index.date).apply(compare)
2021-01-01 00:00:00 10 NaN
2021-01-01 04:00:00 11 Success
2021-01-01 08:00:00 12 Success
2021-01-01 12:00:00 13 Success
2021-01-01 16:00:00 14 Success
2021-01-01 20:00:00 15 Success
2021-01-02 00:00:00 16 NaN
2021-01-02 04:00:00 17 Success
2021-01-02 08:00:00 18 Success
2021-01-02 12:00:00 19 Success
2021-01-02 16:00:00 20 Success
2021-01-02 20:00:00 21 Success
2021-01-03 00:00:00 22 NaN
2021-01-03 04:00:00 23 Success
2021-01-03 08:00:00 24 Success
2021-01-03 12:00:00 25 Success
### random numbers to show that it works here too
df = pd.DataFrame({'IntCol': [random.randint(3, 500) for x in range(0,16)]}, index=dates)
df.groupby(df.index.date).apply(compare)
IntCol StringCol
2021-01-01 00:00:00 386 NaN
2021-01-01 04:00:00 276 NaN
2021-01-01 08:00:00 143 NaN
2021-01-01 12:00:00 144 Success
2021-01-01 16:00:00 10 NaN
2021-01-01 20:00:00 343 Success
2021-01-02 00:00:00 424 NaN
2021-01-02 04:00:00 362 NaN
2021-01-02 08:00:00 269 NaN
2021-01-02 12:00:00 35 NaN
2021-01-02 16:00:00 278 Success
2021-01-02 20:00:00 268 NaN
2021-01-03 00:00:00 58 NaN
2021-01-03 04:00:00 169 Success
2021-01-03 08:00:00 85 NaN
2021-01-03 12:00:00 491 Success

Assign first element of groupby to a column yields NaN

Why does this not work out?
I get the right results if I just print it out, but if I use the same to assign it to the df column, I get Nan values...
print(df.groupby('cumsum').first()['Date'])
cumsum
1 2021-01-05 11:00:00
2 2021-01-06 08:00:00
3 2021-01-06 10:00:00
4 2021-01-06 13:00:00
5 2021-01-06 14:00:00
...
557 2021-08-08 08:00:00
558 2021-08-08 09:00:00
559 2021-08-08 11:00:00
560 2021-08-08 13:00:00
561 2021-08-08 18:00:00
Name: Date, Length: 561, dtype: datetime64[ns]
vs
df["Date_First"] = df.groupby('cumsum').first()['Date']
Date
2021-01-01 00:00:00 NaT
2021-01-01 01:00:00 NaT
2021-01-01 02:00:00 NaT
2021-01-01 03:00:00 NaT
2021-01-01 04:00:00 NaT
..
2021-08-08 14:00:00 NaT
2021-08-08 15:00:00 NaT
2021-08-08 16:00:00 NaT
2021-08-08 17:00:00 NaT
2021-08-08 18:00:00 NaT
Name: Date_Last, Length: 5268, dtype: datetime64[ns]
What happens here?
I used an exmpmle form here, but want to get the first elements.
https://www.codeforests.com/2021/03/30/group-consecutive-rows-in-pandas/
What happens here?
If use:
print(df.groupby('cumsum')['Date'].first())
#print(df.groupby('cumsum').first()['Date'])
output are aggregated values by column cumsum with aggregated function first.
So in index are unique values cumsum, so if assign to new column there is mismatch with original index and output are NaNs.
Solution is use GroupBy.transform, which repeat aggregated values to Series (column) with same size like original DataFrame, so index is same like original and assign working perfectly:
df["Date_First"] = df.groupby('cumsum')['Date'].transform("first")

Pandas, insert datetime values that increase one hour for each row

I made predictions with an Arima model that predict the next 168 hours (one week) of cars on the road. I also want to add a column called "datetime" that starts with 00:00 01-01-2021 and increases with one hour for each row.
Is there an intelligent way of doing this?
You can do:
x=pd.to_datetime('2021-01-01 00:00')
y=pd.to_datetime('2021-01-07 23:59')
pd.Series(pd.date_range(x,y,freq='H'))
output:
pd.Series(pd.date_range(x,y,freq='H'))
Out[153]:
0 2021-01-01 00:00:00
1 2021-01-01 01:00:00
2 2021-01-01 02:00:00
3 2021-01-01 03:00:00
4 2021-01-01 04:00:00
163 2021-01-07 19:00:00
164 2021-01-07 20:00:00
165 2021-01-07 21:00:00
166 2021-01-07 22:00:00
167 2021-01-07 23:00:00
Length: 168, dtype: datetime64[ns]

Finding skipped index time steps and filling the values in a Pandas DataFrame

I am working with a pandas DataFrame that has an index that skips one or more time steps which in my case is one or more hours. I want to know if there is a way to find these time step skips and possibly insert these missing time steps.
Example of what I have:
[In]: df
[Out]:
point_value
Timestamp
2016-01-01 00:00:00 2550.63
2016-01-01 01:00:00 2535.97
2016-01-01 02:00:00 2538.25
2016-01-01 04:00:00 2548.63
2016-01-01 05:00:00 2555.16
Example of what I am looking for:
[In]: df
[Out]:
point_value
Timestamp
2016-01-01 02:00:00 2538.25
2016-01-01 04:00:00 2548.63
Ideally after finding these time step gaps I'd want to fill them with the time steps missing as such:
[In]: df
[Out]:
point_value
Timestamp
2016-01-01 00:00:00 2550.63
2016-01-01 01:00:00 2535.97
2016-01-01 02:00:00 2538.25
2016-01-01 03:00:00 NaN
2016-01-01 04:00:00 2548.63
2016-01-01 05:00:00 2555.16
I have searched on stack overflow and can't seem to find something that pertains to the index itself. If this is a duplicated question then I will be happy to take it down and find the result. Thanks for the help in advance.
DataFrame.reindex should achieve what you are looking for. Just define a new index and apply it to your dataframe:
new_index = pd.date_range(start='1/1/2016 0:0:0', end='1/1/2016 5:0:0', periods=6)
df.reindex(index=new_index)
With exactly hourly timestamps you can use resample
df.resample('H').first()
point_value
Timestamp
2016-01-01 00:00:00 2550.63
2016-01-01 01:00:00 2535.97
2016-01-01 02:00:00 2538.25
2016-01-01 03:00:00 NaN
2016-01-01 04:00:00 2548.63
2016-01-01 05:00:00 2555.16

Categories

Resources