I made predictions with an Arima model that predict the next 168 hours (one week) of cars on the road. I also want to add a column called "datetime" that starts with 00:00 01-01-2021 and increases with one hour for each row.
Is there an intelligent way of doing this?
You can do:
x=pd.to_datetime('2021-01-01 00:00')
y=pd.to_datetime('2021-01-07 23:59')
pd.Series(pd.date_range(x,y,freq='H'))
output:
pd.Series(pd.date_range(x,y,freq='H'))
Out[153]:
0 2021-01-01 00:00:00
1 2021-01-01 01:00:00
2 2021-01-01 02:00:00
3 2021-01-01 03:00:00
4 2021-01-01 04:00:00
163 2021-01-07 19:00:00
164 2021-01-07 20:00:00
165 2021-01-07 21:00:00
166 2021-01-07 22:00:00
167 2021-01-07 23:00:00
Length: 168, dtype: datetime64[ns]
Related
I have 2 Series with datetime index. I want to edit the values (float) of the first one at each corresponding daily minimum of the second one.
I tried
ser_1.loc[ser_2.groupby(ser_2.index.day_of_year).idxmin()] += 1
But I get this error :
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
series 2 and 1,respectively, are shaped
2019-01-01 00:00:00 0.04980
2019-01-01 01:00:00 0.04426
2019-01-01 02:00:00 0.05100
2019-01-01 03:00:00 0.04627
2019-01-01 04:00:00 0.03978
...
2019-12-31 19:00:00 0.04773
2019-12-31 20:00:00 0.04600
2019-12-31 21:00:00 0.04220
2019-12-31 22:00:00 0.03974
2019-12-31 23:00:00 0.03888
Name: 0, Length: 8760, dtype: float64
2019-01-01 23:00:00 0.000
2019-01-02 00:00:00 0.000
2019-01-02 01:00:00 0.000
2019-01-02 02:00:00 0.000
2019-01-02 03:00:00 0.000
...
2019-12-13 06:00:00 1.534
2019-12-13 07:00:00 2.425
2019-12-13 08:00:00 1.622
2019-12-13 09:00:00 1.974
2019-12-13 10:00:00 1.729
Freq: H, Name: 1, Length: 8292, dtype: float64
Could it be a non correspondig index format or just bad use of a function ?
Found my problem
The function I used is correct
But ser_1 is incomplete and some values can't correspond
So I have a dataset that has electricity load over 24 hours:
Time_of_Day = loadData.groupby(loadData.index.hour).mean()
Time_of_Day
Time Load
2019-01-01 01:00:00 38.045
2019-01-01 02:00:00 30.675
2019-01-01 03:00:00 22.570
2019-01-01 04:00:00 22.153
2019-01-01 05:00:00 21.085
... ...
2019-12-31 20:00:00 65.565
2019-12-31 21:00:00 53.513
2019-12-31 22:00:00 49.096
2019-12-31 23:00:00 44.409
2020-01-01 00:00:00 45.744
how do I plot a random day(24hrs) from the 8760 hours please
With the following toy dataframe:
import pandas as pd
import random
df = pd.DataFrame({"Time": pd.date_range(start="1/1/2019", end="12/31/2019", freq="H")})
df["Load"] = [round(random.random() * 100, 2) for _ in range(df.shape[0])]
Time Load
0 2019-01-01 00:00:00 53.36
1 2019-01-01 01:00:00 34.20
2 2019-01-01 02:00:00 64.19
3 2019-01-01 03:00:00 89.18
4 2019-01-01 04:00:00 27.82
... ... ...
8732 2019-12-30 20:00:00 38.26
8733 2019-12-30 21:00:00 49.66
8734 2019-12-30 22:00:00 64.15
8735 2019-12-30 23:00:00 23.97
8736 2019-12-31 00:00:00 3.72
[8737 rows x 2 columns]
Here is one way to do it using choice function from Python standard library random module:
# In Jupyter cell
df[
(df["Time"].dt.month == random.choice(df["Time"].dt.month))
& (df["Time"].dt.day == random.choice(df["Time"].dt.day))
].plot(x="Time")
Output:
I have a dataframe that's indexed by datetime and has one column of integers and another column that I want to put in a string if a condition of the integers is met. I need the condition to assess the integer in row X against the integer in row X-1, but only if both rows are on the same day.
I am currently using the condition:
df.loc[(df['IntCol'] > df['IntCol'].shift(periods=1)), 'StringCol'] = 'Success'
This successfully applies my condition, however if the shifted row is on a different day then the condition will still use it and I want it to ignore any rows that are on a different day. I've tried various iterations of groupby(df.index.date) but can't seem to figure out if that will work or not.
Not sure if this is the best way to do it but gets you the answer:
df['out'] = np.where(df['int_col'] > df.groupby(df.index)['int_col'].shift(1), 'Success', 'Failure')
I think this is what you want. You were probably closer to the answer than you thought...
There is two dataframes use to show that the logic you have works whether or not data is random or integers are sorted range.
You will need to import random to see the data
dates = list(pd.date_range(start='2021/1/1', periods=16, freq='4H'))
def compare(x):
x.loc[(x['IntCol'] > x['IntCol'].shift(periods=1)), 'StringCol'] = 'Success'
return x
#### Will show success in all rows except where dates change because it's a range in numerical order
df = pd.DataFrame({'IntCol': range(10,26)}, index=dates)
df.groupby(df.index.date).apply(compare)
2021-01-01 00:00:00 10 NaN
2021-01-01 04:00:00 11 Success
2021-01-01 08:00:00 12 Success
2021-01-01 12:00:00 13 Success
2021-01-01 16:00:00 14 Success
2021-01-01 20:00:00 15 Success
2021-01-02 00:00:00 16 NaN
2021-01-02 04:00:00 17 Success
2021-01-02 08:00:00 18 Success
2021-01-02 12:00:00 19 Success
2021-01-02 16:00:00 20 Success
2021-01-02 20:00:00 21 Success
2021-01-03 00:00:00 22 NaN
2021-01-03 04:00:00 23 Success
2021-01-03 08:00:00 24 Success
2021-01-03 12:00:00 25 Success
### random numbers to show that it works here too
df = pd.DataFrame({'IntCol': [random.randint(3, 500) for x in range(0,16)]}, index=dates)
df.groupby(df.index.date).apply(compare)
IntCol StringCol
2021-01-01 00:00:00 386 NaN
2021-01-01 04:00:00 276 NaN
2021-01-01 08:00:00 143 NaN
2021-01-01 12:00:00 144 Success
2021-01-01 16:00:00 10 NaN
2021-01-01 20:00:00 343 Success
2021-01-02 00:00:00 424 NaN
2021-01-02 04:00:00 362 NaN
2021-01-02 08:00:00 269 NaN
2021-01-02 12:00:00 35 NaN
2021-01-02 16:00:00 278 Success
2021-01-02 20:00:00 268 NaN
2021-01-03 00:00:00 58 NaN
2021-01-03 04:00:00 169 Success
2021-01-03 08:00:00 85 NaN
2021-01-03 12:00:00 491 Success
Why does this not work out?
I get the right results if I just print it out, but if I use the same to assign it to the df column, I get Nan values...
print(df.groupby('cumsum').first()['Date'])
cumsum
1 2021-01-05 11:00:00
2 2021-01-06 08:00:00
3 2021-01-06 10:00:00
4 2021-01-06 13:00:00
5 2021-01-06 14:00:00
...
557 2021-08-08 08:00:00
558 2021-08-08 09:00:00
559 2021-08-08 11:00:00
560 2021-08-08 13:00:00
561 2021-08-08 18:00:00
Name: Date, Length: 561, dtype: datetime64[ns]
vs
df["Date_First"] = df.groupby('cumsum').first()['Date']
Date
2021-01-01 00:00:00 NaT
2021-01-01 01:00:00 NaT
2021-01-01 02:00:00 NaT
2021-01-01 03:00:00 NaT
2021-01-01 04:00:00 NaT
..
2021-08-08 14:00:00 NaT
2021-08-08 15:00:00 NaT
2021-08-08 16:00:00 NaT
2021-08-08 17:00:00 NaT
2021-08-08 18:00:00 NaT
Name: Date_Last, Length: 5268, dtype: datetime64[ns]
What happens here?
I used an exmpmle form here, but want to get the first elements.
https://www.codeforests.com/2021/03/30/group-consecutive-rows-in-pandas/
What happens here?
If use:
print(df.groupby('cumsum')['Date'].first())
#print(df.groupby('cumsum').first()['Date'])
output are aggregated values by column cumsum with aggregated function first.
So in index are unique values cumsum, so if assign to new column there is mismatch with original index and output are NaNs.
Solution is use GroupBy.transform, which repeat aggregated values to Series (column) with same size like original DataFrame, so index is same like original and assign working perfectly:
df["Date_First"] = df.groupby('cumsum')['Date'].transform("first")
The following code generates a sample DataFrame with a multilevel index. The first level is a string, the second level is a datetime.
Script
import pandas as pd
from datetime import datetime
import random
df = pd.DataFrame(columns=['network','time','active_clients','throughput','speed'])
networks = ['ALPHA','BETA','GAMMA']
times = pd.date_range(datetime.strptime('2021-01-01 00:00:00','%Y-%m-%d %H:%M:%S'),datetime.strptime('2021-01-01 12:00:00','%Y-%m-%d %H:%M:%S'),7).tolist()
for n in networks:
for t in times:
df = df.append({'network':n,'time':t,'active_clients':random.randint(10,30),'throughput':random.randint(1500,5000),'speed':random.randint(10000,12000)},ignore_index=True)
df.set_index(['network','time'],inplace=True)
print(df.to_string())
Output
active_clients throughput speed
network time
ALPHA 2021-01-01 00:00:00 16 4044 11023
2021-01-01 02:00:00 17 2966 10933
2021-01-01 04:00:00 10 4649 11981
2021-01-01 06:00:00 23 3629 10113
2021-01-01 08:00:00 30 2520 11159
2021-01-01 10:00:00 10 4200 11309
2021-01-01 12:00:00 16 3878 11366
BETA 2021-01-01 00:00:00 17 3073 11798
2021-01-01 02:00:00 20 1941 10640
2021-01-01 04:00:00 17 1980 11869
2021-01-01 06:00:00 23 3346 10002
2021-01-01 08:00:00 10 1952 10063
2021-01-01 10:00:00 28 3788 11047
2021-01-01 12:00:00 24 4993 10487
GAMMA 2021-01-01 00:00:00 21 4366 11587
2021-01-01 02:00:00 22 3404 11669
2021-01-01 04:00:00 20 1608 10344
2021-01-01 06:00:00 28 1849 10278
2021-01-01 08:00:00 14 3229 11925
2021-01-01 10:00:00 21 3408 10411
2021-01-01 12:00:00 12 1799 10492
For each item in the first level, I want to select the last three records in the second level. The catch is that I don't know the datetime values, so I need to select by integer-based index location instead. What's the most efficient way of slicing the DataFrame to achieve the following.
Desired output
active_clients throughput speed
network time
ALPHA 2021-01-01 08:00:00 30 2520 11159
2021-01-01 10:00:00 10 4200 11309
2021-01-01 12:00:00 16 3878 11366
BETA 2021-01-01 08:00:00 10 1952 10063
2021-01-01 10:00:00 28 3788 11047
2021-01-01 12:00:00 24 4993 10487
GAMMA 2021-01-01 08:00:00 14 3229 11925
2021-01-01 10:00:00 21 3408 10411
2021-01-01 12:00:00 12 1799 10492
My attempts
Returns the full dataframe:
df_sel = df.iloc[:,-3:]
Raises an error because loc doesn't support using integer values on datetime objects:
df_sel = df.loc[:,-3:]
Returns the last three entries in the second level, but only for the last entry in the first level:
df_sel = df.loc[:].iloc[-3:]
I have 2 methods to solve this problem:
Method 1:
As it mentions from the first comment from Quang Hoang, you can use groupby to do this, which I believe has the shortest code:
df.groupby(level=0).tail(3)
Method 2:
You can also slice each one in networks then concat them:
pd.concat([df.loc[[i]][-3:] for i in networks])
Both of these 2 methods will output the result you want:
Another method is to do some reshaping:
df.unstack(0).iloc[-3:].stack().swaplevel(0,1).sort_index()
Output:
active_clients throughput speed
network time
ALPHA 2021-01-01 08:00:00 26 4081 11325
2021-01-01 10:00:00 13 3370 10716
2021-01-01 12:00:00 13 3691 10737
BETA 2021-01-01 08:00:00 28 2105 10465
2021-01-01 10:00:00 21 2444 10158
2021-01-01 12:00:00 24 1947 11226
GAMMA 2021-01-01 08:00:00 13 1850 10288
2021-01-01 10:00:00 23 2241 11521
2021-01-01 12:00:00 30 3515 11138
Details:
unstack the outer most index level, level=0
Use, iloc to select the last three records in the dataframe
stack that level back to the index swaplevel and sort_index