selecting rows in dataframe using datetime.datetime - python

Python is new for me.
I want to select a range of rows by using the datetime which is also the index.
I am not sure if having the datetime as the index is a problem or not.
my dataframe looks like this:
gradient
date
2022-04-15 10:00:00 0.013714
2022-04-15 10:20:00 0.140792
2022-04-15 10:40:00 0.148240
2022-04-15 11:00:00 0.016510
2022-04-15 11:20:00 0.018219
...
2022-05-02 15:40:00 0.191208
2022-05-02 16:00:00 0.016198
2022-05-02 16:20:00 0.043312
2022-05-02 16:40:00 0.500573
2022-05-02 17:00:00 0.955833
And I have made variables which contain the start and end date of the rows I want to select. This looks like this:
A_start_646 = datetime.datetime(2022,4,27, 11,0,0)
S_start_646 = datetime.datetime(2022,4,28, 3,0,0)
D_start_646 = datetime.datetime(2022,5,2, 15,25,0)
D_end_646 = datetime.datetime(2022,5, 2, 15,50,0)
So I would like to make a new dataframe. I saw some examples on the internet, but they use another way of expressing the date.
Does somewan know a solution?

I feel kind of stupid and smart at the same time now. This because I have already answered my own question, my apologies
So this is the answer:
new_df = data_646_mean[A_start_646 : S_start_646]

Related

Convert a column to a specific time format which contains different types of time formats in python

This is my data frame
df = pd.DataFrame({
'Time': ['10:00PM', '15:45:00', '13:40:00AM','5:00']
})
Time
0 10:00PM
1 15:45:00
2 13:40:00AM
3 5:00
I need to convert the time format in a specific format which is my expected output, given below.
Time
0 22:00:00
1 15:45:00
2 01:40:00
3 05:00:00
I tried using split and endswith function of str which is a complicated solution. Is there any better way to achieve this?
Thanks in advance!
here you go. One thing to mention though 13:40:00AM will result in an error since 13 is a) wrong format as AM/PM only go from 1 to 12 and b) PM (which 13 would be) cannot at the same time be AM :)
Cheers
import pandas as pd
df = pd.DataFrame({'Time': ['10:00PM', '15:45:00', '01:40:00AM', '5:00']})
df['Time'] = pd.to_datetime(df['Time'])
print(df['Time'].dt.time)
<<< 22:00:00
<<< 15:45:00
<<< 01:45:00
<<< 05:00:00

Custom resample function: only sample similar values hourly - Irregular time series

I am quite new to the game and can't seem to find an answer to my problem online.
I have an somewhat irregular time series in Python (mostly I use Pandas to work with it), which has a datetime index (roughly every 15 minutes) and multiple columns with values. I know that those values are approximatly changing every hour, but they actually don't quite match up with the index I have. It looks something like this:
Values
2019-08-27 02:15:00 91.45
2019-08-27 02:30:00 91.44
2019-08-27 02:45:00 91.44
2019-08-27 03:00:00 91.43
2019-08-27 03:15:00 91.43
2019-08-27 03:30:00 91.43
2019-08-27 03:45:00 91.42
This is just an example, but one can see that the values change at random times (:15, :45, :00) and even tho they should change every hour sometimes there are only two 15 min intervalls with values, so I can't just say: take a group of 4 values and resample them to one hour.
So my idea was to use the if and else function to create something like this:
if a value is the same as the next one: resample those to an hour
else: add one hour to the resampled index.
How could I accomplish that in Python and does my idea even make sense??
Thanks in advance for any kind of help!
You can use pandas.resample.
Ex:
import pandas as pd
index = pd.date_range('2019-08-27 02:15:00', periods=30, freq='15min')
series = pd.Series(range(30), index=index)
series.resample('15min').mean()
2019-08-27 02:00:00 1.0
2019-08-27 03:00:00 4.5
2019-08-27 04:00:00 8.5
2019-08-27 05:00:00 12.5
2019-08-27 06:00:00 16.5
2019-08-27 07:00:00 20.5
2019-08-27 08:00:00 24.5
2019-08-27 09:00:00 28.0
Freq: H, dtype: float64
Pandas is not Python.
When you use plain Python, you have a simple and nice procedural language and you iterate over values in containers. When you use Pandas, you should try hard to avoid any explicit Python loop at Python level. The rationale is that Pandas (and numpy for the underlying containers) uses C optimized code. So you have a large gain when using pandas and numpy tools (it is called vectorization).
Here what you want already exists in Pandas and is called resample.
In you example, and provided the index is a true DatetimeIndex (*), you just do:
df2 = df.resample('1H').mean()
It gives:
Values
2019-08-27 02:00:00 91.443333
2019-08-27 03:00:00 91.427500
(*) If not, convert it first with: df.index = pd.to_datetime(df.index)
From your edit, I think that you want to get one value from each period. A possible way would be to take the most frequent one in the interval H-15T H+30T.
You could then use:
pd.DataFrame(df.resample('60T', base=45, loffset=pd.Timedelta(minutes=15)).agg(
lambda x: x['Values'].value_counts().index[0]).rename('Values'))
This one give:
Values
2019-08-27 02:00:00 91.45
2019-08-27 03:00:00 91.43
2019-08-27 04:00:00 91.42

Where am I going wrong with downsampling this data frame?

I have two dataframes.
One lists 30-minute observations of particular values (actual and forecast) over a 24-hour period (48 observations):
> api_df.head()
from to actual forecast index
0 2019-11-24T23:30Z 2019-11-25T00:00Z 200 210 moderate
1 2019-11-25T00:00Z 2019-11-25T00:30Z 200 199 moderate
2 2019-11-25T00:30Z 2019-11-25T01:00Z 198 200 moderate
3 2019-11-25T01:00Z 2019-11-25T01:30Z 189 204 moderate
4 2019-11-25T01:30Z 2019-11-25T02:00Z 191 199 moderate
The other is observations of another value (KW) over an entire day:
> csv_extr.head()
Date Time KW
3764 2019-11-25 13:45:26.1050000 -424.437988
2911 2019-11-25 16:41:12.4040000 -465.325989
1786 2019-11-25 13:06:54.0290000 -431.795013
4352 2019-11-25 18:42:19.9360000 -452.528992
4634 2019-11-25 19:35:19.9230000 -457.210999
I want to get csv_extr to resemble api_df as closely as possbile, so I decided to downsample it and I ended up with something that is almost what (I think) I'm looking for, but there are some clear issues, namely the Time value and NaN observations:
> x.head()
Date Time KW time
0 2019-11-25 00:00:01.6470000 -100.0 0
1 NaN NaN NaN 0
2 NaN NaN NaN 1
3 2019-11-25 01:57:04.7700000 0.0 0
4 NaN NaN NaN 2
I have looked for possible reasons why, and I can only assume that these results are occurring because Time is set to seven decimal places and thus causes the output to not be neatly aligned to 30-minute blocks for some reason.
I achieve this final output (x) using some code that I found online, but I cannot find an explanation as to what precisely the code is doing and would like some guidance here also:
t = pd.to_timedelta(csv_extr.Time, unit = "min")
s = csv_extr.set_index(t).resample('30min').last().reset_index(drop = True)
x = s.assign(time = s.groupby("Time").cumcount())
There is a single error notification that appears when this code segment runs which is:
ValueError: only leading negative signs are allowed
As before, I have looked for what this might mean but haven't yet found anything that clearly explains it.
I am happy to provide data for reprex purposes; the reason I haven't provided anything (yet) is because I am unclear of the best way to do this in Python/pandas (some guidance here would be good). Also, I am hoping that the solution is a case of a more experienced Python user than myself looking at the code and spotting something obvious. Otherwise I am happy to provide the data required for a reprex.
You converted "Time" to a timedelta, but it looks like a timestamp to me, so I think you want pd.to_datetime, which is what I did to get the following approximation to your data. And I also set the index to the new "DateTime" column and drop the old columns:
csv_extr['DateTime'] = pd.to_datetime(csv_extr.Date + ' ' + csv_extr.Time)
csv_extr = csv_extr[['KW','DateTime']].set_index('DateTime')
KW
DateTime
2019-11-25 13:45:26.105 -424.437988
2019-11-25 16:41:12.404 -465.325989
2019-11-25 13:06:54.029 -431.795013
2019-11-25 18:42:19.936 -452.528992
2019-11-25 19:35:19.923 -457.210999
It's pretty straightforward after that. I'll show 60 minute resampling here to keep the output more compact, but it works just the same for 30 minute sampling, of course:
csv_extr.resample('60 min').last()
KW
DateTime
2019-11-25 13:00:00 -424.437988
2019-11-25 14:00:00 NaN
2019-11-25 15:00:00 NaN
2019-11-25 16:00:00 -465.325989
2019-11-25 17:00:00 NaN
2019-11-25 18:00:00 -452.528992
2019-11-25 19:00:00 -457.210999
I assume you want to fill in the missing values there. Without knowing more about your data, I'd suggest a simple linear interpolation like the following (but pandas and numpy have plenty of other options if you want something more complicated):
csv_extr.resample('60 min').last().interpolate()
KW
DateTime
2019-11-25 13:00:00 -424.437988
2019-11-25 14:00:00 -438.067322
2019-11-25 15:00:00 -451.696655
2019-11-25 16:00:00 -465.325989
2019-11-25 17:00:00 -458.927490
2019-11-25 18:00:00 -452.528992
2019-11-25 19:00:00 -457.210999

Using asfreq to resample a pandas dataframe

EDIT: I had made a mistake and my index was starting at 00:00:00, not at 06:00:00 (see below). So this question is spurious, but of course Wen's solution is correct.
I have a dataframe whose index goes like this:
2017-11-01 06:00:00
2017-11-02 06:00:00
2017-11-03 06:00:00
...
and so on. But I have the suspicion there're missing entries, for instance, index for 2017-11-04 06:00:00 could be missing. I have used
df = df.asfreq(freq="1D")
to fill with NaN the missing values, but it creates a new index that doesn't take into consideration the hours, it goes 2017-11-01, 2017-11-02 and so on, so the values in the adjacent column are all NaN!
How can I fix this? I don't see any option in asfreq that can solve it. Perhaps other tool? Thanks in advance.
It work find on my side
l=[
'2017-11-01 06:00:00',
'2017-11-03 06:00:00']
ts = pd.Series(np.random.randn(len(l)), index=l)
ts.index=pd.to_datetime(ts.index)
ts.asfreq(freq="D")
Out[745]:
2017-11-01 06:00:00 -0.467919
2017-11-02 06:00:00 NaN
2017-11-03 06:00:00 1.610024
Freq: D, dtype: float64

How to sum field across two DataFrames when the indexes don't line up?

I am brand new to complex data analysis in general, and to pandas in particular. I have a feeling that pandas should be able to handle this task easily, but my newbieness prevents me from seeing the path to a solution. I want to sum one column across two files at a given time each day, 3pm in this case. If a file doesn't have a record at 3pm that day, I want to use the previous record.
Let me give a concrete example. I have data in two CSV files. Here are a couple small examples:
datetime value
2013-02-28 09:30:00 0.565019720442
2013-03-01 09:30:00 0.549536266504
2013-03-04 09:30:00 0.5023031467
2013-03-05 09:30:00 0.698370467751
2013-03-06 09:30:00 0.75834927162
2013-03-07 09:30:00 0.783620442226
2013-03-11 09:30:00 0.777265379462
2013-03-12 09:30:00 0.785787872851
2013-03-13 09:30:00 0.784873183044
2013-03-14 10:15:00 0.802959366653
2013-03-15 10:15:00 0.802959366653
2013-03-18 10:15:00 0.805413095911
2013-03-19 09:30:00 0.80816233134
2013-03-20 10:15:00 0.878912249996
2013-03-21 10:15:00 0.986393922571
and the other:
datetime value
2013-02-28 05:00:00 0.0373634672097
2013-03-01 05:00:00 -0.24700085273
2013-03-04 05:00:00 -0.452964976056
2013-03-05 05:00:00 -0.2479288295
2013-03-06 05:00:00 -0.0326855588777
2013-03-07 05:00:00 0.0780461766619
2013-03-08 05:00:00 0.306247682656
2013-03-11 06:00:00 0.0194146154407
2013-03-12 05:30:00 0.0103653153719
2013-03-13 05:30:00 0.0350377752558
2013-03-14 05:30:00 0.0110884755383
2013-03-15 05:30:00 -0.173216846788
2013-03-19 05:30:00 -0.211785013352
2013-03-20 05:30:00 -0.891054563968
2013-03-21 05:30:00 -1.27207563599
2013-03-22 05:30:00 -1.28648629004
2013-03-25 05:30:00 -1.5459897419
Note that a) neither file actually has a 3pm record, and b) the two files don't always have records for any given day. (2013-03-08 is missing from the first file, while 2013-03-18 is missing from the second, and the first file ends before the second.) As output, I envision a dataframe like this (perhaps just the date without the time):
datetime value
2013-Feb-28 15:00:00 0.6023831876517
2013-Mar-01 15:00:00 0.302535413774
2013-Mar-04 15:00:00 0.049338170644
2013-Mar-05 15:00:00 0.450441638251
2013-Mar-06 15:00:00 0.7256637127423
2013-Mar-07 15:00:00 0.8616666188879
2013-Mar-08 15:00:00 0.306247682656
2013-Mar-11 15:00:00 0.7966799949027
2013-Mar-12 15:00:00 0.7961531882229
2013-Mar-13 15:00:00 0.8199109582998
2013-Mar-14 15:00:00 0.8140478421913
2013-Mar-15 15:00:00 0.629742519865
2013-Mar-18 15:00:00 0.805413095911
2013-Mar-19 15:00:00 0.596377317988
2013-Mar-20 15:00:00 -0.012142313972
2013-Mar-21 15:00:00 -0.285681713419
2013-Mar-22 15:00:00 -1.28648629004
2013-Mar-25 15:00:00 -1.5459897419
I have a feeling this is perhaps a three-liner in pandas, but it's not at all clear to me how to do this. Further complicating my thinking about this problem, more complex CSV files might have multiple records for a single day (same date, different times). It seems that I need to somehow either generate a new pair of input dataframes with times at 15:00 and then sum across their values columns keying on just the date, or during the sum operation select the record with the greatest time on any given day with the time <= 15:00:00. Given that datetime.time objects can't be compared for magnitude, I suspect I might have to group rows together having the same date, then within each group, select only the row nearest to (but not greater than) 3pm. Kind of at that point my brain explodes.
I got nowhere looking at the documentation, as I don't really understand all the database-like operations pandas supports. Pointers to relevant documentation (especially tutorials) would be much appreciated.
First combine your DataFrames:
df3 = df1.append(df2)
so that everything is in one table, next use the groupby to sum across timestamps:
df4 = df3.groupby('datetime').aggregate(sum)
now d4 has a value column that is the sum of all matching datetime columns.
Assuming you have the timestamps as datetime objects, you can do whatever filtering you like at any stage:
filtered = df[df['datetime'] < datetime.datetime(year, month, day, hour, minute, second)]
I'm not sure exactly what you are trying to do, you may need to parse your timestamp columns before filtering.

Categories

Resources