PYTHON: Pandas datetime index range to change column values - python

I have a dataframe indexed using a 12hr frequency datetime:
id mm ls
date
2007-09-27 00:00:00 1 0 0
2007-09-27 12:00:00 1 0 0
2007-09-28 00:00:00 1 15 0
2007-09-28 12:00:00 NaN NaN 0
2007-09-29 00:00:00 NaN NaN 0
Timestamp('2007-09-27 00:00:00', offset='12H')
I use column 'ls' as a binary variable with default value '0' using:
data['ls'] = 0
I have a list of days in the form '2007-09-28' from which I wish to update all 'ls' values from 0 to 1.
id mm ls
date
2007-09-27 00:00:00 1 0 0
2007-09-27 12:00:00 1 0 0
2007-09-28 00:00:00 1 15 1
2007-09-28 12:00:00 NaN NaN 1
2007-09-29 00:00:00 NaN NaN 0
Timestamp('2007-09-27 00:00:00', offset='12H')
I understand how this can be done using another column variable ie:
data.ix[data.id == '1'], ['ls'] = 1
yet this does not work using datetime index.
Could you let me know what the method for datetime index is?

You have a list of days in the form '2007-09-28':
days = ['2007-09-28', ...]
then you can modify your df using:
df['ls'][pd.DatetimeIndex(df.index.date).isin(days)] = 1

Related

Identify Dates in DataFrame - Pandas

I have a dataframe:
Datetime
0 2022-06-01 00:00:00 0
1 2022-06-01 00:01:00 0
2 2022-06-01 00:02:00 0
3 2022-06-01 00:03:00 0
4 2022-06-01 00:04:00 0
How to identify the hour is "00", and so for the minutes and seconds. My requirement is to later on like to put them in a function.
You can use:
s = pd.to_datetime(df['Datetime'], format='%Y-%m-%d %H:%M:%S 0') # what is the 0?
df['hour_0'] = s.dt.hour.eq(0)
df['min_0'] = s.dt.minute.eq(0)
df['sec_0'] = s.dt.second.eq(0)
Output:
Datetime hour_0 min_0 sec_0
0 2022-06-01 00:00:00 0 True True True
1 2022-06-01 00:01:00 0 True False True
2 2022-06-01 00:02:00 0 True False True
3 2022-06-01 00:03:00 0 True False True
4 2022-06-01 00:04:00 0 True False True
So, your question is a bit unclear to me, but if I understand correctly you just need to extract the hours from your DF? If so the easiest way to do this is to use Pandas inbuilt datetime functionality. For example:
import pandas as pd
df = pd.DataFrame([["2022-12-12 01:59:00"], ["2022-13-12 01:59:00"]])
print(df)
This will yield:
0
0 2022-12-12 01:59:00
1 2022-12-13 01:59:00
Now can do:
pd['timestamp'] = pd.to_datetime(df[0])
pd['hour'] = pd['timestamp'].dt.hour
You can do this for minutes and seconds etc. Hope that helps.
You can easily extract hours, minutues, seconds directly from date time string. what is extra 0?. If you have extra strings then simply filter first then extra parameters.
df['new'] = pd.to_datetime(df['Datetime'].str.split(' ').str[1],format='%H:%M:%S')
df['hour'] = df['new'].dt.hour
df['minute'] = df['new'].dt.minute
df['second'] = df['new'].dt.second
del df['new']
Gives #
Datetime hour minute second
0 2022-06-01 00:00:00 0 0 0 0
1 2022-06-01 00:01:00 0 0 1 0
2 2022-06-01 00:02:00 0 0 2 0
3 2022-06-01 00:03:00 0 0 3 0
4 2022-06-01 00:04:00 0 0 4 0
Explination:
your date string looks likes this
2022-06-01 00:02:00 0
analysis
2022 - Year - %y
06 - Month - %m
01 - Day - %d
00: - hours - %H
02: - minutes - %M
00: - Seconds - %S
You have an extra 0in date format to filter that I've split the string by space.
df['Datetime'].str.split(' ').str[1],format='%H:%M:%S'
Logically iplies
'2022-06-01 00:02:00 0'.split(' ').str[1],format='%H:%M:%S'
Which wraps elemnts to list sep by spaces.
[2022-06-01, 00:02:00, 0]
Analysis
0th elemnt in list = 2022-06-01
1st elemnt in list = 00:02:00
2nd elemnt in list = 0
Currently we are interested in time which is 1st elemnt in list = 00:02:00
pd.to_datetime(df['Datetime'].str.split(' ').str[1],format='%H:%M:%S')
Pandas has inbuilt time series functions - pandas.Series.dt.minute

int64 to HHMM string

I have the following data frame where the column hour shows hours of the day in int64 form. I'm trying to convert that into a time format; so that hour 1 would show up as '01:00'. I then want to add this to the date column and convert it into a timestamp index.
Using the datetime function in pandas resulted in the column "hr2", which is not what I need. I'm not sure I can even apply datetime directly, as the original data (i.e. in column "hr") is not really a date time format to begin with. Google searches so far have been unproductive.
While I am still in the dark concerning the format of your date column. I will assume the Date column is a string object and the hr column is an int64 object. To create the column TimeStamp in pandas tmestamp format this is how I would proceed>
Given df:
Date Hr
0 12/01/2010 1
1 12/01/2010 2
2 12/01/2010 3
3 12/01/2010 4
4 12/02/2010 1
5 12/02/2010 2
6 12/02/2010 3
7 12/02/2010 4
df['TimeStamp'] = df.apply(lambda row: pd.to_datetime(row['Date']) + pd.to_timedelta(row['Hr'], unit='H'), axis = 1)
yields:
Date Hr TimeStamp
0 12/01/2010 1 2010-12-01 01:00:00
1 12/01/2010 2 2010-12-01 02:00:00
2 12/01/2010 3 2010-12-01 03:00:00
3 12/01/2010 4 2010-12-01 04:00:00
4 12/02/2010 1 2010-12-02 01:00:00
5 12/02/2010 2 2010-12-02 02:00:00
6 12/02/2010 3 2010-12-02 03:00:00
7 12/02/2010 4 2010-12-02 04:00:00
The timestamp column can then be used as your index.

Pandas GroupBy with CumulativeSum per Contiguous Groups

I'm looking to understand the number of times we are in an 'Abnormal State' before we have an 'Event'. My objective is to modify my dataframe to get the following output where everytime we reach an 'event', the 'Abnormal State Grouping' resets to count from 0.
We can go through a number of 'Abnormal States' before we reach an 'Event', which is deemed a failure. (i.e. The lightbulb is switched on and off for several periods before it finally shorts out resulting in an event).
I've written the following code to get my AbnormalStateGroupings to increment into relevant groupings for my analysis which has worked fine. However, we want to 'reset' the count of our 'AbnormalStates' after each event (i.e. lightbulb failure):
dataframe['AbnormalStateGrouping'] = (dataframe['AbnormalState']!=dataframe['AbnormalState'].shift()).cumsum()
I have created an additional column which let's me know what 'event' we are at via:
dataframe['Event_Or_Not'].cumsum() #I have a boolean representation of the Event Column represented and we use .cumsum() to get the relevant groupings (i.e. 1st Event, 2nd Event, 3rd Event etc.)
I've come close previously using the following:
eventOrNot = dataframe['Event'].eq(0)
eventMask = (eventOrNot.ne(eventOrNot.shift())&eventOrNot).cumsum()
dataframe['AbnormalStatePerEvent'] =dataframe.groupby(['Event',eventMask]).cumcount().add(1)
However, this hasn't given me the desired output that I'm after (as per below).
I think I'm close however - Could anyone please advise what I could try to do next so that for each lightbulb failure, the abnormal state count resets and starts counting the # of abnormal states we have gone through before the next lightbulb failure?
State I want to get to with AbnormalStateGrouping
You would note that when an 'Event' is detected, the Abnormal State count resets to 1 and then starts counting again.
Current State of Dataframe
Please find an attached data source below:
https://filebin.net/ctjwk7p3gulmbgkn
I assume that your source DataFrame has only Date/Time (either string
or datetime), Event (string) and AbnormalState (int) columns.
To compute your grouping column, run:
dataframe['AbnormalStateGrouping'] = dataframe.groupby(
dataframe['Event'][::-1].notnull().cumsum()).AbnormalState\
.apply(lambda grp: (grp != grp.shift()).cumsum())
The result, for your initial source data, included as a picture, is:
Date/Time Event AbnormalState AbnormalStateGrouping
0 2018-01-01 01:00 NaN 0 1
1 2018-01-01 02:00 NaN 0 1
2 2018-01-01 03:00 NaN 1 2
3 2018-01-01 04:00 NaN 1 2
4 2018-01-01 05:00 NaN 0 3
5 2018-01-01 06:00 NaN 0 3
6 2018-01-01 07:00 NaN 0 3
7 2018-01-01 08:00 NaN 1 4
8 2018-01-01 09:00 NaN 1 4
9 2018-01-01 10:00 NaN 0 5
10 2018-01-01 11:00 NaN 0 5
11 2018-01-01 12:00 NaN 0 5
12 2018-01-01 13:00 NaN 1 6
13 2018-01-01 14:00 NaN 1 6
14 2018-01-01 15:00 NaN 0 7
15 2018-01-01 16:00 Event 0 7
16 2018-01-01 17:00 NaN 1 1
17 2018-01-01 18:00 NaN 1 1
18 2018-01-01 19:00 NaN 0 2
19 2018-01-01 20:00 NaN 0 2
Note the way of grouping:
dataframe['Event'][::-1].notnull().cumsum()
Due to [::-1], cumsum function is computed from the last row
to the first.
Thus:
rows with hours 01:00 thru 16:00 are in group 1,
remaining rows (hour 17:00 thru 20:00) are in group 0.
Then, to AbnormalState, separately for each group, a lambda function
is applied, so each cumulative sum starts from 1 just in each group
(after each Event).
Edit following the comment as of 22:18:12Z
The reason why I compute the cumsum for grouping in reversed order
is that when you run it in normal order:
dataframe['Event'].notnull().cumsum()
then:
rows with index 0 thru 14 (before the row with Event) have
this sum == 0,
row with index 15 and following rows have this sum == 1.
Try yourself both versions, without and with [::-1].
The result in normal order (without [::-1]) is that:
Event row is in the same group with the following rows,
so the reset occurs just on this row.
To check the whole result, run my code without [::-1] and you will see
that the ending part of the result contains:
Date/Time Event AbnormalState AbnormalStateGrouping
14 2018-01-01 15:00:00 NaN 0 7
15 2018-01-01 16:00:00 Event 0 1
16 2018-01-01 17:00:00 NaN 1 2
17 2018-01-01 18:00:00 NaN 1 2
18 2018-01-01 19:00:00 NaN 0 3
19 2018-01-01 20:00:00 NaN 0 3
so that the Event row to has AbnormalStateGrouping == 1.
But you want this row to have AbnormalStateGrouping in a sequence of
previous grouping states (in this case 7) and reset should occur
from the next row on.
So the Event row should be in same group with preceding rows, what
is the result of my code.

Calculating average differences with groupby in Python

I'm new to Python and I want to aggregate (groupby) ID's in my first column.
The values in the second column are timestamps (datetime format) and by aggregating the ID's, I want the to get the average difference between the timestamps (in days) per ID in the aggregated ID column. My table looks like df1 and I want something like df2, but since I'm an absolute beginner, I have no idea how to do this.
import pandas as pd
import numpy as np
from datetime import datetime
In[1]:
# df1
ID = np.array([1,1,1,2,2,3])
Timestamp = np.array([
datetime.strptime('2018-01-01 18:07:02', "%Y-%m-%d %H:%M:%S"),
datetime.strptime('2018-01-08 18:07:02', "%Y-%m-%d %H:%M:%S"),
datetime.strptime('2018-03-15 18:07:02', "%Y-%m-%d %H:%M:%S"),
datetime.strptime('2018-01-01 18:07:02', "%Y-%m-%d %H:%M:%S"),
datetime.strptime('2018-02-01 18:07:02', "%Y-%m-%d %H:%M:%S"),
datetime.strptime('2018-01-01 18:07:02', "%Y-%m-%d %H:%M:%S")])
df = pd.DataFrame({'ID': ID, 'Timestamp': Timestamp})
Out[1]:
ID Timestamp
0 1 2018-01-01 18:07:02
1 1 2018-01-08 18:07:02
2 1 2018-03-15 18:07:02
3 2 2018-01-01 18:07:02
4 2 2018-02-01 18:07:02
5 3 2018-01-01 18:07:02
In[2]:
#df2
ID = np.array([1,2,3])
Avg_Difference = np.array([7, 1, "nan"])
df2 = pd.DataFrame({'ID': ID, 'Avg_Difference': Avg_Difference})
Out[2]:
ID Avg_Difference
0 1 7
1 2 1
2 3 nan
You could do something like this:
df.groupby('ID')['Timestamp'].apply(lambda x: x.diff().mean())
In your case, it looks like:
>>> df
ID Timestamp
0 1 2018-01-01 18:07:02
1 1 2018-01-08 18:07:02
2 1 2018-03-15 18:07:02
3 2 2018-01-01 18:07:02
4 2 2018-02-01 18:07:02
5 3 2018-01-01 18:07:02
>>> df.groupby('ID')['Timestamp'].apply(lambda x: x.diff().mean())
ID
1 36 days 12:00:00
2 31 days 00:00:00
3 NaT
Name: Timestamp, dtype: timedelta64[ns]
If you want it as a dataframe with the column named Avg_Difference, just add to_frame at the end:
df.groupby('ID')['Timestamp'].apply(lambda x: x.diff().mean()).to_frame('Avg_Difference')
Avg_Difference
ID
1 36 days 12:00:00
2 31 days 00:00:00
3 NaT
Edit Based on your comment, if you want to remove the time element, and just get the number of days, you can do the following:
df.groupby('ID')['Timestamp'].apply(lambda x: x.diff().mean()).dt.days.to_frame('Avg_Difference')
Avg_Difference
ID
1 36.0
2 31.0
3 NaN

python to_date wrong values

Command:
dataframe.date.head()
Result:
0 12-Jun-98
1 7-Aug-2005
2 28-Aug-66
3 11-Sep-1954
4 9-Oct-66
5 NaN
Command:
pd.to_date(dataframe.date.head())
Result:
0 1998-06-12 00:00:00
1 2005-08-07 00:00:00
2 2066-08-28 00:00:00
3 1954-09-11 00:00:00
4 2066-10-09 00:00:00
5 NaN
I don't want to get 2066 it should be 1966, what to do?
The year range supposed to be from 1920 to 2017. The dataframe contains Null values
You can substract 100 years if dt.year is more as 2017:
df['date'] = pd.to_datetime(df['date'])
df['date'] = df['date'].mask(df['date'].dt.year > 2017,
df['date'] - pd.Timedelta(100, unit='Y'))
print (df)
date
0 1998-06-12 00:00:00
1 2005-08-07 00:00:00
2 1966-08-28 18:00:00
3 1954-09-11 00:00:00
4 1966-10-09 18:00:00

Categories

Resources