Identify Dates in DataFrame - Pandas - python

I have a dataframe:
Datetime
0 2022-06-01 00:00:00 0
1 2022-06-01 00:01:00 0
2 2022-06-01 00:02:00 0
3 2022-06-01 00:03:00 0
4 2022-06-01 00:04:00 0
How to identify the hour is "00", and so for the minutes and seconds. My requirement is to later on like to put them in a function.

You can use:
s = pd.to_datetime(df['Datetime'], format='%Y-%m-%d %H:%M:%S 0') # what is the 0?
df['hour_0'] = s.dt.hour.eq(0)
df['min_0'] = s.dt.minute.eq(0)
df['sec_0'] = s.dt.second.eq(0)
Output:
Datetime hour_0 min_0 sec_0
0 2022-06-01 00:00:00 0 True True True
1 2022-06-01 00:01:00 0 True False True
2 2022-06-01 00:02:00 0 True False True
3 2022-06-01 00:03:00 0 True False True
4 2022-06-01 00:04:00 0 True False True

So, your question is a bit unclear to me, but if I understand correctly you just need to extract the hours from your DF? If so the easiest way to do this is to use Pandas inbuilt datetime functionality. For example:
import pandas as pd
df = pd.DataFrame([["2022-12-12 01:59:00"], ["2022-13-12 01:59:00"]])
print(df)
This will yield:
0
0 2022-12-12 01:59:00
1 2022-12-13 01:59:00
Now can do:
pd['timestamp'] = pd.to_datetime(df[0])
pd['hour'] = pd['timestamp'].dt.hour
You can do this for minutes and seconds etc. Hope that helps.

You can easily extract hours, minutues, seconds directly from date time string. what is extra 0?. If you have extra strings then simply filter first then extra parameters.
df['new'] = pd.to_datetime(df['Datetime'].str.split(' ').str[1],format='%H:%M:%S')
df['hour'] = df['new'].dt.hour
df['minute'] = df['new'].dt.minute
df['second'] = df['new'].dt.second
del df['new']
Gives #
Datetime hour minute second
0 2022-06-01 00:00:00 0 0 0 0
1 2022-06-01 00:01:00 0 0 1 0
2 2022-06-01 00:02:00 0 0 2 0
3 2022-06-01 00:03:00 0 0 3 0
4 2022-06-01 00:04:00 0 0 4 0
Explination:
your date string looks likes this
2022-06-01 00:02:00 0
analysis
2022 - Year - %y
06 - Month - %m
01 - Day - %d
00: - hours - %H
02: - minutes - %M
00: - Seconds - %S
You have an extra 0in date format to filter that I've split the string by space.
df['Datetime'].str.split(' ').str[1],format='%H:%M:%S'
Logically iplies
'2022-06-01 00:02:00 0'.split(' ').str[1],format='%H:%M:%S'
Which wraps elemnts to list sep by spaces.
[2022-06-01, 00:02:00, 0]
Analysis
0th elemnt in list = 2022-06-01
1st elemnt in list = 00:02:00
2nd elemnt in list = 0
Currently we are interested in time which is 1st elemnt in list = 00:02:00
pd.to_datetime(df['Datetime'].str.split(' ').str[1],format='%H:%M:%S')
Pandas has inbuilt time series functions - pandas.Series.dt.minute

Related

Pandas create a column iteratively - increasing after specific threshold

I have a simple table which the datetime is formatted correctly on.
Datetime
Diff
2021-01-01 12:00:00
0
2021-01-01 12:02:00
2
2021-01-01 12:04:00
2
2021-01-01 12:010:00
6
2021-01-01 12:020:00
10
2021-01-01 12:022:00
2
I would like to add a label/batch name which increases when a specific threshold/cutoff time is the difference. The output (with a threshold of diff > 7) I am hoping to achieve is:
Datetime
Diff
Batch
2021-01-01 12:00:00
0
A
2021-01-01 12:02:00
2
A
2021-01-01 12:04:00
2
A
2021-01-01 12:010:00
6
A
2021-01-01 12:020:00
10
B
2021-01-01 12:022:00
2
B
Batch doesn't need to be 'A','B','C' - probably easier to increase numerically.
I cannot find a solution online but I'm assuming there is a method to split the table on all values below the threshold, apply the batch label and concatenate again. However I cannot seem to get it working.
Any insight appreciated :)
Since True and False values represent 1 and 0 when summed, you can use this to create a cumulative sum on a boolean column made by df.Diff > 7:
df['Batch'] = (df.Diff > 7).cumsum()
You can use:
df['Batch'] = df['Datetime'].diff().dt.total_seconds().gt(7*60) \
.cumsum().add(65).apply(chr)
print(df)
# Output:
Datetime Diff Batch
0 2021-01-01 12:00:00 0 A
1 2021-01-01 12:02:00 2 A
2 2021-01-01 12:04:00 2 A
3 2021-01-01 12:10:00 6 A
4 2021-01-01 12:20:00 10 B
5 2021-01-01 12:22:00 2 B
Update
For a side question: apply(char) goes through A-Z, what method would you use to achieve AA, AB for batches greater than 26
Try something like this:
# Adapted from openpyxl
def chrext(i):
s = ''
while i > 0:
i, r = divmod(i, 26)
i, r = (i, r) if r > 0 else (i-1, 26)
s += chr(r-1+65)
return s[::-1]
df['Batch'] = df['Datetime'].diff().dt.total_seconds().gt(7*60) \
.cumsum().add(1).apply(chrext)
For demonstration purpose, if you replace 1 by 27:
>>> df
Datetime Diff Batch
0 2021-01-01 12:00:00 0 AA
1 2021-01-01 12:02:00 2 AA
2 2021-01-01 12:04:00 2 AA
3 2021-01-01 12:10:00 6 AA
4 2021-01-01 12:20:00 10 AB
5 2021-01-01 12:22:00 2 AB
You can achieve this by creating a custom group that has the properties you want. After you group the values your batch is simply group number. You don't have to use groupby with only an existing column. You can give a custom index and it is really powerful.
from datetime import timedelta
df['batch'] == df.groupby(((df['Datetime'] - df['Datetime'].min()) // timedelta(minutes=7)).ngroup()

Create a date counter variable starting with a particular date

I have a variable as:
start_dt = 201901 which is basically Jan 2019
I have an initial data frame as:
month
0
1
2
3
4
I want to add a new column (date) to the dataframe where for month 0, the date is the start_dt - 1 month, and for subsequent months, the date is a month + 1 increment.
I want the resulting dataframe as:
month date
0 12/1/2018
1 1/1/2019
2 2/1/2019
3 3/1/2019
4 4/1/2019
You can subtract 1 and add datetimes converted to month periods by Timestamp.to_period and then output convert to timestamps by to_timestamp:
start_dt = 201801
start_dt = pd.to_datetime(start_dt, format='%Y%m')
s = df['month'].sub(1).add(start_dt.to_period('m')).dt.to_timestamp()
print (s)
0 2017-12-01
1 2018-01-01
2 2018-02-01
3 2018-03-01
4 2018-04-01
Name: month, dtype: datetime64[ns]
Or is possible convert column to month offsets with subtract 1 and add datetime:
s = df['month'].apply(lambda x: pd.DateOffset(months=x-1)).add(start_dt)
print (s)
0 2017-12-01
1 2018-01-01
2 2018-02-01
3 2018-03-01
4 2018-04-01
Name: month, dtype: datetime64[ns]
Here is how you can use the third-party library dateutil to increment a datetime by one month:
import pandas as pd
from datetime import datetime
from dateutil.relativedelta import relativedelta
start_dt = '201801'
number_of_rows = 10
start_dt = datetime.strptime(start_dt, '%Y%m')
df = pd.DataFrame({'date': [start_dt+relativedelta(months=+n)
for n in range(-1, number_of_rows-1)]})
print(df)
Output:
date
0 2017-12-01
1 2018-01-01
2 2018-02-01
3 2018-03-01
4 2018-04-01
5 2018-05-01
6 2018-06-01
7 2018-07-01
8 2018-08-01
9 2018-09-01
As you can see, in each iteration of the for loop, the initial datetime is being incremented by the corresponding number (starting at -1) of the iteration.

Calculating average differences with groupby in Python

I'm new to Python and I want to aggregate (groupby) ID's in my first column.
The values in the second column are timestamps (datetime format) and by aggregating the ID's, I want the to get the average difference between the timestamps (in days) per ID in the aggregated ID column. My table looks like df1 and I want something like df2, but since I'm an absolute beginner, I have no idea how to do this.
import pandas as pd
import numpy as np
from datetime import datetime
In[1]:
# df1
ID = np.array([1,1,1,2,2,3])
Timestamp = np.array([
datetime.strptime('2018-01-01 18:07:02', "%Y-%m-%d %H:%M:%S"),
datetime.strptime('2018-01-08 18:07:02', "%Y-%m-%d %H:%M:%S"),
datetime.strptime('2018-03-15 18:07:02', "%Y-%m-%d %H:%M:%S"),
datetime.strptime('2018-01-01 18:07:02', "%Y-%m-%d %H:%M:%S"),
datetime.strptime('2018-02-01 18:07:02', "%Y-%m-%d %H:%M:%S"),
datetime.strptime('2018-01-01 18:07:02', "%Y-%m-%d %H:%M:%S")])
df = pd.DataFrame({'ID': ID, 'Timestamp': Timestamp})
Out[1]:
ID Timestamp
0 1 2018-01-01 18:07:02
1 1 2018-01-08 18:07:02
2 1 2018-03-15 18:07:02
3 2 2018-01-01 18:07:02
4 2 2018-02-01 18:07:02
5 3 2018-01-01 18:07:02
In[2]:
#df2
ID = np.array([1,2,3])
Avg_Difference = np.array([7, 1, "nan"])
df2 = pd.DataFrame({'ID': ID, 'Avg_Difference': Avg_Difference})
Out[2]:
ID Avg_Difference
0 1 7
1 2 1
2 3 nan
You could do something like this:
df.groupby('ID')['Timestamp'].apply(lambda x: x.diff().mean())
In your case, it looks like:
>>> df
ID Timestamp
0 1 2018-01-01 18:07:02
1 1 2018-01-08 18:07:02
2 1 2018-03-15 18:07:02
3 2 2018-01-01 18:07:02
4 2 2018-02-01 18:07:02
5 3 2018-01-01 18:07:02
>>> df.groupby('ID')['Timestamp'].apply(lambda x: x.diff().mean())
ID
1 36 days 12:00:00
2 31 days 00:00:00
3 NaT
Name: Timestamp, dtype: timedelta64[ns]
If you want it as a dataframe with the column named Avg_Difference, just add to_frame at the end:
df.groupby('ID')['Timestamp'].apply(lambda x: x.diff().mean()).to_frame('Avg_Difference')
Avg_Difference
ID
1 36 days 12:00:00
2 31 days 00:00:00
3 NaT
Edit Based on your comment, if you want to remove the time element, and just get the number of days, you can do the following:
df.groupby('ID')['Timestamp'].apply(lambda x: x.diff().mean()).dt.days.to_frame('Avg_Difference')
Avg_Difference
ID
1 36.0
2 31.0
3 NaN

python to_date wrong values

Command:
dataframe.date.head()
Result:
0 12-Jun-98
1 7-Aug-2005
2 28-Aug-66
3 11-Sep-1954
4 9-Oct-66
5 NaN
Command:
pd.to_date(dataframe.date.head())
Result:
0 1998-06-12 00:00:00
1 2005-08-07 00:00:00
2 2066-08-28 00:00:00
3 1954-09-11 00:00:00
4 2066-10-09 00:00:00
5 NaN
I don't want to get 2066 it should be 1966, what to do?
The year range supposed to be from 1920 to 2017. The dataframe contains Null values
You can substract 100 years if dt.year is more as 2017:
df['date'] = pd.to_datetime(df['date'])
df['date'] = df['date'].mask(df['date'].dt.year > 2017,
df['date'] - pd.Timedelta(100, unit='Y'))
print (df)
date
0 1998-06-12 00:00:00
1 2005-08-07 00:00:00
2 1966-08-28 18:00:00
3 1954-09-11 00:00:00
4 1966-10-09 18:00:00

PYTHON: Pandas datetime index range to change column values

I have a dataframe indexed using a 12hr frequency datetime:
id mm ls
date
2007-09-27 00:00:00 1 0 0
2007-09-27 12:00:00 1 0 0
2007-09-28 00:00:00 1 15 0
2007-09-28 12:00:00 NaN NaN 0
2007-09-29 00:00:00 NaN NaN 0
Timestamp('2007-09-27 00:00:00', offset='12H')
I use column 'ls' as a binary variable with default value '0' using:
data['ls'] = 0
I have a list of days in the form '2007-09-28' from which I wish to update all 'ls' values from 0 to 1.
id mm ls
date
2007-09-27 00:00:00 1 0 0
2007-09-27 12:00:00 1 0 0
2007-09-28 00:00:00 1 15 1
2007-09-28 12:00:00 NaN NaN 1
2007-09-29 00:00:00 NaN NaN 0
Timestamp('2007-09-27 00:00:00', offset='12H')
I understand how this can be done using another column variable ie:
data.ix[data.id == '1'], ['ls'] = 1
yet this does not work using datetime index.
Could you let me know what the method for datetime index is?
You have a list of days in the form '2007-09-28':
days = ['2007-09-28', ...]
then you can modify your df using:
df['ls'][pd.DatetimeIndex(df.index.date).isin(days)] = 1

Categories

Resources