extract hour from timestamp with python - python

I have a dataframe df_energy2
df_energy2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29974 entries, 0 to 29973
Data columns (total 4 columns):
TIMESTAMP 29974 non-null datetime64[ns]
P_ACT_KW 29974 non-null int64
PERIODE_TARIF 29974 non-null object
P_SOUSCR 29974 non-null int64
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 936.8+ KB
with this structure :
df_energy2.head()
TIMESTAMP P_ACT_KW PERIODE_TARIF P_SOUSCR
2016-01-01 00:00:00 116 HC 250
2016-01-01 00:10:00 121 HC 250
Is there any python function which can extract hour from TIMESTAMP?
Kind regards

I think you need dt.hour:
print (df.TIMESTAMP.dt.hour)
0 0
1 0
Name: TIMESTAMP, dtype: int64
df['hours'] = df.TIMESTAMP.dt.hour
print (df)
TIMESTAMP P_ACT_KW PERIODE_TARIF P_SOUSCR hours
0 2016-01-01 00:00:00 116 HC 250 0
1 2016-01-01 00:10:00 121 HC 250 0

Given your data:
df_energy2.head()
TIMESTAMP P_ACT_KW PERIODE_TARIF P_SOUSCR
2016-01-01 00:00:00 116 HC 250
2016-01-01 00:10:00 121 HC 250
You have timestamp as the index. For extracting hours from timestamp where you have it as the index of the dataframe:
hours = df_energy2.index.hour
Edit: Yes, jezrael you're right. Putting what he has stated: pandas dataframe has a property for this i.e. dt :
<dataframe>.<ts_column>.dt.hour
Example in your context - the column with date is TIMESTAMP
df.TIMESTAMP.dt.hour
A similar question - Pandas, dataframe with a datetime64 column, querying by hour

Related

how to change datetime format in pandas. fastest way?

I import from .csv and get object-type columns:
begin end
0 2019-03-29 17:02:32.838469+00 2019-04-13 17:32:19.134874+00
1 2019-06-13 19:22:19.331201+00 2019-06-13 19:51:21.987534+00
2 2019-03-27 06:56:51.138795+00 2019-03-27 06:56:54.834751+00
3 2019-05-28 11:09:29.320478+00 2019-05-29 06:47:21.794092+00
4 2019-03-24 07:03:03.582679+00 2019-03-24 09:50:32.595199+00
I need to get in datetime format:
begin end
0 2019-03-29 2019-04-13
1 2019-06-13 2019-06-13
2 2019-03-27 2019-03-27
3 2019-05-28 2019-05-29
4 2019-03-24 2019-03-24
what I do (at first I make them datetime format, then cut to dates only, then make datetime format again):
df['begin'] = pd.to_datetime(df['begin'], dayfirst=False)
df['end'] = pd.to_datetime(df['end'], dayfirst=False)
df['begin'] = df['begin'].dt.date
df['end'] = df['end'].dt.date
df['begin'] = pd.to_datetime(df['begin'], dayfirst=False)
df['end'] = pd.to_datetime(df['end'], dayfirst=False)
Is there any short way to do this without converting 2 times?
You can use .apply() on the 2 columns, each with pd.to_datetime once and use dt.normalize() to remove time info, yet maintaining as datetime format. (Also used dt.tz_localize(None) to remove timezone info):
df = df.apply(lambda x: pd.to_datetime(x).dt.tz_localize(None).dt.normalize())
Result:
print(df)
begin end
0 2019-03-29 2019-04-13
1 2019-06-13 2019-06-13
2 2019-03-27 2019-03-27
3 2019-05-28 2019-05-29
4 2019-03-24 2019-03-24
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 begin 5 non-null datetime64[ns] <=== datetime format
1 end 5 non-null datetime64[ns] <=== datetime format
dtypes: datetime64[ns](2)
memory usage: 120.0 bytes

Datetime fails when setting astype, date mangled

I am importing a csv of 20 variables and 1500 records. There are 5 date columns that are in UK date format dd/mm/yyyy , and import as .str
I need to be be able to subract one date from another.They are hsopital admissions, - I need to subtract discharge date from admission date to get length of stay.
I have has a number of problems.
To illustrate I have used 2 columns.
import pandas as pd
import numpy as np
from datetime import datetime
import .csv
df = pd.read_csv("/Users........csv", usecols = ['ADMIDATE', 'DISDATE'])
df
ADMIDATE DISDATE
0 04/02/2018 07/02/2018
1 25/07/2017 1801-01-01
2 28/06/2017 01/07/2017
3 22/06/2017 1801-01-01
4 11/12/2017 15/12/2017
... ... ...
1503 25/01/2019 27/01/2019
1504 31/08/2018 1801-01-01
1505 20/09/2018 05/11/2018
1506 28/09/2018 1801-01-01
1507 21/02/2019 24/02/2019
1508 rows × 2 columns
I removed about 100 records with a DISDATE of 1801-01-01, - these are likely bad data from the patient still being in hospital when the data was collected.
To convert the dates to datetime, I have used .astype('datetime64[ns]')
This is because I didn't know how to use pd.to_datetime on multiple columns.
df[['ADMIDATE', 'DISDATE']] = df[['ADMIDATE', 'DISDATE']].astype('datetime64[ns]')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1399 entries, 0 to 1398
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1399 non-null int64
1 ADMIDATE 1399 non-null datetime64[ns]
2 DISDATE 1391 non-null datetime64[ns]
dtypes: datetime64[ns](2), int64(1)
memory usage: 32.9 KB
So, the conversion appears to have worked.
However on examining the data, ADMIDATE has become yyyy-mm-dd and DISDATE yyyy-dd-mm.
df.head(20)
Unnamed: 0 ADMIDATE DISDATE
0 0 2018-04-02 2018-07-02
1 2 2017-06-28 2017-01-07
2 4 2017-11-12 2017-12-15
3 5 2017-09-04 2017-12-04
4 6 2017-05-30 2017-01-06
5 7 2017-02-08 2017-07-08
6 8 2017-11-17 2017-11-18
7 9 2018-03-14 2018-03-20
8 10 2017-04-26 2017-03-05
9 11 2017-05-16 2017-05-17
10 12 2018-01-17 2018-01-19
11 13 2017-12-18 2017-12-20
12 14 2017-02-10 2017-04-10
13 16 2017-03-30 2017-07-04
14 17 2017-01-12 2017-12-18
15 18 2017-12-07 2017-07-14
16 19 2017-05-04 2017-08-04
17 20 2017-10-30 2017-01-11
18 21 2017-06-19 2017-06-22
19 22 2017-04-05 2017-08-05
So when I subract the ADMIDATE from the DISDATE I am getting negative values.
df['DISDATE'] - df['ADMIDATE']
0 91 days
1 -172 days
2 33 days
3 91 days
4 -144 days
...
1394 188 days
1395 -291 days
1396 2 days
1397 -132 days
1398 3 days
Length: 1399, dtype: timedelta64[ns]
I would like a method that works on all my date columns, keeps the UK format and allows me to do basic operations on the date fields.
After the suggestions from #code-different which seems very sensible below
for col in df.columns:
df[col] = pd.to_datetime(df[col], dayfirst=True, errors='coerce')
The format is unchanged despite dayfirst=True.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1399 entries, 0 to 1398
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1399 non-null datetime64[ns]
1 ADMIDATE 1399 non-null datetime64[ns]
2 DISDATE 1391 non-null datetime64[ns]
dtypes: datetime64[ns](3)
memory usage: 32.9 KB
df.head()
Unnamed: 0 ADMIDATE DISDATE
0 1970-01-01 00:00:00.000000000 2018-04-02 2018-07-02
1 1970-01-01 00:00:00.000000002 2017-06-28 2017-01-07
2 1970-01-01 00:00:00.000000004 2017-11-12 2017-12-15
3 1970-01-01 00:00:00.000000005 2017-09-04 2017-12-04
4 1970-01-01 00:00:00.000000006 2017-05-30 2017-01-06
I have also tried format='%d%m%Y' and still the year is first. Would datetime.strptime be any good?.
just tell pandas.to_datetime to use a specific and adequate format, e.g.:
import pandas as pd
import numpy as np
df = pd.DataFrame({'ADMIDATE': ['04/02/2018', '25/07/2017',
'28/06/2017', '22/06/2017', '11/12/2017'],
'DISDATE': ['07/02/2018', '1801-01-01',
'01/07/2017', '1801-01-01', '15/12/2017']}).replace({'1801-01-01': np.datetime64('NaT')})
for col in ['ADMIDATE', 'DISDATE']:
df[col] = pd.to_datetime(df[col], format='%d/%m/%Y')
# df
# ADMIDATE DISDATE
# 0 2018-02-04 2018-02-07
# 1 2017-07-25 NaT
# 2 2017-06-28 2017-07-01
# 3 2017-06-22 NaT
# 4 2017-12-11 2017-12-15
# Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 ADMIDATE 5 non-null datetime64[ns]
# 1 DISDATE 3 non-null datetime64[ns]
# dtypes: datetime64[ns](2)
Note: replace '1801-01-01' with np.datetime64('NaT') so you don't have to ignore errors when calling pd.to_datetime.
to_datetime is the function you want. It does not support multiple columns so you just loop over the columns one by one. The strings are in UK format (day-first) so you simply tell to_datetime that:
df = pd.read_csv('/path/to/file.csv', usecols = ['ADMIDATE','DISDATE']).replace({'1801-01-01': pd.NA})
for col in df.columns:
df[col] = pd.to_datetime(df[col], dayfirst=True, errors='coerce')
astype('datetime64[ns]') is too inflexible for what you need.

How to add a column name for Timeseries when it is indexed

Input: df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
Index: 100 entries, 2019-01-16 to 2018-08-23 - I want to add this as my first column to to analysis.
Data columns (total 5 columns):
open 100 non-null float64
high 100 non-null float64
low 100 non-null float64
close 100 non-null float64
volume 100 non-null float64
dtypes: float64(5)
memory usage: 9.7+ KB
df = df.assign(time=date)
df.head()
Out[76]:
1. open 2. high 3. low 4. close 5. volume time
2019-01-16 105.2600 106.2550 104.9600 105.3800 29655851 2019-01-16
2019-01-15 102.5100 105.0500 101.8800 105.0100 31587616 2019-01-15
2019-01-14 101.9000 102.8716 101.2600 102.0500 28437079 2019-01-14
2019-01-11 103.1900 103.4400 101.6400 102.8000 28314202 2019-01-11
2019-01-10 103.2200 103.7500 102.3800 103.6000 30067556 2019-01-10

Pandas: Adding varying numbers of days to a date in a dataframe

I have a dataframe with a date column and then a number of days that I want to add to that column. I want to create a new column, 'Recency_Date', with the resulting value.
df:
fan Community Name Count Mean_Days Date_Min
0 855 AAA Games 6 353 2013-04-16
1 855 First Person Shooters 2 420 2012-10-16
2 855 Playstation 3 108 2014-06-12
3 3148 AAA Games 1 0 2015-04-17
4 3148 Mobile Gaming 1 0 2013-01-19
df info:
merged.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4627415 entries, 0 to 4627414
Data columns (total 5 columns):
fan int64
Community Name object
Count int64
Mean_Days int32
Date_Min datetime64[ns]
dtypes: datetime64[ns](1), int32(1), int64(2), object(1)
memory usage: 194.2+ MB
Sample data as csv:
fan,Community Name,Count,Mean_Days,Date_Min
855,AAA Games,6,353,2013-04-16 00:00:00
855,First Person Shooters,2,420,2012-10-16 00:00:00
855,Playstation,3,108,2014-06-12 00:00:00
3148,AAA Games,1,0,2015-04-17 00:00:00
3148,Mobile Gaming,1,0,2013-01-19 00:00:00
3148,Power PCs,2,0,2014-06-17 00:00:00
3148,XBOX,1,0,2009-11-12 00:00:00
3860,AAA Games,1,0,2012-11-28 00:00:00
3860,Minecraft,3,393,2011-09-07 00:00:00
4044,AAA Games,5,338,2010-11-15 00:00:00
4044,Blizzard Games,1,0,2013-07-12 00:00:00
4044,Geek Culture,1,0,2011-06-03 00:00:00
4044,Indie Games,2,112,2013-01-09 00:00:00
4044,Minecraft,1,0,2014-01-02 00:00:00
4044,Professional Gaming,1,0,2014-01-02 00:00:00
4044,XBOX,2,785,2010-11-15 00:00:00
4827,AAA Games,1,0,2010-08-24 00:00:00
4827,Gaming Humour,1,0,2012-05-05 00:00:00
4827,Minecraft,2,10,2012-03-21 00:00:00
5260,AAA Games,4,27,2013-09-17 00:00:00
5260,Indie Games,8,844,2011-06-08 00:00:00
5260,MOBA,2,0,2012-10-27 00:00:00
5260,Minecraft,5,106,2012-02-17 00:00:00
5260,XBOX,1,0,2011-06-15 00:00:00
5484,AAA Games,21,1296,2009-08-01 00:00:00
5484,Free to Play,1,0,2014-12-08 00:00:00
5484,Indie Games,1,0,2014-05-28 00:00:00
5484,Music Games,1,0,2012-09-12 00:00:00
5484,Playstation,1,0,2012-02-22 00:00:00
I've tried:
merged['Recency_Date'] = merged['Date_Min'] + timedelta(days=merged['Mean_Days'])
and:
merged['Recency_Date'] = pd.DatetimeIndex(merged['Date_Min']) + pd.DateOffset(merged['Mean_Days'])
But am having trouble finding something that will work for a Series rather than an individual int value. Any and all help would be very much appreciated with this.
If 'Date_Min' dtype is already datetime then you can construct a Timedeltaindex from your 'Mean_Days' column and add these:
In [174]:
df = pd.DataFrame({'Date_Min':[dt.datetime.now(), dt.datetime(2015,3,4), dt.datetime(2011,6,9)], 'Mean_Days':[1,2,3]})
df
Out[174]:
Date_Min Mean_Days
0 2015-09-15 14:02:37.452369 1
1 2015-03-04 00:00:00.000000 2
2 2011-06-09 00:00:00.000000 3
In [175]:
df['Date_Min'] + pd.TimedeltaIndex(df['Mean_Days'], unit='D')
Out[175]:
0 2015-09-16 14:02:37.452369
1 2015-03-06 00:00:00.000000
2 2011-06-12 00:00:00.000000
Name: Date_Min, dtype: datetime64[ns]

Pandas df.fillna(method = 'pad') not working on 28000 row df

Have been trying to just replace the NaN values in my DataFrame with the last valued item however this does not seem to do the job. Just wondering if anyone else has this same issue or what could be causing this problem.
In [16]: ABCW.info()
Out[16]:<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 692 entries, 2014-10-22 10:30:00 to 2015-05-21 16:00:00
Data columns (total 6 columns):
Price 692 non-null float64
Volume 692 non-null float64
Symbol_Num 692 non-null object
Actual Price 577 non-null float64
Market Cap Rank 577 non-null float64
Market Cap 577 non-null float64
dtypes: float64(5), object(1)
memory usage: 37.8+ KB
In [18]: ABCW.fillna(method = 'pad')
In [19]: ABCW.info()
Out [19]: <class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 692 entries, 2014-10-22 10:30:00 to 2015-05-21 16:00:00
Data columns (total 6 columns):
Price 692 non-null float64
Volume 692 non-null float64
Symbol_Num 692 non-null object
Actual Price 577 non-null float64
Market Cap Rank 577 non-null float64
Market Cap 577 non-null float64
dtypes: float64(5), object(1)
memory usage: 37.8+ KB
There is no change in the number of non-null values and there is still all the preexisting NaN values that were previously in the data frame
You are using the 'pad' method. This is basically a forward fill. See examples at http://pandas.pydata.org/pandas-docs/stable/missing_data.html
I am reproducing the relevant example here,
In [33]: df
Out[33]:
one two three
a NaN -0.282863 -1.509059
c NaN 1.212112 -0.173215
e 0.119209 -1.044236 -0.861849
f -2.104569 -0.494929 1.071804
h NaN -0.706771 -1.039575
In [34]: df.fillna(method='pad')
Out[34]:
one two three
a NaN -0.282863 -1.509059
c NaN 1.212112 -0.173215
e 0.119209 -1.044236 -0.861849
f -2.104569 -0.494929 1.071804
h -2.104569 -0.706771 -1.039575
This method will not do a backfill. You should also consider doing a backfill if you want all your NaNs to go away. Also, 'inplace= False' by default. So you probably want to assign results of the operation back to ABCW.. like so,
ABCW = ABCW.fillna(method = 'pad')
ABCW = ABCW.fillna(method = 'bfill')

Categories

Resources