Pandas, dataframe with a datetime64 column, querying by hour - python

I have a pandas dataframe df which has one column constituted by datetime64, e.g.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1471 entries, 0 to 2940
Data columns (total 2 columns):
date 1471 non-null values
id 1471 non-null values
dtypes: datetime64[ns](1), int64(1)
I would like to sub-sample df using as criterion the hour of the day (independently on the other information in date). E.g., in pseudo code
df_sub = df[ (HOUR(df.date) > 8) & (HOUR(df.date) < 20) ]
for some function HOUR.
I guess the problem can be solved via a preliminary conversion from datetime64 to datetime. Can this be handled more efficiently?

Found a simple solution.
df['hour'] = df.date.apply(lambda x : x.hour)
df_sub = df[(df.hour > 8) & (df.hour) <20]
EDIT:
There is a property dt specifically introduced to handle this problem. The query becomes:
df_sub = df[ (df.date.dt.hour > 8)
& (df.date.dt.hour < 20) ]

Related

How to transform a Dataframe into a Series with Darts including the DatetimeIndex?

My Dataframe, temperature measurings over time:
[]
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 17545 entries, 2020-01-01 00:00:00+00:00 to 2022-01-01 00:00:00+00:00
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 T (degC) 17545 non-null float64
dtypes: float64(1)
memory usage: 274.1 KB
After transforming the dataframe into a Time Series with
df_series = TimeSeries.from_dataframe(df)
df_series
the result looks like:
For this reason, I cant plot the Series.
TypeError: Plotting requires coordinates to be numeric, boolean, or dates of type numpy.datetime64, datetime.datetime, cftime.datetime or pandas.Interval. Received data of type object instead.
I expected something like this from the darts doc (https://unit8co.github.io/darts/):
df
The DataFrame
time_col
The time column name. If set, the column will be cast to a pandas DatetimeIndex.
If not set, the DataFrame index will be used. In this case the DataFrame must contain an index that is
either a pandas DatetimeIndex or a pandas RangeIndex. If a DatetimeIndex is
used, it is better if it has no holes; alternatively setting fill_missing_dates can in some casees solve
these issues (filling holes with NaN, or with the provided fillna_value numeric value, if any).
In case about the above method description I don't know why it changed my DatetimeIndex to object.
Any suggestions on that?
Thanks.
I had the same issue. Darts doesn't work with datetime64[ns, utc], but works with datetime64[ns]. Darts doesn't recognise datetime64[ns, utc] as datatime type of value.
This fix it by doing datetime64[ns, utc] -> datetime64[ns]:
def set_index(df):
df['open_time'] = pd.to_datetime(df['open_time'], infer_datetime_format=True).dt.tz_localize(None)
df.set_index(keys='open_time', inplace=True, drop=True)
return df

pd.to_datetime() not properly converting the type to datetime

I am looking to parse data with multiple timezones on a single column. I am using the pd.to_datetime function.
df = pd.DataFrame({'timestamp':['2019-05-21 12:00:00-06:00', '2019-05-21 12:15:00-07:00']})
df['timestamp'] = pd.to_datetime(df.timestamp)
df.info()
This results in:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 timestamp 2 non-null object
dtypes: object(1)
memory usage: 144.0+ bytes
I did some testing and noticed that the same does not happen when the offsets are all the same:
df = pd.DataFrame({'timestamp':['2019-05-21 12:00:00-06:00', '2019-05-21 12:15:00-06:00']})
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 timestamp 2 non-null datetime64[ns, pytz.FixedOffset(-360)]
dtypes: datetime64[ns, pytz.FixedOffset(-360)](1)
memory usage: 144.0 bytes
If this error is confirmed, it will have direct implication over the datetime accessors, but it also breaks some compatibility (or assumed compatibilities) with library that operate conversions on the types. The pd.to_datetime() is successfully able to convert everything to a datetime.datetime, but, libraries like pyarrow will apply a fixed tz offset on the column.
Based on many questions on StackOverflow (ex: Convert pandas column with multiple timezones to single timezone) this was not the behavior of pandas in previous versions.
I am on pandas 1.2.4 (I updated from 1.2.2 that shows the same). Python 3.7.9.
Should I report this as a GitHub issue?
I'd suggest to keep the original timestamp column with the offset (to not lose that info) and work with UTC (utc=True). If you know the time zone that put that offset on the data, you could also tz_convert.
Ex / cleaned-up version of the linked question:
import pandas as pd
# sample data
df = pd.DataFrame({'timestamp':['2019-05-21 12:00:00-06:00',
'2019-02-21 12:15:00-07:00']})
# assuming we know the origin time zone
zone = 'America/Boise'
# skip the .dt.tz_convert(zone) part if you don't have the specific zone
df['datetime'] = pd.to_datetime(df['timestamp'], utc=True).dt.tz_convert(zone)
df
timestamp datetime
0 2019-05-21 12:00:00-06:00 2019-05-21 12:00:00-06:00
1 2019-02-21 12:15:00-07:00 2019-02-21 12:15:00-07:00
df['datetime']
0 2019-05-21 12:00:00-06:00
1 2019-02-21 12:15:00-07:00
Name: datetime, dtype: datetime64[ns, America/Boise]

How to plot a large dataframe

This is what my dataframe looks like :
Date,Sales, location
There are a total of 20k entries. Dates are from 2016-2019. I need to have dates on x axis and sales on y axis. This is how I have done it
df1.plot(x="DATE", y=["Total_Sales"], kind="bar", figsize=(1000,20))
Unfortunately even with this the dates aren't clearly visible. How do I make sure that they are pretty plotted? Is there a way to create bins or something.
Edit: output
<class 'pandas.core.frame.DataFrame'>
Int64Index: 382 entries, 0 to 18116
Data columns (total 5 columns):
DATE 382 non-null object
Total_Sales 358 non-null float64
Total_Sum 24 non-null float64
Total_Units 382 non-null int64
locationkey 382 non-null float64
dtypes: float64(3), int64(1), object(1)
memory usage: 17.9+ KB
Edit: Maybe I can divide it into years stacked on top of each other. So, for instance, Jan to Dec 16 will be the first and then succeeding years will be plotted with it. How do I do that?
I recommend that you do this:
df.DATE = pd.to_datetime(df.DATE)
df = df.set_index('DATE')
Now the dataframe's index is the date. This is very convenient. For example, you can do:
df.resample('Y').sum()
You should also be able to plot:
df.Total_Sales.plot()
And pandas will take care of making the x-axis linear in time, formatting the date, etc.

How do I combine dataframe columns

I've a dataframe df that looks like:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 810 entries, 0 to 809
Data columns (total 21 columns):
event_type 810 non-null object
datetime 810 non-null datetime64[ns]
person 810 non-null object
...
from_file 0 non-null float64
dtypes: datetime64[ns](1), float64(2), int64(2), object(16)
memory usage: 133.0+ KB
(There are 21 columns but only the above four I'm interested in so I've omitted them)
I want to create a second dataframe df_b that has two columns where one of them is a combination of df's event_type,person,from_file columns and the other is df's datetime. Did I explain that well?... (so two columns in df_b from df's four but where three of the above are combined into one of df_b's)
I thought of creating a new dataframe df_b as:
df_b = pandas.DataFrame({'event_type+person+from_file': [], 'datetime': []})
Then selecting all rows with:
df.loc[:, ['event_type','person','from_file','datetime']]
But beyond that I don't know how to achieve the remainder and I keep thinking I'm going to end up with datetime values that didn't correspond to the original row's datetime that was pulled from df.
So can you show me how to:
select: event_type, person, from_file, datetime from all rows in df
combine: event_type, person, from_file with '+' between the values
and then put (event_type+person+from_file), datetime into df_b
?
To drop NaN values use:
df_clean = df.dropna(subset=['event_type', 'person', 'from_file'])
Concatenating string columns in Pandas is as easy as
df_clean['event_type+person+from_file'] = df_clean['event_type'] + '+' +
df_clean['person'] + '+' + df_clean['from_file']
And then:
df_b = df_clean[['event_type+person+from_file', 'datetime']].copy()

pandas, python - how to select specific times in timeseries

I worked now for quite some time using python and pandas for analysing a set of hourly data and find it quite nice (Coming from Matlab.)
Now I am kind of stuck. I created my DataFrame like that:
SamplingRateMinutes=60
index = DateRange(initialTime,finalTime, offset=datetools.Minute(SamplingRateMinutes))
ts=DataFrame(data, index=index)
What I want to do now is to select the Data for all days at the hours 10 to 13 and 20-23 to use the data for further calculations.
So far I sliced the data using
selectedData=ts[begin:end]
And I am sure to get some kind of dirty looping to select the data needed. But there must be a more elegant way to index exacly what I want. I am sure this is a common problem and the solution in pseudocode should look somewhat like that:
myIndex=ts.index[10<=ts.index.hour<=13 or 20<=ts.index.hour<=23]
selectedData=ts[myIndex]
To mention I am an engineer and no programer :) ... yet
In upcoming pandas 0.8.0, you'll be able to write
hour = ts.index.hour
selector = ((10 <= hour) & (hour <= 13)) | ((20 <= hour) & (hour <= 23))
data = ts[selector]
Here's an example that does what you want:
In [32]: from datetime import datetime as dt
In [33]: dr = p.DateRange(dt(2009,1,1),dt(2010,12,31), offset=p.datetools.Hour())
In [34]: hr = dr.map(lambda x: x.hour)
In [35]: dt = p.DataFrame(rand(len(dr),2), dr)
In [36]: dt
Out[36]:
<class 'pandas.core.frame.DataFrame'>
DateRange: 17497 entries, 2009-01-01 00:00:00 to 2010-12-31 00:00:00
offset: <1 Hour>
Data columns:
0 17497 non-null values
1 17497 non-null values
dtypes: float64(2)
In [37]: dt[(hr >= 10) & (hr <=16)]
Out[37]:
<class 'pandas.core.frame.DataFrame'>
Index: 5103 entries, 2009-01-01 10:00:00 to 2010-12-30 16:00:00
Data columns:
0 5103 non-null values
1 5103 non-null values
dtypes: float64(2)
As it looks messy in my comment above, I decided to provide another answer which is a syntax update for pandas 0.10.0 on Marc's answer, combined with Wes' hint:
import pandas as pd
from datetime import datetime
dr = pd.date_range(datetime(2009,1,1),datetime(2010,12,31),freq='H')
dt = pd.DataFrame(rand(len(dr),2),dr)
hour = dt.index.hour
selector = ((10 <= hour) & (hour <= 13)) | ((20<=hour) & (hour<=23))
data = dt[selector]
Pandas DataFrame has a built-in function
pandas.DataFrame.between_time
df = pd.DataFrame(np.random.randn(1000, 2),
index=pd.date_range(start='2017-01-01', freq='10min', periods=1000))
Create 2 data frames for each period of time:
df1 = df.between_time(start_time='10:00', end_time='13:00')
df2 = df.between_time(start_time='20:00', end_time='23:00')
Data frame you want is merged and sorted df1 and df2:
pd.concat([df1, df2], axis=0).sort_index()

Categories

Resources