int64 to HHMM string - python

I have the following data frame where the column hour shows hours of the day in int64 form. I'm trying to convert that into a time format; so that hour 1 would show up as '01:00'. I then want to add this to the date column and convert it into a timestamp index.
Using the datetime function in pandas resulted in the column "hr2", which is not what I need. I'm not sure I can even apply datetime directly, as the original data (i.e. in column "hr") is not really a date time format to begin with. Google searches so far have been unproductive.

While I am still in the dark concerning the format of your date column. I will assume the Date column is a string object and the hr column is an int64 object. To create the column TimeStamp in pandas tmestamp format this is how I would proceed>
Given df:
Date Hr
0 12/01/2010 1
1 12/01/2010 2
2 12/01/2010 3
3 12/01/2010 4
4 12/02/2010 1
5 12/02/2010 2
6 12/02/2010 3
7 12/02/2010 4
df['TimeStamp'] = df.apply(lambda row: pd.to_datetime(row['Date']) + pd.to_timedelta(row['Hr'], unit='H'), axis = 1)
yields:
Date Hr TimeStamp
0 12/01/2010 1 2010-12-01 01:00:00
1 12/01/2010 2 2010-12-01 02:00:00
2 12/01/2010 3 2010-12-01 03:00:00
3 12/01/2010 4 2010-12-01 04:00:00
4 12/02/2010 1 2010-12-02 01:00:00
5 12/02/2010 2 2010-12-02 02:00:00
6 12/02/2010 3 2010-12-02 03:00:00
7 12/02/2010 4 2010-12-02 04:00:00
The timestamp column can then be used as your index.

Related

Create a date counter variable starting with a particular date

I have a variable as:
start_dt = 201901 which is basically Jan 2019
I have an initial data frame as:
month
0
1
2
3
4
I want to add a new column (date) to the dataframe where for month 0, the date is the start_dt - 1 month, and for subsequent months, the date is a month + 1 increment.
I want the resulting dataframe as:
month date
0 12/1/2018
1 1/1/2019
2 2/1/2019
3 3/1/2019
4 4/1/2019
You can subtract 1 and add datetimes converted to month periods by Timestamp.to_period and then output convert to timestamps by to_timestamp:
start_dt = 201801
start_dt = pd.to_datetime(start_dt, format='%Y%m')
s = df['month'].sub(1).add(start_dt.to_period('m')).dt.to_timestamp()
print (s)
0 2017-12-01
1 2018-01-01
2 2018-02-01
3 2018-03-01
4 2018-04-01
Name: month, dtype: datetime64[ns]
Or is possible convert column to month offsets with subtract 1 and add datetime:
s = df['month'].apply(lambda x: pd.DateOffset(months=x-1)).add(start_dt)
print (s)
0 2017-12-01
1 2018-01-01
2 2018-02-01
3 2018-03-01
4 2018-04-01
Name: month, dtype: datetime64[ns]
Here is how you can use the third-party library dateutil to increment a datetime by one month:
import pandas as pd
from datetime import datetime
from dateutil.relativedelta import relativedelta
start_dt = '201801'
number_of_rows = 10
start_dt = datetime.strptime(start_dt, '%Y%m')
df = pd.DataFrame({'date': [start_dt+relativedelta(months=+n)
for n in range(-1, number_of_rows-1)]})
print(df)
Output:
date
0 2017-12-01
1 2018-01-01
2 2018-02-01
3 2018-03-01
4 2018-04-01
5 2018-05-01
6 2018-06-01
7 2018-07-01
8 2018-08-01
9 2018-09-01
As you can see, in each iteration of the for loop, the initial datetime is being incremented by the corresponding number (starting at -1) of the iteration.

How can I join columns by DatetimeIndex, matching day, month and hour from data from different years?

I have a dataset with meteorological features for 2019, to which I want to join two columns of power consumption datasets for 2017, 2018. I want to match them by hour, day and month, but the data belongs to different years. How can I do that?
The meteo dataset is a 6 column similar dataframe with datetimeindexes belonging to 2019.
You can from the index 3 additional columns that represent the hour, day and month and use them for a later join. DatetimeIndex has attribtues for different parts of the timestamp:
import pandas as pd
ind = pd.date_range(start='2020-01-01', end='2020-01-20', periods=10)
df = pd.DataFrame({'number' : range(10)}, index = ind)
df['hour'] = df.index.hour
df['day'] = df.index.day
df['month'] = df.index.month
print(df)
number hour day month
2020-01-01 00:00:00 0 0 1 1
2020-01-03 02:40:00 1 2 3 1
2020-01-05 05:20:00 2 5 5 1
2020-01-07 08:00:00 3 8 7 1
2020-01-09 10:40:00 4 10 9 1
2020-01-11 13:20:00 5 13 11 1
2020-01-13 16:00:00 6 16 13 1
2020-01-15 18:40:00 7 18 15 1
2020-01-17 21:20:00 8 21 17 1
2020-01-20 00:00:00 9 0 20 1

How to count by time frequency using groupby - pandas

I'm trying to count a frequency of 2 events by the month using 2 columns from my df. What I have done so far has counted all events by the unique time which is not efficient enough as there are too many results. I wish to create a graph with the results afterwards.
I've tried adapting my code by the answers on the SO questions:
[How to groupby time series by 10 minutes using pandas?
[Counting frequency of occurrence by month-year using python panda
[Pandas Groupby using time frequency
but can not seem to get the command working when I input freq='day' within the groupby command.
My code is:
print(df.groupby(['Priority', 'Create Time']).Priority.count())
which initially produced something like 170000 results in the structure of the following:
Priority Create Time
1.0 2011-01-01 00:00:00 1
2011-01-01 00:01:11 1
2011-01-01 00:02:10 1
...
2.0 2011-01-01 00:01:25 1
2011-01-01 00:01:35 1
...
But now for some reason (I'm using Jupyter Notebook) it only produces:
Priority Create Time
1.0 2011-01-01 00:00:00 1
2011-01-01 00:01:11 1
2011-01-01 00:02:10 1
2.0 2011-01-01 00:01:25 1
2011-01-01 00:01:35 1
Name: Priority, dtype: int64
No idea why the output has changed to only 5 results (maybe I unknowingly changed something).
I would like the results to be in the following format:
Priority month Count
1.0 2011-01 a
2011-02 b
2011-03 c
...
2.0 2011-01 x
2011-02 y
2011-03 z
...
Top points for showing how to change the frequency correctly for other values as well, for example hour/day/month/year. With the answers please could you explain what is going on in your code as I am new and learning pandas and wish to understand the process. Thank you.
One possible solution is convert datetime column to months periods by Series.dt.to_period:
print(df.groupby(['Priority', df['Create Time'].dt.to_period('m')]).Priority.count())
Or use Grouper:
print(df.groupby(['Priority', pd.Grouper(key='Create Time', freq='MS')]).Priority.count())
Sample:
np.random.seed(123)
df = pd.DataFrame({'Create Time':pd.date_range('2019-01-01', freq='10D', periods=10),
'Priority':np.random.choice([0,1], size=10)})
print (df)
Create Time Priority
0 2019-01-01 0
1 2019-01-11 1
2 2019-01-21 0
3 2019-01-31 0
4 2019-02-10 0
5 2019-02-20 0
6 2019-03-02 0
7 2019-03-12 1
8 2019-03-22 1
9 2019-04-01 0
print(df.groupby(['Priority', df['Create Time'].dt.to_period('m')]).Priority.count())
Priority Create Time
0 2019-01 3
2019-02 2
2019-03 1
2019-04 1
1 2019-01 1
2019-03 2
Name: Priority, dtype: int64
print(df.groupby(['Priority', pd.Grouper(key='Create Time', freq='MS')]).Priority.count())
Priority Create Time
0 2019-01-01 3
2019-02-01 2
2019-03-01 1
2019-04-01 1
1 2019-01-01 1
2019-03-01 2
Name: Priority, dtype: int64

Pandas, add date column to a series

I have a timeseries dataframe that is data agnostic and uses period vs date.
I would like to at some point add in dates, using the period.
My dataframe looks like
period custid
1 1
2 1
3 1
1 2
2 2
1 3
2 3
3 3
4 3
I would like to be able to pick a random starting date, for example 1/1/2018, and that would be period 1 so you would end up with
period custid date
1 1 1/1/2018
2 1 2/1/2018
3 1 3/1/2018
1 2 1/1/2018
2 2 2/1/2018
1 3 1/1/2018
2 3 2/1/2018
3 3 3/1/2018
4 3 4/1/2018
You could create a column of timedeltas, based on the period column, where each row is a time delta of period dates (-1, so that it starts at 0). then, starting from your start_date, which you can define as a datetime object, add the timedelta to start date:
start_date = pd.to_datetime('1/1/2018')
df['date'] = pd.to_timedelta(df['period'] - 1, unit='D') + start_date
>>> df
period custid date
0 1 1 2018-01-01
1 2 1 2018-01-02
2 3 1 2018-01-03
3 1 2 2018-01-01
4 2 2 2018-01-02
5 1 3 2018-01-01
6 2 3 2018-01-02
7 3 3 2018-01-03
8 4 3 2018-01-04
Edit: In your comment, you said you were trying to add months, not days. For this, you could use your method, or alternatively, the following:
from pandas.tseries.offsets import MonthBegin
df['date'] = start_date + (df['period'] -1) * MonthBegin()

Python pandas groupby manipulation?

Examples of how the df looks like:
customer order_datetime
a 01-03-2017 12:00:00 PM
b 01-04-2017 12:00:00 PM
c 01-07-2017 12:00:00 PM
a 01-08-2017 12:00:00 PM
b 01-09-2017 12:00:00 PM
a 01-11-2017 12:00:00 PM
There's 2 thing that I wanted to achieve but I'm still in the learning process, really appreciate any help to guide me in the right direction.
Create a list of "time between orders" where I can find the min, mean, max
Find out if "time between order" gets faster/slower, i.e. time between order_3 and order_2 vs time between order_2 and order_1
This example should set you in the right direction for your assignment.
First I'm creating a DataFrame similar to the one you show in the question:
import pandas as pd
import numpy as np
import datetime as dt
orders = pd.DataFrame({
'client': np.random.randint(65, 70, size=15),
'date': np.random.randint(0, 30, size=15)})
orders.client = orders.client.apply(chr)
orders.date = orders.date.apply(
pd.to_datetime, unit='d', origin=dt.date(2017, 1, 1), box=False)
# Sorting here is not necessary, just for visualization
orders.sort_values(['client', 'date'], inplace=True)
orders.reset_index(inplace=True, drop=True)
orders.head()
>>>>
client date
0 A 2017-01-27
1 A 2017-01-29
2 A 2017-01-30
3 B 2017-01-03
4 B 2017-01-13
The key to the solution is in the line orders.groupby('client').date.apply(pd.Series.sort_values).diff().
First we use groupby to group the orders using client as a key, then we select the date column only and sort the dates in each group with pd.Series.sort_values, finally we use diff to compute the difference of each record with the following one (here's why the dates in each group must be sorted).
The rest of the code is just to visualize the result, i.e. renaming the Series you obtain and concatenating it with the initial DataFrame.
diff_df = pd.concat([
orders,
orders.groupby('client').date.diff().rename('diff')], axis=1)
diff_df.head(10)
>>>>
client date diff
0 A 2017-01-27 NaT
1 A 2017-01-29 2 days
2 A 2017-01-30 1 days
3 B 2017-01-03 NaT
4 B 2017-01-13 10 days
5 B 2017-01-18 5 days
6 B 2017-01-24 6 days
7 C 2017-01-01 NaT
8 C 2017-01-02 1 days
9 C 2017-01-03 1 days
Once you have the time differences you can compute all kinds of in-group metrics you need.
First you can try pd.Series.describe:
diff_df.groupby('client').diff.describe()
>>>>
count mean std min \
client
A 1 5 days 00:00:00 NaT 5 days 00:00:00
B 1 12 days 00:00:00 NaT 12 days 00:00:00
C 3 4 days 00:00:00 1 days 17:34:09.189773 2 days 00:00:00
D 1 4 days 00:00:00 NaT 4 days 00:00:00
E 4 5 days 00:00:00 3 days 03:53:40.789838 2 days 00:00:00
25% 50% 75% max
client
A 5 days 00:00:00 5 days 00:00:00 5 days 00:00:00 5 days 00:00:00
B 12 days 00:00:00 12 days 00:00:00 12 days 00:00:00 12 days 00:00:00
C 3 days 12:00:00 5 days 00:00:00 5 days 00:00:00 5 days 00:00:00
D 4 days 00:00:00 4 days 00:00:00 4 days 00:00:00 4 days 00:00:00
E 2 days 18:00:00 4 days 12:00:00 6 days 18:00:00 9 days 00:00:00
If that is not enough you can define your own aggregations.
You will need a list of functions if you work on a single Series:
metrics = [pd.Series.count, pd.Series.min, pd.Series.max, pd.Series.mean]
diff_df.groupby('client').diff.aggregate(metrics)
>>>>
count nunique min max mean
client
A 1 1 5 days 5 days 5 days
B 1 1 12 days 12 days 12 days
C 3 2 2 days 5 days 4 days
D 1 1 4 days 4 days 4 days
E 4 4 2 days 9 days 5 days
Or a dictionary of of {column -> function, column -> function_list} if you work on the whole DataFrame:
metrics = {
'date': [pd.Series.count, pd.Series.nunique],
'diff': [pd.Series.min, pd.Series.max, pd.Series.mean],
}
diff_df.groupby('client').aggregate(metrics)
>>>>
diff date
min max mean count nunique
client
A 5 days 5 days 5 days 2 2
B 12 days 12 days 12 days 2 2
C 2 days 5 days 4 days 4 4
D 4 days 4 days 4 days 2 2
E 2 days 9 days 5 days 5 5

Categories

Resources