Consider a CSV file:
string,date,number
a string,2/5/11 9:16am,1.0
a string,3/5/11 10:44pm,2.0
a string,4/22/11 12:07pm,3.0
a string,4/22/11 12:10pm,4.0
a string,4/29/11 11:59am,1.0
a string,5/2/11 1:41pm,2.0
a string,5/2/11 2:02pm,3.0
a string,5/2/11 2:56pm,4.0
a string,5/2/11 3:00pm,5.0
a string,5/2/14 3:02pm,6.0
a string,5/2/14 3:18pm,7.0
I can read this in, and reformat the date column into datetime format:
b = pd.read_csv('b.dat')
b['date'] = pd.to_datetime(b['date'],format='%m/%d/%y %I:%M%p')
I have been trying to group the data by month. It seems like there should be an obvious way of accessing the month and grouping by that. But I can't seem to do it. Does anyone know how?
What I am currently trying is re-indexing by the date:
b.index = b['date']
I can access the month like so:
b.index.month
However I can't seem to find a function to lump together by month.
Managed to do it:
b = pd.read_csv('b.dat')
b.index = pd.to_datetime(b['date'],format='%m/%d/%y %I:%M%p')
b.groupby(by=[b.index.month, b.index.year])
Or
b.groupby(pd.Grouper(freq='M')) # update for v0.21+
(update: 2018)
Note that pd.Timegrouper is depreciated and will be removed. Use instead:
df.groupby(pd.Grouper(freq='M'))
To groupby time-series data you can use the method resample. For example, to groupby by month:
df.resample(rule='M', on='date')['Values'].sum()
The list with offset aliases you can find here.
One solution which avoids MultiIndex is to create a new datetime column setting day = 1. Then group by this column.
Normalise day of month
df = pd.DataFrame({'Date': pd.to_datetime(['2017-10-05', '2017-10-20', '2017-10-01', '2017-09-01']),
'Values': [5, 10, 15, 20]})
# normalize day to beginning of month, 4 alternative methods below
df['YearMonth'] = df['Date'] + pd.offsets.MonthEnd(-1) + pd.offsets.Day(1)
df['YearMonth'] = df['Date'] - pd.to_timedelta(df['Date'].dt.day-1, unit='D')
df['YearMonth'] = df['Date'].map(lambda dt: dt.replace(day=1))
df['YearMonth'] = df['Date'].dt.normalize().map(pd.tseries.offsets.MonthBegin().rollback)
Then use groupby as normal:
g = df.groupby('YearMonth')
res = g['Values'].sum()
# YearMonth
# 2017-09-01 20
# 2017-10-01 30
# Name: Values, dtype: int64
Comparison with pd.Grouper
The subtle benefit of this solution is, unlike pd.Grouper, the grouper index is normalized to the beginning of each month rather than the end, and therefore you can easily extract groups via get_group:
some_group = g.get_group('2017-10-01')
Calculating the last day of October is slightly more cumbersome. pd.Grouper, as of v0.23, does support a convention parameter, but this is only applicable for a PeriodIndex grouper.
Comparison with string conversion
An alternative to the above idea is to convert to a string, e.g. convert datetime 2017-10-XX to string '2017-10'. However, this is not recommended since you lose all the efficiency benefits of a datetime series (stored internally as numerical data in a contiguous memory block) versus an object series of strings (stored as an array of pointers).
Slightly alternative solution to #jpp's but outputting a YearMonth string:
df['YearMonth'] = pd.to_datetime(df['Date']).apply(lambda x: '{year}-{month}'.format(year=x.year, month=x.month))
res = df.groupby('YearMonth')['Values'].sum()
Related
I have a full-year hourly series, that we may call "calendar":
from pandas import date_range, Series
calendar = Series(
index=date_range("2006-01-01", "2007-01-01", freq="H", closed="left", tz="utc"),
data=range(365 * 24)
)
Now I have a new index, which is another hourly series, but starting and ending at arbitrary datetimes:
index = date_range("2019-01-01", "2020-10-02", freq="H", tz="utc")
I would like to create a new series result that has the same index as index and, for each month-day-hour combination, it takes the value from the corresponding month-day-hour in the calendar.
I could iterate to have a working solution like so, with a try-except just to ignore February 29th:
result = Series(index=index, dtype="float")
for timestamp in result.index:
try:
calendar_timestamp = timestamp.replace(year=2006)
except:
continue
result.loc[timestamp] = calendar.loc[calendar_timestamp]
This however, is very inefficient, so does anybody know how to do it better? With better I mean specially faster (CPU-time-wise).
Constraints/notes:
No Numba, nor Cython, just CPython and Pandas/NumPy
It is fine to leave February 29th with NaN values, since that day is not represented in the calendar
We can always assume that the index is properly sorted and has no gaps (the same applies to the calendar)
Let's try extracting the combination as string and map:
cal1 = pd.Series(calendar.values,
index=calendar.index.strftime('%m%d%H'))
result = index.to_series().dt.strftime('%m%d%H').map(cal1)
Output:
Have a df like that:
Dat
10/01/2016
11/01/2014
12/02/2013
The column 'Dat' has object type so I trying to switch it to datetime using to_datetime () pandas function that way:
to_datetime_rand = partial(pd.to_datetime, format='%m/%d/%Y')
df['DAT'] = df['DAT'].apply(to_datetime_rand)
Everything works well but I have performance issues when my df is higher than 2 billion rows. So in that case this method stucks and does not work well.
Does pandas to_datetime () function has an ability to do the convertation by chuncks or maybe iterationally by looping.
Thanks.
If performance is a concern I would advise to use the following function to convert those columns to date_time:
def lookup(s):
"""
This is an extremely fast approach to datetime parsing.
For large data, the same dates are often repeated. Rather than
re-parse these, we store all unique dates, parse them, and
use a lookup to convert all dates.
"""
dates = {date:pd.to_datetime(date) for date in s.unique()}
return s.apply(lambda v: dates[v])
to_datetime: 5799 ms
dateutil: 5162 ms
strptime: 1651 ms
manual: 242 ms
lookup: 32 ms
UPDATE: This enhancement has been incorporated into pandas 0.23.0
cache : boolean, default False
If True, use a cache of unique, converted dates to apply the datetime
conversion. May produce significant speed-up when parsing duplicate
date strings, especially ones with timezone offsets.
You could split into chunks your huge dataframe into smaller ones, for example this method can do it where you can decide what is the chunk size:
def splitDataFrameIntoSmaller(df, chunkSize = 10000):
listOfDf = list()
numberChunks = len(df) // chunkSize + 1
for i in range(numberChunks):
listOfDf.append(df[i*chunkSize:(i+1)*chunkSize])
return listOfDf
After you have chunks, you can apply the datetime function on each chunk separately.
I just came across this same issue myself. Thanks to SerialDev for the excellent answer. To build on that, I tried using datetime.strptime instead of pd.to_datetime:
from datetime import datetime as dt
dates = {date : dt.strptime(date, '%m/%d/%Y') for date in df['DAT'].unique()}
df['DAT'] = df['DAT'].apply(lambda v: dates[v])
The strptime method was 6.5x faster than the to_datetime method for me.
Inspired from the previous answers, in the case of having both performance problems and multiple date formats I suggest the following solution.
for date in df['DAT'].unique():
for ft in ['%Y/%m/%d', '%Y']:
try:
dates[date] = datetime.strptime(date, ft) if date else None
except ValueError:
continue
df['DAT'] = df['DAT'].apply(lambda v: dates[v])
This is the closest to what i'm looking for that I've found
Let's say my dataframe looks something like this:
d = {'item_number':['K208UL','AKD098008','DF900A','K208UL','AKD098008']
'Comp_ID':['998798098','988797387','12398787','998798098','988797387']
'date':['2016-11-12','2016-11-13','2016-11-17','2016-11-13','2016-11-14']}
df = pd.DataFrame(data=d)
I would like to count the amount of times where the same item_number and Comp_ID were observed on consecutive days.
I imagine this will look something along the lines of:
g = df.groupby(['Comp_ID','item_number'])
g.apply(lambda x: x.loc[x.iloc[i,'date'].shift(-1) - x.iloc[i,'date'] == 1].count())
However, I would need to extract the day from each date as an int before comparing, which I'm also having trouble with.
for i in df.index:
wbc_seven.iloc[i, 'day_column'] = datetime.datetime.strptime(df.iloc[i,'date'],'%Y-%m-%d').day
Apparently location based indexing only allows for integers? How could I solve this problem?
However, I would need to extract the day from each date as an int
before comparing, which I'm also having trouble with.
Why?
To fix your code, you need:
consecutive['date'] = pd.to_datetime(consecutive['date'])
g = consecutive.groupby(['Comp_ID','item_number'])
g['date'].apply(lambda x: sum(abs((x.shift(-1) - x)) == pd.to_timedelta(1, unit='D')))
Note the following:
The code above avoids repetitions. That is a basic programming principle: Don't Repeat Yourself
It converts 1 to timedelta for proper comparison.
It takes the absolute difference.
Tip, write a top level function for your work, instead of a lambda, as it accords better readability, brevity, and aesthetics:
def differencer(grp, day_dif):
"""Counts rows in grp separated by day_dif day(s)"""
d = abs(grp.shift(-1) - grp)
return sum(d == pd.to_timedelta(day_dif, unit='D'))
g['date'].apply(differencer, day_dif=1)
Explanation:
It is pretty straightforward. The dates are converted to Timestamp type, then subtracted. The difference will result in a timedelta, which needs to also be compared with a timedelta object, hence the conversion of 1 (or day_dif) to timedelta. The result of that conversion will be a Boolean Series. Boolean are represented by 0 for False and 1 for True. Sum of a Boolean Series will return the total number of True values in the Series.
One solution would be to use pivot tables to count the number of times a Comp_ID and an item_number were observed on consecutive days.
import pandas as pd
d = {'item_number':['K208UL','AKD098008','DF900A','K208UL','AKD098008'],'Comp_ID':['998798098','988797387','12398787','998798098','988797387'],'date':['2016-11-12','2016-11-13','2016-11-17','2016-11-13','2016-11-14']}
df = pd.DataFrame(data=d).sort_values(['item_number','Comp_ID'])
df['date'] = pd.to_datetime(df['date'])
df['delta'] = (df['date'] - df['date'].shift(1))
df = df[(df['delta']=='1 days 00:00:00.000000000') & (df['Comp_ID'] == df['Comp_ID'].shift(1)) &
(df['item_number'] == df['item_number'].shift(1))].pivot_table( index=['item_number','Comp_ID'],
values=['date'],aggfunc='count').reset_index()
df.rename(columns={'date':'consecutive_days'},inplace =True)
Results in
item_number Comp_ID consecutive_days
0 AKD098008 988797387 1
1 K208UL 998798098 1
I have a dataset of locations of stores with dates of events (the date all stock was sold from that store) and quantities of the sold items, such as the following:
import numpy as np, pandas as pd
# Dates
start = pd.Timestamp("2014-02-26")
end = pd.Timestamp("2014-09-24")
# Generate some data
N = 1000
quantA = np.random.randint(10, 500, N)
quantB = np.random.randint(50, 250, N)
sell = np.random.randint(start.value, end.value, N)
sell = pd.to_datetime(np.array(sell, dtype="datetime64[ns]"))
df = pd.DataFrame({"sell_date": sell, "quantityA":quantA, "quantityB":quantB})
df.index = df.sell_date
I would like to create a new time series dataframe that has per-weekly summaries (or per daily; or per custom date_range object) from a range of these quantities A and B.
I can generate week number and aggregate sales based on those, like so...
df['week'] = df.sell_date.dt.week
df.pivot_table(values = ['quantityA', 'quantityB'], index = 'week', aggfunc = [np.sum, len])
But I don't see how to do the following:
expand this out to a full time series (based on a date_range object, such as period_range = pd.date_range(start = start, end = end, freq='7D')),
include the original date (as a 'week starting' variable), instead of integer week number, or
change the date variable to be the index of this new dataframe.
I'm not sure if this is what you want but you can try
df.set_index('sell_date', inplace=True)
resampled = df.resample('7D', [sum, len])
The resulting index might not be exactly what you want as it starts with the earliest datetime correct to the nanosecond. You could replace with datetimes which have 00:00:00 in the time by doing
resampled.index = pd.to_datetime(resampled.index.date)
EDIT:
You can actually just do
resampled = df.resample('W', [sum, len])
And the resulting index is exactly what you want. Interestingly, passing 'D' also gives the index you would expect but passing a multiple like '2D' results in the 'ugly' index, that is, starting at the earliest correct to the nanosecond and increasing in multiples of exactly 2 days. I guess the lesson is stick to singles like 'D', 'W', 'M' where possible.
EDIT:
The API for resampling changed at some point such that the above no longer works. Instead one can do:
resampled = df.resample('W').agg([sum, len])
.resample now returns a Resampler object which exposes methods, much like the groupbyAPI.
I'm trying to calculate daily sums of values using pandas. Here's the test file - http://pastebin.com/uSDfVkTS
This is the code I came up so far:
import numpy as np
import datetime as dt
import pandas as pd
f = np.genfromtxt('test', dtype=[('datetime', '|S16'), ('data', '<i4')], delimiter=',')
dates = [dt.datetime.strptime(i, '%Y-%m-%d %H:%M') for i in f['datetime']]
s = pd.Series(f['data'], index = dates)
d = s.resample('D', how='sum')
Using the given test file this produces:
2012-01-02 1128
Freq: D
First problem is that calculated sum corresponds to the next day. I've been able to solve that by using parameter loffset='-1d'.
Now the actual problem is that the data may start not from 00:30 of a day but at any time of a day. Also the data has gaps filled with 'nan' values.
That said, is it possible to set a lower threshold of number of values that are necessary to calculate daily sums? (e.g. if there're less than 40 values in a single day, then put NaN instead of a sum)
I believe that it is possible to define a custom function to do that and refer to it in 'how' parameter, but I have no clue how to code the function itself.
You can do it directly in Pandas:
s = pd.read_csv('test', header=None, index_col=0, parse_dates=True)
d = s.groupby(lambda x: x.date()).aggregate(lambda x: sum(x) if len(x) >= 40 else np.nan)
X.2
2012-01-01 1128
Much easier way is to use pd.Grouper:
d = s.groupby(pd.Grouper(freq='1D')).sum()