I have a pandas dataframe that contains two date columns, a start date and an end date that defines a range. I'd like to be able to collect a total count for all dates across all rows in the dataframe, as defined by these columns.
For example, the table looks like:
index start_date end date
0 '2015-01-01' '2015-01-17'
1 '2015-01-03' '2015-01-12'
And the result would be a per date aggregate, like:
date count
'2015-01-01' 1
'2015-01-02' 1
'2015-01-03' 2
and so on.
My current approach works but is extremely slow on a big dataframe as I'm looping across the rows, calculating the range and then looping through this. I'm hoping to find a better approach.
Currently I'm doing :
date = pd.date_range (min (df.start_date), max (df.end_date))
df2 = pd.DataFrame (index =date)
df2 ['count'] = 0
for index, row in df.iterrows ():
dates = pd.date_range (row ['start_date'], row ['end_date'])
for date in dates:
df2.loc['date']['count'] += 1
After stacking the relevant columns as suggested by #Sam, just use value_counts.
df[['start_date', 'end date']].stack().value_counts()
EDIT:
Given that you also want to count the dates between the start and end dates:
start_dates = pd.to_datetime(df.start_date)
end_dates = pd.to_datetime(df.end_date)
>>> pd.Series(dt.date() for group in
[pd.date_range(start, end) for start, end in zip(start_dates, end_dates)]
for dt in group).value_counts()
Out[178]:
2015-01-07 2
2015-01-06 2
2015-01-12 2
2015-01-05 2
2015-01-04 2
2015-01-10 2
2015-01-03 2
2015-01-09 2
2015-01-08 2
2015-01-11 2
2015-01-16 1
2015-01-17 1
2015-01-14 1
2015-01-15 1
2015-01-02 1
2015-01-01 1
2015-01-13 1
dtype: int64
I think the solution here is to 'stack' your two date columns, group by the date,and do a count. Play around with the df.stack() function. Here is something i threw together that yields a good solution:
import datetime
df = pd.DataFrame({'Start' : [datetime.date(2016, 5, i) for i in range(1,30)],
'End':[datetime.date(2016, 5, i) for i in range(1,30)]})
df.stack().reset_index()[[0, 'level_1']].groupby(0).count()
I would use melt() method for that:
In [76]: df
Out[76]:
start_date end_date
index
0 2015-01-01 2015-01-17
1 2015-01-03 2015-01-12
2 2015-01-03 2015-01-17
In [77]: pd.melt(df, value_vars=['start_date','end_date']).groupby('value').size()
Out[77]:
value
2015-01-01 1
2015-01-03 2
2015-01-12 1
2015-01-17 2
dtype: int64
Related
I have the following time series df in Pandas:
date value
2015-01-01 0
2015-01-02 0
2015-01-03 0
2015-01-04 3
2015-01-05 0
2015-01-06 4
2015-01-07 0
I would like to remove the leading and trailing zeroes, so as to have the following df:
date value
2015-01-04 3
2015-01-05 0
2015-01-06 4
Simply dropping rows with 0s in them would lead to deleting the 0s in the middle as well, which I don't want.
I thought of writing a forward loop that starts from the first row and then continues until the first non 0 value, and a second backwards loop that goes back from the end and stops at the last non 0 value. But that seems like overkill, is there a more efficient way of doing so?
General solution returned empty DataFrame, if all 0 values in data with cumulative sum of mask tested not equal 0 values and swapped values by [::-1] chained by bitwise AND and filtering by boolean indexing:
s = df['value'].ne(0)
df = df[s.cumsum().ne(0) & s[::-1].cumsum().ne(0)]
print (df)
date value
3 2015-01-04 3
4 2015-01-05 0
5 2015-01-06 4
If always at least one non 0 value is possible convert 0 to missing values and use DataFrame.loc with DataFrame.first_valid_index and
DataFrame.last_valid_index:
s = df['value'].mask(df['value'] == 0)
df = df.loc[s.first_valid_index():s.last_valid_index()]
print (df)
date value
3 2015-01-04 3
4 2015-01-05 0
5 2015-01-06 4
Another idea is use DataFrame.idxmax or DataFrame.idxmin:
s = df['value'].eq(0)
df = df.loc[s.idxmin():s[::-1].idxmin()]
print (df)
date value
3 2015-01-04 3
4 2015-01-05 0
5 2015-01-06 4
s = df['value'].ne(0)
df = df.loc[s.idxmax():s[::-1].idxmax()]
You can get a list of the indexes where value is > than 0, and then find the min.
data = [
['2015-01-01', 0],
['2015-01-02', 0],
['2015-01-03', 0],
['2015-01-04', 3],
['2015-01-05', 0],
['2015-01-06', 4]
]
df = pd.DataFrame(data, columns = ['date', 'value'])
print(min(df.index[df['value'] > 0].tolist()))
# 3
Then filter the main df like this:
df.iloc[3:]
Or even better:
df.iloc[min(df.index[df['value'] > 0].tolist()):]
And you get:
date value
3 2015-01-04 3
4 2015-01-05 0
5 2015-01-06 4
I have the following time series dataset of the number of sales happening for a day as a pandas data frame.
date, sales
20161224,5
20161225,2
20161227,4
20161231,8
Now if I have to include the missing data points here(i. e. missing dates) with a constant value(zero) and want to make it look the following way, how can I do this efficiently(assuming the data frame is ~50MB) using Pandas.
date, sales
20161224,5
20161225,2
20161226,0**
20161227,4
20161228,0**
20161229,0**
20161231,8
**Missing rows which are been added to the data frame.
Any help will be appreciated.
You can first cast to to_datetime column date, then set_index and reindex by min and max value of index, reset_index and if necessary change format by strftime:
df.date = pd.to_datetime(df.date, format='%Y%m%d')
df = df.set_index('date')
df = df.reindex(pd.date_range(df.index.min(), df.index.max()), fill_value=0)
.reset_index()
.rename(columns={'index':'date'})
print (df)
date sales
0 2016-12-24 5
1 2016-12-25 2
2 2016-12-26 0
3 2016-12-27 4
4 2016-12-28 0
5 2016-12-29 0
6 2016-12-30 0
7 2016-12-31 8
Last if need change format:
df.date = df.date.dt.strftime('%Y%m%d')
print (df)
date sales
0 20161224 5
1 20161225 2
2 20161226 0
3 20161227 4
4 20161228 0
5 20161229 0
6 20161230 0
7 20161231 8
I'm trying to split one date list by using another. So:
d = {'date':['1/15/2015','2/15/2015'], 'num':[1,2]}
s = {'split':['2/1/2015']}
df = pd.DataFrame(d)
sf = pd.DataFrame(s)
df['date'] = pd.to_datetime(df['date'])
sf['split'] = pd.to_datetime(sf['split'])
df['date'].split_by(sf['split'])
would yield:
date num
0 2015-01-15 1.0
1 2015-02-01 NaN
2 2015-02-15 2.0
...but of course, it doesn't. I'm sure there's a simple merge or join I'm missing here, but I can't figure it out. Thanks.
Also, if the 'split' list has multiple dates, some of which fall outside the range of the 'date' list, I don't want them included. So basically, the extents of the new range would be the same as the old.
(side note: if there's a better way to convert a dictionary to a DataFrame and immediately convert the date strings to datetimes, that would be icing on the cake)
I think you need boolean indexing for filter sf by min and max of column date in df first and then concat with sort_values, for align need rename column:
d = {'date':['1/15/2015','2/15/2015'], 'num':[1,2]}
s = {'split':['2/1/2015', '2/1/2016', '2/1/2014']}
df = pd.DataFrame(d)
sf = pd.DataFrame(s)
df['date'] = pd.to_datetime(df['date'])
sf['split'] = pd.to_datetime(sf['split'])
print (df)
date num
0 2015-01-15 1
1 2015-02-15 2
print (sf)
split
0 2015-02-01
1 2016-02-01
2 2014-02-01
mask = (sf.split <= df.date.max()) & (sf.split >= df.date.min())
print (mask)
0 True
1 False
2 False
Name: split, dtype: bool
sf = sf[mask]
print (sf)
split
0 2015-02-01
df = pd.concat([df, sf.rename(columns={'split':'date'})]).sort_values('date')
print (df)
date num
0 2015-01-15 1.0
0 2015-02-01 NaN
1 2015-02-15 2.0
I had no success looking for answers for this question in the forum since it is hard to put it in keywords. Any keywords suggestions are appreciated so that I cane make this question more accessible so that others can benefit from it.
The closest question I found doesn't really answer mine.
My problem is the following:
I have one DataFrame that I called ref, and a dates list called pub. ref has dates for indexes but those dates are different (there will be a few matching values) from the dates in pub. I want to create a new DataFrame that contains all the dates from pub but fill it with the "last available data" from ref.
Thus, say ref is:
Dat col1 col2
2015-01-01 5 4
2015-01-02 6 7
2015-01-05 8 9
And pub
2015-01-01
2015-01-04
2015-01-06
I'd like to create a DataFrame like:
Dat col1 col2
2015-01-01 5 4
2015-01-04 6 7
2015-01-06 8 9
For this matter performance is an issue. So i'm looking for the fastest / a fast way of doing that.
Thanks in advance.
You can do an outer merge, set the new index to Dat, sort it, forward fill, and then reindex based on the dates in pub.
dates = ['2015-01-01', '2015-01-04', '2015-01-06']
pub = pd.DataFrame([dt.datetime.strptime(ts, '%Y-%m-%d').date() for ts in dates],
columns=['Dat'])
>>> (ref
.merge(pub, on='Dat', how='outer')
.set_index('Dat')
.sort_index()
.ffill()
.reindex(pub.Dat))
col1 col2
Dat
2015-01-01 5 4
2015-01-04 6 7
2015-01-06 8 9
Use np.searchsorted for finding the indice just after ('right' option; needed to handle properly equality) :
In [27]: pub = ['2015-01-01', '2015-01-04', '2015-01-06']
In [28]: df
Out[28]:
col1 col2
Dat
2015-01-01 5 4
2015-01-02 6 7
2015-01-05 8 9
In [29]: y=np.searchsorted(list(df.index),pub,'right')
#array([1, 2, 3], dtype=int64)
Then just rebuild :
In [30]: pd.DataFrame(df.iloc[y-1].values,index=pub)
Out[30]:
0 1
2015-01-01 5 4
2015-01-04 6 7
2015-01-06 8 9
I have data for a number of events with start and end times like this:
df = pd.DataFrame({'start': ['2015-01-05', '2015-01-10', '2015-01-11'], 'end': ['2015-01-07', '2015-01-15', '2015-01-13'], 'value': [3, 4, 5]})
df['end'] = pd.to_datetime(df['end'])
df['start'] = pd.to_datetime(df['start'])
Out:
end start value
0 2015-01-07 2015-01-05 3
1 2015-01-15 2015-01-10 4
2 2015-01-13 2015-01-11 5
Now I need to calculate the number of events active at the same time, and eg. the sum of their values. So the result should look something like this:
date count sum
2015-01-05 1 3
2015-01-06 1 3
2015-01-07 1 3
2015-01-08 0 0
2015-01-09 0 0
2015-01-10 1 4
2015-01-11 2 9
2015-01-12 2 9
2015-01-13 2 9
2015-01-14 1 4
2015-01-15 1 4
Any ideas for how to do this? I was thinking about using a custom Grouper for groupby, but as far as I can see a Grouper can only assign a row to a single group so that doesn't look useful.
EDIT: After some testing I found this rather ugly way to get the desired result:
df['count'] = 1
dates = pd.date_range('2015-01-05', '2015-01-15', freq='1D')
start = df[['start', 'value', 'count']].set_index('start').reindex(dates)
end = df[['end', 'value', 'count']].set_index('end').reindex(dates).shift(1)
rstart = pd.rolling_sum(start, len(start), min_periods=1)
rend = pd.rolling_sum(end, len(end), min_periods=1)
rstart.subtract(rend, fill_value=0).fillna(0)
However, this only works with sums, and I can't see an obvious way to make it work with other functions. For example, is there a way to get it to work with median instead of sum?
If I were using SQL, I would do this by joining an all-dates table to the events table, and then grouping by date. Pandas doesn't make this approach especially easy, since there's no way to left-join on a condition, but we can fake it using dummy columns and reindexing:
df = pd.DataFrame({'start': ['2015-01-05', '2015-01-10', '2015-01-11'], 'end': ['2015-01-07', '2015-01-15', '2015-01-13'], 'value': [3, 4, 5]})
df['end'] = pd.to_datetime(df['end'])
df['start'] = pd.to_datetime(df['start'])
df['dummy'] = 1
Then:
date_series = pd.date_range('2015-01-05', '2015-01-15', freq='1D')
date_df = pd.DataFrame(dict(date=date_series, dummy=1))
cross_join = date_df.merge(df, on='dummy')
cond_join = cross_join[(cross_join.start <= cross_join.date) & (cross_join.date <= cross_join.end)]
grp_join = cond_join.groupby(['date'])
final = (
pd.DataFrame(dict(
val_count=grp_join.size(),
val_sum=grp_join.value.sum(),
val_median=grp_join.value.median()
), index=date_series)
.fillna(0)
.reset_index()
)
The fillna(0) isn't perfect, since it makes nulls in the val_median column into 0s, when they should really remain nulls.
Alternatively, with pandas-ply we can code that up as:
date_series = pd.date_range('2015-01-05', '2015-01-15', freq='1D')
date_df = pd.DataFrame(dict(date=date_series, dummy=1))
final = (
date_df
.merge(df, on='dummy')
.ply_where(X.start <= X.date, X.date <= X.end)
.groupby('date')
.ply_select(val_count=X.size(), val_sum=X.value.sum(), median=X.value.median())
.reindex(date_series)
.ply_select('*', val_count=X.val_count.fillna(0), val_sum=X.val_sum.fillna(0))
.reset_index()
)
which handles nulls a bit better.
This is what I came up with. Got to think there's a better way
Given your frame
end start value
0 2015-01-07 2015-01-05 3
1 2015-01-15 2015-01-10 4
2 2015-01-13 2015-01-11 5
and then
dList = []
vList = []
d = {}
def buildDict(row):
for x in pd.date_range(row["start"],row["end"]): #build a range for each row
dList.append(x) #date list
vList.append(row["value"]) #value list
df.apply(buildDict,axis=1) #each row in df is passed to buildDict
#this d will be used to create our new frame
d["date"] = dList
d["value"] = vList
#from here you can use whatever agg functions you want
pd.DataFrame(d).groupby("date").agg(["count","sum"])
yields
value
count sum
date
2015-01-05 1 3
2015-01-06 1 3
2015-01-07 1 3
2015-01-10 1 4
2015-01-11 2 9
2015-01-12 2 9
2015-01-13 2 9
2015-01-14 1 4
2015-01-15 1 4
You can avoid the cross join by exploding the dates, imputing the missing rows with complete from pyjanitor, before aggregating the dates:
# pip install pyjanitor
import pandas as pd
import janitor
(df.assign(dates = [pd.date_range(start, end, freq='1D')
for start, end
in zip(df.start, df.end)])
.explode('dates')
.loc[:, ['value', 'dates']]
.complete({'dates': lambda df: pd.date_range(df.min(), df.max(), freq='1D')})
.groupby('dates')
.agg(['size', 'sum'])
.droplevel(level=0, axis='columns')
)
size sum
dates
2015-01-05 1 3.0
2015-01-06 1 3.0
2015-01-07 1 3.0
2015-01-08 1 0.0
2015-01-09 1 0.0
2015-01-10 1 4.0
2015-01-11 2 9.0
2015-01-12 2 9.0
2015-01-13 2 9.0
2015-01-14 1 4.0
2015-01-15 1 4.0