This is the closest to what i'm looking for that I've found
Let's say my dataframe looks something like this:
d = {'item_number':['K208UL','AKD098008','DF900A','K208UL','AKD098008']
'Comp_ID':['998798098','988797387','12398787','998798098','988797387']
'date':['2016-11-12','2016-11-13','2016-11-17','2016-11-13','2016-11-14']}
df = pd.DataFrame(data=d)
I would like to count the amount of times where the same item_number and Comp_ID were observed on consecutive days.
I imagine this will look something along the lines of:
g = df.groupby(['Comp_ID','item_number'])
g.apply(lambda x: x.loc[x.iloc[i,'date'].shift(-1) - x.iloc[i,'date'] == 1].count())
However, I would need to extract the day from each date as an int before comparing, which I'm also having trouble with.
for i in df.index:
wbc_seven.iloc[i, 'day_column'] = datetime.datetime.strptime(df.iloc[i,'date'],'%Y-%m-%d').day
Apparently location based indexing only allows for integers? How could I solve this problem?
However, I would need to extract the day from each date as an int
before comparing, which I'm also having trouble with.
Why?
To fix your code, you need:
consecutive['date'] = pd.to_datetime(consecutive['date'])
g = consecutive.groupby(['Comp_ID','item_number'])
g['date'].apply(lambda x: sum(abs((x.shift(-1) - x)) == pd.to_timedelta(1, unit='D')))
Note the following:
The code above avoids repetitions. That is a basic programming principle: Don't Repeat Yourself
It converts 1 to timedelta for proper comparison.
It takes the absolute difference.
Tip, write a top level function for your work, instead of a lambda, as it accords better readability, brevity, and aesthetics:
def differencer(grp, day_dif):
"""Counts rows in grp separated by day_dif day(s)"""
d = abs(grp.shift(-1) - grp)
return sum(d == pd.to_timedelta(day_dif, unit='D'))
g['date'].apply(differencer, day_dif=1)
Explanation:
It is pretty straightforward. The dates are converted to Timestamp type, then subtracted. The difference will result in a timedelta, which needs to also be compared with a timedelta object, hence the conversion of 1 (or day_dif) to timedelta. The result of that conversion will be a Boolean Series. Boolean are represented by 0 for False and 1 for True. Sum of a Boolean Series will return the total number of True values in the Series.
One solution would be to use pivot tables to count the number of times a Comp_ID and an item_number were observed on consecutive days.
import pandas as pd
d = {'item_number':['K208UL','AKD098008','DF900A','K208UL','AKD098008'],'Comp_ID':['998798098','988797387','12398787','998798098','988797387'],'date':['2016-11-12','2016-11-13','2016-11-17','2016-11-13','2016-11-14']}
df = pd.DataFrame(data=d).sort_values(['item_number','Comp_ID'])
df['date'] = pd.to_datetime(df['date'])
df['delta'] = (df['date'] - df['date'].shift(1))
df = df[(df['delta']=='1 days 00:00:00.000000000') & (df['Comp_ID'] == df['Comp_ID'].shift(1)) &
(df['item_number'] == df['item_number'].shift(1))].pivot_table( index=['item_number','Comp_ID'],
values=['date'],aggfunc='count').reset_index()
df.rename(columns={'date':'consecutive_days'},inplace =True)
Results in
item_number Comp_ID consecutive_days
0 AKD098008 988797387 1
1 K208UL 998798098 1
Related
I have a dataframe df where one column is timestamp and one is A. Column A contains decimals.
I would like to add a new column B and fill it with the current value of A divided by the value of A one minute earlier. That is:
df['B'] = df['A']_current / df['A'] _(current - 1 min)
NOTE: The data does not come in exactly every 1 minute so "the row one minute earlier" means the row whose timestamp is the closest to (current - 1 minute).
Here is how I do it:
First, I use the timestamp as index in order to use get_loc and create a new dataframe new_df starting from 1 minute after df. In this way I'm sure I have all the data when I go look 1 minute earlier within the first minute of data.
new_df = df.loc[df['timestamp'] > df.timestamp[0] + delta] # delta = 1 min timedelta
values = []
for index, row n new_df.iterrows():
v = row.A / df.iloc[df.index.get_loc(row.timestamp-delta,method='nearest')]['A']
values.append[v]
v_ser = pd.Series(values)
new_df['B'] = v_ser.values
I'm afraid this is not that great. It takes a long time for large dataframes. Also, I am not 100% sure the above is completely correct. Sometimes I get this message:
A value is trying to be set on a copy of a slice from a DataFrame. Try
using .loc[row_indexer,col_indexer] = value instead
What is the best / most efficient way to do the task above? Thank you.
PS. If someone can think of a better title please let me know. It took me longer to write the title than the post and I still don't like it.
You could try to use .asof() if the DataFrame has been indexed correctly by the timestamps (if not, use .set_index() first).
Simple example here
import pandas as pd
import numpy as np
n_vals = 50
# Create a DataFrame with random values and 'unusual times'
df = pd.DataFrame(data = np.random.randint(low=1,high=6, size=n_vals),
index=pd.DatetimeIndex(start=pd.Timestamp.now(),
freq='23s', periods=n_vals),
columns=['value'])
# Demonstrate how to use .asof() to get the value that was the 'state' at
# the time 1 min since the index. Note the .values call
df['value_one_min_ago'] = df['value'].asof(df.index - pd.Timedelta('1m')).values
# Note that there will be some NaNs to deal with consider .fillna()
I have a dataset of locations of stores with dates of events (the date all stock was sold from that store) and quantities of the sold items, such as the following:
import numpy as np, pandas as pd
# Dates
start = pd.Timestamp("2014-02-26")
end = pd.Timestamp("2014-09-24")
# Generate some data
N = 1000
quantA = np.random.randint(10, 500, N)
quantB = np.random.randint(50, 250, N)
sell = np.random.randint(start.value, end.value, N)
sell = pd.to_datetime(np.array(sell, dtype="datetime64[ns]"))
df = pd.DataFrame({"sell_date": sell, "quantityA":quantA, "quantityB":quantB})
df.index = df.sell_date
I would like to create a new time series dataframe that has per-weekly summaries (or per daily; or per custom date_range object) from a range of these quantities A and B.
I can generate week number and aggregate sales based on those, like so...
df['week'] = df.sell_date.dt.week
df.pivot_table(values = ['quantityA', 'quantityB'], index = 'week', aggfunc = [np.sum, len])
But I don't see how to do the following:
expand this out to a full time series (based on a date_range object, such as period_range = pd.date_range(start = start, end = end, freq='7D')),
include the original date (as a 'week starting' variable), instead of integer week number, or
change the date variable to be the index of this new dataframe.
I'm not sure if this is what you want but you can try
df.set_index('sell_date', inplace=True)
resampled = df.resample('7D', [sum, len])
The resulting index might not be exactly what you want as it starts with the earliest datetime correct to the nanosecond. You could replace with datetimes which have 00:00:00 in the time by doing
resampled.index = pd.to_datetime(resampled.index.date)
EDIT:
You can actually just do
resampled = df.resample('W', [sum, len])
And the resulting index is exactly what you want. Interestingly, passing 'D' also gives the index you would expect but passing a multiple like '2D' results in the 'ugly' index, that is, starting at the earliest correct to the nanosecond and increasing in multiples of exactly 2 days. I guess the lesson is stick to singles like 'D', 'W', 'M' where possible.
EDIT:
The API for resampling changed at some point such that the above no longer works. Instead one can do:
resampled = df.resample('W').agg([sum, len])
.resample now returns a Resampler object which exposes methods, much like the groupbyAPI.
I have rows representing ranges (from->to). Here is a subset of the data.
df = DataFrame({'from': ['2015-08-24','2015-08-24'], 'to': ['2015-08-26','2015-08-31']})
from to
0 2015-08-24 2015-08-26
1 2015-08-24 2015-08-31
I want to count the number of business days for each day in the ranges. Here is my code.
# Creating a business time index by taking min an max values from the ranges
b_range = pd.bdate_range(start=min(df['from']), end=max(df['to']))
# Init of a new DataFrame with this index and the count at 0
result = DataFrame(0, index=b_range, columns=['count'])
# Iterating over the range to select the index in the result and update the count column
for index, row in df.iterrows():
result.loc[pd.bdate_range(row['from'],row['to']),'count'] += 1
print(result)
count
2015-08-24 2
2015-08-25 2
2015-08-26 2
2015-08-27 1
2015-08-28 1
2015-08-31 1
It works, but does anyone know a more pythonic way of doing that (i.e. without the for loop) ?
Caveat, I sort of hate this answer but on this tiny dataframe it is over 2x faster so I'll throw it out there as a workable, if not elegant, alternative.
df2 = df.apply( lambda x: [ pd.bdate_range( x['from'], x['to'] ) ], axis=1 )
arr = np.unique( np.hstack( df2.values ), return_counts=True )
result = pd.DataFrame( arr[1], index=arr[0] )
Basically all I'm doing here is to make a column with all the dates in it and then use numpy unique (analog of pandas value_counts) to add everything up. I was hoping to come up with something more elegant and readable but this is what I have at the moment.
Here is a method that use cumsum(). It should be faster than for-loop if you have alot of range:
import pandas as pd
df = pd.DataFrame({
'from': ['2015-08-24','2015-08-24'],
'to': ['2015-08-26','2015-08-31']})
df = df.apply(pd.to_datetime)
from_date = min(df['from'])
to_date = max(df['to'])
b_range = pd.bdate_range(start=from_date, end=to_date)
d_range = pd.date_range(start=from_date, end=to_date)
s = pd.Series(0, index=d_range)
from_count = df["from"].value_counts()
to_count = df["to"].value_counts()
s.add(from_count, fill_value=0).sub(to_count.shift(freq="D"), fill_value=0).cumsum().reindex(b_range)
I was not completely satisfied by these solutions. So I kept searching and I think I found a rather elegant and fast solution.
It's inspired by the section "Pivoting 'long' to 'wide' format" explained in the Wes McKinney book: Python for Data Analysis.
I have put a lot of comments in my code but I think it's preferable to print out each step to understand it.
df = DataFrame({'from': ['2015-08-24','2015-08-24'], 'to': ['2015-08-26','2015-08-31']})
# Convert boundaries to datetime
df['from'] = pd.to_datetime(df['from'], format='%Y-%m-%d')
df['to'] = pd.to_datetime(df['to'], format='%Y-%m-%d')
# Reseting index to create a row id named index
df = df.reset_index(level=0)
# Pivoting data to obtain 'from' as row index and row id ('index') as column,
# each cell cointaining the 'to' date
# In consequence each range (from - to pair) is split into as many columns.
pivoted = df.pivot('from', 'index', 'to')
# Reindexing the table with a range of business dates (i.e. working days)
pivoted = pivoted.reindex(index=pd.bdate_range(start=min(df['from']),
end=max(df['to'])))
# Filling the NA values forward to copy the to date
# now each row of each column contains the corresponding to date
pivoted = pivoted.fillna(method='ffill')
# Computing the basically 'from' - 'to' for each column and each row and converting the result in days
# to obtain the number of days between the date in the index and the 'to' date
# Note: one day is added to include the right side of the interval
pivoted = pivoted.apply(lambda x: (x + Day() - x.index) / np.timedelta64(1, 'D'),
axis=0)
# Clipping value lower than 0 (not in the range) to 0
# and values upper than 0 to 1 (only one by day and by id)
pivoted = pivoted.clip_lower(0).clip_upper(1)
# Summing along the columns and that's it
pivoted.sum(axis=1)
I have a DataFrame I wanted the difference between the maximum and second maximum from the DataFrame as a new column appended to the DataFrame as output.
The data frame looks like this for example (this is quite a huge DataFrame):
gene_id Time_1 Time_2 Time_3
a 0.01489251 8.00246 8.164309
b 6.67943235 0.8832114 1.048761
So far I tried the following but it's just taking the headers,
largest = max(df)
second_largest = max(item for item in df if item < largest)
and returning the header value alone.
You can define a func which takes the values, sorts them, slices the top 2 values ([:2]) then calculates the difference and returns the second value (as the first value is NaN). You apply this and pass arg axis=1 to apply row-wise:
In [195]:
def func(x):
return -x.sort(inplace=False, ascending=False)[:2].diff()[1]
df['diff'] = df.loc[:,'Time_1':].apply(func, axis=1)
df
Out[195]:
gene_id Time_1 Time_2 Time_3 diff
0 a 0.014893 8.002460 8.164309 0.161849
1 b 6.679432 0.883211 1.048761 5.630671
Here is my solution:
# Load data
data = {'a': [0.01489251, 8.00246, 8.164309], 'b': [6.67943235, 0.8832114, 1.048761]}
df = pd.DataFrame.from_dict(data, 'index')
The trick is to do a linear sort of the values and keep the top-2 using numpy.argpartition.
You do the difference of the 2 maximum values in absolute value. The function is applied row-wise.
def f(x):
ind = np.argpartition(x.values, -2)[-2:]
return np.abs(x.iloc[ind[0]] - x.iloc[ind[1]])
df.apply(f, axis=1)
Here's an elegant solution that doesn't involve sorting or defining any functions. It's also fully vectorized as it avoid use of the apply method.
maxes = df.max(axis=1)
less_than_max = df.where(df.lt(maxes, axis='rows'))
seconds = less_than_max.max(axis=1)
df['diff'] = maxes - seconds
Consider a CSV file:
string,date,number
a string,2/5/11 9:16am,1.0
a string,3/5/11 10:44pm,2.0
a string,4/22/11 12:07pm,3.0
a string,4/22/11 12:10pm,4.0
a string,4/29/11 11:59am,1.0
a string,5/2/11 1:41pm,2.0
a string,5/2/11 2:02pm,3.0
a string,5/2/11 2:56pm,4.0
a string,5/2/11 3:00pm,5.0
a string,5/2/14 3:02pm,6.0
a string,5/2/14 3:18pm,7.0
I can read this in, and reformat the date column into datetime format:
b = pd.read_csv('b.dat')
b['date'] = pd.to_datetime(b['date'],format='%m/%d/%y %I:%M%p')
I have been trying to group the data by month. It seems like there should be an obvious way of accessing the month and grouping by that. But I can't seem to do it. Does anyone know how?
What I am currently trying is re-indexing by the date:
b.index = b['date']
I can access the month like so:
b.index.month
However I can't seem to find a function to lump together by month.
Managed to do it:
b = pd.read_csv('b.dat')
b.index = pd.to_datetime(b['date'],format='%m/%d/%y %I:%M%p')
b.groupby(by=[b.index.month, b.index.year])
Or
b.groupby(pd.Grouper(freq='M')) # update for v0.21+
(update: 2018)
Note that pd.Timegrouper is depreciated and will be removed. Use instead:
df.groupby(pd.Grouper(freq='M'))
To groupby time-series data you can use the method resample. For example, to groupby by month:
df.resample(rule='M', on='date')['Values'].sum()
The list with offset aliases you can find here.
One solution which avoids MultiIndex is to create a new datetime column setting day = 1. Then group by this column.
Normalise day of month
df = pd.DataFrame({'Date': pd.to_datetime(['2017-10-05', '2017-10-20', '2017-10-01', '2017-09-01']),
'Values': [5, 10, 15, 20]})
# normalize day to beginning of month, 4 alternative methods below
df['YearMonth'] = df['Date'] + pd.offsets.MonthEnd(-1) + pd.offsets.Day(1)
df['YearMonth'] = df['Date'] - pd.to_timedelta(df['Date'].dt.day-1, unit='D')
df['YearMonth'] = df['Date'].map(lambda dt: dt.replace(day=1))
df['YearMonth'] = df['Date'].dt.normalize().map(pd.tseries.offsets.MonthBegin().rollback)
Then use groupby as normal:
g = df.groupby('YearMonth')
res = g['Values'].sum()
# YearMonth
# 2017-09-01 20
# 2017-10-01 30
# Name: Values, dtype: int64
Comparison with pd.Grouper
The subtle benefit of this solution is, unlike pd.Grouper, the grouper index is normalized to the beginning of each month rather than the end, and therefore you can easily extract groups via get_group:
some_group = g.get_group('2017-10-01')
Calculating the last day of October is slightly more cumbersome. pd.Grouper, as of v0.23, does support a convention parameter, but this is only applicable for a PeriodIndex grouper.
Comparison with string conversion
An alternative to the above idea is to convert to a string, e.g. convert datetime 2017-10-XX to string '2017-10'. However, this is not recommended since you lose all the efficiency benefits of a datetime series (stored internally as numerical data in a contiguous memory block) versus an object series of strings (stored as an array of pointers).
Slightly alternative solution to #jpp's but outputting a YearMonth string:
df['YearMonth'] = pd.to_datetime(df['Date']).apply(lambda x: '{year}-{month}'.format(year=x.year, month=x.month))
res = df.groupby('YearMonth')['Values'].sum()