I'm trying to transform my dataframe based on certain conditions. Following is my input dataframe
In [11]: df
Out[11]:
DocumentNumber I_Date N_Date P_Date Amount
0 1234 2016-01-01 2017-01-01 2017-10-23 38.38
1 2345 2016-01-02 2017-01-02 2018-03-26 41.00
2 1324 2016-01-12 2017-01-03 2018-03-26 30.37
3 5421 2016-01-13 2017-01-02 2018-03-06 269.00
4 5532 2016-01-15 2017-01-04 2018-06-30 271.00
Desired solution:
Each row is a unique document and my aim is to find the number of documents and their total amount, which meet the mentioned condition if I am running for each day and delta combination.
I am able to get to my desired result via for-loop, but I know it is not the ideal way and gets slower as my data increases. Since I am new to python, I need help to get rid of the loop by a list comprehension or any other faster option.
Code:
d1 = datetime.date(2017, 1, 1)
d2 = datetime.date(2017, 1, 15)
mydates = pd.date_range(d1, d2).tolist()
Delta = pd.Series(range(0,5)).tolist()
df_A =[]
for i in mydates:
for j in Delta:
A = df[(df["I_Date"]<i) & (df["N_Date"]>i+j) & (df["P_Date"]>i) ]
A["DateCutoff"] = i
A["Delta"]=j
A = A.groupby(['DateCutoff','Delta'],as_index=False).agg({'Amount':'sum','DocumentNumber':'count'})
A.columns = ['DateCutoff','Delta','A_PaymentAmount','A_DocumentNumber']
df_A.append(A)
df_A = pd.concat(df_A, sort = False)
Output:
In [14]: df_A
Out[14]:
DateCutoff Delta A_PaymentAmount A_DocumentNumber
0 2017-01-01 0 611.37 4
0 2017-01-01 1 301.37 2
0 2017-01-01 2 271.00 1
0 2017-01-02 0 301.37 2
0 2017-01-02 1 271.00 1
0 2017-01-03 0 271.00 1
I don't see a way to remove the loop from your code, because the loop is creating individual dataframes based on the contents of mydates and Delta.
In this example you are creating 75 different dataframes
On each dataframe you .groupby, then .agg the sum of payments and the count of document numbers.
Each dataframe is appended to a list.
pd.concat the complete list into a dataframe.
One significant improvement
Check the Boolean condition before creating the dataframe and performing the remaining operations. In this example, operations were performed on 69 empty dataframes. By checking the condition first, operations will only be performed on the 6 dataframes containing data.
condition.any() returns True as long as at least one element is True
Minor changes
datetime + int is deprecated, so change that to datetime + timedelta(days=x)
pd.Series(range(0,5)).tolist() is overkill for making a list. Now timedelta objects are needed, so use [timedelta(days=x) for x in range(5)]
Instead of iterating with two for-loops, use itertools.product on mydates and Delta. This creates a generator of tuples in the form (Timestamp('2017-01-01 00:00:00', freq='D'), datetime.timedelta(0))
Use .copy() when creating dataframe A, to prevent SettingWithCopyWarning
Note:
A list comprehension was mentioned in the question. They are just a pythonic way of making a for-loop, but don't necessarily improve performance.
All of the calculations are using pandas methods, not for-loops. The for-loop only creates the dataframe from the condition.
Updated Code:
from itertools import product
import pandas as pd
from datetime import date, timedelta
d1 = date(2017, 1, 1)
d2 = date(2017, 1, 15)
mydates = pd.date_range(d1, d2)
Delta = [timedelta(days=x) for x in range(5)]
df_list = list()
for t in product(mydates, Delta):
condition = (df["I_Date"]<t[0]) & (df["N_Date"]>t[0]+t[1]) & (df["P_Date"]>t[0])
if condition.any():
A = df[condition].copy()
A["DateCutoff"] = t[0]
A["Delta"] = t[1]
A = A.groupby(['DateCutoff','Delta'],as_index=False).agg({'Amount':'sum','DocumentNumber':'count'})
A.columns = ['DateCutoff','Delta','A_PaymentAmount','A_DocumentNumber']
df_list.append(A)
df_CutOff = pd.concat(df_list, sort = False)
Output
The same as the original
DateCutoff Delta A_PaymentAmount A_DocumentNumber
0 2017-01-01 0 611.37 4
0 2017-01-01 1 301.37 2
0 2017-01-01 2 271.00 1
0 2017-01-02 0 301.37 2
0 2017-01-02 1 271.00 1
0 2017-01-03 0 271.00 1
Related
Let's assume a dataframe using datetimes as index, where we have a column named 'Score', initialy set to 10:
score
2016-01-01 10
2016-01-02 10
2016-01-03 10
2016-01-04 10
2016-01-05 10
2016-01-06 10
2016-01-07 10
2016-01-08 10
I want to substract a fixed value (let's say 1) from the score, but only when the index is between certain dates (for example between the 3rd and the 6th):
score
2016-01-01 10
2016-01-02 10
2016-01-03 9
2016-01-04 9
2016-01-05 9
2016-01-06 9
2016-01-07 10
2016-01-08 10
Since my real dataframe is big, and I will be doing this for different dateranges and different fixed values N for each one of them, I'd like to achieve this without requiring to create a new column set to -N for each case.
Something like numpy's where function, but for a certain range, and allowing me to sum/substract to current value if the condition is met, and do nothing otherwise. Is there something like that?
Use index slicing:
df.loc['2016-01-03':'2016-01-06', 'score'] -= 1
Assuming dates are datetime dtype:
#if date is own column:
df.loc[df['date'].dt.day.between(3,6), 'score'] = df['score'] - 1
#if date is index:
df.loc[df.index.day.isin(range(3,7)), 'score'] = df['score'] - 1
I would do something like that using query :
import pandas as pd
df = pd.DataFrame({"score":pd.np.random.randint(1,10,100)},
index=pd.date_range(start="2018-01-01", periods=100))
start = "2018-01-05"
stop = "2018-04-08"
df.query('#start <= index <= #stop ') - 1
Edit : note that something using eval which goes to boolean, can be used but in a different manner because pandas where acts on the False values.
df.where(~df.eval('#start <= index <= #stop '),
df['score'] - 1, axis=0, inplace=True)
See how I inverted the comparison operators (with ~), in order to get what I wanted. It's efficient but not really clear. Of course, you can also use pd.np.where and all is good in the world.
I have following dataframe in pandas
code bucket
0 08:30:00-9:00:00
1 10:00:00-11:00:00
2 12:00:00-13:00:00
I want to replace 7th character 0 with 1, my desired dataframe is
code bucket
0 08:30:01-9:00:00
1 10:00:01-11:00:00
2 12:00:01-13:00:00
How to do it in pandas?
Use indexing with str:
df['bucket'] = df['bucket'].str[:7] + '1' + df['bucket'].str[8:]
Or list comprehension:
df['bucket'] = [x[:7] + '1' + x[8:] for x in df['bucket']]
print (df)
code bucket
0 0 08:30:01-9:00:00
1 1 10:00:01-11:00:00
2 2 12:00:01-13:00:00
Avoid string operations where possible
You lose a considerable amount of functionality by working with strings only. While this may be a one-off operation, you will find that repeated string manipulations will quickly become expensive in terms of time and memory efficiency.
Use pd.to_datetime instead
You can add additional series to your dataframe with datetime objects. Below is an example which, in addition, creates an object dtype series in the format you desire.
# split by '-' into 2 series
dfs = df.pop('bucket').str.split('-', expand=True)
# convert to datetime
dfs = dfs.apply(pd.to_datetime, axis=1)
# add 1s to first series
dfs[0] = dfs[0] + pd.Timedelta(seconds=1)
# create object series from 2 times
form = '%H:%M:%S'
dfs[2] = dfs[0].dt.strftime(form) + '-' + dfs[1].dt.strftime(form)
# join to original dataframe
res = df.join(dfs)
print(res)
code 0 1 2
0 0 2018-10-02 08:30:01 2018-10-02 09:00:00 08:30:01-09:00:00
1 1 2018-10-02 10:00:01 2018-10-02 11:00:00 10:00:01-11:00:00
2 2 2018-10-02 12:00:01 2018-10-02 13:00:00 12:00:01-13:00:00
I have a dataframe with a consecutive index (date for every calendar day) and a reference vector that does not contain every date (only working days).
I want to reindex the dataframe to only the dates in the reference vector with the missing data being aggregated to the latest entry before a missing-date-section (i.e. weekend data shall be aggregated together to the last Friday).
Currently I have implemented this by looping over the reversed index and collecting the weekend data, then adding it later in the loop. I'm asking if there is a more efficient "array-way" to do it.
import pandas as pd
import numpy as np
df = pd.DataFrame({'x': np.arange(10), 'y': np.arange(10)**2},
index=pd.date_range(start="2018-01-01", periods=10))
print(df)
ref_dates = pd.date_range(start="2018-01-01", periods=10)
ref_dates = ref_dates[:5].append(ref_dates[7:]) # omit 2018-01-06 and -07
# inefficient approach by reverse-traversing the dates, collecting the data
# and aggregating it together with the first date that's in ref_dates
df.sort_index(ascending=False, inplace=True)
collector = []
for dt in df.index:
if collector and dt in ref_dates:
# data from previous iteration was collected -> aggregate it and reset collector
# first append also the current data
collector.append(df.loc[dt, :].values)
collector = np.array(collector)
# applying aggregation function, here sum as example
aggregates = np.sum(collector, axis=0)
# setting the new data
df.loc[dt,:] = aggregates
# reset collector
collector = []
if dt not in ref_dates:
collector.append(df.loc[dt, :].values)
df = df.reindex(ref_dates)
print(df)
Gives the output (first: source dataframe, second: target dataframe)
x y
2018-01-01 0 0
2018-01-02 1 1
2018-01-03 2 4
2018-01-04 3 9
2018-01-05 4 16
2018-01-06 5 25
2018-01-07 6 36
2018-01-08 7 49
2018-01-09 8 64
2018-01-10 9 81
x y
2018-01-01 0 0
2018-01-02 1 1
2018-01-03 2 4
2018-01-04 3 9
2018-01-05 15 77 # contains the sum of Jan 5th, 6th and 7th
2018-01-08 7 49
2018-01-09 8 64
2018-01-10 9 81
Still has a list comprehension loop, but works.
import pandas as pd
import numpy as np
# Create dataframe which contains all days
df = pd.DataFrame({'x': np.arange(10), 'y': np.arange(10)**2},
index=pd.date_range(start="2018-01-01", periods=10))
# create second dataframe which only contains week-days or whatever dates you need.
ref_dates = [x for x in df.index if x.weekday() < 5]
# Set the index of df to a forward filled version of the ref days
df.index = pd.Series([x if x in ref_dates else float('nan') for x in df.index]).fillna(method='ffill')
# Group by unique dates and sum
df = df.groupby(level=0).sum()
print(df)
I have a highly sparse dataframe (only one non-zero value per row) indexed by non-regular Timestamps for which I am trying to do the following.
For each non-zero value in a given column, I want to count the numbers of other non-zero values in other columns within a given timedelta. In a way, I am trying to compute something similar to a rolling cross_tab.
My solution so far is ugly and slow as I haven't figured out how to do this using slicing and rolling. It looks something like:
delta = 1
values = pd.DataFrame(0,index= df.columns,columns= df.columns)
for j in df.columns:
for i in range(len(df[df[j]!=0].index)-1):
#min is used to avoid overlapping
values[j] +=df[(df.index<min((df[df[j]!=0].index + pd.tseries.timedeltas.to_timedelta(delta, unit='h'))[i],df[df[j]!=0].index[i+1]))&(df.index>=df[df[j]!=0].index[i])].astype(bool).sum()
values = values.T
and a toy-example dataframe is:
df = pd.DataFrame.from_dict({"2016-01-01 10:00.00":[0,1],
"2016-01-01 10:30.00":[1,0],
"2016-01-01 12:00.00":[0,1],
"2016-01-01 14:00.00":[1,0]},
orient="index")
df.columns=['a','b']
df.index = pd.to_datetime(df.index)
a b
2016-01-01 10:00:00 0 1
2016-01-01 10:30:00 1 0
2016-01-01 12:00:00 0 1
2016-01-01 14:00:00 1 0
The desired output should look like (with the counts depending on the timedelta):
a b
a 1 0
b 1 1
Hard to tell what exactly you want. But it sounded kind of like this
I want to use a new feature pandas 0.19. Time aware rolling. In order to use it, we need a sorted index.
d1 = df.sort_index()
Now, let's assume we want to count within plus or minus one hour. Let's start by adding two hours to every element of the index
d1.index = d1.index + pd.offsets.Hour(2)
Then we'll roll through, looking back four hours. This will be like looking forward two hours and backwards two hours relative to the original indices.
d2 = d1.rolling('4H').sum()
d2.index = d2.index - pd.offsets.Hour(2)
d2
a b
2016-01-01 10:00:00 0.0 1.0
2016-01-01 10:30:00 1.0 1.0
2016-01-01 12:00:00 1.0 2.0
2016-01-01 14:00:00 2.0 1.0
I have a Pandas DataFrame with the columns:
UserID, Date, (other columns that we can ignore here)
I'm trying to select out only users that have visited on multiple dates. I'm currently doing it with groupby(['UserID', 'Date']) and a for loop, where I drop users with only one result, but I feel like there is a much better way to do this.
Thanks
It depends on exact format of output you want to get, but you can count distinct Dates inside each UserID and get all where this count > 1 (like having count(distinct Date) > 1 in SQL):
>>> df
Date UserID
0 2013-01-01 00:00:00 1
1 2013-01-02 00:00:00 2
2 2013-01-02 00:00:00 2
3 2013-01-02 00:00:00 1
4 2013-01-02 00:00:00 3
>>> g = df.groupby('UserID').Date.nunique()
>>> g
UserID
1 2
2 1
3 1
>>> g > 1
UserID
1 True
2 False
3 False
dtype: bool
>>> g[g > 1]
UserID
1 2
you see that you get UserID = 1 as a result, it's the only user visited on multiple dates
To count unique date counts for every UserID:
df.groupby("UserID").Date.agg(lambda s:len(s.unique()))
The you can drop users with only one count.
For the sake of adding another answer, you can also use indexing with list comprehension
DF = pd.DataFrame({'UserID' : [1, 1, 2, 3, 4, 4, 5], 'Data': np.random.rand(7)})
DF.ix[[row for row in DF.index if list(DF.UserID).count(DF.UserID[row])>1]]
This might be as much work as your for loop, but its just another option for you to consider....