I have a pandas dataframe that looks like this:
Emp_ID | Weekly_Hours | Hire_Date | Termination_Date | Salary_Paid | Multiplier | Hourly_Pay
A1 | 35 | 01/01/1990 | 06/04/2020 | 5000 | 0.229961 | 32.85
B2 | 35 | 02/01/2020 | NaN | 10000 | 0.229961 | 65.70
C3 | 30 | 23/03/2020 | NaN | 5800 | 0.229961 | 44.46
The multiplier is a static figure for all employees, calculated as 7 / 30.44. The hourly pay is worked out by multiplying the monthly salary by the multiplier and dividing by the weekly contracted hours.
Now my challenge is to get Pandas to recognise a date in the Termination Date field, and adjust the calculation. For instance, the first record would need to be updated to show that the employee was actually paid 5k through the payroll for 4 business days, not the full month, given that they resigned on 06/04/2020. So the expected hourly pay figure would be (5000 / 4 * 7 / 35) = 250.
I can code the calculation quite easily; my struggle is adding a column to reflect the business days (4 in the above example) in a fresh column for all April leavers (not interested in any other months). So far I have tried.
df['T_Mth_Workdays'] = np.where(df['Termination_Date'].notnull(), np.busday_count('2020-04-01', df['Termination_Date']), 0)
However the above approach returns an error stating that:
iterator operand 0 dtype could not be cast from dtype(' m8 [ns] ') to dtype(' m8 [d] ')
I should add here that I had to change the dates to datetime[ns64] format manually.
Any pointers gratefully received. Thanks!
The issue with your np.where function call is that it is trying to pass the entire series df["Termination_Date"] as an argument to np.busday_count. The count function fails because it requires arguments to be in the np.datetime64[D] format (i.e., value only specified to the day), and the Series cannot be easily converted to this format.
One solution is to write a custom function that only calls that np.busday_count on elements that are not NaTs, converting those to the datetime64[D] type before calling np.busday_count. Then, you can apply the custom function to the df["Termination_Date"] series, as below:
#!/usr/bin/env python3
import numpy as np
import pandas as pd
DATE_FORMAT = "%d-%m-%Y"
# Reproduce raw data
raw_data = [
["A1", 35, "01/01/1990", "06/04/2020", 5000, 0.229961, 32.85],
["B2", 35, "02/01/2020", None, 10000, 0.229961, 65.70],
["C3", 35, "23/03/2020", "NAT", 5800, 0.229961, 44.46],
]
# Convert raw dates to ISO format, then np.datetime64
def parse_raw_dates(s):
try:
spl = s.split("/")
ds = "%s-%s-%s" %(spl[2], spl[1], spl[0])
except:
ds = "NAT"
return np.datetime64(ds)
for line in raw_data:
line[2] = parse_raw_dates(line[2])
# Create dataframe
df = pd.DataFrame(
data = raw_data,
columns = [
"Emp_ID", "Weekly_Hours", "Hire_Date", "Termination_Date",
"Salary_Paid", "Multiplier", "Hourly_Pay"],
)
# Create special conversion function
def myfunc(d):
d = d.to_numpy().astype('datetime64[D]')
if np.isnat(d):
return 0
else:
return np.busday_count('2020-04-01', d)
df['T_Mth_Workdays'] = df["Termination_Date"].apply(myfunc)
def format_date(d):
d = d.to_numpy().astype('datetime64[D]')
if np.isnat(d):
return ""
else:
return pd.to_datetime(d).strftime(DATE_FORMAT)
df["Hire_Date"] = df["Hire_Date"].apply(format_date)
df["Termination_Date"] = df["Termination_Date"].apply(format_date)
Posting my approach here in case it helps others in the future. Firstly code for creating the dataframe:
d = {'Emp_ID': ['A1', 'B2', 'C3'], 'Weekly Hours': ['35', '35', '30'], 'Hire_Date': ['01/01/1990', '02/01/2020', '23/03/2020'],
'Termination_Date': ['06/04/2020', np.nan, np.nan], 'Salary_Paid': [5000, 10000, 5800]}
df = pd.DataFrame(data=d)
df
The first step was to convert the dates to a more useable format - this is where pd.to_datetime() comes in handy -the adjustment needed was to specify the format.
df['Hire_Date'] = pd.to_datetime(df['Hire_Date'], format='%d/%m/%Y')
df['Termination_Date'] = pd.to_datetime(df['Termination_Date'], format='%d/%m/%Y')
This has the desired effect; whereby the dates are correctly represented and April is picked up as the right month of termination for employee A1.
I now (slightly) adjusted Ken's custom solution for calculating the working days in April:
def workday_calc(d):
d = d.to_numpy().astype('datetime64[D]')
if np.isnat(d):
return 30.44
else:
d = d.astype(str)
d = dt.datetime.strptime(d, '%Y-%m-%d')
e = (d + dt.timedelta(1)).strftime('%Y-%m-%d')
return np.busday_count('2020-04-01', e, weekmask=[1,1,1,1,1,0,0])
I spotted the error while reviewing numpy documentation on np.busday_count(). There are two useful pointers to note:
The use of the datetime64[D] is mandatory in the first line of the function - you can't use pd.to_datetime(). This is because the datetime64[D] format is a pre-requisite to being able to call the np.isnat() function.
However, the minute we deal with the NaT in the dataframe, we need to switch back to a string format, which is needed for the datetime.strptime() function.
Using the datetime.strptime() feature, we tell Python that the date is a) represented in the ISO format, and we need to retain it as a string. The advantage with both datetime.strptime() and np.busday_count() is that they are both built to handle strings.
Also, the np.busday_count() excludes the end date, so I used timedelta() to increment the end date by one, so that all the dates in the interim are counted. This may or may not be appropriate given what you're trying to do, but I wanted an inclusive count of days worked in April. So in this case, the employee has worked for 4 business days in April.
We then simply apply the custom function and create a new column.
df['Days_Worked_April'] = df['Termination_Date'].apply(workday_calc)
I was now able to use the freshly created column to derive my multiplier - using the same old approach. The rest is simple, but I'm including the code and results below for completeness.
df['Multiplier'] = df.apply(lambda x: 7 / x['Days_Worked_April'], axis=1)
df['Hourly_Pay_Calc'] = round((df.apply(lambda x: x['Salary_Paid'] * x['Multiplier'] / x['Weekly Hours'], axis=1)), 2)
Output:
Emp_ID Weekly Hours Hire_Date Termination_Date Salary_Paid Days_Worked_April Multiplier Hourly_Pay_Calc
0 A1 35.0 1990-01-01 2020-04-06 5000 4.00 1.750000 250.00
1 B2 35.0 2020-01-02 NaT 10000 30.44 0.229961 65.70
2 C3 30.0 2020-03-23 NaT 5800 30.44 0.229961 44.46
I have a pandas dataframe that looks like this:
+-----------+------------------+---------------+------------+
| AccountID | RegistrationWeek | Weekly_Visits | Visit_Week |
+-----------+------------------+---------------+------------+
| ACC1 | 2015-01-25 | 0 | NaT |
| ACC2 | 2015-01-11 | 0 | NaT |
| ACC3 | 2015-01-18 | 0 | NaT |
| ACC4 | 2014-12-21 | 14 | 2015-02-12 |
| ACC5 | 2014-12-21 | 5 | 2015-02-15 |
| ACC6 | 2014-12-21 | 0 | 2015-02-22 |
+-----------+------------------+---------------+------------+
It's essentially a visit log of sorts, as it holds all the necessary data for creating a cohort analysis.
Each registration week is a cohort.
To know how many people are part of the cohort I can use:
visit_log.groupby('RegistrationWeek').AccountID.nunique()
What I want to do is create a pivot table with the registration weeks as keys. The columns should be the visit_weeks and the values should be the count of unique account ids who have more than 0 weekly visits.
Together with the total accounts in each cohort, I will then be able to show percentages instead of absolute values.
The end product would look something like this:
+-------------------+-------------+-------------+-------------+
| Registration Week | Visit_week1 | Visit_Week2 | Visit_week3 |
+-------------------+-------------+-------------+-------------+
| week1 | 70% | 30% | 20% |
| week2 | 70% | 30% | |
| week3 | 40% | | |
+-------------------+-------------+-------------+-------------+
I tried pivoting the dataframe like this:
visit_log.pivot_table(index='RegistrationWeek', columns='Visit_Week')
But I haven't nailed down the value part. I'll need to somehow count account Id and divide the sum by the registration week aggregation from above.
I'm new to pandas so if this isn't the best way to do retention cohorts, please enlighten me!
Thanks
There are several aspects to your question.
What you can build with the data you have
There are several kinds of retention. For simplicity, we’ll mention only two :
Day-N retention : if a user registered on day 0, did she log in on day N ? (Logging on day N+1 does not affect this metric). To measure it, you need to keep track of all the logs of your users.
Rolling retention : if a user registered on day 0, did she log in on day N or any day after that ? (Logging in on day N+1 affects this metric). To measure it, you just need the last know logs of your users.
If I understand your table correctly, you have two relevant variables to build your cohort table : registration date, and last log (visit week). The number of weekly visits seems irrelevant.
So with this you can only go with option 2, rolling retention.
How to build the table
First, let's build a dummy data set so that we have enough to work on and you can reproduce it :
import pandas as pd
import numpy as np
import math
import datetime as dt
np.random.seed(0) # so that we all have the same results
def random_date(start, end,p=None):
# Return a date randomly chosen between two dates
if p is None:
p = np.random.random()
return start + dt.timedelta(seconds=math.ceil(p * (end - start).days*24*3600))
n_samples = 1000 # How many users do we want ?
index = range(1,n_samples+1)
# A range of signup dates, say, one year.
end = dt.datetime.today()
from dateutil.relativedelta import relativedelta
start = end - relativedelta(years=1)
# Create the dataframe
users = pd.DataFrame(np.random.rand(n_samples),
index=index, columns=['signup_date'])
users['signup_date'] = users['signup_date'].apply(lambda x : random_date(start, end,x))
# last logs randomly distributed within 10 weeks of singing up, so that we can see the retention drop in our table
users['last_log'] = users['signup_date'].apply(lambda x : random_date(x, x + relativedelta(weeks=10)))
So now we should have something that looks like this :
users.head()
Here is some code to build a cohort table :
### Some useful functions
def add_weeks(sourcedate,weeks):
return sourcedate + dt.timedelta(days=7*weeks)
def first_day_of_week(sourcedate):
return sourcedate - dt.timedelta(days = sourcedate.weekday())
def last_day_of_week(sourcedate):
return sourcedate + dt.timedelta(days=(6 - sourcedate.weekday()))
def retained_in_interval(users,signup_week,n_weeks,end_date):
'''
For a given list of users, returns the number of users
that signed up in the week of signup_week (the cohort)
and that are retained after n_weeks
end_date is just here to control that we do not un-necessarily fill the bottom right of the table
'''
# Define the span of the given week
cohort_start = first_day_of_week(signup_week)
cohort_end = last_day_of_week(signup_week)
if n_weeks == 0:
# If this is our first week, we just take the number of users that signed up on the given period of time
return len( users[(users['signup_date'] >= cohort_start)
& (users['signup_date'] <= cohort_end)])
elif pd.to_datetime(add_weeks(cohort_end,n_weeks)) > pd.to_datetime(end_date) :
# If adding n_weeks brings us later than the end date of the table (the bottom right of the table),
# We return some easily recognizable date (not 0 as it would cause confusion)
return float("Inf")
else:
# Otherwise, we count the number of users that signed up on the given period of time,
# and whose last known log was later than the number of weeks added (rolling retention)
return len( users[(users['signup_date'] >= cohort_start)
& (users['signup_date'] <= cohort_end)
& pd.to_datetime((users['last_log']) >= pd.to_datetime(users['signup_date'].map(lambda x: add_weeks(x,n_weeks))))
])
With this we can create the actual function :
def cohort_table(users,cohort_number=6,period_number=6,cohort_span='W',end_date=None):
'''
For a given dataframe of users, return a cohort table with the following parameters :
cohort_number : the number of lines of the table
period_number : the number of columns of the table
cohort_span : the span of every period of time between the cohort (D, W, M)
end_date = the date after which we stop counting the users
'''
# the last column of the table will end today :
if end_date is None:
end_date = dt.datetime.today()
# The index of the dataframe will be a list of dates ranging
dates = pd.date_range(add_weeks(end_date,-cohort_number), periods=cohort_number, freq=cohort_span)
cohort = pd.DataFrame(columns=['Sign up'])
cohort['Sign up'] = dates
# We will compute the number of retained users, column-by-column
# (There probably is a more pythonesque way of doing it)
range_dates = range(0,period_number+1)
for p in range_dates:
# Name of the column
s_p = 'Week '+str(p)
cohort[s_p] = cohort.apply(lambda row: retained_in_interval(users,row['Sign up'],p,end_date), axis=1)
cohort = cohort.set_index('Sign up')
# absolute values to percentage by dividing by the value of week 0 :
cohort = cohort.astype('float').div(cohort['Week 0'].astype('float'),axis='index')
return cohort
Now you can call it and see the result :
cohort_table(users)
Hope it helps
Using the same format of users data from rom_j's answer, this will be cleaner/faster, but only works assuming there is at least one signup/churn per week. Not a terrible assumption on large enough data.
users = users.applymap(lambda d: d.strftime('%Y-%m-%V') if pd.notnull(d) else d)
tab = pd.crosstab(signup_date, last_log)
totals = tab.T.sum()
retention_counts = ((tab.T.cumsum().T * -1)
.replace(0, pd.NaT)
.add(totals, axis=0)
)
retention = retention_counts.div(totals, axis=0)
realined = [retention.loc[a].dropna().values for a in retention.index]
realigned_retention = pd.DataFrame(realined, index=retention.index)
"DF","00000000#11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2"
"Rail","00000.POO#GMAIL.COM","NR251764697478","24JUN2011","B2C","2025"
"DF","0000650000#YAHOO.COM","NF2513521438550","01JAN2013","B2C","6792"
"Bus","00009.GAURAV#GMAIL.COM","NU27012932319739","26JAN2013","B2C","800"
"Rail","0000.ANU#GMAIL.COM","NR251764697526","24JUN2011","B2C","595"
"Rail","0000MANNU#GMAIL.COM","NR251277005737","29OCT2011","B2C","957"
"Rail","0000PRANNOY0000#GMAIL.COM","NR251297862893","21NOV2011","B2C","212"
"DF","0000PRANNOY0000#YAHOO.CO.IN","NF251327485543","26JUN2011","B2C","17080"
"Rail","0000RAHUL#GMAIL.COM","NR2512012069809","25OCT2012","B2C","5731"
"DF","0000SS0#GMAIL.COM","NF251355775967","10MAY2011","B2C","2000"
"DF","0001HARISH#GMAIL.COM","NF251352240086","22DEC2010","B2C","4006"
"DF","0001HARISH#GMAIL.COM","NF251742087846","12DEC2010","B2C","1000"
"DF","0001HARISH#GMAIL.COM","NF252022031180","09DEC2010","B2C","3439"
"Rail","000AYUSH#GMAIL.COM","NR2151120122283","25JAN2013","B2C","136"
"Rail","000AYUSH#GMAIL.COM","NR2151213260036","28NOV2012","B2C","41"
"Rail","000AYUSH#GMAIL.COM","NR2151313264432","29NOV2012","B2C","96"
"Rail","000AYUSH#GMAIL.COM","NR2151413266728","29NOV2012","B2C","96"
"Rail","000AYUSH#GMAIL.COM","NR2512912359037","08DEC2012","B2C","96"
"Rail","000AYUSH#GMAIL.COM","NR2517612385569","12DEC2012","B2C","96"
Above is the sample data.
Data is sorted according to email addresses and the file is very large around 1.5Gb
I want output in another csv file something like this
"DF","00000000#11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2",1,0 days
"Rail","00000.POO#GMAIL.COM","NR251764697478","24JUN2011","B2C","2025",1,0 days
"DF","0000650000#YAHOO.COM","NF2513521438550","01JAN2013","B2C","6792",1,0 days
"Bus","00009.GAURAV#GMAIL.COM","NU27012932319739","26JAN2013","B2C","800",1,0 days
"Rail","0000.ANU#GMAIL.COM","NR251764697526","24JUN2011","B2C","595",1,0 days
"Rail","0000MANNU#GMAIL.COM","NR251277005737","29OCT2011","B2C","957",1,0 days
"Rail","0000PRANNOY0000#GMAIL.COM","NR251297862893","21NOV2011","B2C","212",1,0 days
"DF","0000PRANNOY0000#YAHOO.CO.IN","NF251327485543","26JUN2011","B2C","17080",1,0 days
"Rail","0000RAHUL#GMAIL.COM","NR2512012069809","25OCT2012","B2C","5731",1,0 days
"DF","0000SS0#GMAIL.COM","NF251355775967","10MAY2011","B2C","2000",1,0 days
"DF","0001HARISH#GMAIL.COM","NF251352240086","09DEC2010","B2C","4006",1,0 days
"DF","0001HARISH#GMAIL.COM","NF251742087846","12DEC2010","B2C","1000",2,3 days
"DF","0001HARISH#GMAIL.COM","NF252022031180","22DEC2010","B2C","3439",3,10 days
"Rail","000AYUSH#GMAIL.COM","NR2151213260036","28NOV2012","B2C","41",1,0 days
"Rail","000AYUSH#GMAIL.COM","NR2151313264432","29NOV2012","B2C","96",2,1 days
"Rail","000AYUSH#GMAIL.COM","NR2151413266728","29NOV2012","B2C","96",3,0 days
"Rail","000AYUSH#GMAIL.COM","NR2512912359037","08DEC2012","B2C","96",4,9 days
"Rail","000AYUSH#GMAIL.COM","NR2512912359037","08DEC2012","B2C","96",5,0 days
"Rail","000AYUSH#GMAIL.COM","NR2517612385569","12DEC2012","B2C","96",6,4 days
"Rail","000AYUSH#GMAIL.COM","NR2517612385569","12DEC2012","B2C","96",7,0 days
"Rail","000AYUSH#GMAIL.COM","NR2151120122283","25JAN2013","B2C","136",8,44 days
"Rail","000AYUSH#GMAIL.COM","NR2151120122283","25JAN2013","B2C","136",9,0 days
i.e if entry occurs 1st time i need to append 1 if it occurs 2nd time i need to append 2 and likewise i mean i need to count no of occurences of an email address in the file and if an email exists twice or more i want difference among dates and remember dates are not sorted so we have to sort them also against a particular email address and i am looking for a solution in python using numpy or pandas library or any other library that can handle this type of huge data without giving out of bound memory exception i have dual core processor with centos 6.3 and having ram of 4GB
make sure you have 0.11, read these docs: http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables, and these recipes: http://pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore (esp the 'merging on millions of rows'
Here is a solution that seems to work. Here is the workflow:
read data from your csv by chunks and appending to an hdfstore
iterate over the store, which creates another store that does the combiner
Essentially we are taking a chunk from the table and combining with a chunk from every other part of the file. The combiner function does not reduce, but instead calculates your function (the diff in days) between all elements in that chunk, eliminating duplicates as you go, and taking the latest data after each loop. Kind of like a recursive reduce almost.
This should be O(num_of_chunks**2) memory and calculation time
chunksize could be say 1m (or more) in your case
processing [0] [datastore.h5]
processing [1] [datastore_0.h5]
count date diff email
4 1 2011-06-24 00:00:00 0 0000.ANU#GMAIL.COM
1 1 2011-06-24 00:00:00 0 00000.POO#GMAIL.COM
0 1 2010-07-26 00:00:00 0 00000000#11111.COM
2 1 2013-01-01 00:00:00 0 0000650000#YAHOO.COM
3 1 2013-01-26 00:00:00 0 00009.GAURAV#GMAIL.COM
5 1 2011-10-29 00:00:00 0 0000MANNU#GMAIL.COM
6 1 2011-11-21 00:00:00 0 0000PRANNOY0000#GMAIL.COM
7 1 2011-06-26 00:00:00 0 0000PRANNOY0000#YAHOO.CO.IN
8 1 2012-10-25 00:00:00 0 0000RAHUL#GMAIL.COM
9 1 2011-05-10 00:00:00 0 0000SS0#GMAIL.COM
12 1 2010-12-09 00:00:00 0 0001HARISH#GMAIL.COM
11 2 2010-12-12 00:00:00 3 0001HARISH#GMAIL.COM
10 3 2010-12-22 00:00:00 13 0001HARISH#GMAIL.COM
14 1 2012-11-28 00:00:00 0 000AYUSH#GMAIL.COM
15 2 2012-11-29 00:00:00 1 000AYUSH#GMAIL.COM
17 3 2012-12-08 00:00:00 10 000AYUSH#GMAIL.COM
18 4 2012-12-12 00:00:00 14 000AYUSH#GMAIL.COM
13 5 2013-01-25 00:00:00 58 000AYUSH#GMAIL.COM
import pandas as pd
import StringIO
import numpy as np
from time import strptime
from datetime import datetime
# your data
data = """
"DF","00000000#11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2"
"Rail","00000.POO#GMAIL.COM","NR251764697478","24JUN2011","B2C","2025"
"DF","0000650000#YAHOO.COM","NF2513521438550","01JAN2013","B2C","6792"
"Bus","00009.GAURAV#GMAIL.COM","NU27012932319739","26JAN2013","B2C","800"
"Rail","0000.ANU#GMAIL.COM","NR251764697526","24JUN2011","B2C","595"
"Rail","0000MANNU#GMAIL.COM","NR251277005737","29OCT2011","B2C","957"
"Rail","0000PRANNOY0000#GMAIL.COM","NR251297862893","21NOV2011","B2C","212"
"DF","0000PRANNOY0000#YAHOO.CO.IN","NF251327485543","26JUN2011","B2C","17080"
"Rail","0000RAHUL#GMAIL.COM","NR2512012069809","25OCT2012","B2C","5731"
"DF","0000SS0#GMAIL.COM","NF251355775967","10MAY2011","B2C","2000"
"DF","0001HARISH#GMAIL.COM","NF251352240086","22DEC2010","B2C","4006"
"DF","0001HARISH#GMAIL.COM","NF251742087846","12DEC2010","B2C","1000"
"DF","0001HARISH#GMAIL.COM","NF252022031180","09DEC2010","B2C","3439"
"Rail","000AYUSH#GMAIL.COM","NR2151120122283","25JAN2013","B2C","136"
"Rail","000AYUSH#GMAIL.COM","NR2151213260036","28NOV2012","B2C","41"
"Rail","000AYUSH#GMAIL.COM","NR2151313264432","29NOV2012","B2C","96"
"Rail","000AYUSH#GMAIL.COM","NR2151413266728","29NOV2012","B2C","96"
"Rail","000AYUSH#GMAIL.COM","NR2512912359037","08DEC2012","B2C","96"
"Rail","000AYUSH#GMAIL.COM","NR2517612385569","12DEC2012","B2C","96"
"""
# read in and create the store
data_store_file = 'datastore.h5'
store = pd.HDFStore(data_store_file,'w')
def dp(x, **kwargs):
return [ datetime(*strptime(v,'%d%b%Y')[0:3]) for v in x ]
chunksize=5
reader = pd.read_csv(StringIO.StringIO(data),names=['x1','email','x2','date','x3','x4'],
header=0,usecols=['email','date'],parse_dates=['date'],
date_parser=dp, chunksize=chunksize)
for i, chunk in enumerate(reader):
chunk['indexer'] = chunk.index + i*chunksize
# create the global index, and keep it in the frame too
df = chunk.set_index('indexer')
# need to set a minimum size for the email column
store.append('data',df,min_itemsize={'email' : 100})
store.close()
# define the combiner function
def combiner(x):
# given a group of emails (the same), return a combination
# with the new data
# sort by the date
y = x.sort('date')
# calc the diff in days (an integer)
y['diff'] = (y['date']-y['date'].iloc[0]).apply(lambda d: float(d.item().days))
y['count'] = pd.Series(range(1,len(y)+1),index=y.index,dtype='float64')
return y
# reduce the store (and create a new one by chunks)
in_store_file = data_store_file
in_store1 = pd.HDFStore(in_store_file)
# iter on the store 1
for chunki, df1 in enumerate(in_store1.select('data',chunksize=2*chunksize)):
print "processing [%s] [%s]" % (chunki,in_store_file)
out_store_file = 'datastore_%s.h5' % chunki
out_store = pd.HDFStore(out_store_file,'w')
# iter on store 2
in_store2 = pd.HDFStore(in_store_file)
for df2 in in_store2.select('data',chunksize=chunksize):
# concat & drop dups
df = pd.concat([df1,df2]).drop_duplicates(['email','date'])
# group and combine
result = df.groupby('email').apply(combiner)
# remove the mi (that we created in the groupby)
result = result.reset_index('email',drop=True)
# only store those rows which are in df2!
result = result.reindex(index=df2.index).dropna()
# store to the out_store
out_store.append('data',result,min_itemsize={'email' : 100})
in_store2.close()
out_store.close()
in_store_file = out_store_file
in_store1.close()
# show the reduced store
print pd.read_hdf(out_store_file,'data').sort(['email','diff'])
Use the built-in sqlite3 database: you can insert the data, sort and group as necessary, and there's no problem using a file which is larger than available RAM.
Another possible (system-admin) way, avoiding database and SQL queries plus a whole lot of requirements in runtime processes and hardware resources.
Update 20/04 Added more code and simplified approach:-
Convert the timestamp to seconds (from Epoch) and use UNIX sort, using email and this new field (that is: sort -k2 -k4 -n -t, < converted_input_file > output_file)
Initialize 3 variable, EMAIL, PREV_TIME and COUNT
Interate over each line, if new email is encountered, add "1,0 day". Update PREV_TIME=timestamp, COUNT=1, EMAIL=new_email
Next line: 3 possible scenario
a) if same email, different timestamp: calculate days, increment COUNT=1, update PREV_TIME, add "Count, Difference_in_days"
b) If same email, same timestamp: increment COUNT, add "COUNT, 0 day"
c) If new email, start from 3.
Alternative to 1. is to add a new field TIMESTAMP and remove it upon printing out the line.
Note: If 1.5GB is too huge to sort at a go, split it into smaller chuck, using email as the split point. You can run these chunks in parallel on different machine
/usr/bin/gawk -F'","' ' {
split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " ");
for (i=1; i<=12; i++) mdigit[month[i]]=i;
print $0 "," mktime(substr($4,6,4) " " mdigit[substr($4,3,3)] " " substr($4,1,2) " 00 00 00"
)}' < input.txt | /usr/bin/sort -k2 -k7 -n -t, > output_file.txt
output_file.txt:
"DF","00000000#11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2",1280102400
"DF","0001HARISH#GMAIL.COM","NF252022031180","09DEC2010","B2C","3439",1291852800
"DF","0001HARISH#GMAIL.COM","NF251742087846","12DEC2010","B2C","1000",1292112000
"DF","0001HARISH#GMAIL.COM","NF251352240086","22DEC2010","B2C","4006",1292976000
...
You pipe the output to Perl, Python or AWK script to process step 2. through 4.