find a user/customer max count of continuous date using Python - python

Scenario: I have a sample data frame like below
user_id | date_login
--------|-----------
101 | 2015-10-11
101 | 2015-10-12
101 | 2015-11-01
101 | 2015-11-02
101 | 2015-11-03
102 | 2015-10-12
102 | 2015-10-13
...
I would like to know user's max active days, which means the count of continuous days he/she keeps log into the system. For the sample data frame above, the desired result should return like below:
user_id | max_continuous_login_count
--------|-----------
101|3
102|2
I'm thinking to convert date into number to compare, is it necessary, any good practice?
Thanks for the help,

Solution:
import operator
import datetime
from collections import defaultdict
from functools import reduce
dataset = [(101, "2015-10-11"), (101, "2015-10-12"), (102, "2015-10-13")]
data = defaultdict(list)
for user, date in dataset:
data[user].append(datetime.datetime.strptime(date, "%Y-%m-%d").date())
data[user].sort()
def count_days(data, new_date):
max_days, current_max, last_date = data
# Check if there's one day difference, else, reset back to 1.
if abs((new_date - last_date).days) != 1:
current_max = 0
current_max += 1
return max(max_days, current_max), current_max, new_date
result = {}
for user, dates in data.items():
result[user] = reduce(count_days, dates, (0, 0, datetime.date.min))[0]
What I did here, was to first convert the dataset into a dict mapping a user and his login dates. On the way, I converted the dates to date objects and sorted them in the correct order (just in case the dataset is garbled).
I then created a function count_days() which checks if the difference is 1 day between 2 dates. If it is, it increases the max count of days. Then, by using reduce, I created a new results dict mapping user id to max_days.

Related

Count over PySpark dataframe by Running-window using a combination of two columns

I have a spark DataFrame (v.2.2.0) in which I want to count (by a group key) all events that occurred within a certain time-frame, for example, 5-day from the start_date of the event in each row to the end_date of the other rows. So for example
# For the sake of simplicity I included only 1 user, but there are multiple users
+-------------------+-------------------+-----+
|end_date |start_date |uid |
+-------------------+-------------------+-----+
|2020-11-26 09:30:28|2020-11-26 08:30:22|user1|
|2020-11-26 10:41:00|2020-11-26 10:00:00|user1|
|2020-11-22 12:40:27|2020-11-22 08:37:18|user1|
|2020-11-22 15:22:20|2020-11-22 13:32:30|user1|
|2020-11-20 17:20:07|2020-11-20 16:04:04|user1|
I defined a window
days = lambda i: i * 86400 # 60*60*24 = number of seconds in a day
w = (Window()
.partitionBy(col("uid"))
.orderBy(col("end_date").cast("timestamp").cast("long"))
.rangeBetween(-days(5), 0))
And I calculated over the window:
by_end = df.select(col("start_date"), f.count("end_date").over(w).alias("count"))
df = df.join(by_end, 'start_date', how='left')
I'll get this:
+-------------------+-------------------+-----+-------+
|end_date |start_date |uid |count |
+-------------------+-------------------+-----+-------+
|2020-11-26 09:30:28|2020-11-26 08:30:22|user1| 4 |
|2020-11-26 10:41:00|2020-11-26 10:00:00|user1| 3 |
|2020-11-22 12:40:27|2020-11-22 08:37:18|user1| 3 |
|2020-11-22 15:22:20|2020-11-22 13:32:30|user1| 2 |
|2020-11-20 17:20:07|2020-11-20 16:04:04|user1| 1 |
But, to my understanding, this will do the rolling-window count by end-date, which is almost correct, since I need it from the start_date of current event to the end_date of all others.
Any suggestions?

How can I get pandas to adjust my formula based on a specific value in a dataframe?

I have a pandas dataframe that looks like this:
Emp_ID | Weekly_Hours | Hire_Date | Termination_Date | Salary_Paid | Multiplier | Hourly_Pay
A1 | 35 | 01/01/1990 | 06/04/2020 | 5000 | 0.229961 | 32.85
B2 | 35 | 02/01/2020 | NaN | 10000 | 0.229961 | 65.70
C3 | 30 | 23/03/2020 | NaN | 5800 | 0.229961 | 44.46
The multiplier is a static figure for all employees, calculated as 7 / 30.44. The hourly pay is worked out by multiplying the monthly salary by the multiplier and dividing by the weekly contracted hours.
Now my challenge is to get Pandas to recognise a date in the Termination Date field, and adjust the calculation. For instance, the first record would need to be updated to show that the employee was actually paid 5k through the payroll for 4 business days, not the full month, given that they resigned on 06/04/2020. So the expected hourly pay figure would be (5000 / 4 * 7 / 35) = 250.
I can code the calculation quite easily; my struggle is adding a column to reflect the business days (4 in the above example) in a fresh column for all April leavers (not interested in any other months). So far I have tried.
df['T_Mth_Workdays'] = np.where(df['Termination_Date'].notnull(), np.busday_count('2020-04-01', df['Termination_Date']), 0)
However the above approach returns an error stating that:
iterator operand 0 dtype could not be cast from dtype(' m8 [ns] ') to dtype(' m8 [d] ')
I should add here that I had to change the dates to datetime[ns64] format manually.
Any pointers gratefully received. Thanks!
The issue with your np.where function call is that it is trying to pass the entire series df["Termination_Date"] as an argument to np.busday_count. The count function fails because it requires arguments to be in the np.datetime64[D] format (i.e., value only specified to the day), and the Series cannot be easily converted to this format.
One solution is to write a custom function that only calls that np.busday_count on elements that are not NaTs, converting those to the datetime64[D] type before calling np.busday_count. Then, you can apply the custom function to the df["Termination_Date"] series, as below:
#!/usr/bin/env python3
import numpy as np
import pandas as pd
DATE_FORMAT = "%d-%m-%Y"
# Reproduce raw data
raw_data = [
["A1", 35, "01/01/1990", "06/04/2020", 5000, 0.229961, 32.85],
["B2", 35, "02/01/2020", None, 10000, 0.229961, 65.70],
["C3", 35, "23/03/2020", "NAT", 5800, 0.229961, 44.46],
]
# Convert raw dates to ISO format, then np.datetime64
def parse_raw_dates(s):
try:
spl = s.split("/")
ds = "%s-%s-%s" %(spl[2], spl[1], spl[0])
except:
ds = "NAT"
return np.datetime64(ds)
for line in raw_data:
line[2] = parse_raw_dates(line[2])
# Create dataframe
df = pd.DataFrame(
data = raw_data,
columns = [
"Emp_ID", "Weekly_Hours", "Hire_Date", "Termination_Date",
"Salary_Paid", "Multiplier", "Hourly_Pay"],
)
# Create special conversion function
def myfunc(d):
d = d.to_numpy().astype('datetime64[D]')
if np.isnat(d):
return 0
else:
return np.busday_count('2020-04-01', d)
df['T_Mth_Workdays'] = df["Termination_Date"].apply(myfunc)
def format_date(d):
d = d.to_numpy().astype('datetime64[D]')
if np.isnat(d):
return ""
else:
return pd.to_datetime(d).strftime(DATE_FORMAT)
df["Hire_Date"] = df["Hire_Date"].apply(format_date)
df["Termination_Date"] = df["Termination_Date"].apply(format_date)
Posting my approach here in case it helps others in the future. Firstly code for creating the dataframe:
d = {'Emp_ID': ['A1', 'B2', 'C3'], 'Weekly Hours': ['35', '35', '30'], 'Hire_Date': ['01/01/1990', '02/01/2020', '23/03/2020'],
'Termination_Date': ['06/04/2020', np.nan, np.nan], 'Salary_Paid': [5000, 10000, 5800]}
df = pd.DataFrame(data=d)
df
The first step was to convert the dates to a more useable format - this is where pd.to_datetime() comes in handy -the adjustment needed was to specify the format.
df['Hire_Date'] = pd.to_datetime(df['Hire_Date'], format='%d/%m/%Y')
df['Termination_Date'] = pd.to_datetime(df['Termination_Date'], format='%d/%m/%Y')
This has the desired effect; whereby the dates are correctly represented and April is picked up as the right month of termination for employee A1.
I now (slightly) adjusted Ken's custom solution for calculating the working days in April:
def workday_calc(d):
d = d.to_numpy().astype('datetime64[D]')
if np.isnat(d):
return 30.44
else:
d = d.astype(str)
d = dt.datetime.strptime(d, '%Y-%m-%d')
e = (d + dt.timedelta(1)).strftime('%Y-%m-%d')
return np.busday_count('2020-04-01', e, weekmask=[1,1,1,1,1,0,0])
I spotted the error while reviewing numpy documentation on np.busday_count(). There are two useful pointers to note:
The use of the datetime64[D] is mandatory in the first line of the function - you can't use pd.to_datetime(). This is because the datetime64[D] format is a pre-requisite to being able to call the np.isnat() function.
However, the minute we deal with the NaT in the dataframe, we need to switch back to a string format, which is needed for the datetime.strptime() function.
Using the datetime.strptime() feature, we tell Python that the date is a) represented in the ISO format, and we need to retain it as a string. The advantage with both datetime.strptime() and np.busday_count() is that they are both built to handle strings.
Also, the np.busday_count() excludes the end date, so I used timedelta() to increment the end date by one, so that all the dates in the interim are counted. This may or may not be appropriate given what you're trying to do, but I wanted an inclusive count of days worked in April. So in this case, the employee has worked for 4 business days in April.
We then simply apply the custom function and create a new column.
df['Days_Worked_April'] = df['Termination_Date'].apply(workday_calc)
I was now able to use the freshly created column to derive my multiplier - using the same old approach. The rest is simple, but I'm including the code and results below for completeness.
df['Multiplier'] = df.apply(lambda x: 7 / x['Days_Worked_April'], axis=1)
df['Hourly_Pay_Calc'] = round((df.apply(lambda x: x['Salary_Paid'] * x['Multiplier'] / x['Weekly Hours'], axis=1)), 2)
Output:
Emp_ID Weekly Hours Hire_Date Termination_Date Salary_Paid Days_Worked_April Multiplier Hourly_Pay_Calc
0 A1 35.0 1990-01-01 2020-04-06 5000 4.00 1.750000 250.00
1 B2 35.0 2020-01-02 NaT 10000 30.44 0.229961 65.70
2 C3 30.0 2020-03-23 NaT 5800 30.44 0.229961 44.46

RuntimeWarning: invalid value encountered in longlong_scalars

What I'm trying to do
I want to report the weekly rejection rate for multiple users. I use a for loop to go through a monthly dataset to get the numbers for every user. The final dataframe, rates, should look something like:
The end product, rates
Description
I have an initial dataframe (numbers), that contains only the ACCEPT, REJECT and REVIEW numbers, where I added these rows and columns:
Rows: Grand Total, Rejection Rate
Columns: Grand Total
Here's how numbers look like:
|---|--------|--------|--------|--------|-------------|
| | Week 1 | Week 2 | Week 3 | Week 4 | Grand Total |
|---|--------|--------|--------|--------|-------------|
| 0 | 994 | 699 | 529 | 877 | 3099 |
|---|--------|--------|--------|--------|-------------|
| 1 | 27 | 7 | 8 | 13 | 55 |
|---|--------|--------|--------|--------|-------------|
| 2 | 100 | 86 | 64 | 107 | 357 |
|---|--------|--------|--------|--------|-------------|
| 3 | 1121 | 792 | 601 | 997 | 3511 |
|---|--------|--------|--------|--------|-------------|
The indexes represent the following values:
0 - ACCEPT
1 - REJECT
2 - REVIEW
3 - TOTAL (Accept+Reject+Review)
I wrote 2 pre-defined functions:
get_decline_rates(df): The get the decline rates by week in the numbers dataframe.
copy(empty_df, data): To transfer all data to a new dataframe with "double" headers (for reporting purposes).
Here's my code where I add rows and columns to numbers, then re-format it:
# Adding "Grand Total" column and rows
totals = numbers.sum(axis=0) # column sum
numbers = numbers.append(totals, ignore_index=True)
grand_total = numbers.sum(axis=1) # row sum
numbers.insert(len(numbers.columns), "Grand Total", grand_total)
# Adding "Rejection Rate" and re-indexing numbers
decline_rates = get_decline_rates(numbers)
numbers = numbers.append(decline_rates, ignore_index=True)
numbers.index = ["ACCEPT","REJECT","REVIEW","Grand Total","Rejection Rate"]
# Creating a new df with report format requirements
final = pd.DataFrame(0, columns=numbers.columns, index=["User A"]+list(numbers.index))
final.ix["User A",:] = final.columns
# Copying data from numbers to newly formatted df
copy(final,numbers)
# Append final df of this user to the final dataframe
rates = rates.append(final)
I'm using Python 3.5.2 and Pandas 0.19.2. If it helps, here's how the initial dataset looks like:
Data format
I do a resampling on the date column to get the data by week.
What's going wrong
Here's the funny part - the code runs fine and I get all the required information in rates. However, I'm seeing this warning message:
RuntimeWarning: invalid value encountered in longlong_scalars
If i break down the code and run it line by line, this message does not appear. Even the message looks weird (what does longlong_scalars even mean?) Does anyone know what this warning message mean, and what's causing it?
UPDATE:
I just ran a similar script that takes in exactly the same input and produces a similar output (except I get daily rejection rates instead of weekly). I get the same Runtime warning, except more information is given:
RuntimeWarning: invalid value encountered in longlong_scalars
rej_rate = str(int(round((col.ix[1 ]/col.ix[3 ])*100))) + "%"
I suspect something must have gone wrong when I was trying to calculate the decline rates with my pre-defined function, get_decline_rates(df). Could it be due to the dtype of the values? All columns on the input df, numbers, are int64.
Here's the code for my pre-defined function (the input, numbers, can be found under Description):
# Description: Get rejection rates for all weeks.
# Parameters: Pandas Dataframe with ACCEPT, REJECT, REVIEW count by week.
# Output: Pandas Series with rejection rates for all days in input df.
def get_decline_rates(df):
decline_rates = []
for i in range(len(df.columns)):
col = df.ix[:,i]
try:
rej_rate = str(int(round((col[1]/col[3])*100))) + "%"
except ValueError:
rej_rate = "0%"
decline_rates.append(rej_rate)
return pd.Series(decline_rates, index=df.columns)
I had the same RuntimeWarning, and after looking into the data, it was because of a null-division. I did not have the time to look into your sample, but you could look around id=0, or some other records, where null-division or such could occur.

Generating a retention cohort from a pandas dataframe

I have a pandas dataframe that looks like this:
+-----------+------------------+---------------+------------+
| AccountID | RegistrationWeek | Weekly_Visits | Visit_Week |
+-----------+------------------+---------------+------------+
| ACC1 | 2015-01-25 | 0 | NaT |
| ACC2 | 2015-01-11 | 0 | NaT |
| ACC3 | 2015-01-18 | 0 | NaT |
| ACC4 | 2014-12-21 | 14 | 2015-02-12 |
| ACC5 | 2014-12-21 | 5 | 2015-02-15 |
| ACC6 | 2014-12-21 | 0 | 2015-02-22 |
+-----------+------------------+---------------+------------+
It's essentially a visit log of sorts, as it holds all the necessary data for creating a cohort analysis.
Each registration week is a cohort.
To know how many people are part of the cohort I can use:
visit_log.groupby('RegistrationWeek').AccountID.nunique()
What I want to do is create a pivot table with the registration weeks as keys. The columns should be the visit_weeks and the values should be the count of unique account ids who have more than 0 weekly visits.
Together with the total accounts in each cohort, I will then be able to show percentages instead of absolute values.
The end product would look something like this:
+-------------------+-------------+-------------+-------------+
| Registration Week | Visit_week1 | Visit_Week2 | Visit_week3 |
+-------------------+-------------+-------------+-------------+
| week1 | 70% | 30% | 20% |
| week2 | 70% | 30% | |
| week3 | 40% | | |
+-------------------+-------------+-------------+-------------+
I tried pivoting the dataframe like this:
visit_log.pivot_table(index='RegistrationWeek', columns='Visit_Week')
But I haven't nailed down the value part. I'll need to somehow count account Id and divide the sum by the registration week aggregation from above.
I'm new to pandas so if this isn't the best way to do retention cohorts, please enlighten me!
Thanks
There are several aspects to your question.
What you can build with the data you have
There are several kinds of retention. For simplicity, we’ll mention only two :
Day-N retention : if a user registered on day 0, did she log in on day N ? (Logging on day N+1 does not affect this metric). To measure it, you need to keep track of all the logs of your users.
Rolling retention : if a user registered on day 0, did she log in on day N or any day after that ? (Logging in on day N+1 affects this metric). To measure it, you just need the last know logs of your users.
If I understand your table correctly, you have two relevant variables to build your cohort table : registration date, and last log (visit week). The number of weekly visits seems irrelevant.
So with this you can only go with option 2, rolling retention.
How to build the table
First, let's build a dummy data set so that we have enough to work on and you can reproduce it :
import pandas as pd
import numpy as np
import math
import datetime as dt
np.random.seed(0) # so that we all have the same results
def random_date(start, end,p=None):
# Return a date randomly chosen between two dates
if p is None:
p = np.random.random()
return start + dt.timedelta(seconds=math.ceil(p * (end - start).days*24*3600))
n_samples = 1000 # How many users do we want ?
index = range(1,n_samples+1)
# A range of signup dates, say, one year.
end = dt.datetime.today()
from dateutil.relativedelta import relativedelta
start = end - relativedelta(years=1)
# Create the dataframe
users = pd.DataFrame(np.random.rand(n_samples),
index=index, columns=['signup_date'])
users['signup_date'] = users['signup_date'].apply(lambda x : random_date(start, end,x))
# last logs randomly distributed within 10 weeks of singing up, so that we can see the retention drop in our table
users['last_log'] = users['signup_date'].apply(lambda x : random_date(x, x + relativedelta(weeks=10)))
So now we should have something that looks like this :
users.head()
Here is some code to build a cohort table :
### Some useful functions
def add_weeks(sourcedate,weeks):
return sourcedate + dt.timedelta(days=7*weeks)
def first_day_of_week(sourcedate):
return sourcedate - dt.timedelta(days = sourcedate.weekday())
def last_day_of_week(sourcedate):
return sourcedate + dt.timedelta(days=(6 - sourcedate.weekday()))
def retained_in_interval(users,signup_week,n_weeks,end_date):
'''
For a given list of users, returns the number of users
that signed up in the week of signup_week (the cohort)
and that are retained after n_weeks
end_date is just here to control that we do not un-necessarily fill the bottom right of the table
'''
# Define the span of the given week
cohort_start = first_day_of_week(signup_week)
cohort_end = last_day_of_week(signup_week)
if n_weeks == 0:
# If this is our first week, we just take the number of users that signed up on the given period of time
return len( users[(users['signup_date'] >= cohort_start)
& (users['signup_date'] <= cohort_end)])
elif pd.to_datetime(add_weeks(cohort_end,n_weeks)) > pd.to_datetime(end_date) :
# If adding n_weeks brings us later than the end date of the table (the bottom right of the table),
# We return some easily recognizable date (not 0 as it would cause confusion)
return float("Inf")
else:
# Otherwise, we count the number of users that signed up on the given period of time,
# and whose last known log was later than the number of weeks added (rolling retention)
return len( users[(users['signup_date'] >= cohort_start)
& (users['signup_date'] <= cohort_end)
& pd.to_datetime((users['last_log']) >= pd.to_datetime(users['signup_date'].map(lambda x: add_weeks(x,n_weeks))))
])
With this we can create the actual function :
def cohort_table(users,cohort_number=6,period_number=6,cohort_span='W',end_date=None):
'''
For a given dataframe of users, return a cohort table with the following parameters :
cohort_number : the number of lines of the table
period_number : the number of columns of the table
cohort_span : the span of every period of time between the cohort (D, W, M)
end_date = the date after which we stop counting the users
'''
# the last column of the table will end today :
if end_date is None:
end_date = dt.datetime.today()
# The index of the dataframe will be a list of dates ranging
dates = pd.date_range(add_weeks(end_date,-cohort_number), periods=cohort_number, freq=cohort_span)
cohort = pd.DataFrame(columns=['Sign up'])
cohort['Sign up'] = dates
# We will compute the number of retained users, column-by-column
# (There probably is a more pythonesque way of doing it)
range_dates = range(0,period_number+1)
for p in range_dates:
# Name of the column
s_p = 'Week '+str(p)
cohort[s_p] = cohort.apply(lambda row: retained_in_interval(users,row['Sign up'],p,end_date), axis=1)
cohort = cohort.set_index('Sign up')
# absolute values to percentage by dividing by the value of week 0 :
cohort = cohort.astype('float').div(cohort['Week 0'].astype('float'),axis='index')
return cohort
Now you can call it and see the result :
cohort_table(users)
Hope it helps
Using the same format of users data from rom_j's answer, this will be cleaner/faster, but only works assuming there is at least one signup/churn per week. Not a terrible assumption on large enough data.
users = users.applymap(lambda d: d.strftime('%Y-%m-%V') if pd.notnull(d) else d)
tab = pd.crosstab(signup_date, last_log)
totals = tab.T.sum()
retention_counts = ((tab.T.cumsum().T * -1)
.replace(0, pd.NaT)
.add(totals, axis=0)
)
retention = retention_counts.div(totals, axis=0)
realined = [retention.loc[a].dropna().values for a in retention.index]
realigned_retention = pd.DataFrame(realined, index=retention.index)

Need to compare very large files around 1.5GB in python

"DF","00000000#11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2"
"Rail","00000.POO#GMAIL.COM","NR251764697478","24JUN2011","B2C","2025"
"DF","0000650000#YAHOO.COM","NF2513521438550","01JAN2013","B2C","6792"
"Bus","00009.GAURAV#GMAIL.COM","NU27012932319739","26JAN2013","B2C","800"
"Rail","0000.ANU#GMAIL.COM","NR251764697526","24JUN2011","B2C","595"
"Rail","0000MANNU#GMAIL.COM","NR251277005737","29OCT2011","B2C","957"
"Rail","0000PRANNOY0000#GMAIL.COM","NR251297862893","21NOV2011","B2C","212"
"DF","0000PRANNOY0000#YAHOO.CO.IN","NF251327485543","26JUN2011","B2C","17080"
"Rail","0000RAHUL#GMAIL.COM","NR2512012069809","25OCT2012","B2C","5731"
"DF","0000SS0#GMAIL.COM","NF251355775967","10MAY2011","B2C","2000"
"DF","0001HARISH#GMAIL.COM","NF251352240086","22DEC2010","B2C","4006"
"DF","0001HARISH#GMAIL.COM","NF251742087846","12DEC2010","B2C","1000"
"DF","0001HARISH#GMAIL.COM","NF252022031180","09DEC2010","B2C","3439"
"Rail","000AYUSH#GMAIL.COM","NR2151120122283","25JAN2013","B2C","136"
"Rail","000AYUSH#GMAIL.COM","NR2151213260036","28NOV2012","B2C","41"
"Rail","000AYUSH#GMAIL.COM","NR2151313264432","29NOV2012","B2C","96"
"Rail","000AYUSH#GMAIL.COM","NR2151413266728","29NOV2012","B2C","96"
"Rail","000AYUSH#GMAIL.COM","NR2512912359037","08DEC2012","B2C","96"
"Rail","000AYUSH#GMAIL.COM","NR2517612385569","12DEC2012","B2C","96"
Above is the sample data.
Data is sorted according to email addresses and the file is very large around 1.5Gb
I want output in another csv file something like this
"DF","00000000#11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2",1,0 days
"Rail","00000.POO#GMAIL.COM","NR251764697478","24JUN2011","B2C","2025",1,0 days
"DF","0000650000#YAHOO.COM","NF2513521438550","01JAN2013","B2C","6792",1,0 days
"Bus","00009.GAURAV#GMAIL.COM","NU27012932319739","26JAN2013","B2C","800",1,0 days
"Rail","0000.ANU#GMAIL.COM","NR251764697526","24JUN2011","B2C","595",1,0 days
"Rail","0000MANNU#GMAIL.COM","NR251277005737","29OCT2011","B2C","957",1,0 days
"Rail","0000PRANNOY0000#GMAIL.COM","NR251297862893","21NOV2011","B2C","212",1,0 days
"DF","0000PRANNOY0000#YAHOO.CO.IN","NF251327485543","26JUN2011","B2C","17080",1,0 days
"Rail","0000RAHUL#GMAIL.COM","NR2512012069809","25OCT2012","B2C","5731",1,0 days
"DF","0000SS0#GMAIL.COM","NF251355775967","10MAY2011","B2C","2000",1,0 days
"DF","0001HARISH#GMAIL.COM","NF251352240086","09DEC2010","B2C","4006",1,0 days
"DF","0001HARISH#GMAIL.COM","NF251742087846","12DEC2010","B2C","1000",2,3 days
"DF","0001HARISH#GMAIL.COM","NF252022031180","22DEC2010","B2C","3439",3,10 days
"Rail","000AYUSH#GMAIL.COM","NR2151213260036","28NOV2012","B2C","41",1,0 days
"Rail","000AYUSH#GMAIL.COM","NR2151313264432","29NOV2012","B2C","96",2,1 days
"Rail","000AYUSH#GMAIL.COM","NR2151413266728","29NOV2012","B2C","96",3,0 days
"Rail","000AYUSH#GMAIL.COM","NR2512912359037","08DEC2012","B2C","96",4,9 days
"Rail","000AYUSH#GMAIL.COM","NR2512912359037","08DEC2012","B2C","96",5,0 days
"Rail","000AYUSH#GMAIL.COM","NR2517612385569","12DEC2012","B2C","96",6,4 days
"Rail","000AYUSH#GMAIL.COM","NR2517612385569","12DEC2012","B2C","96",7,0 days
"Rail","000AYUSH#GMAIL.COM","NR2151120122283","25JAN2013","B2C","136",8,44 days
"Rail","000AYUSH#GMAIL.COM","NR2151120122283","25JAN2013","B2C","136",9,0 days
i.e if entry occurs 1st time i need to append 1 if it occurs 2nd time i need to append 2 and likewise i mean i need to count no of occurences of an email address in the file and if an email exists twice or more i want difference among dates and remember dates are not sorted so we have to sort them also against a particular email address and i am looking for a solution in python using numpy or pandas library or any other library that can handle this type of huge data without giving out of bound memory exception i have dual core processor with centos 6.3 and having ram of 4GB
make sure you have 0.11, read these docs: http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables, and these recipes: http://pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore (esp the 'merging on millions of rows'
Here is a solution that seems to work. Here is the workflow:
read data from your csv by chunks and appending to an hdfstore
iterate over the store, which creates another store that does the combiner
Essentially we are taking a chunk from the table and combining with a chunk from every other part of the file. The combiner function does not reduce, but instead calculates your function (the diff in days) between all elements in that chunk, eliminating duplicates as you go, and taking the latest data after each loop. Kind of like a recursive reduce almost.
This should be O(num_of_chunks**2) memory and calculation time
chunksize could be say 1m (or more) in your case
processing [0] [datastore.h5]
processing [1] [datastore_0.h5]
count date diff email
4 1 2011-06-24 00:00:00 0 0000.ANU#GMAIL.COM
1 1 2011-06-24 00:00:00 0 00000.POO#GMAIL.COM
0 1 2010-07-26 00:00:00 0 00000000#11111.COM
2 1 2013-01-01 00:00:00 0 0000650000#YAHOO.COM
3 1 2013-01-26 00:00:00 0 00009.GAURAV#GMAIL.COM
5 1 2011-10-29 00:00:00 0 0000MANNU#GMAIL.COM
6 1 2011-11-21 00:00:00 0 0000PRANNOY0000#GMAIL.COM
7 1 2011-06-26 00:00:00 0 0000PRANNOY0000#YAHOO.CO.IN
8 1 2012-10-25 00:00:00 0 0000RAHUL#GMAIL.COM
9 1 2011-05-10 00:00:00 0 0000SS0#GMAIL.COM
12 1 2010-12-09 00:00:00 0 0001HARISH#GMAIL.COM
11 2 2010-12-12 00:00:00 3 0001HARISH#GMAIL.COM
10 3 2010-12-22 00:00:00 13 0001HARISH#GMAIL.COM
14 1 2012-11-28 00:00:00 0 000AYUSH#GMAIL.COM
15 2 2012-11-29 00:00:00 1 000AYUSH#GMAIL.COM
17 3 2012-12-08 00:00:00 10 000AYUSH#GMAIL.COM
18 4 2012-12-12 00:00:00 14 000AYUSH#GMAIL.COM
13 5 2013-01-25 00:00:00 58 000AYUSH#GMAIL.COM
import pandas as pd
import StringIO
import numpy as np
from time import strptime
from datetime import datetime
# your data
data = """
"DF","00000000#11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2"
"Rail","00000.POO#GMAIL.COM","NR251764697478","24JUN2011","B2C","2025"
"DF","0000650000#YAHOO.COM","NF2513521438550","01JAN2013","B2C","6792"
"Bus","00009.GAURAV#GMAIL.COM","NU27012932319739","26JAN2013","B2C","800"
"Rail","0000.ANU#GMAIL.COM","NR251764697526","24JUN2011","B2C","595"
"Rail","0000MANNU#GMAIL.COM","NR251277005737","29OCT2011","B2C","957"
"Rail","0000PRANNOY0000#GMAIL.COM","NR251297862893","21NOV2011","B2C","212"
"DF","0000PRANNOY0000#YAHOO.CO.IN","NF251327485543","26JUN2011","B2C","17080"
"Rail","0000RAHUL#GMAIL.COM","NR2512012069809","25OCT2012","B2C","5731"
"DF","0000SS0#GMAIL.COM","NF251355775967","10MAY2011","B2C","2000"
"DF","0001HARISH#GMAIL.COM","NF251352240086","22DEC2010","B2C","4006"
"DF","0001HARISH#GMAIL.COM","NF251742087846","12DEC2010","B2C","1000"
"DF","0001HARISH#GMAIL.COM","NF252022031180","09DEC2010","B2C","3439"
"Rail","000AYUSH#GMAIL.COM","NR2151120122283","25JAN2013","B2C","136"
"Rail","000AYUSH#GMAIL.COM","NR2151213260036","28NOV2012","B2C","41"
"Rail","000AYUSH#GMAIL.COM","NR2151313264432","29NOV2012","B2C","96"
"Rail","000AYUSH#GMAIL.COM","NR2151413266728","29NOV2012","B2C","96"
"Rail","000AYUSH#GMAIL.COM","NR2512912359037","08DEC2012","B2C","96"
"Rail","000AYUSH#GMAIL.COM","NR2517612385569","12DEC2012","B2C","96"
"""
# read in and create the store
data_store_file = 'datastore.h5'
store = pd.HDFStore(data_store_file,'w')
def dp(x, **kwargs):
return [ datetime(*strptime(v,'%d%b%Y')[0:3]) for v in x ]
chunksize=5
reader = pd.read_csv(StringIO.StringIO(data),names=['x1','email','x2','date','x3','x4'],
header=0,usecols=['email','date'],parse_dates=['date'],
date_parser=dp, chunksize=chunksize)
for i, chunk in enumerate(reader):
chunk['indexer'] = chunk.index + i*chunksize
# create the global index, and keep it in the frame too
df = chunk.set_index('indexer')
# need to set a minimum size for the email column
store.append('data',df,min_itemsize={'email' : 100})
store.close()
# define the combiner function
def combiner(x):
# given a group of emails (the same), return a combination
# with the new data
# sort by the date
y = x.sort('date')
# calc the diff in days (an integer)
y['diff'] = (y['date']-y['date'].iloc[0]).apply(lambda d: float(d.item().days))
y['count'] = pd.Series(range(1,len(y)+1),index=y.index,dtype='float64')
return y
# reduce the store (and create a new one by chunks)
in_store_file = data_store_file
in_store1 = pd.HDFStore(in_store_file)
# iter on the store 1
for chunki, df1 in enumerate(in_store1.select('data',chunksize=2*chunksize)):
print "processing [%s] [%s]" % (chunki,in_store_file)
out_store_file = 'datastore_%s.h5' % chunki
out_store = pd.HDFStore(out_store_file,'w')
# iter on store 2
in_store2 = pd.HDFStore(in_store_file)
for df2 in in_store2.select('data',chunksize=chunksize):
# concat & drop dups
df = pd.concat([df1,df2]).drop_duplicates(['email','date'])
# group and combine
result = df.groupby('email').apply(combiner)
# remove the mi (that we created in the groupby)
result = result.reset_index('email',drop=True)
# only store those rows which are in df2!
result = result.reindex(index=df2.index).dropna()
# store to the out_store
out_store.append('data',result,min_itemsize={'email' : 100})
in_store2.close()
out_store.close()
in_store_file = out_store_file
in_store1.close()
# show the reduced store
print pd.read_hdf(out_store_file,'data').sort(['email','diff'])
Use the built-in sqlite3 database: you can insert the data, sort and group as necessary, and there's no problem using a file which is larger than available RAM.
Another possible (system-admin) way, avoiding database and SQL queries plus a whole lot of requirements in runtime processes and hardware resources.
Update 20/04 Added more code and simplified approach:-
Convert the timestamp to seconds (from Epoch) and use UNIX sort, using email and this new field (that is: sort -k2 -k4 -n -t, < converted_input_file > output_file)
Initialize 3 variable, EMAIL, PREV_TIME and COUNT
Interate over each line, if new email is encountered, add "1,0 day". Update PREV_TIME=timestamp, COUNT=1, EMAIL=new_email
Next line: 3 possible scenario
a) if same email, different timestamp: calculate days, increment COUNT=1, update PREV_TIME, add "Count, Difference_in_days"
b) If same email, same timestamp: increment COUNT, add "COUNT, 0 day"
c) If new email, start from 3.
Alternative to 1. is to add a new field TIMESTAMP and remove it upon printing out the line.
Note: If 1.5GB is too huge to sort at a go, split it into smaller chuck, using email as the split point. You can run these chunks in parallel on different machine
/usr/bin/gawk -F'","' ' {
split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " ");
for (i=1; i<=12; i++) mdigit[month[i]]=i;
print $0 "," mktime(substr($4,6,4) " " mdigit[substr($4,3,3)] " " substr($4,1,2) " 00 00 00"
)}' < input.txt | /usr/bin/sort -k2 -k7 -n -t, > output_file.txt
output_file.txt:
"DF","00000000#11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2",1280102400
"DF","0001HARISH#GMAIL.COM","NF252022031180","09DEC2010","B2C","3439",1291852800
"DF","0001HARISH#GMAIL.COM","NF251742087846","12DEC2010","B2C","1000",1292112000
"DF","0001HARISH#GMAIL.COM","NF251352240086","22DEC2010","B2C","4006",1292976000
...
You pipe the output to Perl, Python or AWK script to process step 2. through 4.

Categories

Resources