Related
I start of with my wants with this simplified example:
data = {'dg1_1':[1, 0],
'dg1_2':[0, 1],
'dg2_1':[0, 1],
'dg2_2':[1, 0],
'cont1':[13.0, 13.0]}
wants = pd.DataFrame(data)
I do not really have this and this is meant to be generated. I have 2 one hot encoded groups dg1 and dg2. This is obviously simplified and dg1 and dg2 can contain different number of columns. From some observations (a sample) I can get them also like this:
dg1_indeces = observations.columns[wants.columns.str.startswith("dg1")]
dg2_indeces = observations.columns[wants.columns.str.startswith("dg2")]
Given one observation (ab)using my wants to explain:
one_observation = wants.head(1)
I want to create all possibly combinations given one_observation so that for each encoded group, I only turn on one column in each "one hot encoded group" at the time. So I can do:
haves = pd.concat([haves]*(len(dg1_indeces) * len(dg2_indeces)), ignore_index=True)
haves.loc[:, dg1_indeces] = 0
haves.loc[:, dg2_indeces] = 0
print(haves)
This gives me all rows with the hot encoded groups all zero - I now want to get to my wants (see at the top) in the most efficient way. I guess avoiding loops to then score the data using an existing model. Hope this makes sense?
PS:
This my naïve way of possibly achieving this:
row = 0
for dg1 in dg1_indeces:
for dg2 in dg2_indeces:
haves.loc[row, dg1] = 1
haves.loc[row, dg2] = 1
row += 1
You can build from bottom with pd.MultiIndex.from_product or merge with cross
s1 = df.columns[df.columns.str.startswith('dg1')]
s2 = df.columns[df.columns.str.startswith('dg2')]
#if s1 and s2 is dataframe idx = s1.merge(s2,how='cross')
idx = pd.MultiIndex.from_product([s1,s2]).map('|'.join)
pd.Series(idx).str.get_dummies('|')
Out[115]:
dg1_1 dg1_2 dg2_1 dg2_2
0 1 0 1 0
1 1 0 0 1
2 0 1 1 0
3 0 1 0 1
Let's add a third attribute to the dg2 group and change the cont1 value of the second row to make things less confusing:
data = {'dg1_1':[1, 0],
'dg1_2':[0, 1],
'dg2_1':[0, 1],
'dg2_2':[1, 0],
'dg2_3':[0, 0],
'cont1':[13.0, 14.0]}
wants = pd.DataFrame(data)
So now you have 2 groups, one with 2 attributes and one with 3 attributes. Only one attribute can be "hot" per group. If we lay out a 2 x 3 matrix and fill each cell with 2 ** (i,j):
0 1 2
0 (1, 1) (1, 2) (1, 4)
1 (2, 1) (2, 2) (2, 4)
Then convert the matrix to binary:
0 1 2
0 (01, 001) (01, 010) (01, 100)
1 (10, 001) (10, 010) (10, 100)
It essentially satisfies our requirement that only one attribute per group is "hot". If you unravel (i.e. flatten) it:
dg1 dg2
01 001
01 010
01 100
10 001
10 010
10 100
It becomes the list of permutations that you can cross join against every observation.
# Get the columns were are interested in
cols = wants.columns[wants.columns.str.startswith("dg")].to_series()
# shape is an (n1, n2, n3, ...) tuple where n_i is the number of attribute per group
shape = cols.str.split("_", expand=True).groupby(0).size().to_numpy()
rows = []
# Make the matrix
for i in range(shape.prod()):
string = ''
for dim, index in enumerate(np.unravel_index(i, shape)):
string += bin(2 ** index)[2:].zfill(shape[dim])
rows.append(map(int, list(string)))
permutations = pd.DataFrame(rows, columns=cols)
# Result
wants[["cont1"]].merge(permutations, how="cross")
I have an example DataFrame below:
df = pd.DataFrame([[1, 1, 1,'2016-09-01','pay',1], [1, 2, 1, '2016-09-01','claims',1], [2, 3, 3, '2016-09-02','claims',1],[2,4,3,'2016-10-02','pay',2],[3,5,4,'2016-09-02','pay',1],[3,6,5,'2016-09-04','pay',2],[3,7,4,'2016-09-06','claims',3],[3,8,6,'2016-09-08','pay',4]], columns=['claim_id', 'payment_id', 'provider_id','date','dataset','date_rank'])
df['date'] = pd.to_datetime(df['date']) # this column ought to be date
df
df image
There are duplicate payments that cannot be removed using a simple drop_duplicates() because the decision to drop a row or not depends on it's relationship to other payment rows of the same claim_id.
I would like to create a new column called 'dup' which labels the rows that are duplicates so that I can review them before dropping them from the DataFrame.
The logic needed to accurately remove the duplicates is:
For each claim_id in df:
For the payment where df['dataset'] == 'claims', check if there is another payment for the same claim_id that has the same provider_id and that occurs prior to or on the same df['date']. If there is, label the new column df['dup'] as True for the payment where df['dataset'] == 'claims'. Otherwise, label the new column df['dup'] as False.
In this example, payment_id's 2 and 7 should have a value of True in the new column 'dup' while all other payment id's should be False:
df_out = pd.DataFrame([[1, 1, 1,'2016-09-01','pay',1,False], [1, 2, 1, '2016-09-01','claims',1,True], [2, 3, 3, '2016-09-02','claims',1,False],[2,4,3,'2016-10-02','pay',2,False],[3,5,4,'2016-09-02','pay',1,False],[3,6,5,'2016-09-04','pay',2,False],[3,7,4,'2016-09-06','claims',3,True],[3,8,6,'2016-09-08','pay',4,False]], columns=['claim_id', 'payment_id', 'provider_id','date','dataset','date_rank','dup'])
df_out['date'] = pd.to_datetime(df_out['date']) # this column ought to be date
df_out
df_out image
I have tried many different things including trying to break this down into steps but have not been successful. In one of these attempts I created the date_rank column which labels the payments by the date order that they appear. I have included this here in case it is helpful.
it's a little bit ugly but it works
import pandas as pd
df = pd.DataFrame([[1, 1, 1,'2016-09-01','pay',1], [1, 2, 1, '2016-09-01','claims',1], [2, 3, 3, '2016-09-02','claims',1],[2,4,3,'2016-10-02','pay',2],[3,5,4,'2016-09-02','pay',1],[3,6,5,'2016-09-04','pay',2],[3,7,4,'2016-09-06','claims',3],[3,8,6,'2016-09-08','pay',4]], columns=['claim_id', 'payment_id', 'provider_id','date','dataset','date_rank'])
df['date'] = pd.to_datetime(df['date']) # this column ought to be date
df['dup'] = False
provider_list = list(set(df['provider_id'].tolist()))
for provider in provider_list:
temp_df = df.loc[df['provider_id'] == provider]
claim_list = list(set(df['claim_id'].tolist()))
for claim in claim_list:
temp_df2 = temp_df.loc[temp_df['claim_id'] == claim]
if len(set(temp_df2['dataset'].tolist())) == 2:
pay_date = temp_df2.loc[temp_df2['dataset'] == 'pay', 'date'].iloc[0]
claim_date = temp_df2.loc[temp_df2['dataset'] == 'claims', 'date'].iloc[0]
if pay_date <= claim_date:
payment_id = temp_df2.loc[temp_df2['dataset'] == 'claims', 'payment_id'].iloc[0]
df.loc[df['payment_id'] == payment_id, 'dup'] = True
else:
continue
df
output:
claim_id payment_id provider_id date dataset date_rank dup
0 1 1 1 2016-09-01 pay 1 False
1 1 2 1 2016-09-01 claims 1 True
2 2 3 3 2016-09-02 claims 1 False
3 2 4 3 2016-10-02 pay 2 False
4 3 5 4 2016-09-02 pay 1 False
5 3 6 5 2016-09-04 pay 2 False
6 3 7 4 2016-09-06 claims 3 True
7 3 8 6 2016-09-08 pay 4 False
EDIT
because you have to iterate each provider_id (I didn't find a different way) and you have more then 1M rows, I recommend to split the DataFrame into chunks of ~1000.
In addition, I would save the result as csv file for each chunk just in case the run will collapse (in that case I would continue from the place it collapsed)
import pandas as pd
import itertools
chunk_n = 1000
df = pd.DataFrame([[1, 1, 1,'2016-09-01','pay',1], [1, 2, 1, '2016-09-01','claims',1], [2, 3, 3, '2016-09-02','claims',1],[2,4,3,'2016-10-02','pay',2],[3,5,4,'2016-09-02','pay',1],[3,6,5,'2016-09-04','pay',2],[3,7,4,'2016-09-06','claims',3],[3,8,6,'2016-09-08','pay',4]], columns=['claim_id', 'payment_id', 'provider_id','date','dataset','date_rank'])
df['date'] = pd.to_datetime(df['date']) # this column ought to be date
df['dup'] = False
list_of_df = []
provider_list = list(set(df['provider_id'].tolist()))
provider_groups = [list(group) for key, group in itertools.groupby(provider_list, lambda k: k//chunk_n)]
i = 0
for group in provider_groups:
print('starting group', i)
temp_df = df.loc[df['provider_id'].isin(group)]
for provider in group:
claim_list = list(set(temp_df.loc[temp_df['provider_id'] == provider, 'claim_id'].tolist()))
for claim in claim_list:
temp_df2 = temp_df.loc[(temp_df['claim_id'] == claim) & (temp_df['provider_id'] == provider)]
if len(set(temp_df2['dataset'].tolist())) == 2:
pay_date = temp_df2.loc[temp_df2['dataset'] == 'pay', 'date'].iloc[0]
claim_date = temp_df2.loc[temp_df2['dataset'] == 'claims', 'date'].iloc[0]
if pay_date <= claim_date:
payment_id = temp_df2.loc[temp_df2['dataset'] == 'claims', 'payment_id'].iloc[0]
temp_df.loc[temp_df['payment_id'] == payment_id, 'dup'] = True
else:
continue
list_of_df.append(temp_df)
# temp_df.to_csv('group_{}.csv'.format(i))
# i += 1
df = pd.concat(list_of_df)
my date format come in format 11122020 (ddmmyyyy) in a pandas column.
I use
datapdf["wholetime"]=pd.to_datetime(datapdf["wholetime"],format='%d%m%Y)
to convert to time and do processing on the time.
recently my code failed for date 3122020 as
ValueError: day is out of range for month
python is interpreting as 31 2 2020 instead of 3 12 2020 which is causing the error. Any one have solution for this?
One way would be to use str.zfill to ensure that date is in 8 digits:
s = pd.Series(["11122020", "3122020"])
pd.to_datetime(s.str.zfill(8), format="%d%m%Y")
Output:
0 2020-12-11
1 2020-12-03
dtype: datetime64[ns]
Note that this answer only concerns about missing 0 in the day. It won't be able to parse more ambiguous items such as 332020, where the month part also requires heading 0.
Little bit newbie approach using apply I created custom parser for dates, if you have some other formats in it then you can tweak the function w.r.t your date formats,
import pandas as pd
data = {
#assuming your dates are mix of ddmmyyyy,dmmyyyy,dmyyyy
'date': ['11122020','3122020','572020','','222019','3112019']
}
df = pd.DataFrame(data)
def parser(elem):
res = ''
if len(elem) > 7:
res = elem
elif len(elem) > 6:
d = '0' + elem[0]
m = elem[1:3]
y = elem[3:]
res = d+m+y
elif len(elem) > 5:
d = '0' + elem[0]
m = '0' + elem[1]
y = elem[2:]
res = d+m+y
else:
res = ''
return pd.to_datetime(res, format='%d%m%Y',errors='coerce')
df['date'] = df['date'].apply(parser)
df
output:
date
0 2020-12-11
1 2020-12-03
2 2020-07-05
3 NaT
4 2019-02-02
5 2019-11-03
For customer segmentation purpose, I want to analyse, How many transactions did the customer do in prior 10 days & 20 days based on given table of transaction records with date.
In this table, the last 2 columns are joined by using the following code.
But I'm not satisfied with this code, please suggest me improvement.
import pandas as pd
df4 = pd.read_excel(path)
# Since A and B two customers are there, two separate dataframe created
df4A = df4[df4['Customer_ID'] == 'A']
df4B = df4[df4['Customer_ID'] == 'B']
from datetime import date
from dateutil.relativedelta import relativedelta
txn_prior_10days = []
for i in range(len(df4)):
current_date = df4.iloc[i,2]
prior_10days_date = current_date - relativedelta(days=10)
if df4.iloc[i,1] == 'A':
No_of_txn = ((df4A['Transaction_Date'] >= prior_10days_date) & (df4A['Transaction_Date'] < current_date)).sum()
txn_prior_10days.append(No_of_txn)
elif df4.iloc[i,1] == 'B':
No_of_txn = ((df4B['Transaction_Date'] >= prior_10days_date) & (df4B['Transaction_Date'] < current_date)).sum()
txn_prior_10days.append(No_of_txn)
txn_prior_20days = []
for i in range(len(df4)):
current_date = df4.iloc[i,2]
prior_20days_date = current_date - relativedelta(days=20)
if df4.iloc[i,1] == 'A':
no_of_txn = ((df4A['Transaction_Date'] >= prior_20days_date) & (df4A['Transaction_Date'] < current_date)).sum()
txn_prior_20days.append(no_of_txn)
elif df4.iloc[i,1] == 'B':
no_of_txn = ((df4B['Transaction_Date'] >= prior_20days_date) & (df4B['Transaction_Date'] < current_date)).sum()
txn_prior_20days.append(no_of_txn)
df4['txn_prior_10days'] = txn_prior_10days
df4['txn_prior_20days'] = txn_prior_20days
df4
Your code would be very difficult to write if you had
e.g. 10 different Customer_IDs.
Fortunately, there is much shorter solution:
When you read your file, convert Transaction_Date to datetime,
e.g. passing parse_dates=['Transaction_Date'] to read_excel.
Define a fuction counting how many dates in group (gr) are
within the range between tDlt (Timedelta) and 1 day before the
current date (dd):
def cntPrevTr(dd, gr, tDtl):
return gr.between(dd - tDtl, dd - pd.Timedelta(1, 'D')).sum()
It will be applied twice to each member of the current group
by Customer_ID (actually to Transaction_Date column only),
once with tDtl == 10 days and second time with tDlt == 20 days.
Define a function counting both columns containing the number of previous
transactions, for the current group of transaction dates:
def priorTx(td):
return pd.DataFrame({
'tx10' : td.apply(cntPrevTr, args=(td, pd.Timedelta(10, 'D'))),
'tx20' : td.apply(cntPrevTr, args=(td, pd.Timedelta(20, 'D')))})
Generate the result:
df[['txn_prior_10days', 'txn_prior_20days']] = df.groupby('Customer_ID')\
.Transaction_Date.apply(priorTx)
The code above:
groups df by Customer_ID,
takes from the current group only Transaction_Date column,
applies priorTx function to it,
saves the result in 2 target columns.
The result, for a bit shortened Transaction_ID, is:
Transaction_ID Customer_ID Transaction_Date txn_prior_10days txn_prior_20days
0 912410 A 2019-01-01 0 0
1 912341 A 2019-01-03 1 1
2 312415 A 2019-01-09 2 2
3 432513 A 2019-01-12 2 3
4 357912 A 2019-01-19 2 4
5 912411 B 2019-01-06 0 0
6 912342 B 2019-01-11 1 1
7 312416 B 2019-01-13 2 2
8 432514 B 2019-01-20 2 3
9 357913 B 2019-01-21 3 4
You cannot use rolling computation, because:
the rolling window extends forward from the current row, but you
want to count previous transactions,
rolling calculations include the current row, whereas
you want to exclude it.
This is why I came up with the above solution (just 8 lines of code).
Details how my solution works
To see all details, create the test DataFrame the following way:
import io
txt = '''
Transaction_ID Customer_ID Transaction_Date
912410 A 2019-01-01
912341 A 2019-01-03
312415 A 2019-01-09
432513 A 2019-01-12
357912 A 2019-01-19
912411 B 2019-01-06
912342 B 2019-01-11
312416 B 2019-01-13
432514 B 2019-01-20
357913 B 2019-01-21'''
df = pd.read_fwf(io.StringIO(txt), skiprows=1,
widths=[15, 12, 16], parse_dates=[2])
Perform groupby, but for now retrieve only group with key 'A':
gr = df.groupby('Customer_ID')
grp = gr.get_group('A')
It contains:
Transaction_ID Customer_ID Transaction_Date
0 912410 A 2019-01-01
1 912341 A 2019-01-03
2 312415 A 2019-01-09
3 432513 A 2019-01-12
4 357912 A 2019-01-19
Let's start from the most detailed issue, how works cntPrevTr.
Retrieve one of dates from grp:
dd = grp.iloc[2,2]
It contains Timestamp('2019-01-09 00:00:00').
To test example invocation of cntPrevTr for this date, run:
cntPrevTr(dd, grp.Transaction_Date, pd.Timedelta(10, 'D'))
i.e. you want to check how many prior transaction performed this customer
before this date, but not earlier than 10 days back.
The result is 2.
To see how the whole first column is computed, run:
td = grp.Transaction_Date
td.apply(cntPrevTr, args=(td, pd.Timedelta(10, 'D')))
The result is:
0 0
1 1
2 2
3 2
4 2
Name: Transaction_Date, dtype: int64
The left column is the index and the right - values returned
from cntPrevTr call for each date.
And the last thing is to show, how the result for the whole group
is generated. Run:
priorTx(grp.Transaction_Date)
The result (a DataFrame) is:
tx10 tx20
0 0 0
1 1 1
2 2 2
3 2 3
4 2 4
The same procedure takes place for all other groups, then
all partial results are concatenated (vertically) and the last
step is to save both columns of the whole DataFrame in
respective columns of df.
I have the following python pandas dataframe:
| Number of visits per year |
user id | 2013 | 2014 | 2015 | 2016 |
A 4 3 6 0
B 3 0 7 3
C 10 6 3 0
I want to calculate the percentage of users who returned based on their numbers of visits. I am sorry , I don't have any code yet, I wasn't sure how to start this.
This is the end result I am looking for:
| Number of visits in the year |
Year | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
2014 7% 3% 4% 15% 6% 7% 18% 17% 3% 2%
2015 3% ....
2016
So based on the above I can say that 15% of clients who visited the store 4 times in 2013, came back to the store in 2014.
Thank you very much.
UPDATE: This is what I did, maybe there is a better way through a loop?
For each year, I had a csv like this:
user_id | NR_V
A 4
B 3
C 10
NR_V stands for number of visits.
So I uploaded each csv as it's own df and I had df_2009, df_2010, ... until df_2016.
For each file I added a column column with 0/1 if they shopped the next year.
df_2009['shopped2010'] = np.where(df_2009['user_ID'].isin(df_2010['user_ID']), 1, 0)
Then I pivoted each dataframe.
pivot_2009 = pd.pivot_table(df_2009,index=["NR_V"],aggfunc={"NR_V":len, "shopped2010":np.sum})
Next, for each dataframe I created a new dataframe with the a column calculating the percentage by number of visits.
p_2009 = pd.DataFrame()
p_2009['%returned2010'] = (pivot_2009['shopped2010']/pivot_2009['NR_V'])*100
Finally, I merged all those dataframes into one.
dfs = [p_2009, p_2010, p_2011, p_2012, p_2013, p_2014, p_2015 ]
final = pd.concat(dfs, axis=1)
Consider the sample visits dataframe df
df = pd.DataFrame(
np.random.randint(1, 10, (100, 5)),
pd.Index(['user_{}'.format(i) for i in range(1, 101)], name='user id'),
[
['Number of visits per year'] * 5,
[2012, 2013, 2014, 2015, 2016]
]
)
df.head()
You can apply pd.value_counts with parameter normalize=True.
Also, since an entry of 8 represents 8 separate visits, it should count 8 times. I'll use repeat to accomplish this prior to value_counts
def count_visits(col):
v = col.values
return pd.value_counts(v.repeat(v), normalize=True)
df.apply(count_visits).stack().unstack(0)
I used the index value of every visitor and checked if the same index value (aka the same vistor_ID) was more then 0 the next year. This was then added to a dictionary in the form of True or False, which you could use for a bar-chart. I also made two lists (times_returned and returned_at_all) for additional data manipulation.
import pandas as pd
# Part 1, Building the dataframe.
df = pd.DataFrame({
'Visitor_ID':[1,2,3],
'2010' :[4,3,10],
'2011' :[3,0,6],
'2012' :[6,7,3],
'2013' :[0,3,0]
})
df.set_index("Visitor_ID", inplace=True)
# Part 2, preparing the required variables.
def dictionary (max_visitors):
dictionary={}
for x in range(max_visitors):
dictionary["number_{}".format(x)] = []
# print(dictionary)
return dictionary
# Part 3, Figuring out if the customer returned.
def compare_yearly_visits(current_year, next_year):
index = 1
years = df.columns
for x in df[current_year]:
# print (df[years][current_year][index], 'this year.')
# print (df[years][next_year][index], 'Next year.')
how_many_visits = df[years][current_year][index]
did_he_return = df[years][next_year][index]
if did_he_return > 0:
# If the visitor returned, add to a bunch of formats:
returned_at_all.append([how_many_visits, True])
times_returned.append([how_many_visits, did_he_return])
dictionary["number_{}".format(x)].append(True)
else:
## If the visitor did not return, add to a bunch of formats:
returned_at_all.append([how_many_visits, False])
dictionary["number_{}".format(x)].append(False)
index = index +1
# Part 4, The actual program:
highest_amount_of_visits = 11 # should be done automatically, max(visits)?
relevant_years = len(df.columns) -1
times_returned = []
returned_at_all = []
dictionary = dictionary(highest_amount_of_visits)
for column in range(relevant_years):
# print (dictionary)
this_year = df.columns[column]
next_year = df.columns[column+1]
compare_yearly_visits(this_year, next_year)
print ("cumulative dictionary up to:", this_year,"\n", dictionary)
Please find below my solution. As a note, I am pretty positive that this can be improved.
# step 0: create data frame
df = pd.DataFrame({'2013':[4, 3, 10], '2014':[3, 0, 6], '2015':[6, 7, 3], '2016':[0, 3, 0]}, index=['A', 'B', 'C'])
# container list of dataframes to be concatenated
frames = []
# iterate through the dataframe one column at a time and determine its value_counts(freq table)
for name, series in df.iteritems():
frames.append(series.value_counts())
# Merge frequency table for all columns into a dataframe
temp_df = pd.concat(frames, axis=1).transpose().fillna(0)
# Find the key for the new dataframe (i.e. range for number of columns), and append missing ones
cols = temp_df.columns
min = cols.min()
max = cols.max()
for i in range(min, max):
if (not i in a):
temp_df[str(i)] = 0
# Calculate percentage
final_df = temp_df.div(temp_df.sum(axis=1), axis=0)