Calculating customer lifetime using Pandas

Calculating customer lifetime using Pandas - python

I'm performing a Cohort analysis using python, and I am having trouble creating a new column that sums up the total months a user has stayed with us.
I know the math behind the answer, all I have to do is:
subtract the year when they canceled our service from when they started it
Multiply that by 12.
Subtract the month when they canceled our service from when they started it.
Add those two numbers together.
So in Excel, it looks like this:
=(YEAR(C2)-YEAR(B2))*12+(MONTH(C2)-MONTH(B2))
C is when the customer canceled the date, and B is when they started
The problem is that I am very new to Python and Pandas, and I am having trouble translating that function in Python
What I have tried so far:
df['Lifetime'] = df.Plan_Cancel_Date('%Y') - df.Plan_Start_Date('%Y')*12 +
df.Plan_Cancel_Date('%m') - df.Plan_Start_Date('%m')
df.head()
It returns with an error 'Series' is not callable, and I have a general understanding of what that means.
I then tried:
def LTVCalc (Plan_Start_Date, Plan_Cancel_Date):
df['Lifetime'] = df.Plan_Cancel_Date('%Y') - df.Plan_Start_Date('%Y')*12 +
df.Plan_Cancel_Date('%m') - df.Plan_Start_Date('%m')
df.head()
But that didn't add the Column 'Lifetime' to the DataFrame.
Anyone able to help a rookie?

I think need first convert to_datetime and then use dt.year and
dt.month:
df = pd.DataFrame({
'Plan_Cancel_Date': ['2018-07-07','2019-03-05','2020-10-08'],
'Plan_Start_Date': ['2016-02-07','2017-01-05','2017-08-08']
})
#print (df)
#if necessary convert to datetimes
df.Plan_Start_Date = pd.to_datetime(df.Plan_Start_Date)
df.Plan_Cancel_Date = pd.to_datetime(df.Plan_Cancel_Date)
df['Lifetime'] = ((df.Plan_Cancel_Date.dt.year - df.Plan_Start_Date.dt.year)*12 +
df.Plan_Cancel_Date.dt.month - df.Plan_Start_Date.dt.month)
print (df)
Plan_Cancel_Date Plan_Start_Date Lifetime
0 2018-07-07 2016-02-07 29
1 2019-03-05 2017-01-05 26
2 2020-10-08 2017-08-08 38

Related

How can I get the next row value in a Python dataframe?

I'm a new Python user and I'm trying to learn this so I can complete a research project on cryptocurrencies. What I want to do is retrieve the value right after having found a condition, and retrieve the value 7 rows later in another variable.
I'm working within an Excel spreadsheet which has 2250 rows and 25 columns. By adding 4 columns as detailed just below, I get to 29 columns. It has lots of 0s (where no pattern has been found), and a few 100s (where a pattern has been found). I want my program to get the row right after the one where 100 is present, and return it's Close Price. That way, I can see the difference between the day of the pattern and the day after the pattern. I also want to do this for seven days down the line, to find the performance of the pattern on a week.
Here's a screenshot of the spreadsheet to illustrate this
You can see -100 cells too, those are bearish pattern recognition. For now I just want to work with the "100" cells so I can at least make this work.
I want this to happen:
import pandas as pd
import talib
import csv
import numpy as np
my_data = pd.read_excel('candlesticks-patterns-excel.xlsx')
df = pd.DataFrame(my_data)
df['Next Close'] = np.nan_to_num(0) #adding these next four columns to my dataframe so I can fill them up with the later variables#
df['Variation2'] = np.nan_to_num(0)
df['Next Week Close'] = np.nan_to_num(0)
df['Next Week Variation'] = np.nan_to_num(0)
df['Close'].astype(float)
for row in df.itertuples(index=True):
str(row[7:23])
if ((row[7:23]) == 100):
nextclose = np.where(row[7:23] == row[7:23]+1)[0] #(I Want this to be the next row after having found the condition)#
if (row.Index + 7 < len(df)):
nextweekclose = np.where(row[7:23] == row[7:23]+7)[0] #(I want this to be the 7th row after having found the condition)#
else:
nextweekclose = 0
The reason I want these values is to later compare them with these variables:
variation2 = (nextclose - row.Close) / row.Close * 100
nextweekvariation = (nextweekclose - row.Close) / row.Close * 100
df.append({'Next Close': nextclose, 'Variation2': variation2, 'Next Week Close': nextweekclose, 'Next Week Variation': nextweekvariation}, ignore_index = true)
My errors come from the fact that I do not know how to retrieve the row+1 value, and the row+7 value. I have searched high and low all day online and haven't found a concrete way to do this. Whichever idea I try to come up with gives me either a "can only concatenate tuple (not "int") to tuple" error, or a "AttributeError: 'Series' object has no attribute 'close'". This second one I get when I try:
for row in df.itertuples(index=True):
str(row[7:23])
if ((row[7:23]) == 100):
nextclose = df.iloc[row.Index + 1,:].close
if (row.Index + 7 < len(df)):
nextweekclose = df.iloc[row.Index + 7,:].close
else:
nextweekclose = 0
I would really love some help on this.
Using Jupyter Notebook.
EDIT : FIXED
I have finally succeeded ! As it often seems to be the case with programming (yeah, I'm new here...), the mistakes were because of my inability to think outside the box. I was persuaded a certain part of my code was the problem, when the issues ran deeper than that.
Thanks to BenB and Michael Gardner, I have fixed my code and it is now returning what I wanted. Here it is.
import pandas as pd
import talib
import csv
import numpy as np
my_data = pd.read_excel('candlesticks-patterns-excel.xlsx')
df = pd.DataFrame(my_data)
#Creating my four new columns. In my first message I thought I needed to fill them up
#with 0s (or NaNs) and then fill them up with their respective content later.
#It is actually much simpler to make the operations right now, keeping in mind
#that I need to reference df['Column Of Interest'] every time.
df['Next Close'] = df['Close'].shift(-1)
df['Variation2'] = (((df['Next Close'] - df['Close']) / df['Close']) * 100)
df['Next Week Close'] = df['Close'].shift(-7)
df['Next Week Variation'] = (((df['Next Week Close'] - df['Close']) / df['Close']) * 100)
#The only use of this is for me to have a visual representation of my newly created columns#
print(df)
for row in df.itertuples(index=True):
if 100 or -100 in row[7:23]:
nextclose = df['Next Close']
if (row.Index + 7 < len(df)) and 100 or -100 in row[7:23]:
nextweekclose = df['Next Week Close']
else:
nextweekclose = 0
variation2 = (nextclose - row.Close) / row.Close * 100
nextweekvariation = (nextweekclose - row.Close) / row.Close * 100
df.append({'Next Close': nextclose, 'Variation2': variation2, 'Next Week Close': nextweekclose, 'Next Week Variation': nextweekvariation}, ignore_index = True)
df.to_csv('gatherinmahdata3.csv')

If I understand correctly, you should be able to use shift to move the rows by the amount you want and then do your conditional calculations.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Close': np.arange(8)})
df['Next Close'] = df['Close'].shift(-1)
df['Next Week Close'] = df['Close'].shift(-7)
df.head(10)
Close Next Close Next Week Close
0 0 1.0 7.0
1 1 2.0 NaN
2 2 3.0 NaN
3 3 4.0 NaN
4 4 5.0 NaN
5 5 6.0 NaN
6 6 7.0 NaN
7 7 NaN NaN
df['Conditional Calculation'] = np.where(df['Close'].mod(2).eq(0), df['Close'] * df['Next Close'], df['Close'])
df.head(10)
Close Next Close Next Week Close Conditional Calculation
0 0 1.0 7.0 0.0
1 1 2.0 NaN 1.0
2 2 3.0 NaN 6.0
3 3 4.0 NaN 3.0
4 4 5.0 NaN 20.0
5 5 6.0 NaN 5.0
6 6 7.0 NaN 42.0
7 7 NaN NaN 7.0

From your update it becomes clear that the first if statement checks that there is the value "100" in your row. You would do that with
if 100 in row[7:23]:
This checks whether the integer 100 is in one of the elements of the tuple containing the columns 7 to 23 (23 itself is not included) of the row.
If you look closely at the error messages you get, you see where the problems are:
TypeError: can only concatenate tuple (not "int") to tuple
comes from
nextclose = np.where(row[7:23] == row[7:23]+1)[0]
row is a tuple and slicing it will just give you a shorter tuple to which you are trying to add an integer, as is said in the error message. Maybe have a look at the documentation of numpy.where and see how it works in general, but I think it is not really needed in this case.
This brings us to your second error message:
AttributeError: 'Series' object has no attribute 'close'
This is case sensitive and for me it works if I just capitalize the close to "Close" (same reason why Index has to be capitalized):
nextclose = df.iloc[row.Index + 1,:].Close
You could in principle use the shift method mentioned in the other reply and I would suggest it for easiness, but I want to point out another method, because I think understanding them is important for working with dataframes:
nextclose = df.iloc[row[0]+1]["Close"]
nextclose = df.iloc[row[0]+1].Close
nextclose = df.loc[row.Index + 1, "Close"]
All of them work and there are probably even more possibilities. I can't really tell you which ones are the fastest or whether there are any differences, but they are very commonly used when working with dataframes. Therefore, I would recommend to have a closer look at the documentation of the methods you used and especially what kind of data type they return. Hope that helps understanding the topic a bit more.

How can I get pandas to adjust my formula based on a specific value in a dataframe?

I have a pandas dataframe that looks like this:
Emp_ID | Weekly_Hours | Hire_Date | Termination_Date | Salary_Paid | Multiplier | Hourly_Pay
A1 | 35 | 01/01/1990 | 06/04/2020 | 5000 | 0.229961 | 32.85
B2 | 35 | 02/01/2020 | NaN | 10000 | 0.229961 | 65.70
C3 | 30 | 23/03/2020 | NaN | 5800 | 0.229961 | 44.46
The multiplier is a static figure for all employees, calculated as 7 / 30.44. The hourly pay is worked out by multiplying the monthly salary by the multiplier and dividing by the weekly contracted hours.
Now my challenge is to get Pandas to recognise a date in the Termination Date field, and adjust the calculation. For instance, the first record would need to be updated to show that the employee was actually paid 5k through the payroll for 4 business days, not the full month, given that they resigned on 06/04/2020. So the expected hourly pay figure would be (5000 / 4 * 7 / 35) = 250.
I can code the calculation quite easily; my struggle is adding a column to reflect the business days (4 in the above example) in a fresh column for all April leavers (not interested in any other months). So far I have tried.
df['T_Mth_Workdays'] = np.where(df['Termination_Date'].notnull(), np.busday_count('2020-04-01', df['Termination_Date']), 0)
However the above approach returns an error stating that:
iterator operand 0 dtype could not be cast from dtype(' m8 [ns] ') to dtype(' m8 [d] ')
I should add here that I had to change the dates to datetime[ns64] format manually.
Any pointers gratefully received. Thanks!

The issue with your np.where function call is that it is trying to pass the entire series df["Termination_Date"] as an argument to np.busday_count. The count function fails because it requires arguments to be in the np.datetime64[D] format (i.e., value only specified to the day), and the Series cannot be easily converted to this format.
One solution is to write a custom function that only calls that np.busday_count on elements that are not NaTs, converting those to the datetime64[D] type before calling np.busday_count. Then, you can apply the custom function to the df["Termination_Date"] series, as below:
#!/usr/bin/env python3
import numpy as np
import pandas as pd
DATE_FORMAT = "%d-%m-%Y"
# Reproduce raw data
raw_data = [
["A1", 35, "01/01/1990", "06/04/2020", 5000, 0.229961, 32.85],
["B2", 35, "02/01/2020", None, 10000, 0.229961, 65.70],
["C3", 35, "23/03/2020", "NAT", 5800, 0.229961, 44.46],
]
# Convert raw dates to ISO format, then np.datetime64
def parse_raw_dates(s):
try:
spl = s.split("/")
ds = "%s-%s-%s" %(spl[2], spl[1], spl[0])
except:
ds = "NAT"
return np.datetime64(ds)
for line in raw_data:
line[2] = parse_raw_dates(line[2])
# Create dataframe
df = pd.DataFrame(
data = raw_data,
columns = [
"Emp_ID", "Weekly_Hours", "Hire_Date", "Termination_Date",
"Salary_Paid", "Multiplier", "Hourly_Pay"],
)
# Create special conversion function
def myfunc(d):
d = d.to_numpy().astype('datetime64[D]')
if np.isnat(d):
return 0
else:
return np.busday_count('2020-04-01', d)
df['T_Mth_Workdays'] = df["Termination_Date"].apply(myfunc)
def format_date(d):
d = d.to_numpy().astype('datetime64[D]')
if np.isnat(d):
return ""
else:
return pd.to_datetime(d).strftime(DATE_FORMAT)
df["Hire_Date"] = df["Hire_Date"].apply(format_date)
df["Termination_Date"] = df["Termination_Date"].apply(format_date)

Posting my approach here in case it helps others in the future. Firstly code for creating the dataframe:
d = {'Emp_ID': ['A1', 'B2', 'C3'], 'Weekly Hours': ['35', '35', '30'], 'Hire_Date': ['01/01/1990', '02/01/2020', '23/03/2020'],
'Termination_Date': ['06/04/2020', np.nan, np.nan], 'Salary_Paid': [5000, 10000, 5800]}
df = pd.DataFrame(data=d)
df
The first step was to convert the dates to a more useable format - this is where pd.to_datetime() comes in handy -the adjustment needed was to specify the format.
df['Hire_Date'] = pd.to_datetime(df['Hire_Date'], format='%d/%m/%Y')
df['Termination_Date'] = pd.to_datetime(df['Termination_Date'], format='%d/%m/%Y')
This has the desired effect; whereby the dates are correctly represented and April is picked up as the right month of termination for employee A1.
I now (slightly) adjusted Ken's custom solution for calculating the working days in April:
def workday_calc(d):
d = d.to_numpy().astype('datetime64[D]')
if np.isnat(d):
return 30.44
else:
d = d.astype(str)
d = dt.datetime.strptime(d, '%Y-%m-%d')
e = (d + dt.timedelta(1)).strftime('%Y-%m-%d')
return np.busday_count('2020-04-01', e, weekmask=[1,1,1,1,1,0,0])
I spotted the error while reviewing numpy documentation on np.busday_count(). There are two useful pointers to note:
The use of the datetime64[D] is mandatory in the first line of the function - you can't use pd.to_datetime(). This is because the datetime64[D] format is a pre-requisite to being able to call the np.isnat() function.
However, the minute we deal with the NaT in the dataframe, we need to switch back to a string format, which is needed for the datetime.strptime() function.
Using the datetime.strptime() feature, we tell Python that the date is a) represented in the ISO format, and we need to retain it as a string. The advantage with both datetime.strptime() and np.busday_count() is that they are both built to handle strings.
Also, the np.busday_count() excludes the end date, so I used timedelta() to increment the end date by one, so that all the dates in the interim are counted. This may or may not be appropriate given what you're trying to do, but I wanted an inclusive count of days worked in April. So in this case, the employee has worked for 4 business days in April.
We then simply apply the custom function and create a new column.
df['Days_Worked_April'] = df['Termination_Date'].apply(workday_calc)
I was now able to use the freshly created column to derive my multiplier - using the same old approach. The rest is simple, but I'm including the code and results below for completeness.
df['Multiplier'] = df.apply(lambda x: 7 / x['Days_Worked_April'], axis=1)
df['Hourly_Pay_Calc'] = round((df.apply(lambda x: x['Salary_Paid'] * x['Multiplier'] / x['Weekly Hours'], axis=1)), 2)
Output:
Emp_ID Weekly Hours Hire_Date Termination_Date Salary_Paid Days_Worked_April Multiplier Hourly_Pay_Calc
0 A1 35.0 1990-01-01 2020-04-06 5000 4.00 1.750000 250.00
1 B2 35.0 2020-01-02 NaT 10000 30.44 0.229961 65.70
2 C3 30.0 2020-03-23 NaT 5800 30.44 0.229961 44.46

Improve running time when using inflation method on pandas

I'm trying to get real prices for my data in pandas. Right now, I am just playing with one year's worth of data (3962050 rows) and it took me 443 seconds to inflate the values using the code below. Is there a quicker way to find the real value? Is it possible to use pooling? I have many more years and if would take too long to wait every time.
Portion of df:
year quarter fare
0 1994 1 213.98
1 1994 1 214.00
2 1994 1 214.00
3 1994 1 214.50
4 1994 1 214.50
import cpi
import pandas as pd
def inflate_column(data, column):
"""
Adjust for inflation the series of values in column of the
dataframe data. Using cpi library.
"""
print('Beginning to inflate ' + column)
start_time = time.time()
df = data.apply(lambda x: cpi.inflate(x[column],
x.year), axis=1)
print("Inflating process took", time.time() - start_time, " seconds to run")
return df
df['real_fare'] = inflate_column(df, 'fare')

You have multiple values for each year: you can just call one for every year, store it in dict and then use the value instead of calling to cpi.inflate everytime.
all_years = df["year"].unique()
dict_years = {}
for year in all_years:
dict_years[year] = cpi.inflate(1.0, year)
df['real_fare'] = # apply here: dict_years[row['year']]*row['fare']
You can fill the last line using apply, or try do it in some other way like df['real_fare']=df['fare']*...

Pandas Timedelta mean returns error "No numeric types to aggregate". Why?

I am trying to perform the following operation:
pd.concat([A,B], axis = 1).groupby("status_reason")["closing_time"].mean()
Where
A is a Series named "status_reason" (Categorical values)
B is a Series named "closing_time" (TimeDelta values)
Example:
In : A.head(5)
Out:
0 -1 days +11:35:00
1 -10 days +07:13:00
2 NaT
3 NaT
4 NaT
Name: closing_time, dtype: timedelta64[ns]
In : B.head(5)
Out:
0 Won
1 Canceled
2 In Progress
3 In Progress
4 In Progress
Name: status_reason, dtype: object
The following error occurs:
DataError: No numeric types to aggregate
Please note: I tried to perform the mean even isolating every single category
Now, I saw a few question similar to mine online, so I tried this:
pd.to_timedelta(pd.concat([pd.to_numeric(A),B], axis = 1).groupby("status_reason")["closing_time"].mean())
Which is simply converting the Timedelta to an int64 and viceversa. But there result was quite strange (numbers too high)
In order to investigate the situation, I wrote the following code:
xxx = pd.concat([A,B], axis = 1)
xxx.closing_time.mean()
#xxx.groupby("status_reason")["closing_time"].mean()
The second row WORKS FINE, without converting the Timedelta to Int64. The third row DOES NOT work, and returns again the DataError.
I'm so confused here! What am I missig?
I would like to see the mean of the "closing times" for each "status reason"!
EDIT
If I try to do this: (Isolate the rows with a specific status without grouping)
yyy = xxx[xxx["status_reason"] == "In Progress"]
yyy["closing_time"].mean()
The result is:
Timedelta('310 days 21:18:05.454545')
But if I do this: (Isolate the rows with a specific status grouping)
yyy = xxx[xxx["status_reason"] == "In Progress"]
yyy.groupby("status_reason")["closing_time"].mean()
The result is again:
DataError: No numeric types to aggregate
Lastly, if I do this: (converting and converting back) (LET's CALL THIS: Special Example)
yyy = xxx[xxx["status_reason"] == "In Progress"]
yyy.closing_time = pd.to_numeric (yyy.closing_time)
pd.to_timedelta(yyy.groupby("status_reason")["closing_time"].mean())
We go back to the first problem I noticed:
status_reason
In Progress -105558 days +10:08:05.605064
Name: closing_time, dtype: timedelta64[ns]
EDIT2
If I do this: (convert to seconds and convert back)
yyy = xxx[xxx["status_reason"] == "In Progress"]
yyy.closing_time = A.dt.seconds
pd.to_timedelta(yyy.groupby("status_reason")["closing_time"].mean(), unit="s" )
The result is
status_reason
In Progress 08:12:38.181818
Name: closing_time, dtype: timedelta64[ns]
The same result happens if I remove the NaNs, or if I fill them with 0:
yyy = xxx[xxx["status_reason"] == "In Progress"].dropna()
yyy.closing_time = A.dt.seconds
pd.to_timedelta(yyy.groupby("status_reason")["closing_time"].mean(), unit="s" )
BUT the numbers are very different from what we saw in the first edit! (Special Example)
-105558 days +10:08:05.605064
Also, let me run the same code (Special Example) with dropna():
310 days 21:18:05.454545
And again, let's run the same code (Special Example) with fillna(0):
3 days 11:14:22.819472
This is going nowhere. I should probably prepare an export of those data, and post them somewhere: Here we go

From reading the discussion of this issue on Github here, you can solve this issue by specifying numeric_only=False for mean calculation as follows
pd.concat([A,B], axis = 1).groupby("status_reason")["closing_time"] \
.mean(numeric_only=False)

The problem might be In Progress only have NaT time, which might not allowed in groupby().mean(). Here's the test:
df = pd.DataFrame({'closing_time':['11:35:00', '07:13:00', np.nan,np.nan, np.nan],
'status_reason':['Won','Canceled','In Progress', 'In Progress', 'In Progress']})
df.closing_time = pd.to_timedelta(df.closing_time)
df.groupby('status_reason').closing_time.mean()
gives the exact error. To overcome this, do:
def custom_mean(x):
try:
return x.mean()
except:
return pd.to_timedelta([np.nan])
df.groupby('status_reason').closing_time.apply(custom_mean)
which gives:
status_reason
Canceled 07:13:00
In Progress NaT
Won 11:35:00
Name: closing_time, dtype: timedelta64[ns]

I cannot say why groupby's mean() method does not work, but the following slight modification of your code should work: First, convert timedelta column to seconds with total_seconds() method, then groupby and mean, then convert seconds to timedelta again:
pd.to_timedelta(pd.concat([ A.dt.total_seconds(), B], axis = 1).groupby("status_reason")["closing_time"].mean(), unit="s")
For example dataframe below, the code -
df = pd.DataFrame({'closing_time':['2 days 11:35:00', '07:13:00', np.nan,np.nan, np.nan],'status_reason':['Won','Canceled','In Progress', 'In Progress', 'In Progress']})
df.loc[:,"closing_time"] = \
pd.to_timedelta(df.closing_time).dt.days*24*3600 \
+ pd.to_timedelta(df.closing_time).dt.seconds
# or alternatively use total_seconds() to get total seconds in timedelta as follows
# df.loc[:,"closing_time"] = pd.to_timedelta(df.closing_time).dt.total_seconds()
pd.to_timedelta(df.groupby("status_reason")["closing_time"].mean(), unit="s")
produces
status_reason
Canceled 0 days 07:13:00
In Progress NaT
Won 2 days 11:35:00
Name: closing_time, dtype: timedelta64[ns]

After a few investigation, here is what I found:
Most of the confusion comes from the fact that in one case I was calling SeriesGroupBy.mean() and in the other case Series.mean()
These functions are actually different and have different behaviours. I was not realizing that
The second important point is that converting to numeric, or to seconds, leads to a totally different behaviour when it comes to handling NaNs value.
To overcome this situation, the first thing you have to do is deciding how to handle NaN values. The best approach depends on what we want to achieve. In my case, it's fine to have even a simple categorical result, so I can do something like this:
import datetime
def define_time(row):
if pd.isnull(row["closing_time"]):
return "Null"
elif row["closing_time"] < datetime.timedelta(days=100):
return "<100"
elif row["closing_time"] > datetime.timedelta(days=100):
return ">100"
time_results = pd.concat([A,B], axis = 1).apply(lambda row:define_time(row), axis = 1)
In the end the result is like this:
In :
time_results.value_counts()
Out :
>100 1452
<100 1091
Null 1000
dtype: int64

pandas: column formatting issues causing merge problems

I have the following two databases:
url='https://raw.githubusercontent.com/108michael/ms_thesis/master/rgdp_catcode.merge'
df=pd.read_csv(url, index_col=0)
df.head(1)
naics catcode GeoName Description ComponentName year GDP state
0 22 E1600',\t'E1620',\t'A4000',\t'E5000',\t'E3000'... Alabama Utilities Real GDP by state 2004 5205 AL
url='https://raw.githubusercontent.com/108michael/ms_thesis/master/mpl.Bspons.merge'
df1=pd.read_csv(url, index_col=0)
df1.head(1)
state year unemployment log_diff_unemployment id.thomas party type date bills id.fec years_exp session name disposition catcode
0 AK 2006 6.6 -0.044452 1440 Republican sen 2006-05-01 s2686-109 S2AK00010 39 109 National Cable & Telecommunications Association support C4500
Regarding df, I had to manually input the catcode values. I think that is why the formatting is off. What I would like is to simply have the values without the \t prefix. I want to merge the dfs on catcode, state, year. I made a test earlier wherein a df1.catcode with only one value per cell was matched with the values in another df.catcode that had more than one value per cell and it worked.
So technically, all I need to do is lose the \t before each consecutive value in df.catcode, but additionally, if anyone has ever done a merge of this sort before, any 'caveats' learned through experience would be appreciated. My merge code looks like this:
mplmerge=pd.merge(df1,df, on=(['catcode', 'state', 'year']), how='left' )
I think this can be done with the regex method, I'm looking at the documentation now.

Cleaning catcode column in df is rather straightforward:
catcode_fixed = df.catcode.str.findall('[A-Z][0-9]{4}')
This will produce a series with a list of catcodes in every row:
catcode_fixed.head(3)
Out[195]:
0 [E1600, E1620, A4000, E5000, E3000, E1000]
1 [X3000, X3200, L1400, H6000, X5000]
2 [X3000, X3200, L1400, H6000, X5000]
Name: catcode, dtype: object
If I understand correctly what you want, then you need to "ungroup" these lists. Here is the trick, in short:
catcode_fixed = catcode_fixed = catcode_fixed.apply(pd.Series).stack()
catcode_fixed.index = catcode_fixed.index.droplevel(-1)
So, we've got (note the index values):
catcode_fixed.head(12)
Out[206]:
0 E1600
0 E1620
0 A4000
0 E5000
0 E3000
0 E1000
1 X3000
1 X3200
1 L1400
1 H6000
1 X5000
2 X3000
dtype: object
Now, dropping the old catcode and joining in the new one:
df.drop('catcode',axis = 1, inplace = True)
catcode_fixed.name = 'catcode'
df = df.join(catcode_fixed)
By the way, you may also need to use df1.reset_index() when merging the data frames.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Calculating customer lifetime using Pandas - python

Related

How can I get the next row value in a Python dataframe?

How can I get pandas to adjust my formula based on a specific value in a dataframe?

Improve running time when using inflation method on pandas

Pandas Timedelta mean returns error "No numeric types to aggregate". Why?

pandas: column formatting issues causing merge problems

Categories

Resources