Issues with replicating results from R to python by writing customised function

Issues with replicating results from R to python by writing customised function - python

I am trying to convert the R code to python by writing customised function or without function in python based on this following lines of code
customers_df$segment = "NA"
customers_df$segment[which(customers_df$recency > 365*3)] = "inactive"
customers_df$segment[which(customers_df$recency <= 365*3 & customers_df$recency > 365*2)] = "cold"
customers_df$segment[which(customers_df$recency <= 365*2 & customers_df$recency > 365*1)] = "warm"
customers_df$segment[which(customers_df$recency <= 365)] = "active"
customers_df$segment[which(customers_df$segment == "warm" & customers_df$first_purchase <= 365*2)] = "new warm"
customers_df$segment[which(customers_df$segment == "warm" & customers_df$amount < 100)] = "warm low value"
customers_df$segment[which(customers_df$segment == "warm" & customers_df$amount >= 100)] = "warm high value"
customers_df$segment[which(customers_df$segment == "active" & customers_df$first_purchase <= 365)] = "new active"
customers_df$segment[which(customers_df$segment == "active" & customers_df$amount < 100)] = "active low value"
customers_df$segment[which(customers_df$segment == "active" & customers_df$amount >= 100)] = "active high value"
table(customers_2015$segment)
active high value active low value cold inactive
573 3313 1903 9158
new active new warm warm high value warm low value
1512 938 119 901
Python Function
I tried to replicate the same code as above in python by writing function. However, I was not able to get the same categories as R as a well number in each category also differs.
def mang_segment (s):
if (s['recency'] > 365*3):
return ("inactive")
elif (s['recency'] <= 365*3) & (s['recency'] > 365*2):
return ("cold")
elif (s['recency'] <= 365*2) & (s['recency'] > 365*1):
return ("warm")
elif (s['recency'] <= 365):
return ("active")
def mang_segment_up (s):
# if (s['recency'] > 365*3):
# return ("inactive")
# elif (s['recency'] <= 365*3 & s['recency'] > 365*2):
# return ("cold")
# elif (s['recency'] <= 365*2 & s['recency'] > 365*1):
# return ("warm")
if (s['segment'] == "warm") & (s['first_purchase'] <= 365*2):
return ("new warm")
elif (s['segment'] == "warm") & (s['amount'] < 100):
return ("warm low value")
elif (s['segment'] == "warm") & (s['amount'] >= 100):
return ("warm high value")
elif (s['segment'] == "active") & (s['first_purchase'] <= 365):
return ("new active")
elif (s['segment'] == "active") & (s['amount'] < 100):
return ("active low value")
elif (s['segment'] == "active") & (s['amount'] >= 100):
return ("active high value")
active low value 19664
warm low value 4083
active high value 3374
new active 1581
new warm 980
warm high value 561
Any pointer/suggestion would be appreciated.
Thanks in advance

I am a little confused about the purpose of the function (and if it is working as you expect). If you are seeking to mimic your R code within a function, your syntax can line up much closer with your initial code than it currently is. Assuming you are using panads/numpy:
import numpy as np
import pandas as pd
#toy example
s = pd.DataFrame({'rec' : [2000, 1500, 3000, 750]})
def mang_segment (s):
s.loc[(s['rec'] > 365*3), 'seg'] = "inactive" # creating a new column seg with our values
s.loc[(s['rec'] <= 365*3) & (s['rec'] > 365*2), 'seg'] = "cold"
#etc...
#run function
mang_segment(s)
#get value counts
s['seg'].value_counts()
Here we add a column to our dataframe to capture our values, which we can later summarize. This is different than the return function that, if it were working and call appropriately, would not assign it directly to your data frame.
There are other function and ways to get it at this, too. Check out np.where as another option.

Related

How do I create a time series with 15min buckets in pyspark?

I'm trying to create a report that shows the total number of minutes worked by a group of employees in 15 minute increments.
The source table has the time in/out and total minutes worked, one record for each employee.
I've create a RDD row wise mapping function to loop through the number of hours in a day, then an inner loop for each 15 minute increment.
Each loop should add a column to the RDD row dictionary.
I've confirmed the resulting schema contains these new columns, but I'm missing lots of data in the final output.
I'm not sure if it's a problem with the row iteration or the stacking.
This is the starting schema -
Any ideas?
final schema -
Updated code -
def create_time_block_columns(row_dict):
inhour = row_dict['inhour']
outhour = row_dict['outhour']
inminute = row_dict['inminute']
outminute = row_dict['outminute']
# loop through hours of day
for i in range(24):
# loop through quarter hour blocks
for j in range(1,5):
lowerBound = (j-1)*15
upperBound = j*15
# create column names like 't_0_0', 't_0_15', t_0_30', 't_0_45', 't_1_0', etc...
timeBlockColumnName = F't_{i}_{lowerBound}'
# Add a new key in the dictionary with the new column name and value.
# initialized to 0
row_dict[timeBlockColumnName] = 0
# if the employee was currently clocked in
if (inhour <= i) & (outhour >= i):
# if the inhour is the current time block hour and the outhour is in a future time block
# this means they worked the rest of the hour
# start_during_end_after
if (i == inhour) & (outhour > i):
if (inminute >= lowerBound):
row_dict[timeBlockColumnName] = (upperBound - inminute)
else:
row_dict[timeBlockColumnName] = 15
# if the current row is completely within the current time block [hour and minutes]
# this means they worked all 15 minutes of each hour quarter
elif (i < inhour) & (i > outhour):
row_dict[timeBlockColumnName] = 15
# if the inhour is before the current timeblock hour, and outhour is the current hour
# this means they worked all minutes in the current block up-to the outminute
elif (i < inhour) & (i == outhour):
if (outminute < lowerBound):
row_dict[timeBlockColumnName] = outminute - lowerBound
else:
row_dict[timeBlockColumnName] = 15
# if the inhour and outhour are the current timeblock hour, and they are the same hour,
# we'll calculated the difference between minutes
elif (i == inhour) & (i == outhour):
if (inminute >= lowerBound) & (outminute <= upperBound):
row_dict[timeBlockColumnName] = outminute - inminute
elif (inminute < lowerBound) & (outminute >= upperBound):
row_dict[timeBlockColumnName] = 15
elif (inminute >= lowerBound) & (outminute >= upperBound):
row_dict[timeBlockColumnName] = upperBound - inminute
elif (inminute < lowerBound) & (outminute <= upperBound):
row_dict[timeBlockColumnName] = outminute - lowerBound
# else: we don't do anything because the employee wasnt clocked in
return row_dict
mappedDF = Map.apply(frame = dyF, f = create_time_block_columns).toDF()
# output some interesting logs for debugging
mappedDF.printSchema()
# Build expression to stack new columns as rows
stack_expression = F"stack({24*4}"
for i in range(24):
for j in range(1,5):
stack_expression += F", 't_{i}_{(j-1)*15}', t_{i}_{(j-1)*15}"
stack_expression += ') as (time_block, minutes_worked)'
timeBlockDF = mappedDF.select('pos_key', 'p_dob', 'dob', 'employee', 'rate', 'jobcode', 'pay', 'overpay', 'minutes', F.expr(stack_expression))
timeBlockDF = timeBlockDF.filter('minutes_worked > 0') \
.withColumn("dob",F.col("dob").cast(DateType()))
# create time block identifier column
time_pattern = r't_(\d+)_(\d+)'
timeBlockDF = timeBlockDF.withColumn('time_block_hour', F.regexp_extract('time_block', time_pattern, 1).cast(IntegerType())) \
.withColumn('time_block_min', F.regexp_extract('time_block', time_pattern, 2).cast(IntegerType())) \
.drop('time_block') \
.withColumn('time_block_time', F.concat_ws(':', F.format_string("%02d", F.col('time_block_hour')), F.format_string("%02d", F.col('time_block_min')))) \
.withColumn('time_block_temp', F.concat_ws(' ', F.col('dob'), F.col('time_block_time'))) \
.withColumn('time_block_datetime', F.to_timestamp(F.col('time_block_temp'), 'yyyy-MM-dd HH:mm')) \
.withColumn('time_block_pay', ((F.col('pay') + F.col('overpay')) / F.col('minutes')) * F.col('minutes_worked')) \
.drop('time_block_temp', 'pay', 'overpay', 'minutes')
# output some interesting logs for debugging
timeBlockDF.printSchema()

The problem was with the udf.
There were several cases not handled by the conditions, but the stack expression was working fine.
Here is a working example [without considering shifts that span midnight].
def create_time_block_columns(row_dict):
inhour = row_dict['inhour']
outhour = row_dict['outhour']
inminute = row_dict['inminute']
outminute = row_dict['outminute']
# loop through hours of day
for i in range(24):
# loop through quarter hour blocks
for j in range(1,5):
lowerBound = (j-1)*15
upperBound = j*15
# create column names like 't_0_0', 't_0_15', t_0_30', 't_0_45', 't_1_0', etc...
timeBlockColumnName = F't_{i}_{lowerBound}'
# Add a new key in the dictionary with the new column name and value.
# initialized to 0
row_dict[timeBlockColumnName] = 0
# if the employee was currently clocked in
if (inhour <= i) & (outhour >= i):
# if the inhour is the current time block hour and the outhour is in a future time block
# this means they worked the rest of the hour
# start_during_end_after
if (i == inhour) & (outhour > i):
if (inminute >= lowerBound):
row_dict[timeBlockColumnName] = (upperBound - inminute)
else:
row_dict[timeBlockColumnName] = 15
# if the current row is completely within the current time block [hour and minutes]
# this means they worked all 15 minutes of each hour quarter
elif (inhour < i) & (i < outhour):
row_dict[timeBlockColumnName] = 15
# if the inhour is before the current timeblock hour, and outhour is the current hour
# this means they worked all minutes in the current block up-to the outminute
elif (i < inhour) & (i == outhour):
if (outminute < lowerBound):
row_dict[timeBlockColumnName] = outminute - lowerBound
else:
row_dict[timeBlockColumnName] = 15
# if the inhour and outhour are the current timeblock hour, and they are the same hour,
# we'll calculated the difference between minutes
elif (i == inhour) & (i == outhour):
if (inminute >= lowerBound) & (outminute <= upperBound):
row_dict[timeBlockColumnName] = outminute - inminute
elif (inminute < lowerBound) & (outminute >= upperBound):
row_dict[timeBlockColumnName] = 15
elif (inminute >= lowerBound) & (outminute >= upperBound):
row_dict[timeBlockColumnName] = upperBound - inminute
elif (inminute < lowerBound) & (outminute <= upperBound):
row_dict[timeBlockColumnName] = outminute - lowerBound
# else: we don't do anything because the employee wasnt clocked in
return row_dict
mappedDF = Map.apply(frame = dyF, f = create_time_block_columns).toDF()
# output some interesting logs for debugging
mappedDF.printSchema()
# Build expression to stack new columns as rows
stack_expression = F"stack({24*4}"
for i in range(24):
for j in range(1,5):
stack_expression += F", 't_{i}_{(j-1)*15}', t_{i}_{(j-1)*15}"
stack_expression += ') as (time_block, minutes_worked)'
timeBlockDF = mappedDF.select('pos_key', 'p_dob', 'dob', 'employee', 'rate', 'jobcode', 'pay', 'overpay', 'minutes', F.expr(stack_expression))
timeBlockDF = timeBlockDF.filter('minutes_worked > 0') \
.withColumn("dob",F.col("dob").cast(DateType()))

Returns None instead of False

So I'm working on a question on CodingBat, a website that provides JS and Python practice problems. I've encountered a unexpected output. Btw here's the link to the question: https://codingbat.com/prob/p135815 . In theory my code should return False but it returns none when I put print(squirrel_play(50, False))
Code:
def squirrel_play(temp, is_summer):
if is_summer:
if temp <= 100:
if temp >= 60:
return True
elif temp <= 60:
return False
elif temp >= 100:
return False
if not is_summer:
if temp <= 90:
if temp >= 60:
return True
elif temp >= 90:
return False
elif temp <= 60:
return False
when I run that with print(squirrel_play(50, False)), I get None (I should get False)
Why???

With your parameter of is_summer of False, you're in the 2nd conditional block:
if not is_summer:
if temp <= 90:
if temp >= 60:
return True
elif temp >= 90:
return False
elif temp <= 60:
return False
Then follow this block:
is the temp less than 90? yes. so now we're in this block:
if temp <= 90:
if temp >= 60:
return True
What is happening here is that you never get to the elif temp <= 60 because you are in the first conditional instead. You could only ever get to the elif below if you didn't satisfy the first condition.
At the end of this if temp <= 90 block the entire conditional chain ends and your function returns the default value of None because you didn't provide another return value.
You can maybe more clearly see this by making the entire code read:
def squirrel_play(temp, is_summer):
if is_summer:
if temp <= 100:
if temp >= 60:
return True
elif temp <= 60:
return False
elif temp >= 100:
return False
if not is_summer:
if temp <= 90:
if temp >= 60:
return True
else:
return "This is where I'm returning with 50, and True as my parameters"
elif temp >= 90:
return False
elif temp <= 60:
return False

Did you try to debug it?
With squirrel_play(50, False) it will fall into:
def squirrel_play(temp, is_summer):
if is_summer:
if temp <= 100:
if temp >= 60:
return True
elif temp <= 60:
return False
elif temp >= 100:
return False
if not is_summer:
if temp <= 90:
if temp >= 60:
return True
# HERE ( 50 is less than 90 but not greater than 60 )
# and you have no return statement for this case
elif temp >= 90:
return False
elif temp <= 60:
return False

If you don't return a value from a Python function, None is returned by default. I believe that what is happening here is that because you are using elif statements, since the clause if not is_summer: if temp <= 90: is being entered, the final clause elif temp <= 60 is not being reached. Therefore the function gets passed all of the if/elif statements without returning a value, and returns None.
A simple solution is to replace all of the elifs with ifs. Then print(squirrel_play(50, False)) returns False (for me at least).

The way that you have currently coded it, in your
if temp <= 90:
if temp >= 60:
return True
elif ....
if the first if test evaluates True but the second one evaluates False, then no return statement is reached (bear in mind that the subsequent elif tests are not performed because the first if evaluated true), and the function therefore returns None.
In fact you can simplify the function making use of chained comparison operators:
def squirrel_play(temp, is_summer):
if is_summer:
return 60 <= temp <= 100
else:
return 60 <= temp <= 90

I keep getting None

My code:
def get_feedback(mark, out_of):
percentage = int((mark / out_of) * 100)
if percentage >= 80:
print("Excellent")
if 60 < percentage < 70:
print("Good")
if 50 < percentage < 59:
print("Pass")
if percentage < 50:
print("Not a pass")
I know I have to use a return statement somewhere but I'm not really sure how it works or when to use it. If anyone could help, that would be great thank you!

def get_feedback(mark, out_of):
percentage = int((mark / out_of) * 100)
remark = ''
if percentage >= 80:
remark = "Excellent"
elif 60 <= percentage <= 79:
remark = "Good"
elif 50 <= percentage <= 59:
remark = "Pass"
else percentage < 50:
remark = "Not a pass"
return remark
Some suggestions:
I believe you need inclusive range, so include <= instead of <
If one condition satisfies, no need to check the rest of the conditions. So instead of using if for every check, use if - elif- else checks.
Also your question says the range between 60 and 79 for grade 'Good'. You haven't checked it.

Use return in place of print. Example :- return "Excellent".

Another way to do it :
def get_feedback(mark, out_of):
percentage = int((mark / out_of) * 100)
if percentage >= 80:
return "Excellent"
elif 60 <= percentage <= 79:
return "Good"
elif 50 <= percentage <= 59:
return "Pass"
else:
return "Not a pass"

You can make a variable to represent the value of its condition.
def get_feedback(mark, out_of):
percentage = int((mark / out_of) * 100)
if percentage >= 80:
feedback = "Excellent"
elif 60 < percentage < 70:
feedback = "Good"
elif 50 < percentage < 59:
feedback = "Pass"
elif percentage < 50:
feedback = "Not a pass"
return print(feedback)
At the very end, we use return to give us the result of the function. Also, notice that I used elif statements, which are faster than just using if statements.

Forwarding variables between functions

I'm trying to do a calculation in the following code:
def traffic_intensity(count):
"""Number of cars passing by"""
int(count)
if count < 5000:
level = "Very Low"
elif 5000 <= count < 10000:
level = "Low"
elif 10000 <= count < 18000:
level = "Moderate"
elif count >= 18000:
level = "High"
return level
def number_of_busy_days(counts):
"""Busy days based on traffic flow"""
daily_counts = 0
for count in counts:
if traffic_intensity(level) == "Moderate" or "High":
daily_counts = daily_counts + 1
return daily_counts
counts = [18000, 10000, 500, 9999, 12000]
print(number_of_busy_days(counts))
What I'm trying to achieve is to use the traffic_intensity function to calculate the number of busy days - a busy day is defined as having more than the "Moderate" amount of traffic given in the traffic_intensity function. I've tried a lot of different ways but I'm running out of ideas at this point.
The problem I'm encountering is that it doesn't find the level variable form the first function. I get the following error:
if traffic_intensity(level) == "Moderate" or "High":
NameError: name 'level' is not defined
Is anyone able to help me? Thanks! ^_^

You were passing in a variable level that doesn't exist into the function traffic_intensity. I presume you meant to pass in the variable count.
Likewise, having or "High" as a condition will always result in true since non-empty strings are convertible true in python
def traffic_intensity(count):
"""Number of cars passing by"""
int(count)
if count < 5000:
level = "Very Low"
elif 5000 <= count < 10000:
level = "Low"
elif 10000 <= count < 18000:
level = "Moderate"
elif count >= 18000:
level = "High"
return level
def number_of_busy_days(counts):
"""Busy days based on traffic flow"""
daily_counts = 0
for count in counts:
state = traffic_intensity(count)
if state == "Moderate" or state == "High":
daily_counts = daily_counts + 1
return daily_counts
counts = [18000, 10000, 500, 9999, 12000]
print(number_of_busy_days(counts))

Using for/while loop in python to generate variables and do computation

I want to know if I can use a loop in python to make it into single block of code that would fetch me best_result1_conf, best_result2_conf and best_result3_conf.
if best_score1 == '1.0':
best_result1_conf='High'
elif best_score1 > '0.85' and best_score1 < '1.0':
best_result1_conf='Medium'
else: best_result1_conf='Low'
if best_score2 == '1.0':
best_result2_conf='High'
elif best_score2 > '0.85' and best_score2 < '1.0':
best_result2_conf='Medium'
else: best_result2_conf='Low'
if best_score3 == '1.0':
best_result3_conf='High'
elif best_score3 > '0.85' and best_score3 < '1.0':
best_result3_conf='Medium'
else: best_result3_conf='Low'

As a function:
def s_to_r(s):
if 0.85 < s < 1.0:
return "Medium"
elif s == 1.0:
return "High"
else:
return "Low"
results = [s_to_r(score) for score in [best_score1, best_score2, best_score3] ]
Though usually this is when I'd like to introduce OOP. I'm assuming these scores BELONG to someone, so maybe:
class Competitor(object):
def __init__(self, name):
self.name = name
self.scores = list()
def addScore(self,score):
self.scores.append(score)
def _getScoreValue(self,index):
score = self.scores[index]
if score <= 0.85:
return "Low"
elif 0.85 < score < 1.0:
return "Medium"
else:
return "High"
def getScore(self,index):
return {"score":self.scores[index],"value":_getScoreValue(index)}
This will let you do things like:
competitors = [Competitor("Adam"),Competitor("Steven"),Competitor("George"),
Competitor("Charlie"),Competitor("Bob"),Competitor("Sally")]
# generating test data
for competitor in competitors:
for _ in range(5):
competitor.addScore(round(random.random(),2))
# generating test data
for competitor in competitors:
for i,score in enumerate(competitor.scores):
if i==0: name = competitor.name
else: name = ""
print("{name:20}{scoredict[score]:<7}{scoredict[value]}".format(name=name,
scoredict=competitor.getScore(i)))

You can use lists.
best_scores = [1.0, 0.9, 0.7]
results = []
for score in best_scores:
if score == 1.0:
results.append('High')
elif score > 0.85 and score < 1.0:
results.append('Medium')
else:
results.append('Low')

Using a function:
def find_result(result):
if result > 1.0:
print(“Cannot calculate”)
elif result == 1.0:
word_result = ‘High’
elif result > 0.85:
word_result = ‘Medium’
else:
word_result = ‘Low’
return word_result
best_result1_conf = find_result(best_score1)
best_result2_conf = find_result(best_score2)
best_result3_conf = find_result(best_score3)
Hope this helps!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Issues with replicating results from R to python by writing customised function - python

Related

How do I create a time series with 15min buckets in pyspark?

Returns None instead of False

I keep getting None

Forwarding variables between functions

Using for/while loop in python to generate variables and do computation

Categories

Resources