I have the function that operates in Pandas DataFrame format. It works with pandas.apply() but it does not work with np.Vectorize(). Find the function below:
def AMTTL(inputData, amortization = []):
rate = inputData['EIR']
payment = inputData['INSTALMENT']
amount = inputData['OUTSTANDING']
amortization = [amount]
if amount - payment <= 0:
return amortization
else:
while amount > 0:
amount = BALTL(rate, payment, amount)
if amount <= 0:
continue
amortization.append(amount)
return amortization
The function receives inputData as Pandas DataFrame format. The EIR, INSTALMENT and OUTSTANDING are the columns name. This function works well with pandas.apply()
data.apply(AMTTL, axis = 1)
However, I have tried to use np.Vectorize(). it does not work with the code below:
vfunc = np.vectorize(AMTTL)
vfunc(data)
It got error like 'Timestamp' object is not subscriptable. So, I tried to drop other columns that not used but it still got the another error like invalid index to scalar variable.
I am not sure how to adjust pandas.apply() to np.Vectorize().
Any suggestion? Thank you in advance.
np.vectorize is nothing more than a map function that is applied to all the elements of the array - meaning you cannot differentiate between the columns with in the function. It has no idea of the column names like EIR or INSTALMENT. Therefore your current implementation for numpy will not work.
From the docs:
The vectorized function evaluates pyfunc over successive tuples of the
input arrays like the python map function, except it uses the
broadcasting rules of numpy.
The vectorize function is provided primarily for convenience, not for
performance. The implementation is essentially a for loop.
Based on your problem, you should try np.apply_along_axis instead, where you can refer different columns with their indexes.
Related
I am trying to modify a pandas dataframe column this way:
Temporary=DF.loc[start:end].copy()
SLICE=Temporary.unstack("time").copy()
SLICE["Var"]["Jan"] = 2678400*SLICE["Var"]["Jan"]
However, this does not work. The resulting column SLICE["Var"]["Jan"] is still the same as before the multiplication.
If I multiply with 2 orders of magnitude less, the multiplication works. Also a subsequent multiplication with 100 to receive the same value that was intended in the first place, works.
SLICE["Var"]["Jan"] = 26784*SLICE["Var"]["Jan"]
SLICE["Var"]["Jan"] = 100*SLICE["Var"]["Jan"]
I seems like the scalar is too large for the multiplication. Is this a python thing or a pandas thing? How can I make sure that the multiplication with the 7-digit number works directly?
I am using Python 3.8, the precision of numbers in the dataframe is float32, they are in a range between 5.0xE-5 and -5.0xE-5 with some numbers having a smaller absolute value than 1xE-11.
EDIT: It might have to do with the 2-level column indexing. When I delete the first level, the calculation works:
Temporary=DF.loc[start:end].copy()
SLICE=Temporary.unstack("time").copy()
SLICE=SLICE.droplevel(0, axis=1)
SLICE["Jan"] = 2678400*SLICE["Jan"]
Your first method might give SettingWithCopyWarning which basically means the changes are not made to the actual dataframe. You can use .loc instead:
SLICE.loc[:,('Var', 'Jan')] = SLICE.loc[:,('Var', 'Jan')]*2678400
This question is related to the question I posted yesterday, which can be found here.
So, I went ahead and implemented the solution provided by Jan to the entire data set. The solution is as follows:
import re
def is_probably_english(row, threshold=0.90):
regular_expression = re.compile(r'[-a-zA-Z0-9_ ]')
ascii = [character for character in row['App'] if regular_expression.search(character)]
quotient = len(ascii) / len(row['App'])
passed = True if quotient >= threshold else False
return passed
google_play_store_is_probably_english = google_play_store_no_duplicates.apply(is_probably_english, axis=1)
google_play_store_english = google_play_store_no_duplicates[google_play_store_is_probably_english]
So, from what I understand, we are filtering the google_play_store_no_duplicates DataFrame using the is_probably_english function and storing the result, which is a boolean, into another DataFrame (google_play_store_is_probably_english). The google_play_store_is_probably_english is then used to filter out the non-English apps in the google_play_store_no_duplicates DataFrame, with the end result being stored in a new DataFrame.
Does this make sense and does it seem like a sound way to approach the problem? Is there a better way to do this?
This makes sense, I think this is the best way to do it, the result of the function is a boolean as you said and then when you apply it in a pd.Series you end up with a pd.Series of booleans, which is usually called a boolean mask. This concept can be very useful in pandas when you want to filter rows by some parameters.
Here is an article about boolean masks in pandas.
I am trying to find the average CTR for a set of emails which I would like to categorize by the time that they were sent in order to determine if the CTR is affected by the time they were sent. But for some reason, pandas just doesn't want to let me find the mean of the CTR values.
As you'll see below, I have tried using the mean function to find the mean of the CTR for each of the times, but I continually get the error:
DataError: No numeric types to aggregate
This to me would imply that my CTR figures are not integers or floats, but are instead strings. However, though they came in as strings, I have already converted them to floats. I know this too because if I use the sum() function in lieu of the average function, it works just fine.
The line of code is very simple:
df.groupby("TIME SENT", as_index=False)['CTR'].mean()
I can't imagine why the sum function would work and the mean function would fail, especially if the error is the one described above. Anyone got any ideas?
EDIT: Code I used to turn CTR column from string percentage (85.8%) to float:
i = 0
for index, row in df.iterrows():
df.loc[i, "CTR"] = float(row['CTR'].strip('%'))/100
i += 1
Link to df.head() : https://ethercalc.org/zw6xmf2c7auw
df['CTR']= (df['CTR'].str.strip('%').astype('float'))/100
The above code strips the % from the CTR column, then changes its type to float.You can then do your groupby.
For my thesis I need the implied volatility of options, I already created the following function for it:
#Implied volatility solver
def Implied_Vol_Solver(s_t,K,t,r_f,option,step_size):
#s_t=Current stock price, K=Strike price, t=time until maturity, r_f=risk-free rate and option=option price,stepsize=is precision in stepsizes
#sigma set equal to steps to make a step siz equal to the starting point
sigma=step_size
while sigma < 1:
#Regualar BlackScholes formula (current only call option, will also be used to calculate put options)
d_1=(np.log(s_t/K)+(r_f+(sigma**2)/2)*t)/(sigma*(np.sqrt(t)))
d_2=d_1-np.square(t)*sigma
P_implied=s_t*norm.cdf(d_1)-K*np.exp(-r_f*t)*norm.cdf(d_2)
if option-(P_implied)<step_size:
#convert stepts to a string to find the decimal point (couldn't be done with a float)
step_size=str(step_size)
#rounds sigma equal to the stepsize
return round(sigma,step_size[::-1].find('.'))
sigma+=step_size
return "Could not find the right volatility"
The variables I need are located in a Pandas DataFrame and I already created a loop for it, to test if it works (I will add the other variables when it works correctly):
for x in df_option_data['Settlement_Price']:
df_option_data['Implied_Volatility']=Implied_Vol_Solver(100,100,1,0.01,x,0.001)
However when I run this loop I will get 0.539 for the whole Implied_Voltality Column and these numbers needs to be different, what do I wrong? Or are there any easier solutions?
I also tried the following:
df_option_data['Implied_Volatility']=Implied_Vol_Solver(100,100,1,0.01,np.array(df_option_data['Settlement_Price']),0.001)
But than I get the following error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Essentially what I need is the following: A dataframe with 5 columns for the input variables and 1 column with the output variables (the implied volatility) which is calculated by the function.
You are replacing the result from Implied_Vol_Solver to the entire column instead of a specific cell.
Try the following:
df_option_data['Implied_Volatility'] = df_option_data['Settlement_Price'].apply(lambda x: Implied_Vol_Solver(100,100,1,0.01,x,0.001))
The apply function can apply a function to all the elements in a data column so that you don't need to do the for loop yourself.
Instead of having the input variables passed into the function, you could pass in the row (as a series) and pluck off the values from that. Then, use an apply function to get your output frame. This would look something like this:
def Implied_Vol_Solver(row):
s_t = row['s_t'] # or whatever the column is called in the dataframe
k = row['k'] # and so on and then leave the rest of your logic as is
Once you've modified the function, you can use it like this:
df_option_data['Implied_Volatility'] = df_option_data.apply(Implied_Vol_Solver, axis=1)
I am writing a function that computes conditional probability all columns in a pd.DataFrame that has ~800 columns. I wrote a few versions of the function and found a very big difference in compute time over two primary options:
col_sums = data.sum() #Simple Column Sum over 800 x 800 DataFrame
Option #1:
{'col_sums' and 'data' are a Series and DataFrame respectively}
[This is contained within a loop over index1 and index2 to get all combinations]
joint_occurance = data[index1] * data[index2]
sum_joint_occurance = joint_occurance.sum()
max_single_occurance = max(col_sum[index1], col_sum[index2])
cond_prob = sum_joint_occurance / max_single_occurance #Symmetric Conditional Prob
results[index1][index2] = cond_prob
Vs.
Option #2: [While looping over index1 and index2 to get get all combinations]
Only Difference is instead of using DataFrame I exported the data_matrix to a np.array prior to looping
new_data = data.T.as_matrix() [Type: np.array]
Option #1 Runtime is ~1700 sec
Option #2 Runtime is ~122 sec
Questions:
Is converting the contents of DataFrames to np.array's best for computational tasks?
Is the .sum() routine in pandas significantly different to to .sum() routine in NumPy or is the difference in speed due to the label access to data?
Why are these runtimes so different?
While reading the documentation I came across:
Section 7.1.1 Fast scalar value getting and setting Since indexing with [] must handle a lot of cases (single-label access, slicing,
boolean indexing, etc.), it has a bit of overhead in order to figure
out what you’re asking for. If you only want to access a scalar value,
the fastest way is to use the get_value method, which is implemented
on all of the data structures:
In [656]: s.get_value(dates[5])
Out[656]: -0.67368970808837059
In [657]: df.get_value(dates[5], ’A’)
Out[657]: -0.67368970808837059
Best Guess:
Because I am accessing individual data elements many times from the dataframe (order of ~640,000 per matrix). I think the speed reduction came from how I referenced the data (i.e. "indexing with [] handles a lot of cases") and therefore I should be using the get_value() method for accessing scalars similar to a matrix lookup.