Python Optimization Package

Python Optimization Package - python

I'm new to the python and analytics world, and I know there are lots of packages out there to solve specific problems. My problem is like follows:
I have some independent variables with a value for each row.
I can calculate some ratios (like payed amount, payed count, bill not payed amount and bill not payed count, in relation to the totals of the dataset).
I have to find the value for each independent variable maximizing the ratios of payed amount and counts and minimizing the ratios for not payed ones.
For example:
amount payed var1 var2 var3
10 1 25 5 10
25 0 21 8 15
30 1 35 3 8
As you can see in this data set the payed and not payed ratios would be:
For the totals.
Payed count: 2/3 ~ 0.67
Payed amount: 40/65 ~ 0.62
Not payed count: 1/3 ~ 0.33
Not payed amount: 25/65 ~ 0.38
I was thinking to compare each split of the independent variables, against the ones in the total and see If there is a gain on the calculations, if so, I can take it as a possible solution, for var1 >= 25
Payed count: 2/2 ~ 1
Payed amount: 40/40 ~ 1
Not payed count: 0/2 ~ 0
Not payed amount: 0/40 ~ 0
This numbers are better that the original ones if we calculate the gain:
Payed count: 1 - 0.67 = 0.33
payed amount: 1 - 0.62 = 0.38
Not payed count: 0.33 - 0 = 0.33
Not payed amount: 0.38 - 0 = 0.38
Total Gain: = 1.42
An so on for each independent variable. Until we get the best splits with the maximum gain.
So seems like the best solution would be:
Var1: >= 25
Var2: <= 5
Var3: <= 10
Is there any tool to solve this? I was thinking of an optimization package but any other approach is welcome.

Related

How to count things within string in Python

I have data where one column is a string. This column contains text, such as:
#
financial_covenants
1
Max. Debt to Cash Flow: Value is 6.00
2.
Max. Debt to Cash Flow: Decreasing from 4.00 to 3.00, Min. Fixed Charge Coverage Ratio: Value is 1.20
3
Min. Interest Coverage Ratio: Value is 3.00
4
Max. Debt to Cash Flow: Decreasing from 4.00 to 3.50, Min. Interest Coverage Ratio: Value is 3.00
5
Max. Leverage Ratio: Value is 0.6, Tangible Net Worth: 7.88e+008, Min. Fixed Charge Coverage Ratio: Value is 1.75, Min. Debt Service Coverage Ratio: Value is 2.00
I want a new column that counts how many covenants there are in "financial_covenants".
As you can see, the covenants are divided by a comma.
I want my final result to look like this:
financial_covenants
num_of_cov
Max. Debt to Cash Flow: Value is 6.00
1
Max. Debt to Cash Flow: Decreasing from 4.00 to 3.00, Min. Fixed Charge Coverage Ratio: Value is 1.20
2
Max. Debt to Cash Flow: Value is 3.00
1
Max. Debt to Cash Flow: Decreasing from 4.00 to 3.50, Min. Interest Coverage Ratio: Value is 3.00
2
Max. Leverage Ratio: Value is 0.6, Tangible Net Worth: 7.88e+008, Min. Fixed Charge Coverage Ratio: Value is 1.75, Min. Debt Service Coverage Ratio: Value is 2.00
4
The data set is large (3000 rows), and these phrases differ among themselves in values, such like:
Max. Debt to Cash Flow: Value is 3.00 and Max. Debt to Cash Flow: Value is 6.00. I am not interested in these values, but just want to know how many covenants there are.
Do you have any idea how to do this in Python?

Looks to me that you could use:
counts = [] # structure to store the results
for financial_covenant in financial_covenants: # your structure containing rows
parts = financial_covenant.split(',') # this will split your sentence using commas as delimiters
count = len(parts) # this will count the number of parts obtained
counts.append(count) # this will store the final results in a array
print(counts) # displays [1, 2, 1, 2, 4]

On the assumption that your data is in a pandas DataFrame called df with columns as labelled then you could use:
df['num_of_cov'] = df['financial_covenants'].map(lambda row : len(row.split(',')))

Pandas: Sum down a column based on other columns

I've managed to successfully confuse myself with this problem. Here's a sample of my dataframe:
Model Rank Prediction Runtime
0.05 1 0.516267 250.500
0.05 2 0.504968 253.875
0.05 3 0.482915 310.875
0.05 4 0.470865 251.375
0.05 5 0.459580 277.250
. . . .
. . . .
. . . .
0.50 96 0.130696 250.500
0.50 97 0.130696 220.375
0.50 98 0.130696 314.625
0.50 99 0.130696 232.000
0.50 100 0.130696 258.000
And my use case is as follows:
I would, for each Model, like to calculate the total Runtime with respect to its Rank. By that I mean, the Runtime at Rank 1 should be the sum of all Runtimes (for its respective Model) and the Runtime at Rank 100 should be only the Runtime for Rank 100 (for its respective Model).
So for instance,
If the Rank is 1, the Runtime column at that row should represent the total sum of all Runtimes for Model 0.05
If the Rank is 2, it should be all of the Runtimes for Model 0.05 minus the Runtime for Model 0.05 at Rank 1
...
If the Rank is 100, it should be only the Runtime for Model 0.05 at Rank 100.
I have the idea in my head but I'm not sure how this is achieved in Pandas. I know how to sum the column, but not to sum based on a condition like this. If any more data or explanation is required, I'd be happy to attach it.

If I understand correctly, what you're asking for is essentially a reversed cumulative sum, which you can do by a reverse, cumsum, reverse operation:
In [4]: df["model_runtimes"] = df[::-1].groupby("Model")["Runtime"].cumsum()[::-1]
In [5]: df
Out[5]:
Model Rank Prediction Runtime model_runtimes
0 0.05 1 0.516267 250.500 1343.875
1 0.05 2 0.504968 253.875 1093.375
2 0.05 3 0.482915 310.875 839.500
3 0.05 4 0.470865 251.375 528.625
4 0.05 5 0.459580 277.250 277.250
5 0.50 96 0.130696 250.500 1275.500
6 0.50 97 0.130696 220.375 1025.000
7 0.50 98 0.130696 314.625 804.625
8 0.50 99 0.130696 232.000 490.000
9 0.50 100 0.130696 258.000 258.000

I would frame your problem in two steps.
First, each model is independent, so you can split the dataframe by the model field, and solve each one independently. This is a good place to use groupby().
Second, your problem is like a cumulative sum, except that a normal cumulative sum starts at the top and carries the sum down, and you want to do the opposite. You can solve this by reversing the dataframe, or by sorting in descending order.
With that in mind, here's how I would approach this problem. (Lines 1-14 are just setting up the dataset.)
import pandas as pd
import numpy as np
np.random.seed(42)
# Only used for setting up dataframe. Ignore.
def flatten(t):
return [item for sublist in t for item in sublist]
df = pd.DataFrame({
"Model": flatten([list(c * 20) for c in "ABCDE"]),
"Rank": flatten([range(1, 21) for i in range(5)]),
"Prediction": np.random.rand(100),
"Runtime": np.random.rand(100),
})
def add_sum(d):
d["Runtime"] = d["Runtime"].cumsum()
return d
df = df.sort_values(by=["Model", "Rank"], ascending=False) \
.groupby("Model") \
.apply(add_sum) \
.sort_index()
print(df)

Finding MACD Divergence

I want to create a loop to automate finding MACD divergence with specific scenario/criterion, but I am finding it difficult to execute although its very easy to spot when looking at chart by eyes. Note: you can easily get this as ready available scanner but i want to improve my python knowledge, hope someone will be able to help me here with this mission.
My main issue is how to make it reference 40 rows up, and test forward - couldn't get my head around the logic itself.
The rules are as follow: lets say we have the table below
Date
Price
MACD Hist
04/08/2021
30
1
05/08/2021
29
0.7
06/08/2021
28
0.4
07/08/2021
27
0.1
08/08/2021
26
-0.15
09/08/2021
25
-0.70
10/08/2021
26
-0.1
11/08/2021
27
0.2
12/08/2021
28
0.4
13/08/2021
29
0.5
14/08/2021
30
0.55
15/08/2021
31
0.6
16/08/2021
30
0.55
17/08/2021
29
0.5
18/08/2021
28
0.4225
19/08/2021
27
0.4
20/08/2021
26
0.35
21/08/2021
25
0.3
22/08/2021
24
0.25
23/08/2021
23
0.2
24/08/2021
22
0.15
25/08/2021
21
0.1
26/08/2021
20
0.05
27/08/2021
19
0
28/08/2021
18
-0.05
29/08/2021
17
-0.1
30/08/2021
16
-0.25
i want the code to:
look back 40 days from today, within these 40 days get the lowest
point reached in MACDHist and Price corresponding to it(i.e. price 25$ on
09/08/2021 in this example and the MACDHist -0.7)
compare it with today's price & MACDHist and give divergence or not based on below 3 rules:
If today's price < the recorded price in point 1 (16$ < 25$ in this example) AND
Today's MACDHist > the recorded MACD in Absolute terms in point 1 (ABS(-0.7) > ABS(-0.20)) AND
During the same period we recorded those Price and MACDHist (between 09/08/2021 and today) the MACDHist was positive at least once.
I am sorry if my explanation isn't very clear, for that the below picture might help illustrate the scenario I am after:
A. The Lowes MACDHist in specfied period
B. Within the same period, MACDHist were positive at least once
C. Price is lower than in point A (Price C is lower than A) and MACDHist was higher than MACDHist in Point A (i.e. Lower in ABS terms)

In a similar case i have used backtrader. Its a feature-rich Python framework for backtesting and trading and you can also use it in order to generate lots of predefined indicators. In addition with this framework you are able to develop your own custom indicator as shown here. Its very easy to use and it supports lots of data formats like pandas data frames. Please take a look!

I found the answer in this great post. its not direct implementation but at least the logic is the same and by replacing RSI info with MACDHist you get to the same conclusion.
How to implement RSI Divergence in Python

Calculating the proportional weighted value in a specific segment: Make it more Pythonic

I have to make the following calculation (or similar) many times in my code and it takes a long time to run. I was wondering if it was possible to make the code more pythonic (reduce the time to run).
I am calculating the weighting of the "loan_size" proportional to all other loans that are have the same origination month
loan_plans['weighting'] = loan_plans.loan_size / loan_plans.apply(lambda S: loan_plans.loc[(loan_plans.origination_month == S.origination_month) 'loan_size'].sum(), axis=1)
The following is a set of example data with the desired result:
loan_size origination_month weighting
1000 01-2018 0.25
2000 02-2018 0.2
3000 01-2018 0.75
8000 02-2018 0.8

Update (per OP update):
There's nothing wrong with your approach; you might use groupby instead to get origination_month sums, and then do the weighting:
loan_plans = loan_plans.reset_index().merge(
loan_plans.groupby("origination_month").loan_size.sum().reset_index(), on="origination_month"
)
loan_plans["weighting"] = loan_plans.loan_size_x / loan_plans.loan_size_y
loan_plans.sort_values("index").set_index("index")
loan_size_x origination_month loan_size_y weighting
index
0 1000 01-2018 4000 0.25
1 2000 02-2018 10000 0.20
2 3000 01-2018 4000 0.75
3 8000 02-2018 10000 0.80
Cosmetics:
(loan_plans
.sort_values("index")
.set_index("index")
.rename(columns={"loan_size_x": "loan_size"})
.drop("loan_size_y", 1))
loan_size origination_month weighting
index
0 1000 01-2018 0.25
1 2000 02-2018 0.20
2 3000 01-2018 0.75
3 8000 02-2018 0.80
Earlier answer
You can use div and sum, no need for apply:
loan_plans.loan_size.div(
loan_plans.loc[loan_plans.loan_number.eq(1), "loan_size"].sum()
)
Output:
0 0.024714
1 0.053143
2 0.012143
3 0.010929
4 0.039643
...
Data:
N = 100
data = {"loan_size": np.random.randint(100, 1000, size=N),
"loan_number": np.random.binomial(n=1, p=.3, size=N)}
loan_plans = pd.DataFrame(data)

Using Pandas & Pivot table how to use column(level) groupby sum values for the next steps analysis?

I want to find out how many sample will be taken from each level using proportion allocation method.
I have total 3 level's : [Small , Medium , Large ].
First , I want to take a sum for this 3 level's.
Next, I want to find out probability for this 3 levels
Next, I want to use this probability answer with multiply by how many samples given for this 3 levels
And, Last step is : sample will be select as top village's for the each level.
Data :
Village Workers Level
Aagar 10 Small
Dhagewadi 32 Small
Sherewadi 34 Small
Shindwad 42 Small
Dhokari 84 Medium
Khanapur 65 Medium
Ambikanagar 45 Medium
Takali 127 Large
Gardhani 122 Large
Pi.Khand 120 Large
Pangri 105 Large
let me explain, I am attaching excel photo
In the first step: I want to get sum values for level -> Small, Medium and High. i.e ( 10+32+34+42)=118 for Small level.
In the next step I want to find out probability for each levels rounding in 2 decimal.
i.e ( 118/786) =0.15 for small level.
And using length(size) of each level multiply by probability for find out how many sample(village) taken from each level.
i.e for Medium level we have probability 0.25 , and we have 3 villages in Medium level. so, 0.25*3 = 0.75 will be sample taken from medium level.
So, it will rounding to the next whole number 0.75 ~ 1 sample taken from Medium level, and it will take top village in this level. so, in medium level "Dhokri" village will be select,
I have done some work,
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df=pd.read_csv("/home/desktop/Desktop/t.csv")
df = df.sort('Workers', ascending=True)
df['level'] = pd.qcut(df['Workers'], 3, ['Small','Medium','Large'])
df
I am use this command for get the sum for level's. next what to do I am confuse,
df=df.groupby(['level'])['Workers'].aggregate(['sum']).unstack()
Is it possible in python , to get that village name what I get in the using excel ?

You can use:
transform with sum for same length of column
divide by div with sum and round
another transform with size
last custom function
df['Sum_Level_wise'] = df.groupby('Level')['Workers'].transform('sum')
df['Probability'] = df['Sum_Level_wise'].div(df['Workers'].sum()).round(2)
df['Sample'] = df['Probability'] * df.groupby('Level')['Workers'].transform('size')
df['Selected villages'] = df['Sample'].apply(np.ceil).astype(int)
df['Selected village'] = df.groupby('Level')
.apply(lambda x: x['Village'].head(x['Selected villages'].iat[0]))
.reset_index(level=0)['Village']
df['Selected village'] = df['Selected village'].fillna('')
print (df)
Village Workers Level Sum_Level_wise Probability Sample \
0 Aagar 10 Small 118 0.15 0.60
1 Dhagewadi 32 Small 118 0.15 0.60
2 Sherewadi 34 Small 118 0.15 0.60
3 Shindwad 42 Small 118 0.15 0.60
4 Dhokari 84 Medium 194 0.25 0.75
5 Khanapur 65 Medium 194 0.25 0.75
6 Ambikanagar 45 Medium 194 0.25 0.75
7 Takali 127 Large 474 0.60 2.40
8 Gardhani 122 Large 474 0.60 2.40
9 Pi.Khand 120 Large 474 0.60 2.40
10 Pangri 105 Large 474 0.60 2.40
Selected villages Selected village
0 1 Aagar
1 1
2 1
3 1
4 1 Dhokari
5 1
6 1
7 3 Takali
8 3 Gardhani
9 3 Pi.Khand
10 3
You can try debug with custom function:
def f(x):
a = x['Village'].head(x['Selected villages'].iat[0])
print (x['Village'])
print (a)
if (len(x) < len(a)):
print ('original village cannot be filled to Selected village, because length is higher')
return a
df['Selected village'] = df.groupby('Level').apply(f).reset_index(level=0)['Village']
df['Selected village'] = df['Selected village'].fillna('')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.