Pandas Dataframe as Paramaters within Pyomo optimisation model

Pandas Dataframe as Paramaters within Pyomo optimisation model - python

I'm new to Pyomo and trying to utilise data in my pandas dataframe as parameters within the optimisation model, the dataframe looks like this;
Ticker Margin Avg. Volume M_ratio V_ratio
Index
0 ES1 6600.00 1250970 0.126036 0.212996
1 TY1 1150.00 1232311 0.021961 0.209819
2 FV1 700.00 488906 0.013367 0.083244
3 TU1 570.00 293885 0.010885 0.050038
4 ED3 500.00 137802 0.009548 0.023463
5 NQ1 7500.00 427061 0.143223 0.072713
6 FDAX1 24074.12 98838 0.459728 0.016829
7 FESX1 2641.28 832836 0.050439 0.141803
8 FGBL1 2502.75 546878 0.047793 0.093114
9 FGBM1 1042.10 330517 0.019900 0.056275
10 FGBS1 262.97 232801 0.005022 0.039638
11 F2MX1 4822.81 398 0.092098 0.000068
The model I'm constructing aims to find the maximum contracts one may have in all assets based on balance and a number of constraints.
I need to iterate through the rows in order to add all the relevant data to model.utilisation
model.Vw = Param() #<- V_ratio from df
model.M = Param() #<- Margin from df
model.L = Var(domain=NonNegativeReals)
model.utilisation = Objective(expr = model.M * model.L, sense=maximize)
Effectively it needs to take in Margin for each ticker and determine how many of that you can get relevant to balance - i.e.
*(ES1 Margin * model.L) + (TY1 Margin * model.L)* etc etc throughout the dataframe.
I've tested the logic by plugging in dummy data and seems to work but it's not efficient to be writing in each piece of data and then adding it to the utilisation model as I have hundreds of lines in my dataframe.
Apologies if there are some blinding errors, very new to Pyomo

Related

Mapping data frame descriptions based on values of multiple columns

I need to generate a mapping dataframe with each unique code and a description I want prioritised, but need to do it based off a set of prioritisation options. So for example the starting dataframe might look like this:
Filename TB Period Company Code Desc. Amount
0 3 - Foxtrot... Prior TB FOXTROT FOXTROT__1000 98 100
1 3 - Foxtrot... Prior TB FOXTROT FOXTROT__1000 7 200
2 3 - Foxtrot... Opening TB FOXTROT FOXTROT__1000 ZX -100
3 3 - Foxtrot... Closing TB FOXTROT FOXTROT__1000 29 -200
4 3 - Foxtrot... Prior TB FOXTROT FOXTROT__1001 BA 100
5 3 - Foxtrot... Opening TB FOXTROT FOXTROT__1001 9 200
6 3 - Foxtrot... Closing TB FOXTROT FOXTROT__1001 ARC -100
7 3 - Foxtrot... Closing TB FOXTROT FOXTROT__1001 86 -200
The options I have for prioritisation of descriptions are:
Firstly to search for viable options in each Period, so for example Closing first, then if not found Opening, then if not found Prior.
If multiple descriptions are in the prioritised period, prioritise either longest or first instance.
So for example, if I wanted prioritisation of Closing, then Opening, then Prior, with longest string, I should get a mapping dataframe that looks like this:
Code New Desc.
FOXTROT__1000 29
FOXTROT__1001 ARC
Just for context, I have a fairly simple way to do all this in tkinter, but its dependent on generating a GUI of inconsistent codes and comboboxes of their descriptions, which is then used to generate a mapping dataframe.
The issue is that for large volumes (>1000 up to 30,000 inconsistent codes), it becomes impractical to generate a GUI, so for large volumes I need this as a way to auto-generate the mapping dataframe directly from the initial data whilst circumventing tkinter entirely.

import numpy as np
import pandas as df
#Create a new column which shows the hierarchy given the value of Period
df['NewFilterColumn'] = np.where( df['Period'] == 'Closing', 1,
np.where(df['Period'] == 'Opening', 2,
np.where(df['Period'] == 'Prior', 3, None
)
)
)
df = df.sort_values(by = ['NewFilterColumn', 'Code','New Desc.'], ascending = True, axis = 0)

Iterate over rows of a dataframe based on index in python

I am trying to build a loop that iterate over each rows of several Dataframes in order to create two new columns. The original dataframes contain two columns (time, velocity), which can vary in length and stored in nested dictionaries. Here an exemple of one of them :
time velocity
0 0.000000 0.136731
1 0.020373 0.244889
2 0.040598 0.386443
3 0.060668 0.571861
4 0.080850 0.777680
5 0.101137 1.007287
6 0.121206 1.207533
7 0.141284 1.402833
8 0.161388 1.595385
9 0.181562 1.762003
10 0.201640 1.857233
11 0.221788 2.006104
12 0.241866 2.172649
The two new columns should de a normalization of the 'time' and 'velocity' column, respectively. Each rows of the new columns should therefore be equal to the following transformation :
t_norm = (time(n) - time(n-1)) / (time(max) - time(min))
vel_norm = (velocity(n) - velocity(n-1)) / (velocity(max) - velocity(min))
Also, the first value of the two new column should be set to 0.
My problem is that I don't know how to properly indicate to python how to access to n and n-1 values to realize such operations, and I don't know if that could be done using pd.DataFrame.iterrows() or the .iloc function.
I have come with the following piece of code, but it miss the crucial parts :
for nested_dict in dict_all_raw.values():
for dflist in nested_dict.values():
dflist['t_norm'] = ? / (dflist['time'].max() - dflist['time'].min())
dflist['vel_norm'] = ? / (dflist['velocity'].max() - dflist['velocity'].min())
dflist['acc_norm'] = dflist['vel_norm'] / dflist['t_norm']
Any help is welcome..! :)

If you just want to normalise, you can write the expression directly, using Series.min and Series.max:
m = df['time'].min()
df['normtime'] = (df['time'] - m) / (df['time'].max() - m)
However, if you want the difference between successive elements, you can use Series.diff:
df['difftime'] = df['time'].diff() / (df['time'].max() - df['time'].min())
Testing:
df = pd.DataFrame({'time': [0.000000, 0.020373, 0.040598], 'velocity': [0.136731, 0.244889, 0.386443]})
print(df)
# time velocity
# 0 0.000000 0.136731
# 1 0.020373 0.244889
# 2 0.040598 0.386443
m = df['time'].min()
df['normtime'] = (df['time'] - m) / (df['time'].max() - m)
df['difftime'] = df['time'].diff() / (df['time'].max() - df['time'].min())
print(df)
# time velocity normtime difftime
# 0 0.000000 0.136731 0.000000 NaN
# 1 0.020373 0.244889 0.501823 0.501823
# 2 0.040598 0.386443 1.000000 0.498177

You can use shift (see the doc here) to create lagged columns
df['time_n-1']=df['time'].shift(1)
Also, the first value of the two new column should be set to 0.
Use df['column']=df['column'].fillna(0) after your calculations

Buy/sell strategy with indicators?

I have a dataframe similar to the one below;
Price
return
indicator
5
0.05
1
6
0.20
-1
5
-0.16
1
Where the indicator is based upon the forecasted return on the following day.
what I would like to achieve is a strategy where when the indicator is positive 1, I buy the stock at the price on that date/row. Then if the indicator is negative we sell at that price. Then I would like to create a new column with represents the value of the portfolio on each day. Assuming I have $1000 to invest the value of the portfolio should equal the holdings and cash amount. Im assuming that any fraction of Stock can be purchased.
Im unsure where to start with this one. I tried calculating a the Buy/Hold strategy using;
df['Holding'] = df['return'].add(1).cumprod().*5000
this worked for a buy hold strategy but to modify it to the new strategy seems difficult.
I tried;
df['HOLDINg'] = (df['return'].add(1).cumprod()* 5000 * df['Indicator])
#to get the value of the buy or the sell
#then using
df['HOLDING'] = np.where(df['HOLDING'] >0, df['HOLDING'] , df['HON HOLDING 2']*-1)
#my logic was, if its positive its the value of the stock holding, and if its negative it is a cash inflow therefore I made it positive as it would be cash.
the issue is, my logic is massively flawed, as if the holding is cash the return shouldn't apply to it. further I don't think using the cumprod is correct with this strategy.
Has anyone used this strategy before and can offer tips of how to make it work?
thank you

I'm not sure about the returns and prices being in the correct place (they shouldn't really be in the same row if they represent the buying price (presumably yesterday's close), and the daily return (assuming the position was held for the whole day). But anyway...
import pandas as pd
# the data you provided
df = pd.read_csv("Data.csv", header=0)
# an initial starting row (explanation provided)
starting = pd.DataFrame({'Price': [0], 'return': [0], 'indicator': [0]})
# concatenate so starting is first row
df = pd.concat([starting, df]).reset_index(drop=True)
# setting holding to 0 at start (no shares), and cash at 1000 (therefore portfolio = 1000)
df[["Holding", "Cash", "Portfolio"]] = [0, 1000, 1000]
# buy/sell is the difference (explanation provided)
df["BuySell"] = df["indicator"].diff()
# simulating every day
for i in range(1, len(df)):
# buying
if df["BuySell"].iloc[i] > 0:
df["Holding"].iloc[i] += df["Cash"].iloc[i-1] / df["Price"].iloc[i]
df["Cash"].iloc[i] = 0
# selling
elif df["BuySell"].iloc[i] < 0:
df["Cash"].iloc[i] = df["Holding"].iloc[i-1] * df["Price"].iloc[i]
df["Holding"].iloc[i] = 0
# holding position
else:
df["Cash"].iloc[i] = df["Cash"].iloc[i-1]
df["Holding"].iloc[i] = df["Holding"].iloc[i-1]
# multiply holding by return (assuming all-in, so holding=0 not affected)
df["Holding"].iloc[i] *= (1 + df["return"].iloc[i])
df["Portfolio"].iloc[i] = df["Holding"].iloc[i] * df["Price"].iloc[i] + df["Cash"].iloc[i]
Explanations:
Starting row:
This is needed so that the loop can refer to the previous holdings and cash (it would be more of an inconvenience to add in an if statement in the loop if i=0).
Buy/Sell:
The difference is necessary here, as if the position changes from buy to sell, then obviously selling the shares (and vice versa). However, if the previous was buy/sell, the same as the current row, there would be no change (diff=0), with no shares bought or sold.
Portfolio:
This is an "equivalent" amount (the amount you would hold if you converted all shares to cash at the time).
Holding:
This is the number of shares held.
NOTE: from what I understood of your question, this is an all-in strategy - there is no percentage in, which has made this strategy more simplistic, but easier to code.
Output:
#Out:
# Price return indicator Holding Cash Portfolio BuySell
#0 0 0.00 0 0.00 1000 1000.0 NaN
#1 5 0.05 1 210.00 0 1050.0 1.0
#2 6 0.20 -1 0.00 1260 1260.0 -2.0
#3 5 -0.16 1 211.68 0 1058.4 2.0
Hopefully this will give you a good starting point to create something more to your specification and more advanced, such as with multiple shares, or being a certain percentage exposed, etc.

Lifetimes package gives inconsistent results

I am using Lifetimes to compute CLV of some customers of mine.
I have transactional data and, by means of summary_data_from_transaction_data (the implementation can be found here) I would like to compute
the recency, the frequency and the time interval T of each customer.
Unfortunately, it seems that the method does not compute correctly the frequency.
Here is the code for testing my dataset:
df_test = pd.read_csv('test_clv.csv', sep=',')
RFT_from_libray = summary_data_from_transaction_data(df_test,
'Customer',
'Transaction date',
observation_period_end='2020-02-12',
freq='D')
According to the code, the result is:
frequency recency T
Customer
1158624 18.0 389.0 401.0
1171970 67.0 396.0 406.0
1188564 12.0 105.0 401.0
The problem is that customer 1188564 and customer 1171970 did respectively 69 and 14 transaction, thus the frequency should have been 68 and 13.
Printing the size of each customer confirms that:
print(df_test.groupby('Customer').size())
Customer
1158624 19
1171970 69
1188564 14
I did try to use natively the underlying code in the summary_data_from_transaction_data like this:
RFT_native = df_test.groupby('Customer', sort=False)['Transaction date'].agg(["min", "max", "count"])
observation_period_end = (
pd.to_datetime('2020-02-12', format=None).to_period('D').to_timestamp()
)
# subtract 1 from count, as we ignore their first order.
RFT_native ["frequency"] = RFT_native ["count"] - 1
RFT_native ["T"] = (observation_period_end - RFT_native ["min"]) / np.timedelta64(1, 'D') / 1
RFT_native ["recency"] = (RFT_native ["max"] - RFT_native ["min"]) / np.timedelta64(1, 'D') / 1
As you can see, the result is indeed correct.
min max count frequency T recency
Customer
1171970 2019-01-02 15:45:39 2020-02-02 13:40:18 69 68 405.343299 395.912951
1188564 2019-01-07 18:10:55 2019-04-22 14:27:08 14 13 400.242419 104.844595
1158624 2019-01-07 10:52:33 2020-01-31 13:50:36 19 18 400.546840 389.123646
Of course my dataset is much bigger, and a slight difference in my frequency and/or recency alters a lot the computation of the BGF model.
What am I missing? Is there something that I should consider when using the method?

I might be a bit late to answer your query, but here it goes.
The documentation for the Lifestyles package defines frequency as:
frequency represents the number of repeat purchases the customer has made. This means that it’s one less than the total number of purchases. This is actually slightly wrong. It’s the count of time periods the customer had a purchase in. So if using days as units, then it’s the count of days the customer had a purchase on.
So, it's basically the number of time periods when the customer has made a repeat purchase, not the number of individual repeat purchases. A quick scan of your sample dataset confirmed that both 1188564 and 1171970 indeed made 2 purchases on a single day, 13Jan2019 and 15Jun2019, respectively. So these 2 transactions would be considered as 1 when calculating frequency that would result in the frequency calculated by summary_data_from_transaction_data function to be 2 less than your manual count.

According to documentation, you need to set:
include_first_transaction = True
include_first_transaction (bool, optional) – Default: False By default
the first transaction is not included while calculating frequency and
monetary_value. Can be set to True to include it. Should be False if
you are going to use this data with any fitters in lifetimes package

Pandas very slow query

I have the following code which reads a csv file and then analyzes it. One patient has more than one illness and I need to find how many times an illness is seen on all patients. But the query given here
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
is so slow that it takes more than 15 mins. Is there a way to make the query faster?
raw_data = pd.read_csv(r'C:\Users\omer.kurular\Desktop\Data_Entry_2017.csv')
data = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia", "Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax", "Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
illnesses = pd.DataFrame({"Finding_Label":[],
"Count_of_Patientes_Having":[],
"Count_of_Times_Being_Shown_In_An_Image":[]})
ids = raw_data["Patient ID"].drop_duplicates()
index = 0
for ctr in data[:1]:
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = raw_data[raw_data["Finding Labels"].str.contains(ctr)].size / 12
for i in ids:
illnesses.at[index, "Count_of_Patientes_Having"] = raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
index = index + 1
Part of dataframes:
Raw_data
Finding Labels - Patient ID
IllnessA|IllnessB - 1
Illness A - 2

From what I read I understand that ctr stands for the name of a disease.
When you are doing this query:
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
You are not only filtering the rows which have the disease, but also which have a specific patient id. If you have a lot of patients, you will need to do this query a lot of times. A simpler way to do it would be to not filter on the patient id and then take the count of all the rows which have the disease.
This would be:
raw_data[raw_data['Finding Labels'].str.contains(ctr)].size
And in this case since you want the number of rows, len is what you are looking for instead of size (size will be the number of cells in the dataframe).
Finally another source of error in your current code was the fact that you were not keeping the count for every patient id. You needed to increment illnesses.at[index, "Count_of_Patientes_Having"] not set it to a new value each time.
The code would be something like (for the last few lines), assuming you want to keep the disease name and the index separate:
for index, ctr in enumerate(data[:1]):
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = len(raw_data[raw_data["Finding Labels"].str.contains(ctr)]) / 12
illnesses.at[index, "Count_of_Patientes_Having"] = len(raw_data[raw_data['Finding Labels'].str.contains(ctr)])
I took the liberty of using enumerate for a more pythonic way of handling indexes. I also don't really know what "Count_of_Times_Being_Shown_In_An_Image" is, but I assumed you had had the same confusion between size and len.

Likely the reason your code is slow is that you are growing a data frame row-by-row inside a loop which can involve multiple in-memory copying. Usually this is reminiscent of general purpose Python and not Pandas programming which ideally handles data in blockwise, vectorized processing.
Consider a cross join of your data (assuming a reasonable data size) to the list of illnesses to line up Finding Labels to each illness in same row to be filtered if longer string contains shorter item. Then, run a couple of groupby() to return the count and distinct count by patient.
# CROSS JOIN LIST WITH MAIN DATA FRAME (ALL ROWS MATCHED)
raw_data = (raw_data.assign(key=1)
.merge(pd.DataFrame({'ills':ills, 'key':1}), on='key')
.drop(columns=['key'])
)
# SUBSET BY ILLNESS CONTAINED IN LONGER STRING
raw_data = raw_data[raw_data.apply(lambda x: x['ills'] in x['Finding Labels'], axis=1)]
# CALCULATE GROUP BY count AND distinct count
def count_distinct(grp):
return (grp.groupby('Patient ID').size()).size
illnesses = pd.DataFrame({'Count_of_Times_Being_Shown_In_An_Image': raw_data.groupby('ills').size(),
'Count_of_Patients_Having': raw_data.groupby('ills').apply(count_distinct)})
To demonstrate, consider below with random, seeded input data and output.
Input Data (attempting to mirror original data)
import numpy as np
import pandas as pd
alpha = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'
data_tools = ['sas', 'stata', 'spss', 'python', 'r', 'julia']
ills = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia",
"Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax",
"Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
np.random.seed(542019)
raw_data = pd.DataFrame({'Patient ID': np.random.choice(data_tools, 25),
'Finding Labels': np.core.defchararray.add(
np.core.defchararray.add(np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]),
np.random.choice(ills, 25).astype('str')),
np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]))
})
print(raw_data.head(10))
# Patient ID Finding Labels
# 0 r xPNPneumothoraxXYm
# 1 python ScSInfiltration9Ud
# 2 stata tJhInfiltrationJtG
# 3 r thLPneumoniaWdr
# 4 stata thYAtelectasis6iW
# 5 sas 2WLPneumonia1if
# 6 julia OPEConsolidationKq0
# 7 sas UFFCardiomegaly7wZ
# 8 stata 9NQHerniaMl4
# 9 python NB8HerniapWK
Output (after running above process)
print(illnesses)
# Count_of_Times_Being_Shown_In_An_Image Count_of_Patients_Having
# ills
# Atelectasis 3 1
# Cardiomegaly 2 1
# Consolidation 1 1
# Effusion 1 1
# Emphysema 1 1
# Fibrosis 2 2
# Hernia 4 3
# Infiltration 2 2
# Mass 1 1
# Nodule 2 2
# Pleural_Thickening 1 1
# Pneumonia 3 3
# Pneumothorax 2 2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Dataframe as Paramaters within Pyomo optimisation model - python

Related

Mapping data frame descriptions based on values of multiple columns

Iterate over rows of a dataframe based on index in python

Buy/sell strategy with indicators?

Lifetimes package gives inconsistent results

Pandas very slow query

Categories

Resources