Create a binary matrix after comparing columns' values in a dataframe - python

The text is long but the question is simple!
I have two dataframes that brings different informations about two variables and I need to create a binary matrix as my output after following some steps.
Let's say my dataframes are these:
market_values = pd.DataFrame({'variableA': (1,2.0,3), 'variableB': (np.nan,2,np.nan), 'variableC': (9,10,15), 'variableD' : (18,25,43),'variableE':(36,11,12),'variableF':(99,10,98), 'variableG': (42,19,27)})
variableA variableB variableC variableD variableE variableF variableG
0 1.0 NaN 9 18 36 99 42
1 2.0 2.0 10 25 11 10 19
2 3.0 NaN 15 43 12 98 27
negociation_values = pd.DataFrame({'variableA': (0.1,0.2,0.3), 'variableB': (0.5,np.nan,0.303), 'variableC': (0.9,0.10,0.4), 'variableD' : (0.12,0.11,0.09),'variableE':(np.nan,0.13,0.21),'variableF':(0.14,np.nan,0.03), 'variableG': (1.4,0.134,0.111)})
variableA variableB variableC variableD variableE variableF variableG
0 0.1 0.500 0.9 0.12 NaN 1.4 0.141
1 0.2 NaN 0.1 0.11 0.13 NaN 0.134
2 0.3 0.303 0.4 0.09 0.21 0.03 0.111
And I need to follow these steps:
1 - Check if two columns in my 'market_values' df have at least one
value that is equal (for the same row)
2 - If a pair of columns has one value that is equal (for the same row),
then I need to compare these same columns in my
'negociation_values' df
3 - Then I have to discover which variable has the higher
negociation value (for a given row)
4 - Finally I need to create a binary matrix.
For those equal values' variable, I'll put 1 where one
negociation value is higher and 0 for the other. If a column
doesn't have an equal value with another column, I'll just put 1
for the entire column.
The desired output matrix will be like:
variableA variableB variableC variableD variableE variableF variableG
0 0 1 0 1 1 1 1
1 1 0 1 1 1 0 1
2 0 1 1 1 1 0 1
The main difficult is at steps 3 and 4.
I've done steps 1 and 2 so far. They're above:
arr = market_values.to_numpy()
is_equal = ((arr == arr[None].T).any(axis=1))
is_equal[np.tril_indices_from(is_equal)] = False
inds_of_same_cols = [*zip(*np.where(is_equal))]
equal_cols = [market_values.columns[list(inds)].tolist() for inds in inds_of_same_cols]
print(equal_cols)
-----------------
[['variableA', 'variableB'], ['variableC', 'variableF']]
h = []
for i in equal_cols:
op = pd.DataFrame(negociation_values[i])
h.append(op)
print(h)
-------
[ variableA variableB
0 0.1 0.500
1 0.2 NaN
2 0.3 0.303,
variableC variableF
0 0.9 0.14
1 0.1 NaN
2 0.4 0.03]
The code above returns me the negociation values for the columns that have at least one equal value in the market values df.
Unfortunately, I don't know where to go from here. I need to write a code that says something like: "If variableA > variableB (for a row), insert '1' in a new matrix under variableA column and a '0' under variableB column for that row. keep doing that and then do that for the others". Also, I need to say "If a variable doesn't have an equal value in some other column, insert 1 for all values in this binary matrix"

your negociation_values definition and presented table are not the same:
here is the definition I used
market_values = pd.DataFrame({'variableA': (1,2.0,3), 'variableB': (np.nan,2,np.nan), 'variableC': (9,10,15), 'variableD' : (18,25,43),'variableE':(36,11,12),'variableF':(99,10,98), 'variableG': (42,19,27)})
negociation_values = pd.DataFrame({'variableA': (0.1,0.2,0.3), 'variableB': (0.5,np.nan,0.303), 'variableC': (0.9,0.10,0.4), 'variableD' : (0.12,0.11,0.09),'variableE':(np.nan,0.13,0.21),'variableF':(1.4,np.nan,0.03), 'variableG': (0.141,0.134,0.111)})
The following code gives me the required matrix (though there are a number of edge cases you will need to consider)
cols = market_values.columns.values
bmatrix = pd.DataFrame(index=market_values.index, columns=cols, data=1)
for idx,col in enumerate(cols):
print(cols[idx+1:])
df_m = market_values[cols[idx+1:]]
df_n = negociation_values[cols[idx+1:]]
v = df_n.loc[:,df_m.sub(market_values[col],axis=0).eq(0).any()].sub(negociation_values[col], axis=0).applymap(lambda x: 1 if x > 0 else 0)
if v.columns.size > 0:
bmatrix[v.columns[0]] = v
bmatrix[col] = 1 - v
The result is as required:
The pseudo code is:
for each column of the market matrix:
subtract from the later columns,
keep columns with any zeros (edge case: more than one column),
from column with zero , find difference between corresponding negoc. matrix,
set result to 1 if > 0, otherwise 0,
enter into binary matrix
Hope that makes sense.

Related

How to multiply the previous value of an other column with the value of x column (shift)

I have the following pandas df which consists of 2 factor-columns and 2 signal-columns.
import pandas as pd
data = [
[0.1,-0.1,0.1],
[-0.1,0.2,0.3],
[0.3,0.1,0.3],
[0.1,0.3,-0.2]
]
df = pd.DataFrame(data, columns=['factor_A', 'factor_B', 'factor_C'])
for col in df:
new_name = col + '_signal'
df[new_name] = [1 if x>0 else -1 for x in df[col]]
print(df)
This gives me the following output:
factor_A factor_B factor_C factor_A_signal factor_B_signal factor_C_signal
0 0.1 -0.1 0.1 1 -1 1
1 -0.1 0.2 0.3 -1 1 1
2 0.3 0.1 0.3 1 1 1
3 0.1 0.3 -0.2 1 1 -1
Now in a 1 month holding period I have to multiply factor_A with the previous factor_A_signal + factor_B with the previous factor_B_signal divided by the number of factors (in this case "2") and add a new column ("ret_1m). At the moment I am not able to say how much factors I will have as an input so therefore I have to work with a for loop. 
In a 2 month holding period I have to multiply the t+1 factor_A with the previous factor_A_signal + the t+1 factor_B with the previous factor_B_signal divided by the number of factors and add a new column ("ret_2m") and so on to the 12th month.
To show you an example I would do that for 2 factors for 3 month holding period as follow:
import pandas as pd
data = [
[0.1,-0.1],
[-0.1,0.2],
[0.3,0.1],
[0.1,0.3]
]
df = pd.DataFrame(data, columns=['factor_A', 'factor_B'])
for col in df:
new_name = col + '_signal'
df[new_name] = [1 if x>0 else -1 for x in df[col]]
print(df)
def one_three(n_factors):
df["ret_1m"] = (df['factor_A_signal'].shift() * df["factor_A"] +
df['factor_B_signal'].shift() * df["factor_B"])/n_factors
df["ret_2m"] = (df['factor_A_signal'].shift() * df["factor_A"].shift(-1) +
df['factor_B_signal'].shift() * df["factor_B"].shift(-1))/n_factors
df["ret_3m"] = (df['factor_A_signal'].shift() * df["factor_A"].shift(-2) +
df['factor_B_signal'].shift() * df["factor_B"].shift(-2))/n_factors
return df
one_three(2)
Output:
factor_A factor_B factor_A_signal factor_B_signal ret_1m ret_2m ret_3m
0 0.1 -0.1 1 -1 NaN NaN NaN
1 -0.1 0.2 -1 1 -0.15 0.1 -0.1
2 0.3 0.1 1 1 -0.10 0.1 NaN
3 0.1 0.3 1 1 0.20 NaN NaN
How could I automate this with a for loop? Thank you very much in advance.
A for loop for your function def one_three(n_factors):
# Create list of columns in dataframe that are not signals
factors = [x for x in df.columns if not x.endswith("_signal")]
# Looking through range from 1 to 1 + number of months (in your example 3)
for i in range(1, 3+1):
name = "ret_" + str(i) + "m"
df[name] = 0
for x in factors:
df[name] += df[str(x + "_signal")].shift() * df[x].shift(1 - i)
df[name] /= len(factors)
Assuming you know already populated the factor_ columns, then run the signal loop. The first section finds all columns that do not end with _signal and returns a list - otherwise you could use a list of [factor_A, factor_B, ...]. Looping through the number of months, here I used 3 following your example, the computation loops through all items in the list.
The output for this matched your output with the given input data.

Loop through rows of dataframe at specific row values

My dataframe contains three different replications for each treatment. I want to loop through both, so I want to loop through each treatment, and for each treatment calculate a model for each replication. I managed to loop through the treatments, but I need to also loop through the replications of each treatment. Ideally, the output should be saved into a new dataframe that contains 'treatment' and 'replication'. Any suggestion?
The dataframe (df) looks like this:
treatment replication time y
**8 1 1 0.1**
8 1 2 0.1
8 1 3 0.1
**8 2 1 0.1**
8 2 2 0.1
8 2 3 0.1
**10 1 1 0.1**
10 1 2 0.1
10 1 3 0.1
**10 2 1 0.1**
10 2 2 0.1
10 2 3 0.1
for i, g in df.groupby('treament'):
k = g.iloc[0].y
popt, pcov = curve_fit(model, x, y)
fit_m = popt
I now apply iterrows, but then I can no longer use the index of NPQ [0] to get the initial value. Any idea how to solve this? The error message reads as:
for index, row in HL.iterrows():
g = (index, row['filename'], row['hr'], row['time'], row['NPQ'])
k = g.iloc[0]['NPQ'])
AttributeError: 'tuple' object has no attribute 'iloc'
Thank you in advance
grouped_df = HL.groupby(["hr", "filename"])
for key, g in grouped_df:
k = g.iloc[0].y
popt, pcov = curve_fit(model, x, y)
fit_m = popt

Sum of values based on a condition in pandas

ID change SX Supresult
0 UNITY NaN 0 NaN
1 UNITY -0.009434 100 -0.015283 (P1)
2 UNITY 0.003463 0 NaN
3 TRINITY 0.008628 100 -0.043363
4 TRINITY -0.027374 100 0.008423 (P2)
5 TRINITY -0.011002 0 NaN
6 TRINITY -0.004987 100 NaN
7 TRINITY 0.007566 0 NaN
I use the following program which creates a new column 'Supresult' if 'SX' is equal to 100. The new column stores the sum of NEXT three 'change' values. For instance in index 1 the supresult is a sum of change in index 2,3 & 4.
df['Supresult'] = df[df.SX == 100].index.to_series().apply(lambda x: df.change.shift(-1).iloc[x: x + 3].sum())
However, I am facing two problems that I need assistance with:
(P1): I want the sum to be 'ID' specific. For instance the result in index 1 goes ahead and take the sum of one value from UNITY and two from TRINITY. Sum should be made as long as it in within the same 'ID'. I tried to add .groupby('ID') at the end of my code but it gave a keyerror.
(P2): Since the index 3 already gave sum of next three change, index 4 shouldn't go ahead and make the sum of the next three days. The next sum should only be taken once the previous calculation period is complete i.e. index 6 and onwards.
Intended result:
ID change SX Supresult
0 UNITY NaN 0 NaN
1 UNITY -0.009434 100 NaN
2 UNITY 0.003463 0 NaN
3 TRINITY 0.008628 100 -0.043363
4 TRINITY -0.027374 100 NaN
5 TRINITY -0.011002 0 NaN
6 TRINITY -0.004987 100 NaN
7 TRINITY 0.007566 0 NaN
Little help will be appreciated, THANKS!
Given your complex requirements, I think a loop is appropriate:
# If your data frame is not indexed sequentially, this will make it so.
# The algorithm needs the frame to be indexed 0, 1, 2, ...
df.reset_index(inplace=True)
# Every row starts off in "unconsumed" state
consumed = np.repeat(0, len(df))
result = np.repeat(np.nan, len(df))
for i, sx in df['SX'].iteritems():
# The next three rows
slc = slice(i+1, i+4)
# A row is considered a match if:
# * It has SX == 100
# * The next three rows have the same ID
# * The next three rows are not involved in a previous summation
match = (
(sx == 100) and
(df.loc[slc, 'ID'].nunique() == 1) and
(consumed[i] == 0)
)
if match:
consumed[slc] = 1
result[i] = df.loc[slc, 'Supresult'].sum()
df['Supresult'] = result

index counter for if conditions python pandas

I wanted to generate some sort of cycle for my dataFrame. One cycle in the example below has the length of 4. The last column is how is supposed to look like, the rest are attempts on my behalf.
My current code looks like this:
import pandas as pd
import numpy as np
l = list(np.linspace(0,10,12))
data = [
('time',l),
('A',[0,5,0.6,-4.8,-0.3,4.9,0.2,-4.7,0.5,5,0.1,-4.6]),
('B',[ 0,300,20,-280,-25,290,30,-270,40,300,-10,-260]),
]
df = pd.DataFrame.from_dict(dict(data))
length = len(df)
df.loc[0,'cycle']=1
df['cycle'] = length/4 +df.loc[0,'cycle']
i = 0
for i in range(0,length):
df.loc[i,'new_cycle']=i+1
df['want_cycle']= [1,1,1,1,2,2,2,2,3,3,3,3]
print(length)
print(df)
I do need an if conditions in the code, too only increase in the value of df['new_cycle'] if the index counter for example 4. But so far I failed to find a proper way to implement such conditions.
Try this with the default range index, because your dataframe row index is a range starting with 0, the default index of a dataframe, you can use floor divide to calculate your cycle:
df['cycle'] = df.index//4 + 1
Output:
time A B cycle
0 0.000000 0.0 0 1
1 0.909091 5.0 300 1
2 1.818182 0.6 20 1
3 2.727273 -4.8 -280 1
4 3.636364 -0.3 -25 2
5 4.545455 4.9 290 2
6 5.454545 0.2 30 2
7 6.363636 -4.7 -270 2
8 7.272727 0.5 40 3
9 8.181818 5.0 300 3
10 9.090909 0.1 -10 3
11 10.000000 -4.6 -260 3
Now, if your dataframe index isn't the default, the you can use something like this:
df['cycle'] = [df.index.get_loc(i) // 4 + 1 for i in df.index]
I've added just 1 thing for you, a new variable called new_cycle which will keep the count you're after.
In the for loop we're checking to see whether or not i is divisible by 4 without a remainder, if it is we're adding 1 to the new variable, and filling the data frame with this value the same way you did.
import pandas as pd
import numpy as np
l = list(np.linspace(0,10,12))
data = [
('time',l),
('A',[0,5,0.6,-4.8,-0.3,4.9,0.2,-4.7,0.5,5,0.1,-4.6]),
('B',[ 0,300,20,-280,-25,290,30,-270,40,300,-10,-260]),
]
df = pd.DataFrame.from_dict(dict(data))
length = len(df)
df.loc[0,'cycle']=1
df['cycle'] = length/4 +df.loc[0,'cycle']
new_cycle = 0
for i in range(0,length):
if i % 4 == 0:
new_cycle += 1
df.loc[i,'new_cycle']= new_cycle
df['want_cycle'] = [1,1,1,1,2,2,2,2,3,3,3,3]
print(length)
print(df)

Pandas Timeseries Data - Calculating product over intervals of varying length

I have some timeseries data that basically contains information on price change period by period. For example, let's say:
df = pd.DataFrame(columns = ['TimeStamp','PercPriceChange'])
df.loc[:,'TimeStamp']=[1457280,1457281,1457282,1457283,1457284,1457285,1457286]
df.loc[:,'PercPriceChange']=[0.1,0.2,-0.1,0.1,0.2,0.1,-0.1]
so that df looks like
TimeStamp PercPriceChange
0 1457280 0.1
1 1457281 0.2
2 1457282 -0.1
3 1457283 0.1
4 1457284 0.2
5 1457285 0.1
6 1457286 -0.1
What I want to achieve is to calculate the overall price change before the an increase/decrease streak ends, and store the value in the row where the streak started. That is, what I want is a column 'TotalPriceChange' :
TimeStamp PercPriceChange TotalPriceChange
0 1457280 0.1 1.1 * 1.2 - 1 = 0.31
1 1457281 0.2 0
2 1457282 -0.1 -0.1
3 1457283 0.1 1.1 * 1.2 * 1.1 - 1 = 0.452
4 1457284 0.2 0
5 1457285 0.1 0
6 1457286 -0.1 -0.1
I can identify the starting points using something like:
df['turn'] = 0
df['PriceChange_L1'] = df['PercPriceChange'].shift(periods=1, freq=None, axis=0)
df.loc[ df['PercPriceChange'] * df['PriceChange_L1'] < 0, 'turn' ] = 1
to get
TimeStamp PercPriceChange turn
0 1457280 0.1 NaN or 1?
1 1457281 0.2 0
2 1457282 -0.1 1
3 1457283 0.1 1
4 1457284 0.2 0
5 1457285 0.1 0
6 1457286 -0.1 1
Given this column "turn", I need help proceeding with my quest (or perhaps we don't need this "turn" at all). I am pretty sure I can write a nested for-loop going through the entire DataFrame row by row, calculating what I need and populating the column 'TotalPriceChange', but given that I plan on doing this on a fairly large data set (think minute or hour data for couple of years), I imagine nested for-loops will be really slow.
Therefore, I just wanted to check with you experts to see if there is any efficient solution to my problem that I am not aware of. Any help would be much appreciated!
Thanks!
The calculation you are looking for looks like a groupby/product operation.
To set up the groupby operation, we need to assign a group value to each row. Taking the cumulative sum of the turn column gives the desired result:
df['group'] = df['turn'].cumsum()
# 0 0
# 1 0
# 2 1
# 3 2
# 4 2
# 5 2
# 6 3
# Name: group, dtype: int64
Now we can define the TotalPriceChange column (modulo a little cleanup work) as
df['PercPriceChange_plus_one'] = df['PercPriceChange']+1
df['TotalPriceChange'] = df.groupby('group')['PercPriceChange_plus_one'].transform('prod') - 1
import pandas as pd
df = pd.DataFrame({'PercPriceChange': [0.1, 0.2, -0.1, 0.1, 0.2, 0.1, -0.1],
'TimeStamp': [1457280, 1457281, 1457282, 1457283, 1457284, 1457285, 1457286]})
df['turn'] = 0
df['PriceChange_L1'] = df['PercPriceChange'].shift(periods=1, freq=None, axis=0)
df.loc[ df['PercPriceChange'] * df['PriceChange_L1'] < 0, 'turn' ] = 1
df['group'] = df['turn'].cumsum()
df['PercPriceChange_plus_one'] = df['PercPriceChange']+1
df['TotalPriceChange'] = df.groupby('group')['PercPriceChange_plus_one'].transform('prod') - 1
mask = (df['group'].diff() != 0)
df.loc[~mask, 'TotalPriceChange'] = 0
df = df[['TimeStamp', 'PercPriceChange', 'TotalPriceChange']]
print(df)
yields
TimeStamp PercPriceChange TotalPriceChange
0 1457280 0.1 0.320
1 1457281 0.2 0.000
2 1457282 -0.1 -0.100
3 1457283 0.1 0.452
4 1457284 0.2 0.000
5 1457285 0.1 0.000
6 1457286 -0.1 -0.100

Categories

Resources