Pandas Timeseries Data - Calculating product over intervals of varying length - python

I have some timeseries data that basically contains information on price change period by period. For example, let's say:
df = pd.DataFrame(columns = ['TimeStamp','PercPriceChange'])
df.loc[:,'TimeStamp']=[1457280,1457281,1457282,1457283,1457284,1457285,1457286]
df.loc[:,'PercPriceChange']=[0.1,0.2,-0.1,0.1,0.2,0.1,-0.1]
so that df looks like
TimeStamp PercPriceChange
0 1457280 0.1
1 1457281 0.2
2 1457282 -0.1
3 1457283 0.1
4 1457284 0.2
5 1457285 0.1
6 1457286 -0.1
What I want to achieve is to calculate the overall price change before the an increase/decrease streak ends, and store the value in the row where the streak started. That is, what I want is a column 'TotalPriceChange' :
TimeStamp PercPriceChange TotalPriceChange
0 1457280 0.1 1.1 * 1.2 - 1 = 0.31
1 1457281 0.2 0
2 1457282 -0.1 -0.1
3 1457283 0.1 1.1 * 1.2 * 1.1 - 1 = 0.452
4 1457284 0.2 0
5 1457285 0.1 0
6 1457286 -0.1 -0.1
I can identify the starting points using something like:
df['turn'] = 0
df['PriceChange_L1'] = df['PercPriceChange'].shift(periods=1, freq=None, axis=0)
df.loc[ df['PercPriceChange'] * df['PriceChange_L1'] < 0, 'turn' ] = 1
to get
TimeStamp PercPriceChange turn
0 1457280 0.1 NaN or 1?
1 1457281 0.2 0
2 1457282 -0.1 1
3 1457283 0.1 1
4 1457284 0.2 0
5 1457285 0.1 0
6 1457286 -0.1 1
Given this column "turn", I need help proceeding with my quest (or perhaps we don't need this "turn" at all). I am pretty sure I can write a nested for-loop going through the entire DataFrame row by row, calculating what I need and populating the column 'TotalPriceChange', but given that I plan on doing this on a fairly large data set (think minute or hour data for couple of years), I imagine nested for-loops will be really slow.
Therefore, I just wanted to check with you experts to see if there is any efficient solution to my problem that I am not aware of. Any help would be much appreciated!
Thanks!

The calculation you are looking for looks like a groupby/product operation.
To set up the groupby operation, we need to assign a group value to each row. Taking the cumulative sum of the turn column gives the desired result:
df['group'] = df['turn'].cumsum()
# 0 0
# 1 0
# 2 1
# 3 2
# 4 2
# 5 2
# 6 3
# Name: group, dtype: int64
Now we can define the TotalPriceChange column (modulo a little cleanup work) as
df['PercPriceChange_plus_one'] = df['PercPriceChange']+1
df['TotalPriceChange'] = df.groupby('group')['PercPriceChange_plus_one'].transform('prod') - 1
import pandas as pd
df = pd.DataFrame({'PercPriceChange': [0.1, 0.2, -0.1, 0.1, 0.2, 0.1, -0.1],
'TimeStamp': [1457280, 1457281, 1457282, 1457283, 1457284, 1457285, 1457286]})
df['turn'] = 0
df['PriceChange_L1'] = df['PercPriceChange'].shift(periods=1, freq=None, axis=0)
df.loc[ df['PercPriceChange'] * df['PriceChange_L1'] < 0, 'turn' ] = 1
df['group'] = df['turn'].cumsum()
df['PercPriceChange_plus_one'] = df['PercPriceChange']+1
df['TotalPriceChange'] = df.groupby('group')['PercPriceChange_plus_one'].transform('prod') - 1
mask = (df['group'].diff() != 0)
df.loc[~mask, 'TotalPriceChange'] = 0
df = df[['TimeStamp', 'PercPriceChange', 'TotalPriceChange']]
print(df)
yields
TimeStamp PercPriceChange TotalPriceChange
0 1457280 0.1 0.320
1 1457281 0.2 0.000
2 1457282 -0.1 -0.100
3 1457283 0.1 0.452
4 1457284 0.2 0.000
5 1457285 0.1 0.000
6 1457286 -0.1 -0.100

Related

Bin values into groups

The relevant data in my dataframe looks as follows:
Datapoint
Values
1
0.2
2
0.8
3
0.4
4
0.1
5
1.0
6
0.6
7
0.7
8
0.2
9
0.5
10
0.1
I am hoping to group the numbers in the Values column into three categories: less than 0.25 as 'low', between 0.25 and 0.75 as middle and greater than 0.75 as high.
I want to create a new column which returns 'low', 'middle' or 'high' for each row based off the data in the value column.
What I have tried:
def categorize_values("Values"):
if "Values" > 0.75:
return 'high'
elif 'Values' < 0.25:
return 'low'
else:
return 'middle'
However this is returning an error for me.
If you're using a dataframe, Pandas has a built-in function called pd.cut()
import pandas as pd
import numpy as np
from io import StringIO
df = pd.read_csv(StringIO('''Datapoint Values
1 0.2
2 0.8
3 0.4
4 0.1
5 1.0
6 0.6
7 0.7
8 0.2
9 0.5
10 0.1'''), sep='\t')
df['category'] = pd.cut(df['Values'], [0, 0.25, 0.75, df['Values'].max()], labels=['low', 'middle', 'high'])
#output
>>> df
Datapoint Values category
0 1 0.2 low
1 2 0.8 high
2 3 0.4 middle
3 4 0.1 low
4 5 1.0 high
5 6 0.6 middle
6 7 0.7 middle
7 8 0.2 low
8 9 0.5 middle
9 10 0.1 low
First of all, you cannot put constants in your function parameters.
You need to fix your function first like this,
def categorize_values(Values):
if Values > 0.75:
return 'high'
elif Values < 0.25:
return 'low'
else:
return 'middle'
and then you can apply that function to your 'Values' column as below.
df['Category'] = df['Values'].apply(categorize_values)
df.head()
it will generate that DataFrame,
Values Category
DataPoint
1 0.22 low
2 0.32 middle
3 0.55 middle
4 0.75 middle
5 0.12 low
You should take the '' around the Values away.
That would look like this:
def categorize_values(Values):
if Values > 0.75:
return 'high'
elif Values < 0.25:
return 'low'
else:
return 'middle'

Create pandas column using previous calculated value

I tried to construct a column without using a loop for. However I am using the value that I calculated at the previous step.
Example :
f-
0
0.3
1
0.4
2
0.45
3
0.6
f+
0
0.21
1
0.78
2
0.54
3
0.9
Matrix P
0
1
0
0.9
0.1
1
0.1
0.9
DataFrame df to be filled
0
1
2
0
0.5
...
...
1
0.5
....
...
Then the column 1 in DataFrame df becomes :
prob = f-[0,0] * df[0,0] / (f-[0,0] * df[0,0] * f+[0,0] * df[0,1])
df[0] = P.dot([prob, 1 - prob])
Here is the code with a for loop :
for i in np.arange(0, n):
p = f_m[i+1] * df.iloc[0, i] / (f_m[i+1] * df.iloc[0, i] + f_p[i+1] * df.iloc[1, i])
xi[i + 1] = p_matrix.dot(np.array([p, 1 - p]))
Does someone have the solution to create it without a for loop ? And then each column will be calculated the same way

Create a binary matrix after comparing columns' values in a dataframe

The text is long but the question is simple!
I have two dataframes that brings different informations about two variables and I need to create a binary matrix as my output after following some steps.
Let's say my dataframes are these:
market_values = pd.DataFrame({'variableA': (1,2.0,3), 'variableB': (np.nan,2,np.nan), 'variableC': (9,10,15), 'variableD' : (18,25,43),'variableE':(36,11,12),'variableF':(99,10,98), 'variableG': (42,19,27)})
variableA variableB variableC variableD variableE variableF variableG
0 1.0 NaN 9 18 36 99 42
1 2.0 2.0 10 25 11 10 19
2 3.0 NaN 15 43 12 98 27
negociation_values = pd.DataFrame({'variableA': (0.1,0.2,0.3), 'variableB': (0.5,np.nan,0.303), 'variableC': (0.9,0.10,0.4), 'variableD' : (0.12,0.11,0.09),'variableE':(np.nan,0.13,0.21),'variableF':(0.14,np.nan,0.03), 'variableG': (1.4,0.134,0.111)})
variableA variableB variableC variableD variableE variableF variableG
0 0.1 0.500 0.9 0.12 NaN 1.4 0.141
1 0.2 NaN 0.1 0.11 0.13 NaN 0.134
2 0.3 0.303 0.4 0.09 0.21 0.03 0.111
And I need to follow these steps:
1 - Check if two columns in my 'market_values' df have at least one
value that is equal (for the same row)
2 - If a pair of columns has one value that is equal (for the same row),
then I need to compare these same columns in my
'negociation_values' df
3 - Then I have to discover which variable has the higher
negociation value (for a given row)
4 - Finally I need to create a binary matrix.
For those equal values' variable, I'll put 1 where one
negociation value is higher and 0 for the other. If a column
doesn't have an equal value with another column, I'll just put 1
for the entire column.
The desired output matrix will be like:
variableA variableB variableC variableD variableE variableF variableG
0 0 1 0 1 1 1 1
1 1 0 1 1 1 0 1
2 0 1 1 1 1 0 1
The main difficult is at steps 3 and 4.
I've done steps 1 and 2 so far. They're above:
arr = market_values.to_numpy()
is_equal = ((arr == arr[None].T).any(axis=1))
is_equal[np.tril_indices_from(is_equal)] = False
inds_of_same_cols = [*zip(*np.where(is_equal))]
equal_cols = [market_values.columns[list(inds)].tolist() for inds in inds_of_same_cols]
print(equal_cols)
-----------------
[['variableA', 'variableB'], ['variableC', 'variableF']]
h = []
for i in equal_cols:
op = pd.DataFrame(negociation_values[i])
h.append(op)
print(h)
-------
[ variableA variableB
0 0.1 0.500
1 0.2 NaN
2 0.3 0.303,
variableC variableF
0 0.9 0.14
1 0.1 NaN
2 0.4 0.03]
The code above returns me the negociation values for the columns that have at least one equal value in the market values df.
Unfortunately, I don't know where to go from here. I need to write a code that says something like: "If variableA > variableB (for a row), insert '1' in a new matrix under variableA column and a '0' under variableB column for that row. keep doing that and then do that for the others". Also, I need to say "If a variable doesn't have an equal value in some other column, insert 1 for all values in this binary matrix"
your negociation_values definition and presented table are not the same:
here is the definition I used
market_values = pd.DataFrame({'variableA': (1,2.0,3), 'variableB': (np.nan,2,np.nan), 'variableC': (9,10,15), 'variableD' : (18,25,43),'variableE':(36,11,12),'variableF':(99,10,98), 'variableG': (42,19,27)})
negociation_values = pd.DataFrame({'variableA': (0.1,0.2,0.3), 'variableB': (0.5,np.nan,0.303), 'variableC': (0.9,0.10,0.4), 'variableD' : (0.12,0.11,0.09),'variableE':(np.nan,0.13,0.21),'variableF':(1.4,np.nan,0.03), 'variableG': (0.141,0.134,0.111)})
The following code gives me the required matrix (though there are a number of edge cases you will need to consider)
cols = market_values.columns.values
bmatrix = pd.DataFrame(index=market_values.index, columns=cols, data=1)
for idx,col in enumerate(cols):
print(cols[idx+1:])
df_m = market_values[cols[idx+1:]]
df_n = negociation_values[cols[idx+1:]]
v = df_n.loc[:,df_m.sub(market_values[col],axis=0).eq(0).any()].sub(negociation_values[col], axis=0).applymap(lambda x: 1 if x > 0 else 0)
if v.columns.size > 0:
bmatrix[v.columns[0]] = v
bmatrix[col] = 1 - v
The result is as required:
The pseudo code is:
for each column of the market matrix:
subtract from the later columns,
keep columns with any zeros (edge case: more than one column),
from column with zero , find difference between corresponding negoc. matrix,
set result to 1 if > 0, otherwise 0,
enter into binary matrix
Hope that makes sense.

How to multiply the previous value of an other column with the value of x column (shift)

I have the following pandas df which consists of 2 factor-columns and 2 signal-columns.
import pandas as pd
data = [
[0.1,-0.1,0.1],
[-0.1,0.2,0.3],
[0.3,0.1,0.3],
[0.1,0.3,-0.2]
]
df = pd.DataFrame(data, columns=['factor_A', 'factor_B', 'factor_C'])
for col in df:
new_name = col + '_signal'
df[new_name] = [1 if x>0 else -1 for x in df[col]]
print(df)
This gives me the following output:
factor_A factor_B factor_C factor_A_signal factor_B_signal factor_C_signal
0 0.1 -0.1 0.1 1 -1 1
1 -0.1 0.2 0.3 -1 1 1
2 0.3 0.1 0.3 1 1 1
3 0.1 0.3 -0.2 1 1 -1
Now in a 1 month holding period I have to multiply factor_A with the previous factor_A_signal + factor_B with the previous factor_B_signal divided by the number of factors (in this case "2") and add a new column ("ret_1m). At the moment I am not able to say how much factors I will have as an input so therefore I have to work with a for loop. 
In a 2 month holding period I have to multiply the t+1 factor_A with the previous factor_A_signal + the t+1 factor_B with the previous factor_B_signal divided by the number of factors and add a new column ("ret_2m") and so on to the 12th month.
To show you an example I would do that for 2 factors for 3 month holding period as follow:
import pandas as pd
data = [
[0.1,-0.1],
[-0.1,0.2],
[0.3,0.1],
[0.1,0.3]
]
df = pd.DataFrame(data, columns=['factor_A', 'factor_B'])
for col in df:
new_name = col + '_signal'
df[new_name] = [1 if x>0 else -1 for x in df[col]]
print(df)
def one_three(n_factors):
df["ret_1m"] = (df['factor_A_signal'].shift() * df["factor_A"] +
df['factor_B_signal'].shift() * df["factor_B"])/n_factors
df["ret_2m"] = (df['factor_A_signal'].shift() * df["factor_A"].shift(-1) +
df['factor_B_signal'].shift() * df["factor_B"].shift(-1))/n_factors
df["ret_3m"] = (df['factor_A_signal'].shift() * df["factor_A"].shift(-2) +
df['factor_B_signal'].shift() * df["factor_B"].shift(-2))/n_factors
return df
one_three(2)
Output:
factor_A factor_B factor_A_signal factor_B_signal ret_1m ret_2m ret_3m
0 0.1 -0.1 1 -1 NaN NaN NaN
1 -0.1 0.2 -1 1 -0.15 0.1 -0.1
2 0.3 0.1 1 1 -0.10 0.1 NaN
3 0.1 0.3 1 1 0.20 NaN NaN
How could I automate this with a for loop? Thank you very much in advance.
A for loop for your function def one_three(n_factors):
# Create list of columns in dataframe that are not signals
factors = [x for x in df.columns if not x.endswith("_signal")]
# Looking through range from 1 to 1 + number of months (in your example 3)
for i in range(1, 3+1):
name = "ret_" + str(i) + "m"
df[name] = 0
for x in factors:
df[name] += df[str(x + "_signal")].shift() * df[x].shift(1 - i)
df[name] /= len(factors)
Assuming you know already populated the factor_ columns, then run the signal loop. The first section finds all columns that do not end with _signal and returns a list - otherwise you could use a list of [factor_A, factor_B, ...]. Looping through the number of months, here I used 3 following your example, the computation loops through all items in the list.
The output for this matched your output with the given input data.

Summing subsets of many dataframes

I have ~1.2k files that when converted into dataframes look like this:
df1
A B C D
0 0.1 0.5 0.2 C
1 0.0 0.0 0.8 C
2 0.5 0.1 0.1 H
3 0.4 0.5 0.1 H
4 0.0 0.0 0.8 C
5 0.1 0.5 0.2 C
6 0.1 0.5 0.2 C
Now, I have to subset each dataframe with a window of fixed size along the rows, and add its contents to a second dataframe, with all its values originally initialized to 0.
df_sum
A B C
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
For example, let's set the window size to 3. The first subset therefore will be
window = df.loc[start:end, 'A':'C']
window
A B C
0 0.1 0.5 0.2
1 0.0 0.0 0.8
2 0.5 0.1 0.1
window.index = correct_index
df_sum = df_sum.add(window, fill_value=0)
df_sum
A B C
0 0.1 0.5 0.2
1 0.0 0.0 0.8
2 0.5 0.1 0.1
After that, the window will be the subset of df1 from rows 1-4, then rows 2-5, and finally rows 3-6. Once the first file has been scanned, the second file will begin, until all file have been processed. As you can see, this approach relies on df.loc for the subset and df.add for the addition. However, despite the ease of coding, it is very inefficient. On my machine it takes about 5 minutes to process the whole batch of 1.2k files of 200 lines each. I know that an implementation based on numpy arrays is orders of magnitude faster (about 10 seconds), but a bit more complicated in terms of subsetting and adding. Is there any way to increase the performance of this method while stile using dataframe? For example substituting the loc with a more performing slice method.
Example:
def generate_index_list(window_size):
before_offset = -(window_size - 1)// 2
after_offset = (window_size - 1)// 2
index_list = list()
for n in range(before_offset, after_offset + 1):
index_list.append(str(n))
return index_list
window_size = 3
for file in os.listdir('.'):
df1 = pd.read_csv(file, sep= '\t')
starting_index = (window_size - 1)//2
before_offset = (window_size - 1)// 2
after_offset = (window_size -1)//2
for index in df1.iterrows():
if index < starting_index or index + before_offset + 1 > len(profile.index):
continue
indexes = generate_index_list(window_size)
window = df1.loc[index - before_offset:index + after_offset, 'A':'C']
window.index = indexes
df_sum = df_sum.add(window, fill_value=0)
Expected output:
df_sum
A B C
0 1.0 1.1 2.0
1 1.0 1.1 2.0
2 1.1 1.6 1.4
Consider building a list of subsetted data frames with.loc and .head. Then run groupby aggregation after individual elements are concatenated.
window_size = 3
def window_process(file):
csv_df = pd.read_csv(file, sep= '\t')
window_dfs = [(csv_df.loc[i:,['A', 'B', 'C']] # ROW AND COLUMN SLICE
.head(window) # SELECT FIRST WINDOW ROWS
.reset_index(drop=True) # RESET INDEX TO 0, 1, 2, ...
) for i in range(df.shape[0])]
sum_df = (pd.concat(window_dfs) # COMBINE WINDOW DFS
.groupby(level=0).sum()) # AGGREGATE BY INDEX
return sum_df
# BUILD LONG DF FROM ALL FILES
long_df = pd.concat([window_process(f) for file in os.listdir('.')])
# FINAL AGGREGATION
df_sum = long_df.groupby(level=0).sum()
Using posted data sample, below are the outputs of each window_dfs:
A B C
0 0.1 0.5 0.2
1 0.0 0.0 0.8
2 0.5 0.1 0.1
A B C
0 0.0 0.0 0.8
1 0.5 0.1 0.1
2 0.4 0.5 0.1
A B C
0 0.5 0.1 0.1
1 0.4 0.5 0.1
2 0.0 0.0 0.8
A B C
0 0.4 0.5 0.1
1 0.0 0.0 0.8
2 0.1 0.5 0.2
A B C
0 0.0 0.0 0.8
1 0.1 0.5 0.2
2 0.1 0.5 0.2
A B C
0 0.1 0.5 0.2
1 0.1 0.5 0.2
A B C
0 0.1 0.5 0.2
With final df_sum to show accuracy of DataFrame.add():
df_sum
A B C
0 1.2 2.1 2.4
1 1.1 1.6 2.2
2 1.1 1.6 1.4

Categories

Resources