Apply a rolling function with multiple arguments on a DataFrame - python

Let's say I have the following dataframe:
df = pd.DataFrame({"quantity": [101, 102, 103], "price":[12, 33, 44]})
price quantity
0 12 101
1 33 102
2 44 103
I have been struggling to find how to apply a rolling complex function on it.
For simplicity, let's assume this function f is just the product of quantity and price. In this case, how do I apply this function on a rolling window of size 1, with a scaling parameter, say:
scaling = 10
such that the resulting dataframe would be:
price quantity value
0 12 101 NaN
1 33 102 12120.0
2 44 103 33660.0
with value[i] = price[i-1]*quantity[i-1]*scaling
I have tried:
def f(x,scaling):
return x['quantity']*x['price']*scaling
df.rolling(window=1).apply(lambda x: f(x,scaling))
and
def f(quantity,price,scaling):
return quantity*price*scaling
df.rolling(window=1).apply(lambda x: f(x['quantity'],x['price'],scaling))
Could you please help me fixing this without doing a simple:
df['value'] = df['quantity'].shift(1)*df['price'].shift(1)*scaling
?

Assuming what you want is indeed value[i] = price[i-1] * quantity[i-1] * scaling , then:
scaling = 10
df['value'] = df.shift(1).apply(lambda x: x['quantity'] * x['price'] * scaling, axis=1)
df
quantity price value
0 101 12 NaN
1 102 33 12120.0
2 103 44 33660.0

Related

Create a master data set comprised of multiple data frames

I have been stuck on this problem for a while now! Included below is a very simplified version of my program, along with some context. Essentially I want to view is one large dataframe which has all of my desired permutations based on my input variables. This is in the context of scenario analysis and it will help me avoid doing on-demand calculations through my BI tool when the user wants to change variables to visualise the output.
I have tried:
Creating a function out of my code and trying to apply the function with each of the step size changes of my input variables ( no idea what I am doing there).
Literally manually changing the input variables myself (as a noob I realise this is not the way to go but had to first see my code was working to append df's).
Essentially what I want to achieve is as follows:
use the variables "date_offset" and "cost" and vary each of them by the required number of defined steps sizes and number of steps
As an example, if there are 2 values for date_offset (step size 1) and two values for cost (step size one) there are a possible 4 combinations, therefore the data set will be 4 times the size of the df in my code below.
Now I have all of the permutations of the input variable and the corresponding data frame to go with each of those permutations, I would like to append each one of the data frames together.
I should be left with one data frame for all of the possible scenarios which I can then visualise with a BI tool.
I hope you guys can help :)
Here is my code.....
import pandas as pd
import numpy as np
#want to iterate through starting at a date_offset of 0 with a total of 5 steps and a step size of 1
date_offset = 0
steps_1 = 5
stepsize_1 = 1
#want to iterate though starting at a cost of 5 with a total number of steps of 5 and a step size of 1
cost = 5
steps_2 = 4
step_size = 1
df = {'id':['1a', '2a', '3a', '4a'],'run_life':[10,20,30,40]}
df = pd.DataFrame(df)
df['date_offset'] = date_offset
df['cost'] = cost
df['calc_col1'] = df['run_life']*cost
Are you trying to do something like this:
from itertools import product
data = {'id': ['1a', '2a', '3a', '4a'], 'run_life': [10, 20, 30, 40]}
df = pd.DataFrame(data)
date_offset = 0
steps_1 = 5
stepsize_1 = 1
cost = 5
steps_2 = 4
stepsize_2 = 1
df2 = pd.DataFrame(
product(
range(date_offset, date_offset + steps_1 * stepsize_1 + 1, stepsize_1),
range(cost, cost + steps_2 * stepsize_2 + 1, stepsize_2)
),
columns=['offset', 'cost']
)
result = df.merge(df2, how='cross')
result['calc_col1'] = result['run_life'] * result['cost']
Output:
id run_life offset cost calc_col1
0 1a 10 0 5 50
1 1a 10 0 6 60
2 1a 10 0 7 70
3 1a 10 0 8 80
4 1a 10 0 9 90
.. .. ... ... ... ...
115 4a 40 5 5 200
116 4a 40 5 6 240
117 4a 40 5 7 280
118 4a 40 5 8 320
119 4a 40 5 9 360
[120 rows x 5 columns]

determine the range of a value using a look up table

I have a df with numbers:
numbers = pd.DataFrame(columns=['number'], data=[
50,
65,
75,
85,
90
])
and a df with ranges (look up table):
ranges = pd.DataFrame(
columns=['range','range_min','range_max'],
data=[
['A',90,100],
['B',85,95],
['C',70,80]
]
)
I want to determine what range (in second table) a value (in the first table) falls in. Please note ranges overlap, and limits are inclusive.
Also please note the vanilla dataframe above has 3 ranges, however this dataframe gets generated dynamically. It could have from 2 to 7 ranges.
Desired result:
numbers = pd.DataFrame(columns=['number','detected_range'], data=[
[50,'out_of_range'],
[65, 'out_of_range'],
[75,'C'],
[85,'B'],
[90,'overlap'] * could be A or B *
])
I solved this with a for loop but this doesn't scale well to a big dataset I am using. Also code is too extensive and inelegant. See below:
numbers['detected_range'] = nan
for i, row1 in number.iterrows():
for j, row2 in ranges.iterrows():
if row1.number<row2.range_min and row1.number>row2.range_max:
numbers.loc[i,'detected_range'] = row1.loc[j,'range']
else if (other cases...):
...and so on...
How could I do this?
You can use a bit of numpy vectorial operations to generate masks, and use them to select your labels:
import numpy as np
a = numbers['number'].values # numpy array of numbers
r = ranges.set_index('range') # dataframe of min/max with labels as index
m1 = (a>=r['range_min'].values[:,None]).T # is number above each min
m2 = (a<r['range_max'].values[:,None]).T # is number below each max
m3 = (m1&m2) # combine both conditions above
# NB. the two operations could be done without the intermediate variables m1/m2
m4 = m3.sum(1) # how many matches?
# 0 -> out_of_range
# 2 -> overlap
# 1 -> get column name
# now we select the label according to the conditions
numbers['detected_range'] = np.select([m4==0, m4==2], # out_of_range and overlap
['out_of_range', 'overlap'],
# otherwise get column name
default=np.take(r.index, m3.argmax(1))
)
output:
number detected_range
0 50 out_of_range
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap
edit:
It works with any number of intervals in ranges
example output with extra['D',50,51]:
number detected_range
0 50 D
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap
Pandas IntervalIndex fits in here; however, since your data has overlapping points, a for loop is the approach I'll use here (for unique, non-overlapping indices, pd.get_indexer is a fast approach):
intervals = pd.IntervalIndex.from_arrays(ranges.range_min,
ranges.range_max,
closed='both')
box = []
for num in numbers.number:
bools = intervals.contains(num)
if bools.sum()==1:
box.append(ranges.range[bools].item())
elif bools.sum() > 1:
box.append('overlap')
else:
box.append('out_of_range')
numbers.assign(detected_range = box)
number detected_range
0 50 out_of_range
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap
firstly,explode the ranges:
df1=ranges.assign(col1=ranges.apply(lambda ss:range(ss.range_min,ss.range_max),axis=1)).explode('col1')
df1
range range_min range_max col1
0 A 90 100 90
0 A 90 100 91
0 A 90 100 92
0 A 90 100 93
0 A 90 100 94
0 A 90 100 95
0 A 90 100 96
0 A 90 100 97
0 A 90 100 98
0 A 90 100 99
1 B 85 95 85
1 B 85 95 86
1 B 85 95 87
1 B 85 95 88
1 B 85 95 89
1 B 85 95 90
secondly,judge wether each of numbers in first df
def function1(x):
df11=df1.loc[df1.col1==x]
if len(df11)==0:
return 'out_of_range'
if len(df11)>1:
return 'overlap'
return df11.iloc[0,0]
numbers.assign(col2=numbers.number.map(function1))
number col2
0 50 out_of_range
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap
the logic is simple and clear

Calculating rolling beta in Pandas

I am trying to calculating a rolling beta between two Series in Pandas.
My understanding is that to get the beta, I need to get the covariance matrix and then divide the cells (0, 1) by (1, 1)
So I created a function:
def calc_beta (A, B) :
covariance = np.cov (A, B)
beta = covariance[0, 1] / covariance[1, 1]
return beta
If I just wanted to run it for the entire series, I would do:
calc_beta(A, B)
But I'm not sure how to do that on a rolling basis, I tried A.rolling(10).apply(calc_beta, raw=False, B) unsuccessfully.
Then I just tried calculating the cov matrix on a rolling basis, which I can do:
A = pd.Series(np.random.randint(1,101,50))
B = pd.Series(np.random.randint(1,101,50))
df = pd.DataFrame([A, B]).transpose()
df.rolling(10).cov(df, pairwise=True)
Now I have a covariance matrix but how do I perform the beta calc, i.e. (covariance[0,1]/covariance[1,1]) on a rolling basis (and then get the mean).
It might not be the best answer (read, the most compact) but Ithink this could do the trick. You were actually on the right track to begin with. So, assume you have the two series you gave and make them into a df
A = pd.Series(np.random.randint(1,101,50))
B = pd.Series(np.random.randint(1,101,50))
df = pd.concat([A, B], axis=1)
Define the beta and the rolling in the following way:
def calc_beta(df):
np_array = df.values
s = np_array[:,0]
m = np_array[:,1]
covariance = np.cov(s,m)
beta = covariance[0,1]/covariance[1,1]
return beta
def rolling(df, period, function , min_periods=None):
if min_periods is None:
min_periods = period
result = pd.Series(np.nan, index=df.index)
for i in range(1, len(df)+1):
df2 = df.iloc[max(i-period, 0):i,:] #I edited here
if len(df2) >= min_periods:
idx = df2.index[-1]
result[idx] = function(df2)
return result
And do the following:
calc_beta(df)
which return 0.15350171576854774
and
rolling(df, 12,calc_beta, min_periods=None)
(Of course, you can choose any period)
which gives
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 0.034478
12 0.019883
13 -0.093483
14 0.140603
15 0.137694
16 -0.004115
17 -0.144355
18 -0.079803
19 -0.023759
20 0.099539
21 0.186670
22 0.199526
23 0.113457
24 0.152232
25 0.149928
26 0.079760
27 0.032097
28 0.056294
29 0.070176
30 0.076560
31 0.013778
32 0.080279
33 0.058864
34 0.006916
35 0.303566
36 0.133580
37 0.238668
38 0.312243
39 0.406835
40 0.337503
41 0.370470
42 0.237132
43 0.253779
44 0.160348
45 0.103425
46 0.261430
47 0.130407
48 0.314028
49 0.322890
dtype: float64
so I appreciate the answer #Serge but I felt like I could do it in a slightly cleaner way. This is what I've come up with which works for me. Let me know if you have any comments on it. Thanks again.
A = pd.Series(np.random.randint(1,101,50))
B = pd.Series(np.random.randint(1,101,50))
df = pd.DataFrame({'A' : A, 'B' : B})
df.rolling(10).cov(df, pairwise=True).drop(['A'], axis=1) \
.unstack(1) \
.droplevel(0, axis=1) \
.apply(lambda row: row['A'] / row['B'], axis=1) \
.mean()

How to multiply a specific row in pandas dataframe by a condition

I have a column which of 10th marks but some specific rows are not scaled properly i.e they are out of 10. I want to create a function that will help me to detect which are <=10 and then multiply to 100. I tried by creating a function but it failed.
Following is the Column:
data['10th']
0 0
1 0
2 0
3 10.00
4 0
...
2163 0
2164 0
2165 0
2166 76.50
2167 64.60
Name: 10th, Length: 2168, dtype: object
I am not what do you mean by "multiply to 100" but you should be able to use apply with lambda similar to this:
df = pd.DataFrame({"a": [1, 3, 5, 23, 76, 43 ,12, 3 ,5]})
df['a'] = df['a'].apply(lambda x: x*100 if x < 10 else x)
print(df)
0 100
1 300
2 500
3 23
4 76
5 43
6 12
7 300
8 500
If I do not understand you correctly you could replace the action and condition in the lambda function to your purpose.
Looks like you need to change the data type first data["10th"] = pd.to_numeric(data["10th"])
I assume you want to multiply by 10 not 100 to scale it with the other out of 100 scores. you can try this np.where(data["10th"]<10, data["10th"]*10, data["10th"])
assigning it back to the dataframe using. data["10th"] = np.where(data["10th"]<10, data["10th"]*10, data["10th"])

I want to multiply two columns in a pandas DataFrame and add the result into a new column

I'm trying to multiply two existing columns in a pandas Dataframe (orders_df): Prices (stock close price) and Amount (stock quantities) and add the calculation to a new column called Value. For some reason when I run this code, all the rows under the Value column are positive numbers, while some of the rows should be negative. Under the Action column in the DataFrame there are seven rows with the 'Sell' string and seven with the 'Buy' string.
for i in orders_df.Action:
if i == 'Sell':
orders_df['Value'] = orders_df.Prices*orders_df.Amount
elif i == 'Buy':
orders_df['Value'] = -orders_df.Prices*orders_df.Amount)
Please let me know what i'm doing wrong !
I think an elegant solution is to use the where method (also see the API docs):
In [37]: values = df.Prices * df.Amount
In [38]: df['Values'] = values.where(df.Action == 'Sell', other=-values)
In [39]: df
Out[39]:
Prices Amount Action Values
0 3 57 Sell 171
1 89 42 Sell 3738
2 45 70 Buy -3150
3 6 43 Sell 258
4 60 47 Sell 2820
5 19 16 Buy -304
6 56 89 Sell 4984
7 3 28 Buy -84
8 56 69 Sell 3864
9 90 49 Buy -4410
Further more this should be the fastest solution.
You can use the DataFrame apply method:
order_df['Value'] = order_df.apply(lambda row: (row['Prices']*row['Amount']
if row['Action']=='Sell'
else -row['Prices']*row['Amount']),
axis=1)
It is usually faster to use these methods rather than over for loops.
If we're willing to sacrifice the succinctness of Hayden's solution, one could also do something like this:
In [22]: orders_df['C'] = orders_df.Action.apply(
lambda x: (1 if x == 'Sell' else -1))
In [23]: orders_df # New column C represents the sign of the transaction
Out[23]:
Prices Amount Action C
0 3 57 Sell 1
1 89 42 Sell 1
2 45 70 Buy -1
3 6 43 Sell 1
4 60 47 Sell 1
5 19 16 Buy -1
6 56 89 Sell 1
7 3 28 Buy -1
8 56 69 Sell 1
9 90 49 Buy -1
Now we have eliminated the need for the if statement. Using DataFrame.apply(), we also do away with the for loop. As Hayden noted, vectorized operations are always faster.
In [24]: orders_df['Value'] = orders_df.Prices * orders_df.Amount * orders_df.C
In [25]: orders_df # The resulting dataframe
Out[25]:
Prices Amount Action C Value
0 3 57 Sell 1 171
1 89 42 Sell 1 3738
2 45 70 Buy -1 -3150
3 6 43 Sell 1 258
4 60 47 Sell 1 2820
5 19 16 Buy -1 -304
6 56 89 Sell 1 4984
7 3 28 Buy -1 -84
8 56 69 Sell 1 3864
9 90 49 Buy -1 -4410
This solution takes two lines of code instead of one, but is a bit easier to read. I suspect that the computational costs are similar as well.
Since this question came up again, I think a good clean approach is using assign.
The code is quite expressive and self-describing:
df = df.assign(Value = lambda x: x.Prices * x.Amount * x.Action.replace({'Buy' : 1, 'Sell' : -1}))
To make things neat, I take Hayden's solution but make a small function out of it.
def create_value(row):
if row['Action'] == 'Sell':
return row['Prices'] * row['Amount']
else:
return -row['Prices']*row['Amount']
so that when we want to apply the function to our dataframe, we can do..
df['Value'] = df.apply(lambda row: create_value(row), axis=1)
...and any modifications only need to occur in the small function itself.
Concise, Readable, and Neat!
Good solution from bmu. I think it's more readable to put the values inside the parentheses vs outside.
df['Values'] = np.where(df.Action == 'Sell',
df.Prices*df.Amount,
-df.Prices*df.Amount)
Using some pandas built in functions.
df['Values'] = np.where(df.Action.eq('Sell'),
df.Prices.mul(df.Amount),
-df.Prices.mul(df.Amount))
For me, this is the clearest and most intuitive:
values = []
for action in ['Sell','Buy']:
amounts = orders_df['Amounts'][orders_df['Action'==action]].values
if action == 'Sell':
prices = orders_df['Prices'][orders_df['Action'==action]].values
else:
prices = -1*orders_df['Prices'][orders_df['Action'==action]].values
values += list(amounts*prices)
orders_df['Values'] = values
The .values method returns a numpy array allowing you to easily multiply element-wise and then you can cumulatively generate a list by 'adding' to it.
First, multiply the columns Prices and Amount. Afterwards use mask to negate the values if the condition is True:
df.assign(
Values=(df["Prices"] * df["Amount"]).mask(df["Action"] == "Buy", lambda x: -x)
)

Categories

Resources