Python Pandas Cumulative Multiplication with IF case - python

I created a small dataframe and I want to multiply 0.99 to the previous row and so on but only if the "IF case" is true, otherwise put the x[i].
In:
1
6
2
8
4
Out:
1.00
0.99
2.00
1.98
1.96
With a help from a guy, based on a similar problem, I tried the following but does not work.
x = pd.DataFrame([1, 6, 2, 8, 4])
y = np.zeros(x.shape)
yd = pd.DataFrame(y)
yd = np.where(x<3, x ,pd.Series(.99, yd.index).cumprod() / .99)
Any idea? Thank you

This is more like a groupby problem , when the value is less than 3 you reset the prod process
y=x[0]
mask=y<3
y.where(mask,0.99).groupby(mask.cumsum()).cumprod()
Out[122]:
0 1.0000
1 0.9900
2 2.0000
3 1.9800
4 1.9602
Name: 0, dtype: float64
At least we have the for loop here (If above does not work )
your=[]
for t,v in enumerate(x[0]):
if v < 3:
your.append(v)
else:
your.append(your[t-1]*0.99)
your
Out[129]: [1, 0.99, 2, 1.98, 1.9602]

This checks whether the value of x in the current row is lesser than 3. If it is, it keeps it as is else multiplies the previous row by 0.99.
x = pd.DataFrame([1, 6, 2, 8, 4])
x['out'] = np.where(x[0] <3, x[0], x[0].shift(1)*0.99)
Output:
x['out']
0 1.00
1 0.99
2 2.00
3 1.98
4 7.92

Related

Pandas, millions and billions

I have a dataframe with this kind of data
1 400.00M
2 1.94B
3 2.72B
4 -400.00M
5 13.94B
I would like to convert the data to billions so that the output would be something like this
1 0.40
2 1.94
3 2.72
4 -0.40
5 13.94
Note that dtype: object
Use replace with dictionary and map pd.eval
Sample df:
Out[1629]:
val
1 400.00M
2 1.94B
3 2.72B
4 -400.00M
5 13.94B
d = {'M': '*0.001', 'B': ''}
s_convert = df.val.replace(d, regex=True).map(pd.eval)
Out[1633]:
1 0.40
2 1.94
3 2.72
4 -0.40
5 13.94
Name: val, dtype: float64
You can use a lambda expression if you know for a fact that you either have only millions or billions:
amount=["400.00M","1.94B","2.72B","-400.00M","13.94B"]
df=pd.DataFrame(amount,columns=["amount"])
df.amount.apply(lambda x: float(x[:-1]) if x[-1]=="B" else float(x[:-1])/1000)
Or a list comprehension...
data = {'value': ['400.00M', '1.94B', '2.72B', '-400.00M', '13.94B']}
df = pd.DataFrame(data, index = [1, 2, 3, 4, 5])
df['value'] = [float(n[:-1])/1000 if n[-1:] == 'M' else float(n[:-1]) for n in df['value']]
...though #Andy's answer is more concise.

Vectorized way of finding the index of a previously occurring element

Let's say I have this Pandas series:
num = pd.Series([1,2,3,4,5,6,5,6,4,2,1,3])
What I want to do is to get a number, say 5, and return the index where it previously occurred. So if I'm using the element 5, I should get 4 as the element appears in indices 4 and 6. Now I want to do this for all of the elements of the series, and can be easily done using a for loop:
for idx,x in enumerate(num):
idx_prev = num[num == x].idxmax()
if(idx_prev < idx):
return idx_prev
However, this process consumes too much time for longer series lengths due to the looping. Is there a way to implement the same thing but in a vectorized form? The output should be something like this:
[NaN,NaN,NaN,NaN,NaN,NaN,4,5,3,1,0,2]
You can use groupby to shift the index:
num.index.to_series().groupby(num).shift()
Output:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 4.0
7 5.0
8 3.0
9 1.0
10 0.0
11 2.0
dtype: float64
It's possible to keep working in numpy.
Equivalent of [num[num == x].idxmax() for idx,x in enumerate(num)] using numpy is:
_, out = np.unique(num.values, return_inverse=True)
which assigns
array([0, 1, 2, 3, 4, 5, 4, 5, 3, 1, 0, 2], dtype=int64)
to out. Now you can assign bad values of out to Nans like this:
out_series = pd.Series(out)
out_series[out >= np.arange(len(out))] = np.nan

how to implement this difference operation efficiently?

I would like to build a data frame from an existing one, where each value per row is depending on the previous one. I have an initial value v0 as starting point. Let me make an example
In [126]:import pandas as pd
In [127]: df = pd.DataFrame([1.0, 1.1, 1.2, 1.3])
In [128]: df_result = df.copy()
In [129]: v0 = 10
In [130]: for i in range(1, len(df.index)):
...: df_result.iloc[i, 0] = df.iloc[i, 0]*df_result.iloc[i-1, 0]
...:
In [131]: df_result
Out[131]:
0
0 1.000
1 1.100
2 1.320
3 1.716
In [132]:
My question is about the for loop. How can I more efficiently writing this?
I believe need first numpy.insert value v0 to first position and then call numpy.cumprod:
df = pd.DataFrame([1.0, 1.1, 1.2, 1.3], columns=['r'])
v0 = 10
df['n'] = np.cumprod(np.insert(df['r'].values[1:], 0, v0))
print (df)
r n
0 1.0 10.00
1 1.1 11.00
2 1.2 13.20
3 1.3 17.16

Efficient way to find price momentum in python: averaging last n entries of a column

I'm defining price momentum is an average of the given stock’s momentum over the past n days.
Momentum, in turn, is a classification: each day is labeled 1 if closing price that day is higher than the day before, and −1 if the price is lower than the day before.
I have stock change percentages as follows:
df['close in percent'] = np.array([0.27772152, 1.05468772,
0.124156 , -0.39298394,
0.56415267, 1.67812005])
momentum = df['close in percent'].apply(lambda x: 1 if x > 0 else -1).values
Momentum should be: [1,1,1,-1,1,1].
So if I'm finding the average momentum for the last n = 3 days, I want my price momentum to be:
Price_momentum = [Nan, Nan, 1, 1/3, 1/3, 1/3]
I managed to use the following code to get it working, but this is extremely slow (the dataset is 5000+ rows and it takes 10 min to execute).
for i in range(3,len(df)+1,1):
data = np.array(momentum[i-3:i])
df['3_day_momentum'].iloc[i-1]=data.mean()
You can create a rolling object:
df = pd.DataFrame()
df['close_in_percent'] = np.array([0.27772152, 1.05468772,
0.124156 , -0.39298394,
0.56415267, 1.67812005])
df['momentum'] = np.where(df['close_in_percent'] > 0, 1, -1)
df['3_day_momentum'] = df.momentum.rolling(3).mean()
Here, np.where is an alternative to apply(), which is generally slow and should be used as a last resort.
close_in_percent momentum 3_day_momentum
0 0.2777 1 NaN
1 1.0547 1 NaN
2 0.1242 1 1.0000
3 -0.3930 -1 0.3333
4 0.5642 1 0.3333
5 1.6781 1 0.3333
You can use np.where + pd.Rolling.mean -
s = df['close in percent']
pd.Series(np.where(s > 0, 1, -1)).rolling(3).mean()
0 NaN
1 NaN
2 1.000000
3 0.333333
4 0.333333
5 0.333333
dtype: float64
For v0.17 or below, there's also rolling_mean which works with arrays directly.
pd.rolling_mean(np.where(s > 0, 1, -1), window=3)
array([ nan, nan, 1. , 0.33333333, 0.33333333,
0.33333333])
Those rolling averages are basically uniform filtered values. Hence, we can use SciPy's uniform filter -
from scipy.ndimage.filters import uniform_filter1d
def rolling_mean(ar, W=3):
hW = (W-1)//2
out = uniform_filter1d(momentum.astype(float), size=W, origin=hW)
out[:W-1] = np.nan
return out
momentum = 2*(df['close in percent'] > 0) - 1
df['out'] = rolling_mean(momentum, W=3)
Benchmarking
Timing pandas.rolling and SciPy's uniform filter -
In [463]: df = pd.DataFrame({'close in percent':np.random.randn(1000000)})
In [464]: df['momentum'] = np.where(df['close in percent'] > 0, 1, -1)
In [465]: momentum = 2*(df['close in percent'] > 0) - 1
# From #Brad Solomon's soln
In [466]: %timeit df['3_day_momentum'] = df.momentum.rolling(3).mean()
10 loops, best of 3: 27.3 ms per loop
# SciPy's uniform filter
In [467]: %timeit df['3_day_momentum_out'] = rolling_mean(momentum, W=3)
100 loops, best of 3: 7.69 ms per loop

How to count continuous numbers in numpy

I have a Numpy one-dimensional array of 1 and 0. for e.g
a = np.array([0,1,1,1,0,0,0,0,0,0,0,1,0,1,1,0,0,0,1,1,0,0])
I want to count the continuous 0s and 1s in the array and output something like this
[1,3,7,1,1,2,3,2,2]
What I do atm is
np.diff(np.where(np.abs(np.diff(a)) == 1)[0])
and it outputs
array([3, 7, 1, 1, 2, 3, 2])
as you can see it is missing the first count 1.
I've tried np.split and then get the sizes of each segments but it does not seem to be optimistic.
Is there more elegant "pythonic" solution?
Here's one vectorized approach -
np.diff(np.r_[0,np.flatnonzero(np.diff(a))+1,a.size])
Sample run -
In [208]: a = np.array([0,1,1,1,0,0,0,0,0,0,0,1,0,1,1,0,0,0,1,1,0,0])
In [209]: np.diff(np.r_[0,np.flatnonzero(np.diff(a))+1,a.size])
Out[209]: array([1, 3, 7, 1, 1, 2, 3, 2, 2])
Faster one with boolean concatenation -
np.diff(np.flatnonzero(np.concatenate(([True], a[1:]!= a[:-1], [True] ))))
Runtime test
For the setup, let's create a bigger dataset with islands of 0s and 1s and for a fair benchmarking as with the given sample, let's have the island lengths vary between 1 and 7 -
In [257]: n = 100000 # thus would create 100000 pair of islands
In [258]: a = np.repeat(np.arange(n)%2, np.random.randint(1,7,(n)))
# Approach #1 proposed in this post
In [259]: %timeit np.diff(np.r_[0,np.flatnonzero(np.diff(a))+1,a.size])
100 loops, best of 3: 2.13 ms per loop
# Approach #2 proposed in this post
In [260]: %timeit np.diff(np.flatnonzero(np.concatenate(([True], a[1:]!= a[:-1], [True] ))))
1000 loops, best of 3: 1.21 ms per loop
# #Vineet Jain's soln
In [261]: %timeit [ sum(1 for i in g) for k,g in groupby(a)]
10 loops, best of 3: 61.3 ms per loop
Using groupby from itertools
from itertools import groupby
a = np.array([0,1,1,1,0,0,0,0,0,0,0,1,0,1,1,0,0,0,1,1,0,0])
grouped_a = [ sum(1 for i in g) for k,g in groupby(a)]
I found a similar method to yours, just that this code finds the first and the last count separately. The answer is detailed in the code below:
import numpy as np
a = np.array([0,1,1,1,0,0,0,0,0,0,0,1,0,1,1,0,0,0,1,1,0,0])
print(f'a: {a}')
diff_a = np.diff(a)
print(f'diff_a: {diff_a}')
non_zero_pos_arr = np.where(diff_a != 0)[0]
print(f'Array of positions where non zero elements are present in diff_a array: {non_zero_pos_arr}')
diff_non_zero_pos_arr = np.diff(non_zero_pos_arr)
print(f'Result Array except for first and last element: {diff_non_zero_pos_arr}')
ans_first_ele = non_zero_pos_arr[0] + 1
ans_last_ele = len(diff_a) - non_zero_pos_arr[-1]
ans = np.array([], dtype=np.int8)
ans = np.append(ans, ans_first_ele)
ans = np.append(ans, diff_non_zero_pos_arr)
ans = np.append(ans, ans_last_ele)
print(f'Result Array: {ans}')
Output:
a: [0 1 1 1 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 0]
diff_a: [ 1 0 0 -1 0 0 0 0 0 0 1 -1 1 0 -1 0 0 1 0 -1 0]
Array of positions where non zero elements are present in diff_a array:
[ 0 3 10 11 12 14 17 19]
Result Array except for first and last element: [3 7 1 1 2 3 2]
Result Array: [1 3 7 1 1 2 3 2 2]

Categories

Resources