Python Pandas cumulative sum with non fixed coefficients - python

I would like to compute following formula.
NVI(t) = NVI(t-1) + ROC(t)*NVI(t-1)
both NVI and ROC are same length Series.
Not sure if this can be done without a for loop.
=============================================
Maybe I wasnt being clear before, we only have NVI(0)=100, ROC is a Series, we need to generate NVI(1...t) series from the above formula progressively.

Quite easily with the shift method.
In [21]: df = pd.DataFrame({'nvi': np.random.uniform(0, 1, 10), 'roc': np.random.uniform(0, 1, 10)})
In [22]: df
Out[22]:
nvi roc
0 0.237223 0.256954
1 0.583694 0.473751
2 0.441392 0.734422
3 0.111818 0.947311
4 0.798595 0.537202
5 0.782228 0.053902
6 0.806241 0.640266
7 0.568911 0.945149
8 0.020364 0.331894
9 0.193462 0.090610
In [23]: df['nvi_t'] = df.nvi.shift() * df.roc
In [24]: df
Out[24]:
nvi roc nvi_t
0 0.237223 0.256954 NaN
1 0.583694 0.473751 0.112385
2 0.441392 0.734422 0.428678
3 0.111818 0.947311 0.418135
4 0.798595 0.537202 0.060069
5 0.782228 0.053902 0.043046
6 0.806241 0.640266 0.500834
7 0.568911 0.945149 0.762018
8 0.020364 0.331894 0.188818
9 0.193462 0.090610 0.001845

You could use a for loop:
import numpy as np
from pandas import DataFrame, Series
ROC = Series(np.random.randn(10))
NVI = Series(np.zeros(len(ROC)), index=ROC.index)
NVI[0] = 100
for ii in range(1, len(ROC)):
NVI[ii] = NVI[ii-1]*(1 + ROC[ii])
DataFrame({'NVI':NVI, 'ROC':ROC})
Which gives
Out[163]:
NVI ROC
0 100.000000 -0.671116
1 175.200037 0.752000
2 190.944391 0.089865
3 213.050742 0.115774
4 285.011333 0.337763
5 654.873638 1.297711
6 1970.284505 2.008648
7 3738.327575 0.897354
8 -3640.266184 -1.973769
9 -8171.676652 1.244802

Related

Pandas dataframe from numpy array with multiindex

I'm working with a numpy array called array_test with shape (5, 359, 2). This is checked with array_test.shape. The array reflects mean and uncertainty for observations in 5 repetitions of an experiment.
The goal of this is to be able to estimate the mean value of each observation across the 5 repetitions of the experiment, and to estimate the total uncertainty per observation also a mean across the 5 repetitions.
I would need to create a pandas dataframe from it, I believe with a multiindex in which the first level would have 5 values from the first dimension (named simply '1', '2', etc.), and a second one which would be 'mean' and 'uncertainty'.
Suggestions are more than welcome!
IIUC, you might want to aggregate in numpy, then construct a DataFrame and stack:
a = np.random.random((5, 359, 2))
out = pd.DataFrame(a.mean(1), index=range(1, a.shape[0]+1),
columns=['mean', 'uncertainty']).stack()
Output (a Series):
1 mean 0.499102
uncertainty 0.511757
2 mean 0.480295
uncertainty 0.473132
3 mean 0.500507
uncertainty 0.519352
4 mean 0.505443
uncertainty 0.493672
5 mean 0.514302
uncertainty 0.519299
dtype: float64
For a DataFrame:
out = pd.DataFrame(a.mean(1), index=range(1, a.shape[0]+1),
columns=['mean', 'uncertainty']).stack().to_frame('value')
Output:
value
1 mean 0.499102
uncertainty 0.511757
2 mean 0.480295
uncertainty 0.473132
3 mean 0.500507
uncertainty 0.519352
4 mean 0.505443
uncertainty 0.493672
5 mean 0.514302
uncertainty 0.519299
I would approach it by using a normal Dataframe, but adding columns for the observation and experiment number.
import numpy as np
import pandas as pd
a = np.random.rand(5, 10, 2)
# Get the shape
n_experiments, n_observations, n_values = a.shape
# Reshape array into a 2-dimensional array
# (stacking experiments on top of each other)
a = a.reshape(-1, n_values)
# Create Dataframe and add experiment and observation number
df = pd.DataFrame(a, columns=["mean", "uncertainty"])
# This returns an array, like [0, 0, 0, 0, 0, 1, 1, 1, ..., 4, 4]
experiment = np.repeat(range(n_experiments), n_observations)
df["experiment"] = experiment
# This returns an array like [0, 1, 2, 3, 4, 0, 1, 2, ..., 3, 4]
observation = np.tile(range(n_observations), n_experiments)
df["observation"] = observation
The Dataframe now looks like this:
print(df.head(15))
mean uncertainty experiment observation
0 0.741436 0.775086 0 0
1 0.401934 0.277716 0 1
2 0.148269 0.406040 0 2
3 0.852485 0.702986 0 3
4 0.240930 0.644746 0 4
5 0.309648 0.914761 0 5
6 0.479186 0.495845 0 6
7 0.154647 0.422658 0 7
8 0.381012 0.756473 0 8
9 0.939797 0.764821 0 9
10 0.994342 0.019140 1 0
11 0.300225 0.992146 1 1
12 0.265698 0.823469 1 2
13 0.791907 0.555051 1 3
14 0.503281 0.249237 1 4
Now you can analyze the Dataframe (with groupby and mean):
# Only the mean
print(df[['observation', 'mean', 'uncertainty']].groupby(['observation']).mean())
mean uncertainty
observation
0 0.699324 0.506369
1 0.382288 0.456324
2 0.333396 0.324469
3 0.690545 0.564583
4 0.365198 0.555231
5 0.453545 0.596149
6 0.526988 0.395162
7 0.565689 0.569904
8 0.425595 0.415944
9 0.731776 0.375612
Or with more advanced aggregate functions, which are probably useful for your usecase:
# Use aggregate function to calculate not only mean, but min and max as well
print(df[['observation', 'mean', 'uncertainty']].groupby(['observation']).aggregate(['mean', 'min', 'max']))
mean uncertainty
mean min max mean min max
observation
0 0.699324 0.297030 0.994342 0.506369 0.019140 0.974842
1 0.382288 0.063046 0.810411 0.456324 0.108774 0.992146
2 0.333396 0.148269 0.698921 0.324469 0.009539 0.823469
3 0.690545 0.175471 0.895190 0.564583 0.260557 0.721265
4 0.365198 0.015501 0.726352 0.555231 0.249237 0.929258
5 0.453545 0.111355 0.807582 0.596149 0.101421 0.914761
6 0.526988 0.323945 0.786167 0.395162 0.007105 0.691998
7 0.565689 0.154647 0.813336 0.569904 0.302157 0.964782
8 0.425595 0.116968 0.567544 0.415944 0.014439 0.756473
9 0.731776 0.411324 0.939797 0.375612 0.085988 0.764821

How to effeciently average value in between precceeding and succesing row index with Pandas

The objective was to get an average for the n-preceeding and n-succeding for a given index row.
For a given index, the get average of a list of index.
For example
index index of average
0 0,1,2
1 0,1,2,3
2 0,1,2,3,4
...
9 7,8,9,10,11
10 8,9,10,11,12
This can be achieved as below:
import pandas as pd
arr=[[6772],
[7182],
[8570],
[11078],
[11646],
[13426],
[16996],
[17514],
[18408],
[22128],
[22520],
[23532],
[26164],
[26590],
[30636],
[3119],
[32166],
[34774]]
df=pd.DataFrame(arr,columns=['a'])
df['cal']=0
idx_c=2
for idx in range(len(df)):
idx_l=idx-idx_c
idx_t=idx+idx_c
idx_l=0 if idx_l<0 else idx_l
idx_t=len(df) if idx_t>len(df) else idx_t
df.loc[idx,'cal']=df['a'][df.index.isin(range(idx_l,idx_t+1))].mean()
However, I wonder there is more efficient way of achieving the above task?
Series.rolling
The trick here is to use a rolling window of size 2*w + 1 with the optional parameter center=True to center the result of rolling computation. For example, if w=2 then the window size would be 2*w + 1 = 5 and result of rolling computation will be stored at the position 3.
w = 2
df['avg'] = df['a'].rolling(2 * w + 1, center=True, min_periods=1).mean()
print(df)
a avg
0 6772 7508.00
1 7182 8400.50
2 8570 9049.60
3 11078 10380.40
4 11646 12343.20
5 13426 14132.00
6 16996 15598.00
7 17514 17694.40
8 18408 19513.20
9 22128 20820.40
10 22520 22550.40
11 23532 24186.80
12 26164 25888.40
13 26590 22008.20
14 30636 23735.00
15 3119 25457.00
16 32166 25173.75
17 34774 23353.00

Pandas groupby aggregation with percentages

I have the following dataframe:
import pandas as pd
import numpy as np
np.random.seed(123)
n = 10
df = pd.DataFrame({"val": np.random.randint(1, 10, n),
"cat": np.random.choice(["X", "Y", "Z"], n)})
val cat
0 3 Z
1 3 X
2 7 Y
3 2 Z
4 4 Y
5 7 X
6 2 X
7 1 X
8 2 X
9 1 Y
I want to know the percentage each category X, Y, and Z has of the entire val column sum. I can aggregate df like this:
total_sum = df.val.sum()
#32
s = df.groupby("cat").val.sum().div(total_sum)*100
#this is the desired result in % of total val
cat
X 46.875 #15/32
Y 37.500 #12/32
Z 15.625 #5/32
Name: val, dtype: float64
However, I find it rather surprising that pandas seemingly does not have a percentage/frequency function something like df.groupby("cat").val.freq() instead of df.groupby("cat").val.sum() or df.groupby("cat").val.mean(). I assumed this is a common operation, and Series.value_counts has implemented this with normalize=True - but for groupby aggregation, I cannot find anything similar. Am I missing here something or is there indeed no out-of-the-box function?

Selecting columns from a pandas dataframe based on row conditions

I have a pandas dataframe
In [1]: df = DataFrame(np.random.randn(10, 4))
Is there a way I can only select columns which have (last row) value>0
the desired output would be a new dataframe having all rows associated with columns where the last row >0
In [201]: df = pd.DataFrame(np.random.randn(10, 4))
In [202]: df
Out[202]:
0 1 2 3
0 -1.380064 0.391358 -0.043390 -1.970113
1 -0.612594 -0.890354 -0.349894 -0.848067
2 1.178626 1.798316 0.691760 0.736255
3 -0.909491 0.429237 0.766065 -0.605075
4 -1.214366 1.907580 -0.583695 0.192488
5 -0.283786 -1.315771 0.046579 -0.777228
6 1.195634 -0.259040 -0.432147 1.196420
7 -2.346814 1.251494 0.261687 0.400886
8 0.845000 0.536683 -2.628224 -0.238449
9 0.246398 -0.548448 -0.295481 0.076117
In [203]: df.iloc[:, (df.iloc[-1] > 0).values]
Out[203]:
0 3
0 -1.380064 -1.970113
1 -0.612594 -0.848067
2 1.178626 0.736255
3 -0.909491 -0.605075
4 -1.214366 0.192488
5 -0.283786 -0.777228
6 1.195634 1.196420
7 -2.346814 0.400886
8 0.845000 -0.238449
9 0.246398 0.076117
Basically this solution uses very basic Pandas indexing, in particular iloc() method
You can use the boolean series generated from the condition to index the columns of interest:
In [30]:
df = pd.DataFrame(np.random.randn(10, 4))
df
Out[30]:
0 1 2 3
0 -0.667736 -0.744761 0.401677 -1.286372
1 1.098134 -1.327454 1.409357 -0.180265
2 -0.105780 0.446195 -0.562578 -0.746083
3 1.366714 -0.685103 0.982354 1.928026
4 0.091040 -0.689676 0.425042 0.723466
5 0.798305 -1.454922 -0.017695 0.515961
6 -0.786693 1.496968 -0.112125 -1.303714
7 -0.211216 -1.321854 -0.892023 -0.583492
8 1.293255 0.936271 1.873870 0.790086
9 -0.699665 -0.953611 0.139986 -0.200499
In [32]:
df[df.columns[df.iloc[-1]>0]]
Out[32]:
2
0 0.401677
1 1.409357
2 -0.562578
3 0.982354
4 0.425042
5 -0.017695
6 -0.112125
7 -0.892023
8 1.873870
9 0.139986
Check out pandasql: https://pypi.python.org/pypi/pandasql
This blog post is a great tutorial for using SQL for Pandas DataFrames: http://blog.yhathq.com/posts/pandasql-sql-for-pandas-dataframes.html
This should get you started:
from pandasql import *
import pandas
def pysqldf(q):
return sqldf(q, globals())
q = """
SELECT
*
FROM
df
WHERE
value > 0
ORDER BY 1;
"""
df = pysqldf(q)

Python Pandas: Get row by median value

I'm trying to get the row of the median value for a column.
I'm using data.median() to get the median value for 'column'.
id 30444.5
someProperty 3.0
numberOfItems 0.0
column 70.0
And data.median()['column'] is subsequently:
data.median()['performance']
>>> 70.0
How can get the row or index of the median value?
Is there anything similar to idxmax / idxmin?
I tried filtering but it's not reliable in cases multiple rows have the same value.
Thanks!
You can use rank and idxmin and apply it to each column:
import numpy as np
import pandas as pd
def get_median_index(d):
ranks = d.rank(pct=True)
close_to_median = abs(ranks - 0.5)
return close_to_median.idxmin()
df = pd.DataFrame(np.random.randn(13, 4))
df
0 1 2 3
0 0.919681 -0.934712 1.636177 -1.241359
1 -1.198866 1.168437 1.044017 -2.487849
2 1.159440 -1.764668 -0.470982 1.173863
3 -0.055529 0.406662 0.272882 -0.318382
4 -0.632588 0.451147 -0.181522 -0.145296
5 1.180336 -0.768991 0.708926 -1.023846
6 -0.059708 0.605231 1.102273 1.201167
7 0.017064 -0.091870 0.256800 -0.219130
8 -0.333725 -0.170327 -1.725664 -0.295963
9 0.802023 0.163209 1.853383 -0.122511
10 0.650980 -0.386218 -0.170424 1.569529
11 0.678288 -0.006816 0.388679 -0.117963
12 1.640222 1.608097 1.779814 1.028625
df.apply(get_median_index, 0)
0 7
1 7
2 3
3 4
May be just : data[data.performance==data.median()['performance']].

Categories

Resources