Pandas dataframe from numpy array with multiindex - python

I'm working with a numpy array called array_test with shape (5, 359, 2). This is checked with array_test.shape. The array reflects mean and uncertainty for observations in 5 repetitions of an experiment.
The goal of this is to be able to estimate the mean value of each observation across the 5 repetitions of the experiment, and to estimate the total uncertainty per observation also a mean across the 5 repetitions.
I would need to create a pandas dataframe from it, I believe with a multiindex in which the first level would have 5 values from the first dimension (named simply '1', '2', etc.), and a second one which would be 'mean' and 'uncertainty'.
Suggestions are more than welcome!

IIUC, you might want to aggregate in numpy, then construct a DataFrame and stack:
a = np.random.random((5, 359, 2))
out = pd.DataFrame(a.mean(1), index=range(1, a.shape[0]+1),
columns=['mean', 'uncertainty']).stack()
Output (a Series):
1 mean 0.499102
uncertainty 0.511757
2 mean 0.480295
uncertainty 0.473132
3 mean 0.500507
uncertainty 0.519352
4 mean 0.505443
uncertainty 0.493672
5 mean 0.514302
uncertainty 0.519299
dtype: float64
For a DataFrame:
out = pd.DataFrame(a.mean(1), index=range(1, a.shape[0]+1),
columns=['mean', 'uncertainty']).stack().to_frame('value')
Output:
value
1 mean 0.499102
uncertainty 0.511757
2 mean 0.480295
uncertainty 0.473132
3 mean 0.500507
uncertainty 0.519352
4 mean 0.505443
uncertainty 0.493672
5 mean 0.514302
uncertainty 0.519299

I would approach it by using a normal Dataframe, but adding columns for the observation and experiment number.
import numpy as np
import pandas as pd
a = np.random.rand(5, 10, 2)
# Get the shape
n_experiments, n_observations, n_values = a.shape
# Reshape array into a 2-dimensional array
# (stacking experiments on top of each other)
a = a.reshape(-1, n_values)
# Create Dataframe and add experiment and observation number
df = pd.DataFrame(a, columns=["mean", "uncertainty"])
# This returns an array, like [0, 0, 0, 0, 0, 1, 1, 1, ..., 4, 4]
experiment = np.repeat(range(n_experiments), n_observations)
df["experiment"] = experiment
# This returns an array like [0, 1, 2, 3, 4, 0, 1, 2, ..., 3, 4]
observation = np.tile(range(n_observations), n_experiments)
df["observation"] = observation
The Dataframe now looks like this:
print(df.head(15))
mean uncertainty experiment observation
0 0.741436 0.775086 0 0
1 0.401934 0.277716 0 1
2 0.148269 0.406040 0 2
3 0.852485 0.702986 0 3
4 0.240930 0.644746 0 4
5 0.309648 0.914761 0 5
6 0.479186 0.495845 0 6
7 0.154647 0.422658 0 7
8 0.381012 0.756473 0 8
9 0.939797 0.764821 0 9
10 0.994342 0.019140 1 0
11 0.300225 0.992146 1 1
12 0.265698 0.823469 1 2
13 0.791907 0.555051 1 3
14 0.503281 0.249237 1 4
Now you can analyze the Dataframe (with groupby and mean):
# Only the mean
print(df[['observation', 'mean', 'uncertainty']].groupby(['observation']).mean())
mean uncertainty
observation
0 0.699324 0.506369
1 0.382288 0.456324
2 0.333396 0.324469
3 0.690545 0.564583
4 0.365198 0.555231
5 0.453545 0.596149
6 0.526988 0.395162
7 0.565689 0.569904
8 0.425595 0.415944
9 0.731776 0.375612
Or with more advanced aggregate functions, which are probably useful for your usecase:
# Use aggregate function to calculate not only mean, but min and max as well
print(df[['observation', 'mean', 'uncertainty']].groupby(['observation']).aggregate(['mean', 'min', 'max']))
mean uncertainty
mean min max mean min max
observation
0 0.699324 0.297030 0.994342 0.506369 0.019140 0.974842
1 0.382288 0.063046 0.810411 0.456324 0.108774 0.992146
2 0.333396 0.148269 0.698921 0.324469 0.009539 0.823469
3 0.690545 0.175471 0.895190 0.564583 0.260557 0.721265
4 0.365198 0.015501 0.726352 0.555231 0.249237 0.929258
5 0.453545 0.111355 0.807582 0.596149 0.101421 0.914761
6 0.526988 0.323945 0.786167 0.395162 0.007105 0.691998
7 0.565689 0.154647 0.813336 0.569904 0.302157 0.964782
8 0.425595 0.116968 0.567544 0.415944 0.014439 0.756473
9 0.731776 0.411324 0.939797 0.375612 0.085988 0.764821

Related

Dataframe column: to find (cumulative) local maxima

In the below dataframe the column "CumRetperTrade" is a column which consists of a few vertical vectors (=sequences of numbers) separated by zeros. (= these vectors correspond to non-zero elements of column "Portfolio"). I would like to find the cumulative local maxima of every non-zero vector contained in column "CumRetperTrade".
To be precise, I would like to transform (using vectorization - or other - methods) column "CumRetperTrade" to the column "PeakCumRet" (desired result) which gives for every vector ( = subset corresponding to ’Portfolio =1 ’) contained in column "CumRetperTrade" the cumulative maximum value of (all its previous) values. The numeric example is below. Thanks in advance!
PS In other words, I guess that we need to use cummax() but to apply it only to the consequent (where 'Portfolio' = 1) subsets of 'CumRetperTrade'
import numpy as np
import pandas as pd
df1 = pd.DataFrame({"Portfolio": [1, 1, 1, 1, 0 , 0, 0, 1, 1, 1],
"CumRetperTrade": [2, 3, 2, 1, 0 , 0, 0, 4, 2, 1],
"PeakCumRet": [2, 3, 3, 3, 0 , 0, 0, 4, 4, 4]})
df1
Portfolio CumRetperTrade PeakCumRet
0 1 2 2
1 1 3 3
2 1 2 3
3 1 1 3
4 0 0 0
5 0 0 0
6 0 0 0
7 1 4 4
8 1 2 4
9 1 1 4
PPS I already asked a similar question previously (Dataframe column: to find local maxima) and received a correct answer to my question, however in my question I did not explicitly mention the requirement of cumulative local maxima
You only need a small modification to the previous answer:
df1["PeakCumRet"] = (
df1.groupby(df1["Portfolio"].diff().ne(0).cumsum())
["CumRetperTrade"].expanding().max()
.droplevel(0)
)
expanding().max() is what produces the local maxima.

pandas apply with different arg for each column/row

Assume I have a M (rows) by N (columns) dataFrame
df = pandas.DataFrame([...])
and a vector of length N
windows = [1,2,..., N]
I would like to apply a moving average function to each column in df, but would like the moving average to have different length for each column (e.g. column1 has MA length 1, column 2 has MA length 2, etc) - these lengths are contained in windows
Are there built in functions to do this quickly? I'm aware of the df.apply(lambda a: f(a), axis=0, args=...) but unclear how to apply different args for each column
Here's one way to do it:
In [15]: dfrm
Out[15]:
A B C
0 0.948898 0.587032 0.131551
1 0.385582 0.275673 0.107135
2 0.849599 0.696882 0.313717
3 0.993080 0.510060 0.287691
4 0.994823 0.441560 0.632076
5 0.711145 0.760301 0.813272
6 0.932131 0.531901 0.393798
7 0.965915 0.812821 0.287819
8 0.782890 0.478565 0.960353
9 0.908078 0.850664 0.912878
In [16]: windows
Out[16]: [1, 2, 3]
In [17]: pandas.DataFrame(
{c: dfrm[c].rolling(windows[i]).mean() for i, c in enumerate(dfrm.columns)}
)
Out[17]:
A B C
0 0.948898 NaN NaN
1 0.385582 0.431352 NaN
2 0.849599 0.486277 0.184134
3 0.993080 0.603471 0.236181
4 0.994823 0.475810 0.411161
5 0.711145 0.600931 0.577680
6 0.932131 0.646101 0.613049
7 0.965915 0.672361 0.498296
8 0.782890 0.645693 0.547323
9 0.908078 0.664614 0.720350
As #Manish Saraswat mentioned in the comments, you can also express the same thing as dfrm[c].rolling_mean(windows[i]). Further, you can use sequences as the items in windows if you want, and they would express a custom window shape (size and weights), or any of the other options with different rolling aggregations and keywords.

Rolling mean with customized window with Pandas

Is there a way to customize the window of the rolling_mean function?
data
1
2
3
4
5
6
7
8
Let's say the window is set to 2, that is to calculate the average of 2 datapoints before and after the obervation including the observation. Say the 3rd observation. In this case, we will have (1+2+3+4+5)/5 = 3. So on and so forth.
Compute the usual rolling mean with a forward (or backward) window and then use the shift method to re-center it as you wish.
data_mean = pd.rolling_mean(data, window=5).shift(-2)
If you want to average over 2 datapoints before and after the observation (for a total of 5 datapoints) then make the window=5.
For example,
import pandas as pd
data = pd.Series(range(1, 9))
data_mean = pd.rolling_mean(data, window=5).shift(-2)
print(data_mean)
yields
0 NaN
1 NaN
2 3
3 4
4 5
5 6
6 NaN
7 NaN
dtype: float64
As kadee points out, if you wish to center the rolling mean, then use
pd.rolling_mean(data, window=5, center=True)
For more current version of Pandas (please see 0.23.4 documentation https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html), you don't have rolling_mean anymore. Instead, you will use
DataFrame.rolling(window, min_periods=None, center=False, win_type=None, on=None, axis=0, closed=None)
For your example, it will be:
df.rolling(5,center=True).mean()

Python Pandas cumulative sum with non fixed coefficients

I would like to compute following formula.
NVI(t) = NVI(t-1) + ROC(t)*NVI(t-1)
both NVI and ROC are same length Series.
Not sure if this can be done without a for loop.
=============================================
Maybe I wasnt being clear before, we only have NVI(0)=100, ROC is a Series, we need to generate NVI(1...t) series from the above formula progressively.
Quite easily with the shift method.
In [21]: df = pd.DataFrame({'nvi': np.random.uniform(0, 1, 10), 'roc': np.random.uniform(0, 1, 10)})
In [22]: df
Out[22]:
nvi roc
0 0.237223 0.256954
1 0.583694 0.473751
2 0.441392 0.734422
3 0.111818 0.947311
4 0.798595 0.537202
5 0.782228 0.053902
6 0.806241 0.640266
7 0.568911 0.945149
8 0.020364 0.331894
9 0.193462 0.090610
In [23]: df['nvi_t'] = df.nvi.shift() * df.roc
In [24]: df
Out[24]:
nvi roc nvi_t
0 0.237223 0.256954 NaN
1 0.583694 0.473751 0.112385
2 0.441392 0.734422 0.428678
3 0.111818 0.947311 0.418135
4 0.798595 0.537202 0.060069
5 0.782228 0.053902 0.043046
6 0.806241 0.640266 0.500834
7 0.568911 0.945149 0.762018
8 0.020364 0.331894 0.188818
9 0.193462 0.090610 0.001845
You could use a for loop:
import numpy as np
from pandas import DataFrame, Series
ROC = Series(np.random.randn(10))
NVI = Series(np.zeros(len(ROC)), index=ROC.index)
NVI[0] = 100
for ii in range(1, len(ROC)):
NVI[ii] = NVI[ii-1]*(1 + ROC[ii])
DataFrame({'NVI':NVI, 'ROC':ROC})
Which gives
Out[163]:
NVI ROC
0 100.000000 -0.671116
1 175.200037 0.752000
2 190.944391 0.089865
3 213.050742 0.115774
4 285.011333 0.337763
5 654.873638 1.297711
6 1970.284505 2.008648
7 3738.327575 0.897354
8 -3640.266184 -1.973769
9 -8171.676652 1.244802

Weird Data manipulation in Pandas

I'm reading Python for Data Analysis by Wes Mckinney, but I was surprised by this data manipulation. You can see all the procedure here but I will try to summarize it here. Assume you have something like this:
In [133]: agg_counts = by_tz_os.size().unstack().fillna(0)
Out[133]:
a Not Windows Windows
tz 245 276
Africa/Cairo 0 3
Africa/Casablanca 0 1
Africa/Ceuta 0 2
Africa/Johannesburg 0 1
Africa/Lusaka 0 1
America/Anchorage 4 1
...
tz means time zone and Not Windows and Windows are categories extracted from the User Agent in the original data, so we can see that there are 3 Windows users and 0 Non-windows users in Africa/Cairo from the data collected.
Then in order to get "the top overall time zones" we have:
In [134]: indexer = agg_counts.sum(1).argsort()
Out[134]:
tz
24
Africa/Cairo 20
Africa/Casablanca 21
Africa/Ceuta 92
Africa/Johannesburg 87
Africa/Lusaka 53
America/Anchorage 54
America/Argentina/Buenos_Aires 57
America/Argentina/Cordoba 26
America/Argentina/Mendoza 55
America/Bogota 62
...
So at that point, I would have thought that according to the documentation I was summing over columns (in sum(1)) and then sorting according to the result showing arguments (as usual in argsort). First of all, I'm not sure what does it mean "columns" in the context of this series because sum(1) is actually summing Not Windows and Windows users keeping that value in the same row as its time zone. Furthermore, I can't see a correlation between argsort values and agg_counts. For example, Pacific/Auckland has an "argsort value" (in In[134]) of 0 and it only has a sum of 11 Windows and Not Windows users. Asia/Harbin has an argsort value of 1 and appears with a sum of 3 Windows and Not Windows users.
Can someone explain to me what is going on there? Obviously I'm misunderstanding something.
sum(1) means sum over axis = 1. The terminology comes from numpy.
For a 2+ dimensional object, the 0-axis refers to the rows. Summing over the 0-axis means summing over the rows, which amounts to summing "vertically" (when looking at the table).
The 1-axis refers to the columns. Summing over the 1-axis means summing over the columns, which amounts to summing "horizontally".
numpy.argsort returns an array of indices which tell you how to sort an array. For example:
In [72]: import numpy as np
In [73]: x = np.array([521, 3, 1, 2, 1, 1, 5])
In [74]: np.argsort(x)
Out[74]: array([2, 4, 5, 3, 1, 6, 0])
The 2 in the array returned by np.argsort means the smallest value in x is x[2], which equals 1. The next smallest is x[4] which is also 1. And so on.
If we define
totals = df.sum(1)
print(totals)
# tz 521
# Africa/Cairo 3
# Africa/Casablanca 1
# Africa/Ceuta 2
# Africa/Johannesburg 1
# Africa/Lusaka 1
# America/Anchorage 5
then totals.argsort() is argsorting the values [521, 3, 1, 2, 1, 1, 5]. We've seen the result; it is the same as numpy.argsort:
[2, 4, 5, 3, 1, 6, 0]
These values are simply made into a Series, with the same index as totals:
print(totals.argsort())
# tz 2
# Africa/Cairo 4
# Africa/Casablanca 5
# Africa/Ceuta 3
# Africa/Johannesburg 1
# Africa/Lusaka 6
# America/Anchorage 0
Associating the totals.index with this argsort indices does not appear have intrinsic meaning, but if you compute totals[totals.argsort()] you see the rows of totals in sorted order:
print(totals[totals.argsort()])
# Africa/Casablanca 1
# Africa/Johannesburg 1
# Africa/Lusaka 1
# Africa/Ceuta 2
# Africa/Cairo 3
# America/Anchorage 5
# tz 521
I loved unutbu's clarification. In the second last table above, print(totals.argsort()), ignore the first column. What we need is the 2nd column which gives the positions that we need. This is so cool!
Here are some examples on take method: https://pandas-docs.github.io/pandas-docs-travis/advanced.html#take-methods

Categories

Resources