I have a dataframe and I am finding the confidence intervals across each row. My actual dataframe is hundreds of rows long, but here is an example:
df = pd.DataFrame({'nums_1': [1, 2, 3], 'nums_2': [1, 1, 5], 'nums_3' : [8,7,9]})
df['CI']=df.apply(lambda row: stats.t.interval(0.95, len(df)-1,
loc=np.mean(row), scale=stats.sem(row)), axis=1).apply(lambda x: np.round(x,2))
I also want to calculate the width of each confidence interval. I tried the the following, but it did not work
df['width']=df.apply(lambda row: stats.t.interval(0.95, len(df)-1,
loc=np.mean(row), scale=stats.sem(row)), axis=1)[1] - df.apply(lambda row:
stats.t.interval(0.95, len(df)-1,
loc=np.mean(row), scale=stats.sem(row)), axis=1)[0]
IIUC, you want to compute the difference between upper from lower in the confidence interval, you can try this:
df['CI'].apply(lambda x: x[1] - x[0])
If you have this:
>>> from scipy import stats
>>> import numpy as np
>>> df = pd.DataFrame({'nums_1': [1, 2, 3], 'nums_2': [1, 1, 5], 'nums_3' : [8,7,9]})
>>> df['CI']=df.apply(lambda row: stats.t.interval(0.95, len(df)-1, loc=np.mean(row), scale=stats.sem(row)), axis=1).apply(lambda x: np.round(x,2))
>>> df['CI']
0 [-6.71, 13.37]
1 [-4.65, 11.32]
2 [-1.92, 13.26]
Name: CI, dtype: object
you get this:
>>> df['width'] = df['CI'].apply(lambda x: x[1] - x[0])
>>> df
nums_1 nums_2 nums_3 CI width
0 1 1 8 [-6.71, 13.37] 20.08
1 2 1 7 [-4.65, 11.32] 15.97
2 3 5 9 [-1.92, 13.26] 15.18
Related
I have a DF with labels and values as below:
df = pd.DataFrame({'labels' : ['A','A', 'B', 'C', 'C'],'val' : [1, 2, 3, 4, 5]})
Now, I want to calculate the std. dev as below:
for each row:
row with A: (std dev of B and C labels) (first 2 rows would have std dev of all other rows)
row with B: (std dev of A and C labels) (third row would have std dev of all other rows)
row with C: (std dev of A and B labels) (last 2 rows would have std dev of all other rows)
How can I achieve this?
Update
To optimise, precompute std dev for each label:
df = pd.DataFrame({'labels' : ['A','A', 'B', 'C', 'C'],'val' : [1, 2, 3, 4, 5]})
labels = df.labels.unique()
std_map = {l:df[df.labels != l]["val"].std() for l in labels}
df["std_dev"] = df["labels"].apply(lambda l: std_map[l])
Iterate dataframe and filter rows with other labels and compute std dev:
df = pd.DataFrame({'labels' : ['A','A', 'B', 'C', 'C'],'val' : [1, 2, 3, 4, 5]})
df["std_dev"] = df.apply(lambda row: df[df.labels != row["labels"]]["val"].std(), axis=1)
[Out]:
labels val std_dev
0 A 1 1.000000
1 A 2 1.000000
2 B 3 1.825742
3 C 4 1.000000
4 C 5 1.000000
please advice how to get the following output:
df1 = pd.DataFrame([['1, 2', '2, 2','3, 2','1, 1', '2, 1','3, 1']])
df2 = pd.DataFrame([[1, 2, 100, 'x'], [3, 4, 200, 'y'], [5, 6, 300, 'x']])
import numpy as np
df22 = df2.rename(index = lambda x: x + 1).set_axis(np.arange(1, len(df2.columns) + 1), inplace=False, axis=1)
f = lambda x: df22.loc[tuple(map(int, x.split(',')))]
df = df1.applymap(f)
print (df)
Output:
0 1 2 3 4 5
0 2 4 6 1 3 5
df1 is 'address' of df2 in row, col format (1,2 is first row, second column which is 2, 2,2 is 4 3,2 is 6 etc.)
I need to add values from the 3rd and 4th columns to get something like (2*100x, 4*200y, 6*300x, 1*100x, 3*200y, 5*300x)
the output should be 5000(sum of x's and y's), 0.28 (1400/5000 - % of y's)
It's not clear to me why you need df1 and df... Maybe your question is lacking some details?
You can compute your values directly:
df22['val'] = (df22[1] + df22[2])*df22[3]
Output:
1 2 3 4 val
1 1 2 100 x 300
2 3 4 200 y 1400
3 5 6 300 x 3300
From there it's straightforward to compute the sums (total and grouped by column 4):
total = df22['val'].sum() # 5000
y_sum = df22.groupby(4).sum().loc['y', 'val'] # 1400
print(y_sum/total) # 0.28
Edit: if df1 doesn't necessarily contain all members of columns 1 and 2, you could loop through it (it's not clear in your question why df1 is a Dataframe or if it can have more than one row, therefore I flattened it):
df22['val'] = 0
for c in df1.to_numpy().flatten():
i, j = map(int, c.split(','))
df22.loc[i, 'val'] += df22.loc[i, j]*df22.loc[i, 3]
This gives you the same output as above for your example but will ignore values that are not in df1.
I have a data frame consisting of lists as elements. I want to find the closest matching values within a percentage of a given value.
My code:
df = pd.DataFrame({'A':[[1,2],[4,5,6]]})
df
A
0 [1, 2]
1 [3, 5, 7]
# in each row, lets find a the values and their index that match 5 with 20% tolerance
val = 5
tol = 0.2 # find values matching 5 or 20% within 5 (4 or 6)
df['Matching_index'] = (df['A'].map(np.array)-val).map(abs).map(np.argmin)
Present solution:
df
A Matching_index
0 [1, 2] 1 # 2 matches closely with 5 but this is wrong
1 [4, 5, 6] 1 # 5 matches with 5, correct.
Expected solution:
df
A Matching_index
0 [1, 2] NaN # No matching value, hence NaN
1 [4, 5, 6] 1 # 5 matches with 5, correct.
Idea is get difference with val and then replace to missing values if not match tolerance, last get np.nanargmin which raise error if all missing values, so added next condition with np.any:
def f(x):
a = np.abs(np.array(x)-val)
m = a <= val * tol
return np.nanargmin(np.where(m, a, np.nan)) if m.any() else np.nan
df['Matching_index'] = df['A'].map(f)
print (df)
A Matching_index
0 [1, 2] NaN
1 [4, 5, 6] 1.0
Pandas solution:
df1 = pd.DataFrame(df['A'].tolist(), index=df.index).sub(val).abs()
df['Matching_index'] = df1.where(df1 <= val * tol).dropna(how='all').idxmin(axis=1)
I'm not sure it you want all indexes or just a counter.
Try this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[[1,2],[4,5,6,7,8]]})
val = 5
tol = 0.3
def closest(arr,val,tol):
idxs = [ idx for idx,el in enumerate(arr) if (np.abs(el - val) < val*tol)]
result = len(idxs) if len(idxs) != 0 else np.nan
return result
df['Matching_index'] = df['A'].apply(closest, args=(val,tol,))
df
If you want all the indexes, just return idxs instead of len(idxs).
I'm trying to separate a DataFrame into groups and drop groups below a minimum size (small outliers).
Here's what I've tried:
df.groupby(['A']).filter(lambda x: x.count() > min_size)
df.groupby(['A']).filter(lambda x: x.size() > min_size)
df.groupby(['A']).filter(lambda x: x['A'].count() > min_size)
df.groupby(['A']).filter(lambda x: x['A'].size() > min_size)
But these either throw an exception or return a different table than I'm expecting. I'd just like to filter, not compute a new table.
You can use len:
In [11]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
In [12]: df.groupby('A').filter(lambda x: len(x) > 1)
Out[12]:
A B
0 1 2
1 1 4
The number of rows is in the attribute .shape[0]:
df.groupby('A').filter(lambda x: x.shape[0] >= min_size)
NB: If you want to remove the groups below the minimum size, keep those that are above or at the minimum size (>=, not >).
groupby.filter can be very slow for larger dataset / a large number of groups. A faster approach is to use groupby.transform:
Here's an example, first create the dataset:
import pandas as pd
import numpy as np
df = pd.concat([
pd.DataFrame({'y': np.random.randn(np.random.randint(1,5))}).assign(A=str(i))
for i in range(1,1000)
]).reset_index(drop=True)
print(df)
y A
0 1.375980 1
1 -0.023861 1
2 -0.474707 1
3 -0.151859 2
4 -1.696823 2
... ... ...
2424 0.276737 998
2425 -0.142171 999
2426 -0.718891 999
2427 -0.621315 999
2428 1.335450 999
[2429 rows x 2 columns]
Time it:
I have a reasonably sized DataFrame of time-series data and I would like to have rolling pairwise correlation data in a reasonable format.
Pandas has a very interesting 'rolling' feature that does the correct calculations
dfCorrelations = dfReturns.rolling(correlation_window).corr()
but the output time series of correlation grids is inconvenient for my later use (sample output for a subset on a given date shown).
Is there a way to do the same calculation, but get the output in a simple time-series DataFrame with only the unique, non-diagonal correlations? Say with an column index that looks something like
['III LN x ABN NA', 'III LN x AGN NA', 'III LN x AGS BB', 'ABN NA x AGN NA', 'ABN NA x AGS BB', ...]
from itertools import combinations
# Create sample dataset.
idx = pd.MultiIndex(
levels=[[u'2017-1-1', u'2017-1-2'], [u'A', u'B', u'C']],
labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]],
names=[u'date', u'ticker'])
df = pd.DataFrame(np.random.randn(6, 3), index=idx, columns=list('ABC'))
for tup in zip(range(6), range(3) * 2):
df.iloc[tup] = 1
>>> df
A B C
date ticker
2017-1-1 A 1.000000 0.440276 -1.087536
B -0.809949 1.000000 -0.548897
C 0.922866 -0.788699 1.000000
2017-1-2 A 1.000000 -0.106493 0.034319
B 0.080990 1.000000 0.218323
C 0.051651 -0.680358 1.000000
# Unstack and remove duplicates.
tickers = df.columns.tolist()
df = df.unstack().sort_index(axis=1)
pairs = df.columns.get_values().tolist()
df.columns = ["{0} vs. {1}".format(*pair) for pair in pairs]
mask = [n for n, pair in enumerate(pairs) if pair in list(combinations(tickers, 2))]
df = df.iloc[:, mask]
>>> df
A vs. B A vs. C B vs. C
date
2017-1-1 -0.809949 0.922866 -0.788699
2017-1-2 0.080990 0.051651 -0.680358