My imported data from excel has been multi indexed with the time stamp column in Pandas. I would like place the remaining columns into their respective hierarchical groups. The frequency band columns (7 columns: delta', 'theta', 'alpha', 'beta', 'high beta', 'gamma') under the hierarchy column labeled 'EMG' and the biological measures (2 columns: 'Heart Rate Variabilty' and 'GSR') under 'Biofeedback'.
Is there a straight forward way to do this?
The second part is, how can a single level dataframe be appended to this multi index hierarchical dataframe without flattening the hierarchy created in the first part?
You can create MultiIndex.from_arrays and then reindex:
cols = ['delta','theta','alpha','beta','Heart Rate Variabilty','high beta', 'gamma','GSR']
df = pd.DataFrame([[4,5,8,3,1,0,9,2]], columns=cols)
print (df)
delta theta alpha beta Heart Rate Variabilty high beta gamma GSR
0 4 5 8 3 1 0 9 2
c1 = ['delta', 'theta', 'alpha', 'beta','high beta', 'gamma']
c2 = ['Heart Rate Variabilty', 'GSR']
mux = pd.MultiIndex.from_arrays([ ['EMG'] * len(c1) + ['Biofeedback'] * len(c2),c1 + c2])
print (mux)
MultiIndex(levels=[['Biofeedback', 'EMG'],
['GSR', 'Heart Rate Variabilty', 'alpha', 'beta',
'delta', 'gamma', 'high beta', 'theta']],
labels=[[1, 1, 1, 1, 1, 1, 0, 0], [4, 7, 2, 3, 6, 5, 1, 0]])
df = df.reindex(columns=mux, level=1)
print (df)
EMG Biofeedback
delta theta alpha beta high beta gamma Heart Rate Variabilty GSR
0 4 5 8 3 0 9 1 2
EDIT by comment:
Thank you for final solution OP:
df1.columns = pd.MultiIndex.from_tuples([(c, '', '') for c in df1])
df = pd.concat([df, df1], axis=1)
Related
I have a DF with labels and values as below:
df = pd.DataFrame({'labels' : ['A','A', 'B', 'C', 'C'],'val' : [1, 2, 3, 4, 5]})
Now, I want to calculate the std. dev as below:
for each row:
row with A: (std dev of B and C labels) (first 2 rows would have std dev of all other rows)
row with B: (std dev of A and C labels) (third row would have std dev of all other rows)
row with C: (std dev of A and B labels) (last 2 rows would have std dev of all other rows)
How can I achieve this?
Update
To optimise, precompute std dev for each label:
df = pd.DataFrame({'labels' : ['A','A', 'B', 'C', 'C'],'val' : [1, 2, 3, 4, 5]})
labels = df.labels.unique()
std_map = {l:df[df.labels != l]["val"].std() for l in labels}
df["std_dev"] = df["labels"].apply(lambda l: std_map[l])
Iterate dataframe and filter rows with other labels and compute std dev:
df = pd.DataFrame({'labels' : ['A','A', 'B', 'C', 'C'],'val' : [1, 2, 3, 4, 5]})
df["std_dev"] = df.apply(lambda row: df[df.labels != row["labels"]]["val"].std(), axis=1)
[Out]:
labels val std_dev
0 A 1 1.000000
1 A 2 1.000000
2 B 3 1.825742
3 C 4 1.000000
4 C 5 1.000000
please advice how to get the following output:
df1 = pd.DataFrame([['1, 2', '2, 2','3, 2','1, 1', '2, 1','3, 1']])
df2 = pd.DataFrame([[1, 2, 100, 'x'], [3, 4, 200, 'y'], [5, 6, 300, 'x']])
import numpy as np
df22 = df2.rename(index = lambda x: x + 1).set_axis(np.arange(1, len(df2.columns) + 1), inplace=False, axis=1)
f = lambda x: df22.loc[tuple(map(int, x.split(',')))]
df = df1.applymap(f)
print (df)
Output:
0 1 2 3 4 5
0 2 4 6 1 3 5
df1 is 'address' of df2 in row, col format (1,2 is first row, second column which is 2, 2,2 is 4 3,2 is 6 etc.)
I need to add values from the 3rd and 4th columns to get something like (2*100x, 4*200y, 6*300x, 1*100x, 3*200y, 5*300x)
the output should be 5000(sum of x's and y's), 0.28 (1400/5000 - % of y's)
It's not clear to me why you need df1 and df... Maybe your question is lacking some details?
You can compute your values directly:
df22['val'] = (df22[1] + df22[2])*df22[3]
Output:
1 2 3 4 val
1 1 2 100 x 300
2 3 4 200 y 1400
3 5 6 300 x 3300
From there it's straightforward to compute the sums (total and grouped by column 4):
total = df22['val'].sum() # 5000
y_sum = df22.groupby(4).sum().loc['y', 'val'] # 1400
print(y_sum/total) # 0.28
Edit: if df1 doesn't necessarily contain all members of columns 1 and 2, you could loop through it (it's not clear in your question why df1 is a Dataframe or if it can have more than one row, therefore I flattened it):
df22['val'] = 0
for c in df1.to_numpy().flatten():
i, j = map(int, c.split(','))
df22.loc[i, 'val'] += df22.loc[i, j]*df22.loc[i, 3]
This gives you the same output as above for your example but will ignore values that are not in df1.
I have two data sources I can join by a field and want to summarize them in a chart:
Data
The two DataFrames share column A:
ROWS = 1000
df = pd.DataFrame.from_dict({'A': np.arange(ROWS),
'B': np.random.randint(0, 60, size=ROWS),
'C': np.random.randint(0, 100, size=ROWS)})
df.head()
A B C
0 0 10 11
1 1 7 64
2 2 22 12
3 3 1 67
4 4 34 57
And other which I joined as such:
other = pd.DataFrame.from_dict({'A': np.arange(ROWS),
'D': np.random.choice(['One', 'Two'], ROWS)})
other.set_index('A', inplace=True)
df = df.join(other, on=['A'], rsuffix='_right')
df.head()
A B C D
0 0 10 11 One
1 1 7 64 Two
2 2 22 12 One
3 3 1 67 Two
4 4 34 57 One
Question
A proper way to get a column chart with the count of:
C is GTE50 and D is One
C is GTE50 and D is Two
C is LT50 and D is One
C is LT50 and D is Two
Grouped by B, binned into 0, 1-10, 11-20, 21-30, 21-40, 41+.
IIUC, this can be dramatically simplified to a single groupby, taking advantage of clip and np.ceil to form your groups. A single unstack with 2 levels gives us the B-grouping as our x-axis with bars for each D-C combination:
If you want slightly nicer labels, you can map the groupby values:
(df.groupby(['D',
df.C.ge(50).map({True: 'GE50', False: 'LT50'}),
np.ceil(df.B.clip(lower=0, upper=41)/10).map({0: '0', 1: '1-10', 2: '11-20', 3: '21-30', 4: '31-40', 5: '41+'})
])
.size().unstack([0,1]).plot.bar())
Also it's equivalent to group B on:
pd.cut(df['B'],
bins=[-np.inf, 1, 11, 21, 31, 41, np.inf],
right=False,
labels=['0', '1-10', '11-20', '21-30', '31-40', '41+'])
I arrived to this solution after days of grinding, going back and forth, but there are many things I consider code smells:
groupby returns a sort-of pivot table and melt's purpose is to unpivot data.
The use of dummies for Cx, but not for D? Ultimately they are both categorical data with 2 options. After two days, when I got this first solution I needed a break before trying another branch that treat these two equally.
reset_index, only to set_index lines later. Having to sort_values before set_index
That last summary.unstack().unstack() reads like a big hack.
# %% Cx
df['Cx'] = df['C'].apply(lambda x: 'LT50' if x < 50 else 'GTE50')
df.head()
# %% Bins
df['B_binned'] = pd.cut(df['B'],
bins=[-np.inf, 1, 11, 21, 31, 41, np.inf],
right=False,
labels=['0', '1-10', '11-20', '21-30', '31-40', '41+'])
df.head()
# %% Dummies
s = df['D']
dummies = pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
df = pd.concat([df, dummies], axis=1)
df.head()
# %% Summary
summary = df.groupby(['B_binned', 'Cx']).agg({'One': 'sum', 'Two': 'sum'})
summary.reset_index(inplace=True)
summary = pd.melt(summary,
id_vars=['B_binned', 'Cx'],
value_vars=['One', 'Two'],
var_name='D',
value_name='count')
summary.sort_values(['B_binned', 'D', 'Cx'], inplace=True)
summary.set_index(['B_binned', 'D', 'Cx'], inplace=True)
summary
# %% Chart
summary.unstack().unstack().plot(kind='bar')
Numpy
Using numpy arrays to count then construct the DataFrame to plot
labels = np.array(['0', '1-10', '11-20', '21-30', '31-40', '41+'])
ge_lbl = np.array(['GE50', 'LT50'])
u, d = np.unique(df.D.values, return_inverse=True)
bins = np.array([1, 11, 21, 31, 41]).searchsorted(df.B)
ltge = (df.C.values >= 50).astype(int)
shape = (len(u), len(labels), len(ge_lbl))
out = np.zeros(shape, int)
np.add.at(out, (d, bins, ltge), 1)
pd.concat({
d_: pd.DataFrame(o, labels, ge_lbl)
for d_, o in zip(u, out)
}, names=['Cx', 'D'], axis=1).plot.bar()
Tried a different way of doing it.
df['Bins'] = np.where(df['B'].isin([0]), '0',
np.where(df['B'].isin(range(1,11)), '1-10',
np.where(df['B'].isin(range(11,21)), '11-20',
np.where(df['B'].isin(range(21,31)), '21-30',
np.where(df['B'].isin(range(31,40)), '31-40','41+')
))))
df['Class_type'] = np.where(((df['C']>50) & (df['D']== 'One') ), 'C is GTE50 and D is One',
np.where(((df['C']>50) & (df['D']== 'Two')) , 'C is GTE50 and D is Two',
np.where(((df['C']<50) & (df['D']== 'One') ), 'C is LT50 and D is One',
'C is LT50 and D is Two')
))
df.groupby(['Bins', 'Class_type'])['C'].sum().unstack().plot(kind='bar')
plt.show()
#### Output ####
WARNING: Not sure how optimal the solution is.And also it consumes extra space so space complexity may increase.
I have a reasonably sized DataFrame of time-series data and I would like to have rolling pairwise correlation data in a reasonable format.
Pandas has a very interesting 'rolling' feature that does the correct calculations
dfCorrelations = dfReturns.rolling(correlation_window).corr()
but the output time series of correlation grids is inconvenient for my later use (sample output for a subset on a given date shown).
Is there a way to do the same calculation, but get the output in a simple time-series DataFrame with only the unique, non-diagonal correlations? Say with an column index that looks something like
['III LN x ABN NA', 'III LN x AGN NA', 'III LN x AGS BB', 'ABN NA x AGN NA', 'ABN NA x AGS BB', ...]
from itertools import combinations
# Create sample dataset.
idx = pd.MultiIndex(
levels=[[u'2017-1-1', u'2017-1-2'], [u'A', u'B', u'C']],
labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]],
names=[u'date', u'ticker'])
df = pd.DataFrame(np.random.randn(6, 3), index=idx, columns=list('ABC'))
for tup in zip(range(6), range(3) * 2):
df.iloc[tup] = 1
>>> df
A B C
date ticker
2017-1-1 A 1.000000 0.440276 -1.087536
B -0.809949 1.000000 -0.548897
C 0.922866 -0.788699 1.000000
2017-1-2 A 1.000000 -0.106493 0.034319
B 0.080990 1.000000 0.218323
C 0.051651 -0.680358 1.000000
# Unstack and remove duplicates.
tickers = df.columns.tolist()
df = df.unstack().sort_index(axis=1)
pairs = df.columns.get_values().tolist()
df.columns = ["{0} vs. {1}".format(*pair) for pair in pairs]
mask = [n for n, pair in enumerate(pairs) if pair in list(combinations(tickers, 2))]
df = df.iloc[:, mask]
>>> df
A vs. B A vs. C B vs. C
date
2017-1-1 -0.809949 0.922866 -0.788699
2017-1-2 0.080990 0.051651 -0.680358
I have data from users who have left star ratings (1, 2 or 3 stars) on items in various categories, where each item may belong to multiple categories. In my current dataframe, each row represents a rating and the categories are one-hot encoded, like so:
import numpy as np
import pandas as pd
df_old = pd.DataFrame({
'user': [1, 1, 2, 2, 2],
'rate': [3, 2, 1, 1, 2],
'cat1': [1, 0, 1, 1, 1],
'cat2': [0, 1, 0, 0, 1]
})
# user rate cat1 cat2
# 0 1 3 1 0
# 1 1 2 0 1
# 2 2 1 1 0
# 3 2 1 1 0
# 4 2 2 1 1
I want to convert this to a new dataframe, multiindexed by user and rate, which show the per-category bincounts for each star rating. I'm currently doing this with loops:
multi_idx = pd.MultiIndex.from_product(
[df_old.user.unique(), range(1,4)],
names=['user', 'rate']
)
df_new = pd.DataFrame( # preallocate in an attempt to speed up the code
{'cat1': np.nan, 'cat2': np.nan},
index=multi_idx
)
df_new.sort_index(inplace=True)
idx = pd.IndexSlice
for uid in df_old.user.unique():
for cat in ['cat1', 'cat2']:
df_new.loc[idx[uid, :], cat] = np.bincount(
df_old.loc[(df_old.user == uid) & (df_old[cat] == 1),
'rate'].values, minlength=4)[1:]
# cat1 cat2
# user rate
# 1 1 0.0 0.0
# 2 0.0 1.0
# 3 1.0 0.0
# 2 1 2.0 0.0
# 2 1.0 1.0
# 3 0.0 0.0
Unfortunately the above code is hopelessly slow on my real dataframe, which is long and contains many categories. How can I eliminate the loops please?
With your multi-index, you can aggregate your old data frame, and reindex it:
df_old.groupby(['user', 'rate']).sum().reindex(multi_idx).fillna(0)
Or as #piRSquared commented, do the reindex and fill missing value at one step:
df_old.groupby(['user', 'rate']).sum().reindex(multi_idx, fill_value=0)