I have a dataframe that uses MultiIndex for both index and columns.
For example:
df = pd.DataFrame(index=pd.MultiIndex.from_product([[1,2], [1,2,3], [4,5]], names=['i','j', 'k']), columns=pd.MultiIndex.from_product([[1,2], [1,2]], names=['x', 'y']))
for c in df.columns:
df[c] = np.random.randint(100, size=(12,1))
x 1 2
y 1 2 1 2
i j k
1 1 4 10 13 0 76
5 92 37 52 40
2 4 88 77 50 22
5 75 31 19 1
3 4 61 23 5 47
5 43 68 10 21
2 1 4 23 15 17 5
5 47 68 6 94
2 4 0 12 24 54
5 83 27 46 19
3 4 7 22 5 15
5 7 10 89 79
I want to group the values by a name in the index and by a name in the columns.
For each such group, we will have a 2D array of numbers (rather than a Series). I want to aggregate std() of all entries in that 2D array.
For example, let's say I groupby ['i', 'x'], one group would be with values of i=1 and x=1. I want to compute std for each of these 2D arrays and produce a DataFrame with i values as index and x values as columns.
What is the best way to achieve this?
If I do stack() to get x as an index, I will still be computing several std() instead of one as there will still be multiple columns.
You can use nested list comprehensions. For your example, with the given kind of DataFrame (not the same, as the values are random; you may want to fix a seed value so that results are comparable) and i and x as the indices of interest, it would work like this:
# get values of the top level row index
rows = set(df.index.get_level_values(0))
# get values of the top level column index
columns = set(df.columns.get_level_values(0))
# for every sub-dataframe (every combination of top-level indices)
# compute sampling standard deviation (1 degree of freedom) across all values
df_groupSD = pd.DataFrame([[df.loc[(row, )][(col, )].values.std(ddof=1)
for col in columns] for row in rows],
index = rows, columns = columns)
# show result
display(df_groupSD)
Output:
1 2
1 31.455115 25.433812
2 29.421699 33.748962
There may be better ways, of course.
You can use stack to put the 'y' level of column as index and then groupby only i to get:
print (df.stack(level='y').groupby(['i']).std())
x 1 2
i
1 32.966811 23.933462
2 28.668825 28.541835
Try the following code:
df.groupby(level=0).apply(lambda grp: grp.stack().std())
Related
I have a data frame with many columns that I need to divide by a column to compute proportions. Can someone help with a for loop for this? In the given data example below I want to add columns c1p = c1/ct, c2p=c2/ct, c3p=c3/ct, c4p=c4/ct.
id c1 c2 c3 c4 ct
1 6 8 8 12 34
2 5 3 11 6 25
3 3 9 6 12 30
4 14 10 10 3 37
The power of DataFrames is that you can do stuff column by column. Like this:
for i in range(1, 5):
df[f'c{i}p'] = df[f'[c{i}'] / df['ct']
Here is a vectorized approach to get another dataframe df_p which contains all the per-unit values you need:
df_p = df.filter(regex='c[0-9]').divide(df['ct'], axis=0)
I have this DataFrame df:
ID EVAL
11 1
11 0
22 0
11 1
33 0
44 0
22 1
11 1
I need to estimate the % of rows with EVAL equal to 1 and 0 for two groups: Group 1 contains those IDs that are repeated more than or equal to 3 times in df. Group 2 contains IDs that are repeated less than 3 times in df.
The result should be this one:
GROUP EVAL_0 EVAL_1
1 25 75
2 75 25
You can get the percentage of IDs that are repeated three or more times with value_counts() and then using a boolean index with mean.
>>> (df.ID.value_counts() >= 3).mean()
0.25
This is the gist of the work, but depending on what you wanted to do with it, if you wanted output like yours you could just create a DataFrame
>>> g1_perc = (df.ID.value_counts() >= 3).mean()
>>> pd.DataFrame(dict(group=[1, 2], perc_group=[g1_perc*100, (1-g1_perc)*100]))
group perc_group
0 1 25.0
1 2 75.0
The second column with the opposite percentage looks a bit needless to me.
I have a dataset of different sections of a race in a pandas dataframe from which I need to calculate certain features. It looks something like this:
id distance timeto1000m timeto800m timeto600m timeto400m timeto200m timetoFinish
1 1400m 10 21 30 39 50 60
2 1200m 0 19 31 42 49 57
3 1800m 0 0 0 38 49 62
4 1000m 0 0 29 40 48 61
So, what I need to do is for each row find the first timetoXXm column that is non-zero and the correspoding distance XX. For instance, for id=1 that would be 1000m, for id=3 that would be 400m etc.
I can do this with a series of if..elif..else conditions but was wondering if there is a better way of doing this kind of lookup in pandas/numpy?
You can do it like this, first filter the cols of interest and take a slice, then call idxmin on the cols of interest to return the columns where the boolean condition is met:
In [11]:
df_slice = df.ix[:,df.columns.str.startswith('time')]
df_slice[df_slice!=0].idxmin(axis=1)
Out[11]:
0 timeto1000m
1 timeto800m
2 timeto400m
3 timeto600m
dtype: object
In [15]:
df['first_valid'] = df_slice[df_slice!=0].idxmin(axis=1)
df[['id','first_valid']]
Out[15]:
id first_valid
0 1 timeto1000m
1 2 timeto800m
2 3 timeto400m
3 4 timeto600m
use idxmax(1)
df.set_index(['id', 'distance']).ne(0).idxmax(1)
id distance
1 1400m timeto1000m
2 1200m timeto800m
3 1800m timeto400m
4 1000m timeto600m
dtype: object
I'm trying to take an existing DataFrame and append a new column.
Let's say I have this DataFrame (just some random numbers):
a b c d e
0 2.847674 0.890958 -1.785646 -0.648289 1.178657
1 -0.865278 0.696976 1.522485 -0.248514 1.004034
2 -2.229555 -0.037372 -1.380972 -0.880361 -0.532428
3 -0.057895 -2.193053 -0.691445 -0.588935 -0.883624
And I want to create a new column 'f' that multiplies each row by a 'costs' vector, for instance [1,0,0,0,0]. So, for row zero, the output in column f should be 2.847674.
Here's the function I currently use:
def addEstimate (df, costs):
row_iterator = df.iterrows()
for i, row in row_iterator:
df.ix[i, 'f'] = np.dot(costs, df.ix[i])
I'm doing this with a 15-element vector, over ~20k rows, and I'm finding that this is super-duper slow (half an hour). I suspect that using iterrows and ix is inefficient, but I'm not sure how to correct this.
Is there a way that I can apply this to the entire DataFrame at once, rather than looping through rows? Or do you have other suggestions to speed this up?
You can create the new column with df['f'] = df.dot(costs).
dot is already a DataFrame method: applying it to the DataFrame as a whole will be much quicker than looping over the DataFrame and applying np.dot to individual rows.
For example:
>>> df # an example DataFrame
a b c d e
0 0 1 2 3 4
1 12 13 14 15 16
2 24 25 26 27 28
3 36 37 38 39 40
>>> costs = [1, 0, 0, 0, 2]
>>> df['f'] = df.dot(costs)
>>> df
a b c d e f
0 0 1 2 3 4 8
1 12 13 14 15 16 44
2 24 25 26 27 28 80
3 36 37 38 39 40 116
Pandas has a dot function as well. Does
df['dotproduct'] = df.dot(costs)
do what you are looking for?
I have a pandas data frame in which one of the columns contains real values. I would like to have a new column in this data frame that contains integer numbers indicating what place the real number from another column takes. For example, 1 would mean that the real number from the column with real numbers is the largest one and 2 would mean the second largest and so on.
DataFrame has a rank method:
import pandas as pd
df = pd.DataFrame({'a': np.random.randint(0,100,10)})
df['rank'] = df.rank(ascending=False)
a rank
0 16 8
1 91 1
2 58 4
3 36 6
4 15 9
5 69 3
6 35 7
7 78 2
8 48 5
9 5 10
Make sure you checkout the optional method keyword which sets the behavior in case of equal values.