Group data by frequency and estimate % per groups - python

I have this DataFrame df:
ID EVAL
11 1
11 0
22 0
11 1
33 0
44 0
22 1
11 1
I need to estimate the % of rows with EVAL equal to 1 and 0 for two groups: Group 1 contains those IDs that are repeated more than or equal to 3 times in df. Group 2 contains IDs that are repeated less than 3 times in df.
The result should be this one:
GROUP EVAL_0 EVAL_1
1 25 75
2 75 25

You can get the percentage of IDs that are repeated three or more times with value_counts() and then using a boolean index with mean.
>>> (df.ID.value_counts() >= 3).mean()
0.25
This is the gist of the work, but depending on what you wanted to do with it, if you wanted output like yours you could just create a DataFrame
>>> g1_perc = (df.ID.value_counts() >= 3).mean()
>>> pd.DataFrame(dict(group=[1, 2], perc_group=[g1_perc*100, (1-g1_perc)*100]))
group perc_group
0 1 25.0
1 2 75.0
The second column with the opposite percentage looks a bit needless to me.

Related

Getting max values based on sliced column

Let's consider this Dataframe:
$> df
a b
0 6 50
1 2 20
2 9 60
3 4 40
4 5 20
I want to compute column D based on:
The max value between:
integer 0
A slice of column B at that row's index
So I have created a column C (all zeroes) in my dataframe in order use DataFrame.max(axis=1). However, short of using apply or looping over the DataFrame, I don't know how to slice the input values. Expected result would be:
$> df
a b c d
0 6 50 0 60
1 2 20 0 60
2 9 60 0 60
3 4 40 0 40
4 5 20 0 20
So essentially, d's 3rd row is computed (pseudo-code) as max(df[3:,"b"], df[3:,"c"]), and similarly for each row.
Since the input columns (b, c) have already been computed, there has to be a way to slice the input as I calculate each row for D without having to loop, as this is slow.
Seems like this could work: Reverse "B", find cummax, then reverse it back and assign it to "d". Then use where on "d" to see if any value is less than 0:
df['d'] = df['b'][::-1].cummax()[::-1]
df['d'] = df['d'].where(df['d']>0, 0)
We can replace the last line with the one below using clip (thanks #Either), and drop the 2nd reversal (assuming indexes match) making it all a one liner:
df['d'] = df['b'][::-1].cummax().clip(lower=0)
Output:
a b d
0 6 50 60
1 2 20 60
2 9 60 60
3 4 40 40
4 5 20 20

use groupby and custom agg in a dataframe pandas

I have this dataframe :
id start end
1 1 2
1 13 27
1 30 35
1 36 40
2 2 5
2 8 10
2 25 30
I want to groupby over id and aggregate rows where difference of end of n-1 row and start of n row is less than 10 for example. I already find a way using a loop but it's far too long with over a million rows.
So the expected outcome would be :
id start end
1 1 2
1 13 40
2 2 10
2 25 30
First I can get the required difference by using df['diff']=df['start'].shift(-1)-df['end']. How can I gather ids based on the condition for each different id ?
Thanks !
I believe you can create groups by suntract shifted end by DataFrameGroupBy.shift with greater like 10 and cumulative sum and pass to GroupBy.agg:
g = df['start'].sub(df.groupby('id')['end'].shift()).gt(10).cumsum()
df = (df.groupby(['id',g])
.agg({'start':'first', 'end': 'last'})
.reset_index(level=1, drop=True)
.reset_index())
print (df)
id start end
0 1 1 2
1 1 13 40
2 2 2 10
3 2 25 30

Pandas: Groupby names in index and columns

I have a dataframe that uses MultiIndex for both index and columns.
For example:
df = pd.DataFrame(index=pd.MultiIndex.from_product([[1,2], [1,2,3], [4,5]], names=['i','j', 'k']), columns=pd.MultiIndex.from_product([[1,2], [1,2]], names=['x', 'y']))
for c in df.columns:
df[c] = np.random.randint(100, size=(12,1))
x 1 2
y 1 2 1 2
i j k
1 1 4 10 13 0 76
5 92 37 52 40
2 4 88 77 50 22
5 75 31 19 1
3 4 61 23 5 47
5 43 68 10 21
2 1 4 23 15 17 5
5 47 68 6 94
2 4 0 12 24 54
5 83 27 46 19
3 4 7 22 5 15
5 7 10 89 79
I want to group the values by a name in the index and by a name in the columns.
For each such group, we will have a 2D array of numbers (rather than a Series). I want to aggregate std() of all entries in that 2D array.
For example, let's say I groupby ['i', 'x'], one group would be with values of i=1 and x=1. I want to compute std for each of these 2D arrays and produce a DataFrame with i values as index and x values as columns.
What is the best way to achieve this?
If I do stack() to get x as an index, I will still be computing several std() instead of one as there will still be multiple columns.
You can use nested list comprehensions. For your example, with the given kind of DataFrame (not the same, as the values are random; you may want to fix a seed value so that results are comparable) and i and x as the indices of interest, it would work like this:
# get values of the top level row index
rows = set(df.index.get_level_values(0))
# get values of the top level column index
columns = set(df.columns.get_level_values(0))
# for every sub-dataframe (every combination of top-level indices)
# compute sampling standard deviation (1 degree of freedom) across all values
df_groupSD = pd.DataFrame([[df.loc[(row, )][(col, )].values.std(ddof=1)
for col in columns] for row in rows],
index = rows, columns = columns)
# show result
display(df_groupSD)
Output:
1 2
1 31.455115 25.433812
2 29.421699 33.748962
There may be better ways, of course.
You can use stack to put the 'y' level of column as index and then groupby only i to get:
print (df.stack(level='y').groupby(['i']).std())
x 1 2
i
1 32.966811 23.933462
2 28.668825 28.541835
Try the following code:
df.groupby(level=0).apply(lambda grp: grp.stack().std())

Pandas: group by two columns, sum up the first value in the first column group

In Python, I have a pandas data frame df.
ID Ref Dist
A 0 10
A 0 10
A 1 20
A 1 20
A 2 30
A 2 30
A 3 5
A 3 5
B 0 8
B 0 8
B 1 40
B 1 40
B 2 7
B 2 7
I want to group by ID and Ref, and take the first row of the Dist column in each group.
ID Ref Dist
A 0 10
A 1 20
A 2 30
A 3 5
B 0 8
B 1 40
B 2 7
And I want to sum up the Dist column in each ID group.
ID Sum
A 65
B 55
I tried this to do the first step, but this gives me just an index of the row and Dist, so I cannot move on to the second step.
df.groupby(['ID', 'Ref'])['Dist'].head(1)
It'd be wonderful if somebody helps me for this.
Thank you!
I believe this is what you're looking for.
The first step you need to use first since you want the first in the groupby. Once you've done that, use reset_index() so you can use a groupby afterwards and sum it up using ID.
df.groupby(['ID','Ref'])['Dist'].first()\
.reset_index().groupby(['ID'])['Dist'].sum()
ID
A 65
B 55
Just drop_duplicates before the groupby. The default behavior is to keep the first duplicate row, which is what you want.
df.drop_duplicates(['ID', 'Ref']).groupby('ID').Dist.sum()
#A 65
#B 55
#Name: Dist, dtype: int64

Pandas: How to find the first valid column among a series of columns

I have a dataset of different sections of a race in a pandas dataframe from which I need to calculate certain features. It looks something like this:
id distance timeto1000m timeto800m timeto600m timeto400m timeto200m timetoFinish
1 1400m 10 21 30 39 50 60
2 1200m 0 19 31 42 49 57
3 1800m 0 0 0 38 49 62
4 1000m 0 0 29 40 48 61
So, what I need to do is for each row find the first timetoXXm column that is non-zero and the correspoding distance XX. For instance, for id=1 that would be 1000m, for id=3 that would be 400m etc.
I can do this with a series of if..elif..else conditions but was wondering if there is a better way of doing this kind of lookup in pandas/numpy?
You can do it like this, first filter the cols of interest and take a slice, then call idxmin on the cols of interest to return the columns where the boolean condition is met:
In [11]:
df_slice = df.ix[:,df.columns.str.startswith('time')]
df_slice[df_slice!=0].idxmin(axis=1)
Out[11]:
0 timeto1000m
1 timeto800m
2 timeto400m
3 timeto600m
dtype: object
In [15]:
df['first_valid'] = df_slice[df_slice!=0].idxmin(axis=1)
df[['id','first_valid']]
Out[15]:
id first_valid
0 1 timeto1000m
1 2 timeto800m
2 3 timeto400m
3 4 timeto600m
use idxmax(1)
df.set_index(['id', 'distance']).ne(0).idxmax(1)
id distance
1 1400m timeto1000m
2 1200m timeto800m
3 1800m timeto400m
4 1000m timeto600m
dtype: object

Categories

Resources