Pandas groupby then drop groups below specified size

Pandas groupby then drop groups below specified size - python

I'm trying to separate a DataFrame into groups and drop groups below a minimum size (small outliers).
Here's what I've tried:
df.groupby(['A']).filter(lambda x: x.count() > min_size)
df.groupby(['A']).filter(lambda x: x.size() > min_size)
df.groupby(['A']).filter(lambda x: x['A'].count() > min_size)
df.groupby(['A']).filter(lambda x: x['A'].size() > min_size)
But these either throw an exception or return a different table than I'm expecting. I'd just like to filter, not compute a new table.

You can use len:
In [11]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
In [12]: df.groupby('A').filter(lambda x: len(x) > 1)
Out[12]:
A B
0 1 2
1 1 4

The number of rows is in the attribute .shape[0]:
df.groupby('A').filter(lambda x: x.shape[0] >= min_size)
NB: If you want to remove the groups below the minimum size, keep those that are above or at the minimum size (>=, not >).

groupby.filter can be very slow for larger dataset / a large number of groups. A faster approach is to use groupby.transform:
Here's an example, first create the dataset:
import pandas as pd
import numpy as np
df = pd.concat([
pd.DataFrame({'y': np.random.randn(np.random.randint(1,5))}).assign(A=str(i))
for i in range(1,1000)
]).reset_index(drop=True)
print(df)
y A
0 1.375980 1
1 -0.023861 1
2 -0.474707 1
3 -0.151859 2
4 -1.696823 2
... ... ...
2424 0.276737 998
2425 -0.142171 999
2426 -0.718891 999
2427 -0.621315 999
2428 1.335450 999
[2429 rows x 2 columns]
Time it:

Related

how to apply multiplication within pandas dataframe

please advice how to get the following output:
df1 = pd.DataFrame([['1, 2', '2, 2','3, 2','1, 1', '2, 1','3, 1']])
df2 = pd.DataFrame([[1, 2, 100, 'x'], [3, 4, 200, 'y'], [5, 6, 300, 'x']])
import numpy as np
df22 = df2.rename(index = lambda x: x + 1).set_axis(np.arange(1, len(df2.columns) + 1), inplace=False, axis=1)
f = lambda x: df22.loc[tuple(map(int, x.split(',')))]
df = df1.applymap(f)
print (df)
Output:
0 1 2 3 4 5
0 2 4 6 1 3 5
df1 is 'address' of df2 in row, col format (1,2 is first row, second column which is 2, 2,2 is 4 3,2 is 6 etc.)
I need to add values from the 3rd and 4th columns to get something like (2*100x, 4*200y, 6*300x, 1*100x, 3*200y, 5*300x)
the output should be 5000(sum of x's and y's), 0.28 (1400/5000 - % of y's)

It's not clear to me why you need df1 and df... Maybe your question is lacking some details?
You can compute your values directly:
df22['val'] = (df22[1] + df22[2])*df22[3]
Output:
1 2 3 4 val
1 1 2 100 x 300
2 3 4 200 y 1400
3 5 6 300 x 3300
From there it's straightforward to compute the sums (total and grouped by column 4):
total = df22['val'].sum() # 5000
y_sum = df22.groupby(4).sum().loc['y', 'val'] # 1400
print(y_sum/total) # 0.28
Edit: if df1 doesn't necessarily contain all members of columns 1 and 2, you could loop through it (it's not clear in your question why df1 is a Dataframe or if it can have more than one row, therefore I flattened it):
df22['val'] = 0
for c in df1.to_numpy().flatten():
i, j = map(int, c.split(','))
df22.loc[i, 'val'] += df22.loc[i, j]*df22.loc[i, 3]
This gives you the same output as above for your example but will ignore values that are not in df1.

How to compare the sizes of confidence intervals in python

I have a dataframe and I am finding the confidence intervals across each row. My actual dataframe is hundreds of rows long, but here is an example:
df = pd.DataFrame({'nums_1': [1, 2, 3], 'nums_2': [1, 1, 5], 'nums_3' : [8,7,9]})
df['CI']=df.apply(lambda row: stats.t.interval(0.95, len(df)-1,
loc=np.mean(row), scale=stats.sem(row)), axis=1).apply(lambda x: np.round(x,2))
I also want to calculate the width of each confidence interval. I tried the the following, but it did not work
df['width']=df.apply(lambda row: stats.t.interval(0.95, len(df)-1,
loc=np.mean(row), scale=stats.sem(row)), axis=1)[1] - df.apply(lambda row:
stats.t.interval(0.95, len(df)-1,
loc=np.mean(row), scale=stats.sem(row)), axis=1)[0]

IIUC, you want to compute the difference between upper from lower in the confidence interval, you can try this:
df['CI'].apply(lambda x: x[1] - x[0])
If you have this:
>>> from scipy import stats
>>> import numpy as np
>>> df = pd.DataFrame({'nums_1': [1, 2, 3], 'nums_2': [1, 1, 5], 'nums_3' : [8,7,9]})
>>> df['CI']=df.apply(lambda row: stats.t.interval(0.95, len(df)-1, loc=np.mean(row), scale=stats.sem(row)), axis=1).apply(lambda x: np.round(x,2))
>>> df['CI']
0 [-6.71, 13.37]
1 [-4.65, 11.32]
2 [-1.92, 13.26]
Name: CI, dtype: object
you get this:
>>> df['width'] = df['CI'].apply(lambda x: x[1] - x[0])
>>> df
nums_1 nums_2 nums_3 CI width
0 1 1 8 [-6.71, 13.37] 20.08
1 2 1 7 [-4.65, 11.32] 15.97
2 3 5 9 [-1.92, 13.26] 15.18

Pandas dataframe, merge by intersection of spans?

I would like to merge two dataframes based on overlap of spans (indicated by pairs (s,e), s- start of span, e - end of span), and while I have a pretty bad code for doing it, I would like to know if there is a good way to implement it. Here is example:
df1 = pd.DataFrame({'s':[0,10,20,33,424,5345],
'e':[3,17,30,39,1000,10987],
'data1':[1,2,3,4,5,6]})
df2 = pd.DataFrame({'s':[1,45,0],
'e':[50,46,90],
'data2':[1,2,3]})
def overlap(a1,a2,b1,b2):
if type(b1) == list or type(b1)==np.ndarray:
assert(len(b1)==len(b2))
return np.asarray([overlap(a1,a2,b1[k],b2[k]) for k in range(len(b1))])
else:
return max((a2-a1)+(b2-b1)+min(a1,b1)-max(b2,a2)+1,0)
overlaps = [overlap(df1['s'].iloc[i],df1['e'].iloc[i],df2['s'].values,df2['e'].values)>0
for i in range(len(df1))]
df1['data2']=[df2['data2'][o].tolist() for o in overlaps]
Output is:
s e data1 data2
0 0 3 1 [1, 3]
1 10 17 2 [1, 3]
2 20 30 3 [1, 3]
3 33 39 4 [1, 3]
4 424 1000 5 []
5 5345 10987 6 []
Edit: also, in my particular case I am guaranteed that for df1 spans are non-overlapping and sequential (ie s[i]>s[i-1], e[i]>s[i], e[i] < s[i+1] )
Edit2: you can generate arbitrary amount of almost valid fake data (here we don't have guarantees on non-overlapping of spans in first df):
N=int(1e3)
sdf1=np.random.randint(0, high=10*N, size=(N,))
sdf1.sort()
edf1=sdf1+np.random.randint(1, high=10, size=(N,))
data1=range(N)
sdf2=np.random.randint(0, high=10*N, size=(N,))
edf2=sdf2+np.random.randint(1, high=10, size=(N,))
data2=range(N)
df1 = pd.DataFrame({'s':sdf1,
'e':edf1,
'data1':data1})
df2 = pd.DataFrame({'s':sdf2,
'e':edf2,
'data2':data2})

When it comes to pandas dataframe, you should always avoid for loops to process rows/columns and use apply, transform or other pandas functions. For example to get the overlaps you can do:
def has_overlap(a1, a2, b1, b2):
''' return True if spans overlap, otherwise return False '''
return (a2-a1)+(b2-b1)+min(a1,b1)-max(b2,a2)+1 > 0
def find_overlap(row1):
'''return indices of df2 which overlap with the given row of df1 as a list'''
df2['has_overlap'] = df2.apply(lambda row2: has_overlap(row1.s, row1.e, row2.s, row2.e), axis=1)
return list(df2['data2'].loc[df2['has_overlap']])
df1['data2'] = df1.apply(lambda row: find_overlap(row), axis=1)
print('df1: {}'.format(df1))

Pandas Split Dataframe into two Dataframes at a specific row

I have pandas DataFrame which I have composed from concat. One row consists of 96 values, I would like to split the DataFrame from the value 72.
So that the first 72 values of a row are stored in Dataframe1, and the next 24 values of a row in Dataframe2.
I create my DF as follows:
temps = DataFrame(myData)
datasX = concat(
[temps.shift(72), temps.shift(71), temps.shift(70), temps.shift(69), temps.shift(68), temps.shift(67),
temps.shift(66), temps.shift(65), temps.shift(64), temps.shift(63), temps.shift(62), temps.shift(61),
temps.shift(60), temps.shift(59), temps.shift(58), temps.shift(57), temps.shift(56), temps.shift(55),
temps.shift(54), temps.shift(53), temps.shift(52), temps.shift(51), temps.shift(50), temps.shift(49),
temps.shift(48), temps.shift(47), temps.shift(46), temps.shift(45), temps.shift(44), temps.shift(43),
temps.shift(42), temps.shift(41), temps.shift(40), temps.shift(39), temps.shift(38), temps.shift(37),
temps.shift(36), temps.shift(35), temps.shift(34), temps.shift(33), temps.shift(32), temps.shift(31),
temps.shift(30), temps.shift(29), temps.shift(28), temps.shift(27), temps.shift(26), temps.shift(25),
temps.shift(24), temps.shift(23), temps.shift(22), temps.shift(21), temps.shift(20), temps.shift(19),
temps.shift(18), temps.shift(17), temps.shift(16), temps.shift(15), temps.shift(14), temps.shift(13),
temps.shift(12), temps.shift(11), temps.shift(10), temps.shift(9), temps.shift(8), temps.shift(7),
temps.shift(6), temps.shift(5), temps.shift(4), temps.shift(3), temps.shift(2), temps.shift(1), temps,
temps.shift(-1), temps.shift(-2), temps.shift(-3), temps.shift(-4), temps.shift(-5), temps.shift(-6),
temps.shift(-7), temps.shift(-8), temps.shift(-9), temps.shift(-10), temps.shift(-11), temps.shift(-12),
temps.shift(-13), temps.shift(-14), temps.shift(-15), temps.shift(-16), temps.shift(-17), temps.shift(-18),
temps.shift(-19), temps.shift(-20), temps.shift(-21), temps.shift(-22), temps.shift(-23)], axis=1)
Question is: How can split them? :)

iloc
df1 = datasX.iloc[:, :72]
df2 = datasX.iloc[:, 72:]
(iloc docs)

use np.split(..., axis=1):
Demo:
In [255]: df = pd.DataFrame(np.random.rand(5, 6), columns=list('abcdef'))
In [256]: df
Out[256]:
a b c d e f
0 0.823638 0.767999 0.460358 0.034578 0.592420 0.776803
1 0.344320 0.754412 0.274944 0.545039 0.031752 0.784564
2 0.238826 0.610893 0.861127 0.189441 0.294646 0.557034
3 0.478562 0.571750 0.116209 0.534039 0.869545 0.855520
4 0.130601 0.678583 0.157052 0.899672 0.093976 0.268974
In [257]: dfs = np.split(df, [4], axis=1)
In [258]: dfs[0]
Out[258]:
a b c d
0 0.823638 0.767999 0.460358 0.034578
1 0.344320 0.754412 0.274944 0.545039
2 0.238826 0.610893 0.861127 0.189441
3 0.478562 0.571750 0.116209 0.534039
4 0.130601 0.678583 0.157052 0.899672
In [259]: dfs[1]
Out[259]:
e f
0 0.592420 0.776803
1 0.031752 0.784564
2 0.294646 0.557034
3 0.869545 0.855520
4 0.093976 0.268974
np.split() is pretty flexible - let's split an original DF into 3 DFs at columns with indexes [2,3]:
In [260]: dfs = np.split(df, [2,3], axis=1)
In [261]: dfs[0]
Out[261]:
a b
0 0.823638 0.767999
1 0.344320 0.754412
2 0.238826 0.610893
3 0.478562 0.571750
4 0.130601 0.678583
In [262]: dfs[1]
Out[262]:
c
0 0.460358
1 0.274944
2 0.861127
3 0.116209
4 0.157052
In [263]: dfs[2]
Out[263]:
d e f
0 0.034578 0.592420 0.776803
1 0.545039 0.031752 0.784564
2 0.189441 0.294646 0.557034
3 0.534039 0.869545 0.855520
4 0.899672 0.093976 0.268974

I generally use array split because it's easier simple syntax and scales better with more than 2 partitions.
import numpy as np
partitions = 2
dfs = np.array_split(df, partitions)
np.split(df, [100,200,300], axis=0] wants explicit index numbers which may or may not be desirable.

Python: Scaling numbers column by column with pandas

I have a Pandas data frame 'df' in which I'd like to perform some scalings column by column.
In column 'a', I need the maximum number to be 1, the minimum number to be 0, and all other to be spread accordingly.
In column 'b', however, I need the minimum number to be 1, the maximum number to be 0, and all other to be spread accordingly.
Is there a Pandas function to perform these two operations? If not, numpy would certainly do.
a b
A 14 103
B 90 107
C 90 110
D 96 114
E 91 114

This is how you can do it using sklearn and the preprocessing module. Sci-Kit Learn has many pre-processing functions for scaling and centering data.
In [0]: from sklearn.preprocessing import MinMaxScaler
In [1]: df = pd.DataFrame({'A':[14,90,90,96,91],
'B':[103,107,110,114,114]}).astype(float)
In [2]: df
Out[2]:
A B
0 14 103
1 90 107
2 90 110
3 96 114
4 91 114
In [3]: scaler = MinMaxScaler()
In [4]: df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
In [5]: df_scaled
Out[5]:
A B
0 0.000000 0.000000
1 0.926829 0.363636
2 0.926829 0.636364
3 1.000000 1.000000
4 0.939024 1.000000

You could subtract by the min, then divide by the max (beware 0/0). Note that after subtracting the min, the new max is the original max - min.
In [11]: df
Out[11]:
a b
A 14 103
B 90 107
C 90 110
D 96 114
E 91 114
In [12]: df -= df.min() # equivalent to df = df - df.min()
In [13]: df /= df.max() # equivalent to df = df / df.max()
In [14]: df
Out[14]:
a b
A 0.000000 0.000000
B 0.926829 0.363636
C 0.926829 0.636364
D 1.000000 1.000000
E 0.939024 1.000000
To switch the order of a column (from 1 to 0 rather than 0 to 1):
In [15]: df['b'] = 1 - df['b']
An alternative method is to negate the b columns first (df['b'] = -df['b']).

In case you want to scale only one column in the dataframe, you can do the following:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['Col1_scaled'] = scaler.fit_transform(df['Col1'].values.reshape(-1,1))

This is not very elegant but the following works for this two column case:
#Create dataframe
df = pd.DataFrame({'A':[14,90,90,96,91], 'B':[103,107,110,114,114]})
#Apply operates on each row or column with the lambda function
#axis = 0 -> act on columns, axis = 1 act on rows
#x is a variable for the whole row or column
#This line will scale minimum = 0 and maximum = 1 for each column
df2 = df.apply(lambda x:(x.astype(float) - min(x))/(max(x)-min(x)), axis = 0)
#Want to now invert the order on column 'B'
#Use apply function again, reverse numbers in column, select column 'B' only and
#reassign to column 'B' of original dataframe
df2['B'] = df2.apply(lambda x: 1-x, axis = 1)['B']
If I find a more elegant way (for example, using the column index: (0 or 1)mod 2 - 1 to select the sign in the apply operation so it can be done with just one apply command, I'll let you know.

I think Acumenus' comment in this answer, should be mentioned explicitly as an answer, as it is a one-liner.
>>> import pandas as pd
>>> from sklearn.preprocessing import minmax_scale
>>> df = pd.DataFrame({'A':[14,90,90,96,91], 'B':[103,107,110,114,114]})
>>> minmax_scale(df)
array([[0. , 0. ],
[0.92682927, 0.36363636],
[0.92682927, 0.63636364],
[1. , 1. ],
[0.93902439, 1. ]])

given a data frame
df = pd.DataFrame({'A':[14,90,90,96,91], 'B':[103,107,110,114,114]})
scale with mean 0 and var 1
df.apply(lambda x: (x - np.mean(x)) / np.std(x), axis=0)
scale with range between 0 and 1
df.apply(lambda x: x / np.max(x), axis=0)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas groupby then drop groups below specified size - python

You can use len: In [11]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B']) In [12]: df.groupby('A').filter(lambda x: len(x) > 1) Out[12]: A B 0 1 2 1 1 4

The number of rows is in the attribute .shape[0]: df.groupby('A').filter(lambda x: x.shape[0] >= min_size) NB: If you want to remove the groups below the minimum size, keep those that are above or at the minimum size (>=, not >).

Related

how to apply multiplication within pandas dataframe

How to compare the sizes of confidence intervals in python

Pandas dataframe, merge by intersection of spans?

Pandas Split Dataframe into two Dataframes at a specific row

Python: Scaling numbers column by column with pandas

Categories

Resources