Pandas randomly select n groups from a larger dataset

Pandas randomly select n groups from a larger dataset - python

If I have a dataframe with groups like so
val label
x A
x A
x B
x B
x C
x C
x D
x D
how can I randomly pick out n groups without replacement?

You can use random.choice with loc:
N = 3
vals = np.random.choice(df['label'].unique(), N, replace=False)
print (vals)
['C' 'A' 'B']
df = df.set_index('label').loc[vals].reset_index()
print (df)
label val
0 C x5
1 C x6
2 A x1
3 A x2
4 B x3
5 B x4

Related

Pandas : If a column is having duplicates then copy values of corresponding column and copy it to new column

If my dataframe is like this,
X Y Z
1 a
1 b
2 c
the output should be
X Y Z
1 a a,b
1 b a,b
2 c
Condition:
If a X has duplicates then it should all take the values of Y of that X duplicate and convert to csv values and paste in column Z

df["Z"] = df.X.map(df.groupby("X").agg(list).apply(lambda x: "" if len(x.Y) == 1 else ",".join(x.Y), axis=1))

Use a groupby.transform and mask:
g = df.groupby('X')['Y']
df['Z'] = g.transform(','.join).mask(g.transform('size')==1, '')
output:
X Y Z
0 1 a a,b
1 1 b a,b
2 2 c

Split rows to create new rows in Pandas Dataframe with same other row values

I have a pandas dataframe in which one column of text strings contains multiple comma-separated values. I want to split each field and create a new row per entry only where the number of commas is >= 2. For example, a should become b:
In [7]: a
Out[7]:
var1 var2 var3
0 a,b,c,d 1 X1
1 a,b,c,d 1 X2
2 a,b,c,d 1 X3
3 a,b,c,d 1
4 e,f,g 2 Y1
5 e,f,g 2 Y2
6 e,f,g 2
7 h,i 3 Z1
In [8]: b
Out[8]:
var1 var2 var3
0 a,d 1 X1
1 b,d 1 X2
3 c,d 1 X3
4 e,g 2 Y1
5 f,g 2 Y2
6 h,i 3 Z1

You could use a custom function:
def custom_split(r):
if r['var3']:
s = r['var1']
i = int(r['var3'][1:])-1
l = s.split(',')
return l[i]+','+l[-1]
df['var1'] = df.apply(custom_split, axis=1)
df = df.dropna()
output:
var1 var2 var3
0 a,d 1 X1
1 b,d 1 X2
2 c,d 1 X3
4 e,g 2 Y1
5 f,g 2 Y2
7 h,i 3 Z1

df['cc'] = df.groupby('var1')['var1'].cumcount()
df['var1'] = df['var1'].str.split(',')
df['var1'] = df[['cc','var1']].apply(lambda x: x['var1'][x['cc']]+','+x['var1'][-1],axis=1)
df = df.dropna().drop(columns=['cc']).reset_index(drop=True)
df

You can do so by splitting var1 on the comma into lists. The integer in var3 minus 1 can be interpreterd as the index of what item in the list in var1 to keep:
import pandas as pd
import io
data = ''' var1 var2 var3
0 a,b,c,d 1 X1
1 a,b,c,d 1 X2
2 a,b,c,d 1 X3
3 a,b,c,d 1
4 e,f,g 2 Y1
5 e,f,g 2 Y2
6 e,f,g 2
7 h,i 3 Z1'''
df = pd.read_csv(io.StringIO(data), sep = r'\s\s+', engine='python')
df['var1'] = df["var1"].str.split(',').apply(lambda x: [[i,x[-1]] for i in x[:-1]]) #split the string to list and create combinations of all items with the last item in the list
df = df[df['var3'].notnull()] # drop rows where var3 is None
df['var1'] = df.apply(lambda x: x['var1'][0 if not x['var3'] else int(x['var3'][1:])-1], axis=1) #keep only the element in the list in var1 where the index is the integer in var3 minus 1
Output:
var1
var2
var3
0
['a', 'd']
1
X1
1
['b', 'd']
1
X2
2
['c', 'd']
1
X3
4
['e', 'g']
2
Y1
5
['f', 'g']
2
Y2
7
['h', 'i']
3
Z1
Run df['var1'] = df['var1'].str.join(',') to reconvert var1 to a string.

Pandas Count Group Number

Given the following dataframe:
df=pd.DataFrame({'col1':['A','A','A','A','A','A','B','B','B','B','B','B'],
'col2':['x','x','y','z','y','y','x','y','y','z','z','x'],
})
df
col1 col2
0 A x
1 A x
2 A y
3 A z
4 A y
5 A y
6 B x
7 B y
8 B y
9 B z
10 B z
11 B x
I'd like to create a new column, col3 which classifies the values in col2 sequentially, grouped by the values in col1:
col1 col2 col3
0 A x x1
1 A x x1
2 A y y1
3 A z z1
4 A y y2
5 A y y2
6 B x x1
7 B y y1
8 B y y1
9 B z z1
10 B z z1
11 B x x2
In the above example, col3[0:1] has a value of x1 because its the first group of x values in col2 for col1 = A. col3[4:5] has values of y2 because its the second group of y values in col2 for col1 = A etc...
I hope the description makes sense. I was unable to find an answer partially because I can't find an elegant way to articulate what I'm looking for.

Here's my approach:
groups = (df.assign(s=df.groupby('col1')['col2'] # group col2 by col1
.shift().ne(df['col2']) # check if col2 different from the previous (shift)
.astype(int) # convert to int
) # the new column s marks the beginning of consecutive blocks with `1`
.groupby(['col1','col2'])['s'] # group `s` by `col1` and `col2`
.cumsum() # cumsum by group
.astype(str)
)
df['col3'] = df['col2'] + groups
Output:
col1 col2 col3
0 A x x1
1 A x x1
2 A y y1
3 A z z1
4 A y y2
5 A y y2
6 B x x1
7 B y y1
8 B y y1
9 B z z1
10 B z z1
11 B x x2

Split row into multiple rows in pandas

I have a DataFrame with a format like this (simplified)
a b 43
a c 22
I would like this to be split up in the following way.
a b 20
a b 20
a b 1
a b 1
a b 1
a c 20
a c 1
a c 1
Where I have as many rows as the number divides by 20, and then as many rows as the remainder. I have a solution that basically iterates over the rows and fills up a dictionary which can then be converted back to Dataframe but I was wondering if there is a better solution.

You can use floor divison with modulo first and then create new DataFrame by constructor with numpy.repeat.
Last need numpy.concatenate with list comprehension for C:
a,b = df.C // 20, df.C % 20
#print (a, b)
cols = ['A','B']
df = pd.DataFrame({x: np.repeat(df[x], a + b) for x in cols})
df['C'] = np.concatenate([[20] * x + [1] * y for x,y in zip(a,b)])
print (df)
A B C
0 a b 20
0 a b 20
0 a b 1
0 a b 1
0 a b 1
1 a c 20
1 a c 1
1 a c 1

Setup
Consider the dataframe df
df = pd.DataFrame(dict(A=['a', 'a'], B=['b', 'c'], C=[43, 22]))
df
A B C
0 a b 43
1 a c 22
np.divmod and np.repeat
m = np.array([20, 1])
dm = list(zip(*np.divmod(df.C.values, m[0])))
# [(2, 3), (1, 2)]
rep = [sum(x) for x in dm]
new = np.concatenate([m.repeat(x) for x in dm])
df.loc[df.index.repeat(rep)].assign(C=new)
A B C
0 a b 20
0 a b 20
0 a b 1
0 a b 1
0 a b 1
1 a c 20
1 a c 1
1 a c 1

how to convert column names into column values in pandas - python

df=pd.DataFrame(index=['x','y'], data={'a':[1,2],'b':[3,4]})
how can I convert column names into values of a column? This is my desired output
c1 c2
x 1 a
x 3 b
y 2 a
y 4 b

You can use:
print (df.T.unstack().reset_index(level=1, name='c1')
.rename(columns={'level_1':'c2'})[['c1','c2']])
c1 c2
x 1 a
x 3 b
y 2 a
y 4 b
Or:
print (df.stack().reset_index(level=1, name='c1')
.rename(columns={'level_1':'c2'})[['c1','c2']])
c1 c2
x 1 a
x 3 b
y 2 a
y 4 b

try this:
In [279]: df.stack().reset_index().set_index('level_0').rename(columns={'level_1':'c2',0:'c1'})
Out[279]:
c2 c1
level_0
x a 1
x b 3
y a 2
y b 4

Try:
df1 = df.stack().reset_index(-1).iloc[:, ::-1]
df1.columns = ['c1', 'c2']
df1

In [62]: (pd.melt(df.reset_index(), var_name='c2', value_name='c1', id_vars='index')
.set_index('index'))
Out[62]:
c2 c1
index
x a 1
y a 2
x b 3
y b 4

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas randomly select n groups from a larger dataset - python

If I have a dataframe with groups like so val label x A x A x B x B x C x C x D x D how can I randomly pick out n groups without replacement?

You can use random.choice with loc: N = 3 vals = np.random.choice(df['label'].unique(), N, replace=False) print (vals) ['C' 'A' 'B'] df = df.set_index('label').loc[vals].reset_index() print (df) label val 0 C x5 1 C x6 2 A x1 3 A x2 4 B x3 5 B x4

Related

Pandas : If a column is having duplicates then copy values of corresponding column and copy it to new column

Split rows to create new rows in Pandas Dataframe with same other row values

Pandas Count Group Number

Split row into multiple rows in pandas

how to convert column names into column values in pandas - python

Categories

Resources