pandas sample based on category for each row - python

Let's say I have a pandas dataframe
rid category
0 0 c2
1 1 c3
2 2 c2
3 3 c3
4 4 c2
5 5 c2
6 6 c1
7 7 c3
8 8 c1
9 9 c3
I want to add 2 columns pid and nid, such that for each row pid contains a random id (other than rid) that belongs to the same category as rid and nid contains a random id that belongs to a different category as rid,
an example dataframe would be:
rid category pid nid
0 0 c2 2 1
1 1 c3 7 4
2 2 c2 0 1
3 3 c3 1 5
4 4 c2 5 7
5 5 c2 4 6
6 6 c1 8 5
7 7 c3 9 8
8 8 c1 6 2
9 9 c3 1 2
Note that pid should not be the same as rid. Right now, I am just brute forcing it by iterating through the rows and sampling each time, which seems very inefficient.
Is there a better way to do this?
EDIT 1: For simplicity let us assume that each category is represented at least twice, so that at least one id can be found that is not rid but has the same category.
EDIT 2: For further simplicity let us assume that in a large dataframe the probability of ending up with the same id as rid is zero. If that is the case I believe the solution should be easier. I would prefer not to make this assumption though

For pid column use Sattolo's algorithm and for nid get all possible values with difference all volues of column with values of group with numpy.random.choice and set difference:
from random import randrange
#https://stackoverflow.com/questions/7279895
def sattoloCycle(items):
items = list(items)
i = len(items)
while i > 1:
i = i - 1
j = randrange(i) # 0 <= j <= i-1
items[j], items[i] = items[i], items[j]
return items
def outsideGroupRand(x):
return np.random.choice(list(set(df['rid']).difference(x)),
size=len(x),
replace=False)
df['pid1'] = df.groupby('category')['rid'].transform(sattoloCycle)
df['nid1'] = df.groupby('category')['rid'].transform(outsideGroupRand)
print (df)
rid category pid nid pid1 nid1
0 0 c2 2 1 4 6
1 1 c3 7 4 7 4
2 2 c2 0 1 5 3
3 3 c3 1 5 1 0
4 4 c2 5 7 2 9
5 5 c2 4 6 0 8
6 6 c1 8 5 8 3
7 7 c3 9 8 9 5
8 8 c1 6 2 6 5
9 9 c3 1 2 3 6

import pandas as pd
import numpy as np
## generate dummy data
raw = {
"rid": range(10),
"cat": np.random.choice("c1,c2,c3".split(","), 10)
}
df = pd.DataFrame(raw)
def get_random_ids(x):
pids,nids = [],[]
sh = x.copy()
for _ in x:
## do circular shift choose random value except cur_val
cur_value = sh.iloc[0]
sh = sh.shift(-1)
sh[-1:] = cur_value
pids.append(np.random.choice(sh[:-1]))
## randomly choose from values from other cat
nids = np.random.choice(df[df["cat"]!=x.name]["rid"], len(x))
return pd.DataFrame({"pid": pids, "nid": nids}, index=x.index)
new_ids = df.groupby("cat")["rid"].apply(lambda x:get_random_ids(x))
df.join(new_ids).sort_values("cat")
output
rid cat pid nid
5 5 c1 8.0 9
8 8 c1 5.0 6
0 0 c2 6.0 1
2 2 c2 0.0 8
3 3 c2 0.0 9
6 6 c2 2.0 4
7 7 c2 3.0 1
1 1 c3 9.0 5
4 4 c3 9.0 0
9 9 c3 4.0 2

Start with defining a function computing pid:
def getPid(elem, grp):
return grp[grp != elem].sample().values[0]
Parameters:
eleme - the current rid from group,
grp - the whole group of rid values.
The idea is to:
select "other" elements from the current group (for some category),
call sample,
return the only returned value from a Series returned by sample.
Then define second function, generating both new ids:
def getIds(grp):
pids = grp.rid.apply(getPid, grp=grp.rid)
rowNo = grp.rid.size
currGrp = grp.name
nids = df.query('category != #currGrp').rid\
.sample(rowNo, replace=True)
return pd.DataFrame({'pid': pids, 'nid': nids.values}, index=grp.index)
Note that:
all nid values for the current group can be computed with
a single call to sample,
from a Series of rids for "other categories.
But pid values must be computed separately, applying getPid to each
element (rid) of the current group.
The reason is that each time a different element should be eliminated
from the current group, before sample is called.
And to get the result, run a single instruction:
pd.concat([df, df.groupby('category').apply(getIds)], axis=1)

Related

Joining dataframe whose columns have the same name

I would like to ask how to join (or merge) multiple dataframes (arbitrary number) whose columns may have the same name. I know this has been asked several times, but could not find a clear answer in any of the questions I have looked at.
import pickle
import os
from posixpath import join
import numpy as np
import pandas as pd
import re
import pickle
np.random.seed(1)
n_cols = 3
col_names = ["Ci"] + ["C"+str(i) for i in range(n_cols)]
def get_random_df():
values = np.random.randint(0, 10, size=(4,n_cols))
index = np.arange(4).reshape([4,-1])
return pd.DataFrame(np.concatenate([index, values], axis=1), columns=col_names).set_index("Ci")
dfs = []
for i in range(3):
dfs.append(get_random_df())
print(dfs[0])
print(dfs[1])
with output:
C0 C1 C2
Ci
0 5 8 9
1 5 0 0
2 1 7 6
3 9 2 4
C0 C1 C2
Ci
0 5 2 4
1 2 4 7
2 7 9 1
3 7 0 6
If I try and join two dataframes per iteration:
# concanenate two per iteration
df = dfs[0]
for df_ in dfs[1:]:
df = df.join(df_, how="outer", rsuffix="_r")
print("** 1 **")
print(df)
the final dataframe has columns with the same name: for example, C0_r is repeated for each joined dataframe.
** 1 **
C0 C1 C2 C0_r C1_r C2_r C0_r C1_r C2_r
Ci
0 5 8 9 5 2 4 9 9 7
1 5 0 0 2 4 7 6 9 1
2 1 7 6 7 9 1 0 1 8
3 9 2 4 7 0 6 8 3 9
This could be easily solved by providing a different suffix per iteration. However, [the doc on join] says 1 " Efficiently join multiple DataFrame objects by index at once by passing a list.". If I try what follows:
# concatenate all at once
df = dfs[0].join(dfs[1:], how="outer")
# fails
# concatenate all at once
df = dfs[0].join(dfs[1:], how="outer", rsuffix="_r")
# fails
All steps fail due to duplicate columns:
Indexes have overlapping values: Index(['C0', 'C1', 'C2'], dtype='object')
Question: is there a way to join automatically multiple dataframes without explicitly providing a different suffix every time?
Instead of join, concatenate along columns
# concatenate along columns
# use keys to differentiate different dfs
res = pd.concat(dfs, keys=range(len(dfs)), axis=1)
# flatten column names
res.columns = [f"{j}_{i}" for i,j in res.columns]
res
Wouldn't be more readable to display your data like this?
By adding this line of code at the end:
pd.concat([x for x in dfs], axis=1, keys=[f'DF{str(i+1)}' for i in range(len(dfs))])
#output
DF1 DF2 DF3
C0 C1 C2 C0 C1 C2 C0 C1 C2
Ci
0 5 8 9 5 2 4 9 9 7
1 5 0 0 2 4 7 6 9 1
2 1 7 6 7 9 1 0 1 8
3 9 2 4 7 0 6 8 3 9

How to correctly use panda's slice/replace with values form another column and then explode a row into two

I'm working with a DataFrame that looks like this:
ID SEQ BEG_GAP END_GAP
0 A1 ABCDEFG 2 4
1 B1 YUUAAMN 4 6
2 C1 LKHUTYYYYA 7 9
And what I'm trying to do is basically first replace the part of the string in "SEQ" that's in between the values of "BEG_GAP" and "END_GAP", to then explode the two pieces of the string left into two different lines (probably using Panda's explode).
I.e: First expected result:
ID SEQ BEG_GAP END_GAP
0 A1 AB---FG 2 4
1 B1 YUUA--- 4 6
2 C1 LKHUTY--YA 7 8
To then get:
ID SEQ BEG_GAP END_GAP
0 A1 AB 2 4
1 A1 FG 2 4
2 B1 YUUA 4 6
3 C1 LKHUTY 7 8
4 C1 YA 7 8
I'm trying to use the following code:
import pandas as pd
df = pd.read_csv("..\path_to_the_csv.csv")
for index, rows in df.iterrows():
start = df["BEG_GAP"].astype(float)
stop= df["END_GAP"].astype(float)
df["SEQ"] = df["SEQ"].astype(str)
df['SEQ'] = df['SEQ'].str.slice_replace(start=start,stop=stop,repl='-')
But the column "SEQ" that I'm getting is full of NaN. I suppose it has to do with how I'm using start and stop. I could use some help with this, and also with how to later divide the rows according to the gaps.
I hope I was clear enough, thanks in advance!
Let's try:
df['SEQ'] = df.apply(lambda x: [x.SEQ[:x.BEG_GAP], x.SEQ[x.END_GAP+1:]], axis=1)
output = df.explode('SEQ').query('SEQ!=""')
Output:
ID SEQ BEG_GAP END_GAP
0 A1 AB 2 4
0 A1 FG 2 4
1 B1 YUUA 4 6
2 C1 LKHUTYY 7 8
2 C1 A 7 8

python pandas - remove duplicates in a column and keep rows according to a complex criteria

Suppose I have this DF:
s1 = pd.Series([1,1,2,2,2,3,3,3,4])
s2 = pd.Series([10,20,10,5,10,7,7,3,10])
s3 = pd.Series([0,0,0,0,1,1,0,2,0])
df = pd.DataFrame([s1,s2,s3]).transpose()
df.columns = ['id','qual','nm']
df
id qual nm
0 1 10 0
1 1 20 0
2 2 10 0
3 2 5 0
4 2 10 1
5 3 7 1
6 3 7 0
7 3 3 2
8 4 10 0
I want to get a new DF in which there are no duplicate ids, so there should be 4 rows with ids 1,2,3,4. The row that should be kept should be chosen based on the following criteria: take the one with smallest nm, if equal, take the one with largest qual, if still equal, just choose one.
I figure that my code should look something like:
df.groupby('id').apply(lambda x: ???)
And it should return:
id qual nm
0 1 20 0
1 2 10 0
2 3 7 0
3 4 10 0
But not sure what my function should take and return.
Or possibly there is an easier way?
Thanks!
Use boolean indexing with GroupBy.transform for minumum rows per groups, then for maximum values and last if still dupes remove them by DataFrame.drop_duplicates:
#get minimal nm
df1 = df[df['nm'] == df.groupby('id')['nm'].transform('min')]
#get maximal qual
df1 = df1[df1['qual'] == df1.groupby('id')['qual'].transform('max')]
#if still dupes get first id
df1 = df1.drop_duplicates('id')
print (df1)
id qual nm
1 1 20 0
2 2 10 0
6 3 7 0
8 4 10 0
Use -
grouper = df.groupby(['id'])
df.loc[(grouper['nm'].transform(min) == df['nm'] ) & (grouper['qual'].transform(max) == df['qual']),:].drop_duplicates(subset=['id'])
Output
id qual nm
1 1 20 0
2 2 10 0
6 3 7 0
8 4 10 0

Randomly combine pandas group objects

Question:
How can one use pandas df.groupby() function to create randomly selected groups of groups?
Example:
I would like to group a dataframe into random groups of size n where n corresponds to the number of unique values in a given column.
I have a dataframe with a variety of columns including "id". Some rows have unique ids whereas others may have the same id. For example:
c1 id c2
0 a 1 4
1 b 2 6
2 c 2 2
3 d 5 7
4 y 9 3
In reality this dataframe can have up to 1000 or so rows.
I would like to be able to group this dataframe using the following criteria:
each group should contain at most n unique ids
no id should appear in more than one group
the specific ids in a given group should be selected randomly
each id should appear in exactly one group
For example the example dataframe (above) could become:
group1:
c1 id c2
0 a 1 4
4 y 9 3
group2:
c1 id c2
1 b 2 6
2 c 2 2
3 d 5 7
where n = 2
Thanks for your suggestions.
It seems difficult for a uniq groupby statement. A way to do that :
uniq=df['id'].unique()
random.shuffle(uniq)
groups=np.split(uniq,2)
dfr=df.set_index(df['id'])
for gp in groups : print (dfr.loc[gp])
For
c1 id c2
id
9 y 9 3
1 a 1 4
c1 id c2
id
5 d 5 7
2 b 2 6
2 c 2 2
If size of groups(n) does'nt divide len(uniq), You can use np.split(uniq,range(n,len(uniq),n)) instead.
Here's a way to do it:
import numpy as np
df = pd.DataFrame({'c1':list('abcdy'), 'id':[1,2,2,5,9], 'c2':[4,6,2,7,3]})
n = 2
shuffled_ids = np.random.permutation(df['id'].unique())
id_groups = [shuffled_ids[i:i+n] for i in xrange(0, len(shuffled_ids), n)]
groups = [df['id'].apply(lambda x: x in g) for g in id_groups]
Output:
In [1]: df[groups[0]]
Out[1]:
c1 c2 id
1 b 6 2
2 c 2 2
3 d 7 5
In [2]: df[groups[1]]
Out[2]:
c1 c2 id
0 a 4 1
4 y 3 9
This approach doesn't involve changing the index, in case you need to keep it.

speeding up dataframe processing by concomitant filtering on pandas joins

I have two dataframes, one is pretty big the other is really huge.
df1: "classid"(text), "c1" (numeric), "c2"(numeric)
df2: "classid"(text), "c3" (numeric), "c4"(numeric)
I want to filter df2 based on values on df1. In pseudocode one would formulate it like this:
df2[(df2.classid == df1.classid) & (df2.c3 < df1.c1) & (df2.c4 < df1.c2)]
Right now I do this by iterating rows in df1 and doing some 40k filter calls on df2, which is a 3mil rows table. Obviously it works too slow.
df = dataframe()
for row in df1:
dft = df2[(df2.classid == row.classid) & (df2.c3 < row.c1) & (df2.c4 < row.c2)]
df.add(dft)
I guess the best option is to make an inner join and then the (df2.c3 < df1.c1) & (df2.c4 < df1.c2) filtering but the problem is that the inner join would create a huge table, since classid are not indexes and not unique row identifiers. If filtering could be applied concomitantly that might just work. Any ideas?
Iterating should be the last resort, I'd merge the other dfs columns c1 and c2 to df:
df = df.merge(df1, on='classid', how='left')
Then I would groupby the classid and then filter the rows like the following example:
In [95]:
df = pd.DataFrame({'classid':[0,0,1,1,1,2,2], 'c1':np.arange(7), 'c2':np.arange(7), 'c3':3, 'c4':4})
df
Out[95]:
c1 c2 c3 c4 classid
0 0 0 3 4 0
1 1 1 3 4 0
2 2 2 3 4 1
3 3 3 3 4 1
4 4 4 3 4 1
5 5 5 3 4 2
6 6 6 3 4 2
In [100]:
df.groupby('classid').filter(lambda x: len( x[x['c3'] < x['c1']] & x[x['c4'] < x['c2']] ) > 0)
Out[100]:
c1 c2 c3 c4 classid
2 2 2 3 4 1
3 3 3 3 4 1
4 4 4 3 4 1
5 5 5 3 4 2
6 6 6 3 4 2

Categories

Resources