Randomly combine pandas group objects - python

Question:
How can one use pandas df.groupby() function to create randomly selected groups of groups?
Example:
I would like to group a dataframe into random groups of size n where n corresponds to the number of unique values in a given column.
I have a dataframe with a variety of columns including "id". Some rows have unique ids whereas others may have the same id. For example:
c1 id c2
0 a 1 4
1 b 2 6
2 c 2 2
3 d 5 7
4 y 9 3
In reality this dataframe can have up to 1000 or so rows.
I would like to be able to group this dataframe using the following criteria:
each group should contain at most n unique ids
no id should appear in more than one group
the specific ids in a given group should be selected randomly
each id should appear in exactly one group
For example the example dataframe (above) could become:
group1:
c1 id c2
0 a 1 4
4 y 9 3
group2:
c1 id c2
1 b 2 6
2 c 2 2
3 d 5 7
where n = 2
Thanks for your suggestions.

It seems difficult for a uniq groupby statement. A way to do that :
uniq=df['id'].unique()
random.shuffle(uniq)
groups=np.split(uniq,2)
dfr=df.set_index(df['id'])
for gp in groups : print (dfr.loc[gp])
For
c1 id c2
id
9 y 9 3
1 a 1 4
c1 id c2
id
5 d 5 7
2 b 2 6
2 c 2 2
If size of groups(n) does'nt divide len(uniq), You can use np.split(uniq,range(n,len(uniq),n)) instead.

Here's a way to do it:
import numpy as np
df = pd.DataFrame({'c1':list('abcdy'), 'id':[1,2,2,5,9], 'c2':[4,6,2,7,3]})
n = 2
shuffled_ids = np.random.permutation(df['id'].unique())
id_groups = [shuffled_ids[i:i+n] for i in xrange(0, len(shuffled_ids), n)]
groups = [df['id'].apply(lambda x: x in g) for g in id_groups]
Output:
In [1]: df[groups[0]]
Out[1]:
c1 c2 id
1 b 6 2
2 c 2 2
3 d 7 5
In [2]: df[groups[1]]
Out[2]:
c1 c2 id
0 a 4 1
4 y 3 9
This approach doesn't involve changing the index, in case you need to keep it.

Related

Remove a row in Python Dataframe if there are repeating values in different columns

This might be a quite easy problem but I can't deal with it properly and didn't find the exact answer here. So, let's say we have a Python Dataframe as below:
df:
ID a b c d
0 1 3 4 9
1 2 8 8 3
2 1 3 10 12
3 0 1 3 0
I want to remove all the rows that contain repeating values in different columns. In other words, I am only interested in keeping rows with unique values. Referring to the above example, the desired output should be:
ID a b c d
0 1 3 4 9
2 1 3 10 12
(I didn't change the ID values on purpose to make the comparison easier). Please let me know if you have any ideas. Thanks!
You can compare length of sets with length of columns names:
lc = len(df.columns)
df1 = df[df.apply(lambda x: len(set(x)) == lc, axis=1)]
print (df1)
a b c d
ID
0 1 3 4 9
2 1 3 10 12
Or test by Series.duplicated and Series.any:
df1 = df[~df.apply(lambda x: x.duplicated().any(), axis=1)]
Or DataFrame.nunique:
df1 = df[df.nunique(axis=1).eq(lc)]
Or:
df1 = df[[len(set(x)) == lc for x in df.to_numpy()]]

How to correctly use panda's slice/replace with values form another column and then explode a row into two

I'm working with a DataFrame that looks like this:
ID SEQ BEG_GAP END_GAP
0 A1 ABCDEFG 2 4
1 B1 YUUAAMN 4 6
2 C1 LKHUTYYYYA 7 9
And what I'm trying to do is basically first replace the part of the string in "SEQ" that's in between the values of "BEG_GAP" and "END_GAP", to then explode the two pieces of the string left into two different lines (probably using Panda's explode).
I.e: First expected result:
ID SEQ BEG_GAP END_GAP
0 A1 AB---FG 2 4
1 B1 YUUA--- 4 6
2 C1 LKHUTY--YA 7 8
To then get:
ID SEQ BEG_GAP END_GAP
0 A1 AB 2 4
1 A1 FG 2 4
2 B1 YUUA 4 6
3 C1 LKHUTY 7 8
4 C1 YA 7 8
I'm trying to use the following code:
import pandas as pd
df = pd.read_csv("..\path_to_the_csv.csv")
for index, rows in df.iterrows():
start = df["BEG_GAP"].astype(float)
stop= df["END_GAP"].astype(float)
df["SEQ"] = df["SEQ"].astype(str)
df['SEQ'] = df['SEQ'].str.slice_replace(start=start,stop=stop,repl='-')
But the column "SEQ" that I'm getting is full of NaN. I suppose it has to do with how I'm using start and stop. I could use some help with this, and also with how to later divide the rows according to the gaps.
I hope I was clear enough, thanks in advance!
Let's try:
df['SEQ'] = df.apply(lambda x: [x.SEQ[:x.BEG_GAP], x.SEQ[x.END_GAP+1:]], axis=1)
output = df.explode('SEQ').query('SEQ!=""')
Output:
ID SEQ BEG_GAP END_GAP
0 A1 AB 2 4
0 A1 FG 2 4
1 B1 YUUA 4 6
2 C1 LKHUTYY 7 8
2 C1 A 7 8

How to find the values of a column such that no values in another column takes value greater than 3

I want to find the values corresponding to a column such that no values in another column takes value greater than 3.
For example, in the following dataframe
df = pd.DataFrame({'a':[1,2,3,1,2,3,1,2,3], 'b':[4,5,6,4,5,6,4,5,6], 'c':[4,3,5,4,3,5,4,3,3]})
I want the values of the column 'a' for which all the values of 'c' which are greater than 3.
I think groupby is the correct way to do it. My below code comes closer to it.
df.groupby('a')['c'].max()>3
a
1 True
2 False
3 True
4 False
Name: c, dtype: bool
The above code gives me a boolean frame. How can I get the values of 'a' such that it is true.
I want my output to be [1,3]
Is there a better and efficient way to get this on a very large data frame (with more than 30 million rows).
From your code I see that you actually want to output:
group keys for each group (df grouped by a),
where no value in c column (within the current group) is greater than 3.
In order to get some non-empty result, let's change the source DataFrame to:
a b c
0 1 4 4
1 2 5 1
2 3 6 5
3 1 4 4
4 2 5 2
5 3 6 5
6 1 4 4
7 2 5 2
8 3 6 3
For readability, let's group df by a and print each group.
The code to do it:
for key, grp in df.groupby('a'):
print(f'\nGroup: {key}\n{grp}')
gives result:
Group: 1
a b c
0 1 4 4
3 1 4 4
6 1 4 4
Group: 2
a b c
1 2 5 1
4 2 5 2
7 2 5 2
Group: 3
a b c
2 3 6 5
5 3 6 5
8 3 6 3
And now take a look at each group.
Only group 2 meets the condition that each element in c column
is less than 3.
So actually you need a groupby and filter, passing only groups
meeting the above condition:
To get full rows from the "good" groups, you can run:
df.groupby('a').filter(lambda grp: grp.c.lt(3).all())
getting:
a b c
1 2 5 1
4 2 5 2
7 2 5 2
But you want only values from a column, without repetitions.
So extend the above code to:
df.groupby('a').filter(lambda grp: grp.c.lt(3).all()).a.unique().tolist()
getting:
[2]
Note that your code: df.groupby('a')['c'].max() > 3 is wrong,
as it marks with True groups for which max is greater than 3
(instead of ">" there should be "<").
So an alternative solution is:
res = df.groupby('a')['c'].max()<3
res[res].index.tolist()
giving the same result.
Yet another solution can be based on a list comprehension:
[ key for key, grp in df.groupby('a') if grp.c.lt(3).all() ]
Details:
for key, grp in df.groupby('a') - creates groups,
if grp.c.lt(3).all() - filters groups,
key (at the start) - adds particular group key to the result.
import pandas as pd
#Create DataFrame
df = pd.DataFrame({'a':[1,2,3,1,2,3,1,2,3], 'b':[4,5,6,4,5,6,4,5,6], 'c':[4,3,5,4,3,5,4,3,3]})
#Write a function to find values greater than 3 if found return.
def grt(x):
for i in x:
if i>3:
return(i)
#Groupby column a and call function grt
p = {'c':grt}
grp = df.groupby(['a']).agg(p)
print(grp)

Compare columns in Pandas between two unequal size Dataframes for condition check

I have two pandas DF. Of unequal sizes. For example :
Df1
id value
a 2
b 3
c 22
d 5
Df2
id value
c 22
a 2
No I want to extract from DF1 those rows which has the same id as in DF2. Now my first approach is to run 2 for loops, with something like :
x=[]
for i in range(len(DF2)):
for j in range(len(DF1)):
if DF2['id'][i] == DF1['id'][j]:
x.append(DF1.iloc[j])
Now this is okay, but for 2 files of 400,000 lines in one and 5,000 in another, I need an efficient Pythonic+Pnadas way
import pandas as pd
data1={'id':['a','b','c','d'],
'value':[2,3,22,5]}
data2={'id':['c','a'],
'value':[22,2]}
df1=pd.DataFrame(data1)
df2=pd.DataFrame(data2)
finaldf=pd.concat([df1,df2],ignore_index=True)
Output after concat
id value
0 a 2
1 b 3
2 c 22
3 d 5
4 c 22
5 a 2
Final Ouput
finaldf.drop_duplicates()
id value
0 a 2
1 b 3
2 c 22
3 d 5
You can concat the dataframes , then check if all the elements are duplicated or not , then drop_duplicates and keep just the first occurrence:
m = pd.concat((df1,df2))
m[m.duplicated('id',keep=False)].drop_duplicates()
id value
0 a 2
2 c 22
You can try this:
df = df1[df1.set_index(['id']).index.isin(df2.set_index(['id']).index)]

sum two pandas dataframe columns, keep non-common rows

I just asked a similar question but then
realized, it wasn't the right question.
What I'm trying to accomplish is to combine two data frames that actually have the same columns, but may or may not have common rows (indices of a MultiIndex). I'd like to combine them taking the sum of one of the columns, but leaving the other columns.
According to the accepted answer, the approach may be something like:
def mklbl(prefix,n):
try:
return ["%s%s" % (prefix,i) for i in range(n)]
except:
return ["%s%s" % (prefix,i) for i in n]
mi1 = pd.MultiIndex.from_product([mklbl('A',4), mklbl('C',2)])
mi2 = pd.MultiIndex.from_product([mklbl('A',[2,3,4]), mklbl('C',2)])
df2 = pd.DataFrame({'a':np.arange(len(mi1)), 'b':np.arange(len(mi1)),'c':np.arange(len(mi1)), 'd':np.arange(len( mi1))[::-1]}, index=mi1).sort_index().sort_index(axis=1)
df1 = pd.DataFrame({'a':np.arange(len(mi2)), 'b':np.arange(len(mi2)),'c':np.arange(len(mi2)), 'd':np.arange(len( mi2))[::-1]}, index=mi2).sort_index().sort_index(axis=1)
df1 = df1.add(df2.pop('b'))
but the problem is this will fail as the indices don't align.
This is close to what I'm trying to achieve, except that I lose rows that are not common to the two dataframes:
df1['b'] = df1['b'].add(df2['b'], fill_value=0)
But this gives me:
Out[197]:
a b c d
A2 C0 0 4 0 5
C1 1 6 1 4
A3 C0 2 8 2 3
C1 3 10 3 2
A4 C0 4 4 4 1
C1 5 5 5 0
When I want:
In [197]: df1
Out[197]:
a b c d
A0 C0 0 0 0 7
C1 1 2 1 6
A1 C0 2 4 2 5
C1 3 6 3 4
A2 C0 0 4 0 5
C1 1 6 1 4
A3 C0 2 8 2 3
C1 3 10 3 2
A4 C0 4 4 4 1
C1 5 5 5 0
Note: in response to #RandyC's comment about the XY problem... the specific problem is that I have a class which reads data and returns a dataframe of 1e9 rows. The columns of the data frame are latll, latur, lonll, lonur, concentration, elevation. The data frame is indexed by a MultiIndex (lat, lon, time) where time is a datetime. The rows of the two dataframes may/may not be the same (IF they exist for a given date, the lat/lon will be the same... they are grid cell centers). latll, latur, lonll, lonur are calculated from lat/lon. I want to sum the concentration column as I add two data frames, but not change the others.
Self answering, there was an error in the comment above that caused a double adding. This is correct:
newdata = df2.pop('b')
result = df1.combine_first(df2)
result['b']= result['b'].add(newdata, fill_value=0)
seems to provide the solution to my use-case.

Categories

Resources