Shuffling one Column of a DataFrame By Group Efficiently - python

I am trying to implement a permutation test on a large Pandas dataframe. The dataframe looks like the following:
group some_value label
0 1 8 1
1 1 7 0
2 1 6 2
3 1 5 2
4 2 1 0
5 2 2 0
6 2 3 1
7 2 4 2
8 3 2 1
9 3 4 1
10 3 2 1
11 3 4 2
I want to group by column group, and shuffle the label column and write back to the data frame, preferably in place. The some_value column should remain intact. The result should look something like the following:
group some_value label
0 1 8 1
1 1 7 2
2 1 6 2
3 1 5 0
4 2 1 1
5 2 2 0
6 2 3 0
7 2 4 2
8 3 2 1
9 3 4 2
10 3 2 1
11 3 4 1
I used np.random.permutation but found it was very slow.
df["label"] = df.groupby("group")["label"].transform(np.random.permutation
It seems that df.sample is much faster. How can I solve this problem using df.sample() instead of np.random.permutation, and inplace?

We can using sample Notice this is assuming df=df.sort_values('group')
df['New']=df.groupby('group').label.apply(lambda x : x.sample(len(x))).values
Or we can do it by
df['New']=df.sample(len(df)).sort_values('group').New.values

What about providing a custom transform function?
def sample(x):
return x.sample(n=x.shape[0])
df.groupby("group")["label"].transform(sample)
This SO explanation of printing out what is passed into the custom function via the transform function is helpful.

Related

Filtering pandas dataframe groups based on groups comparison

I am trying to remove corrupted data from my pandas dataframe. I want to remove groups from dataframe that has difference of value bigger than one from the last group. Here is an example:
Value
0 1
1 1
2 1
3 2
4 2
5 2
6 8 <- here number of group if I groupby by Value is larger than
7 8 the last groups number by 6, so I want to remove this
8 3 group from dataframe
9 3
Expected result:
Value
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 3
Edit:
jezrael solution is great, but in my case it is possible that there will be dubplicate group values:
Value
0 1
1 1
2 1
3 3
4 3
5 3
6 1
7 1
Sorry if I was not clear about this.
First remove duplicates for unique rows, then compare difference with shifted values and last filter by boolean indexing:
s = df['Value'].drop_duplicates()
v = s[s.diff().gt(s.shift())]
df = df[~df['Value'].isin(v)]
print (df)
Value
0 1
1 1
2 1
3 2
4 2
5 2
8 3
9 3
Maybe:
df2 = df.drop_duplicates()
print(df[df['Value'].isin(df2.loc[~df2['Value'].gt(df2['Value'].shift(-1)), 'Value'].tolist())])
Output:
Value
0 1
1 1
2 1
3 2
4 2
5 2
8 3
9 3
We can check if the difference is less than or equal to 5 or NaN. After we check if we have duplicates and keep those rows:
s = df[df['Value'].diff().le(5) | df['Value'].diff().isna()]
s[s.duplicated(keep=False)]
Value
0 1
1 1
2 1
3 2
4 2
5 2
8 3
9 3

Reset count of group after pandas dataframe concatenation [duplicate]

This question already has answers here:
Pandas recalculate index after a concatenation
(3 answers)
Closed 3 years ago.
I am doing data processing and I have a problem figuring out how to reset groups counter after concatenating pandas dataframes. Here is an example below to illustrate my problem:
For example I have two dataframes:
Counter Value
0 1 3
1 1 4
2 1 2
3 2 4
4 2 10
Counter Value
0 1 8
1 1 10
2 2 2
3 2 4
4 2 10
after concatenation I get:
Counter Value
0 1 3
1 1 4
2 1 2
3 2 4
4 2 10
0 1 8
1 1 10
2 2 2
3 2 4
4 2 10
and I want to reset counter and make it sequential and make counter values to be by one digit bigger than the last group of counters.
Counter Value
0 1 3
1 1 4
2 1 2
3 2 4
4 2 10
0 3 8
1 3 10
2 4 2
3 4 4
4 4 10
I was trying to shift all dataframe by one upwards and compare shifted values with original and if original one is bigger that the shifted one, add original value to all values below it. But this solution is not always working due to noisy and inconsistent raw data.
You can just add the maximum value in the Counter column in the first dataframe to the second before concatenating:
df2.Counter += df1.Counter.max()
pd.concat([df1, df2], ignore_index=True)
Counter Value
0 1 3
1 1 4
2 1 2
3 2 4
4 2 10
5 3 8
6 3 10
7 4 2
8 4 4
9 4 10
Or another way using shift():
df=pd.concat([df1,df2])
df=df.assign(Counter_1=df.Counter.ne(df.Counter.shift()).cumsum())
#for same col df=df.assign(Counter=df.Counter.ne(df.Counter.shift()).cumsum())
Counter Value Counter_1
0 1 3 1
1 1 4 1
2 1 2 1
3 2 4 2
4 2 10 2
0 1 8 3
1 1 10 3
2 2 2 4
3 2 4 4
4 2 10 4

Python Pandas Update Value Based on Index using .iloc

I have data frame
a=pd.DataFrame([[1,1,9],[2,1,9],[3,2,9],[4,2,9]],columns=['a','b','c'])
a b c
0 1 1 9
1 2 1 9
2 3 2 9
3 4 2 9
if I run
a['c'].iloc[0]=100
it works and I get
a b c
0 1 1 100
1 2 1 9
2 3 2 9
3 4 2 9
But if I want to update the first observation of group b==2 by running
a['c'][a['b']==2].iloc[0]=100
It doesn't do what I want it do. I still get the same dataframe.
a b c
0 1 1 100
1 2 1 9
2 3 2 9
3 4 2 9
I wonder why? and what's a possible solution for this?
Thank you for your help.
You should using .loc like this, chian with .iloc and .loc sometime will cause the issue
Whether a copy or a reference is returned for a setting operation, may depend on the context. This is sometimes called chained> assignment and should be avoided
a.loc[a.index[a.b==2][0],'c']=10000
a
Out[761]:
a b c
0 1 1 9
1 2 1 9
2 3 2 10000
3 4 2 9

Finding the min (or max) of a class in Pandas

Im working on a large dataset (with pandas in python) and I have a dataframe similar structured to the following:
class value
0 1 6
1 1 4
2 1 5
3 5 6
4 5 2
...
n 225 3
The classes grow through the dataframe continuously, however; missing some values as shown in the example. I was wondering how I can get simple stats like min, or max from each class and assign it to a new feature.
class value min
0 1 6 4
1 1 4 4
2 1 5 4
3 5 6 2
4 5 2 2
...
n 225 3 3
The only solution I can come up with is with a time consuming loop.
By using transform
df['min']=df.groupby('class')['value'].transform('min')
df
Out[497]:
class value min
0 1 6 4
1 1 4 4
2 1 5 4
3 5 6 2
4 5 2 2

How to select rows with same index to form a new DataFrame in python pandas?

Say I have a dataframe like this, filename is the index:
filename a b c
1 1 2 3
1 1 3 4
2 2 2 2
2 3 2 5
2 8 9 9
3 4 8 6
3 1 1 1
I want to divide this dataframe into three dataframes and then process them one by one in a loop. Each dataframe contains rows with same filename like this:
dataframe1:
filename a b c
1 1 2 3
1 1 3 4
dataframe2:
filename a b c
2 2 2 2
2 3 2 5
2 8 9 9
dataframe3:
filename a b c
3 4 8 6
3 1 1 1
Also, in my situation, I actually don't know how many sub dataframes I will get in advance, so I want the program figure this out too and then I can use a loop to process each sub dataframe.
How can I do this in python pandas? Thanks!
Simply you can do this if you want to get number of groups
group = df.groupby('filename')
group.ngroups
and if you want to apply your custom function you can use apply where it takes your custom function as a parameter , and it passes each group to your custom function
group.apply()
you can try this simple function to understand what is your input to your custom functions
def print_group(df):
print(df)
print('-------------------')
group.apply(print_group)
a b c
0 1 2 3
1 1 3 4
-------------------
a b c
2 2 2 2
3 3 2 5
4 8 9 9
-------------------
a b c
5 4 8 6
6 1 1 1
-------------------

Categories

Resources