Im working on a large dataset (with pandas in python) and I have a dataframe similar structured to the following:
class value
0 1 6
1 1 4
2 1 5
3 5 6
4 5 2
...
n 225 3
The classes grow through the dataframe continuously, however; missing some values as shown in the example. I was wondering how I can get simple stats like min, or max from each class and assign it to a new feature.
class value min
0 1 6 4
1 1 4 4
2 1 5 4
3 5 6 2
4 5 2 2
...
n 225 3 3
The only solution I can come up with is with a time consuming loop.
By using transform
df['min']=df.groupby('class')['value'].transform('min')
df
Out[497]:
class value min
0 1 6 4
1 1 4 4
2 1 5 4
3 5 6 2
4 5 2 2
Related
I am trying to implement a permutation test on a large Pandas dataframe. The dataframe looks like the following:
group some_value label
0 1 8 1
1 1 7 0
2 1 6 2
3 1 5 2
4 2 1 0
5 2 2 0
6 2 3 1
7 2 4 2
8 3 2 1
9 3 4 1
10 3 2 1
11 3 4 2
I want to group by column group, and shuffle the label column and write back to the data frame, preferably in place. The some_value column should remain intact. The result should look something like the following:
group some_value label
0 1 8 1
1 1 7 2
2 1 6 2
3 1 5 0
4 2 1 1
5 2 2 0
6 2 3 0
7 2 4 2
8 3 2 1
9 3 4 2
10 3 2 1
11 3 4 1
I used np.random.permutation but found it was very slow.
df["label"] = df.groupby("group")["label"].transform(np.random.permutation
It seems that df.sample is much faster. How can I solve this problem using df.sample() instead of np.random.permutation, and inplace?
We can using sample Notice this is assuming df=df.sort_values('group')
df['New']=df.groupby('group').label.apply(lambda x : x.sample(len(x))).values
Or we can do it by
df['New']=df.sample(len(df)).sort_values('group').New.values
What about providing a custom transform function?
def sample(x):
return x.sample(n=x.shape[0])
df.groupby("group")["label"].transform(sample)
This SO explanation of printing out what is passed into the custom function via the transform function is helpful.
This question already has answers here:
Pandas recalculate index after a concatenation
(3 answers)
Closed 3 years ago.
I am doing data processing and I have a problem figuring out how to reset groups counter after concatenating pandas dataframes. Here is an example below to illustrate my problem:
For example I have two dataframes:
Counter Value
0 1 3
1 1 4
2 1 2
3 2 4
4 2 10
Counter Value
0 1 8
1 1 10
2 2 2
3 2 4
4 2 10
after concatenation I get:
Counter Value
0 1 3
1 1 4
2 1 2
3 2 4
4 2 10
0 1 8
1 1 10
2 2 2
3 2 4
4 2 10
and I want to reset counter and make it sequential and make counter values to be by one digit bigger than the last group of counters.
Counter Value
0 1 3
1 1 4
2 1 2
3 2 4
4 2 10
0 3 8
1 3 10
2 4 2
3 4 4
4 4 10
I was trying to shift all dataframe by one upwards and compare shifted values with original and if original one is bigger that the shifted one, add original value to all values below it. But this solution is not always working due to noisy and inconsistent raw data.
You can just add the maximum value in the Counter column in the first dataframe to the second before concatenating:
df2.Counter += df1.Counter.max()
pd.concat([df1, df2], ignore_index=True)
Counter Value
0 1 3
1 1 4
2 1 2
3 2 4
4 2 10
5 3 8
6 3 10
7 4 2
8 4 4
9 4 10
Or another way using shift():
df=pd.concat([df1,df2])
df=df.assign(Counter_1=df.Counter.ne(df.Counter.shift()).cumsum())
#for same col df=df.assign(Counter=df.Counter.ne(df.Counter.shift()).cumsum())
Counter Value Counter_1
0 1 3 1
1 1 4 1
2 1 2 1
3 2 4 2
4 2 10 2
0 1 8 3
1 1 10 3
2 2 2 4
3 2 4 4
4 2 10 4
I have a really huge dataframe (thousends of rows), but let's assume it is like this:
A B C D E F
0 2 5 2 2 2 2
1 5 2 5 5 5 5
2 5 2 5 2 5 5
3 2 2 2 2 2 2
4 5 5 5 5 5 5
I need to see which value appears most frequently in a group of columns for each row. For instance, the value that appears most frequently in columns ABC and in columns DEF in each row, and put them in another column. In this example, my expected output is
ABC DEF
2 2
5 5
5 5
2 2
5 5
How can I do it in Python???
Thanks!!
Here is one way using columns groupby
mapperd={'A':'ABC','B':'ABC','C':'ABC','D':'DEF','E':'DEF','F':'DEF'}
df.groupby(mapperd,axis=1).agg(lambda x : x.mode()[0])
Out[826]:
ABC DEF
0 2 2
1 5 5
2 5 5
3 2 2
4 5 5
For a good performance you can work with the underlying numpy arrays, and use scipy.stats.mode to compute the mode:
from scipy import stats
cols = ['ABC','DEF']
a = df.values.reshape(-1, df.shape[1]//2)
pd.DataFrame(stats.mode(a, axis=1).mode.reshape(-1,2), columns=cols)
ABC DEF
0 2 2
1 5 5
2 5 5
3 2 2
4 5 5
You try using column header index filtering:
grp = ['ABC','DEF']
pd.concat([df.loc[:,[*g]].mode(1).set_axis([g], axis=1, inplace=False) for g in grp], axis=1)
Output:
ABC DEF
0 2 2
1 5 5
2 5 5
3 2 2
4 5 5
Looking for advise on how to solve the following problem.
I have Pandas dataframe, let's say 1.000.000 rows by 10 columns (A, B, C...J). Data type is float.
The task is to remove all the rows (i) in a dataframe if there exist another raw (j) with all values equal to or greater than values in original raw (i).
(Ai<=Aj) and (Bi<=Bj) and (Ci<=Cj) ... and (Ji<=Jj)
I wonder whether there any tools exist in pandas tool kit or any other analytics python module to efficiently solve this problem.
I have a very inefficient solution with multiple iterations in a simple array. Hoping to find smth more promising.
Simplified example, original data:
0 1 5 4 4 2
2 5 6 4 3 7
-2 5 6 5 3 7
0 0 0 0 0 1
0 0 0 0 0 8
Result to be:
0 1 5 4 4 2
2 5 6 4 3 7
-2 5 6 5 3 7
0 0 0 0 0 8
There is a way from numpy
#import numpy as np
df[np.any(np.sum(df.values>=df.values[:,None],1)==1,1)]
Out[40]:
A B C D E F
0 0 1 5 4 4 2
1 2 5 6 4 3 7
2 -2 5 6 5 3 7
4 0 0 0 0 0 8
I have a large file with 2.2 million rows.
Value Label
4 1
6 1
2 2
6 2
3 2
5 3
8 3
7 3
1 4
5 4
2 5
4 5
1 5
I want to know the fastest way to get following output, where 'Max' stores the maximum value in each label
Label Max
1 6
2 6
3 8
4 5
5 4
I implemented a normal logic using 'for'&'while' loops in python, but it takes hours. I expect pandas will have something for tackling this.
Call max on a groupby object:
In [116]:
df.groupby('Label').max()
Out[116]:
Value
Label
1 6
2 6
3 8
4 5
5 4
If you want to restore the Label column from the index then call reset_index:
In [117]:
df.groupby('Label').max().reset_index()
Out[117]:
Label Value
0 1 6
1 2 6
2 3 8
3 4 5
4 5 4