What I have:
df = pd.DataFrame({'SERIES1':['A','A','A','A','A','A','B','B','B','B','B','B','B','B','C','C','C','C','C'],
'SERIES2':[1,1,1,1,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1],
'SERIES3':[10,12,20,10,12,4,8,8,1,10,12,12,13,13,9,8,7,7,7]})
SERIES1 SERIES2 SERIES3
0 A 1 10
1 A 1 12
2 A 1 20
3 A 1 10
4 A 2 12
5 A 2 4
6 B 1 8
7 B 1 8
8 B 1 1
9 B 1 10
10 B 1 12
11 B 1 12
12 B 1 13
13 B 1 13
14 C 1 9
15 C 1 8
16 C 1 7
17 C 1 7
18 C 1 7
What I need is to group by SERIES1 and SERIES2 and to convert the values in SERIES3 to the minimum of that group. i.e.:
df2 = pd.DataFrame({'SERIES1':['A','A','A','A','A','A','B','B','B','B','B','B','B','B','C','C','C','C','C'],
'SERIES2':[1,1,1,1,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1],
'SERIES3':[10,10,10,10,4,4,1,1,1,1,1,1,1,1,7,7,7,7,7]})
SERIES1 SERIES2 SERIES3
0 A 1 10
1 A 1 10
2 A 1 10
3 A 1 10
4 A 2 4
5 A 2 4
6 B 1 1
7 B 1 1
8 B 1 1
9 B 1 1
10 B 1 1
11 B 1 1
12 B 1 1
13 B 1 1
14 C 1 7
15 C 1 7
16 C 1 7
17 C 1 7
18 C 1 7
I have a feeling this can be done with .groupby(), but I'm not sure how to replace it in the existing DataFrame, or to add it as new series.
I'm able to get:
df.groupby(['SERIES1', 'SERIES2']).min()
SERIES3
SERIES1 SERIES2
A 1 10
2 4
B 1 1
C 1 7
which are the correct minimums per group, but I cant figure out a simple way to pop that back into the original dataframe.
You can use groupby.transform, which gives back a same length series that you can assign back to the data frame:
df['SERIES3'] = df.groupby(['SERIES1', 'SERIES2']).SERIES3.transform('min')
df
I have a dataframe with repeat values in column A. I want to drop duplicates, keeping the row which has its value > 0 in column B
So this:
A B
1 20
1 10
1 -3
2 30
2 -9
2 40
3 10
Should turn into this:
A B
1 20
1 10
2 30
2 40
3 10
Any suggestions on how this can be achieved? I shall be grateful!
In sample data are not duplciates, so use only:
df = df[df['B'].gt(0)]
print (df)
A B
0 1 20
1 1 10
3 2 30
5 2 40
6 3 10
If there are duplicates:
print (df)
A B
0 1 20
1 1 10
2 1 10
3 1 10
4 1 -3
5 2 30
6 2 -9
7 2 40
8 3 10
df = df[df['B'].gt(0) & ~df.duplicated()]
print (df)
A B
0 1 20
1 1 10
5 2 30
7 2 40
8 3 10
I have the following dask dataframe
a b c
1 a 30
1 a 11
2 b 99
2 b 55
3 c 21
4 d 21
I want to sequence the duplicate rows based on the size of the row's c field below is example output
a b c seq
1 a 30 2
1 a 11 1
2 b 99 2
2 b 55 1
3 c 21 1
4 d 21 1
Is there an easy way to do this in dask?
Before you ask, I'm replicating an existing process and I don't know why the duplicate rows are sequenced using the c field.
Try with rank
df['new'] = df.groupby('a')['c'].rank().astype(int)
Out[29]:
0 2
1 1
2 2
3 1
4 1
5 1
Name: c, dtype: int32
I have a dataframe (df)
a b c
1 2 20
1 2 15
2 4 30
3 2 20
3 2 15
and I want to recognize only max values from column c
I tried
a = df.loc[df.groupby('b')['c'].idxmax()]
but it group by removes duplicates so I get
a b c
1 2 20
2 4 30
it removes rows 3 because they are the same was rows 1.
Is it any way to write the code to not remove duplicates?
Just also take column a into account when you do the groupby:
a = df.loc[df.groupby(['a', 'b'])['c'].idxmax()]
a b c
0 1 2 20
2 2 4 30
3 3 2 20
I think you need:
df = df[df['c'] == df.groupby('b')['c'].transform('max')]
print (df)
a b c
0 1 2 20
2 2 4 30
3 3 2 20
Difference in changed data:
print (df)
a b c
0 1 2 30
1 1 2 30
2 1 2 15
3 2 4 30
4 3 2 20
5 3 2 15
#only 1 max rows per groups a and b
a = df.loc[df.groupby(['a', 'b'])['c'].idxmax()]
print (a)
a b c
0 1 2 30
3 2 4 30
4 3 2 20
#all max rows per groups b
df1 = df[df['c'] == df.groupby('b')['c'].transform('max')]
print (df1)
a b c
0 1 2 30
1 1 2 30
3 2 4 30
#all max rows per groups a and b
df2 = df[df['c'] == df.groupby(['a', 'b'])['c'].transform('max')]
print (df2)
a b c
0 1 2 30
1 1 2 30
3 2 4 30
4 3 2 20
I have a dataset which i read in by
data = pd.read_excel('....\data.xlsx')
data = data.fillna(0)
and i made them all strings
data['Block']=data['Block'].astype(str)
data['Concentration']=data['Concentration'].astype(str)
data['Name']=data['Name'].astype(str)
data looks like this
Block Con Name
1 100 A
1 100 A
1 100 A
1 33 B
1 33 B
1 33 B
1 0 c
1 0 c
1 0 c
2 100 A
2 100 A
2 100 A
2 100 B
2 100 B
2 100 B
2 33 B
2 33 B
2 33 B
2 0 c
2 0 c
2 0 c
...
...
24 0 E
I inserted a column 'replicate' :
data['replicate'] = ''
data now looks like this
Block Con Name replicate
1 100 A
1 100 A
1 100 A
1 33 B
1 33 B
1 33 B
1 0 c
1 0 c
1 0 c
2 100 A
2 100 A
2 100 A
2 100 B
2 100 B
2 100 B
2 33 B
2 33 B
2 33 B
2 0 c
2 0 c
2 0 c
...
...
24 0 E
each Block|con|name combination has 3 replicates, how would I fill out the 'replicate' column with 1,2,3 going down the column?
desired output would be
Block Con Name replicate
1 100 A 1
1 100 A 2
1 100 A 3
1 33 B 1
1 33 B 2
1 33 B 3
1 0 c 1
1 0 c 2
1 0 c 3
2 100 A 1
2 100 A 2
2 100 A 3
2 100 B 1
2 100 B 2
2 100 B 3
2 33 B 1
2 33 B 2
2 33 B 3
2 0 c 1
2 0 c 2
2 0 c 3
...
...
24 0 E 3
pseudo code would be:
for b in data.block:
for c in data.con:
for n in data.name:
for each b|c|n combination:
if the same:
assign '1' to data.replicate
assign '2' to data.replicate
assign '3' to data.replicate
i have searched online and have not found any solution, and i'm not sure which function to use for this.
That looks like a groupby cumcount:
In [11]: df["Replicate"] = df.groupby(["Block", "Con", "Name"]).cumcount() + 1
In [12]: df
Out[12]:
Block Con Name Replicate
0 1 100 A 1
1 1 100 A 2
2 1 100 A 3
3 1 33 B 1
4 1 33 B 2
5 1 33 B 3
6 1 0 c 1
7 1 0 c 2
8 1 0 c 3
9 2 100 A 1
10 2 100 A 2
11 2 100 A 3
12 2 100 B 1
13 2 100 B 2
14 2 100 B 3
15 2 33 B 1
16 2 33 B 2
17 2 33 B 3
18 2 0 c 1
19 2 0 c 2
20 2 0 c 3
cumcount enumerates the rows in each group (from 0).
You can use numpy.tile:
import numpy as np
replicate_arr = np.tile(['1', '2', '3'], len(data)/3)
data['replicate'] = replicate_arr