adding values to a column by order pandas python - python

I have a dataset which i read in by
data = pd.read_excel('....\data.xlsx')
data = data.fillna(0)
and i made them all strings
data['Block']=data['Block'].astype(str)
data['Concentration']=data['Concentration'].astype(str)
data['Name']=data['Name'].astype(str)
data looks like this
Block Con Name
1 100 A
1 100 A
1 100 A
1 33 B
1 33 B
1 33 B
1 0 c
1 0 c
1 0 c
2 100 A
2 100 A
2 100 A
2 100 B
2 100 B
2 100 B
2 33 B
2 33 B
2 33 B
2 0 c
2 0 c
2 0 c
...
...
24 0 E
I inserted a column 'replicate' :
data['replicate'] = ''
data now looks like this
Block Con Name replicate
1 100 A
1 100 A
1 100 A
1 33 B
1 33 B
1 33 B
1 0 c
1 0 c
1 0 c
2 100 A
2 100 A
2 100 A
2 100 B
2 100 B
2 100 B
2 33 B
2 33 B
2 33 B
2 0 c
2 0 c
2 0 c
...
...
24 0 E
each Block|con|name combination has 3 replicates, how would I fill out the 'replicate' column with 1,2,3 going down the column?
desired output would be
Block Con Name replicate
1 100 A 1
1 100 A 2
1 100 A 3
1 33 B 1
1 33 B 2
1 33 B 3
1 0 c 1
1 0 c 2
1 0 c 3
2 100 A 1
2 100 A 2
2 100 A 3
2 100 B 1
2 100 B 2
2 100 B 3
2 33 B 1
2 33 B 2
2 33 B 3
2 0 c 1
2 0 c 2
2 0 c 3
...
...
24 0 E 3
pseudo code would be:
for b in data.block:
for c in data.con:
for n in data.name:
for each b|c|n combination:
if the same:
assign '1' to data.replicate
assign '2' to data.replicate
assign '3' to data.replicate
i have searched online and have not found any solution, and i'm not sure which function to use for this.

That looks like a groupby cumcount:
In [11]: df["Replicate"] = df.groupby(["Block", "Con", "Name"]).cumcount() + 1
In [12]: df
Out[12]:
Block Con Name Replicate
0 1 100 A 1
1 1 100 A 2
2 1 100 A 3
3 1 33 B 1
4 1 33 B 2
5 1 33 B 3
6 1 0 c 1
7 1 0 c 2
8 1 0 c 3
9 2 100 A 1
10 2 100 A 2
11 2 100 A 3
12 2 100 B 1
13 2 100 B 2
14 2 100 B 3
15 2 33 B 1
16 2 33 B 2
17 2 33 B 3
18 2 0 c 1
19 2 0 c 2
20 2 0 c 3
cumcount enumerates the rows in each group (from 0).

You can use numpy.tile:
import numpy as np
replicate_arr = np.tile(['1', '2', '3'], len(data)/3)
data['replicate'] = replicate_arr

Related

python pandas: Remove duplicates by columns A, which is not satisfying a condition in column B

I have a dataframe with repeat values in column A. I want to drop duplicates, keeping the row which has its value > 0 in column B
So this:
A B
1 20
1 10
1 -3
2 30
2 -9
2 40
3 10
Should turn into this:
A B
1 20
1 10
2 30
2 40
3 10
Any suggestions on how this can be achieved? I shall be grateful!
In sample data are not duplciates, so use only:
df = df[df['B'].gt(0)]
print (df)
A B
0 1 20
1 1 10
3 2 30
5 2 40
6 3 10
If there are duplicates:
print (df)
A B
0 1 20
1 1 10
2 1 10
3 1 10
4 1 -3
5 2 30
6 2 -9
7 2 40
8 3 10
df = df[df['B'].gt(0) & ~df.duplicated()]
print (df)
A B
0 1 20
1 1 10
5 2 30
7 2 40
8 3 10

Can I search for values to match results from another dataframe?

I want to do is line up a value together from 2 dataframes but they differ in shape and size.
Say I want to extract column D from one of the dataframe and append it to another
DataFrame1:
A B C D
1 1 0 2
1 4 0 1
1 0 2 4
2 2 3 0
2 1 0 1
Dataframe2
A B C D
1 1 0 54
1 4 0 10
1 0 2 54
2 2 3 55
2 1 0 34
outcome I'm looking for:
A B C D newD
1 1 0 2 54
1 4 0 1 10
1 0 2 4 54
2 2 3 0 55
2 1 0 1 34
I tried this
DataFrame1['newD'] = DataFrame2.loc[DataFrame1[['A', 'B', 'C']] == DataFrame2['A', 'B', 'C']]['D']
but I got a keyword error: KeyError: ('A', 'B', 'C')
Is there an easy way to get this result?
bonus question - Is it possible to have multiple criteria in search(i.e. D not null or something?)?
Isn't it merge:
pd.merge(df1,df2, on=['A','B','C'], how='left')
Output:
A B C D_x D_y
0 1 1 0 2 54
1 1 4 0 1 10
2 1 0 2 4 54
3 2 2 3 0 55
4 2 1 0 1 34

Add a column at a specific location in Dataframe

I have a dataframe-
data={'a':[1,2,3,6],'b':[5,6,7,6],'c':[45,77,88,99]}
df=pd.DataFrame(data)
Now I want to add a column at a two rows down in the dataframe.
The updated dataframe should look like-
l=[4,5] #column to add
a b c d
0 1 5 45 0
1 2 6 77 0
2 3 7 88 4
3 6 6 99 5
I did this-
df.loc[:2,'f'] = pd.Series(l)
Idea is add Series by index with length by list:
df['d'] = pd.Series(l, index=df.index[-len(l):])
print (df)
a b c d
0 1 5 45 NaN
1 2 6 77 NaN
2 3 7 88 4.0
3 6 6 99 5.0
Last for 0 values add Series.reindex by original index
df['d'] = pd.Series(l, index=df.index[-len(l):]).reindex(df.index, fill_value=0)
print (df)
a b c d
0 1 5 45 0
1 2 6 77 0
2 3 7 88 4
3 6 6 99 5
Another idea is repeat 0 values by difference of lengths and add l:
df['d'] = [0] * (len(df) - len(l)) + l
print (df)
a b c d
0 1 5 45 0
1 2 6 77 0
2 3 7 88 4
3 6 6 99 5
You can add a col with 0s and set the index:
>>> df
a b c
0 1 5 45
1 2 6 77
2 3 7 88
3 6 6 99
>>> df['d'] = 0
>>> df.iloc[-2:, df.columns.get_loc('d')] = [4,5]
>>> df
a b c d
0 1 5 45 0
1 2 6 77 0
2 3 7 88 4
3 6 6 99 5

group by pandas removes duplicates

I have a dataframe (df)
a b c
1 2 20
1 2 15
2 4 30
3 2 20
3 2 15
and I want to recognize only max values from column c
I tried
a = df.loc[df.groupby('b')['c'].idxmax()]
but it group by removes duplicates so I get
a b c
1 2 20
2 4 30
it removes rows 3 because they are the same was rows 1.
Is it any way to write the code to not remove duplicates?
Just also take column a into account when you do the groupby:
a = df.loc[df.groupby(['a', 'b'])['c'].idxmax()]
a b c
0 1 2 20
2 2 4 30
3 3 2 20
I think you need:
df = df[df['c'] == df.groupby('b')['c'].transform('max')]
print (df)
a b c
0 1 2 20
2 2 4 30
3 3 2 20
Difference in changed data:
print (df)
a b c
0 1 2 30
1 1 2 30
2 1 2 15
3 2 4 30
4 3 2 20
5 3 2 15
#only 1 max rows per groups a and b
a = df.loc[df.groupby(['a', 'b'])['c'].idxmax()]
print (a)
a b c
0 1 2 30
3 2 4 30
4 3 2 20
#all max rows per groups b
df1 = df[df['c'] == df.groupby('b')['c'].transform('max')]
print (df1)
a b c
0 1 2 30
1 1 2 30
3 2 4 30
#all max rows per groups a and b
df2 = df[df['c'] == df.groupby(['a', 'b'])['c'].transform('max')]
print (df2)
a b c
0 1 2 30
1 1 2 30
3 2 4 30
4 3 2 20

Python convert variables to cases

I'm trying to transform a DataFrame from this
id track var1 text1 var1 text2
1 1 10 a 11 b
2 1 17 b 19 c
3 2 20 c 33 d
Into this:
id track var text
1 1 10 a
1 1 11 b
2 1 17 b
2 1 19 c
3 2 20 c
3 2 33 d
I'm trying Pandas stack() method yet it seems to force all columns as respondents and does not keep fixed vales (i.e id track).
Any ideas?
Try with wide_to_long
df.columns=['id','track','var1','text1','var2','text2']
pd.wide_to_long(df,['var','text'],i=['id','track'],j='drop').reset_index(level=[0,1])
Out[238]:
id track var text
drop
1 1 1 10 a
2 1 1 11 b
1 2 1 17 b
2 2 1 19 c
1 3 2 20 c
2 3 2 33 d

Categories

Resources