Finding the closest value of median value in duplicated rows of dataframe - python

I have a DataFrame which contains more than 2000 rows.
Here is a part of my DataFrame:
In [2]: df
Out[2]:
A B C D
0 a b -1 3.5
1 a b -1 52
2 a b -1 2
3 a b -1 0
4 a b 0 15
5 a c -1 1612
6 a c 1 17
7 a e 1 52
8 a d -1 412
9 a d -1 532
I would like to find the index of the closest (next) value of the median value of D column grouping by A, B and C and also add a new column as Next_Med to label it.
Here is the expected result :
A B C D Next_Med
0 a b -1 3.5 1
1 a b -1 52 0
2 a b -1 2 0
3 a b -1 0 0
4 a b 0 15 1
5 a c -1 1612 1
6 a c 1 17 1
7 a e 1 52 1
8 a d -1 412 0
9 a d -1 532 1
For example for a, b and -1 combination, the median value is 2.75 so I'd like to label 3.5 as Next_Med.

Try this following one-liner with groupby and tranform with lambda:
>>> df['Next_Med'] = df.sort_values([*'ABC']).groupby([*'ABC'])['D'].transform(lambda x: x == min(x, key=lambda y: abs(y - x.median()))).astype(int).reset_index(drop=True)
>>> df
A B C D Next_Med
0 a b -1 3.5 1
1 a b -1 52.0 0
2 a b -1 2.0 0
3 a b -1 0.0 0
4 a b 0 15.0 1
5 a c -1 1612.0 1
6 a c 1 17.0 1
7 a e 1 52.0 1
8 a d -1 412.0 0
9 a d -1 532.0 1
>>>

Related

how to subtract all pandas dataframe elements with each other easier way?

let's say I have a dataframe like this
name time
a 10
b 30
c 11
d 13
now I want a new dataframe like this
name1 name2 time_diff
a a 0
a b -20
a c -1
a d -3
b a 20
b b 0
b c 19
b d 17
.....
.....
d d 0
nested for loops, lambda function can be used but as the number of elements go above 200, for loops just take too much time to finish or should I say, I always have to interrupt the process. Does someone know a panda query way or something quicker & easier. shape of my dataframe is 1600x2
Solution with itertools:
import itertools
d=pd.DataFrame(list(itertools.product(df.name,df.name)),columns=['name1','name2'])
dic = dict(zip(df.name,df.time))
d['time_diff']=d.name1.map(dic)-d.name2.map(dic)
print(d)
name1 name2 time_diff
0 a a 0
1 a b -20
2 a c -1
3 a d -3
4 b a 20
5 b b 0
6 b c 19
7 b d 17
8 c a 1
9 c b -19
10 c c 0
11 c d -2
12 d a 3
13 d b -17
14 d c 2
15 d d 0
Use cross join first by merge with helper column, get difference and select only necessary columns:
df = df.assign(A=1)
df = pd.merge(df, df, on='A', suffixes=('1','2'))
df['time_diff'] = df['time1'] - df['time2']
df = df[['name1','name2','time_diff']]
print (df)
name1 name2 time_diff
0 a a 0
1 a b -20
2 a c -1
3 a d -3
4 b a 20
5 b b 0
6 b c 19
7 b d 17
8 c a 1
9 c b -19
10 c c 0
11 c d -2
12 d a 3
13 d b -17
14 d c 2
15 d d 0
Another solution with MultiIndex.from_product and reindex by first and second level:
df = df.set_index('name')
mux = pd.MultiIndex.from_product([df.index, df.index], names=['name1','name2'])
df = (df['time'].reindex(mux, level=0)
.sub(df.reindex(mux, level=1)['time'])
.rename('time_diff')
.reset_index())
another way would be, df.apply
df=pd.DataFrame({'col':['a','b','c','d'],'col1':[10,30,11,13]})
index = pd.MultiIndex.from_product([df['col'], df['col']], names = ["name1", "name2"])
res=pd.DataFrame(index = index).reset_index()
res['time_diff']=df.apply(lambda x: x['col1']-df['col1'],axis=1).values.flatten()
O/P:
name1 name2 time_diff
0 a a 0
1 a b -20
2 a c -1
3 a d -3
4 b a 20
5 b b 0
6 b c 19
7 b d 17
8 c a 1
9 c b -19
10 c c 0
11 c d -2
12 d a 3
13 d b -17
14 d c 2
15 d d 0

Pad dataframe discontinuous column

I have the following dataframe:
Name B C D E
1 A 1 2 2 7
2 A 7 1 1 7
3 B 1 1 3 4
4 B 2 1 3 4
5 B 3 1 3 4
What I'm trying to do is to obtain a new dataframe in which, for rows with the same "Name", the elements in the "B" column are continuous, hence in this example for rows with "Name" = A, the dataframe would have to be padded with elements ranging from 1 to 7, and the values for columns C, D, E should be 0.
Name B C D E
1 A 1 2 2 7
2 A 2 0 0 0
3 A 3 0 0 0
4 A 4 0 0 0
5 A 5 0 0 0
6 A 6 0 0 0
7 A 7 0 0 0
8 B 1 1 3 4
9 B 2 1 5 4
10 B 3 4 3 6
What I've done so far is to turn the B column values for the same "Name" into continuous values:
new_idx = df_.groupby('Name').apply(lambda x: np.arange(x.index.min(), x.index.max() + 1)).apply(pd.Series).stack()
and reindexing the original (having set B as the index) df using this new Series, but I'm having trouble reindexing using duplicates. Any help would be appreciated.
You can use:
def f(x):
a = np.arange(x.index.min(), x.index.max() + 1)
x = x.reindex(a, fill_value=0)
return (x)
new_idx = (df.set_index('B')
.groupby('Name')
.apply(f)
.drop('Name', 1)
.reset_index()
.reindex(columns=df.columns))
print (new_idx)
Name B C D E
0 A 1 2 2 7
1 A 2 0 0 0
2 A 3 0 0 0
3 A 4 0 0 0
4 A 5 0 0 0
5 A 6 0 0 0
6 A 7 1 1 7
7 B 1 1 3 4
8 B 2 1 3 4
9 B 3 1 3 4

rename index using index and name column

I have the dataframe df
import pandas as pd
b=np.array([0,1,2,2,0,1,2,2,3,4,4,4,5,6,0,1,0,0]).reshape(-1,1)
c=np.array(['a','a','a','a','b','b','b','b','b','b','b','b','b','b','c','c','d','e']).reshape(-1,1)
df = pd.DataFrame(np.hstack([b,c]),columns=['Start','File'])
df
Out[22]:
Start File
0 0 a
1 1 a
2 2 a
3 2 a
4 0 b
5 1 b
6 2 b
7 2 b
8 3 b
9 4 b
10 4 b
11 4 b
12 5 b
13 6 b
14 0 c
15 1 c
16 0 d
17 0 e
I would like to rename the index using index_File
in order to have 0_a, 1_a, ...17_e as indeces
You use set_index with or without the inplace=True
df.set_index(df.File.radd(df.index.astype(str) + '_'))
Start File
File
0_a 0 a
1_a 1 a
2_a 2 a
3_a 2 a
4_b 0 b
5_b 1 b
6_b 2 b
7_b 2 b
8_b 3 b
9_b 4 b
10_b 4 b
11_b 4 b
12_b 5 b
13_b 6 b
14_c 0 c
15_c 1 c
16_d 0 d
17_e 0 e
At the expense of a few more code characters, we can quicken this up and take care of the unnecessary index name
df.set_index(df.File.values.__radd__(df.index.astype(str) + '_'))
Start File
0_a 0 a
1_a 1 a
2_a 2 a
3_a 2 a
4_b 0 b
5_b 1 b
6_b 2 b
7_b 2 b
8_b 3 b
9_b 4 b
10_b 4 b
11_b 4 b
12_b 5 b
13_b 6 b
14_c 0 c
15_c 1 c
16_d 0 d
17_e 0 e
You can directly assign to the index, first by converting the default index to str using astype and then concatenate the str as usual:
In[41]:
df.index = df.index.astype(str) + '_' + df['File']
df
Out[41]:
Start File
File
0_a 0 a
1_a 1 a
2_a 2 a
3_a 2 a
4_b 0 b
5_b 1 b
6_b 2 b
7_b 2 b
8_b 3 b
9_b 4 b
10_b 4 b
11_b 4 b
12_b 5 b
13_b 6 b
14_c 0 c
15_c 1 c
16_d 0 d
17_e 0 e

adding values to a column by order pandas python

I have a dataset which i read in by
data = pd.read_excel('....\data.xlsx')
data = data.fillna(0)
and i made them all strings
data['Block']=data['Block'].astype(str)
data['Concentration']=data['Concentration'].astype(str)
data['Name']=data['Name'].astype(str)
data looks like this
Block Con Name
1 100 A
1 100 A
1 100 A
1 33 B
1 33 B
1 33 B
1 0 c
1 0 c
1 0 c
2 100 A
2 100 A
2 100 A
2 100 B
2 100 B
2 100 B
2 33 B
2 33 B
2 33 B
2 0 c
2 0 c
2 0 c
...
...
24 0 E
I inserted a column 'replicate' :
data['replicate'] = ''
data now looks like this
Block Con Name replicate
1 100 A
1 100 A
1 100 A
1 33 B
1 33 B
1 33 B
1 0 c
1 0 c
1 0 c
2 100 A
2 100 A
2 100 A
2 100 B
2 100 B
2 100 B
2 33 B
2 33 B
2 33 B
2 0 c
2 0 c
2 0 c
...
...
24 0 E
each Block|con|name combination has 3 replicates, how would I fill out the 'replicate' column with 1,2,3 going down the column?
desired output would be
Block Con Name replicate
1 100 A 1
1 100 A 2
1 100 A 3
1 33 B 1
1 33 B 2
1 33 B 3
1 0 c 1
1 0 c 2
1 0 c 3
2 100 A 1
2 100 A 2
2 100 A 3
2 100 B 1
2 100 B 2
2 100 B 3
2 33 B 1
2 33 B 2
2 33 B 3
2 0 c 1
2 0 c 2
2 0 c 3
...
...
24 0 E 3
pseudo code would be:
for b in data.block:
for c in data.con:
for n in data.name:
for each b|c|n combination:
if the same:
assign '1' to data.replicate
assign '2' to data.replicate
assign '3' to data.replicate
i have searched online and have not found any solution, and i'm not sure which function to use for this.
That looks like a groupby cumcount:
In [11]: df["Replicate"] = df.groupby(["Block", "Con", "Name"]).cumcount() + 1
In [12]: df
Out[12]:
Block Con Name Replicate
0 1 100 A 1
1 1 100 A 2
2 1 100 A 3
3 1 33 B 1
4 1 33 B 2
5 1 33 B 3
6 1 0 c 1
7 1 0 c 2
8 1 0 c 3
9 2 100 A 1
10 2 100 A 2
11 2 100 A 3
12 2 100 B 1
13 2 100 B 2
14 2 100 B 3
15 2 33 B 1
16 2 33 B 2
17 2 33 B 3
18 2 0 c 1
19 2 0 c 2
20 2 0 c 3
cumcount enumerates the rows in each group (from 0).
You can use numpy.tile:
import numpy as np
replicate_arr = np.tile(['1', '2', '3'], len(data)/3)
data['replicate'] = replicate_arr

Pivoting a table with hierarchical index

This is a simple problem but for some reason I am not able to find an easy solution.
I have a hierarchically indexed Series, for example:
s = pd.Series(data=randint(0, 3, 45),
index=pd.MultiIndex.from_tuples(list(itertools.product('pqr',[0,1,2],'abcde')),
names=['Index1', 'Index2', 'Index3']), name='P')
s = s.map({0:'A', 1:'B', 2:'C'})
So it looks like
Index1 Index2 Index3
p 0 a A
b A
c C
d B
e C
1 a B
b C
c C
d B
e B
q 0 a B
b C
c C
d C
e C
1 a A
b A
c B
d C
e A
I want to do a frequency count by value so that the output looks like
Index1 Index2 P
p 0 A 2
B 1
C 2
1 A 0
B 3
C 2
q 0 A 0
B 1
C 4
1 A 3
B 1
C 1
You can apply value_counts to the Series groupby:
In [11]: s.groupby(level=[0, 1]).value_counts() # equiv .apply(pd.value_counts)
Out[11]:
Index1 Index2
p 0 C 2
A 2
B 1
1 B 3
A 2
2 A 3
B 1
C 1
q 0 A 3
B 1
C 1
1 B 2
C 2
A 1
2 C 3
B 1
A 1
r 0 A 3
B 1
C 1
1 B 3
C 2
2 B 3
C 1
A 1
dtype: int64
If you want to include the 0s (which the above won't) you could use cross_tab:
In [21]: ct = pd.crosstab(rows=[s.index.get_level_values(0), s.index.get_level_values(1)],
cols=s.values,
aggfunc=len,
rownames=s.index.names[:2],
colnames=s.index.names[2:3])
In [22]: ct
Out[22]:
Index3 A B C
Index1 Index2
p 0 2 1 2
1 2 3 0
2 3 1 1
q 0 3 1 1
1 1 2 2
2 1 1 3
r 0 3 1 1
1 0 3 2
2 1 3 1
In [23]: ct.stack()
Out[23]:
Index1 Index2 Index3
p 0 A 2
B 1
C 2
1 A 2
B 3
C 0
2 A 3
B 1
C 1
q 0 A 3
B 1
C 1
1 A 1
B 2
C 2
2 A 1
B 1
C 3
r 0 A 3
B 1
C 1
1 A 0
B 3
C 2
2 A 1
B 3
C 1
dtype: int64
Which may be slightly faster...

Categories

Resources