simple mapping of pandas series to 0 and 1s given threshold - python

I am sorry for asking such a simple question (yes I googled). Do I really require 2 steps to map a simple pandas series of float between 0 and 1s to 0 and 1s given a threshold. This is the reproducible example:
series = pd.Series([0.0, 0.3, 0.6, 1.0])
threshold = 0.5
print(series)
series[series > threshold] = 1.0
series[series <= threshold] = 0.0
print(series)
It works producing:
0 0.0
1 0.0
2 1.0
3 1.0
from:
0 0.0
1 0.3
2 0.6
3 1.0

You can use the > operator.
series = (series > threshold).astype(int)
print(series)
Output:
0 0
1 0
2 1
3 1
dtype: int32

You could also apply a function on all elements using map() like
series = series.map(lambda x: 1.0 if x > threshold else 0.0)

I'd use numpy.where:
np.where(series > threshold, 1, 0)

Related

Pandas column that depends on its previous value (row)?

I would like to create a 3rd column in my dataframe, which depends on both the new and existing columns in the previous row.
This new column should start at 0.
I would like my 3rd column to start at 0.
Its next value is its previous value plus df.below_lo[i] (if the previous value was 0).
If its previous value was 1, its next value is its previous value plus df.above_hi[i].
I think I have two issues: how to initiate this 3rd column and how to make it dependent on itself.
import pandas as pd
import math
data = {'below_lo': [0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
'above_hi': [0, 0, -1, 0, -1, 0, -1, 0, 0, 0, 0, 0, 0]}
df = pd.DataFrame(data)
df['pos'] = math.nan
df['pos'][0] = 0
for i in range(len(df.below_lo)):
if df.pos[i] == 0:
df.pos[i+1] = df.pos[i] + df.below_lo[i]
if df.pos[i] == 1:
df.pos[i+1] = df.pos[i] + df.above_hi[i]
print(df)
The desired output is:
below_lo above_hi pos
0 0.0 0.0 0.0
1 1.0 0.0 0.0
2 0.0 -1.0 1.0
3 0.0 0.0 0.0
4 0.0 -1.0 0.0
5 0.0 0.0 0.0
6 0.0 -1.0 0.0
7 0.0 0.0 0.0
8 0.0 0.0 0.0
9 1.0 0.0 0.0
10 0.0 0.0 1.0
11 0.0 0.0 1.0
12 0.0 0.0 1.0
13 NaN NaN 1.0
The above code produces the correct output, except I am also getting a few of these error messages:
A value is trying to be set on a copy of a slice from a DataFrame
How do I clean this code up so that it runs without throwing this warning? ?
Use .loc:
df.loc[0, 'pos'] = 0
for i in range(len(df.below_lo)):
if df.loc[i, 'pos'] == 0:
df.loc[i+1, 'pos'] = df.loc[i, 'pos'] + df.loc[i, 'below_lo']
if df.loc[i, 'pos'] == 1:
df.loc[i+1, 'pos'] = df.loc[i, 'pos'] + df.loc[i, 'above_hi']
Appreciate there is an accepted, and perfectly good, answer by #Michael O. already, but if you dislike iterating over rows as not-quite Pandas-esque, here is a solution without explicit looping over rows:
from functools import reduce
res = reduce(lambda d, _ :
d.fillna({'pos':d['pos'].shift(1)
+ (d['pos'].shift(1) == 0) * d['below_lo']
+ (d['pos'].shift(1) == 1) * d['above_hi']}),
range(len(df)), df)
res
produces
below_lo above_hi pos
-- ---------- ---------- -----
0 0 0 0
1 1 0 1
2 0 -1 0
3 0 0 0
4 0 -1 0
5 0 0 0
6 0 -1 0
7 0 0 0
8 0 0 0
9 1 0 1
10 0 0 1
11 0 0 1
12 0 0 1
It is, admittedly, somewhat less efficient and has a bit more obscure syntax. But it could be written on a single line (even if I split it over a few for clarity)!
The idea is that we can use fillna(..) function by passing the value, calculated from the previous value of 'pos' (hence shift(1)) and current values of 'below_lo' and 'above_hi'. The extra complication here is that this operation will only fill NaN with a non-NaN for the row just below the one with non-NaN value. Hence we need to apply this function repeatedly until all NaNs are filled, and this is where reduce comes into play

Pandas conditional map/fill/replace

d1=pd.DataFrame({'x':['a','b','c','c'],'y':[-1,-2,-3,0]})
d2=pd.DataFrame({'x':['d','c','a','b'],'y':[0.1,0.2,0.3,0.4]})
I want to replace d1.y where y<0 with the correspondent y in d2. It's something like vlookup in Excel. The core problem is replace y according to x rather than just simply manipulate y. What I want is
Out[40]:
x y
0 a 0.3
1 b 0.4
2 c 0.2
3 c 0.0
Use Series.map with condition:
s = d2.set_index('x')['y']
d1.loc[d1.y < 0, 'y'] = d1['x'].map(s)
print (d1)
x y
0 a 0.3
1 b 0.4
2 c 0.2
3 c 0.0
You can try this:
d1.loc[d1.y < 0, 'y'] = d2.loc[d1.y < 0, 'y']

Reclassification by column name in pandas

I would like to apply a test to a pandas dataframe, and create flags in a corresponding dataframe based on the test results. I've gotten this far:
import numpy as np
import pandas as pd
matrix = pd.DataFrame({'a': [1, 11, 2, 3, 4], 'b': [5, 6, 22, 8, 9]})
flags = pd.DataFrame(np.zeros(matrix.shape), columns=matrix.columns)
flag_values = pd.Series({"a": 100, "b": 200})
flags[matrix > 10] = flag_values
but this raises the error
ValueError: Must specify axis=0 or 1
Where can I specify the axis in this situation? Is there a better way to accomplish this?
Edit:
The result I'm looking for in this example for "flags" is
a b
0 0
100 0
0 200
0 0
0 0
You could define flags = (matrix > 10) * flag_values:
In [35]: (matrix > 10) * flag_values
Out[35]:
a b
0 0 0
1 100 0
2 0 200
3 0 0
4 0 0
This relies on True having numeric value 1 and False having numeric value 0.
It also relies on Pandas' nifty automatic alignment of DataFrames (and Series) based on labels before performing arithmetic operations.
mask with mul
flags.mask(matrix > 10,1).mul(flag_values,axis=1)
Out[566]:
a b
0 0.0 0.0
1 100.0 0.0
2 0.0 200.0
3 0.0 0.0
4 0.0 0.0

pandas replace contents of multiple columns at a time for multiple conditions

I have a df as follows:
CHROM POS SRR4216489 SRR4216675 SRR4216480
0 1 127536 ./. ./. ./.
1 1 127573 ./. 0/1:0,5:5:0:112,1,10 ./.
2 1 135032 ./. 1/1:13,0:13:3240:0,30,361 0/0:13,0:13:3240:0,30,361
3 1 135208 ./. 0/0:5,0:5:3240:0,20,160 0/1:5,0:5:3240:0,20,160
4 1 138558 1/1:5,0:5:3240:0,29,177 0/0:0,5:5:0:112,1,10 ./.
I would like to replace the contents of the sample columns depending on certain conditions. The sample columns are SRR4216489, SRR4216675, SRR4216480. I am looking to replace './.' with 0.5, anything with 0/0 at the start with 0.0 and anything with 0/1 or 1/1 with 1.0. I appreciate that is involves several processes, most of which I can do independently but I don't know the syntax to tie them together. for example I could do this for sample SRR4216480:
df['SRR4216675'][df.SRR4216675 == './.'] = 0.5
This works well, courtesy of here, but I'm not sure how to apply it to all of the sample columns simultaneously. I thought to use a loop by:
sample_cols = df.columns[2:]
for s in sample_cols:
df[s][df.s =='./.'] = 0.5
but this firstly doesn't seem very pandonic and it also doesn't accept the string from the list at 'df.s' anyway.
The next challenge is how to parse the variable strings that populate the other parts of the sample columns. I have tried using the split function:
df=df['SRR4216675'][df.SRR4216675.split(':') == '0/0' ] = 0.0
but I get:
TypeError: 'float' object is not subscriptable
I am sure that a good way to solve this would be using a lambda such as this but being new to pandas and lambdas I'm finding it tricky, I got to here:
col=df['SRR4216675'][df.SRR4216675.apply(lambda x: x.split(':')[0])]
which looks like its almost there, but needs further processing to replace the value and also it looks like it has 2 columns and wont let me reintegrate it into the existing df:
SRR4216675
./. NaN
0/1 NaN
1/1 NaN
0/0 NaN
0/0 NaN
df['SRR4216675'] = col
ValueError: cannot reindex from a duplicate axis
I appreciate that this is several problems in 1 but I am new to pandas and would really like to get to grips with it. I could solve these problems using basic lists and loops with pythons standard list, iteration and string parsing functions but at scale this would be really slow as my full size df is millions of lines long and contains over 500 sample columns.
You can do this by using df.apply and defining a function, like this:
In [10]: cols = ('SRR4216675', 'SRR4216480', 'SRR4216489')
In [11]: def replace_vals(row):
...: for col in cols:
...: if row[col] == './.':
...: row[col] = 0.5
...: elif row[col].startswith('0/0'):
...: row[col] = 0
...: elif row[col].startswith('0/1') or row[col].startswith('1/1'):
...: row[col] = 1
...: return row
...:
...:
In [12]: df.apply(replace_vals, axis=1)
Out[12]:
CHROM POS SRR4216480 SRR4216489 SRR4216675
0 1 127536 0.5 0.5 0.5
1 1 127573 0.5 0.5 1.0
2 1 135032 0.0 0.5 1.0
3 1 135208 1.0 0.5 0.0
4 1 138558 0.5 1.0 0.0
And here's a faster way to do this:
First, let's create a larger data frame so that we can meaningfully measure differences in time, and let's import a timer so that we can measure.
In [70]: from timeit import default_timer as timer
In [71]: long_df = pd.DataFrame()
In [72]: for i in range(10000):
...: long_df = pd.concat([long_df, df])
Using the function we defined above, we get:
In [76]: start = timer(); long_df.apply(replace_vals, axis=1); end = timer()
In [77]: end - start
Out[77]: 8.662535898998613
Now, we define a new function (for the purposes of timing easily) where we loop over the columns and apply the same replacement logic as above, except we do it by using the vectorized str.startswith method on each column:
In [78]: def modify_vectorized():
...: start = timer()
...: for col in cols:
...: long_df.loc[long_df[col] == './.', col] = 0.5
...: long_df.loc[long_df[col].str.startswith('0/0', na=False), col] = 0
...: long_df.loc[long_df[col].str.startswith('0/1', na=False), col] = 1
...: long_df.loc[long_df[col].str.startswith('1/1', na=False), col] = 1
...: end = timer()
...: return end - start
We recreate the large dataframe and we run the new function on it, getting a significant speedup:
In [79]: long_df = pd.DataFrame()
In [80]: for i in range(10000):
...: long_df = pd.concat([long_df, df])
...:
In [81]: time_elapsed = modify_vectorized()
In [82]: time_elapsed
Out[82]: 0.44004046998452395
The resulting dataframe looks like this:
In [83]: long_df
Out[83]:
CHROM POS SRR4216480 SRR4216489 SRR4216675
0 1 127536 0.5 0.5 0.5
1 1 127573 0.5 0.5 1
2 1 135032 0 0.5 1
3 1 135208 1 0.5 0
4 1 138558 0.5 1 0
0 1 127536 0.5 0.5 0.5
1 1 127573 0.5 0.5 1
2 1 135032 0 0.5 1
3 1 135208 1 0.5 0
4 1 138558 0.5 1 0
0 1 127536 0.5 0.5 0.5
1 1 127573 0.5 0.5 1
2 1 135032 0 0.5 1
3 1 135208 1 0.5 0
4 1 138558 0.5 1 0
0 1 127536 0.5 0.5 0.5
...

constrain a series or array to a range of values

I have a series of values that I want to have constrained to be within +1 and -1.
s = pd.Series(np.random.randn(10000))
I know I can use apply, but is there a simple vectorized approach?
s_ = s.apply(lambda x: min(max(x, -1), 1))
s_.head()
0 -0.256117
1 0.879797
2 1.000000
3 -0.711397
4 -0.400339
dtype: float64
Use clip:
s = s.clip(-1,1)
Example Input:
s = pd.Series([-1.2, -0.5, 1, 1.1])
0 -1.2
1 -0.5
2 1.0
3 1.1
Example Output:
0 -1.0
1 -0.5
2 1.0
3 1.0
You can use the between Series method:
In [11]: s[s.between(-1, 1)]
Out[11]:
0 -0.256117
1 0.879797
3 -0.711397
4 -0.400339
5 0.667196
...
Note: This discards the values outside of the between range.
Use nested np.where
pd.Series(np.where(s < -1, -1, np.where(s > 1, 1, s)))
Timing
One more suggestion:
s[s<-1] = -1
s[s>1] = 1

Categories

Resources