convert values in a series to either one of two values - python

I have a series y which has values between -3 and 3.
I want to convert numbers that are above 0 to 1 and numbers that are less than or equal to zero to 0.
What is the best way to do this?
I wrote the code below. However it doesn't give me the expected output. The first line works. However after running the second line the values that were 1 change to something random, which I don't understand
import numpy as np
y_final = np.where(y > 0, 1, y).tolist()
y_final = np.where(y <= 0, 0, y).tolist()

I think you need Series.clip if values are integers:
y = pd.Series(range(-3, 4))
print (y)
0 -3
1 -2
2 -1
3 0
4 1
5 2
6 3
dtype: int64
print (y.clip(lower=0, upper=1))
0 0
1 0
2 0
3 0
4 1
5 1
6 1
dtype: int64
In your solution is possible simplify it by set 1 and 0:
y_final = np.where(y > 0, 1, 0)
print (y_final)
[0 0 0 0 1 1 1]
Or convert mask greater like 0 to integers:
y_final = y.gt(0).astype(int)
#alternative
#y_final = (y > 0).astype(int)
print (y_final)
0 0
1 0
2 0
3 0
4 1
5 1
6 1
dtype: int32

You can also use simple map:
numbers = range(-3,4)
print(list(map(lambda n: 1 if n > 0 else 0, numbers)))

Related

Calculating probability of consecutive events with python pandas

Given a dataframe, how do I calculate the probability of consecutive events using python pandas?
For example,
Time
A
B
C
1
1
1
1
2
-1
-1
-1
3
1
1
1
4
-1
-1
-1
5
1
1
1
6
-1
-1
-1
7
1
1
1
8
-1
1
1
9
1
-1
1
10
-1
1
-1
In this dataframe, B has two consecutive "1" in t=7 and t=8, and C has three consecutive "1" in t=7 to to=9.
Probability of event that two consecutive "1" appear is 3/27
Probability of event that three consecutive "1" appear is 1/24
How can I do this using python pandas?
Try this code(It can be used in other dataframes i.e. more columns, rows)
def consecutive(num):
'''
df = pd.DataFrame({
'Time' : [i for i in range(1, 11)],
'A' : [1, -1, 1, -1, 1, -1, 1, -1, 1, -1],
'B' : [1, -1, 1, -1, 1, -1, 1, 1, -1, 1],
'C' : [1, -1, 1, -1, 1, -1, 1, 1, 1, -1]
})
print(df)
'''
row_num = df.shape[0]
col_num = df.shape[1]
cnt = 0 # the number of consecutives
for col_index in range(1, col_num): # counting for each column
col_tmp = df.iloc[:, col_index]
consec = 0
for i in range(row_num):
if col_tmp[i] == 1:
consec += 1
# if -1 comes after 1, then consec = 0
else:
consec = 0
# to simply sum with the condition(consec == num), we minus 1 from consec
if consec == num:
cnt += 1
consec -= 1
all_cases = (row_num - num + 1) * (col_num - 1) # col_num - 1 because of 'Time' column
prob = cnt / all_cases
return prob
When you execute it with the given dataframe with this code
print(f'two consectuvie : {consecutive(2)}')
print(f'three consectuvie : {consecutive(3)}')
Output :
Time A B C
0 1 1 1 1
1 2 -1 -1 -1
2 3 1 1 1
3 4 -1 -1 -1
4 5 1 1 1
5 6 -1 -1 -1
6 7 1 1 1
7 8 -1 1 1
8 9 1 -1 1
9 10 -1 1 -1
two consectuvie : 0.1111111111111111
Time A B C
0 1 1 1 1
1 2 -1 -1 -1
2 3 1 1 1
3 4 -1 -1 -1
4 5 1 1 1
5 6 -1 -1 -1
6 7 1 1 1
7 8 -1 1 1
8 9 1 -1 1
9 10 -1 1 -1
three consectuvie : 0.041666666666666664
You can compare rows with previous rows using shift. So, to find out how often two consecutive values are equal, you can do
>>> (df.C == df.C.shift()).sum()
2
To find three consecutive equal values, you'd have to compare the column with itself shifted by 1 (the default) and additionally, shifted by 2.
>>> ((df.C == df.C.shift()) & (df.C == df.C.shift(2))).sum()
1
Another variation of this using the pd.Series.eq function instead of the == is:
>>> m = df.C.eq(df.C.shift(1)) & df.C.eq(df.C.shift(2))
>>> m.sum()
1
In this case, since the target value is 1 (and True == 1 is True; it won't work for other target values as is, see below), the pattern can be generalized with functools.reduce to:
from functools import reduce
def combos(column, n):
return reduce(pd.Series.eq, [column.shift(i) for i in range(n)])
You can apply this function to df like so, which will give you the numerator:
>>> df[['A', 'B', 'C']].apply(combos, n = 2).values.sum()
3
>>> df[['A', 'B', 'C']].apply(combos, n = 3).values.sum()
1
To get the denominator, you can do, e.g.,
n = 2
rows, cols = df[['A', 'B', 'C']].shape
denominator = (rows - n + 1) * cols
An idea for a generalized version of the combos function that should work with other target values is
from operator import and_ # equivalent of &
def combos_generalized(col, n):
return reduce(and_, [col == col.shift(i) for i in range(1, n)])

Count only first occurrence of each sequence python

I have some acceleration data that I have set up a new column to give a 1 if the accel value in the accelpos column >=2.5 using the following code
frame["new3"] = np.where((frame.accelpos >=2.5), '1', '0')
I end up getting data in sequences like so
0,0,0,0,1,1,1,1,1,0,0,0,1,1,0,0,0,1,1,1,1,1,1,1,1,1,1,0,0,0,0
I want to add a second column to give a 1 just at the start of each sequence as follows
0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
Any help would be much apreciated
You can compare shifted values by Series.shift and get values only for '1', so chain conditions by & for bitwise AND and last casting to integers for True/False to 1/0 mapping:
df = pd.DataFrame({'col':'0,0,0,0,1,1,1,1,1,0,0,0,1,1,0,0,0,1,1,1,1,1,1,1,1,1,1,0,0,0,0'.split(',')})
df['new'] = (df['col'].ne(df['col'].shift()) & df['col'].eq('1')).astype(int)
Or test difference, but because possible first 1 is necessary replace missing value by original with fillna:
s = df['col'].astype(int)
df['new'] = s.diff().fillna(s).eq(1).astype(int)
print (df)
col new
0 0 0
1 0 0
2 0 0
3 0 0
4 1 1
5 1 0
6 1 0
7 1 0
8 1 0
9 0 0
10 0 0
11 0 0
12 1 1
13 1 0
14 0 0
15 0 0
16 0 0
17 1 1
18 1 0
19 1 0
20 1 0
21 1 0
22 1 0
23 1 0
24 1 0
25 1 0
26 1 0
27 0 0
28 0 0
29 0 0
30 0 0
I am not familiar with the where function. I guess i might try and help from an algorithmic point of view.
Assume we have a list a = [0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, ..., 0]
From an algorithmic POV if you want to replace each sequence of 1 with a unique one at the begining of such sequence here is what you want to do :
parse the list
assess whether it is a one or a zero
if it is a one then, each following item must be a 0 until you actually have a zero
You might want to have something like this :
a = [0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1]
for i in range(len(a)-1):
if a[i] == 1 :
for j in range(1,len(a)-i):
if a[i+j] == 1:
a[i+j] = 0
else :
break

why condition for my dataframe is not working?

Here is the code:
import pandas as pd
import numpy as np
df1=pd.DataFrame({'0':[1,0,11,0],'1':[0,11,4,0]})
print(df1.head(5))
df2 = df1.copy()
columns=list(df2.columns)
print(columns)
for i in columns:
idx1 = np.where((df2[i]>0) & (df2[i] < 10))
df2.loc[idx1] = 1
idx3 = np.where(df2[i] == 0)
df2.loc[idx3] = 0
idx2 = np.where(df2[i] > 10)
df2.loc[idx2] = 0
print(df2.head(5))
output:
0 1
0 1 0
1 0 11
2 11 4
3 0 0
['0', '1']
0 1
0 1 1
1 0 0
2 0 0
3 0 0
the concerning part is:
(idx1 = np.where((df2[i]>0) & (df2[i] < 10))
df2.loc[idx1] = 1,
why this logic isn't working?)
According to this logic, this is what needs to be my output:
expected:
0 1
0 1 1
1 0 0
2 0 1
3 0 0
This can be done much simpler. You can operate directly on the dataframe as whole; no need to cycle through the columns individually.
Also, you don't need numpy.where to grab indices; you can use the dataframe with boolean values form the selection directly.
sel = (df2 > 0) & (df2 < 10)
df2[sel] = 1
df2[df2 == 0] = 0
df2[df2 > 10] = 0
(The first line is only to make the second line not overly complicated to the eye.)
Given your conditions however, the result is
0 1
0 1 0
1 0 0
2 0 1
3 0 0
Because you only set numbers between 0 and 10 (exclusive) to 1. A number like 11 is set to 0; while your expected output somehow shows 1 for entries with 11. And 0 is also set to 0, not to 1 (the letter shows in your expected output).
Your expected output does not align with your logic it seems. It looks like anything between 0 and 10 (exclusive) should be 1 and the other be 0.
If so, try this:
df2 = pd.DataFrame(np.where((0 < df1) & (df1 < 10), 1, 0))

Find number of consecutively increasing/decreasing values in a pandas column (and fill another col with it) in an optimized way

I am trying to create a new column for a dataframe. The column I use for it is a price column. Basically what I am trying to achieve is getting the number of times that the price has increased/decreased consecutively. I need this to be rather quick because the dataframes can be quite big.
For example the result should look like :
input = [1,2,3,2,1]
increase = [0,1,2,0,0]
decrease = [0,0,0,1,2]
You can compute the diff and apply a cumsum on the positive/negative values:
df = pd.DataFrame({'col': [1,2,3,2,1]})
s = df['col'].diff()
df['increase'] = s.gt(0).cumsum().where(s.gt(0), 0)
df['decrease'] = s.lt(0).cumsum().where(s.lt(0), 0)
Output:
col increase decrease
0 1 0 0
1 2 1 0
2 3 2 0
3 2 0 1
4 1 0 2
resetting the count
As I realize your example is ambiguous, here is an additional method in case your want to reset the counts for each increasing/decreasing group, using groupby.
The resetting counts are labeled inc2/dec2:
df = pd.DataFrame({'col': [1,2,3,2,1,2,3,1]})
s = df['col'].diff()
s1 = s.gt(0)
s2 = s.lt(0)
df['inc'] = s1.cumsum().where(s1, 0)
df['dec'] = s2.cumsum().where(s2, 0)
si = df['inc'].eq(0)
sd = df['dec'].eq(0)
df['inc2'] = si.groupby(si.cumsum()).cumcount()
df['dec2'] = sd.groupby(sd.cumsum()).cumcount()
Output:
col inc dec inc2 dec2
0 1 0 0 0 0
1 2 1 0 1 0
2 3 2 0 2 0
3 2 0 1 0 1
4 1 0 2 0 2
5 2 3 0 1 0
6 3 4 0 2 0
7 1 0 3 0 1
data = {
'input': [1,2,3,2,1]
}
df = pd.DataFrame(data)
diffs = df['input'].diff()
df['a'] = (df['input'] > df['input'].shift(periods=1, axis=0)).cumsum()-(df['input'] > df['input'].shift(periods=1, axis=0)).astype(int).cumsum() \
.where(~(df['input'] > df['input'].shift(periods=1, axis=0))) \
.ffill().fillna(0).astype(int)
df['b'] = (df['input'] < df['input'].shift(periods=1, axis=0)).cumsum()-(df['input'] < df['input'].shift(periods=1, axis=0)).astype(int).cumsum() \
.where(~(df['input'] < df['input'].shift(periods=1, axis=0))) \
.ffill().fillna(0).astype(int)
print(df)
output
input a b
0 1 0 0
1 2 1 0
2 3 2 0
3 2 0 1
4 1 0 2
Coding this manually using numpy might look like this
import numpy as np
input = np.array([1, 2, 3, 2, 1])
increase = np.zeros(len(input))
decrease = np.zeros(len(input))
for i in range(1, len(input)):
if input[i] > input[i-1]:
increase[i] = increase[i-1] + 1
decrease[i] = 0
elif input[i] < input[i-1]:
increase[i] = 0
decrease[i] = decrease[i-1] + 1
else:
increase[i] = 0
decrease[i] = 0
increase # array([0, 1, 2, 0, 0], dtype=int32)
decrease # array([0, 0, 0, 1, 2], dtype=int32)

Conditional length of a binary data series in Pandas

Having a DataFrame with the following column:
df['A'] = [1,1,1,0,1,1,1,1,0,1]
What would be the best vectorized way to control the length of "1"-series by some limiting value? Let's say the limit is 2, then the resulting column 'B' must look like:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
One fully-vectorized solution is to use the shift-groupby-cumsum-cumcount combination1 to indicate where consecutive runs are shorter than 2 (or whatever limiting value you like). Then, & this new boolean Series with the original column:
df['B'] = ((df.groupby((df.A != df.A.shift()).cumsum()).cumcount() <= 1) & df.A)\
.astype(int) # cast the boolean Series back to integers
This produces the new column in the DataFrame:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
1 See the pandas cookbook; the section on grouping, "Grouping like Python’s itertools.groupby"
Another way (checking if previous two are 1):
In [443]: df = pd.DataFrame({'A': [1,1,1,0,1,1,1,1,0,1]})
In [444]: limit = 2
In [445]: df['B'] = map(lambda x: df['A'][x] if x < limit else int(not all(y == 1 for y in df['A'][x - limit:x])), range(len(df)))
In [446]: df
Out[446]:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
If you know that the values in the series will all be either 0 or 1, I think you can use a little trick involving convolution. Make a copy of your column (which need not be a Pandas object, it can just be a normal Numpy array)
a = df['A'].as_matrix()
and convolve it with a sequence of 1's that is one longer than the cutoff you want, then chop off the last cutoff elements. E.g. for a cutoff of 2, you would do
long_run_count = numpy.convolve(a, [1, 1, 1])[:-2]
The resulting array, in this case, gives the number of 1's that occur in the 3 elements prior to and including that element. If that number is 3, then you are in a run that has exceeded length 2. So just set those elements to zero.
a[long_run_count > 2] = 0
You can now assign the resulting array to a new column in your DataFrame.
df['B'] = a
To turn this into a more general method:
def trim_runs(array, cutoff):
a = numpy.asarray(array)
a[numpy.convolve(a, numpy.ones(cutoff + 1))[:-cutoff] > cutoff] = 0
return a

Categories

Resources