Creating intervaled ramp array based on a threshold - Python / NumPy - python

I would like to measure the length of a sub-array fullfilling some condition (like a stop clock), but as soon as the condition is not fulfilled any more, the value should reset to zero. So, the resulting array should tell me, how many values fulfilled some condition (e.g. value > 1):
[0, 0, 2, 2, 2, 2, 0, 3, 3, 0]
should result into the followin array:
[0, 0, 1, 2, 3, 4, 0, 1, 2, 0]
One can easily define a function in python, which returns the corresponding numy array:
def StopClock(signal, threshold=1):
clock = []
current_time = 0
for item in signal:
if item > threshold:
current_time += 1
else:
current_time = 0
clock.append(current_time)
return np.array(clock)
StopClock([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
However, I really do not like this for-loop, especially since this counter should run over a longer dataset. I thought of some np.cumsum solution in combination with np.diff, however I do not get through the reset part. Is someone aware of a more elegant numpy-style solution of above problem?

This solution uses pandas to perform a groupby:
s = pd.Series([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
threshold = 0
>>> np.where(
s > threshold,
s
.to_frame() # Convert series to dataframe.
.assign(_dummy_=1) # Add column of ones.
.groupby((s.gt(threshold) != s.gt(threshold).shift()).cumsum())['_dummy_'] # shift-cumsum pattern
.transform(lambda x: x.cumsum()), # Cumsum the ones per group.
0) # Fill value with zero where threshold not exceeded.
array([0, 0, 1, 2, 3, 4, 0, 1, 2, 0])

Yes, we can use diff-styled differentiation alongwith cumsum to create such intervaled ramps in a vectorized manner and that should be pretty efficient specially with large input arrays. The resetting part is taken care of by assigning appropriate values at the end of each interval, with the idea of cum-summing that resets the numbers at end of each interval.
Here's one implementation to accomplish all that -
def intervaled_ramp(a, thresh=1):
mask = a>thresh
# Get start, stop indices
mask_ext = np.concatenate(([False], mask, [False] ))
idx = np.flatnonzero(mask_ext[1:] != mask_ext[:-1])
s0,s1 = idx[::2], idx[1::2]
out = mask.astype(int)
valid_stop = s1[s1<len(a)]
out[valid_stop] = s0[:len(valid_stop)] - valid_stop
return out.cumsum()
Sample runs -
Input (a) :
[5 3 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 1]
Output (intervaled_ramp(a, thresh=1)) :
[1 2 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 0]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 1]
Output (intervaled_ramp(a, thresh=1)) :
[0 0 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 0]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 5]
Output (intervaled_ramp(a, thresh=1)) :
[0 0 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 1]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 5]
Output (intervaled_ramp(a, thresh=0)) :
[1 2 3 4 5 0 0 1 2 3 4 0 1 2 0 1 2 3 0 1 2 3 4 0 1]
Runtime test
One way to do a fair benchmarking was to use the posted sample in the question and tiling into a big number of times and using that as the input array. With that setup, here's the timings -
In [841]: a = np.array([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
In [842]: a = np.tile(a,10000)
# #Alexander's soln
In [843]: %timeit pandas_app(a, threshold=1)
1 loop, best of 3: 3.93 s per loop
# #Psidom 's soln
In [844]: %timeit stop_clock(a, threshold=1)
10 loops, best of 3: 119 ms per loop
# Proposed in this post
In [845]: %timeit intervaled_ramp(a, thresh=1)
1000 loops, best of 3: 527 µs per loop

Another numpy solution:
import numpy as np
a = np.array([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
​
def stop_clock(signal, threshold=1):
mask = signal > threshold
indices = np.flatnonzero(np.diff(mask)) + 1
return np.concatenate(list(map(np.cumsum, np.array_split(mask, indices))))
​
stop_clock(a)
# array([0, 0, 1, 2, 3, 4, 0, 1, 2, 0])

Related

Print all possible strings[1,2,3,4,5,6] : for eg: (2,3,4,5,6,1) (3,4,5,6,1,2) (4,5,6,1,2,3) i have to rotate each number using for loop in python

code:
a=['1','2','3','4','5','6']
for i in range(1,6):
for j in range(i+1):
for k in range(j+1):
for l in range(k+1):
for m in range(l+1):
for p in range(m+1):
print(i,j,k,l,m,p)
#------------------------------------------------------------------------------------------
output: 1 0 0 0 0 0
1 1 0 0 0 0
1 1 1 0 0 0
1 1 1 1 0 0
1 1 1 1 1 0
1 1 1 1 1 1
2 0 0 0 0 0
2 1 0 0 0 0
2 1 1 0 0 0
2 1 1 1 0 0
2 1 1 1 1 0
2 1 1 1 1 1
2 2 0 0 0 0
2 2 1 0 0 0
2 2 1 1 0 0
2 2 1 1 1 0
2 2 1 1 1 1
2 2 2 0 0 0
2 2 2 1 0 0
2 2 2 1 1 0
2 2 2 1 1 1
2 2 2 2 0 0
2 2 2 2 1 0
2 2 2 2 1 1
2 2 2 2 2 0
2 2 2 2 2 1
2 2 2 2 2 2
3 0 0 0 0 0
3 1 0 0 0 0
3 1 1 0 0 0
3 1 1 1 0 0
and so on....
This is the code I have tried but im not getting desired output can someone please explain..Thankyou
It would appear to me that you need to look at the examples very carefully, and do something with the use of the word 'rotate'.
A fairly simple solution:
def rotate(xs):
for i in range(len(xs)):
yield tuple(xs[i:] + xs[:i])
for result in rotate([1,2,3,4,5,6]):
print(result)
Output:
(1, 2, 3, 4, 5, 6)
(2, 3, 4, 5, 6, 1)
(3, 4, 5, 6, 1, 2)
(4, 5, 6, 1, 2, 3)
(5, 6, 1, 2, 3, 4)
(6, 1, 2, 3, 4, 5)
Do you mean this?
# Create new list containing all "rotated" versions of lst
lst = list(range(7))
new_lists = [lst[-i:] + lst[:-i] for i in range(len(lst))]
# print results
for l in new_lists:
print(l)
Output:
[0, 1, 2, 3, 4, 5, 6]
[6, 0, 1, 2, 3, 4, 5]
[5, 6, 0, 1, 2, 3, 4]
[4, 5, 6, 0, 1, 2, 3]
[3, 4, 5, 6, 0, 1, 2]
[2, 3, 4, 5, 6, 0, 1]
[1, 2, 3, 4, 5, 6, 0]
Use itertools library and import permutation. Permutation will find all combination. And, it is an easy method too. It will print all possible combination of your input.
from itertools import permutations
a = [1,2,3,4,5,6]
perm = permutations(a)
for i in list(perm):
print (i)
I got the desired output using forloops, hence I am posting this answer so it can help any of you struggling with the same question
import numpy as np
count = 0
n = "kashmeen"
for i in range(0,8):
for j in range(0,8):
for k in range(0,8):
for l in range(0,8):
for m in range(0,8):
for o in range(0,8):
for p in range(0,8):
for q in range(len(n)):
new = [i, j, k, l, m, o, p, q]
new = np.array(new)
new_1 = np.unique(new)
if len(new_1)==8:
print(n[i],n[j],n[k],n[l],n[m],n[o],n[p],n[q])
count += 1
It is clear that the intent of this exercise is to for you to demonstrate knowledge of slicing operations. Using slices, this can be done in a single for-loop:
>>> n = "kashmeen"
>>> for index in range(len(n)):
print (n[index:]+n[:index])
which gives the output
kashmeen
ashmeenk
shmeenka
hmeenkas
meenkash
eenkashm
enkashme
nkashmee
This solution will work for a string of any length (not just 8) and will work equally well for lists or tuples.

Pandas replace all but first in consecutive group

The problem description is simple, but I cannot figure how to make this work in Pandas. Basically, I'm trying to replace consecutive values (except the first) with some replacement value. For example:
data = {
"A": [0, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2, 3]
}
df = pd.DataFrame.from_dict(data)
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 2
10 2
11 2
12 3
If I run this through some function foo(df, 2, 0) I would get the following:
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 0
10 0
11 0
12 3
Which replaces all values of 2 with 0, except for the first one. Is this possible?
You can find all the rows where A = 2 and A is also equal to the previous A value and set them to 0:
data = {
"A": [0, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2, 3]
}
df = pd.DataFrame.from_dict(data)
df[(df.A == 2) & (df.A == df.A.shift(1))] = 0
Output:
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 0
10 0
11 0
12 3
If you have more than one column in the dataframe, use df.loc to just set the A values:
df.loc[(df.A == 2) & (df.A == df.A.shift(1)), 'A'] = 0
Try, if 'A' is duplicated further down the datafame, an is monotonic increasing:
def foo(df, val=2, repl=0):
return df.mask((df.groupby('A').transform('cumcount') > 0) & (df['A'] == val), repl)
foo(df, 2, 0)
Output:
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 0
10 0
11 0
12 3
I'm not sure if this is the best way, but I came up with this solution, hope to be helpful:
import pandas as pd
data = {
"A": [0, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2, 3]
}
df = pd.DataFrame(data)
def replecate(df, number, replacement):
i = 1
for column in df.columns:
for index,value in enumerate(df[column]):
if i == 1 and value == number :
i = 0
elif value == number and i != 1:
df[column][index] = replacement
i = 1
return df
replecate(df, 2 , 0)
Output
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 0
10 0
11 0
12 3
I've managed a solution to this problem by shifting the row down by one and checking to see if the values align. Also included a function which can take multiple values to check for (not just 2).
import pandas as pd
data = {
"A": [0, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2, 3]
}
df = pd.DataFrame(data)
def replace_recurring(df,key,offset=1,values=[2]):
df['offset'] = df[key].shift(offset)
df.loc[(df[key]==df['offset']) & (df[key].isin(values)),key] = 0
df = df.drop(['offset'],axis=1)
return df
df = replace_recurring(df,'A',offset=1,values=[2])
Giving the output:
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 0
10 0
11 0
12 3

Pandas Mapping Numbers to another Number

I have ~5000 rows and all values in my 'Round' column go from -1 to 7. I'm trying to create a new column and it mapped where -1 = 0 and then anything from 1-7 is 1. I tried a simple map and listed all the mappings, but this doesn't work.
combine['Drafted'] = combine.Round.map({'-1':0,'1':1,'2':1,'3':1,'4':1,'5':1,'6':1,'7':1})
Is there something wrong with the logic above that it wouldn't work?
I guess you can achieve it using below code:
df = pd.DataFrame({'Round': [-1, 1, 0, 7, -1, 2, 3, 5, -1, 4, 6]})
df['Drafted'] = np.where(df['Round'] == -1, 0, 1)
print(df)
And the output is as below:
Round Drafted
0 -1 0
1 1 1
2 0 1
3 7 1
4 -1 0
5 2 1
6 3 1
7 5 1
8 -1 0
9 4 1
10 6 1

Ranking groups based on size

Sample Data:
id cluster
1 3
2 3
3 3
4 3
5 1
6 1
7 2
8 2
9 2
10 4
11 4
12 5
13 6
What I would like to do is replace the largest cluster id with 0 and the second largest with 1 and so on and so forth. Output would be as shown below.
id cluster
1 0
2 0
3 0
4 0
5 2
6 2
7 1
8 1
9 1
10 3
11 3
12 4
13 5
I'm not quite sure where to start with this. Any help would be much appreciated.
The objective is to relabel groups defined in the 'cluster' column by the corresponding rank of that group's total value count within the column. We'll break this down into several steps:
Integer factorization. Find an integer representation where each unique value in the column gets its own integer. We'll start with zero.
We then need the counts of each of these unique values.
We need to rank the unique values by their counts.
We assign the ranks back to the positions of the original column.
Approach 1
Using Numpy's numpy.unique + argsort
TL;DR
u, i, c = np.unique(
df.cluster.values,
return_inverse=True,
return_counts=True
)
(-c).argsort()[i]
Turns out, numpy.unique performs the task of integer factorization and counting values in one go. In the process, we get unique values as well, but we don't really need those. Also, the integer factorization isn't obvious. That's because per the numpy.unique function, the return value we're looking for is called the inverse. It's called the inverse because it was intended to act as a way to get back the original array given the array of unique values. So if we let
u, i, c = np.unique(
df.cluster.values,
return_inverse=True,
return_couns=True
)
You'll see i looks like:
array([2, 2, 2, 2, 0, 0, 1, 1, 1, 3, 3, 4, 5])
And if we did u[i] we get back the original df.cluster.values
array([3, 3, 3, 3, 1, 1, 2, 2, 2, 4, 4, 5, 6])
But we are going to use it as integer factorization.
Next, we need the counts c
array([2, 3, 4, 2, 1, 1])
I'm going to propose the use of argsort but it's confusing. So I'll try to show it:
np.row_stack([c, (-c).argsort()])
array([[2, 3, 4, 2, 1, 1],
[2, 1, 0, 3, 4, 5]])
What argsort does in general is to place the top spot (position 0), the position to draw from in the originating array.
# position 2
# is best
# |
# v
# array([[2, 3, 4, 2, 1, 1],
# [2, 1, 0, 3, 4, 5]])
# ^
# |
# top spot
# from
# position 2
# position 1
# goes to
# pen-ultimate spot
# |
# v
# array([[2, 3, 4, 2, 1, 1],
# [2, 1, 0, 3, 4, 5]])
# ^
# |
# pen-ultimate spot
# from
# position 1
What this allows us to do is to slice this argsort result with our integer factorization to arrive at a remapping of the ranks.
# i is
# [2 2 2 2 0 0 1 1 1 3 3 4 5]
# (-c).argsort() is
# [2 1 0 3 4 5]
# argsort
# slice
# \ / This is our integer factorization
# a i
# [[0 2] <-- 0 is second position in argsort
# [0 2] <-- 0 is second position in argsort
# [0 2] <-- 0 is second position in argsort
# [0 2] <-- 0 is second position in argsort
# [2 0] <-- 2 is zeroth position in argsort
# [2 0] <-- 2 is zeroth position in argsort
# [1 1] <-- 1 is first position in argsort
# [1 1] <-- 1 is first position in argsort
# [1 1] <-- 1 is first position in argsort
# [3 3] <-- 3 is third position in argsort
# [3 3] <-- 3 is third position in argsort
# [4 4] <-- 4 is fourth position in argsort
# [5 5]] <-- 5 is fifth position in argsort
We can then drop it into the column with pd.DataFrame.assign
u, i, c = np.unique(
df.cluster.values,
return_inverse=True,
return_counts=True
)
df.assign(cluster=(-c).argsort()[i])
id cluster
0 1 0
1 2 0
2 3 0
3 4 0
4 5 2
5 6 2
6 7 1
7 8 1
8 9 1
9 10 3
10 11 3
11 12 4
12 13 5
Approach 2
I'm going to leverage the same concepts. However, I'll use Pandas pandas.factorize to get integer factorization with numpy.bincount to count values. The reason to use this approach is because Numpy's unique actually sorts the values in the midst of factorizing and counting. pandas.factorize does not. For larger data sets, big oh is our friend as this remains O(n) while the Numpy approach is O(nlogn).
i, u = pd.factorize(df.cluster.values)
c = np.bincount(i)
df.assign(cluster=(-c).argsort()[i])
id cluster
0 1 0
1 2 0
2 3 0
3 4 0
4 5 2
5 6 2
6 7 1
7 8 1
8 9 1
9 10 3
10 11 3
11 12 4
12 13 5
You can use groupby, transform, and rank:
df['cluster'] = df.groupby('cluster').transform('count')\
.rank(ascending=False, method='dense')\
.sub(1).astype(int)
Output:
id cluster
0 1 0
1 2 0
2 3 0
3 4 0
4 5 2
5 6 2
6 7 1
7 8 1
8 9 1
9 10 3
By using category and value_counts
df.cluster.map((-df.cluster.value_counts()).astype('category').cat.codes
)
Out[151]:
0 0
1 0
2 0
3 0
4 2
5 2
6 1
7 1
8 1
9 3
Name: cluster, dtype: int8
This isn't the cleanest solution but it does work. Feel free to suggest improvements:
valueCounts = df.groupby('cluster')['cluster'].count()
valueCounts_sorted = df.sort_values(ascending=False)
for i in valueCounts_sorted.index.values:
print (i)
temp = df[df.cluster == i]
temp["random"] = count
idx = temp.index.values
df.loc[idx, "cluster"] = temp.random.values
count += 1

Encoding column labels in Pandas for machine learning

I am working on car evaulation dataset for machine learning and the dataset is like this
buying,maint,doors,persons,lug_boot,safety,class
vhigh,vhigh,2,2,small,low,unacc
vhigh,vhigh,2,2,small,med,unacc
vhigh,vhigh,2,2,small,high,unacc
vhigh,vhigh,2,2,med,low,unacc
vhigh,vhigh,2,2,med,med,unacc
vhigh,vhigh,2,2,med,high,unacc
i want to convert these strings to unique enumerated integers columnwise. i see that pandas.factorize() is the way to go, but it only works on one column. how do i factorize the dataframe in one go with one command.
i tried lambda function and it is not working.
df.apply(lambda c:pd.factorize(c),axis=1)
Output:
0 ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, small, low,...
1 ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, small, med,...
2 ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, small, high...
3 ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, med, low, u...
4 ([0, 0, 1, 1, 2, 2, 3], [vhigh, 2, med, unacc])
5 ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, med, high, ...
i see the encoded values but cant pull that out from above array
Factorize returns a tuple of (values, labels). You'll just want the values in the DataFrame.
In [26]: cols = ['buying', 'maint', 'lug_boot', 'safety', 'class']
In [27]: df[cols].apply(lambda x: pd.factorize(x)[0])
Out[27]:
buying maint lug_boot safety class
0 0 0 0 0 0
1 0 0 0 1 0
2 0 0 0 2 0
3 0 0 1 0 0
4 0 0 1 1 0
5 0 0 1 2 0
Then concat that to the numeric data.
A word of warning though: this implies that "low" safety and "high" safety are the same distance from "med" safety. You might be better off using pd.get_dummies:
In [37]: dummies = []
In [38]: for col in cols:
....: dummies.append(pd.get_dummies(df[col]))
....:
In [39]: pd.concat(dummies, axis=1)
Out[39]:
vhigh vhigh med small high low med unacc
0 1 1 0 1 0 1 0 1
1 1 1 0 1 0 0 1 1
2 1 1 0 1 1 0 0 1
3 1 1 1 0 0 1 0 1
4 1 1 1 0 0 0 1 1
5 1 1 1 0 1 0 0 1
get_dummies has some optional parameters to control the naming, which you'll probably want.

Categories

Resources