Related
Let's consider two pandas dataframes:
import numpy as np
import pandas as pd
df = pd.DataFrame([1, 2, 3, 2, 5, 4, 3, 6, 7])
check_df = pd.DataFrame([3, 2, 5, 4, 3, 6, 4, 2, 1])
If want to do the following thing:
If df[1] > check_df[1] or df[2] > check_df[1] or df[3] > check_df[1] then we assign to df 1, and 0 otherwise
If df[2] > check_df[2] or df[3] > check_df[2] or df[4] > check_df[2] then we assign to df 1, and 0 otherwise
We apply the same algorithm to end of DataFrame
My primitive code is the following:
df_copy = df.copy()
for i in range(len(df) - 3):
moving_df = df.iloc[i:i+3]
if (moving_df >check_df.iloc[i]).any()[0]:
df_copy.iloc[i] = 1
else:
df_copy.iloc[i] = -1
df_copy
0
0 -1
1 1
2 -1
3 1
4 1
5 -1
6 3
7 6
8 7
Could you please give me a advice, if there is any possibility to do this without loop?
IIUC, this is easily done with a rolling.min:
df['out'] = np.where(df[0].rolling(N, min_periods=1).max().shift(1-N).gt(check_df[0]),
1, -1)
output:
0 out
0 1 -1
1 2 1
2 3 -1
3 2 1
4 5 1
5 4 -1
6 3 1
7 6 -1
8 7 -1
to keep the last items as is:
m = df[0].rolling(N).max().shift(1-N)
df['out'] = np.where(m.gt(check_df[0]),
1, -1)
df['out'] = df['out'].mask(m.isna(), df[0])
output:
0 out
0 1 -1
1 2 1
2 3 -1
3 2 1
4 5 1
5 4 -1
6 3 1
7 6 6
8 7 7
Although #mozway has already provided a very smart solution, I would like to share my approach as well, which was inspired by this post.
You could create your own object that compares a series with a rolling series. The comparison could be performed by typical operators, i.e. >, < or ==. If at least one comparison holds, the object would return a pre-defined value (given in list returns_tf, where the first element would be returned if the comparison is true, and the second if it's false).
Possible Code:
import numpy as np
import pandas as pd
df = pd.DataFrame([1, 2, 3, 2, 5, 4, 3, 6, 7])
check_df = pd.DataFrame([3, 2, 5, 4, 3, 6, 4, 2, 1])
class RollingComparison:
def __init__(self, comparing_series: pd.Series, rolling_series: pd.Series, window: int):
self.comparing_series = comparing_series.values[:-1*window]
self.rolling_series = rolling_series.values
self.window = window
def rolling_window_mask(self, option: str = "smaller"):
shape = self.rolling_series.shape[:-1] + (self.rolling_series.shape[-1] - self.window + 1, self.window)
strides = self.rolling_series.strides + (self.rolling_series.strides[-1],)
rolling_window = np.lib.stride_tricks.as_strided(self.rolling_series, shape=shape, strides=strides)[:-1]
rolling_window_mask = (
self.comparing_series.reshape(-1, 1) < rolling_window if option=="smaller" else (
self.comparing_series.reshape(-1, 1) > rolling_window if option=="greater" else self.comparing_series.reshape(-1, 1) == rolling_window
)
)
return rolling_window_mask.any(axis=1)
def assign(self, option: str = "rolling", returns_tf: list = [1, -1]):
mask = self.rolling_window_mask(option)
return np.concatenate((np.where(mask, returns_tf[0], returns_tf[1]), self.rolling_series[-1*self.window:]))
The assignments can be achieved as follows:
roller = RollingComparison(check_df[0], df[0], 3)
check_df["rolling_smaller_checking"] = roller.assign(option="smaller")
check_df["rolling_greater_checking"] = roller.assign(option="greater")
check_df["rolling_equals_checking"] = roller.assign(option="equal")
Output (the column rolling_smaller_checking equals your desired output):
0 rolling_smaller_checking rolling_greater_checking rolling_equals_checking
0 3 -1 1 1
1 2 1 -1 1
2 5 -1 1 1
3 4 1 1 1
4 3 1 -1 1
5 6 -1 1 1
6 4 3 3 3
7 2 6 6 6
8 1 7 7 7
The problem description is simple, but I cannot figure how to make this work in Pandas. Basically, I'm trying to replace consecutive values (except the first) with some replacement value. For example:
data = {
"A": [0, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2, 3]
}
df = pd.DataFrame.from_dict(data)
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 2
10 2
11 2
12 3
If I run this through some function foo(df, 2, 0) I would get the following:
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 0
10 0
11 0
12 3
Which replaces all values of 2 with 0, except for the first one. Is this possible?
You can find all the rows where A = 2 and A is also equal to the previous A value and set them to 0:
data = {
"A": [0, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2, 3]
}
df = pd.DataFrame.from_dict(data)
df[(df.A == 2) & (df.A == df.A.shift(1))] = 0
Output:
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 0
10 0
11 0
12 3
If you have more than one column in the dataframe, use df.loc to just set the A values:
df.loc[(df.A == 2) & (df.A == df.A.shift(1)), 'A'] = 0
Try, if 'A' is duplicated further down the datafame, an is monotonic increasing:
def foo(df, val=2, repl=0):
return df.mask((df.groupby('A').transform('cumcount') > 0) & (df['A'] == val), repl)
foo(df, 2, 0)
Output:
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 0
10 0
11 0
12 3
I'm not sure if this is the best way, but I came up with this solution, hope to be helpful:
import pandas as pd
data = {
"A": [0, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2, 3]
}
df = pd.DataFrame(data)
def replecate(df, number, replacement):
i = 1
for column in df.columns:
for index,value in enumerate(df[column]):
if i == 1 and value == number :
i = 0
elif value == number and i != 1:
df[column][index] = replacement
i = 1
return df
replecate(df, 2 , 0)
Output
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 0
10 0
11 0
12 3
I've managed a solution to this problem by shifting the row down by one and checking to see if the values align. Also included a function which can take multiple values to check for (not just 2).
import pandas as pd
data = {
"A": [0, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2, 3]
}
df = pd.DataFrame(data)
def replace_recurring(df,key,offset=1,values=[2]):
df['offset'] = df[key].shift(offset)
df.loc[(df[key]==df['offset']) & (df[key].isin(values)),key] = 0
df = df.drop(['offset'],axis=1)
return df
df = replace_recurring(df,'A',offset=1,values=[2])
Giving the output:
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 0
10 0
11 0
12 3
I have a list of zeros and ones, I want to print them in two different columns with headings and index numbers. Something like this.
list = [1,0,1,1,1,0,1,0,1,0,0]
ones zeros
1 1 2 0
3 1 6 0
4 1 8 0
5 1 10 0
7 1 11 0
9 1
This is the desired output.
I tried this:
list = [1,0,1,1,1,0,1,0,1,0,0]
print('ones',end='\t')
print('zeros')
for index,ele in enumerate(list,start=1):
if ele==1:
print(index,ele,end=" ")
elif ele==0:
print(" ")
print(index,ele,end=" ")
else:
print()
But this gives output like this:
ones zeros
1 1
2 0 3 1 4 1 5 1
6 0 7 1
8 0 9 1
10 0
11 0
How do get the desired output?
Any help is appreciated.
You can use itertools.zip_longest, str.ljust, f-strings (for formatting), and some calculations for the printing part, and use two lists to hold the indices of both zeros and ones:
l = [1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0]
ones, zeros = [], []
max_len_zeros = max_len_ones = 0
for index, num in enumerate(l, 1):
if num == 0:
zeros.append(index)
max_len_zeros = max(max_len_zeros, len(str(index)))
else:
ones.append(index)
max_len_ones = max(max_len_ones, len(str(index)))
from itertools import zip_longest
print('ones' + ' ' * (max_len_ones + 2) + 'zeros')
for ones_index, zeros_index in zip_longest(ones, zeros, fillvalue = ''):
one = '1' if ones_index else ' '
this_one_index = str(ones_index).ljust(max_len_ones)
zero = '0' if zeros_index else ''
this_zero_index = str(zeros_index).ljust(max_len_zeros)
print(f'{this_one_index} {one} {this_zero_index} {zero}')
Output:
ones zeros
1 1 2 0
3 1 6 0
4 1 8 0
5 1 10 0
7 1 11 0
9 1
List with more zeros than ones:
In: l = [1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0]
Out:
ones zeros
1 1 2 0
4 1 3 0
7 1 5 0
9 1 6 0
10 1 8 0
14 1 11 0
12 0
13 0
15 0
List with equal number of zeros and ones:
In: l = [1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1]
Out:
ones zeros
1 1 2 0
3 1 4 0
5 1 6 0
8 1 7 0
9 1 10 0
11 1 13 0
12 1 14 0
15 1 16 0
18 1 17 0
20 1 19 0
It's hard to do what you need in an iterative way. I have kind of a "broken" solution that both shows how you could better do what you are trying to do and why an iterative approach is limited in this case.
I updated your code as following:
list = [1,0,1,1,1,0,1,0,1,0,0]
print('ones',end='\t')
print('zeros')
for index,ele in enumerate(list,start=1):
# First check if extra space OR new lines OR both are needed
if index > 1:
if ele==1:
print()
elif ele==0:
if list[index-2]==1:
print('', end=' \t')
else:
print('', end='\n\t\t')
# THEN, write your desired output without any end
if ele==1:
print(index,ele,end="")
elif ele==0:
print(index,ele,end="")
# Finally an empty line
print()
It gives the following ouput:
ones zeros
1 1 2 0
3 1
4 1
5 1 6 0
7 1 8 0
9 1 10 0
11 0
As you can see, its limitation is that you can't go "up" and rewrite in old lines.
However, if you need to display EXACTLY as you've shown, you need to construct an intermediate data structure (for example a dict) and then display it using zip
Sample Data:
id cluster
1 3
2 3
3 3
4 3
5 1
6 1
7 2
8 2
9 2
10 4
11 4
12 5
13 6
What I would like to do is replace the largest cluster id with 0 and the second largest with 1 and so on and so forth. Output would be as shown below.
id cluster
1 0
2 0
3 0
4 0
5 2
6 2
7 1
8 1
9 1
10 3
11 3
12 4
13 5
I'm not quite sure where to start with this. Any help would be much appreciated.
The objective is to relabel groups defined in the 'cluster' column by the corresponding rank of that group's total value count within the column. We'll break this down into several steps:
Integer factorization. Find an integer representation where each unique value in the column gets its own integer. We'll start with zero.
We then need the counts of each of these unique values.
We need to rank the unique values by their counts.
We assign the ranks back to the positions of the original column.
Approach 1
Using Numpy's numpy.unique + argsort
TL;DR
u, i, c = np.unique(
df.cluster.values,
return_inverse=True,
return_counts=True
)
(-c).argsort()[i]
Turns out, numpy.unique performs the task of integer factorization and counting values in one go. In the process, we get unique values as well, but we don't really need those. Also, the integer factorization isn't obvious. That's because per the numpy.unique function, the return value we're looking for is called the inverse. It's called the inverse because it was intended to act as a way to get back the original array given the array of unique values. So if we let
u, i, c = np.unique(
df.cluster.values,
return_inverse=True,
return_couns=True
)
You'll see i looks like:
array([2, 2, 2, 2, 0, 0, 1, 1, 1, 3, 3, 4, 5])
And if we did u[i] we get back the original df.cluster.values
array([3, 3, 3, 3, 1, 1, 2, 2, 2, 4, 4, 5, 6])
But we are going to use it as integer factorization.
Next, we need the counts c
array([2, 3, 4, 2, 1, 1])
I'm going to propose the use of argsort but it's confusing. So I'll try to show it:
np.row_stack([c, (-c).argsort()])
array([[2, 3, 4, 2, 1, 1],
[2, 1, 0, 3, 4, 5]])
What argsort does in general is to place the top spot (position 0), the position to draw from in the originating array.
# position 2
# is best
# |
# v
# array([[2, 3, 4, 2, 1, 1],
# [2, 1, 0, 3, 4, 5]])
# ^
# |
# top spot
# from
# position 2
# position 1
# goes to
# pen-ultimate spot
# |
# v
# array([[2, 3, 4, 2, 1, 1],
# [2, 1, 0, 3, 4, 5]])
# ^
# |
# pen-ultimate spot
# from
# position 1
What this allows us to do is to slice this argsort result with our integer factorization to arrive at a remapping of the ranks.
# i is
# [2 2 2 2 0 0 1 1 1 3 3 4 5]
# (-c).argsort() is
# [2 1 0 3 4 5]
# argsort
# slice
# \ / This is our integer factorization
# a i
# [[0 2] <-- 0 is second position in argsort
# [0 2] <-- 0 is second position in argsort
# [0 2] <-- 0 is second position in argsort
# [0 2] <-- 0 is second position in argsort
# [2 0] <-- 2 is zeroth position in argsort
# [2 0] <-- 2 is zeroth position in argsort
# [1 1] <-- 1 is first position in argsort
# [1 1] <-- 1 is first position in argsort
# [1 1] <-- 1 is first position in argsort
# [3 3] <-- 3 is third position in argsort
# [3 3] <-- 3 is third position in argsort
# [4 4] <-- 4 is fourth position in argsort
# [5 5]] <-- 5 is fifth position in argsort
We can then drop it into the column with pd.DataFrame.assign
u, i, c = np.unique(
df.cluster.values,
return_inverse=True,
return_counts=True
)
df.assign(cluster=(-c).argsort()[i])
id cluster
0 1 0
1 2 0
2 3 0
3 4 0
4 5 2
5 6 2
6 7 1
7 8 1
8 9 1
9 10 3
10 11 3
11 12 4
12 13 5
Approach 2
I'm going to leverage the same concepts. However, I'll use Pandas pandas.factorize to get integer factorization with numpy.bincount to count values. The reason to use this approach is because Numpy's unique actually sorts the values in the midst of factorizing and counting. pandas.factorize does not. For larger data sets, big oh is our friend as this remains O(n) while the Numpy approach is O(nlogn).
i, u = pd.factorize(df.cluster.values)
c = np.bincount(i)
df.assign(cluster=(-c).argsort()[i])
id cluster
0 1 0
1 2 0
2 3 0
3 4 0
4 5 2
5 6 2
6 7 1
7 8 1
8 9 1
9 10 3
10 11 3
11 12 4
12 13 5
You can use groupby, transform, and rank:
df['cluster'] = df.groupby('cluster').transform('count')\
.rank(ascending=False, method='dense')\
.sub(1).astype(int)
Output:
id cluster
0 1 0
1 2 0
2 3 0
3 4 0
4 5 2
5 6 2
6 7 1
7 8 1
8 9 1
9 10 3
By using category and value_counts
df.cluster.map((-df.cluster.value_counts()).astype('category').cat.codes
)
Out[151]:
0 0
1 0
2 0
3 0
4 2
5 2
6 1
7 1
8 1
9 3
Name: cluster, dtype: int8
This isn't the cleanest solution but it does work. Feel free to suggest improvements:
valueCounts = df.groupby('cluster')['cluster'].count()
valueCounts_sorted = df.sort_values(ascending=False)
for i in valueCounts_sorted.index.values:
print (i)
temp = df[df.cluster == i]
temp["random"] = count
idx = temp.index.values
df.loc[idx, "cluster"] = temp.random.values
count += 1
I would like to measure the length of a sub-array fullfilling some condition (like a stop clock), but as soon as the condition is not fulfilled any more, the value should reset to zero. So, the resulting array should tell me, how many values fulfilled some condition (e.g. value > 1):
[0, 0, 2, 2, 2, 2, 0, 3, 3, 0]
should result into the followin array:
[0, 0, 1, 2, 3, 4, 0, 1, 2, 0]
One can easily define a function in python, which returns the corresponding numy array:
def StopClock(signal, threshold=1):
clock = []
current_time = 0
for item in signal:
if item > threshold:
current_time += 1
else:
current_time = 0
clock.append(current_time)
return np.array(clock)
StopClock([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
However, I really do not like this for-loop, especially since this counter should run over a longer dataset. I thought of some np.cumsum solution in combination with np.diff, however I do not get through the reset part. Is someone aware of a more elegant numpy-style solution of above problem?
This solution uses pandas to perform a groupby:
s = pd.Series([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
threshold = 0
>>> np.where(
s > threshold,
s
.to_frame() # Convert series to dataframe.
.assign(_dummy_=1) # Add column of ones.
.groupby((s.gt(threshold) != s.gt(threshold).shift()).cumsum())['_dummy_'] # shift-cumsum pattern
.transform(lambda x: x.cumsum()), # Cumsum the ones per group.
0) # Fill value with zero where threshold not exceeded.
array([0, 0, 1, 2, 3, 4, 0, 1, 2, 0])
Yes, we can use diff-styled differentiation alongwith cumsum to create such intervaled ramps in a vectorized manner and that should be pretty efficient specially with large input arrays. The resetting part is taken care of by assigning appropriate values at the end of each interval, with the idea of cum-summing that resets the numbers at end of each interval.
Here's one implementation to accomplish all that -
def intervaled_ramp(a, thresh=1):
mask = a>thresh
# Get start, stop indices
mask_ext = np.concatenate(([False], mask, [False] ))
idx = np.flatnonzero(mask_ext[1:] != mask_ext[:-1])
s0,s1 = idx[::2], idx[1::2]
out = mask.astype(int)
valid_stop = s1[s1<len(a)]
out[valid_stop] = s0[:len(valid_stop)] - valid_stop
return out.cumsum()
Sample runs -
Input (a) :
[5 3 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 1]
Output (intervaled_ramp(a, thresh=1)) :
[1 2 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 0]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 1]
Output (intervaled_ramp(a, thresh=1)) :
[0 0 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 0]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 5]
Output (intervaled_ramp(a, thresh=1)) :
[0 0 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 1]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 5]
Output (intervaled_ramp(a, thresh=0)) :
[1 2 3 4 5 0 0 1 2 3 4 0 1 2 0 1 2 3 0 1 2 3 4 0 1]
Runtime test
One way to do a fair benchmarking was to use the posted sample in the question and tiling into a big number of times and using that as the input array. With that setup, here's the timings -
In [841]: a = np.array([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
In [842]: a = np.tile(a,10000)
# #Alexander's soln
In [843]: %timeit pandas_app(a, threshold=1)
1 loop, best of 3: 3.93 s per loop
# #Psidom 's soln
In [844]: %timeit stop_clock(a, threshold=1)
10 loops, best of 3: 119 ms per loop
# Proposed in this post
In [845]: %timeit intervaled_ramp(a, thresh=1)
1000 loops, best of 3: 527 µs per loop
Another numpy solution:
import numpy as np
a = np.array([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
def stop_clock(signal, threshold=1):
mask = signal > threshold
indices = np.flatnonzero(np.diff(mask)) + 1
return np.concatenate(list(map(np.cumsum, np.array_split(mask, indices))))
stop_clock(a)
# array([0, 0, 1, 2, 3, 4, 0, 1, 2, 0])