I have a Numpy one-dimensional array of 1 and 0. for e.g
a = np.array([0,1,1,1,0,0,0,0,0,0,0,1,0,1,1,0,0,0,1,1,0,0])
I want to count the continuous 0s and 1s in the array and output something like this
[1,3,7,1,1,2,3,2,2]
What I do atm is
np.diff(np.where(np.abs(np.diff(a)) == 1)[0])
and it outputs
array([3, 7, 1, 1, 2, 3, 2])
as you can see it is missing the first count 1.
I've tried np.split and then get the sizes of each segments but it does not seem to be optimistic.
Is there more elegant "pythonic" solution?
Here's one vectorized approach -
np.diff(np.r_[0,np.flatnonzero(np.diff(a))+1,a.size])
Sample run -
In [208]: a = np.array([0,1,1,1,0,0,0,0,0,0,0,1,0,1,1,0,0,0,1,1,0,0])
In [209]: np.diff(np.r_[0,np.flatnonzero(np.diff(a))+1,a.size])
Out[209]: array([1, 3, 7, 1, 1, 2, 3, 2, 2])
Faster one with boolean concatenation -
np.diff(np.flatnonzero(np.concatenate(([True], a[1:]!= a[:-1], [True] ))))
Runtime test
For the setup, let's create a bigger dataset with islands of 0s and 1s and for a fair benchmarking as with the given sample, let's have the island lengths vary between 1 and 7 -
In [257]: n = 100000 # thus would create 100000 pair of islands
In [258]: a = np.repeat(np.arange(n)%2, np.random.randint(1,7,(n)))
# Approach #1 proposed in this post
In [259]: %timeit np.diff(np.r_[0,np.flatnonzero(np.diff(a))+1,a.size])
100 loops, best of 3: 2.13 ms per loop
# Approach #2 proposed in this post
In [260]: %timeit np.diff(np.flatnonzero(np.concatenate(([True], a[1:]!= a[:-1], [True] ))))
1000 loops, best of 3: 1.21 ms per loop
# #Vineet Jain's soln
In [261]: %timeit [ sum(1 for i in g) for k,g in groupby(a)]
10 loops, best of 3: 61.3 ms per loop
Using groupby from itertools
from itertools import groupby
a = np.array([0,1,1,1,0,0,0,0,0,0,0,1,0,1,1,0,0,0,1,1,0,0])
grouped_a = [ sum(1 for i in g) for k,g in groupby(a)]
I found a similar method to yours, just that this code finds the first and the last count separately. The answer is detailed in the code below:
import numpy as np
a = np.array([0,1,1,1,0,0,0,0,0,0,0,1,0,1,1,0,0,0,1,1,0,0])
print(f'a: {a}')
diff_a = np.diff(a)
print(f'diff_a: {diff_a}')
non_zero_pos_arr = np.where(diff_a != 0)[0]
print(f'Array of positions where non zero elements are present in diff_a array: {non_zero_pos_arr}')
diff_non_zero_pos_arr = np.diff(non_zero_pos_arr)
print(f'Result Array except for first and last element: {diff_non_zero_pos_arr}')
ans_first_ele = non_zero_pos_arr[0] + 1
ans_last_ele = len(diff_a) - non_zero_pos_arr[-1]
ans = np.array([], dtype=np.int8)
ans = np.append(ans, ans_first_ele)
ans = np.append(ans, diff_non_zero_pos_arr)
ans = np.append(ans, ans_last_ele)
print(f'Result Array: {ans}')
Output:
a: [0 1 1 1 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 0]
diff_a: [ 1 0 0 -1 0 0 0 0 0 0 1 -1 1 0 -1 0 0 1 0 -1 0]
Array of positions where non zero elements are present in diff_a array:
[ 0 3 10 11 12 14 17 19]
Result Array except for first and last element: [3 7 1 1 2 3 2]
Result Array: [1 3 7 1 1 2 3 2 2]
Related
Problem statement:
As stated by the title, I want to remove parts from an 1D array that have consecutive zeros and length equal or above a threshold.
My solution:
I produced the solution shown in the following MRE:
import numpy as np
THRESHOLD = 4
a = np.array((1,1,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,1))
print("Input: " + str(a))
# Find the indices of the parts that meet threshold requirement
gaps_above_threshold_inds = np.where(np.diff(np.nonzero(a)[0]) - 1 >= THRESHOLD)[0]
# Delete these parts from array
for idx in gaps_above_threshold_inds:
a = np.delete(a, list(range(np.nonzero(a)[0][idx] + 1, np.nonzero(a)[0][idx + 1])))
print("Output: " + str(a))
Output:
Input: [1 1 0 1 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1]
Output: [1 1 0 1 1 1 0 0 0 1 1]
Question:
Is there a less complicated and more efficient way to do this on a numpy array?
Edit:
Based on #mozway comments, I'm editing my question providing some more information.
Basically, the problem domain is:
I have 1D signals of length ~20.000 samples
Some parts of the signals have been zeroed due to noise
The rest of the signal has non-zero values, in the range ~[50, 250]
Leading and trailing zeros have been removed
My goal is to remove the zero parts above a length threshold as I have already said.
More detailed questions:
As far as numpy efficient handling is concerned, is there a better solution from the one above?
As far as efficient signal processing techniques are concerned, is there more suitable way to achieve the desired goal than using numpy?
Comments on answers:
Regarding my first concern about efficient numpy handling, #mathfux's solution is really great and basically what I was looking for. That's why I accepted this one.
However, the approach by #Jérôme Richard answers my second question and it presents a really high performance solution; really useful if the dataset is extremely big.
Thanks for your great answers!
np.delete create a new array every time it is called which is very inefficient. A faster solution is to store all the value to keep in a mask/boolean array and then filter the input array at once. However, this will still likely require a pure-Python loop if done only with Numpy. A simpler and faster solution is to use Numba (or Cython) to do that. Here is an implementation:
import numpy as np
import numba as nb
#nb.njit('int_[:](int_[:], int_)')
def filterZeros(arr, threshold):
n = len(arr)
res = np.empty(n, dtype=arr.dtype)
count = 0
j = 0
for i in range(n):
if arr[i] == 0:
count += 1
else:
if count >= threshold:
j -= count
count = 0
res[j] = arr[i]
j += 1
if n > 0 and arr[n-1] == 0 and count >= threshold:
j -= count
return res[0:j]
a = np.array((1,1,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,1))
a = filterZeros(a, 4)
print("Output: " + str(a))
Here are the result with a random binary array containing 100_000 items on my machine:
Reference implementation: 5982 ms
Mozway's solution: 23.4 ms
This implementation: 0.11 ms
Thus, the solution is about 54381 faster than the initial solution and 212 times faster than the one of Mozway. The code can even be ~30% faster by working in-place (destroy the input array) and by telling Numba the array is contiguous in memory (using ::1 instead of :).
It's also possible to find differences of nonzero items, fix the ones that exceeed threshold and reconstruct a sequence in a correct way.
def numpy_fix(a):
# STEP 1. find indices of nonzero items: [0 1 3 8 9 13 19]
idx = np.flatnonzero(a)
# STEP 2. Find differences along these indices (also insert a leading zero): [0 1 2 5 1 4 6]
df = np.diff(idx, prepend=0)
# STEP 3. Fix differences of indices larger than THRESHOLD: [0 1 2 1 1 4 1]
df[df>THRESHOLD] = 1
# STEP 4. Given differences on indices, reconstruct indices themselves: [0 1 3 4 5 9 10]
cs = np.cumsum(df)
z = np.zeros(cs[-1]+1, dtype=int) # create a list of zeros
z[cs] = 1 #pad it with ones within indices found
return z
>>> numpy_fix(a)
array([1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1])
(Note that it's correct only if a has no leading or trailing zeros)
%timeit numpy_fix(np.tile(a, (1, 50000)))
39.3 ms ± 865 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
A quite efficient method is to use itertools.groupby+itertools.chain:
from itertools import groupby, chain
a2 = np.array(list(chain(*(l for k,g in groupby(a)
if len(l:=list(g))<THRESHOLD or k))))
output:
array([1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1])
This works relatively fast, for instance on 1 million items:
# A = np.random.randint(2, size=1000000)
%%timeit
np.array(list(chain(*(l for k,g in groupby(a)
if len(l:=list(g))<THRESHOLD or k))))
# 254 ms ± 3.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Whats the most pythonic way to add a column (of weights) to an existing Pandas DataFrame "df" by a condition on dfs column?
Small example:
df = pd.DataFrame({'A' : [1, 2, 3], 'B' : [4, 5, 6]})
df
Out[110]:
A B
0 1 4
1 2 5
2 3 6
I'd Like to add a "weight" column where if df['B'] >= 6 then df['weight'] = 20, else, df['weight'] = 1
So my output will be:
A B weight
0 1 4 1
1 2 5 1
2 3 6 20
Approach #1
Here's one with type-conversion and scaling -
df['weight'] = (df['B'] >= 6)*19+1
Approach #2
Another possibly faster one with using the underlying array data -
df['weight'] = (df['B'].values >= 6)*19+1
Approach #3
Leverage multi-cores with numexpr module -
import numexpr as ne
val = df['B'].values
df['weight'] = ne.evaluate('(val >= 6)*19+1')
Timings on 500k rows as commented by OP for a random data in range [0,9) for the vectorized methods posted thus far -
In [149]: np.random.seed(0)
...: df = pd.DataFrame({'B' : np.random.randint(0,9,(500000))})
# #jpp's soln
In [150]: %timeit df['weight1'] = np.where(df['B'] >= 6, 20, 1)
100 loops, best of 3: 3.57 ms per loop
# #jpp's soln with array data
In [151]: %timeit df['weight2'] = np.where(df['B'].values >= 6, 20, 1)
100 loops, best of 3: 3.27 ms per loop
In [154]: %timeit df['weight3'] = (df['B'] >= 6)*19+1
100 loops, best of 3: 2.73 ms per loop
In [155]: %timeit df['weight4'] = (df['B'].values >= 6)*19+1
1000 loops, best of 3: 1.76 ms per loop
In [156]: %%timeit
...: val = df['B'].values
...: df['weight5'] = ne.evaluate('(val >= 6)*19+1')
1000 loops, best of 3: 1.14 ms per loop
One last one ...
With the output being 1 or 20, we could safely use lower precision : uint8 for a turbo speedup over already discussed ones, like so -
In [208]: %timeit df['weight6'] = (df['B'].values >= 6)*np.uint8(19)+1
1000 loops, best of 3: 428 µs per loop
You can use numpy.where for a vectorised solution:
df['weight'] = np.where(df['B'] >= 6, 20, 1)
Result:
A B weight
0 1 4 1
1 2 5 1
2 3 6 20
Here's a method using df.apply
df['weight'] = df.apply(lambda row: 20 if row['B'] >= 6 else 1, axis=1)
Output:
In [6]: df
Out[6]:
A B weight
0 1 4 1
1 2 5 1
2 3 6 20
There is a 2D numpy array of about 500000 rows by 512 values each row:
[
[1,0,1,...,0,0,1], # 512 1's or 0's
[0,1,0,...,0,1,1],
...
[0,0,1,...,1,0,1], # row number 500000
]
How to sort the rows ascending as if each row is a long 512-bit integer?
[
[0,0,1,...,1,0,1],
[0,1,0,...,0,1,1],
[1,0,1,...,0,0,1],
...
]
Instead of converting to strings you can also use a void view (as from #Jaime here) of the data and argsort by that.
def sort_bin(b):
b_view = np.ascontiguousarray(b).view(np.dtype((np.void, b.dtype.itemsize * b.shape[1])))
return b[np.argsort(b_view.ravel())] #as per Divakar's suggestion
Testing
np.random.seed(0)
b = np.random.randint(0, 2, (10,5))
print(b)
print(sort_bin(b))
[[0 1 1 0 1]
[1 1 1 1 1]
[1 0 0 1 0]
...,
[1 0 1 1 0]
[0 1 0 1 1]
[1 1 1 0 1]]
[[0 0 0 0 1]
[0 1 0 1 1]
[0 1 1 0 0]
...,
[1 1 1 0 1]
[1 1 1 1 0]
[1 1 1 1 1]]
Should be much faster and less memory-intensive since b_view is just a view into b
t = np.random.randint(0,2,(2000,512))
%timeit sort_bin(t)
100 loops, best of 3: 3.09 ms per loop
%timeit np.array([[int(i) for i in r] for r in np.sort(np.apply_along_axis(lambda r: ''.join([str(c) for c in r]), 0, t))])
1 loop, best of 3: 3.29 s per loop
About 1000x faster actually
You could sort them in a stable way 512 times, starting with the right-most bit first.
Sort by last bit
Sort by second-last bit, stable (to not mess up results of previous sort)
...
...
Sort by first bit, stable
A smaller example: assume you want to sort these three 2-bit numbers by bits:
11
01
00
In the first step, you sort by the right bit, resulting in:
00
11
01
Now you sort by the first bit, in this case we have two 0s in that column. If your sorting algorithm is not stable it would be allowed to put these equal items in any order in the result, that could cause 01 to appear before 00 which we do not want, so we use a stable sort, keeping the relative order of equal items, for the first column, resulting in the desired:
00
01
11
Creating a string of each row and then applying np.sort()
So if we have an array to test on:
a = np.array([[1,0,0,0],[0,0,0,0],[1,1,1,1],[0,0,1,1]])
We can create strings of each row by using np.apply_along_axis:
a = np.apply_along_axis(lambda r: ''.join([str(c) for c in r]), 0, a)
which would make a now:
array(['1010', '0010', '0011', '0011'], dtype='<U4')
and so now we can sort the strings with np.sort():
a = np.sort(a)
making a:
array(['0010', '0011', '0011', '1010'], dtype='<U4')
we can then convert back to the original format with:
a = np.array([[int(i) for i in r] for r in a])
which makes a:
array([[0, 0, 1, 0],
[0, 0, 1, 1],
[0, 0, 1, 1],
[1, 0, 1, 0]])
And if you wanted to cram this all into one line:
a = np.array([[int(i) for i in r] for r in np.sort(np.apply_along_axis(lambda r: ''.join([str(c) for c in r]), 0, a))])
This is slow but does the job.
def sort_col(arr, col_num=0):
# if we have sorted over all columns return array
if col_num >= arr.shape[1]:
return arr
# sort array over given column
arr_sorted = arr[arr[:, col_num].argsort()]
# if the number of 1s in the given column is not equal to the total number
# of rows neither equal to 0, split on 1 and 0, sort and then merge
if len(arr) > np.sum(arr_sorted[:, col_num]) > 0:
arr_sorted0s = sort_col(arr_sorted[arr_sorted[:, col_num]==0], col_num+1)
arr_sorted1s = sort_col(arr_sorted[arr_sorted[:, col_num]==1], col_num+1)
# change order of stacking if you want ascenting order
return np.vstack((arr_sorted0s, arr_sorted1s))
# if the number of 1s in the given column is equal to the total number
# of rows or equal to 0, just go to the next iteration
return sort_col(arr_sorted, col_num + 1)
np.random.seed(0)
a = np.random.randint(0, 2, (5, 4))
print(a)
print(sort_col(a))
# prints
[[0 1 1 0]
[1 1 1 1]
[1 1 1 0]
[0 1 0 0]
[0 0 0 1]]
[[0 0 0 1]
[0 1 0 0]
[0 1 1 0]
[1 1 1 0]
[1 1 1 1]]
Edit. Or better yet use Daniels solution. I didn't check for new answers before I posted my code.
I have a list of 4 dataframes, called df.
I'd like to add a "number" column to each dataframe (df[i]['number']) that represent the dataframe number.
I tried to use list comprehension for that:
df=[df['number']=(x+1) for x in range(0,4)]
Which resulted in
File "<ipython-input-52-0b708f543fbb>", line 1
df=[df['number']=(x+1) for x in range(0,4)]
^
SyntaxError: invalid syntax
I also tried:
df=[x['number']=(y+1) for x,y in enumerate(df)]
With the same result, pointing at the '=' sign.
What am I doing wrong?
Use enumerate, starting from 1 and assign to each dataframe in your list.
for i, d in enumerate(df, 1):
d['number'] = i
In-place assignment is much cheaper than assignment in a list comprehension.
df[0]
id marks
0 1 100
1 2 200
2 3 300
df[1]
name score flag
0 'abc' 100 T
1 'zxc' 300 F
for i, d in enumerate(df, 1):
d['number'] = i
df[0]
id marks number
0 1 100 1
1 2 200 1
2 3 300 1
df[1]
name score flag number
0 'abc' 100 T 2
1 'zxc' 300 F 2
Performance
Small
1000 loops, best of 3: 278 µs per loop # mine
vs
1000 loops, best of 3: 567 µs per loop # John Galt
Large (df * 10000)
1000 loops, best of 3: 607 µs per loop # mine
vs
1000 loops, best of 3: 1.16 ms per loop # John Galt - assign
1 loop, best of 1: 1.42 ms per loop # John Galt - side effects
Note that the loop-based assignment is also space efficient.
Use
1)
In [454]: df = [x.assign(number=i) for i, x in enumerate(df, 1)]
In [455]: df[0]
Out[455]:
0 1 number
0 0.068330 0.708835 1
1 0.877747 0.586654 1
In [456]: df[1]
Out[456]:
0 1 number
0 0.430418 0.477923 2
1 0.049980 0.018981 2
Good part you can assign it to a new variable without altering old list like
dff = [x.assign(number=i) for i, x in enumerate(df, 1)]
2)
If you want inplace and list comprehension
In [474]: [x.insert(x.shape[1] ,'number', i) for i, x in enumerate(df, 1)]
Out[474]: [None, None, None, None]
In [475]: df[0]
Out[475]:
0 1 number
0 0.207806 0.315701 1
1 0.464864 0.976156 1
I'm trying to compute the Hamming distance between all strings in a column in a large dataframe. I have over 100,000 rows in this column so with all pairwise combinations, which is 10x10^9 comparisons. These strings are short DNA sequences. I would like to quickly convert every string in the column to a list of integers, where a unique integer represent each character in the string. E.g.
"ACGTACA" -> [0, 1, 2, 3, 1, 2, 1]
then I use scipy.spatial.distance.pdist to quickly and efficiently compute the hamming distance between all of these. Is there a fast way to do this in Pandas?
I have tried using apply but it is pretty slow:
mapping = {"A":0, "C":1, "G":2, "T":3}
df.apply(lambda x: np.array([mapping[char] for char in x]))
get_dummies and other Categorical operations don't apply because they operate on a per row level. Not within the row.
Since Hamming distance doesn't care about magnitude differences, I can get about a 40-60% speedup just replacing df.apply(lambda x: np.array([mapping[char] for char in x])) with df.apply(lambda x: map(ord, x)) on made-up datasets.
Create your test data
In [39]: pd.options.display.max_rows=12
In [40]: N = 100000
In [41]: chars = np.array(list('ABCDEF'))
In [42]: s = pd.Series(np.random.choice(chars, size=4 * np.prod(N)).view('S4'))
In [45]: s
Out[45]:
0 BEBC
1 BEEC
2 FEFA
3 BBDA
4 CCBB
5 CABE
...
99994 EEBC
99995 FFBD
99996 ACFB
99997 FDBE
99998 BDAB
99999 CCFD
dtype: object
These don't actually have to be the same length the way we are doing it.
In [43]: maxlen = s.str.len().max()
In [44]: result = pd.concat([ s.str[i].astype('category',categories=chars).cat.codes for i in range(maxlen) ], axis=1)
In [47]: result
Out[47]:
0 1 2 3
0 1 4 1 2
1 1 4 4 2
2 5 4 5 0
3 1 1 3 0
4 2 2 1 1
5 2 0 1 4
... .. .. .. ..
99994 4 4 1 2
99995 5 5 1 3
99996 0 2 5 1
99997 5 3 1 4
99998 1 3 0 1
99999 2 2 5 3
[100000 rows x 4 columns]
So you get a factorization according the same categories (e.g. the codes are meaningful)
And pretty fast
In [46]: %timeit pd.concat([ s.str[i].astype('category',categories=chars).cat.codes for i in range(maxlen) ], axis=1)
10 loops, best of 3: 118 ms per loop
I didn't test the performance of this, but you could also try somthing like
atest = "ACGTACA"
alist = atest.replace('A', '3.').replace('C', '2.').replace('G', '1.').replace('T', '0.').split('.')
anumlist = [int(x) for x in alist if x.isdigit()]
results in:
[3, 2, 1, 0, 3, 2, 3]
Edit: Ok, so testing it with atest = "ACTACA"*100000 takes a while :/
Maybe not the best idea...
Edit 5:
Another improvement:
import datetime
import numpy as np
class Test(object):
def __init__(self):
self.mapping = {'A' : 0, 'C' : 1, 'G' : 2, 'T' : 3}
def char2num(self, astring):
return [self.mapping[c] for c in astring]
def main():
now = datetime.datetime.now()
atest = "AGTCAGTCATG"*10000000
t = Test()
alist = t.char2num(atest)
testme = np.array(alist)
print testme, len(testme)
print datetime.datetime.now() - now
if __name__ == "__main__":
main()
Takes about 16 seconds for 110.000.000 characters and keeps your processor busy instead of your ram:
[0 2 3 ..., 0 3 2] 110000000
0:00:16.866659
There doesn't seem to be much difference between using ord or a dictionary-based lookup that exactly maps A->0, C->1 etc:
import pandas as pd
import numpy as np
bases = ['A', 'C', 'T', 'G']
rowlen = 4
nrows = 1000000
dna = pd.Series(np.random.choice(bases, nrows * rowlen).view('S%i' % rowlen))
lookup = dict(zip(bases, range(4)))
%timeit dna.apply(lambda row: map(lookup.get, row))
# 1 loops, best of 3: 785 ms per loop
%timeit dna.apply(lambda row: map(ord, row))
# 1 loops, best of 3: 713 ms per loop
Jeff's solution is also not far off in terms of performance:
%timeit pd.concat([dna.str[i].astype('category', categories=bases).cat.codes for i in range(rowlen)], axis=1)
# 1 loops, best of 3: 1.03 s per loop
A major advantage of this approach over mapping the rows to lists of ints is that the categories can then be viewed as a single (nrows, rowlen) uint8 array via the .values attribute, which could then be passed directly to pdist.