How can i remove consecutive pairs of equal numbers with opposite signs from a Pandas dataframe?
Assuming i have this input dataframe
incremental_changes = [2, -2, 2, 1, 4, 5, -5, 7, -6, 6]
df = pd.DataFrame({
'idx': range(len(incremental_changes)),
'incremental_changes': incremental_changes
})
idx incremental_changes
0 0 2
1 1 -2
2 2 2
3 3 1
4 4 4
5 5 5
6 6 -5
7 7 7
8 8 -6
9 9 6
I would like to get the following
idx incremental_changes
0 0 2
3 3 1
4 4 4
7 7 7
Note that the first 2 could either be idx 0 or 2, it doesn't really matter.
Thanks
Can groupby consecutive equal numbers and transform
import itertools
def remove_duplicates(s):
''' Generates booleans that indicate when a pair of ints with
opposite signs are found.
'''
iter_ = iter(s)
for (a,b) in itertools.zip_longest(iter_, iter_):
if b is None:
yield False
else:
yield a+b == 0
yield a+b == 0
>>> mask = df.groupby(df['incremental_changes'].abs().diff().ne(0).cumsum()) \
['incremental_changes'] \
.transform(remove_duplicates)
Then
>>> df[~mask]
idx incremental_changes
2 2 2
3 3 1
4 4 4
7 7 7
Just do rolling, then we filter the multiple combine
s = df.incremental_changes.rolling(2).sum()
s = s.mask(s[s==0].groupby(s.ne(0).cumsum()).cumcount()==1)==0
df[~(s | s.shift(-1))]
Out[640]:
idx incremental_changes
2 2 2
3 3 1
4 4 4
7 7 7
How I can rotate list clockwise one time? I have some temporary solution, but I'm sure there is a better way to do it.
I want to get from this
Index: 0 1 2 3 4 5 6 7 8 9
Count: 0 2 4 4 5 6 6 7 7 7
to this:
Index: 0 1 2 3 4 5 6 7 8 9
Count: 0 0 2 4 4 5 6 6 7 7
And my temporary "solution" is just:
temporary = [0, 2, 4, 4, 5, 6, 6, 7, 7, 7]
test = [None] * len(temporary)
test[0] = temporary[0]
for index in range(1, len(temporary)):
test[index] = temporary[index - 1]
You might use temporary.pop() to discard the last item and temporary.insert(0, 0) to add 0 to the front.
Alternatively in one line:
temporary = [0] + temporary[:-1]
Following the StackOverflow post Elegantly calculate mean of first three values of a list I have tweaked the code to find the maximum.
However, I also require to know the position/index of the max.
So the code below calculates the max value for the first 3 numbers and then the max value for the next 3 numbers and so on.
For example for a list of values [6 3 7 4 6 9 2 6 7 4 3 7 7 2 5 4 1 7 5 1]. The code below takes the first 3 values 6,3,7 and outputs the max as 7 and then for the next 3 values 4,6,9 outputs the value 9 and so on.
But I also want to find which position/index they are at, 1.e 7 is at position 2 and 9 at position 5. The final result [2,5,8,11,12,...]. Any ideas on how to calculate the index. Thanks in advance.
import numpy as np
np.random.seed(42)
test_data = np.random.randint(low = 0, high = 10, size = 20)
maxval = [max(test_data[i:i+3]) for i in range(0,len(test_data),3)]
print(test_data)
print(maxval)
output: test_data : [6 3 7 4 6 9 2 6 7 4 3 7 7 2 5 4 1 7 5 1]
output: [7, 9, 7, 7, 7, 7, 5]
import numpy as np
np.random.seed(42)
test_data = np.random.randint(low = 0, high = 10, size = 20)
maxval = [max(test_data[i:i+3]) for i in range(0,len(test_data),3)]
index = [(np.argmax(test_data[i: i+3]) + i) for i in range(0,len(test_data),3)]
print(test_data)
print(maxval)
print(index)
t = [[1, 2, 3, 4, 5, 6], [2, 4, 6, 8,
i = 0
a = 0
for i in range(len(t)):
for a in range(len(t[i])):
print(a+1, t[i][a], end=' ')
print('\n', end='')
My expected output would be (in more general terms):
a+1 t[i][a] t[i]+[a+2]...
a+2 t[i+1][a]...
and so on.
For example (with 2D-array t):
1 1 2 3 4 5 6
2 2 4 6 8 10 12
Instead with the above code I get:
1 1 2 2 3 3 4 4 5 5 6 6
1 2 2 4 3 6 4 8 5 10 6 12
And I cannot pinpoint exactly why. Any ideas?
You have a small mistake, as you print a+1 inside the inner loop. You should print i+1 inside the outer loop.
for i, _ in enumerate(t):
print(i+1, end=' ');
for a, _ in enumerate(t[i]):
print(t[i][a], end=' ')
print('\n', end='')
This is a typical use case for FEM/FVM equation systems, so is perhaps of broader interest. From a triangular mesh à la
I would like to create a scipy.sparse.csr_matrix. The matrix rows/columns represent values at the nodes of the mesh. The matrix has entries on the main diagonal and wherever two nodes are connected by an edge.
Here's an MWE that first builds a node->edge->cells relationship and then builds the matrix:
import numpy
import meshzoo
from scipy import sparse
nx = 1600
ny = 1000
verts, cells = meshzoo.rectangle(0.0, 1.61, 0.0, 1.0, nx, ny)
n = len(verts)
nds = cells.T
nodes_edge_cells = numpy.stack([nds[[1, 2]], nds[[2, 0]],nds[[0, 1]]], axis=1)
# assign values to each edge (per cell)
alpha = numpy.random.rand(3, len(cells))
vals = numpy.array([
[alpha**2, -alpha],
[-alpha, alpha**2],
])
# Build I, J, V entries for COO matrix
I = []
J = []
V = []
#
V.append(vals[0][0])
V.append(vals[0][1])
V.append(vals[1][0])
V.append(vals[1][1])
#
I.append(nodes_edge_cells[0])
I.append(nodes_edge_cells[0])
I.append(nodes_edge_cells[1])
I.append(nodes_edge_cells[1])
#
J.append(nodes_edge_cells[0])
J.append(nodes_edge_cells[1])
J.append(nodes_edge_cells[0])
J.append(nodes_edge_cells[1])
# Create suitable data for coo_matrix
I = numpy.concatenate(I).flat
J = numpy.concatenate(J).flat
V = numpy.concatenate(V).flat
matrix = sparse.coo_matrix((V, (I, J)), shape=(n, n))
matrix = matrix.tocsr()
With
python -m cProfile -o profile.prof main.py
snakeviz profile.prof
one can create and view a profile of the above:
The method tocsr() takes the lion share of the runtime here, but this is also true when building alpha is more complex. Consequently, I'm looking for ways to speed this up.
What I've already found:
Due to the structure of the data, the values on the diagonal of the matrix can be summed up in advance, i.e.,
V.append(vals[0, 0, 0] + vals[1, 1, 2])
I.append(nodes_edge_cells[0, 0]) # == nodes_edge_cells[1, 2]
J.append(nodes_edge_cells[0, 0]) # == nodes_edge_cells[1, 2]
This makes I, J, V shorter and thus speeds up tocsr.
Right now, edges are "per cell". I could identify equal edges with each other using numpy.unique, effectively saving about half of I, J, V. However, I found that this too takes some time. (Not surprising.)
One other thought that I had was that that I could replace the diagonal V, I, J by a simple numpy.add.at if there was a csr_matrix-like data structure where the main diagonal is kept separately. I know that this exists in some other software packages, but couldn't find it in scipy. Correct?
Perhaps there's a sensible way to construct CSR directly?
I would try creating the csr structure directly, especially if you are resorting to np.unique since this gives you sorted keys, which is half the job done.
I'm assuming you are at the point where you have i, j sorted lexicographically and overlapping v summed using np.add.at on the optional inverse output of np.unique.
Then v and j are already in csr format. All that's left to do is creating the indptr which you simply get by np.searchsorted(i, np.arange(M+1)) where M is the column length. You can pass these directly to the sparse.csr_matrix constructor.
Ok, let code speak:
import numpy as np
from scipy import sparse
from timeit import timeit
def tocsr(I, J, E, N):
n = len(I)
K = np.empty((n,), dtype=np.int64)
K.view(np.int32).reshape(n, 2).T[...] = J, I
S = np.argsort(K)
KS = K[S]
steps = np.flatnonzero(np.r_[1, np.diff(KS)])
ED = np.add.reduceat(E[S], steps)
JD, ID = KS[steps].view(np.int32).reshape(-1, 2).T
ID = np.searchsorted(ID, np.arange(N+1))
return sparse.csr_matrix((ED, np.array(JD, dtype=int), ID), (N, N))
def viacoo(I, J, E, N):
return sparse.coo_matrix((E, (I, J)), (N, N)).tocsr()
#testing and timing
# correctness
N = 1000
A = np.random.random((N, N)) < 0.001
I, J = np.where(A)
E = np.random.random((2, len(I)))
D = np.zeros((2,) + A.shape)
D[:, I, J] = E
D2 = tocsr(np.r_[I, I], np.r_[J, J], E.ravel(), N).A
print('correct:', np.allclose(D.sum(axis=0), D2))
# speed
N = 100000
K = 10
I, J = np.random.randint(0, N, (2, K*N))
E = np.random.random((2 * len(I),))
I, J, E = np.r_[I, I, J, J], np.r_[J, J, I, I], np.r_[E, E]
print('N:', N, ' -- nnz (with duplicates):', len(E))
print('direct: ', timeit('f(a,b,c,d)', number=10, globals={'f': tocsr, 'a': I, 'b': J, 'c': E, 'd': N}), 'secs for 10 iterations')
print('via coo:', timeit('f(a,b,c,d)', number=10, globals={'f': viacoo, 'a': I, 'b': J, 'c': E, 'd': N}), 'secs for 10 iterations')
Prints:
correct: True
N: 100000 -- nnz (with duplicates): 4000000
direct: 7.702431229001377 secs for 10 iterations
via coo: 41.813509466010146 secs for 10 iterations
Speedup: 5x
So, in the end this turned out to be the difference between COO's and CSR's sum_duplicates (just like #hpaulj suspected). Thanks to the efforts of everyone involved here (particularly #paul-panzer), a PR is underway to give tocsr a tremendous speedup.
SciPy's tocsr does a lexsort on (I, J), so it helps organizing the indices in such a way that (I, J) will come out fairly sorted already.
For for nx=4, ny=2 in the above example, I and J are
[1 6 3 5 2 7 5 5 7 4 5 6 0 2 2 0 1 2 1 6 3 5 2 7 5 5 7 4 5 6 0 2 2 0 1 2 5 5 7 4 5 6 0 2 2 0 1 2 1 6 3 5 2 7 5 5 7 4 5 6 0 2 2 0 1 2 1 6 3 5 2 7]
[1 6 3 5 2 7 5 5 7 4 5 6 0 2 2 0 1 2 5 5 7 4 5 6 0 2 2 0 1 2 1 6 3 5 2 7 1 6 3 5 2 7 5 5 7 4 5 6 0 2 2 0 1 2 5 5 7 4 5 6 0 2 2 0 1 2 1 6 3 5 2 7]
First sorting each row of cells, then the rows by the first column like
cells = numpy.sort(cells, axis=1)
cells = cells[cells[:, 0].argsort()]
produces
[1 4 2 5 3 6 5 5 5 6 7 7 0 0 1 2 2 2 1 4 2 5 3 6 5 5 5 6 7 7 0 0 1 2 2 2 5 5 5 6 7 7 0 0 1 2 2 2 1 4 2 5 3 6 5 5 5 6 7 7 0 0 1 2 2 2 1 4 2 5 3 6]
[1 4 2 5 3 6 5 5 5 6 7 7 0 0 1 2 2 2 5 5 5 6 7 7 0 0 1 2 2 2 1 4 2 5 3 6 1 4 2 5 3 6 5 5 5 6 7 7 0 0 1 2 2 2 5 5 5 6 7 7 0 0 1 2 2 2 1 4 2 5 3 6]
For the number in the original post, sorting cuts down the runtime from about 40 seconds to 8 seconds.
Perhaps an even better ordering can be achieved if the nodes are numbered more appropriately in the first place. I'm thinking of Cuthill-McKee and friends.