efficient way to convert pandas series of strings to numpy frequency matrix

efficient way to convert pandas series of strings to numpy frequency matrix - python

I have:
a list of english chars, a to z: char_list
series of strings
I want to create a numpy matrix where each row correspond to the string at the same row in the series, and each column to the character at the same index in the list.
example:
series: [[ab],[ac],[aa]]
chars = [a,b,c]
result = [[110],[101],[200]]
here is how I do it now:
def create_char_matrix(strings, symbol_list):
mat = np.zeros((strings.shape[0],len(symbol_list)))
for i, line in enumerate(strings):
for c in line:
mat[i,symbol_list.index(c)] += 1
return mat
this is not very fast, considering there is often a better solution than nested for loops.
any ideas about how to accelerate this process?

You can split the string to the characters, then crosstab:
s = pd.Series(['ab','ac','aa'])
chars=['a','b','c']
a = s.str.split('').str[1:-1].explode()
pd.crosstab(a.index, a).reindex(chars, axis=1, fill_value=0).values
Output:
array([[1, 1, 0],
[1, 0, 1],
[2, 0, 0]])

Related

compare each row with each column in matrix using only pure python

I have a certain function that I made and I want to run it on each column and each row of a matrix, to check if there are rows and columns that produce the same output.
for example:
matrix = [[1,2,3],
[7,8,9]]
I want to run the function, lets call it myfun, on each column [1,7], [2,8] and [3,9] separatly, and also run it on each row [1,2,3] and [7,8,9]. If there is a row and a column that produce the same result, the counter ct would go up 1. All of this is found in another function, called count_good, which basically counts rows and columns that produce the same result.
here is the code so far:
def count_good(mat):
ct = 0
for i in mat:
for j in mat:
if myfun(i) == myfun(j):
ct += 1
return ct
However, when I use print to check my code I get this:
mat = [[1,2,3],[7,8,9]]

for i in mat:
for j in mat:
print(i,j)

[1, 2, 3] [1, 2, 3]
[1, 2, 3] [7, 8, 9]
[7, 8, 9] [1, 2, 3]
[7, 8, 9] [7, 8, 9]
I see that the code does not return what I need' which means that the count_good function won't work. How can I run a function on each row and each column? I need to do it without any help of outside libraries, no map,zip or stuff like that, only very pure python.

Let's start by using itertools and collections for this, then translate it back to "pure" python.
from itertools import product, starmap, chain # combinations?
from collections import Counter
To iterate in a nested loop efficiently, you can use itertools.product. You can use starmap to expand the arguments of a function as well. Here is a generator of the values of myfun over the rows:
starmap(myfun, product(matrix, repeat=2))
To transpose the matrix and iterate over the columns, use the zip(* idiom:
starmap(myfun, product(zip(*matrix), repeat=2))
You can use collections.Counter to map all the repeats for each possible return value:
Counter(starmap(myfun, chain(product(matrix, repeat=2), product(zip(*matrix), repeat=2))))
If you want to avoid running myfun on the same elements, replace product(..., repeat=2) with combinations(..., 2).
Now that you have the layout of how to do this, replace all the external library stuff with equivalent builtins:
counter = {}
for i in range(len(matrix)):
for j in range(len(matrix)):
result = myfun(matrix[i], matrix[j])
counter[result] = counter.get(result, 0) + 1
for i in range(len(matrix[0])):
for j in range(len(matrix[0])):
c1 = [matrix[row][i] for row in range(len(matrix))]
c2 = [matrix[row][j] for row in range(len(matrix))]
result = myfun(c1, c2)
counter[result] = counter.get(result, 0) + 1
If you want combinations instead, replace the loop pairs with
for i in range(len(...) - 1):
for j in range(i + 1, len(...)):

Using native python:
def count_good(mat):
ct = 0
columns = [[row[col_idx] for row in mat] for col_idx in range(len(mat[0]))]
for row in mat:
for column in columns:
if myfun(row) == myfun(column):
ct += 1
return ct
However, this is very inefficient as it is a triple nested for-loop. I would suggest using numpy instead.
e.g.
def count_good(mat):
ct = 0
mat = np.array(mat)
for row in mat:
for column in mat.T:
if myfun(row) == myfun(column):
ct += 1
return ct

TL;DR
To get a column from a 2D list of N lists of M elements, first flatten the list to a 1D list of N×M elements, then choosing elements from the 1D list with a stride equal to M, the number of columns, gives you a column of the original 2D list.
First, I create a matrix of random integers, as a list of lists of equal
length — Here I take some liberty from the objective of "pure" Python, the OP
will probably input by hand some assigned matrix.
from random import randrange, seed
seed(20220914)
dim = 5
matrix = [[randrange(dim) for column in range(dim)] for row in range(dim)]
print(*matrix, sep='\n')
We need a function to be applied to each row and each column of the matrix,
that I intend must be supplied as a list. Here I choose a simple summation of
the elements.
def myfun(l_st):
the_sum = 0
for value in l_st: the_sum = the_sum+value
return the_sum
To proceed, we are going to do something unexpected, that is we unwrap the
matrix, starting from an empty list we do a loop on the rows and "sum" the
current row to unwrapped, note that summing two lists gives you a single
list containing all the elements of the two lists.
unwrapped = []
for row in matrix: unwrapped = unwrapped+row
In the following we will need the number of columns in the matrix, this number
can be computed counting the elements in the last row of the matrix.
ncols = 0
for value in row: ncols = ncols+1
Now, we can compute the values produced applying myfunc to each column,
counting how many times we have the same value.
We use an auxiliary variable, start, that is initialized to zero and
incremented in every iteration of the following loop, that scans, using a
dummy variable, all the elements of the current row, hence start has the
values 0, 1, ..., ncols-1, so that unwrapped[start::ncols] is a list
containing exactly one of the columns of the matrix.
count_of_column_values = {}
start = 0
for dummy in row:
column_value = myfun(unwrapped[start::ncols])
if column_value not in count_of_column_values:
count_of_column_values[column_value] = 1
else:
count_of_column_values[column_value] = count_of_column_values[column_value] + 1
start = start+1
At this point, we are ready to apply myfun to the rows
count = 0
for row in matrix:
row_value = myfun(row)
if row_value in count_of_column_values: count = count+count_of_column_values[row_value]
print(count)
Executing the code above prints
[1, 4, 4, 1, 0]
[1, 2, 4, 1, 4]
[1, 4, 4, 0, 1]
[4, 0, 3, 1, 2]
[0, 0, 4, 2, 2]
3

Find the index of first non-zero element to the right of given elements in python

I have a 2D numpy.ndarray. Given a list of positions, I want to find the positions of first non-zero elements to the right of the given elements in the same row. Is it possible to vectorize this? I have a huge array and looping is taking too much time.
Eg:
matrix = numpy.array([
[1, 0, 0, 1, 1],
[1, 1, 0, 0, 1],
[1, 0, 0, 0, 1],
[1, 1, 1, 1, 1],
[1, 0, 0, 0, 1]
])
query = numpy.array([[0,2], [2,1], [1,3], [0,1]])
Expected Result:
>> [[0,3], [2,4], [1,4], [0,3]]
Currently I'm doing this using for loops as follows
for query_point in query:
y, x = query_point
result_point = numpy.min(numpy.argwhere(self.matrix[y, x + 1:] == 1)) + x + 1
print(f'{y}, {result_point}')
PS: I also want to find the first non-zero element to the left as well. I guess, the solution to find the right point can be easily tqeaked to find the left point.

If your query array is sufficiently dense, you can reverse the computation: find an array of the same size as matrix that gives the index of the next nonzero element in the same row for each location. Then your problem becomes one of just one of applying query to this index array, which numpy supports directly.
It is actually much easier to find the left index, so let's start with that. We can transform matrix into an array of indices like this:
r, c = np.nonzero(matrix)
left_ind = np.zeros(matrix.shape, dtype=int)
left_ind[r, c] = c
Now you can find the indices of the preceding nonzero element by using np.maximum similarly to how it is done in this answer: https://stackoverflow.com/a/48252024/2988730:
np.maximum.accumulate(left_ind, axis=1, out=left_ind)
Now you can index directly into ind to get the previous nonzero column index:
left_ind[query[:, 0], query[:, 1]]
or
left_ind[tuple(query.T)]
Now to do the same thing with the right index, you need to reverse the array. But then your indices are no longer ascending, and you risk overwriting any zeros you have in the first column. To solve that, in addition to just reversing the array, you need to reverse the order of the indices:
right_ind = np.zeros(matrix.shape, dtype=int)
right_ind[r, c] = matrix.shape[1] - c
You can use any number larger than matrix.shape[1] as your constant as well. The important thing is that the reversed indices all come out greater than zero so np.maximum.accumulate overwrites the zeros. Now you can use np.maximum.accumulate in the same way on the reversed array:
right_ind = matrix.shape[1] - np.maximum.accumulate(right_ind[:, ::-1], axis=1)[:, ::-1]
In this case, I would recommend against using out=right_ind, since right_ind[:, ::-1] is a view into the same buffer. The operation is buffered, but if your line size is big enough, you may overwrite data unintentionally.
Now you can index the array in the same way as before:
right_ind[(*query.T,)]
In both cases, you need to stack with the first column of query, since that's the row key:
>>> row, col = query.T
>>> np.stack((row, left_ind[row, col]), -1)
array([[0, 0],
[2, 0],
[1, 1],
[0, 0]])
>>> np.stack((row, right_ind[row, col]), -1)
array([[0, 3],
[2, 4],
[1, 4],
[0, 3]])
>>> np.stack((row, left_ind[row, col], right_ind[row, col]), -1)
array([[0, 0, 3],
[2, 0, 4],
[1, 1, 4],
[0, 0, 3]])
If you plan on sampling most of the rows in the array, either at once, or throughout your program, this will help you speed things up. If, on the other hand, you only need to access a small subset, you can apply this technique only to the rows you need.

I came up with a solution to get both your wanted indices,
i.e. to the left and to the right from the indicated position.
First define the following function, to get the row number and both indices:
def inds(r, c, arr):
ind = np.nonzero(arr[r])[0]
indSlice = ind[ind < c]
iLeft = indSlice[-1] if indSlice.size > 0 else None
indSlice = ind[ind > c]
iRight = indSlice[0] if indSlice.size > 0 else None
return r, iLeft, iRight
Parameters:
r and c are row number (in the source array) and the "starting"
index in this row,
arr is the array to look in (matrix will be passed here).
Then define the vectorized version of this function:
indsVec = np.vectorize(inds, excluded=['arr'])
And to get the result, run:
result = np.vstack(indsVec(query[:, 0], query[:, 1], arr=matrix)).T
The result is:
array([[0, 0, 3],
[2, 0, 4],
[1, 1, 4],
[0, 0, 3]], dtype=int64)
Your expected result is the left and right column (row number
and the index of first non-zero element after the "starting" position.
The middle column is the index of last non-zero element before the "starting" position.
This solution is resistant to "non-existing" case (if there are no
any "before" or "after" non-zero element). In such case the respective
index is returned as None.

Python: Counting Zeros in multiple array columns and store them efficently

I create an array:
import numpy as np
arr = [[0, 2, 3], [0, 1, 0], [0, 0, 1]]
arr = np.array(arr)
Now I count every zero per column and store it in a variable:
a = np.count_nonzero(arr[:,0]==0)
b = np.count_nonzero(arr[:,1]==0)
c = np.count_nonzero(arr[:,2]==0)
This code works fine. But in my case I have many more columns with over 70000 values in each. This would be many more lines of code and a very messy variable expolorer in spyder.
My questions:
Is there a possibility to make this code more efficient and save the values only in one type of data, e.g. a dictionary, dataframe or tuple?
Can I use a loop for creating the dic, dataframe or tuple?
Thank you

You can construct a boolean array arr == 0 and then take its sum along the rows.
>>> (arr == 0).sum(0)
array([3, 1, 1])

Use an ordered dict from the collections module:
from collections import OrderedDict
import numpy as np
from pprint import pprint as pp
import string
arr = np.array([[0, 2, 3], [0, 1, 0], [0, 0, 1]])
letters = string.ascii_letters
od = OrderedDict()
for i in range(len(arr)):
od[letters[i]] = np.count_nonzero(arr[:, i]==0)
pp(od)
Returning:
OrderedDict([('a', 3), ('b', 1), ('c', 1)])
Example usage:
print(f"First number of zeros: {od.get('a')}")
Will give you:
First number of zeros: 3

To count zeros you can count non-zeros along each column and subtract result from length of each column:
arr.shape[0] - np.count_nonzero(arr, axis=0)
produces [3,1,1].
This solution is very fast because no extra large objects are created.

How to replace one column by a value in a numpy array?

I have an array like this
import numpy as np
a = np.zeros((2,2), dtype=np.int)
I want to replace the first column by the value 1. I did the following:
a[:][0] = [1, 1] # not working
a[:][0] = [[1], [1]] # not working
Contrariwise, when I replace the rows it worked!
a[0][:] = [1, 1] # working
I have a big array, so I cannot replace value by value.

You can replace the first column as follows:
>>> a = np.zeros((2,2), dtype=np.int)
>>> a[:, 0] = 1
>>> a
array([[1, 0],
[1, 0]])
Here a[:, 0] means "select all rows from column 0". The value 1 is broadcast across this selected column, producing the desired array (it's not necessary to use a list [1, 1], although you can).
Your syntax a[:][0] means "select all the rows from the array a and then select the first row". Similarly, a[0][:] means "select the first row of a and then select this entire row again". This is why you could replace the rows successfully, but not the columns - it's necessary to make a selection for axis 1, not just axis 0.

You can do something like this:
import numpy as np
a = np.zeros((2,2), dtype=np.int)
a[:,0] = np.ones((1,2), dtype=np.int)
Please refer to Accessing np matrix columns

Select the intended column using a proper indexing and just assign the value to it using =. Numpy will take care of the rest for you.
>>> a[::,0] = 1
>>> a
array([[1, 0],
[1, 0]])
Read more about numpy indexing.

Python - how to find numbers in a list which are not the minimum

I have a list S = [a[n],b[n],c[n]] and for n=0 the minimum of list S is the value 'a'. How do I select the values b and c given that I know the minimum? The code I'm writing runs through many iterations of n, and I want to examine the elements which are not the minimum for a given iteration in the loop.
Python 2.7.3, 32-bit. Numpy 1.6.2. Scipy 0.11.0b1

If you can flatten the whole list into a numpy array, then use argsort, the first row of argsort will tell you which array contains the minimum value:
a = [1,2,3,4]
b = [3,-4,5,8]
c = [6,1,-7,12]
S = [a,b,c]
S2 = np.array(S)
S2.argsort(axis=0)
array([[0, 1, 2, 0],
[1, 2, 0, 1],
[2, 0, 1, 2]])

Maybe you can do something like
S.sort()
S[1:3]
This is what you want?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

efficient way to convert pandas series of strings to numpy frequency matrix - python

You can split the string to the characters, then crosstab: s = pd.Series(['ab','ac','aa']) chars=['a','b','c'] a = s.str.split('').str[1:-1].explode() pd.crosstab(a.index, a).reindex(chars, axis=1, fill_value=0).values Output: array([[1, 1, 0], [1, 0, 1], [2, 0, 0]])

Related

compare each row with each column in matrix using only pure python

Find the index of first non-zero element to the right of given elements in python

Python: Counting Zeros in multiple array columns and store them efficently

How to replace one column by a value in a numpy array?

Python - how to find numbers in a list which are not the minimum

Categories

Resources