Python: Counting Zeros in multiple array columns and store them efficently

Python: Counting Zeros in multiple array columns and store them efficently - python

I create an array:
import numpy as np
arr = [[0, 2, 3], [0, 1, 0], [0, 0, 1]]
arr = np.array(arr)
Now I count every zero per column and store it in a variable:
a = np.count_nonzero(arr[:,0]==0)
b = np.count_nonzero(arr[:,1]==0)
c = np.count_nonzero(arr[:,2]==0)
This code works fine. But in my case I have many more columns with over 70000 values in each. This would be many more lines of code and a very messy variable expolorer in spyder.
My questions:
Is there a possibility to make this code more efficient and save the values only in one type of data, e.g. a dictionary, dataframe or tuple?
Can I use a loop for creating the dic, dataframe or tuple?
Thank you

You can construct a boolean array arr == 0 and then take its sum along the rows.
>>> (arr == 0).sum(0)
array([3, 1, 1])

Use an ordered dict from the collections module:
from collections import OrderedDict
import numpy as np
from pprint import pprint as pp
import string
arr = np.array([[0, 2, 3], [0, 1, 0], [0, 0, 1]])
letters = string.ascii_letters
od = OrderedDict()
for i in range(len(arr)):
od[letters[i]] = np.count_nonzero(arr[:, i]==0)
pp(od)
Returning:
OrderedDict([('a', 3), ('b', 1), ('c', 1)])
Example usage:
print(f"First number of zeros: {od.get('a')}")
Will give you:
First number of zeros: 3

To count zeros you can count non-zeros along each column and subtract result from length of each column:
arr.shape[0] - np.count_nonzero(arr, axis=0)
produces [3,1,1].
This solution is very fast because no extra large objects are created.

Related

pandas filter using combinations of boolean series

I have 7 boolean series of equal length:
msk_valid_structure
msk_submission_context
msk_reference_substance_location
msk_neutral
msk_identifier_origin
msk_two_conversion_methods
msk_no_error_warnings
I need to combine them with a logical AND (&) and so that:
we always include the first one (msk_valid_structure)
we include 5 out of 6 of the remaining ones (all combinations)
I tried two solutions and I am not happy with neither.
The first one uses reduce and it is quite slow:
from itertools import combinations
from functools import reduce
for combination in combinations(msks, 5):
res = reduce(lambda x,y: x&y, combination, msk_valid_structure)
The second one constructs dataframes that is also slow:
from itertools import combinations
for combination in combinations(msks, 5):
tmp = pd.DataFrame({i: col for i, col in enumerate(list(combination) + [msk_valid_structure])})
res = tmp.all(axis='columns')
How would you handle this situation?
Many thanks for your help.

Generate a DataFrame from your series and then use apply. Here you can write your custom function to add a column with the result valid/invalid or you directly do something if the condition is fulfilled.
This way has the benefit, that its vectorized and kinda fast compared to looping
With:
import pandas as pd
# Creating dataframe
test=[[0, 0, 1, 0, 0, 1, 1],
[1, 0, 1, 0, 1, 0, 1],
[0, 1, 0, 1, 1, 1, 1],
[1, 1, 0, 1, 1, 1, 1]]
cols = ['msk_valid_structure',
'msk_submission_context',
'msk_reference_substance_location',
'msk_neutral',
'msk_identifier_origin',
'msk_two_conversion_methods',
'msk_no_error_warnings']
df_test = pd.DataFrame(test, columns=cols)
# Defining function matching your condition
def your_func(s: pd.Series):
valid_stucture = s.msk_valid_structure
# Because any combination, i just summed Trues
# if you only allow specific combinations just write conditions here with .col or ['col'] notation
rest = s.iloc[1:].sum()
if valid_stucture and (rest >= 5):
return 1
return 0
# Applying the function and saving results in foo
df_test['foo'] = df_test.apply(lambda x: your_func(x), axis=1)
# Result of foo:
0,0,0,1

efficient way to convert pandas series of strings to numpy frequency matrix

I have:
a list of english chars, a to z: char_list
series of strings
I want to create a numpy matrix where each row correspond to the string at the same row in the series, and each column to the character at the same index in the list.
example:
series: [[ab],[ac],[aa]]
chars = [a,b,c]
result = [[110],[101],[200]]
here is how I do it now:
def create_char_matrix(strings, symbol_list):
mat = np.zeros((strings.shape[0],len(symbol_list)))
for i, line in enumerate(strings):
for c in line:
mat[i,symbol_list.index(c)] += 1
return mat
this is not very fast, considering there is often a better solution than nested for loops.
any ideas about how to accelerate this process?

You can split the string to the characters, then crosstab:
s = pd.Series(['ab','ac','aa'])
chars=['a','b','c']
a = s.str.split('').str[1:-1].explode()
pd.crosstab(a.index, a).reindex(chars, axis=1, fill_value=0).values
Output:
array([[1, 1, 0],
[1, 0, 1],
[2, 0, 0]])

Why count doesn't work for this list?(in python)

In the following code the result in "countOf1" is 0 instead of 12. What is the reason and how can i solve it?
import numpy as np
import pandas as pd
x = np.matrix(np.arange(12).reshape((1, 12)))
x[:,:]=1
countOf1=(x.tolist()).count(1)

It's because when you convert that into a list with tolist() you're getting a subset of a list. Meaning this is your x:
x.tolist()
Out[221]: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
To get your countOf1 to work you'll need to do it for x.tolist()[0]. This will give you:
x.tolist()[0].count(1)
Out[223]: 12

Remember that a numpy matrix is like a list of lists. Even though you only created one row vector, numpy writes it with 2 brackets ([[0,1,2,3,4...,11]]). So when you changed it to a list with tolist(), you created a list within a list. Since the list within the list != 1, the count is 0.

How to replace one column by a value in a numpy array?

I have an array like this
import numpy as np
a = np.zeros((2,2), dtype=np.int)
I want to replace the first column by the value 1. I did the following:
a[:][0] = [1, 1] # not working
a[:][0] = [[1], [1]] # not working
Contrariwise, when I replace the rows it worked!
a[0][:] = [1, 1] # working
I have a big array, so I cannot replace value by value.

You can replace the first column as follows:
>>> a = np.zeros((2,2), dtype=np.int)
>>> a[:, 0] = 1
>>> a
array([[1, 0],
[1, 0]])
Here a[:, 0] means "select all rows from column 0". The value 1 is broadcast across this selected column, producing the desired array (it's not necessary to use a list [1, 1], although you can).
Your syntax a[:][0] means "select all the rows from the array a and then select the first row". Similarly, a[0][:] means "select the first row of a and then select this entire row again". This is why you could replace the rows successfully, but not the columns - it's necessary to make a selection for axis 1, not just axis 0.

You can do something like this:
import numpy as np
a = np.zeros((2,2), dtype=np.int)
a[:,0] = np.ones((1,2), dtype=np.int)
Please refer to Accessing np matrix columns

Select the intended column using a proper indexing and just assign the value to it using =. Numpy will take care of the rest for you.
>>> a[::,0] = 1
>>> a
array([[1, 0],
[1, 0]])
Read more about numpy indexing.

Python - how to find numbers in a list which are not the minimum

I have a list S = [a[n],b[n],c[n]] and for n=0 the minimum of list S is the value 'a'. How do I select the values b and c given that I know the minimum? The code I'm writing runs through many iterations of n, and I want to examine the elements which are not the minimum for a given iteration in the loop.
Python 2.7.3, 32-bit. Numpy 1.6.2. Scipy 0.11.0b1

If you can flatten the whole list into a numpy array, then use argsort, the first row of argsort will tell you which array contains the minimum value:
a = [1,2,3,4]
b = [3,-4,5,8]
c = [6,1,-7,12]
S = [a,b,c]
S2 = np.array(S)
S2.argsort(axis=0)
array([[0, 1, 2, 0],
[1, 2, 0, 1],
[2, 0, 1, 2]])

Maybe you can do something like
S.sort()
S[1:3]
This is what you want?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Counting Zeros in multiple array columns and store them efficently - python

You can construct a boolean array arr == 0 and then take its sum along the rows. >>> (arr == 0).sum(0) array([3, 1, 1])

To count zeros you can count non-zeros along each column and subtract result from length of each column: arr.shape[0] - np.count_nonzero(arr, axis=0) produces [3,1,1]. This solution is very fast because no extra large objects are created.

Related

pandas filter using combinations of boolean series

efficient way to convert pandas series of strings to numpy frequency matrix

Why count doesn't work for this list?(in python)

How to replace one column by a value in a numpy array?

Python - how to find numbers in a list which are not the minimum

Categories

Resources