Python: Counting Zeros in multiple array columns and store them efficently - python

I create an array:
import numpy as np
arr = [[0, 2, 3], [0, 1, 0], [0, 0, 1]]
arr = np.array(arr)
Now I count every zero per column and store it in a variable:
a = np.count_nonzero(arr[:,0]==0)
b = np.count_nonzero(arr[:,1]==0)
c = np.count_nonzero(arr[:,2]==0)
This code works fine. But in my case I have many more columns with over 70000 values in each. This would be many more lines of code and a very messy variable expolorer in spyder.
My questions:
Is there a possibility to make this code more efficient and save the values only in one type of data, e.g. a dictionary, dataframe or tuple?
Can I use a loop for creating the dic, dataframe or tuple?
Thank you

You can construct a boolean array arr == 0 and then take its sum along the rows.
>>> (arr == 0).sum(0)
array([3, 1, 1])

Use an ordered dict from the collections module:
from collections import OrderedDict
import numpy as np
from pprint import pprint as pp
import string
arr = np.array([[0, 2, 3], [0, 1, 0], [0, 0, 1]])
letters = string.ascii_letters
od = OrderedDict()
for i in range(len(arr)):
od[letters[i]] = np.count_nonzero(arr[:, i]==0)
pp(od)
Returning:
OrderedDict([('a', 3), ('b', 1), ('c', 1)])
Example usage:
print(f"First number of zeros: {od.get('a')}")
Will give you:
First number of zeros: 3

To count zeros you can count non-zeros along each column and subtract result from length of each column:
arr.shape[0] - np.count_nonzero(arr, axis=0)
produces [3,1,1].
This solution is very fast because no extra large objects are created.

Related

pandas filter using combinations of boolean series

I have 7 boolean series of equal length:
msk_valid_structure
msk_submission_context
msk_reference_substance_location
msk_neutral
msk_identifier_origin
msk_two_conversion_methods
msk_no_error_warnings
I need to combine them with a logical AND (&) and so that:
we always include the first one (msk_valid_structure)
we include 5 out of 6 of the remaining ones (all combinations)
I tried two solutions and I am not happy with neither.
The first one uses reduce and it is quite slow:
from itertools import combinations
from functools import reduce
for combination in combinations(msks, 5):
res = reduce(lambda x,y: x&y, combination, msk_valid_structure)
The second one constructs dataframes that is also slow:
from itertools import combinations
for combination in combinations(msks, 5):
tmp = pd.DataFrame({i: col for i, col in enumerate(list(combination) + [msk_valid_structure])})
res = tmp.all(axis='columns')
How would you handle this situation?
Many thanks for your help.
Generate a DataFrame from your series and then use apply. Here you can write your custom function to add a column with the result valid/invalid or you directly do something if the condition is fulfilled.
This way has the benefit, that its vectorized and kinda fast compared to looping
With:
import pandas as pd
# Creating dataframe
test=[[0, 0, 1, 0, 0, 1, 1],
[1, 0, 1, 0, 1, 0, 1],
[0, 1, 0, 1, 1, 1, 1],
[1, 1, 0, 1, 1, 1, 1]]
cols = ['msk_valid_structure',
'msk_submission_context',
'msk_reference_substance_location',
'msk_neutral',
'msk_identifier_origin',
'msk_two_conversion_methods',
'msk_no_error_warnings']
df_test = pd.DataFrame(test, columns=cols)
# Defining function matching your condition
def your_func(s: pd.Series):
valid_stucture = s.msk_valid_structure
# Because any combination, i just summed Trues
# if you only allow specific combinations just write conditions here with .col or ['col'] notation
rest = s.iloc[1:].sum()
if valid_stucture and (rest >= 5):
return 1
return 0
# Applying the function and saving results in foo
df_test['foo'] = df_test.apply(lambda x: your_func(x), axis=1)
# Result of foo:
0,0,0,1

efficient way to convert pandas series of strings to numpy frequency matrix

I have:
a list of english chars, a to z: char_list
series of strings
I want to create a numpy matrix where each row correspond to the string at the same row in the series, and each column to the character at the same index in the list.
example:
series: [[ab],[ac],[aa]]
chars = [a,b,c]
result = [[110],[101],[200]]
here is how I do it now:
def create_char_matrix(strings, symbol_list):
mat = np.zeros((strings.shape[0],len(symbol_list)))
for i, line in enumerate(strings):
for c in line:
mat[i,symbol_list.index(c)] += 1
return mat
this is not very fast, considering there is often a better solution than nested for loops.
any ideas about how to accelerate this process?
You can split the string to the characters, then crosstab:
s = pd.Series(['ab','ac','aa'])
chars=['a','b','c']
a = s.str.split('').str[1:-1].explode()
pd.crosstab(a.index, a).reindex(chars, axis=1, fill_value=0).values
Output:
array([[1, 1, 0],
[1, 0, 1],
[2, 0, 0]])

Why count doesn't work for this list?(in python)

In the following code the result in "countOf1" is 0 instead of 12. What is the reason and how can i solve it?
import numpy as np
import pandas as pd
x = np.matrix(np.arange(12).reshape((1, 12)))
x[:,:]=1
countOf1=(x.tolist()).count(1)
It's because when you convert that into a list with tolist() you're getting a subset of a list. Meaning this is your x:
x.tolist()
Out[221]: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
To get your countOf1 to work you'll need to do it for x.tolist()[0]. This will give you:
x.tolist()[0].count(1)
Out[223]: 12
Remember that a numpy matrix is like a list of lists. Even though you only created one row vector, numpy writes it with 2 brackets ([[0,1,2,3,4...,11]]). So when you changed it to a list with tolist(), you created a list within a list. Since the list within the list != 1, the count is 0.

How to replace one column by a value in a numpy array?

I have an array like this
import numpy as np
a = np.zeros((2,2), dtype=np.int)
I want to replace the first column by the value 1. I did the following:
a[:][0] = [1, 1] # not working
a[:][0] = [[1], [1]] # not working
Contrariwise, when I replace the rows it worked!
a[0][:] = [1, 1] # working
I have a big array, so I cannot replace value by value.
You can replace the first column as follows:
>>> a = np.zeros((2,2), dtype=np.int)
>>> a[:, 0] = 1
>>> a
array([[1, 0],
[1, 0]])
Here a[:, 0] means "select all rows from column 0". The value 1 is broadcast across this selected column, producing the desired array (it's not necessary to use a list [1, 1], although you can).
Your syntax a[:][0] means "select all the rows from the array a and then select the first row". Similarly, a[0][:] means "select the first row of a and then select this entire row again". This is why you could replace the rows successfully, but not the columns - it's necessary to make a selection for axis 1, not just axis 0.
You can do something like this:
import numpy as np
a = np.zeros((2,2), dtype=np.int)
a[:,0] = np.ones((1,2), dtype=np.int)
Please refer to Accessing np matrix columns
Select the intended column using a proper indexing and just assign the value to it using =. Numpy will take care of the rest for you.
>>> a[::,0] = 1
>>> a
array([[1, 0],
[1, 0]])
Read more about numpy indexing.

Python - how to find numbers in a list which are not the minimum

I have a list S = [a[n],b[n],c[n]] and for n=0 the minimum of list S is the value 'a'. How do I select the values b and c given that I know the minimum? The code I'm writing runs through many iterations of n, and I want to examine the elements which are not the minimum for a given iteration in the loop.
Python 2.7.3, 32-bit. Numpy 1.6.2. Scipy 0.11.0b1
If you can flatten the whole list into a numpy array, then use argsort, the first row of argsort will tell you which array contains the minimum value:
a = [1,2,3,4]
b = [3,-4,5,8]
c = [6,1,-7,12]
S = [a,b,c]
S2 = np.array(S)
S2.argsort(axis=0)
array([[0, 1, 2, 0],
[1, 2, 0, 1],
[2, 0, 1, 2]])
Maybe you can do something like
S.sort()
S[1:3]
This is what you want?

Categories

Resources