Count mismatches in list of lists - python

i have two lists arr_list1 and arr_list2 which are both lists of lists. they are both exactly the same size. I need to count how many elements differ between the two, per list. for example,
arr_list1 = [[0,1,1],[0,1,0],[1,0,1]]
arr_list2 = [[0,1,0],[1,1,1],[1,0,1]]
I would like to get result = (1,2,0)
Is there a 'simple' way of doing this that doesnt require loops?

import numpy as np
arr_list1 = [[0,1,1],[0,1,0],[1,0,1]]
arr_list2 = [[0,1,0],[1,1,1],[1,0,1]]
print np.sum(np.asarray(arr_list1) != np.asarray(arr_list2),axis=1)

Just iterate two lists and use sum() method to count the mismatches
a1 = [[0,1,1],[0,1,0],[1,0,1]]
a2 = [[0,1,0],[1,1,1],[1,0,1]]
print [sum([a1[i][j]!=a2[i][j] for j in range(len(a1[i]))]) for i in range(len(a1))]
Output:
[1, 2, 0]

You can use zip to compare the differences:
arr_list1 = [[0,1,-1],[0,1,0],[1,0,1]]
arr_list2 = [[0,1,0],[1,1,1],[1,0,1]]
def get_differences(arr_list1, arr_list2):
all_differences = []
for a, b in zip(arr_list1, arr_list2):
sum_differences = 0
for a_item, b_item in zip(a, b):
if a_item != b_item:
sum_differences += 1
all_differences.append(sum_differences)
return all_differences
print get_differences(arr_list1, arr_list2)

Related

How to find steps in a vector (1d array, list) in Python?

I want to get border of data in a list using python
For example I have this list :
a = [1,1,1,1,4,4,4,6,6,6,6,6,1,1,1]
I want a code that return data borders. for example:
a = [1,1,1,1,4,4,4,6,6,6,6,6,1,1,1]
^ ^ ^ ^
b = get_border_index(a)
print(b)
output:
[0,4,7,12]
How can I implement get_border_index(lst: list) -> list function?
The scalable answer that also works for very long lists or arrays is to use np.diff. In that case you should avoid a for loop at all costs.
import numpy as np
a = [1,1,1,1,4,4,4,6,6,6,6,6,1,1,1]
a = np.array(a)
# this is unequal 0 if there is a step
d = np.diff(a)
# boolean array where the steps are
is_step = d != 0
# get the indices of the steps (first one is trivial).
ics = np.where(is_step)
# get the first dimension and shift by one as you want
# the index of the element right of the step
ics_shift = ics[0] + 1
# and if you need a list
ics_list = ics_shift.tolist()
print(ics_list)
You can use for loop with enumerate
def get_border_index(a):
last_value = None
result = []
for i, v in enumerate(a):
if v != last_value:
last_value = v
result.append(i)
return result
a = [1,1,1,1,4,4,4,6,6,6,6,6,1,1,1]
b = get_border_index(a)
print(b)
Output
[0, 4, 7, 12]
This code will check if an element in the a list is different then the element before and if so it will append the index of the element to the result list.

Find values in list which differ from reference list by up to N characters

I have a list like the following:
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
And a reference list like this:
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA']
I want to extract the values from Test if they are N or less characters different from any one of the items in Ref.
For example, if N = 1, only the first two elements of Test should be output. If N = 2, all three elements fit this criteria and should be returned.
It should be noted that I am looking for same charcacter length values (ASDFGY -> ASDFG matching doesn't work for N = 1), so I want something more efficient than levensthein distance.
I have over 1000 values in ref and a couple hundred million in Test so efficiency is key.
Using a generation expression with sum:
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA']
from collections import Counter
def comparer(x, y, n):
return (len(x) == len(y)) and (sum(i != j for i, j in zip(x, y)) <= n)
res = [a for a, b in zip(Ref, Test) if comparer(a, b, 1)]
print(res)
['ASDFGY', 'QWERTYI']
Using difflib
Demo:
import difflib
N = 1
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA']
result = []
for i,v in zip(Test, Ref):
c = 0
for j,s in enumerate(difflib.ndiff(i, v)):
if s.startswith("-"):
c += 1
if c <= N:
result.append( i )
print(result)
Output:
['ASDFGH', 'QWERTYU']
The newer regex module offers a "fuzzy" match possibility:
import regex as re
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA', 'ASDFGI', 'ASDFGX']
for item in Test:
rx = re.compile('(' + item + '){s<=3}')
for r in Ref:
if rx.search(r):
print(rf'{item} is similar to {r}')
This yields
ASDFGH is similar to ASDFGY
ASDFGH is similar to ASDFGI
ASDFGH is similar to ASDFGX
QWERTYU is similar to QWERTYI
ZXCVB is similar to ZXCAA
You can control it via the {s<=3} part which allows three or less substitutions.
To have pairs, you could write
pairs = [(origin, difference)
for origin in Test
for rx in [re.compile(rf"({origin}){{s<=3}}")]
for difference in Ref
if rx.search(difference)]
Which would yield for
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA', 'ASDFGI', 'ASDFGX']
the following output:
[('ASDFGH', 'ASDFGY'), ('ASDFGH', 'ASDFGI'),
('ASDFGH', 'ASDFGX'), ('QWERTYU', 'QWERTYI'),
('ZXCVB', 'ZXCAA')]

Python: how to find intersection of two lists (i need a lenght of intersection, actually) if there are duplicates?

Let me have thse two lists:
a = ['a','b','c','a','a']
b = ['a','b','d']
I need to calculate Jaccard distance = (union-intersect)/union, but I know there gonna be duplicates in each list, and I want to count them, so intersect lenght for the example would be 2 and Jaccard distance = (8-2)/8
How can I do that? first thought is to joint lists and then remove elements one by one...
UPDATE:
probably I had to stress more that I need to count dublicates;
here is my working solution, but it is quite ugly:
a = [1,2,3,1,1]
b = [2,1,1, 6,5]
import collections
aX = collections.Counter(a)
bX = collections.Counter(b)
r1 = [x for x in aX if x in bX]
print r1
print sum((min(aX[x], bX[x]) for x in r1))
>>> 3
a = ['a','b','c','a','a']
b = ['a','b','d']
c = list(set(b).intersection(a))
['a','b']
Note sets will discard duplicates!
To the get the jaccard index between two lists a and b:
def jaccard_distance(a,b):
a = set(a)
b = set(b)
c = a.intersection(b)
return float(len(a) + len(b) - len(c)) /(len(a) + len(b))
here is my working solution, but it is quite ugly:
a = [1,2,3,1,1]
b = [2,1,1, 6,5]
import collections
aX = collections.Counter(a)
bX = collections.Counter(b)
r1 = [x for x in aX if x in bX]
print r1
print sum((min(aX[x], bX[x]) for x in r1))
>>> 3

How to mathematically subtract items in two lists in Python and only output those items which meet a condition?

I have two lists that are already sorted from low to high:
A=['40','60','80']
B=['10','42','100']
I want to subtract each item in A by every item in B. Then, if the difference between these values meets a condition, specifically, if less than 5, then delete both items from each list where the output should be:
A=['60','80']
B=['10','100']
**Sometimes lists are unequal in length, and sometimes there are only 1 item in each list
I have found many ways to subtract items in lists, but I do not know how to retrieve and delete the specific items in the specified lists, or they only subtract one item from each corresponding item in the opposite list
Using lambda:
if list(imap(lambda m, n: m-n < 5, A, B)) == True:
Using imap,sub
list(imap(sub, A, B)):
Using Numpy
M = np.array([A])
N = np.array([B])
c = abs(M-N)
Many thanks.
Without using numpy:
A = ["40", "60", "80"]
B = ["10", "42", "100"]
newA = filter(lambda a: all([abs(int(a) - int(b)) >= 5 for b in B]), A)
newB = filter(lambda b: all([abs(int(a) - int(b)) >= 5 for a in A]), B)
print newA
print newB
A_dict = {}
B_dict = {}
for i in xrange(len(A)):
if A[i] not in A_dict:
A_dict[A[i]] = []
A_dict[A[i]].append(i)
for i in xrange(len(B)):
if B[i] not in B_dict:
B_dict[B[i]] = []
B_dict[B[i]].append(i)
for x in B_dict:
for i in xrange(6):
if x - i in A_dict:
B_dict[x] = []
A_dict[x-i] = []
A_new_idx = []
B_new_idx = []
for x in A_dict:
A_new_idx.extend(A_dict[x])
for x in B_dict:
B_new_idx.extend(B_dict[x])
A_new = [A[i] for i in sorted(A_new_idx)]
B_new = [B[i] for i in sorted(B_new_idx)]
This has running time O(n log n) since the maximum difference you want to remove is 5 (a constant). Should be much faster than naive O(n^2)

Python: Check the occurrences in a list against a value

lst = [1,2,3,4,1]
I want to know 1 occurs twice in this list, is there any efficient way to do?
lst.count(1) would return the number of times it occurs. If you're going to be counting items in a list, O(n) is what you're going to get.
The general function on the list is list.count(x), and will return the number of times x occurs in a list.
Are you asking whether every item in the list is unique?
len(set(lst)) == len(lst)
Whether 1 occurs more than once?
lst.count(1) > 1
Note that the above is not maximally efficient, because it won't short-circuit -- even if 1 occurs twice, it will still count the rest of the occurrences. If you want it to short-circuit you will have to write something a little more complicated.
Whether the first element occurs more than once?
lst[0] in lst[1:]
How often each element occurs?
import collections
collections.Counter(lst)
Something else?
For multiple occurrences, this give you the index of each occurence:
>>> lst=[1,2,3,4,5,1]
>>> tgt=1
>>> found=[]
>>> for index, suspect in enumerate(lst):
... if(tgt==suspect):
... found.append(index)
...
>>> print len(found), "found at index:",", ".join(map(str,found))
2 found at index: 0, 5
If you want the count of each item in the list:
>>> lst=[1,2,3,4,5,2,2,1,5,5,5,5,6]
>>> count={}
>>> for item in lst:
... count[item]=lst.count(item)
...
>>> count
{1: 2, 2: 3, 3: 1, 4: 1, 5: 5, 6: 1}
def valCount(lst):
res = {}
for v in lst:
try:
res[v] += 1
except KeyError:
res[v] = 1
return res
u = [ x for x,y in valCount(lst).iteritems() if y > 1 ]
u is now a list of all values which appear more than once.
Edit:
#katrielalex: thank you for pointing out collections.Counter, of which I was not previously aware. It can also be written more concisely using a collections.defaultdict, as demonstrated in the following tests. All three methods are roughly O(n) and reasonably close in run-time performance (using collections.defaultdict is in fact slightly faster than collections.Counter).
My intention was to give an easy-to-understand response to what seemed a relatively unsophisticated request. Given that, are there any other senses in which you consider it "bad code" or "done poorly"?
import collections
import random
import time
def test1(lst):
res = {}
for v in lst:
try:
res[v] += 1
except KeyError:
res[v] = 1
return res
def test2(lst):
res = collections.defaultdict(lambda: 0)
for v in lst:
res[v] += 1
return res
def test3(lst):
return collections.Counter(lst)
def rndLst(lstLen):
r = random.randint
return [r(0,lstLen) for i in xrange(lstLen)]
def timeFn(fn, *args):
st = time.clock()
res = fn(*args)
return time.clock() - st
def main():
reps = 5000
res = []
tests = [test1, test2, test3]
for t in xrange(reps):
lstLen = random.randint(10,50000)
lst = rndLst(lstLen)
res.append( [lstLen] + [timeFn(fn, lst) for fn in tests] )
res.sort()
return res
And the results, for random lists containing up to 50,000 items, are as follows:
(Vertical axis is time in seconds, horizontal axis is number of items in list)
Another way to get all items that occur more than once:
lst = [1,2,3,4,1]
d = {}
for x in lst:
d[x] = x in d
print d[1] # True
print d[2] # False
print [x for x in d if d[x]] # [1]
You could also sort the list which is O(n*log(n)), then check the adjacent elements for equality, which is O(n). The result is O(n*log(n)). This has the disadvantage of requiring the entire list be sorted before possibly bailing when a duplicate is found.
For a large list with a relatively rare duplicates, this could be the about the best you can do. The best way to approach this really does depend on the size of the data involved and its nature.

Categories

Resources