Hamming distance (Simhash python) giving out unexpected value - python

I was checking out Simhash module ( https://github.com/leonsim/simhash ).
I presume that the Simhash("String").distance(Simhash("Another string")) is the hamming distance between the two strings. Now, I am not sure I understand this "get_features(string) method completely, as shown in (https://leons.im/posts/a-python-implementation-of-simhash-algorithm/).
def get_features(s):
width = 2
s = s.lower()
s = re.sub(r'[^\w]+', '', s)
return [s[i:i + width] for i in range(max(len(s) - width + 1, 1))]
Now, when I try to compute distance between "aaaa" and "aaas" using the width 2, it gives out the distance as 0.
from simhash import Simhash
Simhash(get_features("aaas")).distance(Simhash(get_features("aaaa")))
I am not sure what am I missing out in here.

Dig into code
The width, in your case, is the key parameter in get_features(), which give different splitted words. The get_features() in your case will output like:
['aa', 'aa', 'aa']
['aa', 'aa', 'as']
Then Simhash calculates these list as unweighted features (which means the default weight of each feature is 1) and output like:
86f24ba207a4912
86f24ba207a4912
They are the same!
The reason is from simhash algorithm itself. Let's look into the code:
def build_by_features(self, features):
"""
`features` might be a list of unweighted tokens (a weight of 1
will be assumed), a list of (token, weight) tuples or
a token -> weight dict.
"""
v = [0] * self.f
masks = [1 << i for i in range(self.f)]
if isinstance(features, dict):
features = features.items()
for f in features:
if isinstance(f, basestring):
h = self.hashfunc(f.encode('utf-8'))
w = 1
else:
assert isinstance(f, collections.Iterable)
h = self.hashfunc(f[0].encode('utf-8'))
w = f[1]
for i in range(self.f):
v[i] += w if h & masks[i] else -w
ans = 0
for i in range(self.f):
if v[i] >= 0:
ans |= masks[i]
self.value = ans
from: leonsim/simhash
The calculation process can be divied into 4 steps:
1) hash each splitted word (feature), to transform string into binary numbers;
2) weight them;
3) assumble weighted bits together;
4) change the assumbled number into binary and output as the value.
Now, in your case, the step 3 will output like:
[-3, 3, -3, -3, 3, -3, -3, -3, 3, -3, -3, 3, -3, -3, 3, -3, -3, 3, -3, 3, 3, 3, 3, -3, -3, -3, -3, -3, -3, 3, -3, -3, -3, 3, -3, 3, 3, 3, -3, 3, -3, -3, 3, -3, -3, 3, -3, -3, 3, 3, 3, 3, -3, 3, 3, -3, -3, -3, -3, 3, -3, -3, -3, -3]
[-1, 3, -3, -1, 3, -3, -3, -1, 3, -3, -3, 1, -1, -1, 1, -3, -3, 3, -1, 3, 1, 3, 1, -3, -1, -3, -3, -1, -1, 3, -1, -1, -1, 3, -1, 1, 3, 1, -1, 1, -3, -3, 1, -1, -3, 3, -3, -1, 1, 3, 3, 3, -3, 3, 3, -3, -1, -1, -1, 1, -3, -3, -3, -1]
And after step 4, the 2 output the same value.
Other parameter
If you change the width from 2 to 1, 3, 4, you will get different result of
Simhash(get_features()).
Your case shows the limitation of simhash with short length text.

Related

How to check if occurrences of identical consecutive numbers is below a threshold in pandas series

I need to check if the occurrences of identical consecutive numbers is below a certain threshold, e.g. maximal two same consecutive numbers.
pd.Series(data=[-1, -1, 2, -2, 2, -2, 1, 1]) # True
pd.Series(data=[-1, -1, -1, 2, 2, -2, 1, 1]) # False
Further checks:
Only the numbers +1 and -1 are allowed to occur as consecutive numbers with a maximum of two occurrences.
pd.Series(data=[-1, 1, -2, 2, -2, 2, -1, 1]) # True
pd.Series(data=[1, 1, -2, 2, -2, 2, -1, 1]) # True
pd.Series(data=[-1, -1, 2, 2, -2, 1, 1, -2]) # False
pd.Series(data=[-1, 1, -2, -2, 1, -1, 2, -2]) # False
You can use the shift method along with Boolean indexing to achieve this. The idea is to compare each element with the previous one, and if they are equal and not equal to +1 or -1, return False.
Here's an example implementation:
def check_consecutive(series):
consecutive = (series == series.shift()).values
allowed = ((series == 1) | (series == -1)).values
return (consecutive & ~allowed).sum() <= 2
print(check_consecutive(pd.Series(data=[-1, -1, 2, -2, 2, -2, 1, 1]))) # True
print(check_consecutive(pd.Series(data=[-1, -1, -1, 2, 2, -2, 1, 1]))) # False
print(check_consecutive(pd.Series(data=[-1, 1, -2, 2, -2, 2, -1, 1]))) # True
print(check_consecutive(pd.Series(data=[1, 1, -2, 2, -2, 2, -1, 1]))) # True
print(check_consecutive(pd.Series(data=[-1, -1, 2, 2, -2, 1, 1, -2]))) # False
print(check_consecutive(pd.Series(data=[-1, 1, -2, -2, 1, -1, 2, -2]))) # False

is there a way to turn [0, -2, 1, -3, 2, -4, 3, -5, 4, -6] into [0, 1, -2, 2, -3, 3, -4, 4, -5, -6]

In short, I had an assignment problem where my final output should be a list of numbers, However in some cases, I have the good numbers in my list but just in the wrong order. I don't want to change my entire code since it is quite long. Thus, I was wondering if there is a trick, function or something that I can use where I can sort the list in a ascending manner however it doesn't put the negative numbers first.
For example, turn [0, -2, 1, -3, 2, -4, 3, -5, 4, -6] into [0, 1, -2, 2, -3, 3, -4, 4, -5, -6]
Another example, turn [0, 1, 5, 9, -2, 2, -5, -9, -1] into [0, -1, 1, -2, 2, -5, 5, -9, 9]
Another example, turn [12,-12] into [-12,12]
Here you can see that its sorted in an ascending manner but its start from 0 and not from a negative number.
You can use:
lst = [0, -2, 1, -3, 2, -4, 3, -5, 4, -6]
print(sorted(lst, key=lambda x: (abs(x), x)))
output
[0, 1, -2, 2, -3, 3, -4, 4, -5, -6]
Basically you sort the items based on two priorities: first the absolute value of the numbers so that 2 and -2 come next to each other. Also you mentioned, you want -2 to come before 2. The second item of the tuple does this. You are actually comparing tuples. First items are equal so comparison continues to the second items(ascending order):
lst = [(2, 2), (2, -2)]
print(sorted(lst)) # [(2, -2), (2, 2)]
You can use the key parameter of sorted with a tuple of the absolute value and the value itself. This will thus sort by the absolute and in case of tie, put the negative number first.
out = sorted(your_lst, key=lambda x: (abs(x), x))
Output on the first example:
[0, 1, -2, 2, -3, 3, -4, 4, -5, -6]
If you want to sort in place:
your_lst.sort(key=lambda x: (abs(x), x))

how to sum just 1 previous value in a list

I have a list and want to sum the value of index(-1) with current value index for the whole list
list = [-2, -2, -1, 1, -1, 1, 3, 5, 6, -2, -1, 0, -2, -1, -2, 2]
Expected output:
new_list =[-2,-4,-3, 0, 0, 0, 4, 8, 11, 4, -3, -1, -2, -3, -3, 0]
new_list[0] = 0+ list[0] = 0+ (-2) = -2
new_list[1] = list[0] + list[1] = (-2) + (-2) = -4
new_list[2] = list[1] + list[2] = (-2)+ (-1) = -3
new_list[3] = list[2] + list[3] = (-1)+ (1) = 0
Basically new_list[index] = list[index -1] + list[index]
list1 = [-2, -2, -1, 1, -1, 1, 3, 5, 6, -2, -1, 0, -2, -1, -2, 2]
new_list=[list1[0]]
for i in range(len(list1)-1):
value=list1[i]+list1[i+1]
new_list.append(value)
print(new_list)
Output:[-2,-4,-3, 0, 0, 0, 4, 8, 11, 4, -3, -1, -2, -3, -3, 0]
You have to iterate on the list and add the numbers like so:
list = [-2, -2, -1, 1, -1, 1, 3, 5, 6, -2, -1, 0, -2, -1, -2, 2]
new_list = list[0] # We just take the first element of the list, because we don't add anything
for number, element in enumerate(list[1:]):
new_list.append(element + list[number - 1])
Or a more pythonic way:
new_list = [list[0]].extend([element + list[number - 1] for number, element in enumerate (list[1:])
If I understand your requirement correctly, you can do this quite easily with pandas. For example:
import pandas as pd
# Create a pandas Series of values
s = pd.Series([-2, -2, -1, 1, -1, 1, 3, 5, 6, -2, -1, 0, -2, -1, -2, 2])
# Add the current value in the series to the 'shifted' (previous) value.
output = s.add(s.shift(1), fill_value=0).tolist()
# Display the output.
print(output)
Output:
[-2.0, -4.0, -3.0, 0.0, 0.0, 0.0, 4.0, 8.0, 11.0, 4.0, -3.0, -1.0, -2.0, -3.0, -3.0, 0.0]
>>> list = [-2, -2, -1, 1, -1, 1, 3, 5, 6, -2, -1, 0, -2, -1, -2, 2]
>>> list_length = len(list)
>>> result_list = [list[0]]
>>> for i in range(list_length):
... if not (i+1) == list_length:
... result_list.append(list[i] + list[i+1])
...
>>> result_list
[2, -4, -3, 0, 0, 0, 4, 8, 11, 4, -3, -1, -2, -3, -3, 0]
The above is the solution of your quest.

Defining a multidimensional field with nonstandard domain

I have an array a in Python, let's say a=np.array([3, 4]), and would like to define an ndarray (or something like that) of type [-3:3, -4:4], in other words, a collection x of real numbers x[-3,-4], x[-3,-3],...,x[3,4], the i'th coordinate ranging over integers between -a[i] and a[i]. If the array length is given (2 in this example), I could use
np.mgrid[-a[0]:a[0]:1.0,-a[1]:a[1]:1.0][0].
But what should I do if the length of a is unknown?
You could generate a list of ranges with
[np.arange(-x,x+1) for x in a]
I'd have to play around with mgrid, or another function in index_tricks to figure how to use it. I may to make it a tuple or pass it with a *.
mgrid wants slices, so this would replicate your first call
In [60]: np.mgrid[[slice(-x,x+1) for x in [3,4]]]
Out[60]:
array([[[-3, -3, -3, -3, -3, -3, -3, -3, -3],
[-2, -2, -2, -2, -2, -2, -2, -2, -2],
[-1, -1, -1, -1, -1, -1, -1, -1, -1],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 1, 1, 1, 1, 1, 1, 1, 1, 1],
[ 2, 2, 2, 2, 2, 2, 2, 2, 2],
[ 3, 3, 3, 3, 3, 3, 3, 3, 3]],
[[-4, -3, -2, -1, 0, 1, 2, 3, 4],
[-4, -3, -2, -1, 0, 1, 2, 3, 4],
[-4, -3, -2, -1, 0, 1, 2, 3, 4],
[-4, -3, -2, -1, 0, 1, 2, 3, 4],
[-4, -3, -2, -1, 0, 1, 2, 3, 4],
[-4, -3, -2, -1, 0, 1, 2, 3, 4],
[-4, -3, -2, -1, 0, 1, 2, 3, 4]]])
which of course can be generalized to use a.
My initial arange approach works with meshgrid (producing a list of arrays):
In [71]: np.meshgrid(*[np.arange(-x,x+1) for x in [3,4]],indexing='ij')
Out[71]:
[array([[-3, -3, -3, -3, -3, -3, -3, -3, -3],
[-2, -2, -2, -2, -2, -2, -2, -2, -2],
...
[-4, -3, -2, -1, 0, 1, 2, 3, 4],
[-4, -3, -2, -1, 0, 1, 2, 3, 4],
[-4, -3, -2, -1, 0, 1, 2, 3, 4]])]

Getting indices of both zero and nonzero elements in array

I need to find the indicies of both the zero and nonzero elements of an array.
Put another way, I want to find the complementary indices from numpy.nonzero().
The way that I know to do this is as follows:
indices_zero = numpy.nonzero(array == 0)
indices_nonzero = numpy.nonzero(array != 0)
This however means searching the array twice, which for large arrays is not efficient. Is there an efficient way to do this using numpy?
Assuming you already have the range for use numpy.arange(len(array)), just get and store the logical indices:
bindices_zero = (array == 0)
then when you actually need the integer indices you can do
indices_zero = numpy.arange(len(array))[bindices_zero]
or
indices_nonzero = numpy.arange(len(array))[~bindices_zero]
You can use boolean indexing:
In [82]: a = np.random.randint(-5, 5, 100)
In [83]: a
Out[83]:
array([-2, -1, 4, -3, 1, -2, 2, -1, 2, -1, -3, 3, -3, -4, 1, 2, 1,
3, 3, 0, 1, -3, -4, 3, -5, -1, 3, 2, 3, 0, -5, 4, 3, -5,
-3, 1, -1, 0, -4, 0, 1, -5, -5, -1, 3, -2, -5, -5, 1, 0, -1,
1, 1, -1, -2, -2, 1, 1, -4, -4, 1, -3, -3, -5, 3, 0, -5, -2,
-2, 4, 1, -4, -5, -1, 3, -3, 2, 4, -4, 4, 2, -2, -4, 3, 4,
-2, -4, 2, -4, -1, 0, -3, -1, 2, 3, 1, 1, 2, 1, 4])
In [84]: mask = a != 0
In [85]: a[mask]
Out[85]:
array([-2, -1, 4, -3, 1, -2, 2, -1, 2, -1, -3, 3, -3, -4, 1, 2, 1,
3, 3, 1, -3, -4, 3, -5, -1, 3, 2, 3, -5, 4, 3, -5, -3, 1,
-1, -4, 1, -5, -5, -1, 3, -2, -5, -5, 1, -1, 1, 1, -1, -2, -2,
1, 1, -4, -4, 1, -3, -3, -5, 3, -5, -2, -2, 4, 1, -4, -5, -1,
3, -3, 2, 4, -4, 4, 2, -2, -4, 3, 4, -2, -4, 2, -4, -1, -3,
-1, 2, 3, 1, 1, 2, 1, 4])
In [86]: a[-mask]
Out[86]: array([0, 0, 0, 0, 0, 0, 0])
I'm not sure about a built-in numpy method for accomplishing this, but you could use an old-fashioned for loop, I believe. Something like:
indices_zero = []
indices_nonzero = []
for index in xrange(len(array)):
if array[index] == 0:
indicies_zero.append(index)
else:
indicies_nonzero.append(index)
Something like this should accomplish what you want, by only looping once.

Categories

Resources