Python comparing two strings

Python comparing two strings - python

Is there a function to compare how many characters two strings (of the same length) differ by? I mean only substitutions. For example, AAA would differ from AAT by 1 character.

This will work:
>>> str1 = "AAA"
>>> str2 = "AAT"
>>> sum(1 for x,y in enumerate(str1) if str2[x] != y)
1
>>> str1 = "AAABBBCCC"
>>> str2 = "ABCABCABC"
>>> sum(1 for x,y in enumerate(str1) if str2[x] != y)
6
>>>
The above solution uses sum, enumerate, and a generator expression.
Because True can evaluate to 1, you could even do:
>>> str1 = "AAA"
>>> str2 = "AAT"
>>> sum(str2[x] != y for x,y in enumerate(str1))
1
>>>
But I personally prefer the first solution because it is clearer.

This is a nice use case for the zip function!
def count_substitutions(s1, s2):
return sum(x != y for (x, y) in zip(s1, s2))
Usage:
>>> count_substitutions('AAA', 'AAT')
1
From the docs:
zip(...)
zip(seq1 [, seq2 [...]]) -> [(seq1[0], seq2[0] ...), (...)]
Return a list of tuples, where each tuple contains the i-th element
from each of the argument sequences. The returned list is truncated
in length to the length of the shortest argument sequence.

Building on what poke said I would suggest the jellyfish package. It has several distance measures like what you are asking for. Example from the documentation:
IN [1]: jellyfish.damerau_levenshtein_distance('jellyfish', 'jellyfihs')
OUT[1]: 1
or using your example:
IN [2]: jellyfish.damerau_levenshtein_distance('AAA','AAT')
OUT[2]: 1
This will work for many different string lengths and should be able to handle most of what you throw at it.

Similar to simon's answer, but you don't have to zip things in order to just call a function on the resulting tuples because that's what map does anyway (and itertools.imap in Python 2). And there's a handy function for != in operator. Hence:
sum(map(operator.ne, s1, s2))

Related

Incorporate string with list entries - alternating

So SO, i am trying to "merge" a string (a) and a list of strings (b):
a = '1234'
b = ['+', '-', '']
to get the desired output (c):
c = '1+2-34'
The characters in the desired output string alternate in terms of origin between string and list. Also, the list will always contain one element less than characters in the string. I was wondering what the fastest way to do this is.
what i have so far is the following:
c = a[0]
for i in range(len(b)):
c += b[i] + a[1:][i]
print(c) # prints -> 1+2-34
But i kind of feel like there is a better way to do this..

You can use itertools.zip_longest to zip the two sequences, then keep iterating even after the shorter sequence ran out of characters. If you run out of characters, you'll start getting None back, so just consume the rest of the numerical characters.
>>> from itertools import chain
>>> from itertools import zip_longest
>>> ''.join(i+j if j else i for i,j in zip_longest(a, b))
'1+2-34'
As #deceze suggested in the comments, you can also pass a fillvalue argument to zip_longest which will insert empty strings. I'd suggest his method since it's a bit more readable.
>>> ''.join(i+j for i,j in zip_longest(a, b, fillvalue=''))
'1+2-34'
A further optimization suggested by #ShadowRanger is to remove the temporary string concatenations (i+j) and replace those with an itertools.chain.from_iterable call instead
>>> ''.join(chain.from_iterable(zip_longest(a, b, fillvalue='')))
'1+2-34'

common elements in two lists where elements are the same

I have two lists like thw following:
a=['not','not','not','not']
b=['not','not']
and I have to find the len of the list containing the intesection of the two above list, so that the result is:
intersection=['not','not']
len(intersection)
2
Now the problem is that I have tried filter(lambda x: x in a,b) and filter (lambda x: x in b,a) but when one of two list in longer than the other I do not get an intersection but just a membership checking. In the example above, since all the members of a are in b I get a len of common elements of 4; what I instead want is the intersection, which is len 2.
Using set().intersection(set()) would instead create a set, which is not what I want since all the elements are the same.
Can you suggest me any valuable and compact solution to the problem?

If you don't mind using collections.Counter, then you could have a solution like
>>> import collections
>>> a=['not','not','not','not']
>>> b=['not','not']
>>> c1 = collections.Counter(a)
>>> c2 = collections.Counter(b)
and then index by 'not'
>>> c1['not'] + c2['not']
6
For the intersection, you need to
>>> (c1 & c2) ['not']
2

I don't see any particularly compact way to compute this. Let's just go for a solution first.
The intersection is some sublist of the shorter list (e.g. b). Now, for better performance when the shorter list is not extremely short, make the longer list a set (e.g. set(a)). The intersection can then be expressed as a list comprehension of those items in the shorter list which are also in the longer set:
def common_elements(a, b):
shorter, longer = (a, b) if len(a)<len(b) else (b, a)
longer = set(longer)
intersection = [item for item in shorter if item in longer]
return intersection
a = ['not','not','not','not']
b = ['not','not']
print(common_elements(a,b))

Have you considered the following approach?
a = ['not','not','not','not']
b = ['not','not']
min(len(a), len(b))
# 2
Since all the elements are the same, the number of common elements is just the minimum of the lengths of both lists.

Do it by set. First make those lists to sets and then take their intersection. Now there might be repetitions in the intersection. So for each elements in intersection take the minimum repetitions in a and b.
>>> a=['not','not','not','not']
>>> b=['not','not']
>>> def myIntersection(A,B):
... setIn = set(A).intersection(set(B))
... rt = []
... for i in setIn:
... for j in range(min(A.count(i),B.count(i))):
... rt.append(i)
... return rt
...
>>> myIntersection(a,b)
['not', 'not']

Split a string and add into `tuple`

I know of only two simple ways to split a string and add into tuple
import re
1. tuple(map(lambda i: i, re.findall('[\d]{2}', '012345'))) # ('01', '23', '45')
2. tuple(i for i in re.findall('[\d]{2}', '012345')) # ('01', '23', '45')
Is there other simple ways?

I'd go for
s = "012345"
[s[i:i + 2] for i in range(0, len(s), 2)]
or
tuple(s[i:i + 2] for i in range(0, len(s), 2))
if you really want a tuple.

Usually one uses tuples when the dimensions/length is fixed (with possibly different types) and lists when there is an arbitrary number of values of the same type.
What is the reason to use a tuple instead of a list here?
Samples for tuples:
coordinates in a fixed dimensional space (e.g. 2d: (x, y) )
representation of dict key/value-pairs (e.g. ("John Smith", 38))
things where the number of tuple components is known before evaluating the expression
...
Samples for lists:
splitted string ("foo|bar|buz" splited on |s: ["foo", "bar", "buz"])
command line arguments (["-f", "/etc/fstab")
things where the number of list elements is (usually) not known before evaluating the expression
...

Another alternative:
s = '012345'
map(''.join, zip(*[iter(s)]*2))
Or if you need a tuple:
tuple(map(''.join, zip(*[iter(s)]*2)))
This method of grouping items into n-length groups comes straight from the documentation for zip().

Python 3
Here I make the assumption that the OP defines "simpler" as not using regular expressions.
Given a strings in a list ["jim_bob", "slim_jim"] that have some common pattern:
fileNameToTupleByUnderscore = lambda x: tuple(x.split('_'))
print(list(map(fileNameToTupleByUnderscore, ["jim_bob", "slim_jim"])))
Returns
[('jim', 'bob'), ('slim', 'jim')]
Note that you can add a strip('_') before the split('_') if you want to exclude trailing underscores.

python sort strings with digits at the end

what is the easiest way to sort a list of strings with digits at the end where some have 3 digits and some have 4:
>>> list = ['asdf123', 'asdf1234', 'asdf111', 'asdf124']
>>> list.sort()
>>> print list
['asdf111', 'asdf123', 'asdf1234', 'asdf124']
should put the 1234 one on the end. is there an easy way to do this?

is there an easy way to do this?
Yes
You can use the natsort module.
>>> from natsort import natsorted
>>> natsorted(['asdf123', 'asdf1234', 'asdf111', 'asdf124'])
['asdf111', 'asdf123', 'asdf124', 'asdf1234']
Full disclosure, I am the package's author.

is there an easy way to do this?
No
It's perfectly unclear what the real rules are. The "some have 3 digits and some have 4" isn't really a very precise or complete specification. All your examples show 4 letters in front of the digits. Is this always true?
import re
key_pat = re.compile(r"^(\D+)(\d+)$")
def key(item):
m = key_pat.match(item)
return m.group(1), int(m.group(2))
That key function might do what you want. Or it might be too complex. Or maybe the pattern is really r"^(.*)(\d{3,4})$" or maybe the rules are even more obscure.
>>> data= ['asdf123', 'asdf1234', 'asdf111', 'asdf124']
>>> data.sort( key=key )
>>> data
['asdf111', 'asdf123', 'asdf124', 'asdf1234']

What you're probably describing is called a Natural Sort, or a Human Sort. If you're using Python, you can borrow from Ned's implementation.
The algorithm for a natural sort is approximately as follows:
Split each value into alphabetical "chunks" and numerical "chunks"
Sort by the first chunk of each value
If the chunk is alphabetical, sort it as usual
If the chunk is numerical, sort by the numerical value represented
Take the values that have the same first chunk and sort them by the second chunk
And so on

l = ['asdf123', 'asdf1234', 'asdf111', 'asdf124']
l.sort(cmp=lambda x,y:cmp(int(x[4:]), int(y[4:]))

You need a key function. You're willing to specify 3 or 4 digits at the end and I have a feeling that you want them to compare numerically.
sorted(list_, key=lambda s: (s[:-4], int(s[-4:])) if s[-4] in '0123456789' else (s[:-3], int(s[-3:])))
Without the lambda and conditional expression that's
def key(s):
if key[-4] in '0123456789':
return (s[:-4], int(s[-4:]))
else:
return (s[:-3], int(s[-3:]))
sorted(list_, key=key)
This just takes advantage of the fact that tuples sort by the first element, then the second. So because the key function is called to get a value to compare, the elements will now be compared like the tuples returned by the key function. For example, 'asdfbad123' will compare to 'asd7890' as ('asdfbad', 123) compares to ('asd', 7890). If the last 3 characters of a string aren't in fact digits, you'll get a ValueError which is perfectly appropriate given the fact that you passed it data that doesn't fit the specs it was designed for.

The issue is that the sorting is alphabetical here since they are strings. Each sequence of character is compared before moving to next character.
>>> 'a1234' < 'a124' <----- positionally '3' is less than '4'
True
>>>
You will need to due numeric sorting to get the desired output.
>>> x = ['asdf123', 'asdf1234', 'asdf111', 'asdf124']
>>> y = [ int(t[4:]) for t in x]
>>> z = sorted(y)
>>> z
[111, 123, 124, 1234]
>>> l = ['asdf'+str(t) for t in z]
>>> l
['asdf111', 'asdf123', 'asdf124', 'asdf1234']
>>>

L.sort(key=lambda s:int(''.join(filter(str.isdigit,s[-4:]))))

rather than splitting each line myself, I ask python to do it for me with re.findall():
import re
import sys
def SortKey(line):
result = []
for part in re.findall(r'\D+|\d+', line):
try:
result.append(int(part, 10))
except (TypeError, ValueError) as _:
result.append(part)
return result
print ''.join(sorted(sys.stdin.readlines(), key=SortKey)),

How to insert a character after every 2 characters in a string

Is there a pythonic way to insert an element into every 2nd element in a string?
I have a string: 'aabbccdd' and I want the end result to be 'aa-bb-cc-dd'.
I am not sure how I would go about doing that.

>>> s = 'aabbccdd'
>>> '-'.join(s[i:i+2] for i in range(0, len(s), 2))
'aa-bb-cc-dd'

Assume the string's length is always an even number,
>>> s = '12345678'
>>> t = iter(s)
>>> '-'.join(a+b for a,b in zip(t, t))
'12-34-56-78'
The t can also be eliminated with
>>> '-'.join(a+b for a,b in zip(s[::2], s[1::2]))
'12-34-56-78'
The algorithm is to group the string into pairs, then join them with the - character.
The code is written like this. Firstly, it is split into odd digits and even digits.
>>> s[::2], s[1::2]
('1357', '2468')
Then the zip function is used to combine them into an iterable of tuples.
>>> list( zip(s[::2], s[1::2]) )
[('1', '2'), ('3', '4'), ('5', '6'), ('7', '8')]
But tuples aren't what we want. This should be a list of strings. This is the purpose of the list comprehension
>>> [a+b for a,b in zip(s[::2], s[1::2])]
['12', '34', '56', '78']
Finally we use str.join() to combine the list.
>>> '-'.join(a+b for a,b in zip(s[::2], s[1::2]))
'12-34-56-78'
The first piece of code is the same idea, but consumes less memory if the string is long.

If you want to preserve the last character if the string has an odd length, then you can modify KennyTM's answer to use itertools.izip_longest:
>>> s = "aabbccd"
>>> from itertools import izip_longest
>>> '-'.join(a+b for a,b in izip_longest(s[::2], s[1::2], fillvalue=""))
'aa-bb-cc-d'
or
>>> t = iter(s)
>>> '-'.join(a+b for a,b in izip_longest(t, t, fillvalue=""))
'aa-bb-cc-d'

I tend to rely on a regular expression for this, as it seems less verbose and is usually faster than all the alternatives. Aside from having to face down the conventional wisdom regarding regular expressions, I'm not sure there's a drawback.
>>> s = 'aabbccdd'
>>> '-'.join(re.findall('..', s))
'aa-bb-cc-dd'
This version is strict about actual pairs though:
>>> t = s + 'e'
>>> '-'.join(re.findall('..', t))
'aa-bb-cc-dd'
... so with a tweak you can be tolerant of odd-length strings:
>>> '-'.join(re.findall('..?', t))
'aa-bb-cc-dd-e'
Usually you're doing this more than once, so maybe get a head start by creating a shortcut ahead of time:
PAIRS = re.compile('..').findall
out = '-'.join(PAIRS(in))
Or what I would use in real code:
def rejoined(src, sep='-', _split=re.compile('..').findall):
return sep.join(_split(src))
>>> rejoined('aabbccdd', sep=':')
'aa:bb:cc:dd'
I use something like this from time to time to create MAC address representations from 6-byte binary input:
>>> addr = b'\xdc\xf7\x09\x11\xa0\x49'
>>> rejoined(addr[::-1].hex(), sep=':')
'49:a0:11:09:f7:dc'

Here is one list comprehension way with conditional value depending of modulus of enumeration, odd last character will be in group alone:
for s in ['aabbccdd','aabbccdde']:
print(''.join([ char if not ind or ind % 2 else '-' + char
for ind,char in enumerate(s)
]
)
)
""" Output:
aa-bb-cc-dd
aa-bb-cc-dd-e
"""

This one-liner does the trick. It will drop the last character if your string has an odd number of characters.
"-".join([''.join(item) for item in zip(mystring1[::2],mystring1[1::2])])

As PEP8 states:
Do not rely on CPython's efficient implementation of in-place string concatenation for statements in the form a += b or a = a + b. This optimization is fragile even in CPython (it only works for some types) and isn't present at all in implementations.
A pythonic way of doing this that avoids this kind of concatenation, and allows you to join iterables other than strings could be:
':'.join(f'{s[i:i+2]}' for i in range(0, len(s), 2))
And another more functional-like way could be:
':'.join(map('{}{}'.format, *(s[::2], s[1::2])))
This second approach has a particular feature (or bug) of only joining pairs of letters. So:
>>> s = 'abcdefghij'
'ab:cd:ef:gh:ij'
and:
>>> s = 'abcdefghi'
'ab:cd:ef:gh'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python comparing two strings - python

Is there a function to compare how many characters two strings (of the same length) differ by? I mean only substitutions. For example, AAA would differ from AAT by 1 character.

Similar to simon's answer, but you don't have to zip things in order to just call a function on the resulting tuples because that's what map does anyway (and itertools.imap in Python 2). And there's a handy function for != in operator. Hence: sum(map(operator.ne, s1, s2))

Related

Incorporate string with list entries - alternating

common elements in two lists where elements are the same

Split a string and add into `tuple`

python sort strings with digits at the end

How to insert a character after every 2 characters in a string

Categories

Resources