Is there a pythonic way to insert an element into every 2nd element in a string?
I have a string: 'aabbccdd' and I want the end result to be 'aa-bb-cc-dd'.
I am not sure how I would go about doing that.
>>> s = 'aabbccdd'
>>> '-'.join(s[i:i+2] for i in range(0, len(s), 2))
'aa-bb-cc-dd'
Assume the string's length is always an even number,
>>> s = '12345678'
>>> t = iter(s)
>>> '-'.join(a+b for a,b in zip(t, t))
'12-34-56-78'
The t can also be eliminated with
>>> '-'.join(a+b for a,b in zip(s[::2], s[1::2]))
'12-34-56-78'
The algorithm is to group the string into pairs, then join them with the - character.
The code is written like this. Firstly, it is split into odd digits and even digits.
>>> s[::2], s[1::2]
('1357', '2468')
Then the zip function is used to combine them into an iterable of tuples.
>>> list( zip(s[::2], s[1::2]) )
[('1', '2'), ('3', '4'), ('5', '6'), ('7', '8')]
But tuples aren't what we want. This should be a list of strings. This is the purpose of the list comprehension
>>> [a+b for a,b in zip(s[::2], s[1::2])]
['12', '34', '56', '78']
Finally we use str.join() to combine the list.
>>> '-'.join(a+b for a,b in zip(s[::2], s[1::2]))
'12-34-56-78'
The first piece of code is the same idea, but consumes less memory if the string is long.
If you want to preserve the last character if the string has an odd length, then you can modify KennyTM's answer to use itertools.izip_longest:
>>> s = "aabbccd"
>>> from itertools import izip_longest
>>> '-'.join(a+b for a,b in izip_longest(s[::2], s[1::2], fillvalue=""))
'aa-bb-cc-d'
or
>>> t = iter(s)
>>> '-'.join(a+b for a,b in izip_longest(t, t, fillvalue=""))
'aa-bb-cc-d'
I tend to rely on a regular expression for this, as it seems less verbose and is usually faster than all the alternatives. Aside from having to face down the conventional wisdom regarding regular expressions, I'm not sure there's a drawback.
>>> s = 'aabbccdd'
>>> '-'.join(re.findall('..', s))
'aa-bb-cc-dd'
This version is strict about actual pairs though:
>>> t = s + 'e'
>>> '-'.join(re.findall('..', t))
'aa-bb-cc-dd'
... so with a tweak you can be tolerant of odd-length strings:
>>> '-'.join(re.findall('..?', t))
'aa-bb-cc-dd-e'
Usually you're doing this more than once, so maybe get a head start by creating a shortcut ahead of time:
PAIRS = re.compile('..').findall
out = '-'.join(PAIRS(in))
Or what I would use in real code:
def rejoined(src, sep='-', _split=re.compile('..').findall):
return sep.join(_split(src))
>>> rejoined('aabbccdd', sep=':')
'aa:bb:cc:dd'
I use something like this from time to time to create MAC address representations from 6-byte binary input:
>>> addr = b'\xdc\xf7\x09\x11\xa0\x49'
>>> rejoined(addr[::-1].hex(), sep=':')
'49:a0:11:09:f7:dc'
Here is one list comprehension way with conditional value depending of modulus of enumeration, odd last character will be in group alone:
for s in ['aabbccdd','aabbccdde']:
print(''.join([ char if not ind or ind % 2 else '-' + char
for ind,char in enumerate(s)
]
)
)
""" Output:
aa-bb-cc-dd
aa-bb-cc-dd-e
"""
This one-liner does the trick. It will drop the last character if your string has an odd number of characters.
"-".join([''.join(item) for item in zip(mystring1[::2],mystring1[1::2])])
As PEP8 states:
Do not rely on CPython's efficient implementation of in-place string concatenation for statements in the form a += b or a = a + b. This optimization is fragile even in CPython (it only works for some types) and isn't present at all in implementations.
A pythonic way of doing this that avoids this kind of concatenation, and allows you to join iterables other than strings could be:
':'.join(f'{s[i:i+2]}' for i in range(0, len(s), 2))
And another more functional-like way could be:
':'.join(map('{}{}'.format, *(s[::2], s[1::2])))
This second approach has a particular feature (or bug) of only joining pairs of letters. So:
>>> s = 'abcdefghij'
'ab:cd:ef:gh:ij'
and:
>>> s = 'abcdefghi'
'ab:cd:ef:gh'
Related
So SO, i am trying to "merge" a string (a) and a list of strings (b):
a = '1234'
b = ['+', '-', '']
to get the desired output (c):
c = '1+2-34'
The characters in the desired output string alternate in terms of origin between string and list. Also, the list will always contain one element less than characters in the string. I was wondering what the fastest way to do this is.
what i have so far is the following:
c = a[0]
for i in range(len(b)):
c += b[i] + a[1:][i]
print(c) # prints -> 1+2-34
But i kind of feel like there is a better way to do this..
You can use itertools.zip_longest to zip the two sequences, then keep iterating even after the shorter sequence ran out of characters. If you run out of characters, you'll start getting None back, so just consume the rest of the numerical characters.
>>> from itertools import chain
>>> from itertools import zip_longest
>>> ''.join(i+j if j else i for i,j in zip_longest(a, b))
'1+2-34'
As #deceze suggested in the comments, you can also pass a fillvalue argument to zip_longest which will insert empty strings. I'd suggest his method since it's a bit more readable.
>>> ''.join(i+j for i,j in zip_longest(a, b, fillvalue=''))
'1+2-34'
A further optimization suggested by #ShadowRanger is to remove the temporary string concatenations (i+j) and replace those with an itertools.chain.from_iterable call instead
>>> ''.join(chain.from_iterable(zip_longest(a, b, fillvalue='')))
'1+2-34'
Is there a function to compare how many characters two strings (of the same length) differ by? I mean only substitutions. For example, AAA would differ from AAT by 1 character.
This will work:
>>> str1 = "AAA"
>>> str2 = "AAT"
>>> sum(1 for x,y in enumerate(str1) if str2[x] != y)
1
>>> str1 = "AAABBBCCC"
>>> str2 = "ABCABCABC"
>>> sum(1 for x,y in enumerate(str1) if str2[x] != y)
6
>>>
The above solution uses sum, enumerate, and a generator expression.
Because True can evaluate to 1, you could even do:
>>> str1 = "AAA"
>>> str2 = "AAT"
>>> sum(str2[x] != y for x,y in enumerate(str1))
1
>>>
But I personally prefer the first solution because it is clearer.
This is a nice use case for the zip function!
def count_substitutions(s1, s2):
return sum(x != y for (x, y) in zip(s1, s2))
Usage:
>>> count_substitutions('AAA', 'AAT')
1
From the docs:
zip(...)
zip(seq1 [, seq2 [...]]) -> [(seq1[0], seq2[0] ...), (...)]
Return a list of tuples, where each tuple contains the i-th element
from each of the argument sequences. The returned list is truncated
in length to the length of the shortest argument sequence.
Building on what poke said I would suggest the jellyfish package. It has several distance measures like what you are asking for. Example from the documentation:
IN [1]: jellyfish.damerau_levenshtein_distance('jellyfish', 'jellyfihs')
OUT[1]: 1
or using your example:
IN [2]: jellyfish.damerau_levenshtein_distance('AAA','AAT')
OUT[2]: 1
This will work for many different string lengths and should be able to handle most of what you throw at it.
Similar to simon's answer, but you don't have to zip things in order to just call a function on the resulting tuples because that's what map does anyway (and itertools.imap in Python 2). And there's a handy function for != in operator. Hence:
sum(map(operator.ne, s1, s2))
Suppose I have this list:
lis = ['a','b','c','d']
If I do 'x'.join(lis) the result is:
'axbxcxd'
What would be a clean, simple way to get this output?
'xaxbxcxdx'
I could write a helper function:
def joiner(s, it):
return s+s.join(it)+s
and call it like joiner('x',lis) which returns xaxbxcxdx, but it doesn't look as clean as it could be. Is there a better way to get this result?
>>> '{1}{0}{1}'.format(s.join(lis), s)
'xaxbxcxdx'
You can join a list that begins and ends with an empty string:
>>> 'x'.join(['', *lis, ''])
'xaxbxcxdx'
You can use f-string:
s = 'x'
f'{s}{s.join(lis)}{s}'
In Python 3.8 you can also use the walrus operator:
f"{(s:='x')}{s.join(lis)}{s}"
or
(s:='x') + s.join(lis) + s
You can use str.replace() to interleave the characters:
>>> lis = ['a','b','c','d']
>>> ''.join(lis).replace('', 'x')
'xaxbxcxdx'
On the other hand, your original solution (or a trivial modification with string formatting) is IMO actually pretty clean and readable.
You may also do it as
'x'.join([''] + lis + [''])
But I'm not sure if it's cleaner.
It will produce only 1 separator on empty list instead of 2 as one in the question.
A generator can interleave the characters, and the result can be joined without having to create intermediate strings.
L = list('abcd')
def mixer(chars, insert):
yield insert
for char in chars:
yield char
yield insert
result = ''.join(mixer(L, 'x')) # -> 'xaxbxcxdx'
While it isn't a one-liner, I think it is clean and simple, unlike these itertools creations that I came up with initially:
from itertools import repeat, starmap, zip_longest
from operator import add
# L must have a len, so doesn't work with generators
''.join(a for b in itertools.zip_longest(repeat('x', len(L) + 1),
L, fillvalue='')
for a in b)
# As above, and worse still creates lots of intermediate strings
''.join(starmap(add, zip_longest(repeat('x', len(L) + 1), L, fillvalue='')))
Arguably there is a much simpler approach to be found here - just add the character before and after your join function. Others have suggested f-strings, which is a fancy way of achieving the same thing. String concatenation is also fine:
lis = ['a','b','c','d']
lis_str = 'x' + 'x'.join(lis) + 'x'
If your string is long and you don't want to repeat it multiple times, you can just put this into a variable and do the same thing
lis = ['a','b','c','d']
join_str = 'x-marks-the-spot'
lis_str = join_str + join_str.join(lis) + join_str
Say we need a program which takes a list of strings and splits them, and appends the first two words, in a tuple, to a list and returns that list; in other words, a program which gives you the first two words of each string.
input: ["hello world how are you", "foo bar baz"]
output: [("hello", "world"), ("foo", "bar")]
It can be written like so (we assume valid input):
def firstTwoWords(strings):
result = []
for s in strings:
splt = s.split()
result.append((splt[0], splt[1]))
return result
But a list comprehension would be much nicer.
def firstTwoWords(strings):
return [(s.split()[0], s.split()[1]) for s in strings]
But this involves two calls to split(). Is there a way to perform the split only once from within the comprehension? I tried what came naturally and it was invalid syntax:
>>> [(splt[0],splt[1]) for s in strings with s.split() as splt]
File "<stdin>", line 1
[(splt[0],splt[1]) for s in strings with s.split() as splt]
^
SyntaxError: invalid syntax
Well, in this particular case:
def firstTwoWords(strings):
return [s.split()[:2] for s in strings]
Otherwise, though, you can use one generator expression:
def firstTwoWords(strings):
return [(s[0], s[1]) for s in (s.split() for s in strings)]
And if performance is actually critical, just use a function.
Writing what comes to mind naturally from English and hoping it's valid syntax rarely works, unfortunately.
The generalized form of what you're trying to do is bind some expression to a name within a comprehension. There's no direct support to that, but since a for clause in a comprehension binds a name to each element from a sequence in turn, you can use for over single-element containers to achieve the same effect:
>>> strings = ["hello world how are you", "foo bar baz"]
>>> [(splt[0],splt[1]) for s in strings for splt in [s.split()]]
[('hello', 'world'), ('foo', 'bar')]
I think using a genexp is nicer, but here's how to do it with a lambda. There may be cases when this is a better fit
>>> [(lambda splt:(splt[0], splt[1]))(s.split()) for s in input]
[('hello', 'world'), ('foo', 'bar')]
minitech's answer is the right way to do it.
But note that you don't have to do it all in one line, and you don't really gain anything.
This:
splits = (s.split() for s in strings)
return [(s[0], s[1]) for s in splits]
Does exactly the same thing as this:
return [(s[0], s[1]) for s in (s.split() for s in strings)]
No extra intermediate values being built, no effect on the garbage collection, just more readability for free.
Also, there's a good chance your real code doesn't actually need a list in the end, just something iterable, in which case you're better off with this:
splits = (s.split() for s in strings)
return ((s[0], s[1]) for s in splits)
Or, in Python 3.3+:
splits = (s.split() for s in strings)
yield from ((s[0], s[1]) for s in splits)
In fact, an awful lot of programs can be written this way—a series of generator expressions followed by returning/yield froming one last genexpr/listcomp.
Like this?
def firstTwoWords(strings):
return [s.split()[:2] for s in strings]
It uses list splicing. It will return a list of course, but if you want a tuple, you can use:
def firstTwoWords(strings):
return [tuple(s.split()[:2]) for s in strings]
itemgetter can be used here. It's a bit more general than s.split()[:2]. It allows you to pull arbitrary items out of s
>>> from operator import itemgetter
>>> strings = ["hello world how are you", "foo bar baz"]
>>> [itemgetter(0, 1)(s.split()) for s in strings]
[('hello', 'world'), ('foo', 'bar')]
more generally:
>>> [itemgetter(1, 2, 0)(s.split()) for s in strings]
[('world', 'how', 'hello'), ('bar', 'baz', 'foo')]
what is the easiest way to sort a list of strings with digits at the end where some have 3 digits and some have 4:
>>> list = ['asdf123', 'asdf1234', 'asdf111', 'asdf124']
>>> list.sort()
>>> print list
['asdf111', 'asdf123', 'asdf1234', 'asdf124']
should put the 1234 one on the end. is there an easy way to do this?
is there an easy way to do this?
Yes
You can use the natsort module.
>>> from natsort import natsorted
>>> natsorted(['asdf123', 'asdf1234', 'asdf111', 'asdf124'])
['asdf111', 'asdf123', 'asdf124', 'asdf1234']
Full disclosure, I am the package's author.
is there an easy way to do this?
No
It's perfectly unclear what the real rules are. The "some have 3 digits and some have 4" isn't really a very precise or complete specification. All your examples show 4 letters in front of the digits. Is this always true?
import re
key_pat = re.compile(r"^(\D+)(\d+)$")
def key(item):
m = key_pat.match(item)
return m.group(1), int(m.group(2))
That key function might do what you want. Or it might be too complex. Or maybe the pattern is really r"^(.*)(\d{3,4})$" or maybe the rules are even more obscure.
>>> data= ['asdf123', 'asdf1234', 'asdf111', 'asdf124']
>>> data.sort( key=key )
>>> data
['asdf111', 'asdf123', 'asdf124', 'asdf1234']
What you're probably describing is called a Natural Sort, or a Human Sort. If you're using Python, you can borrow from Ned's implementation.
The algorithm for a natural sort is approximately as follows:
Split each value into alphabetical "chunks" and numerical "chunks"
Sort by the first chunk of each value
If the chunk is alphabetical, sort it as usual
If the chunk is numerical, sort by the numerical value represented
Take the values that have the same first chunk and sort them by the second chunk
And so on
l = ['asdf123', 'asdf1234', 'asdf111', 'asdf124']
l.sort(cmp=lambda x,y:cmp(int(x[4:]), int(y[4:]))
You need a key function. You're willing to specify 3 or 4 digits at the end and I have a feeling that you want them to compare numerically.
sorted(list_, key=lambda s: (s[:-4], int(s[-4:])) if s[-4] in '0123456789' else (s[:-3], int(s[-3:])))
Without the lambda and conditional expression that's
def key(s):
if key[-4] in '0123456789':
return (s[:-4], int(s[-4:]))
else:
return (s[:-3], int(s[-3:]))
sorted(list_, key=key)
This just takes advantage of the fact that tuples sort by the first element, then the second. So because the key function is called to get a value to compare, the elements will now be compared like the tuples returned by the key function. For example, 'asdfbad123' will compare to 'asd7890' as ('asdfbad', 123) compares to ('asd', 7890). If the last 3 characters of a string aren't in fact digits, you'll get a ValueError which is perfectly appropriate given the fact that you passed it data that doesn't fit the specs it was designed for.
The issue is that the sorting is alphabetical here since they are strings. Each sequence of character is compared before moving to next character.
>>> 'a1234' < 'a124' <----- positionally '3' is less than '4'
True
>>>
You will need to due numeric sorting to get the desired output.
>>> x = ['asdf123', 'asdf1234', 'asdf111', 'asdf124']
>>> y = [ int(t[4:]) for t in x]
>>> z = sorted(y)
>>> z
[111, 123, 124, 1234]
>>> l = ['asdf'+str(t) for t in z]
>>> l
['asdf111', 'asdf123', 'asdf124', 'asdf1234']
>>>
L.sort(key=lambda s:int(''.join(filter(str.isdigit,s[-4:]))))
rather than splitting each line myself, I ask python to do it for me with re.findall():
import re
import sys
def SortKey(line):
result = []
for part in re.findall(r'\D+|\d+', line):
try:
result.append(int(part, 10))
except (TypeError, ValueError) as _:
result.append(part)
return result
print ''.join(sorted(sys.stdin.readlines(), key=SortKey)),