Eliminating redundant function calls in comprehensions from within the comprehension - python

Say we need a program which takes a list of strings and splits them, and appends the first two words, in a tuple, to a list and returns that list; in other words, a program which gives you the first two words of each string.
input: ["hello world how are you", "foo bar baz"]
output: [("hello", "world"), ("foo", "bar")]
It can be written like so (we assume valid input):
def firstTwoWords(strings):
result = []
for s in strings:
splt = s.split()
result.append((splt[0], splt[1]))
return result
But a list comprehension would be much nicer.
def firstTwoWords(strings):
return [(s.split()[0], s.split()[1]) for s in strings]
But this involves two calls to split(). Is there a way to perform the split only once from within the comprehension? I tried what came naturally and it was invalid syntax:
>>> [(splt[0],splt[1]) for s in strings with s.split() as splt]
File "<stdin>", line 1
[(splt[0],splt[1]) for s in strings with s.split() as splt]
^
SyntaxError: invalid syntax

Well, in this particular case:
def firstTwoWords(strings):
return [s.split()[:2] for s in strings]
Otherwise, though, you can use one generator expression:
def firstTwoWords(strings):
return [(s[0], s[1]) for s in (s.split() for s in strings)]
And if performance is actually critical, just use a function.

Writing what comes to mind naturally from English and hoping it's valid syntax rarely works, unfortunately.
The generalized form of what you're trying to do is bind some expression to a name within a comprehension. There's no direct support to that, but since a for clause in a comprehension binds a name to each element from a sequence in turn, you can use for over single-element containers to achieve the same effect:
>>> strings = ["hello world how are you", "foo bar baz"]
>>> [(splt[0],splt[1]) for s in strings for splt in [s.split()]]
[('hello', 'world'), ('foo', 'bar')]

I think using a genexp is nicer, but here's how to do it with a lambda. There may be cases when this is a better fit
>>> [(lambda splt:(splt[0], splt[1]))(s.split()) for s in input]
[('hello', 'world'), ('foo', 'bar')]

minitech's answer is the right way to do it.
But note that you don't have to do it all in one line, and you don't really gain anything.
This:
splits = (s.split() for s in strings)
return [(s[0], s[1]) for s in splits]
Does exactly the same thing as this:
return [(s[0], s[1]) for s in (s.split() for s in strings)]
No extra intermediate values being built, no effect on the garbage collection, just more readability for free.
Also, there's a good chance your real code doesn't actually need a list in the end, just something iterable, in which case you're better off with this:
splits = (s.split() for s in strings)
return ((s[0], s[1]) for s in splits)
Or, in Python 3.3+:
splits = (s.split() for s in strings)
yield from ((s[0], s[1]) for s in splits)
In fact, an awful lot of programs can be written this way—a series of generator expressions followed by returning/yield froming one last genexpr/listcomp.

Like this?
def firstTwoWords(strings):
return [s.split()[:2] for s in strings]
It uses list splicing. It will return a list of course, but if you want a tuple, you can use:
def firstTwoWords(strings):
return [tuple(s.split()[:2]) for s in strings]

itemgetter can be used here. It's a bit more general than s.split()[:2]. It allows you to pull arbitrary items out of s
>>> from operator import itemgetter
>>> strings = ["hello world how are you", "foo bar baz"]
>>> [itemgetter(0, 1)(s.split()) for s in strings]
[('hello', 'world'), ('foo', 'bar')]
more generally:
>>> [itemgetter(1, 2, 0)(s.split()) for s in strings]
[('world', 'how', 'hello'), ('bar', 'baz', 'foo')]

Related

sorting list of a sentence and number

I have checked several of the answers on how to sort lists in python, but I can't figure this one out.
Let's say I have a list like this:
['Today is a good day,1', 'yesterday was a strange day,2', 'feeling hopeful,3']
Is there a way to sort by the number after each sentence?
I am trying to learn this stuff on my own, so I tried stuff like:
def sortMyList(string):
return len(string)-1
sortedList = sorted(MyList, key=sortMyList())
But of course it doesn't work becaue sortMyList expects one parameter.
Since no one has commented on your coding attempts so far:
def sortMyList(string):
return len(string)-1
sortedList = sorted(MyList, key=sortMyList())
You are on your way, but there are a few issues. First, the key argument expects a function. That function should be sortMyList. sortMyList() would be the result of calling a function - and besides, your function has a parameter (as it should), so calling it with no arguments wouldn't work. Just refer to the function itself.
sortedList = sorted(MyList, key=sortMyList)
Next, you need to tell sorted what is actually being compared when you compare two strings. len(string)-1 gets the length of the string and subtracts one. This would have the effect of sorting the strings by their length, which isn't what you're looking for. You want the character in the string at that index, so sorted will look at all those characters to form a basis for comparison.
def sortMyList(string):
return string[len(string)-1]
Next, you can use a negative index instead of calculating the length of the string, to directly get the last character:
def sortMyList(string):
return string[-1]
Next, we'd like to handle multi-digit numbers. It looks like there's a comma right before the number, so we'll split on that, starting from the right (in case the sentence itself has a comma). We only need the first split, so we'll specify a maxsplit of 1:
def sortMyList(string):
return string.rsplit(',', maxsplit=1)[1]
This will run into a problem: these "numbers" are actually still strings, so when you compare them, it will do so alphabetically, putting "10" before "2" and so on. To fix this, turn the number into an integer before returning it:
def sortMyList(string):
return int(string.rsplit(',', maxsplit=1)[1])
Putting it all together:
def sortMyList(string):
return int(string.rsplit(',', maxsplit=1)[1])
sortedList = sorted(MyList, key=sortMyList)
You can do this
>>> sorted(l, key=lambda x : int(x.split(',')[-1]))
['Today is a good day,1', 'yesterday was a strange day,2', 'feeling hopeful,3']
>>>
This would also work if you happen to have numbers in your string that have more than one digit
>>> l = ['Today is a good day,12', 'yesterday was a strange day,21', 'feeling hopeful,23']
>>> sorted(l, key=lambda x : int(x.split(',')[1]))
['Today is a good day,12', 'yesterday was a strange day,21', 'feeling hopeful,23'] # still works
>>> sorted(l, key=lambda x : x[-1])
['yesterday was a strange day,21', 'Today is a good day,12', 'feeling hopeful,23'] # doesn't work in this scenario
This worked for me:
sorted(myList, key=lambda x: x[-1])
If you need to go into double digits:
sorted(myList, key=lambda x: int(x.split(',')[1]))

Join string before, between, and after

Suppose I have this list:
lis = ['a','b','c','d']
If I do 'x'.join(lis) the result is:
'axbxcxd'
What would be a clean, simple way to get this output?
'xaxbxcxdx'
I could write a helper function:
def joiner(s, it):
return s+s.join(it)+s
and call it like joiner('x',lis) which returns xaxbxcxdx, but it doesn't look as clean as it could be. Is there a better way to get this result?
>>> '{1}{0}{1}'.format(s.join(lis), s)
'xaxbxcxdx'
You can join a list that begins and ends with an empty string:
>>> 'x'.join(['', *lis, ''])
'xaxbxcxdx'
You can use f-string:
s = 'x'
f'{s}{s.join(lis)}{s}'
In Python 3.8 you can also use the walrus operator:
f"{(s:='x')}{s.join(lis)}{s}"
or
(s:='x') + s.join(lis) + s
You can use str.replace() to interleave the characters:
>>> lis = ['a','b','c','d']
>>> ''.join(lis).replace('', 'x')
'xaxbxcxdx'
On the other hand, your original solution (or a trivial modification with string formatting) is IMO actually pretty clean and readable.
You may also do it as
'x'.join([''] + lis + [''])
But I'm not sure if it's cleaner.
It will produce only 1 separator on empty list instead of 2 as one in the question.
A generator can interleave the characters, and the result can be joined without having to create intermediate strings.
L = list('abcd')
def mixer(chars, insert):
yield insert
for char in chars:
yield char
yield insert
result = ''.join(mixer(L, 'x')) # -> 'xaxbxcxdx'
While it isn't a one-liner, I think it is clean and simple, unlike these itertools creations that I came up with initially:
from itertools import repeat, starmap, zip_longest
from operator import add
# L must have a len, so doesn't work with generators
''.join(a for b in itertools.zip_longest(repeat('x', len(L) + 1),
L, fillvalue='')
for a in b)
# As above, and worse still creates lots of intermediate strings
''.join(starmap(add, zip_longest(repeat('x', len(L) + 1), L, fillvalue='')))
Arguably there is a much simpler approach to be found here - just add the character before and after your join function. Others have suggested f-strings, which is a fancy way of achieving the same thing. String concatenation is also fine:
lis = ['a','b','c','d']
lis_str = 'x' + 'x'.join(lis) + 'x'
If your string is long and you don't want to repeat it multiple times, you can just put this into a variable and do the same thing
lis = ['a','b','c','d']
join_str = 'x-marks-the-spot'
lis_str = join_str + join_str.join(lis) + join_str

Find array item in a string

I know can use string.find() to find a substring in a string.
But what is the easiest way to find out if one of the array items has a substring match in a string without using a loop?
Pseudocode:
string = 'I would like an apple.'
search = ['apple','orange', 'banana']
string.find(search) # == True
You could use a generator expression (which somehow is a loop)
any(x in string for x in search)
The generator expression is the part inside the parentheses. It creates an iterable that returns the value of x in string for each x in the tuple search. x in string in turn returns whether string contains the substring x. Finally, the Python built-in any() iterates over the iterable it gets passed and returns if any of its items evaluate to True.
Alternatively, you could use a regular expression to avoid the loop:
import re
re.search("|".join(search), string)
I would go for the first solution, since regular expressions have pitfalls (escaping etc.).
Strings in Python are sequences, and you can do a quick membership test by just asking if one string exists inside of another:
>>> mystr = "I'd like an apple"
>>> 'apple' in mystr
True
Sven got it right in his first answer above. To check if any of several strings exist in some other string, you'd do:
>>> ls = ['apple', 'orange']
>>> any(x in mystr for x in ls)
True
Worth noting for future reference is that the built-in 'all()' function would return true only if all items in 'ls' were members of 'mystr':
>>> ls = ['apple', 'orange']
>>> all(x in mystr for x in ls)
False
>>> ls = ['apple', 'like']
>>> all(x in mystr for x in ls)
True
The simpler is
import re
regx = re.compile('[ ,;:!?.:]')
string = 'I would like an apple.'
search = ['apple','orange', 'banana']
print any(x in regx.split(string) for x in search)
EDIT
Correction, after having read Sven's answer: evidently, string has to not be splited, stupid ! any(x in string for x in search) works pretty well
If you want no loop:
import re
regx = re.compile('[ ,;:!?.:]')
string = 'I would like an apple.'
search = ['apple','orange', 'banana']
print regx.split(string)
print set(regx.split(string)) & set(search)
result
set(['apple'])

python sort strings with digits at the end

what is the easiest way to sort a list of strings with digits at the end where some have 3 digits and some have 4:
>>> list = ['asdf123', 'asdf1234', 'asdf111', 'asdf124']
>>> list.sort()
>>> print list
['asdf111', 'asdf123', 'asdf1234', 'asdf124']
should put the 1234 one on the end. is there an easy way to do this?
is there an easy way to do this?
Yes
You can use the natsort module.
>>> from natsort import natsorted
>>> natsorted(['asdf123', 'asdf1234', 'asdf111', 'asdf124'])
['asdf111', 'asdf123', 'asdf124', 'asdf1234']
Full disclosure, I am the package's author.
is there an easy way to do this?
No
It's perfectly unclear what the real rules are. The "some have 3 digits and some have 4" isn't really a very precise or complete specification. All your examples show 4 letters in front of the digits. Is this always true?
import re
key_pat = re.compile(r"^(\D+)(\d+)$")
def key(item):
m = key_pat.match(item)
return m.group(1), int(m.group(2))
That key function might do what you want. Or it might be too complex. Or maybe the pattern is really r"^(.*)(\d{3,4})$" or maybe the rules are even more obscure.
>>> data= ['asdf123', 'asdf1234', 'asdf111', 'asdf124']
>>> data.sort( key=key )
>>> data
['asdf111', 'asdf123', 'asdf124', 'asdf1234']
What you're probably describing is called a Natural Sort, or a Human Sort. If you're using Python, you can borrow from Ned's implementation.
The algorithm for a natural sort is approximately as follows:
Split each value into alphabetical "chunks" and numerical "chunks"
Sort by the first chunk of each value
If the chunk is alphabetical, sort it as usual
If the chunk is numerical, sort by the numerical value represented
Take the values that have the same first chunk and sort them by the second chunk
And so on
l = ['asdf123', 'asdf1234', 'asdf111', 'asdf124']
l.sort(cmp=lambda x,y:cmp(int(x[4:]), int(y[4:]))
You need a key function. You're willing to specify 3 or 4 digits at the end and I have a feeling that you want them to compare numerically.
sorted(list_, key=lambda s: (s[:-4], int(s[-4:])) if s[-4] in '0123456789' else (s[:-3], int(s[-3:])))
Without the lambda and conditional expression that's
def key(s):
if key[-4] in '0123456789':
return (s[:-4], int(s[-4:]))
else:
return (s[:-3], int(s[-3:]))
sorted(list_, key=key)
This just takes advantage of the fact that tuples sort by the first element, then the second. So because the key function is called to get a value to compare, the elements will now be compared like the tuples returned by the key function. For example, 'asdfbad123' will compare to 'asd7890' as ('asdfbad', 123) compares to ('asd', 7890). If the last 3 characters of a string aren't in fact digits, you'll get a ValueError which is perfectly appropriate given the fact that you passed it data that doesn't fit the specs it was designed for.
The issue is that the sorting is alphabetical here since they are strings. Each sequence of character is compared before moving to next character.
>>> 'a1234' < 'a124' <----- positionally '3' is less than '4'
True
>>>
You will need to due numeric sorting to get the desired output.
>>> x = ['asdf123', 'asdf1234', 'asdf111', 'asdf124']
>>> y = [ int(t[4:]) for t in x]
>>> z = sorted(y)
>>> z
[111, 123, 124, 1234]
>>> l = ['asdf'+str(t) for t in z]
>>> l
['asdf111', 'asdf123', 'asdf124', 'asdf1234']
>>>
L.sort(key=lambda s:int(''.join(filter(str.isdigit,s[-4:]))))
rather than splitting each line myself, I ask python to do it for me with re.findall():
import re
import sys
def SortKey(line):
result = []
for part in re.findall(r'\D+|\d+', line):
try:
result.append(int(part, 10))
except (TypeError, ValueError) as _:
result.append(part)
return result
print ''.join(sorted(sys.stdin.readlines(), key=SortKey)),

How to insert a character after every 2 characters in a string

Is there a pythonic way to insert an element into every 2nd element in a string?
I have a string: 'aabbccdd' and I want the end result to be 'aa-bb-cc-dd'.
I am not sure how I would go about doing that.
>>> s = 'aabbccdd'
>>> '-'.join(s[i:i+2] for i in range(0, len(s), 2))
'aa-bb-cc-dd'
Assume the string's length is always an even number,
>>> s = '12345678'
>>> t = iter(s)
>>> '-'.join(a+b for a,b in zip(t, t))
'12-34-56-78'
The t can also be eliminated with
>>> '-'.join(a+b for a,b in zip(s[::2], s[1::2]))
'12-34-56-78'
The algorithm is to group the string into pairs, then join them with the - character.
The code is written like this. Firstly, it is split into odd digits and even digits.
>>> s[::2], s[1::2]
('1357', '2468')
Then the zip function is used to combine them into an iterable of tuples.
>>> list( zip(s[::2], s[1::2]) )
[('1', '2'), ('3', '4'), ('5', '6'), ('7', '8')]
But tuples aren't what we want. This should be a list of strings. This is the purpose of the list comprehension
>>> [a+b for a,b in zip(s[::2], s[1::2])]
['12', '34', '56', '78']
Finally we use str.join() to combine the list.
>>> '-'.join(a+b for a,b in zip(s[::2], s[1::2]))
'12-34-56-78'
The first piece of code is the same idea, but consumes less memory if the string is long.
If you want to preserve the last character if the string has an odd length, then you can modify KennyTM's answer to use itertools.izip_longest:
>>> s = "aabbccd"
>>> from itertools import izip_longest
>>> '-'.join(a+b for a,b in izip_longest(s[::2], s[1::2], fillvalue=""))
'aa-bb-cc-d'
or
>>> t = iter(s)
>>> '-'.join(a+b for a,b in izip_longest(t, t, fillvalue=""))
'aa-bb-cc-d'
I tend to rely on a regular expression for this, as it seems less verbose and is usually faster than all the alternatives. Aside from having to face down the conventional wisdom regarding regular expressions, I'm not sure there's a drawback.
>>> s = 'aabbccdd'
>>> '-'.join(re.findall('..', s))
'aa-bb-cc-dd'
This version is strict about actual pairs though:
>>> t = s + 'e'
>>> '-'.join(re.findall('..', t))
'aa-bb-cc-dd'
... so with a tweak you can be tolerant of odd-length strings:
>>> '-'.join(re.findall('..?', t))
'aa-bb-cc-dd-e'
Usually you're doing this more than once, so maybe get a head start by creating a shortcut ahead of time:
PAIRS = re.compile('..').findall
out = '-'.join(PAIRS(in))
Or what I would use in real code:
def rejoined(src, sep='-', _split=re.compile('..').findall):
return sep.join(_split(src))
>>> rejoined('aabbccdd', sep=':')
'aa:bb:cc:dd'
I use something like this from time to time to create MAC address representations from 6-byte binary input:
>>> addr = b'\xdc\xf7\x09\x11\xa0\x49'
>>> rejoined(addr[::-1].hex(), sep=':')
'49:a0:11:09:f7:dc'
Here is one list comprehension way with conditional value depending of modulus of enumeration, odd last character will be in group alone:
for s in ['aabbccdd','aabbccdde']:
print(''.join([ char if not ind or ind % 2 else '-' + char
for ind,char in enumerate(s)
]
)
)
""" Output:
aa-bb-cc-dd
aa-bb-cc-dd-e
"""
This one-liner does the trick. It will drop the last character if your string has an odd number of characters.
"-".join([''.join(item) for item in zip(mystring1[::2],mystring1[1::2])])
As PEP8 states:
Do not rely on CPython's efficient implementation of in-place string concatenation for statements in the form a += b or a = a + b. This optimization is fragile even in CPython (it only works for some types) and isn't present at all in implementations.
A pythonic way of doing this that avoids this kind of concatenation, and allows you to join iterables other than strings could be:
':'.join(f'{s[i:i+2]}' for i in range(0, len(s), 2))
And another more functional-like way could be:
':'.join(map('{}{}'.format, *(s[::2], s[1::2])))
This second approach has a particular feature (or bug) of only joining pairs of letters. So:
>>> s = 'abcdefghij'
'ab:cd:ef:gh:ij'
and:
>>> s = 'abcdefghi'
'ab:cd:ef:gh'

Categories

Resources