Related
I came across the following Python code and am having trouble understanding it:
''.join(random.choice(string.ascii_lowercase + string.ascii_uppercase + string.digits) for i in range(length))
The for loop tells me it's a comprehension, but of what type? It's not a list comprehension, because the [] are missing (unless there's a special syntax at work here). I tried to work it out by running
random.choice(string.ascii_lowercase + string.ascii_uppercase + string.digits) for i in range(length)
directly in the interpreter but got syntax error at for.
I did some digging around and came to a not-so-sure conclusion that this is what's called a generator comprehension, but I didn't find any examples that look like this one; they all use the () notation for creating the generator object.
So, is it like join() works on iterators (and therefore generators) and we actually have a generator syntax here? If yes, can we omit the surrounding () when creating generator objects in function calls?
you need join() because the list contains characters, and you want to get a string, hence join()
random.choice() selects random character from the argument list
the argument list contains ASCII upper/lower case characters and digits
the length of the resulting string is length
Summing up all together, this line of code generates a random string with length length that contains upper/lower case letters and numbers.
This is a plain old list comprehension, just the [] are missing because not required when you use join()
It creates an iterator, much like in a list comprehension. Take this example from pythonwiki:
# list comprehension
doubles = [2 * n for n in range(50)]
# same as the list comprehension above
doubles = list(2 * n for n in range(50))
Both are list comprehensions, but the former case is more familiar. I believe your example relies on the latter case. The wiki I linked calls this a generator expression.
I've recently been practicing using map() in Python 3.5.2, and when I tried to run the module it said the comma separating the function and the iterable was a SyntaxError. Here's the code:
eng_swe = {"merry":"god", "christmas":"jul", "and":"och", "happy":"gott",
"new":"nytt", "year":"år"}
def map_translate(l):
"""Translates English words into Swedish using the dictionary above."""
return list(map(lambda x: eng_swe[x] if x in eng_swe.keys(), l))
I noticed that if I eliminate the conditional statement like this:
return list(map(lambda x: eng_swe[x], l))
it works fine, but it sacrifices the ability to avoid attempting to add items to the list that aren't in the dictionary. Interestingly enough, there also weren't any problems when I tried using a conditional statement with reduce(), as shown here:
from functools import reduce
def reduce_max_in_list(l):
"""Returns maximum integer in list using the 'reduce' function."""
return reduce(lambda x, y: x if x > y else y, l)
Yes, I know I could do the exact same thing more cleanly and easily with a list comprehension, but I consider it worth my time to at least learn how to use map() correctly, even if I end up never using it again.
You're getting the SyntaxError because you're using a conditional expression without supplying the else clause which is mandatory.
The grammar for conditional expressions (i.e if statements in an expression form) always includes an else clause:
conditional_expression ::= or_test ["if" or_test "else" expression]
^^
In your reduce example you do supply it and, as a result, no errors are being raised.
In your first example, you don't specify what should be returned if the condition isn't true. Since python can't yield nothing from an expression, that is a syntax error. e.g:
a if b # SyntaxError.
a if b else c # Ok.
You might argue that it could be useful to implicitly yield None in this case, but I doubt that a proposal of that sort would get any traction within the community... (I wouldn't vote for it ;-)
While the others' explanations of why your code is causing a SyntaxError are completely accurate, the goal of my answer is to aid you in your goal "to at least learn how to use map() correctly."
Your use of map in this context does not make much sense. As you noted in your answer it would be much cleaner if you used a list comprehension:
[eng_swe[x] for x in l if x in eng_swe]
As you can see, this looks awfully similar to your map expression, minus some of the convolution. Generally, this is a sign that you're using map incorrectly. map(lambda... is pretty much a code smell. (Note that I am saying this as an ardent supporter of the use of map in Python. I know many people think it should never be used, but I am not one of those people, as long as it is used properly.)
So, you might be wondering, what is an example of a good time to use map? Well, one use case I can think of off the top of my head is converting a list of strs to ints. For example, if I am reading a table of data stored in a file, I might do:
with open('my_file.txt', 'r') as f:
data = [map(int, line.split(' ')) for line in f]
Which would leave me with a 2d-array of ints, perfect for further manipulation or analysis. What makes this a better use of map than your code is that it uses a built-in function. I am not writing a lambda expressly to be used by map (as this is a sign that you should use a list comprehension).
Getting back to your code, however... if you want to write your code functionally, you should really be using filter, which is just as important to know as map.
map(lambda x: eng_swe[x], filter(lambda x: eng_swe.get(x), l))
Note that I was unable to get rid of the map(lambda... code smell in my version, but at least I broke it down into smaller parts. The filter finds the words that can be translated and the map performs the actual translation. (Still, in this case, a list comprehension is probably better.) I hope that this explanation helps you more than it confuses you in your quest to write Python code functionally.
To take the number of test cases and output all the input numbers, I can do the following in Python 2.5
exec"print input();"*input()
How to do it in Python 3, in shortest possible way?
Your obfuscated code works just fine in Python 3 too, once you have adapted for the changes, which can trivially be done by running the code through 2to3.
exec("print(input());"*eval(input()))
(Although eval should in this case be replaced with int() as that's what you want.)
Obviously, this is all ridicolous, why are you using exec and multiplication of strings instead of loops?
for ignored in range(int(input())):
print(input())
You can also do it with a list expression:
[print(input()) for _ in range(int(input()))]
Although most people would say (and I would agree) that using list expressions for it's side effects or to loop is generally bad form. List expressions should be used to create lists.
This is a piece of clear and self-documenting code that does the same:
num_integers = int(input('How many integers do you want to input? '))
for x in range(num_integers):
print(input('Integer {}: '.format(x)))
Is there a reason you can't use a loop?
for _ in xrange(input()):
print input()
exec, like print, is a function call in Python 3. Wrap the string in parentheses.
This is partially a theoretical question:
I have a string (say UTF-8), and I need to modify it so that each character (not byte) becomes 2 characters, for instance:
"Nissim" becomes "N-i-s-s-i-m-"
"01234" becomes "0a1b2c3d4e"
and so on.
I would suspect that naive concatenation in a loop would be too expensive (it IS the bottleneck, this is supposed to happen all the time).
I would either use an array (pre-allocated) or try to make my own C module to handle this.
Anyone has better ideas for this kind of thing?
(Note that the problem is always about multibyte encodings, and must be solved for UTF-8 as well),
Oh and its Python 2.5, so no shiny Python 3 thingies are available here.
Thanks
#gnosis, beware of all the well-intentioned responders saying you should measure the times: yes, you should (because programmers' instincts are often off-base about performance), but measuring a single case, as in all the timeit examples proffered so far, misses a crucial consideration -- big-O.
Your instincts are correct: in general (with a very few special cases where recent Python releases can optimize things a bit, but they don't stretch very far), building a string by a loop of += over the pieces (or a reduce and so on) must be O(N**2) due to the many intermediate object allocations and the inevitable repeated copying of those object's content; joining, regular expressions, and the third option that was not mentioned in the above answers (write method of cStringIO.StringIO instances) are the O(N) solutions and therefore the only ones worth considering unless you happen to know for sure that the strings you'll be operating on have modest upper bounds on their length.
So what, if any, are the upper bounds in length on the strings you're processing? If you can give us an idea, benchmarks can be run on representative ranges of lengths of interest (for example, say, "most often less than 100 characters but some % of the time maybe a couple thousand characters" would be an excellent spec for this performance evaluation: IOW, it doesn't need to be extremely precise, just indicative of your problem space).
I also notice that nobody seems to follow one crucial and difficult point in your specs: that the strings are Python 2.5 multibyte, UTF-8 encoded, strs, and the insertions must happen only after each "complete character", not after each byte. Everybody seems to be "looping on the str", which give each byte, not each character as you so clearly specify.
There's really no good, fast way to "loop over characters" in a multibyte-encoded byte str; the best one can do is to .decode('utf-8'), giving a unicode object -- process the unicode object (where loops do correctly go over characters!), then .encode it back at the end. By far the best approach in general is to only, exclusively use unicode objects, not encoded strs, throughout the heart of your code; encode and decode to/from byte strings only upon I/O (if and when you must because you need to communicate with subsystems that only support byte strings and not proper Unicode).
So I would strongly suggest that you consider this "best approach" and restructure your code accordingly: unicode everywhere, except at the boundaries where it may be encoded/decoded if and when necessary only. For the "processing" part, you'll be MUCH happier with unicode objects than you would be lugging around balky multibyte-encoded strings!-)
Edit: forgot to comment on a possible approach you mention -- array.array. That's indeed O(N) if you are only appending to the end of the new array you're constructing (some appends will make the array grow beyond previously allocated capacity and therefore require a reallocation and copying of data, but, just like for list, a midly exponential overallocation strategy allows append to be amortized O(1), and therefore N appends to be O(N)).
However, to build an array (again, just like a list) by repeated insert operations in the middle of it is O(N**2), because each of the O(N) insertions must shift all the O(N) following items (assuming the number of previously existing items and the number of newly inserted ones are proportional to each other, as seems to be the case for your specific requirements).
So, an array.array('u'), with repeated appends to it (not inserts!-), is a fourth O(N) approach that can solve your problem (in addition to the three I already mentioned: join, re, and cStringIO) -- those are the ones worth benchmarking once you clarify the ranges of lengths that are of interest, as I mentioned above.
Try to build the result with the re module. It will do the nasty concatenation under the hood, so performance should be OK. Example:
import re
re.sub(r'(.)', r'\1-', u'Nissim')
count = 1
def repl(m):
global count
s = m.group(1) + unicode(count)
count += 1
return s
re.sub(r'(.)', repl, u'Nissim')
this might be a python effective solution:
s1="Nissim"
s2="------"
s3=''.join([''.join(list(x)) for x in zip(s1,s2)])
have you tested how slow it is or how fast you need, i think something like this will be fast enough
s = u"\u0960\u0961"
ss = ''.join(sum(map(list,zip(s,"anurag")),[]))
So try with simplest and if it doesn't suffice then try to improve upon it, C module should be last option
Edit: This is also the fastest
import timeit
s1="Nissim"
s2="------"
timeit.f1=lambda s1,s2:''.join(sum(map(list,zip(s1,s2)),[]))
timeit.f2=lambda s1,s2:''.join([''.join(list(x)) for x in zip(s1,s2)])
timeit.f3=lambda s1,s2:''.join(i+j for i, j in zip(s1, s2))
N=100000
print "anurag",timeit.Timer("timeit.f1('Nissim', '------')","import timeit").timeit(N)
print "dweeves",timeit.Timer("timeit.f2('Nissim', '------')","import timeit").timeit(N)
print "SilentGhost",timeit.Timer("timeit.f3('Nissim', '------')","import timeit").timeit(N)
output is
anurag 1.95547590546
dweeves 2.36131184271
SilentGhost 3.10855625505
here are my timings. Note, it's py3.1
>>> s1
'Nissim'
>>> s2 = '-' * len(s1)
>>> timeit.timeit("''.join(i+j for i, j in zip(s1, s2))", "from __main__ import s1, s2")
3.5249209707199043
>>> timeit.timeit("''.join(sum(map(list,zip(s1,s2)),[]))", "from __main__ import s1, s2")
5.903614027402
>>> timeit.timeit("''.join([''.join(list(x)) for x in zip(s1,s2)])", "from __main__ import s1, s2")
6.04072124013328
>>> timeit.timeit("''.join(i+'-' for i in s1)", "from __main__ import s1, s2")
2.484378367653335
>>> timeit.timeit("reduce(lambda x, y : x+y+'-', s1, '')", "from __main__ import s1; from functools import reduce")
2.290644129319844
Use Reduce.
>>> str = "Nissim"
>>> reduce(lambda x, y : x+y+'-', str, '')
'N-i-s-s-i-m-'
The same with numbers too as long as you know which char maps to which. [dict can be handy]
>>> mapper = dict([(repr(i), chr(i+ord('a'))) for i in range(9)])
>>> str1 = '0123'
>>> reduce(lambda x, y : x+y+mapper[y], str1, '')
'0a1b2c3d'
string="™¡™©€"
unicode(string,"utf-8")
s2='-'*len(s1)
''.join(sum(map(list,zip(s1,s2)),[])).encode("utf-8")
This question already has answers here:
How do I do a case-insensitive string comparison?
(15 answers)
Closed 6 years ago.
What is the easiest way to compare strings in Python, ignoring case?
Of course one can do (str1.lower() <= str2.lower()), etc., but this created two additional temporary strings (with the obvious alloc/g-c overheads).
I guess I'm looking for an equivalent to C's stricmp().
[Some more context requested, so I'll demonstrate with a trivial example:]
Suppose you want to sort a looong list of strings. You simply do theList.sort().
This is O(n * log(n)) string comparisons and no memory management (since all
strings and list elements are some sort of smart pointers). You are happy.
Now, you want to do the same, but ignore the case (let's simplify and say
all strings are ascii, so locale issues can be ignored).
You can do theList.sort(key=lambda s: s.lower()), but then you cause two new
allocations per comparison, plus burden the garbage-collector with the duplicated
(lowered) strings.
Each such memory-management noise is orders-of-magnitude slower than simple string comparison.
Now, with an in-place stricmp()-like function, you do: theList.sort(cmp=stricmp)
and it is as fast and as memory-friendly as theList.sort(). You are happy again.
The problem is any Python-based case-insensitive comparison involves implicit string
duplications, so I was expecting to find a C-based comparisons (maybe in module string).
Could not find anything like that, hence the question here.
(Hope this clarifies the question).
Here is a benchmark showing that using str.lower is faster than the accepted answer's proposed method (libc.strcasecmp):
#!/usr/bin/env python2.7
import random
import timeit
from ctypes import *
libc = CDLL('libc.dylib') # change to 'libc.so.6' on linux
with open('/usr/share/dict/words', 'r') as wordlist:
words = wordlist.read().splitlines()
random.shuffle(words)
print '%i words in list' % len(words)
setup = 'from __main__ import words, libc; gc.enable()'
stmts = [
('simple sort', 'sorted(words)'),
('sort with key=str.lower', 'sorted(words, key=str.lower)'),
('sort with cmp=libc.strcasecmp', 'sorted(words, cmp=libc.strcasecmp)'),
]
for (comment, stmt) in stmts:
t = timeit.Timer(stmt=stmt, setup=setup)
print '%s: %.2f msec/pass' % (comment, (1000*t.timeit(10)/10))
typical times on my machine:
235886 words in list
simple sort: 483.59 msec/pass
sort with key=str.lower: 1064.70 msec/pass
sort with cmp=libc.strcasecmp: 5487.86 msec/pass
So, the version with str.lower is not only the fastest by far, but also the most portable and pythonic of all the proposed solutions here.
I have not profiled memory usage, but the original poster has still not given a compelling reason to worry about it. Also, who says that a call into the libc module doesn't duplicate any strings?
NB: The lower() string method also has the advantage of being locale-dependent. Something you will probably not be getting right when writing your own "optimised" solution. Even so, due to bugs and missing features in Python, this kind of comparison may give you wrong results in a unicode context.
Your question implies that you don't need Unicode. Try the following code snippet; if it works for you, you're done:
Python 2.5.2 (r252:60911, Aug 22 2008, 02:34:17)
[GCC 4.3.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.setlocale(locale.LC_COLLATE, "en_US")
'en_US'
>>> sorted("ABCabc", key=locale.strxfrm)
['a', 'A', 'b', 'B', 'c', 'C']
>>> sorted("ABCabc", cmp=locale.strcoll)
['a', 'A', 'b', 'B', 'c', 'C']
Clarification: in case it is not obvious at first sight, locale.strcoll seems to be the function you need, avoiding the str.lower or locale.strxfrm "duplicate" strings.
Are you using this compare in a very-frequently-executed path of a highly-performance-sensitive application? Alternatively, are you running this on strings which are megabytes in size? If not, then you shouldn't worry about the performance and just use the .lower() method.
The following code demonstrates that doing a case-insensitive compare by calling .lower() on two strings which are each almost a megabyte in size takes about 0.009 seconds on my 1.8GHz desktop computer:
from timeit import Timer
s1 = "1234567890" * 100000 + "a"
s2 = "1234567890" * 100000 + "B"
code = "s1.lower() < s2.lower()"
time = Timer(code, "from __main__ import s1, s2").timeit(1000)
print time / 1000 # 0.00920499992371 on my machine
If indeed this is an extremely significant, performance-critical section of code, then I recommend writing a function in C and calling it from your Python code, since that will allow you to do a truly efficient case-insensitive search. Details on writing C extension modules can be found here: https://docs.python.org/extending/extending.html
I can't find any other built-in way of doing case-insensitive comparison: The python cook-book recipe uses lower().
However you have to be careful when using lower for comparisons because of the Turkish I problem. Unfortunately Python's handling for Turkish Is is not good. ı is converted to I, but I is not converted to ı. İ is converted to i, but i is not converted to İ.
There's no built in equivalent to that function you want.
You can write your own function that converts to .lower() each character at a time to avoid duplicating both strings, but I'm sure it will very cpu-intensive and extremely inefficient.
Unless you are working with extremely long strings (so long that can cause a memory problem if duplicated) then I would keep it simple and use
str1.lower() == str2.lower()
You'll be ok
This question is asking 2 very different things:
What is the easiest way to compare strings in Python, ignoring case?
I guess I'm looking for an equivalent to C's stricmp().
Since #1 has been answered very well already (ie: str1.lower() < str2.lower()) I will answer #2.
def strincmp(str1, str2, numchars=None):
result = 0
len1 = len(str1)
len2 = len(str2)
if numchars is not None:
minlen = min(len1,len2,numchars)
else:
minlen = min(len1,len2)
#end if
orda = ord('a')
ordz = ord('z')
i = 0
while i < minlen and 0 == result:
ord1 = ord(str1[i])
ord2 = ord(str2[i])
if ord1 >= orda and ord1 <= ordz:
ord1 = ord1-32
#end if
if ord2 >= orda and ord2 <= ordz:
ord2 = ord2-32
#end if
result = cmp(ord1, ord2)
i += 1
#end while
if 0 == result and minlen != numchars:
if len1 < len2:
result = -1
elif len2 < len1:
result = 1
#end if
#end if
return result
#end def
Only use this function when it makes sense to as in many instances the lowercase technique will be superior.
I only work with ascii strings, I'm not sure how this will behave with unicode.
When something isn't supported well in the standard library, I always look for a PyPI package. With virtualization and the ubiquity of modern Linux distributions, I no longer avoid Python extensions. PyICU seems to fit the bill: https://stackoverflow.com/a/1098160/3461
There now is also an option that is pure python. It's well tested: https://github.com/jtauber/pyuca
Old answer:
I like the regular expression solution. Here's a function you can copy and paste into any function, thanks to python's block structure support.
def equals_ignore_case(str1, str2):
import re
return re.match(re.escape(str1) + r'\Z', str2, re.I) is not None
Since I used match instead of search, I didn't need to add a caret (^) to the regular expression.
Note: This only checks equality, which is sometimes what is needed. I also wouldn't go so far as to say that I like it.
This is how you'd do it with re:
import re
p = re.compile('^hello$', re.I)
p.match('Hello')
p.match('hello')
p.match('HELLO')
The recommended idiom to sort lists of values using expensive-to-compute keys is to the so-called "decorated pattern". It consists simply in building a list of (key, value) tuples from the original list, and sort that list. Then it is trivial to eliminate the keys and get the list of sorted values:
>>> original_list = ['a', 'b', 'A', 'B']
>>> decorated = [(s.lower(), s) for s in original_list]
>>> decorated.sort()
>>> sorted_list = [s[1] for s in decorated]
>>> sorted_list
['A', 'a', 'B', 'b']
Or if you like one-liners:
>>> sorted_list = [s[1] for s in sorted((s.lower(), s) for s in original_list)]
>>> sorted_list
['A', 'a', 'B', 'b']
If you really worry about the cost of calling lower(), you can just store tuples of (lowered string, original string) everywhere. Tuples are the cheapest kind of containers in Python, they are also hashable so they can be used as dictionary keys, set members, etc.
I'm pretty sure you either have to use .lower() or use a regular expression. I'm not aware of a built-in case-insensitive string comparison function.
For occasional or even repeated comparisons, a few extra string objects shouldn't matter as long as this won't happen in the innermost loop of your core code or you don't have enough data to actually notice the performance impact. See if you do: doing things in a "stupid" way is much less stupid if you also do it less.
If you seriously want to keep comparing lots and lots of text case-insensitively you could somehow keep the lowercase versions of the strings at hand to avoid finalization and re-creation, or normalize the whole data set into lowercase. This of course depends on the size of the data set. If there are a relatively few needles and a large haystack, replacing the needles with compiled regexp objects is one solution. If It's hard to say without seeing a concrete example.
You could translate each string to lowercase once --- lazily only when you need it, or as a prepass to the sort if you know you'll be sorting the entire collection of strings. There are several ways to attach this comparison key to the actual data being sorted, but these techniques should be addressed in a separate issue.
Note that this technique can be used not only to handle upper/lower case issues, but for other types of sorting such as locale specific sorting, or "Library-style" title sorting that ignores leading articles and otherwise normalizes the data before sorting it.
Just use the str().lower() method, unless high-performance is important - in which case write that sorting method as a C extension.
"How to write a Python Extension" seems like a decent intro..
More interestingly, This guide compares using the ctypes library vs writing an external C module (the ctype is quite-substantially slower than the C extension).
import re
if re.match('tEXT', 'text', re.IGNORECASE):
# is True
You could subclass str and create your own case-insenstive string class but IMHO that would be extremely unwise and create far more trouble than it's worth.
In response to your clarification...
You could use ctypes to execute the c function "strcasecmp". Ctypes is included in Python 2.5. It provides the ability to call out to dll and shared libraries such as libc. Here is a quick example (Python on Linux; see link for Win32 help):
from ctypes import *
libc = CDLL("libc.so.6") // see link above for Win32 help
libc.strcasecmp("THIS", "this") // returns 0
libc.strcasecmp("THIS", "THAT") // returns 8
may also want to reference strcasecmp documentation
Not really sure this is any faster or slower (have not tested), but it's a way to use a C function to do case insensitive string comparisons.
~~~~~~~~~~~~~~
ActiveState Code - Recipe 194371: Case Insensitive Strings
is a recipe for creating a case insensitive string class. It might be a bit over kill for something quick, but could provide you with a common way of handling case insensitive strings if you plan on using them often.