Python: Loop: formatting string - python

I don't know how to express this. I want to print:
_1__2__3__4_
With "_%s_" as a substring of that. How to get the main string when I format the substring? (as a shortcut of:
for x in range(1,5):
print "_%s_" % (x)
(Even though this prints multiple lines))
Edit: just in one line

Did you mean something like this?
my_string = "".join(["_%d_" % i for i in xrange(1,5)])
That creates a list of the substrings as requested and then concatenates the items in the list using the empty string as separator (See str.join() documentation).
Alternatively you can add to a string though a loop with the += operator although it is much slower and less efficient:
s = ""
for x in range(1,5):
s += "_%d_" % x
print s

print("_" + "__".join(map(str, xrange(1,5)))) +"_"
_1__2__3__4_
In [9]: timeit ("_" + "__".join(map(str,xrange(1,5)))) +"_"
1000000 loops, best of 3: 1.38 µs per loop
In [10]: timeit "".join(["_%d_" % i for i in xrange(1,5)])
100000 loops, best of 3: 3.19 µs per loop

you can maintain your style if you want to.
if you are using python 2.7:
from __future__ import print_function
for x in range(1,5):
print("_%s_" % (x), sep = '', end = '')
print()
for python 3.x, import is not required.
python doc: https://docs.python.org/2.7/library/functions.html?highlight=print#print

Python 3:
print("_{}_".format("__".join(map(str,range(1,5)))))
_1__2__3__4_
Python 2:
print "_{0}_".format("__".join(map(str,range(1,5))))
_1__2__3__4_

Related

Matching list of regular expression to list of strings

I'm working on matching a list of regular expressions with a list of strings. The problem is, that the lists are very big (RegEx about 1 million, strings about 50T). What I've got so far is this:
reg_list = ["domain\.com\/picture\.png", "entry{0,9}"]
y = ["test","string","entry4also_found","entry5"]
for r in reg_list:
for x in y:
if re.findall(r, x):
RESULT_LIST.append(x)
print(x)
Which works very well logically but is way to unefficient for those number of entries. Is there a better (more efficient) solution for this?
Use any() to test if any of the regular expressions match, rather than looping over the entire list.
Compile all the regular expressions first, so this doesn't have to be done repeatedly.
reg_list = [re.compile(rx) for rx in reg_list]
for word in y:
if any(rx.search(word) for rx in reg_list):
RESULT_LIST.append(word)
The only enhancements that come to mind are
Stopping match at first occurrence as re.findall attempts to search for multiple matches, this is not what you are after
Pre-compiling your regexes.
reg_list = [r"domain\.com/picture\.png", r"entry{0,9}"]
reg_list = [re.compile(x) for x in reg_list] # Step 1
y = ["test","string","entry4also_found","entry5"]
RESULT_LIST = []
for r in reg_list:
for x in y:
if r.search(x): # Step 2
RESULT_LIST.append(x)
print(x)
python -m timeit -s "import re" "re.match('hello', 'hello world')"
100000 loops, best of 3: 3.82 usec per loop
$ python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"
1000000 loops, best of 3: 1.26 usec per loop
So, if you are going to be using the same regex a lot, it may be worth it to do re.compile (especially for more complex regexes).

How do I reference several scripts with similar names in Python with a "for" statement?

I want to execute several .py files that have similar names. The goal is call on a1.py, a2.py, and a3.py. I've tried what is below, but I can't get it to work. How do I insert each element in "n" into the filename to execute several scripts? Thanks for any help.
n = [1,2,3]
for x in n:
execfile('C:/a$n.py')
Personally, I really prefer using string formatting to concatenation (more robust to various datatypes). Also, there's no reason to keep a list like that around, and it can be replace by range
for x in range(1,4):
execfile('C:/a%s.py' % x)
range(1, 4) == [1, 2, 3] # True
That said, the format command used by Marcin is technically more pythonic. I personally find it overly verbose though.
I believe that string concatenation is the most performant option, but performance really shouldn't be the deciding factor for you here.
To recap:
Pythonic Solution:
execfile('C:/a{}.py'.format(x))
C-type Solution (that I personally prefer):
execfile('C:/a%s.py' % x)
Performant Solution (Again, performance really shouldn't be your driving force here)
execfile('C:/a' + str(x) + '.py')
EDIT: Turns out I was wrong and the C-type solution is most performant. timeit results below:
$ python -m timeit '["C:/a{}.py".format(x) for x in range(1,4)]'
100000 loops, best of 3: 4.58 usec per loop
$ python -m timeit '["C:/a%s.py" % x for x in range(1,4)]'
100000 loops, best of 3: 2.92 usec per loop
$ python -m timeit '["C:/a" + str(x) + ".py" for x in range(1,4)]'
100000 loops, best of 3: 3.32 usec per loop
You can do this:
n = [1,2,3]
for x in n:
execfile('C:/a{}.py'.format(x))
print('C:/a{}.py'.format(x))
The names will be:
C:/a1.py
C:/a2.py
C:/a3.py
for x in n:
execfile('C:/a{}.py'.format(x))
Better:
for x in range(1,4):
execfile('C:/a{}.py'.format(x))

Efficiency: For x in EmptyList vs if length > 0

Sorry, kind of a hard question to title.
If I want to iterate over a potentially empty list, which is more efficient? I'm expecting the list to be empty the majority of the time.
for x in list:
dostuff()
OR
if len(list)>0:
for x in list:
dostuff()
Based on timings from the timeit module:
>>> from timeit import timeit
>>> timeit('for x in lst:pass', 'lst=[]')
0.08301091194152832
>>> timeit('if len(lst)>0:\n for x in lst:\n pass', 'lst=[]')
0.09223318099975586
It looks like just doing the for loop will be faster when the list is empty, making it faster option regardless of the state of the list.
However, there is a significantly faster option:
>>> timeit('if lst:\n for x in lst:\n pass', 'lst=[]')
0.03235578536987305
Using if lst is much faster than either checking the length of the list or always doing the for loop. However, all three methods are quite fast, so if you are trying to optimize your code I would suggest trying to find what the real bottleneck is - take a look at When is optimisation premature?.
You can just use if list:
In [15]: if l:
....: print "hello"
....:
In [16]: l1= [1]
In [17]: if l1:
....: print "hello from l1"
....:
hello from l1
In [21]: %timeit for x in l:pass
10000000 loops, best of 3: 54.4 ns per loop
In [22]: %timeit if l:pass
10000000 loops, best of 3: 22.4 ns per loop
If a list is empty if list will evaluate to False so no need to check the len(list).
Firstly if len(list) > 0: should be if list: to improve readability. I personally would have thought having the if statement is redundant but timeit seems to prove me wrong. It seems (unless I've made a silly mistake) that having the check for an empty list makes the code faster (for an empty list):
$ python -m timeit 'list = []' 'for x in list:' ' print x'
10000000 loops, best of 3: 0.157 usec per loop
$ python -m timeit 'list = []' 'if list:' ' for x in list:' ' print x'
10000000 loops, best of 3: 0.0766 usec per loop
First variant is more efficient, because Python's for loop checks length of list anyway, making additional explicit check just a waste of CPU cycles.

Put a symbol before and after each character in a string

I would like to add brackets to each character in a string. So
"HelloWorld"
should become:
"[H][e][l][l][o][W][o][r][l][d]"
I have used this code:
word = "HelloWorld"
newWord = ""
for letter in word:
newWord += "[%s]" % letter
which is the most straightforward way to do it but the string concatenations are pretty slow.
Any suggestions on speeding up this code.
>>> s = "HelloWorld"
>>> ''.join('[{}]'.format(x) for x in s)
'[H][e][l][l][o][W][o][r][l][d]'
If string is huge then using str.join with a list comprehension will be faster and memory efficient than using a generator expression(https://stackoverflow.com/a/9061024/846892):
>>> ''.join(['[{}]'.format(x) for x in s])
'[H][e][l][l][o][W][o][r][l][d]'
From Python performance tips:
Avoid this:
s = ""
for substring in list:
s += substring
Use s = "".join(list) instead. The former is a very common and catastrophic mistake when building large strings.
The most pythonic way would probably be with a generator comprehension:
>>> s = "HelloWorld"
>>> "".join("[%s]" % c for c in s)
'[H][e][l][l][o][W][o][r][l][d]'
Ashwini Chaudhary's answer is very similar, but uses the modern (Python3) string format function. The old string interpolation with % still works fine and is a bit simpler.
A bit more creatively, inserting ][ between each character, and surrounding it all with []. I guess this might be a bit faster, since it doesn't do as many string interpolations, but speed shouldn't be an issue here.
>>> "[" + "][".join(s) + "]"
'[H][e][l][l][o][W][o][r][l][d]'
If you are concerned about speed and need a fast implementation, try to determine an implementation which offloads the iteration to the underline native module. This is true for at least in CPython.
Suggested Implementation
"[{}]".format(']['.join(s))
Output
'[H][e][l][l][o][W][o][r][l][d]'
Comparing with a competing solution
In [12]: s = "a" * 10000
In [13]: %timeit "[{}]".format(']['.join(s))
1000 loops, best of 3: 215 us per loop
In [14]: %timeit ''.join(['[{}]'.format(x) for x in s])
100 loops, best of 3: 3.06 ms per loop
In [15]: %timeit ''.join('[{}]'.format(x) for x in s)
100 loops, best of 3: 3.26 ms per loop

Python text validation: a-z and comma (",")

I need to check that some text only contains lower-case letters a-z and a comma (",").
What is the best way to do this in Python?
import re
def matches(s):
return re.match("^[a-z,]*$", s) is not None
Which gives you:
>>> matches("tea and cakes")
False
>>> matches("twiddledee,twiddledum")
True
You can optimise a bit with re.compile:
import re
matcher = re.compile("^[a-z,]*$")
def matches(s):
return matcher.match(s) is not None
import string
allowed = set(string.lowercase + ',')
if set(text) - allowed:
# you know it has forbidden characters
else:
# it doesn't have forbidden characters
Doing it with sets will be faster than doing it with for loops (especially if you want to check more than one text) and is all together cleaner than regexes for this situation.
an alternative that might be faster than two sets, is
allowed = string.lowercase + ','
if not all(letter in allowed for letter in text):
# you know it has forbidden characthers
here's some meaningless mtimeit results. one is the generator expression and two is the set based solution.
$ python -mtimeit -s'import scratch3' 'scratch3.one("asdfas2423452345sdfadf34")'
100000 loops, best of 3: 3.98 usec per loop
$ python -mtimeit -s'import scratch3' 'scratch3.two("asdfas2423452345sdfadf34")'
100000 loops, best of 3: 4.39 usec per loop
$ python -mtimeit -s'import scratch3' 'scratch3.two("asdfasasdfadsfasdfasdfdaf")'
100000 loops, best of 3: 3.51 usec per loop
$ python -mtimeit -s'import scratch3' 'scratch3.one("asdfasasdfadsfasdfasdfdaf")'
100000 loops, best of 3: 7.7 usec per loop
You can see that the setbased one is significantly faster than the generator expression with a small expected alphabet and success conditions. the generator expression is faster with failures because it can bail. This is pretty much whats to be expected so it's interesting to see the numbers back it up.
another possibility that I forgot about is the hybrid approach.
not all(letter in allowed for letter in set(text))
$ python -mtimeit -s'import scratch3' 'scratch3.three("asdfasasdfadsfasdfasdfdaf")'
100000 loops, best of 3: 5.06 usec per loop
$ python -mtimeit -s'import scratch3' 'scratch3.three("asdfas2423452345sdfadf34")'
100000 loops, best of 3: 6.71 usec per loop
it slows down the best case-ish but speeds up the worst case-ish. All in all, you'd have to test the different possibilities over a sample of your expected input. the broader the sample, the better.
import re
if not re.search('[^a-z\,]', yourString):
# True: contains only a-z and comma
# False: contains also something else
Not sure what do you mean with "contain", but this should go in your direction:
reobj = re.compile(r"[a-z,]+")
match = reobj.search(subject)
if match:
result = match.group()
else
result = ""
Just:
def alllower(s):
if ',' in s:
s=s.replace(',','a')
return s.isalpha() and s.islower()
with most efficient and simple.
or in one line:
lambda s:s.isalpha() or (',' in s and s.replace(',','a').isalpha()) and s.islower()
#!/usr/bin/env python
import string
text = 'aasdfadf$oih,234'
for letter in text:
if letter not in string.ascii_lowercase and letter != ',':
print letter
characters a -z are represented by bytes 97 - 122 and ord(char) returns the byte value of the character. Reading the file in binary and making the match should suffice.
f = open("myfile", "rb")
retVal = False
lowerAlphabets = range(97, 123)
try:
byte = f.read(1)
while byte != "":
# Do stuff with byte.
byte = f.read(1)
if byte:
if ord(byte) not in lowerAlphabets:
retVal = True
break
finally:
f.close()
if retVal:
print "characters not from a - z"
else:
print "characters from a - z"

Categories

Resources