Matching list of regular expression to list of strings - python

I'm working on matching a list of regular expressions with a list of strings. The problem is, that the lists are very big (RegEx about 1 million, strings about 50T). What I've got so far is this:
reg_list = ["domain\.com\/picture\.png", "entry{0,9}"]
y = ["test","string","entry4also_found","entry5"]
for r in reg_list:
for x in y:
if re.findall(r, x):
RESULT_LIST.append(x)
print(x)
Which works very well logically but is way to unefficient for those number of entries. Is there a better (more efficient) solution for this?

Use any() to test if any of the regular expressions match, rather than looping over the entire list.
Compile all the regular expressions first, so this doesn't have to be done repeatedly.
reg_list = [re.compile(rx) for rx in reg_list]
for word in y:
if any(rx.search(word) for rx in reg_list):
RESULT_LIST.append(word)

The only enhancements that come to mind are
Stopping match at first occurrence as re.findall attempts to search for multiple matches, this is not what you are after
Pre-compiling your regexes.
reg_list = [r"domain\.com/picture\.png", r"entry{0,9}"]
reg_list = [re.compile(x) for x in reg_list] # Step 1
y = ["test","string","entry4also_found","entry5"]
RESULT_LIST = []
for r in reg_list:
for x in y:
if r.search(x): # Step 2
RESULT_LIST.append(x)
print(x)

python -m timeit -s "import re" "re.match('hello', 'hello world')"
100000 loops, best of 3: 3.82 usec per loop
$ python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"
1000000 loops, best of 3: 1.26 usec per loop
So, if you are going to be using the same regex a lot, it may be worth it to do re.compile (especially for more complex regexes).

Related

How do I reference several scripts with similar names in Python with a "for" statement?

I want to execute several .py files that have similar names. The goal is call on a1.py, a2.py, and a3.py. I've tried what is below, but I can't get it to work. How do I insert each element in "n" into the filename to execute several scripts? Thanks for any help.
n = [1,2,3]
for x in n:
execfile('C:/a$n.py')
Personally, I really prefer using string formatting to concatenation (more robust to various datatypes). Also, there's no reason to keep a list like that around, and it can be replace by range
for x in range(1,4):
execfile('C:/a%s.py' % x)
range(1, 4) == [1, 2, 3] # True
That said, the format command used by Marcin is technically more pythonic. I personally find it overly verbose though.
I believe that string concatenation is the most performant option, but performance really shouldn't be the deciding factor for you here.
To recap:
Pythonic Solution:
execfile('C:/a{}.py'.format(x))
C-type Solution (that I personally prefer):
execfile('C:/a%s.py' % x)
Performant Solution (Again, performance really shouldn't be your driving force here)
execfile('C:/a' + str(x) + '.py')
EDIT: Turns out I was wrong and the C-type solution is most performant. timeit results below:
$ python -m timeit '["C:/a{}.py".format(x) for x in range(1,4)]'
100000 loops, best of 3: 4.58 usec per loop
$ python -m timeit '["C:/a%s.py" % x for x in range(1,4)]'
100000 loops, best of 3: 2.92 usec per loop
$ python -m timeit '["C:/a" + str(x) + ".py" for x in range(1,4)]'
100000 loops, best of 3: 3.32 usec per loop
You can do this:
n = [1,2,3]
for x in n:
execfile('C:/a{}.py'.format(x))
print('C:/a{}.py'.format(x))
The names will be:
C:/a1.py
C:/a2.py
C:/a3.py
for x in n:
execfile('C:/a{}.py'.format(x))
Better:
for x in range(1,4):
execfile('C:/a{}.py'.format(x))

Put a symbol before and after each character in a string

I would like to add brackets to each character in a string. So
"HelloWorld"
should become:
"[H][e][l][l][o][W][o][r][l][d]"
I have used this code:
word = "HelloWorld"
newWord = ""
for letter in word:
newWord += "[%s]" % letter
which is the most straightforward way to do it but the string concatenations are pretty slow.
Any suggestions on speeding up this code.
>>> s = "HelloWorld"
>>> ''.join('[{}]'.format(x) for x in s)
'[H][e][l][l][o][W][o][r][l][d]'
If string is huge then using str.join with a list comprehension will be faster and memory efficient than using a generator expression(https://stackoverflow.com/a/9061024/846892):
>>> ''.join(['[{}]'.format(x) for x in s])
'[H][e][l][l][o][W][o][r][l][d]'
From Python performance tips:
Avoid this:
s = ""
for substring in list:
s += substring
Use s = "".join(list) instead. The former is a very common and catastrophic mistake when building large strings.
The most pythonic way would probably be with a generator comprehension:
>>> s = "HelloWorld"
>>> "".join("[%s]" % c for c in s)
'[H][e][l][l][o][W][o][r][l][d]'
Ashwini Chaudhary's answer is very similar, but uses the modern (Python3) string format function. The old string interpolation with % still works fine and is a bit simpler.
A bit more creatively, inserting ][ between each character, and surrounding it all with []. I guess this might be a bit faster, since it doesn't do as many string interpolations, but speed shouldn't be an issue here.
>>> "[" + "][".join(s) + "]"
'[H][e][l][l][o][W][o][r][l][d]'
If you are concerned about speed and need a fast implementation, try to determine an implementation which offloads the iteration to the underline native module. This is true for at least in CPython.
Suggested Implementation
"[{}]".format(']['.join(s))
Output
'[H][e][l][l][o][W][o][r][l][d]'
Comparing with a competing solution
In [12]: s = "a" * 10000
In [13]: %timeit "[{}]".format(']['.join(s))
1000 loops, best of 3: 215 us per loop
In [14]: %timeit ''.join(['[{}]'.format(x) for x in s])
100 loops, best of 3: 3.06 ms per loop
In [15]: %timeit ''.join('[{}]'.format(x) for x in s)
100 loops, best of 3: 3.26 ms per loop

Populate list or tuple from callable or lambda in python

This is a problem I've come across a lot lately. Google doesn't seem to have an answer so I bring it to the good people of stack overflow.
I am looking for a simple way to populate a list with the output of a function. Something like this:
fill(random.random(), 3) #=> [0.04095623, 0.39761869, 0.46227642]
Here are other ways I've found to do this. But I'm not really happy with them, as they seem inefficient.
results = []
for x in xrange(3): results.append(random.random())
#results => [0.04095623, 0.39761869, 0.46227642]
and
map(lambda x: random.random(), [None] * 3)
#=> [0.04095623, 0.39761869, 0.46227642]
Suggestions?
Thanks for all the answers. I knew there was a more python-esque way.
And to the efficiency questions...
$ python --version
Python 2.7.1+
$ python -m timeit "import random" "map(lambda x: random.random(), [None] * 3)"
1000000 loops, best of 3: 1.65 usec per loop
$ python -m timeit "import random" "results = []" "for x in xrange(3): results.append(random.random())"
1000000 loops, best of 3: 1.41 usec per loop
$ python -m timeit "import random" "[random.random() for x in xrange(3)]"
1000000 loops, best of 3: 1.09 usec per loop
How about a list comprehension?
[random.random() for x in xrange(3)]
Also, in many cases, you need the values just once. In these cases, a generator expression which computes the values just-in-time and does not require a memory allocation is preferable:
results = (random.random() for x in xrange(3))
for r in results:
...
# results is "used up" now.
# We could have used results_list = list(results) to convert the generator
By the way, in Python 3.x, xrange has been replaced by range. In Python 2.x, range allocates the memory and calculates all values beforehand (like a list comprehension), whereas xrange calculates the values just-in-time and does not allocate memory (it's a generator).
why do you think they are inefficient?
There is another way to do it,a list-comprehension
listt= [random.random() for i in range(3)]
something more generic...
from random import random
fill = lambda func, num: [func() for x in xrange(num)]
# for generating tuples:
fill = lambda func, num: (func() for x in xrange(num))
# then just call:
fill(random, 4)
# or...
fill(lambda : 1+2*random(), 4)
list = [random.random() for i in xrange(3)]
list = [random.random() for i in [0]*3]
list = [i() for i in [random.random]*3]
Or :
fill =lambda f,n: [f() for i in xrange(n)]
fill(random.random , 3 ) #=> [0.04095623, 0.39761869, 0.46227642]
List comprehension is probably clearest, but for the itertools afficionado:
>>> list(itertools.islice(iter(random.random, None), 3))
[0.42565379345946064, 0.41754360645917354, 0.797286438646947]
A quick check with timeit shows that the itertools version is ever so slightly faster for more than 10 items, but still go with whatever seems clearest to you:
C:\Python32>python lib\timeit.py -s "import random, itertools" "list(itertools.islice(iter(random.random, None), 10))"
100000 loops, best of 3: 2.93 usec per loop
C:\Python32>python lib\timeit.py -s "import random, itertools" "[random.random() for _ in range(10)]"
100000 loops, best of 3: 3.19 usec per loop

In python, what is more efficient? Modifying lists or strings?

Regardless of ease of use, which is more computationally efficient? Constantly slicing lists and appending to them? Or taking substrings and doing the same?
As an example, let's say I have two binary strings "11011" and "01001". If I represent these as lists, I'll be choosing a random "slice" point. Let's say I get 3. I'll Take the first 3 characters of the first string and the remaining characters of the second string (so I'd have to slice both) and create a new string out of it.
Would this be more efficiently done by cutting the substrings or by representing it as a list ( [1, 1, 0, 1, 1] ) rather than a string?
>>> a = "11011"
>>> b = "01001"
>>> import timeit
>>> def strslice():
return a[:3] + b[3:]
>>> def lstslice():
return list(a)[:3] + list(b)[3:]
>>> c = list(a)
>>> d = list(b)
>>> def lsts():
return c[:3] + d[3:]
>>> timeit.timeit(strslice)
0.5103488475836432
>>> timeit.timeit(lstslice)
2.4350100538824613
>>> timeit.timeit(lsts)
1.0648406858527295
timeit is a good tool for micro-benchmarking, but it needs to be used with the utmost care when the operations you want to compare may involve in-place alterations -- in this case, you need to include extra operations designed to make needed copies. Then, first time just the "extra" overhead:
$ python -mtimeit -s'a="11011";b="01001"' 'la=list(a);lb=list(b)'
100000 loops, best of 3: 5.01 usec per loop
$ python -mtimeit -s'a="11011";b="01001"' 'la=list(a);lb=list(b)'
100000 loops, best of 3: 5.06 usec per loop
So making the two brand-new lists we need (to avoid alteration) costs a tad more than 5 microseconds (when focused on small differences, run things at least 2-3 times to eyeball the uncertainty range). After which:
$ python -mtimeit -s'a="11011";b="01001"' 'la=list(a);lb=list(b);x=a[:3]+b[3:]'
100000 loops, best of 3: 5.5 usec per loop
$ python -mtimeit -s'a="11011";b="01001"' 'la=list(a);lb=list(b);x=a[:3]+b[3:]'
100000 loops, best of 3: 5.47 usec per loop
string slicing and concatenation in this case can be seen to cost another 410-490 nanoseconds. And:
$ python -mtimeit -s'a="11011";b="01001"' 'la=list(a);lb=list(b);la[3:]=lb[3:]'
100000 loops, best of 3: 5.99 usec per loop
$ python -mtimeit -s'a="11011";b="01001"' 'la=list(a);lb=list(b);la[3:]=lb[3:]'
100000 loops, best of 3: 5.99 usec per loop
in-place list splicing can be seen to cost 930-980 nanoseconds. The difference is safely above the noise/uncertainty levels, so you can reliably state that for this use case working with strings is going to take roughly half as much time as working in-place with lists. Of course, it's also crucial to measure a range of use cases that are relevant and representative of your typical bottleneck tasks!
In general, modifying lists is more efficient than modifying strings, because strings are immutable.
It really depends on actual use cases, and as others have said, profile it, but in general, appending to lists will be better, because it can be done in place, whereas "appending to strings" actually creates a new string that concatenates the old strings. This can rapidly eat up memory. (Which is a different issue from computational efficiency, really).
Edit: If you want computational efficiency with binary values, don't use strings or lists. Use integers and bitwise operations. With recent versions of python, you can use binary representations when you need them:
>>> bin(42)
'0b101010'
>>> 0b101010
42
>>> int('101010')
101010
>>> int('101010', 2)
42
>>> int('0b101010')
...
ValueError: invalid literal for int() with base 10: '0b101010'
>>> int('0b101010', 2)
42
Edit 2:
def strslice(a, b):
return a[:3] + b[3:]
might be better written something like:
def binspice(a, b):
mask = 0b11100
return (a & mask) + (b & ~mask)
>>> a = 0b11011
>>> b = 0b1001
>>> bin(binsplice(a, b))
'0b11001
>>>
It might need to be modified if your binary numbers are different sizes.

Remove characters except digits from string using Python?

How can I remove all characters except numbers from string?
Use re.sub, like so:
>>> import re
>>> re.sub('\D', '', 'aas30dsa20')
'3020'
\D matches any non-digit character so, the code above, is essentially replacing every non-digit character for the empty string.
Or you can use filter, like so (in Python 2):
>>> filter(str.isdigit, 'aas30dsa20')
'3020'
Since in Python 3, filter returns an iterator instead of a list, you can use the following instead:
>>> ''.join(filter(str.isdigit, 'aas30dsa20'))
'3020'
In Python 2.*, by far the fastest approach is the .translate method:
>>> x='aaa12333bb445bb54b5b52'
>>> import string
>>> all=string.maketrans('','')
>>> nodigs=all.translate(all, string.digits)
>>> x.translate(all, nodigs)
'1233344554552'
>>>
string.maketrans makes a translation table (a string of length 256) which in this case is the same as ''.join(chr(x) for x in range(256)) (just faster to make;-). .translate applies the translation table (which here is irrelevant since all essentially means identity) AND deletes characters present in the second argument -- the key part.
.translate works very differently on Unicode strings (and strings in Python 3 -- I do wish questions specified which major-release of Python is of interest!) -- not quite this simple, not quite this fast, though still quite usable.
Back to 2.*, the performance difference is impressive...:
$ python -mtimeit -s'import string; all=string.maketrans("", ""); nodig=all.translate(all, string.digits); x="aaa12333bb445bb54b5b52"' 'x.translate(all, nodig)'
1000000 loops, best of 3: 1.04 usec per loop
$ python -mtimeit -s'import re; x="aaa12333bb445bb54b5b52"' 're.sub(r"\D", "", x)'
100000 loops, best of 3: 7.9 usec per loop
Speeding things up by 7-8 times is hardly peanuts, so the translate method is well worth knowing and using. The other popular non-RE approach...:
$ python -mtimeit -s'x="aaa12333bb445bb54b5b52"' '"".join(i for i in x if i.isdigit())'
100000 loops, best of 3: 11.5 usec per loop
is 50% slower than RE, so the .translate approach beats it by over an order of magnitude.
In Python 3, or for Unicode, you need to pass .translate a mapping (with ordinals, not characters directly, as keys) that returns None for what you want to delete. Here's a convenient way to express this for deletion of "everything but" a few characters:
import string
class Del:
def __init__(self, keep=string.digits):
self.comp = dict((ord(c),c) for c in keep)
def __getitem__(self, k):
return self.comp.get(k)
DD = Del()
x='aaa12333bb445bb54b5b52'
x.translate(DD)
also emits '1233344554552'. However, putting this in xx.py we have...:
$ python3.1 -mtimeit -s'import re; x="aaa12333bb445bb54b5b52"' 're.sub(r"\D", "", x)'
100000 loops, best of 3: 8.43 usec per loop
$ python3.1 -mtimeit -s'import xx; x="aaa12333bb445bb54b5b52"' 'x.translate(xx.DD)'
10000 loops, best of 3: 24.3 usec per loop
...which shows the performance advantage disappears, for this kind of "deletion" tasks, and becomes a performance decrease.
s=''.join(i for i in s if i.isdigit())
Another generator variant.
You can use filter:
filter(lambda x: x.isdigit(), "dasdasd2313dsa")
On python3.0 you have to join this (kinda ugly :( )
''.join(filter(lambda x: x.isdigit(), "dasdasd2313dsa"))
You can easily do it using Regex
>>> import re
>>> re.sub("\D","","£70,000")
70000
along the lines of bayer's answer:
''.join(i for i in s if i.isdigit())
The op mentions in the comments that he wants to keep the decimal place. This can be done with the re.sub method (as per the second and IMHO best answer) by explicitly listing the characters to keep e.g.
>>> re.sub("[^0123456789\.]","","poo123.4and5fish")
'123.45'
x.translate(None, string.digits)
will delete all digits from string. To delete letters and keep the digits, do this:
x.translate(None, string.letters)
Use a generator expression:
>>> s = "foo200bar"
>>> new_s = "".join(i for i in s if i in "0123456789")
A fast version for Python 3:
# xx3.py
from collections import defaultdict
import string
_NoneType = type(None)
def keeper(keep):
table = defaultdict(_NoneType)
table.update({ord(c): c for c in keep})
return table
digit_keeper = keeper(string.digits)
Here's a performance comparison vs. regex:
$ python3.3 -mtimeit -s'import xx3; x="aaa12333bb445bb54b5b52"' 'x.translate(xx3.digit_keeper)'
1000000 loops, best of 3: 1.02 usec per loop
$ python3.3 -mtimeit -s'import re; r = re.compile(r"\D"); x="aaa12333bb445bb54b5b52"' 'r.sub("", x)'
100000 loops, best of 3: 3.43 usec per loop
So it's a little bit more than 3 times faster than regex, for me. It's also faster than class Del above, because defaultdict does all its lookups in C, rather than (slow) Python. Here's that version on my same system, for comparison.
$ python3.3 -mtimeit -s'import xx; x="aaa12333bb445bb54b5b52"' 'x.translate(xx.DD)'
100000 loops, best of 3: 13.6 usec per loop
Try:
import re
string = '1abcd2XYZ3'
string_without_letters = re.sub(r'[a-z]', '', string.lower())
this should give:
123
Ugly but works:
>>> s
'aaa12333bb445bb54b5b52'
>>> a = ''.join(filter(lambda x : x.isdigit(), s))
>>> a
'1233344554552'
>>>
$ python -mtimeit -s'import re; x="aaa12333bb445bb54b5b52"' 're.sub(r"\D", "", x)'
100000 loops, best of 3: 2.48 usec per loop
$ python -mtimeit -s'import re; x="aaa12333bab445bb54b5b52"' '"".join(re.findall("[a-z]+",x))'
100000 loops, best of 3: 2.02 usec per loop
$ python -mtimeit -s'import re; x="aaa12333bb445bb54b5b52"' 're.sub(r"\D", "", x)'
100000 loops, best of 3: 2.37 usec per loop
$ python -mtimeit -s'import re; x="aaa12333bab445bb54b5b52"' '"".join(re.findall("[a-z]+",x))'
100000 loops, best of 3: 1.97 usec per loop
I had observed that join is faster than sub.
You can read each character. If it is digit, then include it in the answer. The str.isdigit() method is a way to know if a character is digit.
your_input = '12kjkh2nnk34l34'
your_output = ''.join(c for c in your_input if c.isdigit())
print(your_output) # '1223434'
You can use join + filter + lambda:
''.join(filter(lambda s: s.isdigit(), "20 years ago, 2 months ago, 2 days ago"))
Output: '2022'
Not a one liner but very simple:
buffer = ""
some_str = "aas30dsa20"
for char in some_str:
if not char.isdigit():
buffer += char
print( buffer )
I used this. 'letters' should contain all the letters that you want to get rid of:
Output = Input.translate({ord(i): None for i in 'letters'}))
Example:
Input = "I would like 20 dollars for that suit"
Output = Input.translate({ord(i): None for i in 'abcdefghijklmnopqrstuvwxzy'}))
print(Output)
Output:
20
my_string="sdfsdfsdfsfsdf353dsg345435sdfs525436654.dgg("
my_string=''.join((ch if ch in '0123456789' else '') for ch in my_string)
print(output:+my_string)
output: 353345435525436654
Another one:
import re
re.sub('[^0-9]', '', 'ABC123 456')
Result:
'123456'

Categories

Resources