Log File: Extract data block using Regex findall [duplicate] - python

I want to get the first match of a regex.
In the following case, I have a list:
text = 'aa33bbb44'
re.findall('\d+',text)
# ['33', '44']
I could extract the first element of the list:
text = 'aa33bbb44'
re.findall('\d+',text)[0]
# '33'
But that only works if there is at least one match, otherwise I'll get an IndexError:
text = 'aazzzbbb'
re.findall('\d+',text)[0]
# IndexError: list index out of range
In which case I could define a function:
def return_first_match(text):
try:
result = re.findall('\d+',text)[0]
except Exception, IndexError:
result = ''
return result
Is there a way of obtaining that result without defining a new function?

You could embed the '' default in your regex by adding |$:
>>> re.findall('\d+|$', 'aa33bbb44')[0]
'33'
>>> re.findall('\d+|$', 'aazzzbbb')[0]
''
>>> re.findall('\d+|$', '')[0]
''
Also works with re.search pointed out by others:
>>> re.search('\d+|$', 'aa33bbb44').group()
'33'
>>> re.search('\d+|$', 'aazzzbbb').group()
''
>>> re.search('\d+|$', '').group()
''

If you only need the first match, then use re.search instead of re.findall:
>>> m = re.search('\d+', 'aa33bbb44')
>>> m.group()
'33'
>>> m = re.search('\d+', 'aazzzbbb')
>>> m.group()
Traceback (most recent call last):
File "<pyshell#281>", line 1, in <module>
m.group()
AttributeError: 'NoneType' object has no attribute 'group'
Then you can use m as a checking condition as:
>>> m = re.search('\d+', 'aa33bbb44')
>>> if m:
print('First number found = {}'.format(m.group()))
else:
print('Not Found')
First number found = 33

I'd go with:
r = re.search("\d+", ch)
result = r.group(0) if r else ""
re.search only looks for the first match in the string anyway, so I think it makes your intent slightly more clear than using findall.

You shouldn't be using .findall() at all - .search() is what you want. It finds the leftmost match, which is what you want (or returns None if no match exists).
m = re.search(pattern, text)
result = m.group(0) if m else ""
Whether you want to put that in a function is up to you. It's unusual to want to return an empty string if no match is found, which is why nothing like that is built in. It's impossible to get confused about whether .search() on its own finds a match (it returns None if it didn't, or an SRE_Match object if it did).

You can do:
x = re.findall('\d+', text)
result = x[0] if len(x) > 0 else ''
Note that your question isn't exactly related to regex. Rather, how do you safely find an element from an array, if it has none.

Maybe this would perform a bit better in case greater amount of input data does not contain your wanted piece because except has greater cost.
def return_first_match(text):
result = re.findall('\d+',text)
result = result[0] if result else ""
return result

just assign the results to a variable then iterate the variable
text = 'aa33bbb44'
result=re.findall('\d+',text)
for item in result:
print(item)

With Assignment expressions (PEP572):
text = 'aa33bbb44'
r = m.group() if (m:=re.search(r'\d+',text)) is not None else ''

With re.findall, you can convert the output into an iterator with iter() and call next() on it to get the first result. next() is particularly useful for this task because a default value (e.g. '') can be passed to it; the default is returned if the iterator is empty, i.e. if there are no matches.
next(iter(re.findall('\d+', 'aa33bbb44')), '') # '33'
next(iter(re.findall('\d+', 'aazzzbbb')), '') # ''
At this point, next() can used with re.finditer for the job as well.
next(re.finditer('\d+', 'aa33bbb44'), [''])[0] # '33'
next(re.finditer('\d+', 'aazzzbbb'), [''])[0] # ''
You can also use the walrus operator with re.search for a one-liner.
m[0] if (m:=re.search('\d+', 'aa33bbb44')) else '' # '33'
m[0] if (m:=re.search('\d+', 'aazzzbbb')) else '' # ''
For this specific task, the argument against re.findall is performance and, indeed for large strings, the gap is huge. If there are multiple matches, re.findall is much, much slower than re.search or re.finditer1. However, if there are no matches, re.search with the walrus and re.finditer are the fastest.2.
1 Timings for strings with 1mil characters and 100k matches.
text = 'aabbbccc11'*100_000
%timeit m[0] if (m:=re.search('\d+', text)) else ''
# 1.94 µs ± 192 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)
%timeit next(re.finditer('\d+', text), [''])[0]
# 2.38 µs ± 122 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)
%timeit next(iter(re.findall('\d+', text)), '')
# 59 ms ± 8.65 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
%timeit re.search('\d+|$', text)[0]
# 2.32 µs ± 300 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)
%timeit re.findall('\d+|$', text)[0]
# 82.7 ms ± 1.64 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
2 Timings for strings with 1mil characters and no matches.
text = 'aabbbcccdd'*100000
%timeit m[0] if (m:=re.search('\d+', text)) else ''
# 26.3 ms ± 662 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit next(re.finditer('\d+', text), [''])[0]
# 26 ms ± 195 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit next(iter(re.findall('\d+', text)), '')
# 26.2 ms ± 615 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit re.search('\d+|$', text)[0]
# 72.9 ms ± 14.1 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit re.findall('\d+|$', text)[0]
# 67.8 ms ± 2.38 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)

Related

With python what is the most efficient way to tokenize a string (SELFIES) to a list?

I am currently working with SELFIES (self-referencing embedded strings, github : https://github.com/aspuru-guzik-group/selfies) which is basically a string representation of a molecule. Basically it is a sequence of tokens that are defined by brackets , e.g. propane would be written as "[C][C][C]". I would like to find the most efficient way to get a list of tokens like so:
selfies= "[C][C][C]"
tokens= some_function(selfies)
tokens
["[C]","[C]","[C]"]
i already found 3 ways to do it :
with the "native" function from the github (https://github.com/aspuru-guzik-group/selfies/blob/master/selfies/utils/selfies_utils.py):
def split_selfies(selfies: str) -> Iterator[str]:
"""Tokenizes a SELFIES string into its individual symbols.
:param selfies: a SELFIES string.
:return: the symbols of the SELFIES string one-by-one with order preserved.
:Example:
>>> import selfies as sf
>>> list(sf.split_selfies("[C][=C][F].[C]"))
['[C]', '[=C]', '[F]', '.', '[C]']
"""
left_idx = selfies.find("[")
while 0 <= left_idx < len(selfies):
right_idx = selfies.find("]", left_idx + 1)
if right_idx == -1:
raise ValueError("malformed SELFIES string, hanging '[' bracket")
next_symbol = selfies[left_idx: right_idx + 1]
yield next_symbol
left_idx = right_idx + 1
if selfies[left_idx: left_idx + 1] == ".":
yield "."
left_idx += 1
%%timeit
tokens= list(sf.split_selfies(selfies))
3.41 µs ± 22.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Edit: "." is never present in my case and it is not considered in solution 2 and 3 for speed's sake
This is kinda slow probably because of the conversion to a list
One from the creator of the library (https://github.com/aspuru-guzik-group/stoned-selfies/blob/main/GA_rediscover.py) :
def get_selfie_chars(selfies):
'''Obtain a list of all selfie characters in string selfie
Parameters:
selfie (string) : A selfie string - representing a molecule
Example:
>>> get_selfie_chars('[C][=C][C][=C][C][=C][Ring1][Branch1_1]')
['[C]', '[=C]', '[C]', '[=C]', '[C]', '[=C]', '[Ring1]', '[Branch1_1]']
Returns:
chars_selfie: list of selfie characters present in molecule selfie
'''
chars_selfie = [] # A list of all SELFIE sybols from string selfie
while selfie != '':
chars_selfie.append(selfie[selfie.find('['): selfie.find(']')+1])
selfie = selfie[selfie.find(']')+1:]
return chars_selfie
%%timeit
tokens= get_selfie_chars(selfies)
3.44 µs ± 43.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Which surprisingly take the same amount of time roughly that the native function
My implementation with a combinaison of list comprehension,slicing and .split()
def selfies_split(selfies):
return [block+"]" for block in selfies.split("]")][:-1]
%%timeit
tokens=selfies_split(selfies)
1.05 µs ± 53.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
My implementation is roughly 3 fold faster but I recon that the most efficient way to tokenize is probably to use regex with the package re but i have never used it and i am not particularly confortable with regex. So I fail to see how to implement it in way that yield the best results.
Edit:
Suggested from answers:
def stackoverflow_1_split(selfies):
atoms = selfies[1:-1].replace('][', "$").split("$")
return list(map('[{}]'.format, atoms))
%%timeit
tokens=stackoverflow_1_split(selfies)
1.75 µs ± 101 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Without the list conversion , it is actually faster then my implementation ( 575 ns +/- 10 ns) but the list is a requirement
Second suggestion from answers:
import re
def stackoverflow_2_split(selfies):
return re.findall(r".*?]", selfies)
%%timeit
tokens=stackoverflow_2_split(selfies)
1.81 µs ± 110 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Surprisingly re does not seem to outperform other solutions
third suggestion from answers :
def stackoverflow_3_split(selfies):
return selfies.replace(']', '] ').split()
%%timeit
tokens=stackoverflow_3_split(selfies)
485 ns ± 4.04 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
This is the fastest solution so far , which is roughly 2 time faster then my implementation, Well done Kelly!
With regex you can do it as follows:
import re
def get_selfie_chars(selfie):
return re.findall(r".*?]", selfie)
If a point should be a separate match then:
return re.findall(r"\.|.*?]", selfie)
Another:
selfies.replace(']', '] ').split()
Benchmark with 50 tokens (since you said that's your mean):
7.29 us original
3.91 us Kelly <= mine
8.06 us keepAlive
8.87 us trincot
With your "[C][C][C]" instead:
0.87 us original
0.44 us Kelly
0.88 us keepAlive
1.45 us trincot
Code (Try it online!):
from timeit import repeat
import re
def original(selfies):
return [block+"]" for block in selfies.split("]")][:-1]
def Kelly(selfies):
return selfies.replace(']', '] ').split()
def keepAlive(selfies):
atoms = selfies[1:-1].split('][')
return [f'[{a}]' for a in atoms]
def trincot(selfie):
return re.findall(r".*?]", selfie)
fs = original, Kelly, keepAlive, trincot
selfies = ''.join(f'[{i}]' for i in range(50))
expect = original(selfies)
for f in fs:
print(f(selfies) == expect, f.__name__)
for _ in range(3):
print()
for f in fs:
number = 1000
t = min(repeat(lambda: f(selfies), number=number)) / number
print('%.2f us ' % (t * 1e6), f.__name__)
What about doing
>>> atoms = selfies[1:-1].split('][')
>>> atoms
["C","C","C"]
Assuming you do not need the square brackets anymore. Otherwise, you could ultimately do
>>> [f'[{a}]' for a in atoms]
["[C]","[C]","[C]"]

Efficiently test if an item is in a sorted list of strings

Suppose I have a list of short lowercase [a-z] strings (max length 8):
L = ['cat', 'cod', 'dog', 'cab', ...]
How to efficiently determine if a string s is in this list?
I know I can do if s in L: but I could presort L and binary-tree search.
I could even build my own tree, letter by letter. So setting s='cat':
So T[ ord(s[0])-ord('a') ] gives the subtree leading to 'cat' and 'cab', etc. But eek, messy!
I could also make my own hashfunc, as L is static.
def hash_(item):
w = [127**i * (ord(j)-ord('0')) for i,j in enumerate(item)]
return sum(w) % 123456
... and just fiddle the numbers until I don't get duplicates. Again, ugly.
Is there anything out-of-the-box I can use, or must I roll my own?
There are of course going to be solutions everywhere along the complexity/optimisation curve, so my apologies in advance if this question is too open ended.
I'm hunting for something that gives decent performance gain in exchange for a low LoC cost.
The builtin Python set is almost certainly going to be the most efficient device you can use. (Sure, you could roll out cute things such as a DAG of your "vocabulary", but this is going to be much, much slower).
So, convert your list into a set (preferably built once if multiple tests are to be made) and test for membership:
s in set(L)
Or, for multiple tests:
set_values = set(L)
# ...
if s in set_values:
# ...
Here is a simple example to illustrate the performance:
from string import ascii_lowercase
import random
n = 1_000_000
L = [''.join(random.choices(ascii_lowercase, k=6)) for _ in range(n)]
Time to build the set:
%timeit set(L)
# 99.9 ms ± 49.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Time to query against the set:
set_values = set(L)
# non-existent string
%timeit 'foo' in set_values
# 45.1 ns ± 0.0418 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
# existing value
s = L[-1]
a = %timeit -o s in set_values
# 45 ns ± 0.0286 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Contrast that to testing directly against the list:
b = %timeit -o s in L
# 16.5 ms ± 24.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
b.average / a.average
# 359141.74
When's the last time you made a 350,000x speedup ;-) ?

How do I use regular expressions to preserve only the first symbol in a string?

How can I use a regular expression to preserve only the first dot (.) symbol in a string?
for example, I want to take this string:
"1.0.0.4.55.34..3"
and to turn it into:
"1.00455343"
For this specific example, being numbers separated by dots and keeping only the first one, you can use a regex to split the numbers by dots and then just add the first dot:
import re
s = "10.0.0.4.55.34..3"
first, *rest = re.split(r"\.+", s)
print(f"{first}.{''.join(rest)}")
Gives:
10.00455343
If you want to use regex and sub, you can use the below
str1 = "1.0.0.4.55.34..3"
def func(group1, group2):
return group1 + group2.replace('.', '')
re.sub(r'(.*?\.)(.*)', lambda x : func(x.group(1), x.group(2)), str1)
'1.00455343'
It turns out it is slightly faster than the split -
%%timeit
re.sub(r'(.*?\.)(.*)', lambda x : func(x.group(1), x.group(2)), str1)
1.31 µs ± 38.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%%timeit
first, *rest = re.split(r"\.+", str1)
f"{first}.{''.join(rest)}"
1.99 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

SyntaxError with properly formatted list comprehension conditional

I have a set I want to update if there is match in another set. Else I want to append to a list strings of error messages if there is no match. I referenced if/else in a list comprehension to write my code.
Here is what I wrote:
logstocrunch_set=dirlogs_set.difference(dblogs_set)
pattern = re.compile(r"\d*F[IR]P",re.IGNORECASE) #to find register values
logstocrunch_finset = set()
errorlist = []
logstocrunch_finset.update([x for x if pattern.search(x) else errorlist.append(f'{x} is not proper name') for x in logstocrunch_set])
However, when I run this, I get the error invalid syntax with the arror pointed at my if statement.
So why is this happening?
The syntax of a list comprehension with a condition is:
[<value> for <variable> in <iterable> if <condition>]
if <condition> goes after the iterable, not before it.
Also, you can't have an else clause there. It's not a conditional expression that returns different values, it's just used to filter the values in the iterator, so else makes no sense.
You seem to be confusing it with a conditional expression in the <value> part, which allows you to specify different values to be returned in the resulting list depending on a condition. That's just an ordinary conditional expression, not specific to list comprehensions.
You shouldn't use a list comprehension if you want to update multiple targets. Use an ordinary loop.
logstocrunch_finset = set()
errorlist = []
for x in logstocrunch_set:
if pattern.search(x):
logtocrunch_finset.add(x)
else:
errorlist.append(f'{x} is not proper name')
A list comprehension is a way of creating a single list. A basic conditional one must be in the format:
[ expression for item in iterable if condition ]
You can't (easily) update two objects with one comprehension. Also, there's not a lot of point declaring logstocrunch_finset and errorlist and then populating them. Instead, how about something like:
pattern = re.compile(r"\d*F[IR]P", re.IGNORECASE)
logstocrunch_finset = {x for x in logstocrunch_set if pattern.search(x)}
errorlist = [f'{x} is not proper name' for x in logstocrunch_set.difference(logstocrunch_finset)]
UPDATE BELOW - Performance comparison with for loop
As #Barmar suggested, I benchmarked our two solutions. There's not a lot in it. The two comprehensions seem to handle a larger input set better. Changing the ratio of valid to invalid data didn't seem to make much difference.
import re
range_limit = 10
logstocrunch_set = set(
[f'{i}FRP' for i in range(range_limit)] +
[f'longer_{i}frp_lower' for i in range(range_limit)] +
['not valid', 'something else']
)
pattern = re.compile(r"\d*F[IR]P",re.IGNORECASE)
%%timeit -n 100000 -r 20
logstocrunch_finset = set()
errorlist = []
for x in logstocrunch_set:
if pattern.search(x):
logstocrunch_finset.add(x)
else:
errorlist.append(f'{x} is not proper name')
range_limit = 10 | 9.53 µs ± 34.2 ns per loop (mean ± std. dev. of 20 runs, 100000 loops each)
range_limit = 50 | 45.5 µs ± 699 ns per loop (mean ± std. dev. of 20 runs, 100000 loops each)
range_limit = 100 | 89.4 µs ± 1.2 µs per loop (mean ± std. dev. of 10 runs, 100000 loops each)
%%timeit -n 100000 -r 20
logstocrunch_finset = {x for x in logstocrunch_set if pattern.search(x)}
errorlist = [f'{x} is not proper name' for x in logstocrunch_set.difference(logstocrunch_finset)]
range_limit = 10 | 9.58 µs ± 14.1 ns per loop (mean ± std. dev. of 20 runs, 100000 loops each)
range_limit = 50 | 42.2 µs ± 24.7 ns per loop (mean ± std. dev. of 20 runs, 100000 loops each)
range_limit = 100 | 82.2 µs ± 491 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)

Shuffle string data in python

I have a column with 10 million strings. The characters in the strings need to be rearranged in a certain way.
Original string: AAA01188P001
Shuffled string: 188A1A0AP001
Right now I have a for loop running that takes each string and repositions every letter, but this takes hours to completed. Is there a quicker way to achieve this result?
This is the for loop.
for i in range(0, len(OrderProduct)):
s = list(OrderProduct['OrderProductId'][i])
a = s[1]
s[1] = s[7]
s[7] = a
a = s[3]
s[3] = s[6]
s[6] = a
a = s[2]
s[2] = s[3]
s[3] = a
a = s[5]
s[5] = s[0]
s[0] = a
OrderProduct['OrderProductId'][i] = ''.join(s)
I made a few performance tests using different methods:
Here are the results I got for 1000000 shuffles:
188A1AA0P001 usefString 0.518183742
188A1AA0P001 useMap 1.415851829
188A1AA0P001 useConcat 0.5654986979999999
188A1AA0P001 useFormat 0.800639699
188A1AA0P001 useJoin 0.5488918539999998
based on this, a format string with hard coded substrings seems to be the fastest.
Here is the code I used to test:
def usefString(s): return f"{s[5:8]}{s[0]}{s[4]}{s[1:4]}{s[8:]}"
posMap = [5,6,7,0,4,1,2,3,8,9,10,11]
def useMap(s): return "".join(map(lambda i:s[i], posMap))
def useConcat(s): return s[5:8]+s[0]+s[4]+s[1:4]+s[8:]
def useFormat(s): return '{}{}{}{}{}'.format(s[5:8],s[0],s[4],s[1:4],s[8:])
def useJoin(s): return "".join([s[5:8],s[0],s[4],s[1:4],s[8:]])
from timeit import timeit
count = 1000000
s = "AAA01188P001"
t = timeit(lambda:usefString(s),number=count)
print(usefString(s),"usefString",t)
t = timeit(lambda:useMap(s),number=count)
print(useMap(s),"useMap",t)
t = timeit(lambda:useConcat(s),number=count)
print(useConcat(s),"useConcat",t)
t = timeit(lambda:useFormat(s),number=count)
print(useFormat(s),"useFormat",t)
t = timeit(lambda:useJoin(s),number=count)
print(useJoin(s),"useJoin",t)
Performance: (added by #jezrael)
N = 1000000
OrderProduct = pd.DataFrame({'OrderProductId':['AAA01188P001'] * N})
In [331]: %timeit [f'{s[5:8]}{s[0]}{s[4]}{s[1:4]}{s[8:]}' for s in OrderProduct['OrderProductId']]
527 ms ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [332]: %timeit [s[5:8]+s[0]+s[4]+s[1:4]+s[8:] for s in OrderProduct['OrderProductId']]
610 ms ± 18.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [333]: %timeit ['{}{}{}{}{}'.format(s[5:8],s[0],s[4],s[1:4],s[8:]) for s in OrderProduct['OrderProductId']]
954 ms ± 76.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [334]: %timeit ["".join([s[5:8],s[0],s[4],s[1:4],s[8:]]) for s in OrderProduct['OrderProductId']]
594 ms ± 10.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Can you just reconstruct the string with slices if that logic is consistent?
s = OrderProduct['OrderProductId'][i]
new_s = s[5]+s[7]+s[1:2]+s[6]+s[4]+s[0]+s[3]+s[1]
or as a format string:
new_s = '{}{}{}{}{}{}{}'.format(s[5],s[7]...)
Edit : +1 for Dave's suggestion of ''.join() the list vs. concatenation.
If you just want to shuffle the strings (no particular logic), you can do that in a several ways:
Using string_utils:
import string_utils
print string_utils.shuffle("random_string")
Using built-in methods:
import random
str_var = list("shuffle_this_string")
random.shuffle(str_var)
print ''.join(str_var)
Using numpy:
import numpy
str_var = list("shuffle_this_string")
numpy.random.shuffle(str_var)
print ''.join(str_var)
But if you need to do so with a certain logic (e.g. put each element in a specific position), you can do this:
s = 'some_string'
s = ''.join([list(s)[i] for i in [1,6,2,7,9,4,0,8,5,10,3]])
print(s)
Output:
otmrn_sisge
If this is still taking too long, you can use multiprocessing. Like this:
from multiprocessing import Pool
p = Pool(4) # 4 is the number of workers. usually is set to the number of CPU cores
def shuffle_str(s):
# do shuffling here, and return
list_of_strings = [...]
list_of_results = p.map(shuffle_str, list_of_strings)

Categories

Resources