python re match string with integer - python

I need to match strings like: '2017-08-09,08:59:20.445 INFO {peers_peak_parameters_grid} [eval_peers_peak] Evaluating batch 0 out of 2158',
I have tried different regular expressions such as: comp = re.compile("Evaluating batch ^[-+]?[0-9]+$ out of ^[-+]?[0-9]+$")
and this is an example usage:
def get_batch_process_time(log):
loglines = log.splitlines()
comp = re.compile("Evaluating batch ^[-+]?[0-9]+$ out of ^[-+]?[0-9]+$")
times = []
matches = []
for i, line in enumerate(loglines):
if comp.search(line):
time = string2datetime(line.split(' ')[0])
times.append(time)
matches.append(line)
return np.array(times), matches
Unfortunately none of the lines seems to match the given pattern. I assume that I'm using the wrong regular expression.
What is the right regular expression?
Am I using re correctly? (should I use match rather than search?)

^[-+]?[0-9]+$ alone would match a whole string consisting of an optional plus or minus operation then a non-empty sequence of digits.
When I say a whole string, it's because ^ and $ are "anchors" that will match respectively the start and end of the string, which is why your regex doesn't work.
I suppose you could also remove the optional sign part, i.e. [-+]?.
You could have found that out by yourself by testing your regex in regex101 (check the explanation panel on the top right) or a similar utility.

Related

How I can use regex to remove repeated characters from string

I have a string as follows where I tried to remove similar consecutive characters.
import re
input = "abccbcbbb";
for i in input :
input = re.sub("(.)\\1+", "",input);
print(input)
Now I need to let the user specify the value of k.
I am using the following python code to do it, but I got the error message TypeError: can only concatenate str (not "int") to str
import re
input = "abccbcbbb";
k=3
for i in input :
input= re.sub("(.)\\1+{"+(k-1)+"}", "",input)
print(input)
The for i in input : does not do what you need. i is each character in the input string, and your re.sub is supposed to take the whole input as a char sequence.
If you plan to match a specific amount of chars you should get rid of the + quantifier after \1. The limiting {min,} / {min,max} quantifier should be placed right after the pattern it modifies.
Also, it is more convenient to use raw string literals when defining regexps.
You can use
import re
input_text = "abccbcbbb";
k=3
input_text = re.sub(fr"(.)\1{{{k-1}}}", "", input_text)
print(input_text)
# => abccbc
See this Python demo.
The fr"(.)\1{{{k-1}}}" raw f-string literal will translate into (.)\1{2} pattern. In f-strings, you need to double curly braces to denote a literal curly brace and you needn't escape \1 again since it is a raw string literal.
If I were you, I would prefer to do it like suggested before. But since I've already spend time on answering this question here is my handmade solution.
The pattern described below creates a named group named "letter". This group updates iterative, so firstly it is a, then b, etc. Then it looks ahead for all the repetitions of the group "letter" (which updates for each letter).
So it finds all groups of repeated letters and replaces them with empty string.
import re
input = 'abccbcbbb'
result = 'abcbcb'
pattern = r'(?P<letter>[a-z])(?=(?P=letter)+)'
substituted = re.sub(pattern, '', input)
assert substituted == result
Just to make sure I have the question correct you mean to turn "abccbcbbb" into "abcbcb" only removing sequential duplicate characters. Is there a reason you need to use regex? you could likely do a simple list comprehension. I mean this is a really cut and dirty way to do it but you could just put
input = "abccbcbbb"
input = list(input)
previous = input.pop(0)
result = [previous]
for letter in input:
if letter != previous : result += letter
previous = letter
result = "".join(result)
and with a method like this, you could make it easier to read and faster with a bit of modification id assume.

searching for sequences in a FASTA format

I am trying to look for multiple specific sequences in a DNA sequence within a FASTA format and then print them out. For simplicity, I made a short string sequence to show my problem.
import re
seq = "QPPLSK"
find_in_seq = re.search(r"[^P](P|K|R|H|W)", seq)
print find_in_seq.string[find_in_seq.start():find_in_seq.end()]
I only get one output of a match "QP" when there are 2 matches "QP" and "SK". How do I get to show the 2 matches instead of just only showing the first match?
Thanks
Use re.findall and change the regex so that there is no more capturing group - [^P](?:P|K|R|H|W) or [^P][PKRHW]:
import re
seq = "QPPLSK"
find_in_seq = re.findall(r"[^P][PKRHW]", str(seq))
print(find_in_seq)
See the Python demo
Note that if you want to match any letter other than P, you'd better use [A-OQ-Z].

Slice substrings from long string to a list in python

In python I have long string like (of which I removed all breaks)
stringA = 'abcdefkey:12/eas9ghijklkey:43/e3mnop'
What I want to do is to search this string for all occurrences of "key:", then extract the "values" following "key:".
One further complication for me is that I don't know how long these values belonging to key are (e.g. key:12/eas9 and key:43/e3). All I do know is that they do have to end with a digit whereas the rest of the string does not contain any digits.
This is why my idea was to slice from the indices of key plus the next say 10 characters (e.g. key:12/eas9g) and then work backward until isdigit() is false.
I tried to split my initial string (that did contain breaks):
stringA_split = re.split("\n", stringA)
for linex in stringA_split:
index_start = linex.rfind("key:")
index_end = index_start + 8
print(linex[index_start:index_end]
#then work backward
However, inserting line breaks does not help in any way as they are meaningless from a pdf-to-txt conversion.
How would I then solve this (e.g. as a start with getting all indices of '"key:"' and slice this to a list)?
import re
>>> re.findall('key:(\d+[^\d]+[\d])', stringA)
['12/eas9', '43/e3']
\d+ # One or more digits.
[^\d]+ # Everything except a digit (equivalent to [\D]).
[\d] # The final digit
(\d+[^\d]+[\d]) # The group of the expression above
'key:(\d+[^\d]+[\d])' # 'key:' followed by the group expression
If you want key: in your result:
>>> re.findall('(key:\d+[^\d]+[\d])', stringA)
['key:12/eas9', 'key:43/e3']
I'm not 100% sure I understand your definition of what defines a value, but I think this will get you what you described
import re
stringA = 'abcdefkey:12/eas9ghijklkey:43/e3mnop'
for v in stringA.split('key:'):
ma = re.match(r'(\d+\/.*\d+)', v)
if ma:
print ma.group(1)
This returns:
12/eas9
43/e3
You can apply just one RE that gets all the keys into an array of tuples:
import re
p=re.compile('key\:(\d+)\/([^\d]+\d)')
ret=p.findall(stringA)
After the execution, you have:
ret
[('12', 'eas9'), ('43', 'e3')]
edit: a better answer was posted above. I misread the original question when proposing to reverse here, which really wasn't necessary. Good luck!
If you know that the format is always key:, what if you reversed the string and rex for :yek? You'd isolate all keys and then can reverse them back
import re
# \w is alphanumeric, you may want to add some symbols
rex = re.compile("\w*:yek")
word = 'abcdefkey:12/eas9ghijklkey:43/e3mnop'
matches = re.findall(rex, word[::-1])
matches = [match[::-1] for match in matches]

Split a string using a list of strings as a pattern

Consider an input string :
mystr = "just some stupid string to illustrate my question"
and a list of strings indicating where to split the input string:
splitters = ["some", "illustrate"]
The output should look like
result = ["just ", "some stupid string to ", "illustrate my question"]
I wrote some code which implements the following approach. For each of the strings in splitters, I find its occurrences in the input string, and insert something which I know for sure would not be a part of my input string (for example, this '!!'). Then I split the string using the substring that I just inserted.
for s in splitters:
mystr = re.sub(r'(%s)'%s,r'!!\1', mystr)
result = re.split('!!', mystr)
This solution seems ugly, is there a nicer way of doing it?
Splitting with re.split will always remove the matched string from the output (NB, this is not quite true, see the edit below). Therefore, you must use positive lookahead expressions ((?=...)) to match without removing the match. However, re.split ignores empty matches, so simply using a lookahead expression doesn't work. Instead, you will lose one character at each split at minimum (even trying to trick re with "boundary" matches (\b) does not work). If you don't care about losing one whitespace / non-word character at the end of each item (assuming you only split at non-word characters), you can use something like
re.split(r"\W(?=some|illustrate)")
which would give
["just", "some stupid string to", "illustrate my question"]
(note that the spaces after just and to are missing). You could then programmatically generate these regexes using str.join. Note that each of the split markers is escaped with re.escape so that special characters in the items of splitters do not affect the meaning of the regular expression in any undesired ways (imagine, e.g., a ) in one of the strings, which would otherwise lead to a regex syntax error).
the_regex = r"\W(?={})".format("|".join(re.escape(s) for s in splitters))
Edit (HT to #Arkadiy): Grouping the actual match, i.e. using (\W) instead of \W, returns the non-word characters inserted into the list as seperate items. Joining every two subsequent items would then produce the list as desired as well. Then, you can also drop the requirement of having a non-word character by using (.) instead of \W:
the_new_regex = r"(.)(?={})".format("|".join(re.escape(s) for s in splitters))
the_split = re.split(the_new_regex, mystr)
the_actual_split = ["".join(x) for x in itertools.izip_longest(the_split[::2], the_split[1::2], fillvalue='')]
Because normal text and auxiliary character alternate, the_split[::2] contains the normal split text and the_split[1::2] the auxiliary characters. Then, itertools.izip_longest is used to combine each text item with the corresponding removed character and the last item (which is unmatched in the removed characters)) with fillvalue, i.e. ''. Then, each of these tuples is joined using "".join(x). Note that this requires itertools to be imported (you could of course do this in a simple loop, but itertools provides very clean solutions to these things). Also note that itertools.izip_longest is called itertools.zip_longest in Python 3.
This leads to further simplification of the regular expression, because instead of using auxiliary characters, the lookahead can be replaced with a simple matching group ((some|interesting) instead of (.)(?=some|interesting)):
the_newest_regex = "({})".format("|".join(re.escape(s) for s in splitters))
the_raw_split = re.split(the_newest_regex, mystr)
the_actual_split = ["".join(x) for x in itertools.izip_longest([""] + the_raw_split[1::2], the_raw_split[::2], fillvalue='')]
Here, the slice indices on the_raw_split have swapped, because now the even-numbered items must be added to item afterwards instead of in front. Also note the [""] + part, which is necessary to pair the first item with "" to fix the order.
(end of edit)
Alternatively, you can (if you want) use string.replace instead of re.sub for each splitter (I think that is a matter of preference in your case, but in general it is probably more efficient)
for s in splitters:
mystr = mystr.replace(s, "!!" + s)
Also, if you use a fixed token to indicate where to split, you do not need re.split, but can use string.split instead:
result = mystr.split("!!")
What you could also do (instead of relying on the replacement token not to be in the string anywhere else or relying on every split position being preceded by a non-word character) is finding the split strings in the input using string.find and using string slicing to extract the pieces:
def split(string, splitters):
while True:
# Get the positions to split at for all splitters still in the string
# that are not at the very front of the string
split_positions = [i for i in (string.find(s) for s in splitters) if i > 0]
if len(split_positions) > 0:
# There is still somewhere to split
next_split = min(split_positions)
yield string[:next_split] # Yield everything before that position
string = string[next_split:] # Retain the rest of the string
else:
yield string # Yield the rest of the string
break # Done.
Here, [i for i in (string.find(s) for s in splitters) if i > 0] generates a list of positions where the splitters can be found, for all splitters that are in the string (for this, i < 0 is excluded) and not right at the beginning (where we (possibly) just split, so i == 0 is excluded as well). If there are any left in the string, we yield (this is a generator function) everything up to (excluding) the first splitter (at min(split_positions)) and replace the string with the remaining part. If there are none left, we yield the last part of the string and exit the function. Because this uses yield, it is a generator function, so you need to use list to turn it into an actual list.
Note that you could also replace yield whatever with a call to some_list.append (provided you defined some_list earlier) and return some_list at the very end, I do not consider that to be very good code style, though.
TL;DR
If you are OK with using regular expressions, use
the_newest_regex = "({})".format("|".join(re.escape(s) for s in splitters))
the_raw_split = re.split(the_newest_regex, mystr)
the_actual_split = ["".join(x) for x in itertools.izip_longest([""] + the_raw_split[1::2], the_raw_split[::2], fillvalue='')]
else, the same can also be achieved using string.find with the following split function:
def split(string, splitters):
while True:
# Get the positions to split at for all splitters still in the string
# that are not at the very front of the string
split_positions = [i for i in (string.find(s) for s in splitters) if i > 0]
if len(split_positions) > 0:
# There is still somewhere to split
next_split = min(split_positions)
yield string[:next_split] # Yield everything before that position
string = string[next_split:] # Retain the rest of the string
else:
yield string # Yield the rest of the string
break # Done.
Not especially elegant but avoiding regex:
mystr = "just some stupid string to illustrate my question"
splitters = ["some", "illustrate"]
indexes = [0] + [mystr.index(s) for s in splitters] + [len(mystr)]
indexes = sorted(list(set(indexes)))
print [mystr[i:j] for i, j in zip(indexes[:-1], indexes[1:])]
# ['just ', 'some stupid string to ', 'illustrate my question']
I should acknowledge here that a little more work is needed if a word in splitters occurs more than once because str.index finds only the location of the first occurrence of the word...

Python Regular Expressions Findall

To look through data, I am using regular expressions. One of my regular expressions is (they are dynamic and change based on what the computer needs to look for --- using them to search through data for a game AI):
O,2,([0-9],?){0,},X
After the 2, there can (and most likely will) be other numbers, each followed by a comma.
To my understanding, this will match:
O,2,(any amount of numbers - can be 0 in total, each followed by a comma),X
This is fine, and works (in RegExr) for:
O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X # matches this
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X
My issue is that I need to match all the numbers after the original, provided number. So, I want to match (in the example) 9,6,7,11,8.
However, implementing this in Python:
import re
pattern = re.compile("O,2,([0-9],?){0,},X")
matches = pattern.findall(s) # s is the above string
matches is ['8'], the last number, but I need to match all of the numbers after the given (so '9,6,7,11,8').
Note: I need to use pattern.findall because thee will be more than one match (I shortened my list of strings, but there are actually around 20 thousand strings), and I need to find the shortest one (as this would be the shortest way for the AI to win).
Is there a way to match the entire string (or just the last numbers after those I provided)?
Thanks in advance!
Use this:
O,2,((?:[0-9],?){0,}),X
See it in action:http://regex101.com/r/cV9wS1
import re
s = '''O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X'''
pattern = re.compile("O,2,((?:[0-9],?){0,}),X")
matches = pattern.findall(s) # s is the above string
print matches
Outputs:
['9,6,7,11,8']
Explained:
By wrapping the entire value capture between 2, and ,X in (), you end up capturing that as well. I then used the (?: ) to ignore the inner captured set.
you don't have to use regex
split the string to array
check item 0 == 0 , item 1==2
check last item == X
check item[2:-2] each one of them is a number (is_digit)
that's all

Categories

Resources