Find first pattern and retrieve data up to next pattern find - python

I have a string as follows:
my_str = "808c000003a185c50cd9b00285e78220500ac56a1c5ca5a1004b2404aa412f058c0a1ba85820cc8208080813c7040a228e0133ca5aca03a2829012533208704411004010808c001003a1c5c50cd9b00285e7822"
I want to group together the strings if they meet a condition
Whenever there is a sequence of '0808', I want this and the following text UP TO the point of the next 0808 pattern etc..
result = re.findall(r'(0808)', my_str)
This just gives me a list of the pattern itself.
I want it to include the pattern and the following text. Something fundamentally wrong with the regex i'm inputting.
Help appreciated.

0808.*?(?=0808)
This looks for the 0808 string then lazily matches 0 or more chars afterwards up to the point of the lookahead to 0808
This will return a match for each selection of 0808foobar matches

Related

How can I use regex to match sub-string start by words (not included) to the end of string, and keep non-greedy at same time?

I want to find a sub-string that starts with words (\d月|\d日) (not included in result) and to the end of the string, at the same time, keep the sub-string shortest (non-greedy). for example,
str1 = "秋天9月9日长江工程完成"
res1 = re.search(r'(\d月|\d日).*', str1).group() #return 9月9日长江工程完成
I want to return the result like 长江工程完成,
for another example,
str2 ="秋天9月9日9日长江工程完成"
it should get same results like previous one
thus I tried these several methods, but all return un-expected results, please give me some suggestion...
res1 = re.search(r'(?:(?!\d月|\d日))(?:\d月|\d日)', str1).group() #return 9月
res1 = re.search(r'(?:\d月|\d日)((?:(?!\d月|\d日).)*?)', content).group() #return 9月
If you want to capture the rest of the string, surround .* with a group.
To capture one or more of the same pattern, you can use the + operator.
import re
content = "9月9日9月长江工程完成"
match = re.match(r'(?:\d月|\d日)+(.*)', content)
print(match[1])
Output:
长江工程完成
(?:(?!\d月|\d日))(?:\d月|\d日)
This pattern only captures the initial words, because you don't capture the rest as a group. (Also, it only allows for exactly two occurences).
(?:\d月|\d日)((?:(?!\d月|\d日).)*?)
This pattern requires only matches strings that look like this:
9月4日a6日b0月x - probably not what you need
P.S. Make sure you pick right function from the re: match, search or fullmatch (see What is the difference between re.search and re.match?). You said that you need the whole string needs to start with the given words, so match or fullmatch.

Python regex: excluding square brackets and the text inside

I am trying to write a regex that excludes square brackets and the text inside them.
My sample text looks like this: 'WordA, WordB, WordC, [WordD]'
I want to match each text item in the string except '[WordD]'. I've tried using a negative lookahead, something like... [A-Z][A-Za-z]+(?!\[[A-Z]+\]) but doing so is still matching the text inside the brackets.
Is negative lookahead the best approach? If so, where am I going wrong?
Rather than a regex, you might consider splitting by commas and then filtering by whether the word starts with [:
output = [word for word in str.split(', ') if word[0] != '[']
If you use a regex, you can match either the beginning of the string, or lookbehind for a space:
re.findall(r'(?:^|(?<= ))[A-Z][A-Za-z]+', str)
Or you could negative lookahead for ] at the end, after a word boundary:
output = re.findall(r'[A-Z][A-Za-z]+\b(?!\])', str)
This can be as simple as
(\w+),
Regex Demo
Retrieve value of Group 1 for desired result.
I'm guessing that maybe you were trying to write some expression similar to:
[A-Z][a-z]*[A-Z](?=,|$)
or,
[A-Z][a-z]+[A-Z](?=,|$)
Test
import re
regex = r"[A-Z][a-z]*[A-Z](?=,|$)"
string = """
WordA, WordB, WordC, [WordD]
WordA, WordB, WordC, [WordD], WordE
"""
print(re.findall(regex, string))
Output
['WordA', 'WordB', 'WordC', 'WordA', 'WordB', 'WordC', 'WordE']
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

Empty Regex response using finditer and lookahead

I'm having trouble understanding regex behaviour when using lookahead.
I have a given string in which I have two overlapping patterns (starting with M and ending with p). My expected output would be MGMTPRLGLESLLEp and MTPRLGLESLLEp. My python code below results in two empty strings which share a common start with the expected output.
Removal of the lookahead (?=) results in only ONE output string which is the larger one. Is there a way to modify my regex term to prevent empty strings so that I can get both results with one regex term?
import re
string = 'GYMGMTPRLGLESLLEpApMIRVA'
pattern = re.compile(r'(?=M(.*?)p)')
sequences = pattern.finditer(string)
for results in sequences:
print(results.group())
print(results.start())
print(results.end())
The overlapping matches trick with a look-ahead makes use of the fact that the (?=...) pattern matches at an empty location, then pulls out the captured group nested inside the look-ahead.
You need to print out group 1, explicitly:
for results in sequences:
print(results.group(1))
This produces:
GMTPRLGLESLLE
TPRLGLESLLE
You probably want to include the M and p characters in the capturing group:
pattern = re.compile(r'(?=(M.*?p))')
at which point your output becomes:
MGMTPRLGLESLLEp
MTPRLGLESLLEp

Python Regex behaviour with Square Brackets []

This the text file abc.txt
abc.txt
aa:s0:education.gov.in
bb:s1:defence.gov.in
cc:s2:finance.gov.in
I'm trying to parse this file by tokenizing (correct me if this is the incorrect term :) ) at every ":" using the following regular expression.
parser.py
import re,sys,os,subprocess
path = "C:\abc.txt"
site_list = open(path,'r')
for line in site_list:
site_line = re.search(r'(\w)*:(\w)*:([\w\W]*\.[\W\w]*\.[\W\w]*)',line)
print('Regex found that site_line.group(2) = '+str(site_line.group(2))
Why is the output
Regex found that site_line.group(2) = 0
Regex found that site_line.group(2) = 1
Regex found that site_line.group(2) = 2
Can someone please help me understand why it matches the last character of the second group ? I think its matching 0 from s0 , 1 from s1 & 2 from s2
But Why ?
Let's show a simplified example:
>>> re.search(r'(.)*', 'asdf').group(1)
'f'
>>> re.search(r'(.*)', 'asdf').group(1)
'asdf'
If you have a repetition operator around a capturing group, the group stores the last repetition. Putting the group around the repetition operator does what you want.
If you were expecting to see data from the third group, that would be group(3). group(0) is the whole match, and group(1), group(2), etc. count through the actual parenthesized capturing groups.
That said, as the comments suggest, regexes are overkill for this.
>>> 'aa:s0:education.gov.in'.split(':')
['aa', 's0', 'education.gov.in']
And first group is entire match by default.
If a groupN argument is zero, the corresponding return value is the
entire matching string.
So you should skip it. And check group(3), if you want last one.
Also, you should compile regexp before for-loop. It increase performance of your parser.
And you can replace (\w)* to (\w*), if you want match all symbols between :.

How to specify the regex string in python

I have the following 2 strings of train station IDs (showing the direction of travel) separated by "-".
String A (strA):
NS1-NS2-NS3-NS4-NS5-NS7-NS8-NS9-NS10-NS11-NS13-NS14-NS15-NS16-NS17-NS18-NS19-NS20-NS21-NS22-NS23-NS24-NS25-NS26-NS27
String B (strB):
NS27-NS26-NS25-NS24-NS23-NS22-NS21-NS20-NS19-NS18-NS17-NS16-NS15-NS14-NS13-NS11-NS10-NS9-NS8-NS7-NS5-NS4-NS3-NS2-NS1
I want to find out which of String A or B contains stations "NS4" followed by "NS1" (answer should be String B).
My current code as follows:
searchStr = ".*NS4-.*NS1(-.*|)"
re.search(searchStr, strA)
re.search(searchStr, strB)
But the result keep returning a match in String A.
May I know how to specify 'searchStr' in order to match only String B?
Two ways to do it: tokenizing and improving the regex.
Tokenizing
tokA = strA.split('-')
tokB = strB.split('-')
print('NS4' in tokA and tokA.index('NS1') > tokA.index('NS4'))
print('NS4' in tokB and tokB.index('NS1') > tokB.index('NS4'))
# False
# True
Regex
import re
pattern = '(^|-)NS4.+NS1(-|$)'
print(re.search(pattern, strA) is not None)
print(re.search(pattern, strB) is not None)
# False
# True
Performance
Tokenization: 2.3072939129997394
Regex: 11.138173280000046
But if you really need performance, I'm sure there are faster ways. Even the tokenization method does multiple passes.
As an alternative to tokenizing, you could use the following expression.
NS4(?=.*?NS1(?!\d))
It literally means:
The characters "NS4" literally.
Followed by any characters, until it finds NS1.
NS1 cannot be followed by a digit.
To educate readers as to what I've used:
(?=) is a Positive Lookahead.
Whatever you place inside this token must be found for the match to be True.
I placed .*? to match anything, as few times as possible using the ? quantifier, followed by NS1 since that is what we want to find.
(?!) is a Negative Lookahead
Whatever you place inside this token, as you might guess, must NOT be found for the match to be True.
I placed a digit in here, so that things like NS10 or NS11 or NS19 are never matched.

Categories

Resources