Parsing based on pattern not at the beginning

Parsing based on pattern not at the beginning - python

I want to extract the number before "2022" in a set of strings possibly. I current do
a= mystring.strip().split("2022")[0]
and, for instance, when mystring=' 1020220519AX', this gives a = '10'. However,
mystring.strip().split("2022")[0]
fails when mystring=' 20220220519AX' to return a='202'. Therefore, I want the code to split the string on "2022" that is not at the beginning non-whitespace characters in the string.
Can you please guide with this?

Use a regular expression rather than split().
import re
mystring = ' 20220220519AX'
match = re.search(r'^\s*(\d+?)2022', mystring)
if match:
print(match.group(1))
^\s* skips over the whitespace at the beginning, then (\d+?) captures the following digits up to the first 2022.

You can tell a regex engine that you want all the digits before 2022:
r'\d+(?=2022)'
Like .split(), a regex engine is 'greedy' by default - 'greedy' here means that as soon as it can take something that it is instructed to take, it will take that and it won't try another option, unless the rest of the expression cannot be made to work.
So, in your case, mystring.strip().split("2022") splits on the first 2020 it can find and since there's nothing stopping it, that is the result you have to work with.
Using regex, you can even tell it you're not interested in the 2022, but in the numbers before it: the \d+ will match as long a string of digits it can find (greedy), but the (?=2022) part says it must be followed by a literal 2022 to be a match (and that won't be part of the match, a 'positive lookahead').
Using something like:
import re
mystring = ' 20220220519AX'
print(re.findall(r'\d+(?=2022)', mystring))
Will show you all consecutive matches.
Note that for a string like ' 920220220519AX 12022', it will find ['9202', '1'] and only that - it won't find all possible combinations of matches. The first, greedy pass through the string that succeeds is the answer you get.

You could split() asserting not the start of the string to the left after using strip(), or you can get the first occurrence of 1 or more digits from the start of the string, in case there are more occurrences of 2022
import re
strings = [
' 1020220519AX',
' 20220220519AX'
]
for s in strings:
parts = re.split(r"(?<!^)2022", s.strip())
if parts:
print(parts[0])
for s in strings:
m = re.match(r"\s*(\d+?)2022", s)
if m:
print(m.group(1))
Both will output
10
202
Note that the split variant does not guarantee that the first part consists of digits, it is only splitted.
If the string consists of only word characters, splitting on \B2022 where \B means non a word boundary, will also prevent splitting at the start of the example string.

Related

Python -- Regex match pattern OR end of string

import re
re.findall("(\+?1?[ -.]?\(?\d{3}\)?[ -.]?\d{3}[ -.]?\d{4})(?:[ <$])", "+1.222.222.2222<")
The above code works fine if my string ends with a "<" or space. But if it's the end of the string, it doesn't work. How do I get +1.222.222.2222 to return in this condition:
import re
re.findall("(\+?1?[ -.]?\(?\d{3}\)?[ -.]?\d{3}[ -.]?\d{4})(?:[ <$])", "+1.222.222.2222")
*I removed the "<" and just terminated the string. It returns none in this case. But I'd like it to return the full string -- +1.222.222.2222
POSSIBLE ANSWER:
import re
re.findall("(\+?1?[ -.]?\(?\d{3}\)?[ -.]?\d{3}[ -.]?\d{4})(?:[ <]|$)", "+1.222.222.2222")

I think you've solved the end-of-string issue, but there are a couple of other potential issues with the pattern in your question:
the - in [ -.] either needs to be escaped as \- or placed in the first or last position within square brackets, e.g. [-. ] or [ .-]; if you search for [] in the docs here you'll find the relevant info:
Ranges of characters can be indicated by giving two characters and separating them
by a '-', for example [a-z] will match any lowercase ASCII letter, [0-5][0-9] will match
all the two-digits numbers from 00 to 59, and [0-9A-Fa-f] will match any hexadecimal
digit. If - is escaped (e.g. [a\-z]) or if it’s placed as the first or last character
(e.g. [-a] or [a-]), it will match a literal '-'.
you may want to require that either matching parentheses or none are present around the first 3 of 10 digits using (?:\(\d{3}\) ?|\d{3}[-. ]?)
Here's a possible tweak incorporating the above
import re
pat = "^((?:\+1[-. ]?|1[-. ]?)?(?:\(\d{3}\) ?|\d{3}[-. ]?)\d{3}[-. ]?\d{4})(?:[ <]|$)"
print( re.findall(pat, "+1.222.222.2222") )
print( re.findall(pat, "+1(222)222.2222") )
print( re.findall(pat, "+1(222.222.2222") )
Output:
['+1.222.222.2222']
['+1(222)222.2222']
[]

Maybe try:
import re
re.findall("(\+?1?[ -.]?\(?\d{3}\)?[ -.]?\d{3}[ -.]?\d{4})(?:| |<|$)", "+1.222.222.2222")
null matches any position, +1.222.222.2222
matches space character, +1.222.222.2222
< matches less-than sign character, +1.222.222.2222<
$ end of line, +1.222.222.2222
You can also use regex101 for easier debugging.

Split a split (regex) in python

I do have got the below string and I am looking for a way to split it in order to consistently end up with the following output
'1GB 02060250396L7.067,702BE 129517720L6.633,403NL 134187650L3.824,234DE 165893440L3.111,005PL 65775644897L1.010,006DE 811506926L3.547,407AT U16235008L-830,008SE U57469158L3.001,30'
['1GB 02060250396L1.060,70',
'2BE 129517720L2.639,40',
'3NL 134187650L4.024,23',
'4DE 165893440L8.111,00',
'5PL 65775644897L3.010,00',
'6DE 811506926L3.547,40',
'7AT U16235008L-830,00',
'8SE U57469158L8.0221,30']
My current approach
re.split("([0-9][0-9][0-9][A-Z][A-Z])", input) however is also splitting my delimiter which gives and there is no other split possible than the one I am currently using in order to remain consistent. Is it possible to split my delimiter as well and assign a part of it "70" to the string in front and a part "2BE" to the following string?

Use re.findall() instead of re.split().
You want to match
a number \d, followed by
two letters [A-Z]{2}, followed by
a space \s, followed by
a bunch of characters until you encounter a comma [^,]+, followed by
two digits \d{2}
Try it at regex101
So do:
input_str = '1GB 02060250396L7.067,702BE 129517720L6.633,403NL 134187650L3.824,234DE 165893440L3.111,005PL 65775644897L1.010,006DE 811506926L3.547,407AT U16235008L-830,008SE U57469158L3.001,30'
re.findall(r"\d[A-Z]{2}\s[^,]+,\d{2}", input_str)
Which gives
['1GB 02060250396L7.067,70',
'2BE 129517720L6.633,40',
'3NL 134187650L3.824,23',
'4DE 165893440L3.111,00',
'5PL 65775644897L1.010,00',
'6DE 811506926L3.547,40',
'7AT U16235008L-830,00',
'8SE U57469158L3.001,30']
Alternatively, if you don't want to be so specific with your pattern, you could simply use the regex
[^,]+,\d{2} Try it at regex101
This will match as many of any character except a comma, then a single comma, then two digits.
re.findall(r"[^,]+,\d{2}", input_str)
# Output:
['1GB 02060250396L7.067,70',
'2BE 129517720L6.633,40',
'3NL 134187650L3.824,23',
'4DE 165893440L3.111,00',
'5PL 65775644897L1.010,00',
'6DE 811506926L3.547,40',
'7AT U16235008L-830,00',
'8SE U57469158L3.001,30']

Is it possible to split my delimiter as well and assign a part of it "70" to the string in front and a part "2BE" to the following string?
If you must use re.split AT ANY PRICE then you might exploit zero-length assertion for this task following way
import re
text = '1GB 02060250396L7.067,702BE 129517720L6.633,403NL 134187650L3.824,234DE 165893440L3.111,005PL 65775644897L1.010,006DE 811506926L3.547,407AT U16235008L-830,008SE U57469158L3.001,30'
parts = re.split(r'(?<=,[0-9][0-9])', text)
print(parts)
output
['1GB 02060250396L7.067,70', '2BE 129517720L6.633,40', '3NL 134187650L3.824,23', '4DE 165893440L3.111,00', '5PL 65775644897L1.010,00', '6DE 811506926L3.547,40', '7AT U16235008L-830,00', '8SE U57469158L3.001,30', '']
Explanation: This particular one is positive lookbehind, it does find zero-length substring preceded by , digit digit. Note that parts has superfluous empty str at end.

How to match anything in a regular expression up to a character and not including it?

For example if I have a string abc%12341%%c%9876 I would like to substitute from the last % in the string to the end with an empty string, the final output that I'm trying to get is abc%12341%%c.
I created a regular expression '.*#' to search for the last % meaning abc%12341%%c% , and then getting the index of the the last % and then just replacing it with an empty string.
I was wondering if it can be done in one line using re.sub(..)

Use the following regex pattern, and then replace with empty string:
%[^%]*$
Sample script:
inp = "abc%12341%%c%9876"
output = re.sub(r'%[^%]*$', '', inp)
print(output) # abc%12341%%c
The regex pattern says to match the final % sign, followed by zero or more non % characters, up to the end of the string. We then replace with empty string, to effectively remove this content from the input.

I think it is called lookahead matching - I will look it up if I am not too slow :-)
(?=...)
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'

Find String Between Two Substrings in Python When There is A Space After First Substring

While there are several posts on StackOverflow that are similar to this, none of them involve a situation when the target string is one space after one of the substrings.
I have the following string (example_string):
<insert_randomletters>[?] I want this string.Reduced<insert_randomletters>
I want to extract "I want this string." from the string above. The randomletters will always change, however the quote "I want this string." will always be between [?] (with a space after the last square bracket) and Reduced.
Right now, I can do the following to extract "I want this string".
target_quote_object = re.search('[?](.*?)Reduced', example_string)
target_quote_text = target_quote_object.group(1)
print(target_quote_text[2:])
This eliminates the ] and that always appear at the start of my extracted string, thus only printing "I want this string." However, this solution seems ugly, and I'd rather make re.search() return the current target string without any modification. How can I do this?

Your '[?](.*?)Reduced' pattern matches a literal ?, then captures any 0+ chars other than line break chars, as few as possible up to the first Reduced substring. That [?] is a character class formed with unescaped brackets, and the ? inside a character class is a literal ? char. That is why your Group 1 contains the ] and a space.
To make your regex match [?] you need to escape [ and ? and they will be matched as literal chars. Besides, you need to add a space after ] to actually make sure it does not land into Group 1. A better idea is to use \s* (0 or more whitespaces) or \s+ (1 or more occurrences).
Use
re.search(r'\[\?]\s*(.*?)Reduced', example_string)
See the regex demo.
import re
rx = r"\[\?]\s*(.*?)Reduced"
s = "<insert_randomletters>[?] I want this string.Reduced<insert_randomletters>"
m = re.search(r'\[\?]\s*(.*?)Reduced', s)
if m:
print(m.group(1))
# => I want this string.
See the Python demo.

Regex may not be necessary for this, provided your string is in a consistent format:
mystr = '<insert_randomletters>[?] I want this string.Reduced<insert_randomletters>'
res = mystr.split('Reduced')[0].split('] ')[1]
# 'I want this string.'

The solution turned out to be:
target_quote_object = re.search('] (.*?)Reduced', example_string)
target_quote_text = target_quote_object.group(1)
print(target_quote_text)
However, Wiktor's solution is better.

You [co]/[sho]uld use Positive Lookbehind (?<=\[\?\]) :
import re
pattern=r'(?<=\[\?\])(\s\w.+?)Reduced'
string_data='<insert_randomletters>[?] I want this string.Reduced<insert_randomletters>'
print(re.findall(pattern,string_data)[0].strip())
output:
I want this string.

Like the other answer, this might not be necessary. Or just too long-winded for Python.
This method uses one of the common string methods find.
str.find(sub,start,end) will return the index of the first occurrence of sub in the substring str[start:end] or returns -1 if none found.
In each iteration, the index of [?] is retrieved following with index of Reduced. Resulting substring is printed.
Every time this [?]...Reduced pattern is returned, the index is updated to the rest of the string. The search is continued from that index.
Code
s = ' [?] Nice to meet you.Reduced efweww [?] Who are you? Reduced<insert_randomletters>[?] I want this
string.Reduced<insert_randomletters>'
idx = s.find('[?]')
while idx is not -1:
start = idx
end = s.find('Reduced',idx)
print(s[start+3:end].strip())
idx = s.find('[?]',end)
Output
$ python splmat.py
Nice to meet you.
Who are you?
I want this string.

Regex sub phone number format multiple times on same string

I'm trying to use reg expressions to modify the format of phone numbers in a list.
Here is a sample list:
["(123)456-7890 (321)-654-0987",
"(111) 111-1111",
"222-222-2222",
"(333)333.3333",
"(444).444.4444",
"blah blah blah (555) 555.5555",
"666.666.6666 random text"]
Every valid number has either a space OR start of string character leading, AND either a space OR end of string character trailing. This means that there can be random text in the strings, or multiple numbers on one line. My question is: How can I modify the format of ALL the phone numbers with my match pattern below?
I've written the following pattern to match all valid formats:
p = re.compile(r"""
(((?<=\ )|(?<=^)) #space or start of string
((\([0-9]{3}\))|[0-9]{3}) #Area code
(((-|\ )?[0-9]{3}-[0-9]{4}) #based on '-'
| #or
((\.|\ )?[0-9]{3}\.[0-9]{4})) #based on '.'
(?=(\ |$))) #space or end of string
""", re.X)
I want to modify the numbers so they adhere to the format:
\(\d{3}\)d{3}-\d{4} #Ex: (123)456-7890
I tried using re.findall, and re.sub but had no luck. I'm confused on how to deal with the circumstance of there being multiple matches on a line.
EDIT: Desired output:
["(123)456-7890 (321)654-0987",
"(111)111-1111",
"(222)222-2222",
"(333)333-3333",
"(444)444-4444",
"blah blah blah (555)555-5555",
"(666)666-6666 random text"]

Here's a more simple solution that works for all of those cases, though is a little naïve (and doesn't care about matching brackets).
\(?(\d{3})\)?[ -.]?(\d{3})[ -.]?(\d{4})
Replace with:
(\1)\2-\3
Try it online
Explanation:
Works by first checking for 3 digits, and optionally surrounding brackets on either side, with \(?(\d{3})\)?. Notice that the 3 digits are in a capturing group.
Next, it checks for an optional separator character, and then another 3 digits, also stored in a capturing group: [ -.]?(\d{3}).
And lastly, it does the previous step again - but with 4 digits instead of 3: [ -.]?(\d{4})
Python:
To use it in Python, you should just be able to iterate over each element in the list and do:
p.sub('(\\1)\\2-\\3', myString) # Note the double backslashes, or...
p.sub(r'(\1)\2-\3', myString) # Raw strings work too
Example Python code
EDIT
This solution is a bit more complex, and ensures that if there is a close bracket, there must be a start bracket.
(\()?((?(1)\d{3}(?=\))|\d{3}(?!\))))\)?[ -.]?(\d{3})[ -.]?(\d{4})
Replace with:
(\2)\3-\4
Try it online

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing based on pattern not at the beginning - python

Use a regular expression rather than split(). import re mystring = ' 20220220519AX' match = re.search(r'^\s(\d+?)2022', mystring) if match: print(match.group(1)) ^\s skips over the whitespace at the beginning, then (\d+?) captures the following digits up to the first 2022.

Related

Python -- Regex match pattern OR end of string

Split a split (regex) in python

How to match anything in a regular expression up to a character and not including it?

Find String Between Two Substrings in Python When There is A Space After First Substring

Regex sub phone number format multiple times on same string

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing based on pattern not at the beginning - python

Use a regular expression rather than split(). import re mystring = ' 20220220519AX' match = re.search(r'^\s*(\d+?)2022', mystring) if match: print(match.group(1)) ^\s* skips over the whitespace at the beginning, then (\d+?) captures the following digits up to the first 2022.

Related

Python -- Regex match pattern OR end of string

Split a split (regex) in python

How to match anything in a regular expression up to a character and not including it?

Find String Between Two Substrings in Python When There is A Space After First Substring

Regex sub phone number format multiple times on same string

Categories

Resources

Use a regular expression rather than split(). import re mystring = ' 20220220519AX' match = re.search(r'^\s(\d+?)2022', mystring) if match: print(match.group(1)) ^\s skips over the whitespace at the beginning, then (\d+?) captures the following digits up to the first 2022.