Python ReGex Pattern Finder

Python ReGex Pattern Finder - python

I am trying to get better with ReGex in Python, and I am trying to figure out how I would isolate a specific substring in some text. I have some text, and that text could look like any of the following:
possible_strings = [
"some text (and words) more text (and words)",
"textrighthere(with some more)",
"little trickier (this time) with (all of (the)(values))"
]
With each string, despite the fact that I don't know what's in them, I know it always ends with some information in parentheses. To include examples like #3, where the final pair of parentheses have parentheses in them.
How could I go about using re/ReGex to isolate the text only inside of the last pair of parentheses? So in the previous example, I would want the output to be:
output = [
"and words",
"with some more",
"all of (the)(values)"
]
Any tips or help would be much appreciated!

In python you can use the regex module as it is supports recurssion:
import regex
pat = r'(\((?:[^()]|(?1))*\))$'
regex.findall(pat, '\n'.join(possible_strings), regex.M)
['(and words)', '(with some more)', '(all of (the)(values))']
The regex might be quite complicated for a beginner. Click here for the explanations and examples
Abit of explanation:
( # 1st Capturing Group
\( # matches the character (
(?:#Non-capturing group
[^()] # 1st Alternative Match a single character not present in the character class
| # or
(?1) #2nd Alternative matches the expression defined in the 1st capture group recursively
) # closes capturing group
* # matches zero or more times
\) #matches the character )
$ asserts position at the end of a line

For the first two, start matching an opening bracket, that could be either of these:
"some text (and words) more text (and words)"
^ ^
followed by anything which isn't an opening bracket:
"some text (and words) more text (and words)"
^^^^^^^^^^^^^^^^^^^^^^X^^^^^^^^^^^
|- starting at the first ( hit
another ( which isn't allowed.
followed by end of line. Only the last () fits "no more ( until end of line".
>>> import re
>>> re.findall('\([^(]+\)$', "some text (and words) more text (and words)")
['(and words)']
RegEx is not a good fit for your third example; there's no easy way to pair up the parens, you may have to install and use a different regex engine to get nested structure support. See also
Matching Nested Structures With Regular Expressions in Python
Python: How to match nested parentheses with regex?

Related

Python Regex: Match paragraph numbers

I am attempting to match paragraph numbers inside my block of text. Given the following sentence:
Refer to paragraph C.2.1a.5 for examples.
I would like to match the word C.2.1a.5.
My current code like so:
([0-9a-zA-Z]{1,2}\.)
Only matches C.2.1a. and es., which is not what I want. Is there a way to match the full C.2.1a.5 and not match es.?
https://regex101.com/r/cO8lqs/13723
I have attempted to use ^ and $, but doing so returns no matches.

You should use following regex to match the paragraph numbers in your text.
\b(?:[0-9a-zA-Z]{1,2}\.)+[0-9a-zA-Z]\b
Try this demo
Here is the explanation,
\b - Matches a word boundary hence avoiding matching partially in a large word like examples.
(?:[0-9a-zA-Z]{1,2}\.)+ - This matches an alphanumeric text with length one or two as you tried to match in your own regex.
[0-9a-zA-Z] - Finally the match ends with one alphanumeric character at the end. In case you want it to match one or two alphanumeric characters at the end too, just add {1,2} after it
\b - Matches a word boundary again to ensure it doesn't match partially in a large word.
EDIT:
As someone pointed out, in case your text has strings like A.A.A.A.A.A. or A.A.A or even 1.2 and you don't want to match these strings and only want to match strings that has exactly three dots within it, you should use following regex which is more specific in matching your paragraph numbers.
(?<!\.)\b(?:[0-9a-zA-Z]{1,2}\.){3}[0-9a-zA-Z]\b(?!\.)
This new regex matches only paragraph numbers having exactly three dots and those negative look ahead/behind ensures it doesn't match partially in large string like A.A.A.A.A.A
Updated regex demo
Check these python sample codes,
import re
s = 'Refer to paragraph C.2.1a.5 for examples. Refer to paragraph A.A.A.A.A.A.A for examples. Some more A.A.A or like 1.22'
print(re.findall(r'(?<!\.)\b(?:[0-9a-zA-Z]{1,2}\.){3}[0-9a-zA-Z]\b(?!\.)', s))
Output,
['C.2.1a.5']
Also for trying to use ^ and $, they are called start and end anchors respectively, and if you use them in your regex, then they will expect matching start of line and end of line which is not what you really intend to do hence you shouldn't be using them and like you already saw, using them won't work in this case.

If simple version is required, you can use this easy to understand and modify regex ([A-Z]{1}\.[0-9]{1,3}\.[0-9]{1,3}[a-z]{1}\.[0-9]{1,3})

I think we should keep the regex expression simple and readable.
You can use the regex
**(?:[a-zA-Z]+\.){3}[a-zA-Z]+**
Explanation -
The expression (?:[a-zA-Z]+.){3} ensures that the group (?:[a-zA-Z]+.) is to be repeated 3 times within the word. The group contains an alphabetic character followed a dot.
The word would end with an alphabetic character.
Output:
['C.2.1a.5']

Python regex match pattern "X<string1>:X<string2>"

I'm parsing a file which has text "$string1:$string2"
How do I regex match this string and extract "string1" and "string2" from it, basically regex match this pattern : "$*:$*"

You were nearly there with your own pattern, it needs three alterations in order to work as you want it.
First, the star in regexes isn't a glob, as you might be expecting it from shell scripting, it's a kleene star. Meaning, it needs some character group it can apply it's "zero to n times" logic on. In your case, the alphanumeric character class \w should work. If that's too restrictive, use . instead, which matches any character except line breaks.
Secondly, you need to apply the regex in a way that you can easily extract the results you want. The usual way to go about it is to define groups, using parentheses.
Last but not least, the $ sign is a meta-character in regexes, so if you want to match it literally, you need to write a backslash in front of it.
In working code, it'll look like this:
import re
s = "$string1:$string2"
r = re.compile(r"\$(\w*):\$(\w*)")
match = r.match(s)
print(match.group(1)) # print the first group that was matched
print(match.group(2)) # print the second group that was matched
Output:
string1
string2

Detect abbreviations in the text in python

I want to find abbreviations in the text and remove it. What I am currently doing is identifying consecutive capital letters and remove them.
But I see that it does not remove abbreviations such as MOOCs, M.O.O.C, M.O.O.Cs. Is there an easy way of doing this in python? Or are there any libraries that I can use instead?

The re regex library is probably the tool for the job.
In order to remove every string of consecutive uppercase letters, the following code can be used:
import re
mytext = "hello, look an ACRONYM"
mytext = re.sub(r"\b[A-Z]{2,}\b", "", mytext)
Here, the regex "\b[A-Z]{2,}\b" searches for multiple consecutive (indicated by [...]{2,}) capital letters (A-Z), forming a complete word (\b...\b). It then replaces them with the second string, "".
The convenient thing about regex is how easily it can be modified for more complex cases. For example:
mytext = re.sub(r"\b[A-Z\.]{2,}\b", "", mytext)
Will replace consecutive uppercase letters and full stops, removing acronyms like A.B.C.D. as well as ABCD. The \ before the . is necessary as . otherwise is used by regex as a kind of wildcard.
The ? specifier could also be used to remove acronyms that end in s, for example:
mytext = re.sub(r"\b[A-Z\.]{2,}s?\b", "", mytext)
This regex will remove acronyms like ABCD, A.B.C.D, and even A.B.C.Ds. If other forms of acronym need to be removed, the regex can easily be modified to accommodate them.
The re library also includes functions like findall, or the match function, which allow for programs to locate and process each acronym individually. This might come in handy if you want to, for example, look at a list of the acronyms being removed and check there are no legitimate words there.

An intuitive way would be the use of regex
This regular expression does the job :([A-Z]\.*){2,}s?
Which gives in python :
import re
re.sub("([A-Z]\.*){2,}s?","", your_text)
Please visit regex documentation in case of doubt
https://docs.python.org/2/library/re.html#re.sub

Python regex find all matches

I'm using python 2.7 re library to find all numbers written in scientific form in a string. I'm using the following code:
import re
y = re.findall(".([0-9]+\.[0-9]+[eE][-+]?[0-9]+).","{8.25e+07|8.26206e+07}")
print y
However, the output is only ['8.25e+07'] while I'm expecting something like [('8.25e+07'),(8.26206e+07)]. I've been trying around but couldn't find where the problem is. If I input y = re.findall(".([0-9]+\.[0-9]+[eE][-+]?[0-9]+).","|8.26206e+07}") then it gives ['8.26206e+07'] so the pattern is matching the second number but I don't get it why it doesn't match both at the same time.

You are slightly overcomplicating your regex by misusing the . which matches any character while not actually needing it and using a capturing group () without really using it.
With your pattern you are looking for a number in scientific notation which has to be BOTH preceded and followed by exactly one character.
{8.25e+07|8.26206e+07}
[--------]
After re.findall traverses your string from the beginning it finds your defined pattern, which then drops the { and the | because of your capturing group (..) and saves this as a match. It then continues but only has 8.26206e+07} left. That now does not satisfy your pattern, because it is missing one "any" character for your first ., and no further match is found. Note that findall only looks for non-overlapping matches[1].
To illustrate, change your input string by duplicating your separator |:
>>> p = ".([0-9]+\.[0-9]+[eE][-+]?[0-9]+)."
>>> s = "{8.25e+07||8.26206e+07}"
>>> print(re.findall(p, s))
['8.25e+07', '8.26206e+07']
To satisfy your two .s you need two separators between any two numbers.
Two things I would change in your pattern, (1) remove the .s and (2) remove your capturing group ( ), you have no need for it:
p = "[0-9]+\.[0-9]+[eE][-+]?[0-9]+"
Capturing groups can be very useful if you need to refer to specific captured groups again later, but your task at hand has no need for them.
[1] https://docs.python.org/2/library/re.html?highlight=findall#re.findall

Because findall is documented to
... Return all non-overlapping matches of pattern in string, as a list of strings.
But your patterns overlap: the leading . of the second match would have to be the | character, but that was already consumed by the trailing . of the first match.
Just remove those non-captured .s at the start and end of your regex.

i think you have extra dots.
try this below
import re
y = re.findall("([0-9]+\.[0-9]+[eE][-+]?[0-9]+)","{8.25e+07|8.26206e+07}")
print (y)

When you use regular expressions to match. The default mode will be to find all non-overlapping matches. Using the dots at both the end and the beginning, you make them overlap.
"([0-9]+\.[0-9]+[eE][-+]?[0-9]+)"
should work

Regex to match only part of certain line

I have some config file from which I need to extract only some values. For example, I have this:
PART
{
title = Some Title
description = Some description here. // this 2 params are needed
tags = qwe rty // don't need this param
...
}
I need to extract value of certain param, for example description's value. How do I do this in Python3 with regex?

Here is the regex, assuming that the file text is in txt:
import re
m = re.search(r'^\s*description\s*=\s*(.*?)(?=(//)|$)', txt, re.M)
print(m.group(1))
Let me explain.
^ matches at beginning of line.
Then \s* means zero or more spaces (or tabs)
description is your anchor for finding the value part.
After that we expect = sign with optional spaces before or after by denoting \s*=\s*.
Then we capture everything after the = and optional spaces, by denoting (.*?). This expression is captured by parenthesis. Inside the parenthesis we say match anything (the dot) as many times as you can find (the asterisk) in a non greedy manner (the question mark), that is, stop as soon as the following expression is matched.
The following expression is a lookahead expression, starting with (?= which matches the thing right after the (?=.
And that thing is actually two options, separated by the vertical bar |.
The first option, to the left of the bar says // (in parenthesis to make it atomic unit for the vertical bar choice operation), that is, the start of the comment, which, I suppose, you don't want to capture.
The second option is $, meaning the end of the line, which will be reached if there is no comment // on the line.
So we look for everything we can after the first = sign, until either we meet a // pattern, or we meet the end of the line. This is the essence of the (?=(//)|$) part.
We also need the re.M flag, to tell the regex engine that we want ^ and $ match the start and end of lines, respectively. Without the flag they match the start and end of the entire string, which isn't what we want in this case.

The better approach would be to use an established configuration file system. Python has built-in support for INI-like files in the configparser module.
However, if you just desperately need to get the string of text in that file after the description, you could do this:
def get_value_for_key(key, file):
with open(file) as f:
lines = f.readlines()
for line in lines:
line = line.lstrip()
if line.startswith(key + " ="):
return line.split("=", 1)[1].lstrip()
You can use it with a call like: get_value_for_key("description", "myfile.txt"). The method will return None if nothing is found. It is assumed that your file will be formatted where there is a space and the equals sign after the key name, e.g. key = value.
This avoids regular expressions altogether and preserves any whitespace on the right side of the value. (If that's not important to you, you can use strip instead of lstrip.)
Why avoid regular expressions? They're expensive and really not ideal for this scenario. Use simple string matching. This avoids importing a module and simplifies your code. But really I'd say to convert to a supported configuration file format.

This is a pretty simple regex, you just need a positive lookbehind, and optionally something to remove the comments. (do this by appending ?(//)? to the regex)
r"(?<=description = ).*"
Regex101 demo

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python ReGex Pattern Finder - python

Related

Python Regex: Match paragraph numbers

Python regex match pattern "X<string1>:X<string2>"

Detect abbreviations in the text in python

Python regex find all matches

Regex to match only part of certain line

Categories

Resources