Filter words in Regular expression

Filter words in Regular expression - python

So, quite recently I have been introduced to regular expressions in Python and I've come across with some code online to filter words from a string list that are contained on other substrings.
def Filter(string, substr):
return [str for str in string
if re.match(r'[^\d]+|^', str).group(0) in substr]
It seems pretty straightforward and it works pretty well for my specific problem I'm meeting, but I really can't wrap my head around the meaning of it and how it is working. It just seems very confusing. Can anyone explain to me as if I was a baby or something? My coding skills are not that great, and I'm still a rookie.
Just to be clear, the code works, and I'm happy to move on, I just don't understand this bit.

[^\d] matches any character that isn't a numeric digit; this can also be written as \D.
+ after a pattern means to match any sequence of characters that match the pattern, so [^\d]+ matches a sequence of non-digits.
| separates alternative patterns to match.
The second alternative ^ matches the beginning of the string. Every string will match this. I think they use this just to avoid the match failing, so that you can always call .group(0) on the result. They could accomplish the same thing by changing + to * in the first alternative, since this means that the matched sequence can be 0 repetitions.
re.match() looks for a match of the regexp at the beginning of the argument string. And .group(0) returns what was matched by the entire regexp. So this whole thing returns the initial sequence of non-digits in str.
Finally, the list comprehension returns any of the items in strings whose initial sequence of non-digits is in substr.
With the simplifications I mentioned above, this can be rewritten:
def Filter(string, substr):
return [item for item in string
if re.match(r'\D*', item).group(0) in substr]
Note that if any of the items begin with a digit, the result of the regexp will be an empty string, and an empty string is a substring of every string. So these items will be included in the filter result. I suspect this is not the intended result.

I will try to to explain this for you.
So basically we are creating a method named "filter" and passing two arguments i.e "string (to be searched in)" and "substring (to be searched for)". Then we are using re.match inside a python return function along with an if condition within a for loop (the for loop helps us traverse through the main string one by one).
As for: (r'[^\d]+|^': this is a regular expression pattern where, \d is regex pattern for digit and + means at least one or more and finally they are closed within () that means the group that you want to capture.
re.match:
re.match is a function that searches only from the beginning of the string and returns the matched object (if found). However, if the substring is found somewhere in the middle then it will simply return none.

Related

Regex Statement to only match parts of a string for comparison - Python

What I am trying to do is match values from one file to another, but I only need to match the first portion of the string and the last portion.
I am reading each file into a list, and manipulating these based on different Regex patterns I have created. Everything works, except when it comes to these type of values:
V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
In this example, I only want to match 'V-1\ZDS\R\EMBO-20' and then compare the '24' value at the end of the string. The number x in '20-x:', can vary and doesn't matter in terms of comparisons, as long as the first and last parts of this string match.
This is the Regex I am using:
re.compile(r"(?:.*V-1\\ZDS\\R\\EMBO-20-\d.*)(:\d*\w.*)")
Once I filter down the list, I use the following function to return the difference between the two sets:
funcDiff = lambda x, y: list((set(x)- set(y))) + list((set(y)- set(x)))
Is there a way to take the list of differences and filter out the ones that have matching values after the
:
as mentioned above?
I apologize is this is an obvious answer, I'm new to Python and Regex!
The output I get is the differences between the entire strings, so even if the first and last part of the string match, if the number following the 'EMBO-20-x' doesn't also match, it returns it as being different.

Before discussing your question, regex101 is an incredibly useful tool for this type of thing.
Your issue stems from two issues:
1.) The way you used .*
2.) Greedy vs. Nongreedy matches
.* kinda sucks
.* is a regex expression that is very rarely what you actually want.
As a quick aside, a useful regex expression is [^c]* or [^c]+. These expressions match any character except the letter c, with the first expression matching 0 or more, and the second matched 1 or more.
.* will match all characters as many times as it can. Instead, try to start your regex patterns with more concrete starting points. Two good ways to do this are lookbehind expressions and anchors.
Another quick aside, it's likely that you are misusing regex.match and regex.find. match will only return a match that begins at the start of the string, while find will return matches anywhere in the input string. This could be the reason you included the .* in the first place, to allow a .match call to return a match deeper in the string.
Lookbehind Expressions
There are more complete explanations online, but in short, regex patterns like:
(?<=test)foo
will match the text foo, but only if test is right in front of it. To be more clear, the following strings will not match that regex:
foo
test-foo
test foo
but the following string will match:
testfoo
This will only match the text foo, though.
Anchors
Another option is anchors. ^ and $ are special characters, matching the start and end of a line of text. If you know your regex pattern will match exactly one line of text, start it with ^ and end it with $.
Leading patterns with .* and ending with .* are likely the source of your issue. Although you did not include full examples of your input or your code, you likely used match as opposed to find.
In regex, . matches any character, and * means 0 or more times. This means that for any input, your pattern will match the entire string.
Greedy vs. Non-Greedy qualifiers
The second issue is related to greediness. When your regex patterns have a * in them, they can match 0 or more characters. This can hide problems, as entire * expressions can be skipped. Your regex is likely matched several lines of text as one match, and hiding multiple records in a single .*.
The Actual Answer
Taking all of this in to consideration, let's assume that your input data looks like this:
V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
V-1\ZDS\R\EMBO-20-3:93
V-1\ZDS\R\EMBO-20-6:22309
V-1\ZDS\R\EMBO-20-8:2238
V-1\ZDS\R\EMBO-20-3:28
A better regular expression would be:
^V-1\\ZDS\\R\\EMBO-20-\d:(\d+)$
To visualize this regex in action, follow this link.
There are several differences I would like to highlight:
Starting the expression with ^ and ending with $. This forces the regex to match exactly one line. Even though the pattern works without these characters, it's good practice when working with regex to be as explicit as possible.
No useless non-capturing group. Your example had a (?:) group at the start. This denotes a group that does not capture it's match. It's useful if you want to match a subpattern multiple times ((?:ab){5} matches ababababab without capturing anything). However, in your example, it did nothing :)
Only capturing the number. This makes it easier to extract the value of the capture groups.
No use of *, one use of +. + works like *, but it matches 1 or more. This is often more correct, as it prevents 'skipping' entire characters.

Regex only finds results once

I'm trying to find any text between a '>' character and a new line, so I came up with this regex:
result = re.search(">(.*)\n", text).group(1)
It works perfectly with only one result, such as:
>test1
(something else here)
Where the result, as intended, is
test1
But whenever there's more than one result, it only shows the first one, like in:
>test1
(something else here)
>test2
(something else here)
Which should give something like
test1\ntest2
But instead just shows
test1
What am I missing? Thank you very much in advance.

re.search only returns the first match, as documented:
Scan through string looking for the first location where the regular
expression pattern produces a match, and return a corresponding
MatchObject instance.
To find all the matches, use findall.
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found.
Here's an example from the shell:
>>> import re
>>> re.findall(">(.*)\n", ">test1\nxxx>test2\nxxx")
['test1', 'test2']
Edit: I just read your question again and realised that you want "test1\ntest2" as output. Well, just join the list with \n:
>>> "\n".join(re.findall(">(.*)\n", ">test1\nxxx>test2\nxxx"))
'test1\ntest2'

You could try:
y = re.findall(r'((?:(?:.+?)(?:(?=[\n\r][^\n\r])\n|))+)', text)
Which returns ['t1\nt2\nt3'] for 't1\nt2\nt3\n'. If you simply want the string, you can get it by:
s = y[0]
Although it seems much larger than your initial code, it will give you your desired string.
Explanation -
((?:(?:.+?)(?:(?=[\n\r][^\n\r])\n|))+) is the regex as well as the match.
(?:(?:.+?)(?:(?=[\n\r][^\n\r])\n|)) is the non-capturing group that matches any text followed by a newline, and is repeatedly found one-or-more times by the + after it.
(?:.+?) matches the actual words which are then followed by a newline.
(?:(?=[\n\r][^\n\r])\n|) is a non-capturing conditional group which tells the regex that if the matched text is followed by a newline, then it should match it, provided that the newline is not followed by another newline or carriage return
(?=[\n\r][^\n\r]) is a positive look-ahead which ascertains that the text found is followed by a newline or carriage return, and then some non-newline characters, which combined with the \n| after it, tells the regex to match a newline.
Granted, after typing this big mess out, the regex is pretty long and complicated, so you would be better off implementing the answers you understand, rather than this answer, which you may not. However, this seems to be the only one-line answer to get the exact output you desire.

How do I build a tokenizing regex based iterator in python

I'm basing this question on an answer I gave to this other SO question, which was my specific attempt at a tokenizing regex based iterator using more_itertools's pairwise iterator recipe.
Following is my code taken from that answer:
from more_itertools import pairwise
import re
string = "dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d"
# split according to the given delimiter including segments beginning at the beginning and ending at the end
for prev, curr in pairwise(re.finditer(r"^|[ ]+|$", string)):
print(string[prev.end(): curr.start()]) # originally I yield here
I then noticed that if the string starts or ends with delimiters (i.e. string = " dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d ") then the tokenizer will print empty strings (these are actually extra matches to string start and string end) in the beginning and end of its list of token outputs so to remedy this I tried the following (quite ugly) attempts at other regexes:
"(?:^|[ ]|$)+" - this seems quite simple and like it should work but it doesn't (and also seems to behave wildly different on other regex engines) for some reason it wouldn't build a single match from the string's start and the delimiters following it, the string start somehow also consumes the character following it! (this is also where I see divergence from other engines, is this a BUG? or does it have something to do with special non corporeal characters and the or (|) operator in python that I'm not aware of?), this solution also did nothing for the double match containing the string's end, once it matched the delimiters and then gave another match for the string end ($) character itself.
"(?:[ ]|$|^)+" - Putting the delimiters first actually solves one of the problems, the split at the beginning doesn't contain string start (but I don't care too much about that anyway since I'm interested in the tokens themselves), it also matches string start when there are no delimiters at the beginning of the string but the string ending is still a problem.
"(^[ ]*)|([ ]*$)|([ ]+)" - This final attempt got the string start to be part of the first match (which wasn't really that much of a problem in the first place) but try as I might I couldn't get rid of the delimiter + end and then delimiter match problem (which yields an additional empty string), still, I'm showing you this example (with grouping) since it shows that the ending special character $ is matched twice, once with the preceding delimiters and once by itself (2 group 2 matches).
My questions are:
Why do I get such a strange behavior in attempt #1
How do I solve the end of string issue?
Am I being a tank, i.e. is there a simple way to solve this that I'm blindly missing?
remember that the solution can't change the string and must
produce an iterable generator which iterates on the spaces between the tokens and not the tokens themselves (This last part might seem to complicate the answer unnecessarily since otherwise I have a simple answer but if you must know (and if you don't read no further) it's part of a bigger framework I'm building where this yielding method is inherited by a pipeline which then constructs yielded sentences out of it in various patterns which are used to extract fields from semi structured classifier driven messages)

The problems you're having are due to the trickiness and undocumented edge cases of zero-width matches. You can resolve them by using negative lookarounds to explicitly tell Python not to produce a match for ^ or $ if the string has delimiters at the start or end:
delimiter_re = r'[\n\- ]' # newline, hyphen, or space
search_regex = r'''^(?!{0}) # string start with no delimiter
| # or
{0}+ # sequence of delimiters (at least one)
| # or
(?<!{0})$ # string end with no delimiter
'''.format(delimiter_re)
search_pattern = re.compile(search_regex, re.VERBOSE)
Note that this will produce one match in an empty string, not zero, and not separate beginning and ending matches.
It may be simpler to iterate over non-delimiter sequences and use the resulting matches to locate the string components you want:
token = re.compile(r'[^\n\- ]+')
previous_end = 0
for match in token.finditer(string):
do_something_with(string[previous_end:match.start()])
previous_end = match.end()
do_something_with(string[previous_end:])
The extra matches you were getting at the end of the string were because after matching the sequence of delimiters at the end, the regex engine looks for matches at the end again, and finds a zero-width match for $.
The behavior you were getting at the beginning of the string for the ^|... pattern is trickier: the regex engine sees a zero-width match for ^ at the start of the string and emits it, without trying the other | alternatives. After the zero-width match, the engine needs to avoid producing that match again to avoid an infinite loop; this particular engine appears to do that by skipping a character, but the details are undocumented and the source is hard to navigate. (Here's part of the source, if you want to read it.)
The behavior you were getting at the start of the string for the (?:^|...)+ pattern is even trickier. Executing this straightforwardly, the engine would look for a match for (?:^|...) at the start of the string, find ^, then look for another match, find ^ again, then look for another match ad infinitum. There's some undocumented handling that stops it from going on forever, and this handling appears to produce a zero-width match, but I don't know what that handling is.

It sounds like you're just trying to return a list of all the "words" separated by any number of deliminating chars. You could instead just use regex groups and the negation regex ^ to achieve this:
# match any number of consecutive non-delim chars
string = " dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d "
delimiters = '\n\- '
regex = r'([^{0}]+)'.format(delimiters)
for match in re.finditer(regex, string):
print(match.group(0))
output:
dasdha
hasud
hasuid
hsuia
dhsuai
dhasiu
dhaui
d

Is this regex syntax working?

I wanted to search a string for a substring beginning with ">"
Does this syntax say what I want it to say: this character followed by anything.
regex_firstline = re.compile("[>]{1}.*")

As a pythonic way for such tasks you can use str.startswith() method, and don't need to use regex.
But about your regex "[>]{1}.*" you don't need {1} after your character class and you can specify the start of your regex with anchor ^.So it can be "^>.*"

Using http://regex101.com:
[>]{1} matches the single character > literally exactly one time (but it denotes {1} is a meaningless quantifier), and
.* then matches any character as many times as possible.
If a list was provided inside square brackets (as opposed to a single character), regex would attempt to match a single character within the list exactly one time. http://regex101.com has a good listing of tokens and what they mean.
An ideal regex expression would be ^[>].*, meaning at the beginning of a string find exactly one > character followed by anything else (and with only one character in the square brackets, you can remove those to simplify it even further: ^>.*

Regular Expression Quick Query

Quick regular expressions question.
I want an expression that will find the first digit in a line and also a word at the end of that line. (this will exclude any digits in there)
IE if the string is, "12345hello" then I want the regular expression to find "1hello"
Or even if it's "12345hel45667lo" to find the same thing.
I have the first digit down but my expression I thought would work is:
print re.findall(r'^\d\D+',string)
This just gives me empty brackets, or the first digit if I take out the \D. What gives?
Edit: If I put in a | for or then I get what I want sort of. Returns the words in the string along with the first digit but in separate groupings. I want it all in one.

print re.findall(r'^\d|\D+',string)
print re.sub(r'(?<!^)\d', '', "12345hel45667lo9a") -> '1helloa'

The only thing I can think of is to run a for loop that scans across the string for letters and combines them.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Filter words in Regular expression - python

Related

Regex Statement to only match parts of a string for comparison - Python

Regex only finds results once

How do I build a tokenizing regex based iterator in python

Is this regex syntax working?

Regular Expression Quick Query

Categories

Resources