I'm looking for a list of strings and their variations within a very large string.
What I want to do is find even the implicit matches between two strings.
For example, if my start string is foo-bar, I want the matching to find Foo-bAr foo Bar, or even foo(bar.... Of course, foo-bar should also return a match.
EDIT: More specifically, I need the following matches.
The string itself, case insenstive.
The string with spaces separating any of the characters
The string with parentheses separating any of the characters.
How do I write an expression to meet these conditions?
I realize this might require some tricky regex. The thing is, I have a large list of strings I need to search for, and I feel regex is just the tool for making this as robust as I need.
Perhaps regex isn't the best solution?
Thanks for your help guys. I'm still learning to think in regex.
>>> def findString(inputStr, targetStr):
... if convertToStringSoup(targetStr).find(convertToStringSoup(inputStr)) != -1:
... return True
... return False
...
>>> def convertToStringSoup(testStr):
... testStr = testStr.lower()
... testStr = testStr.replace(" ", "")
... testStr = testStr.replace("(", "")
... testStr = testStr.replace(")", "")
... return testStr
...
>>>
>>> findString("hello", "hello")
True
>>> findString("hello", "hello1")
True
>>> findString("hello", "hell!o1")
False
>>> findString("hello", "hell( o)1")
True
should work according to your specification. Obviously, could be optimized. You're asking about regex, which I'm thinking hard about, and will hopefully edit this question soon with something good. If this isn't too slow, though, regexps can be miserable, and readable is often better!
I noticed that you're repeatedly looking in the same big haystack. Obviously, you only have to convert that to "string soup" once!
Edit: I've been thinking about regex, and any regex you do would either need to have many clauses or the text would have to be modified pre-regex like I did in this answer. I haven't benchmarked string.find() vs re.find(), but I imagine the former would be faster in this case.
I'm going to assume that your rules are right, and your examples are wrong, mainly since you added the rules later, as a clarification, after a bunch of questions. So:
EDIT: More specifically, I need the following matches.
The string itself, case insenstive.
The string with spaces separating any of the characters
The string with parentheses separating any of the characters.
The simplest way to do this is to just remove spaces and parens, then do a case-insensitive search on the result. You don't even need regex for that. For example:
haystack.replace(' ', '').replace('(', '').upper().find(needle.upper())
Try this regex:
[fF][oO]{2}[- ()][bB][aA][rR]
Test:
>>> import re
>>> pattern = re.compile("[fF][oO]{2}[- ()][bB][aA][rR]")
>>> m = pattern.match("foo-bar")
>>> m.group(0)
'foo-bar'
Using a regex, a case-insensitive search matches upper/lower case invariants, '[]' matches any contained characters and '|' lets you do multiple compares at once. Putting it all together, you can try:
import re
pairs = ['foo-bar', 'jane-doe']
regex = '|'.join(r'%s[ -\)]%s' % tuple(p.split('-')) for p in pairs)
print regex
results = re.findall(regex, your_text_here, re.IGNORECASE)
Related
Using the python re.sub, is there a way I can extract the first alpha numeric characters and disregard the rest form a string that starts with a special character and might have special characters in the middle of the string? For example:
re.sub('[^A-Za-z0-9]','', '#my,name')
How do I just get "my"?
re.sub('[^A-Za-z0-9]','', '#my')
Here I would also want it to just return 'my'.
re.sub(".*?([A-Za-z0-9]+).*", r"\1", str)
The \1 in the replacement is equivalent to matchobj.group(1). In other words it replaces the whole string with just what was matched by the part of the regexp inside the brackets. $ could be added at the end of the regexp for clarity, but it is not necessary because the final .* will be greedy (match as many characters as possible).
This solution does suffer from the problem that if the string doesn't match (which would happen if it contains no alphanumeric characters), then it will simply return the original string. It might be better to attempt a match, then test whether it actually matches, and handle separately the case that it doesn't. Such a solution might look like:
matchobj = re.match(".*?([A-Za-z0-9]+).*", str)
if matchobj:
print(matchobj.group(1))
else:
print("did not match")
But the question called for the use of re.sub.
Instead of re.sub it is easier to do matching using re.search or re.findall.
Using re.search:
>>> s = '#my,name'
>>> res = re.search(r'[a-zA-Z\d]+', s)
>>> if res:
... print (res.group())
...
my
Code Demo
This is not a complete answer. [A-Za-z]+ will give give you ['my','name']
Use this to further explore: https://regex101.com/
I've recently decided to jump into the deep end of the Python pool and start converting some of my R code over to Python and I'm stuck on something that is very important to me. In my line of work, I spend a lot of time parsing text data, which, as we all know, is very unstructured. As a result, I've come to rely on the lookaround feature of regex and R's lookaround functionality is quite robust. For example, if I'm parsing a PDF that might introduce some spaces in between letters when I OCR the file, I'd get to the value I want with something like this:
oAcctNum <- str_extract(textBlock[indexVal], "(?<=ORIG\\s?:\\s?/\\s?)[A-Z0-9]+")
In Python, this isn't possible because the use of ? makes the lookbehind a variable-width expression as opposed to a fixed-width. This functionality is important enough to me that it deters me from wanting to use Python, but instead of giving up on the language I'd like to know the Pythonista way of addressing this issue. Would I have to preprocess the string before extracting the text? Something like this:
oAcctNum = re.sub(r"(?<=\b\w)\s(?=\w\b)", "")
oAcctNum = re.search(r"(?<=ORIG:/)([A-Z0-9])", textBlock[indexVal]).group(1)
Is there a more efficient way to do this? Because while this example was trivial, this issue comes up in very complex ways with the data I work with and I'd hate to have to do this kind of preprocessing for every line of text I analyze.
Lastly, I apologize if this is not the right place to ask this question; I wasn't sure where else to post it. Thanks in advance.
Notice that if you can use groups, you generally do not need lookbehinds. So how about
match = re.search(r"ORIG\s?:\s?/\s?([A-Z0-9]+)", string)
if match:
text = match.group(1)
In practice:
>>> string = 'ORIG : / AB123'
>>> match = re.search(r"ORIG\s?:\s?/\s?([A-Z0-9]+)", string)
>>> match
<_sre.SRE_Match object; span=(0, 12), match='ORIG : / AB123'>
>>> match.group(1)
'AB123'
You need to use capture groups in this case you described:
"(?<=ORIG\\s?:\\s?/\\s?)[A-Z0-9]+"
will become
r"ORIG\s?:\s?/\s?([A-Z0-9]+)"
The value will be in .group(1). Note that raw strings are preferred.
Here is a sample code:
import re
p = re.compile(r'ORIG\s?:\s?/\s?([A-Z0-9]+)', re.IGNORECASE)
test_str = "ORIG:/texthere"
print re.search(p, test_str).group(1)
IDEONE demo
Unless you need overlapping matches, capturing groups usage instead of a look-behind is rather straightforward.
print re.findall(r"ORIG\s?:\s?/\s?([A-Z0-9]+)",test_str)
You can directly use findall which will return all the groups in the regex if present.
I'm trying to find ways to do it other than these two:
# match last occurence of \d+, 24242 in this case
>>> test = "123_4242_24242lj.:"
>>> obj = re.search(r"\d+(?!.*\d)", test)
>>> obj.group()
'24242'
>>> re.findall(r"\d+", test)[-1]
'24242'
I'm sure you can find more clever regular expressions that will do this, but I think you should stick with findall().
Regular expressions are hard to read. Not just by others: let 10 days go by since the time you wrote one, and you'll find it hard to read too. This makes them hard to maintain.
Unless performance is critical, it's always best to minimize the work done by regular expressions. This line...
re.findall(r"\d+", test)[-1]
... is clean, concise and immediately obvious.
This lookahead based regex matches last digits in a string:
\d+(?=\D*$)
I'm trying to find ways to do it other than these two:
A slight modification to your first approach. Capture the digits followed by anything that is not a digit at the end of the string.
>>> import re
>>> test = "123_4242_24242lj.:"
>>> print re.findall(r'(\d+)\D*$', test)
['24242']
>>>
Another alternate would be to substitute:
>>> re.sub(r'.*?(\d+)\D*$', "\\1", test)
'24242'
In Python, I am extracting emails from a string like so:
split = re.split(" ", string)
emails = []
pattern = re.compile("^[a-zA-Z0-9_\.-]+#[a-zA-Z0-9-]+.[a-zA-Z0-9-\.]+$");
for bit in split:
result = pattern.match(bit)
if(result != None):
emails.append(bit)
And this works, as long as there is a space in between the emails. But this might not always be the case. For example:
Hello, foo#foo.com
would return:
foo#foo.com
but, take the following string:
I know my best friend mailto:foo#foo.com!
This would return null. So the question is: how can I make it so that a regex is the delimiter to split? I would want to get
foo#foo.com
in all cases, regardless of punctuation next to it. Is this possible in Python?
By "splitting by regex" I mean that if the program encounters the pattern in a string, it will extract that part and put it into a list.
I'd say you're looking for re.findall:
>>> email_reg = re.compile(r'[a-zA-Z0-9_.-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')
>>> email_reg.findall('I know my best friend mailto:foo#foo.com!')
['foo#foo.com']
Notice that findall can handle more than one email address:
>>> email_reg.findall('Text text foo#foo.com, text text, baz#baz.com!')
['foo#foo.com', 'baz#baz.com']
Use re.search or re.findall.
You also need to escape your expression properly (. needs to be escaped outside of character classes, not inside) and remove/replace the anchors ^ and $ (for example with \b), eg:
r"\b[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+\b"
The problem I see in your regex is your use of ^ which matches the start of a string and $ which matches the end of your string. If you remove it and then run it with your sample test case it will work
>>> re.findall("[A-Za-z0-9\._-]+#[A-Za-z0-9-]+.[A-Za-z0-9-\.]+","I know my best friend mailto:foo#foo.com!")
['foo#foo.com']
>>> re.findall("[A-Za-z0-9\._-]+#[A-Za-z0-9-]+.[A-Za-z0-9-\.]+","Hello, foo#foo.com")
['foo#foo.com']
>>>
I need something like grep in python
I have done research and found the re module to be suitable
I need to search variables for a specific string
To search for a specific string within a variable, you can just use in:
>>> 'foo' in 'foobar'
True
>>> s = 'foobar'
>>> 'foo' in s
True
>>> 'baz' in s
False
Using re.findall will be the easiest way. You can search for just a literal string if that's what you're looking for (although your purpose would be better served by the string in operator and you'll need to escape regex characters), or else any string you would pass to grep (although I don't know the syntax differences between the two off the top of my head, but I'm sure there are differences).
>>> re.findall("x", "xyz")
['x']
>>> re.findall("b.d", "abcde")
['bcd']
>>> re.findall("a?ba?c", "abacbc")
['abac', 'bc']
It sounds like what you really want is the ability to print a large substring in a way that lets you easily see where a particular substring is. There are a couple of ways to approach this.
def grep(large_string, substring):
for line, i in enumerate(large_string.split('\n')):
if substring in line:
print("{}: {}".format(i, line))
This would print only the lines that have your substring. However, you would lose a bunch of context. If you want true grep, replace if substring in line with something that uses the re module to do regular expression matching.
def highlight(large_string, substring):
from colorama import Fore
text_in_between = large_string.split(substring)
highlighted_substring = "{}{}{}".format(Fore.RED, substring, Fore.RESET)
print(highlighted_substring.join(text_in_between))
This will print the whole large string, but with the substring you are looking for in red. Note that you'll need to pip install colorama for it to work. You can of course combine the two approaches.