How to find all words followed by symbol using Python Regex? - python

I need re.findall to detect words that are followed by a "="
So it works for an example like
re.findall('\w+(?=[=])', "I think Python=amazing")
but it won't work for "I think Python = amazing" or "Python =amazing"...
I do not know how to possibly integrate the whitespace issue here properly.
Thanks a bunch!

'(\w+)\s*=\s*'
re.findall('(\w+)\s*=\s*', 'I think Python=amazing') \\ return 'Python'
re.findall('(\w+)\s*=\s*', 'I think Python = amazing') \\ return 'Python'
re.findall('(\w+)\s*=\s*', 'I think Python =amazing') \\ return 'Python'

You said "Again stuck in the regex" probably in reference to your earlier question Looking for a way to identify and replace Python variables in a script where you got answers to the question that you asked, but I don't think you asked the question you really wanted the answer to.
You are looking to refactor Python code, and unless your tool understands Python, it will generate false positives and false negatives; that is, finding instances of variable = that aren't assignments and missing assignments that aren't matched by your regexp.
There is a partial list of tools at What refactoring tools do you use for Python? and more general searches with "refactoring Python your_editing_environment" will yield more still.

Just add some optional whitespace before the =:
\w+(?=\s*=)

Use this instead
re.findall('^(.+)(?=[=])', "I think Python=amazing")
Explanation
# ^(.+)(?=[=])
#
# Options: case insensitive
#
# Assert position at the beginning of the string «^»
# Match the regular expression below and capture its match into backreference number 1 «(.+)»
# Match any single character that is not a line break character «.+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=[=])»
# Match the character “=” «[=]»

You need to allow for whitespace between the word and the =:
re.findall('\w+(?=\s*[=])', "I think Python = amazing")
You can also simplify the expression by using a capturing group around the word, instead of a non-capturing group around the equals:
re.findall('(\w+)\s*=', "I think Python = amazing")

r'(.*)=.*' would do it as well ...
You have anything #1 followed with a = followed with anything #2, you get anything #1.
>>> re.findall(r'(.*)=.*', "I think Python=amazing")
['I think Python']
>>> re.findall(r'(.*)=.*', " I think Python = amazing oh yes very amazing ")
[' I think Python ']
>>> re.findall(r'(.*)=.*', "= crazy ")
['']
Then you can strip() the string that is in the list returned.

re.split(r'\s*=', "I think Python=amazing")[0].split() # returns ['I', 'think', 'Python']

Related

how to find substring from a single line string

suppose, I have a string, s="panpanIpanAMpanJOEpan" . From this I want to find the word pan and replace it with spaces so that I can get the output string as "I AM JOE". How can I do it??
Actually I also don't know how to find certain substring from a long string without spaces such as mentioned above.
It will be great if someone helps me learning about this.
If you don't know pan you can exploit that the letters you want to find is all upper case.
fillword = min(set("".join(i if i.islower() else ' ' for i in s).split(' '))-set(['']),key=len)
This works by first replacing all upper case letters with space, then splitting on space and finding the minimal nonempty word.
Use replace to replace with space, and then strip to remove excess spacing.
s="panpanIpanAMpanJOEpan"
s.replace(fillword,' ').strip()
gives:
'I AM JOE'
s="panpanIpanAMpanJOEpan"
print(s.replace("pan"," ").strip())
use replace
Output:
I AM JOE
As DarrylG and others mentioned, .replace will do what you asked for, where you define what you want to replace ("pan") and what you want to replace it with (" ").
To find a certain string in a longer string you can use .find(), which takes a string you are looking for and optionally where to start and stop looking for it (as integers) as arguments.
If you wanted to find all of the occurrences of a string in a bigger string there's two options:
Find the string with find(), then cut the string so it no longer contains your searchterm and repeat this until the .find() method returns -1(that means the searchterm is not found in the string anymore)
or use the regex module and use the .finditer method to find all occurences of your string Link to someone explaining exactly that on stackoverflow.
Edit: If you don't know what you are searching for, it becomes a bit more tricky, but you can write a regex expession that would extract this data as well using the same regex module. This is easy if you know what the end result is supposed to be (I AM JOE in your case). If you don't it becomes more complicated and we would need additional information to help with this.
You can use replace, to replace all occurances of a substring at once.
In case you want to find the substrings yourself, you can do it manually:
s = "panpanIpanAMpanJOEpan"
while True:
panPosition = s.find('pan') # -1 == 'pan' not found!
if panPosition == -1:
s = s.strip()
break
# Cut out pan from s and replace it with a blanc.
s = s[:panPosition] + ' ' + s[panPosition + 3:]
print(s)
Out:
I AM JOE

Python multiple repeat Error

I'm trying to determine whether a term appears in a string.
Before and after the term must appear a space, and a standard suffix is also allowed.
Example:
term: google
string: "I love google!!! "
result: found
term: dog
string: "I love dogs "
result: found
I'm trying the following code:
regexPart1 = "\s"
regexPart2 = "(?:s|'s|!+|,|.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + term + regexPart2 , re.IGNORECASE)
and get the error:
raise error("multiple repeat")
sre_constants.error: multiple repeat
Update
Real code that fails:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
regexPart1 = r"\s"
regexPart2 = r"(?:s|'s|!+|,|.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + term + regexPart2 , re.IGNORECASE)
On the other hand, the following term passes smoothly (+ instead of ++)
term = 'lg incite" OR author:"http+www.dealitem.com" OR "for sale'
The problem is that, in a non-raw string, \" is ".
You get lucky with all of your other unescaped backslashes—\s is the same as \\s, not s; \( is the same as \\(, not (, and so on. But you should never rely on getting lucky, or assuming that you know the whole list of Python escape sequences by heart.
Either print out your string and escape the backslashes that get lost (bad), escape all of your backslashes (OK), or just use raw strings in the first place (best).
That being said, your regexp as posted won't match some expressions that it should, but it will never raise that "multiple repeat" error. Clearly, your actual code is different from the code you've shown us, and it's impossible to debug code we can't see.
Now that you've shown a real reproducible test case, that's a separate problem.
You're searching for terms that may have special regexp characters in them, like this:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
That p++ in the middle of a regexp means "1 or more of 1 or more of the letter p" (in the others, the same as "1 or more of the letter p") in some regexp languages, "always fail" in others, and "raise an exception" in others. Python's re falls into the last group. In fact, you can test this in isolation:
>>> re.compile('p++')
error: multiple repeat
If you want to put random strings into a regexp, you need to call re.escape on them.
One more problem (thanks to Ωmega):
. in a regexp means "any character". So, ,|.|;|:" (I've just extracted a short fragment of your longer alternation chain) means "a comma, or any character, or a semicolon, or a colon"… which is the same as "any character". You probably wanted to escape the ..
Putting all three fixes together:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
regexPart1 = r"\s"
regexPart2 = r"(?:s|'s|!+|,|\.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + re.escape(term) + regexPart2 , re.IGNORECASE)
As Ωmega also pointed out in a comment, you don't need to use a chain of alternations if they're all one character long; a character class will do just as well, more concisely and more readably.
And I'm sure there are other ways this could be improved.
The other answer is great, but I would like to point out that using regular expressions to find strings in other strings is not the best way to go about it. In python simply write:
if term in string:
#do whatever
Also make sure that your arguments are in the correct order!
I was trying to run a regular expression on some html code. I kept getting the multiple repeat error, even with very simple patterns of just a few letters.
Turns out I had the pattern and the html mixed up. I tried re.findall(html, pattern) instead of re.findall(pattern, html).
i have an example_str = "i love you c++" when using regex get error multiple repeat Error. The error I'm getting here is because the string contains "++" which is equivalent to the special characters used in the regex. my fix was to use re.escape(example_str ), here is my code.
example_str = "i love you c++"
regex_word = re.search(rf'\b{re.escape(word_filter)}\b', word_en)
A general solution to "multiple repeat" is using re.escape to match the literal pattern.
Example:
>>>> re.compile(re.escape("c++"))
re.compile('c\\+\\+')
However if you want to match a literal word with space before and after try out this example:
>>>> re.findall(rf"\s{re.escape('c++')}\s", "i love c++ you c++")
[' c++ ']

Find words with capital letters not at start of a sentence with regex

Using Python and regex I am trying to find words in a piece of text that start with a capital letter but are not at the start of a sentence.
The best way I can think of is to check that the word is not preceded by a full stop then a space. I am pretty sure that I need to use negative lookbehind. This is what I have so far, it will run but always returns nothing:
(?<!\.\s)\b[A-Z][a-z]*\b
I think the problem might be with the use of [A-Z][a-z]* inside the word boundary \b but I am really not sure.
Thanks for the help.
Your regex appears to work:
In [6]: import re
In [7]: re.findall(r'(?<!\.\s)\b[A-Z][a-z]*\b', 'lookbehind. This is what I have')
Out[7]: ['I']
Make sure you're using a raw string (r'...') when specifying the regex.
If you have some specific inputs on which the regex doesn't work, please add them to your question.
Although you asked specifically for a regex, it may be interesting to also consider a list comprehension. They're sometimes a bit more readable (although in this case, probably at the cost of efficiency). Here's one way to achieve this:
import string
S = "T'was brillig, and the slithy Toves were gyring and gimbling in the " + \
"Wabe. All mimsy were the Borogoves, and the Mome Raths outgrabe."
LS = S.split(' ')
words = [x for (pre,x) in zip(['.']+LS, LS+[' '])
if (x[0] in string.uppercase) and (pre[-1] != '.')]
Try and loop over your input with:
(?!^)\b([A-Z]\w+)
and capture the first group. As you can see, a negative lookahead can be used as well, since the position you want to match is everything but a beginning of line. A negative lookbehind would have the same effect.

Match a string that does not end with a list of known strings

I want to match street names, which could come in forms of " St/Ave/Road". The postfix may not be exist at all, so it may just be "1st". I also want to know what the postfix is. What is a suitable regex for it? I tried:
(.+)(\s+(St|Ave|Road))?
But it seems like the first group greedily matches the entire string. I tried a look back (?<!), but couldn't get it to work properly, as it kept on spewing errors like "look-behind requires fixed-width pattern".
If it matters at all, I'm using Python.
Any suggestions?
Just make your first group non-greedy by adding a question mark:
(.+?)(\s+(St|Ave|Road))?
As an alternative to regex-based solutions, how about:
suffix = s.split(' ')[-1]
if suffix in ('St', 'Ave', 'Road'):
print 'suffix is', suffix
else:
print 'no suffix'
If you do have to use regular expressions, simply make the first match non-greedy, like to: r'.*?\s+(St|Ave|Road)$'
In [28]: print re.match(r'(.*?)\s+(St|Ave|Road)$', 'Main Road')
<_sre.SRE_Match object at 0x260ead0>
In [29]: print re.match(r'(.*?)\s+(St|Ave|Road)$', 'nothing here')
None
You wanted negative look ahead
(?!(St|Ave|Road))$
How about negative look behind:
(?!<=(St|Ave|Road))$
it seems to express the requirement closely

Find two of the same character in a string with regular expressions

This is in reference to a question I asked before here
I received a solution to the problem in that question but ended up needing to go with regex for this particular part.
I need a regular expression to search and replace a string for instances of two vowels in a row that are the same, so the "oo" in "took", or the "ee" in "bees" and replace it with the one of the letters that was replaced and a :.
Some examples of expected behavior:
"took" should become "to:k"
"waaeek" should become "wa:e:k"
"raaag" should become "ra:ag"
Thank you for the help.
Try this:
re.sub(r'([aeiou])\1', r'\1:', str)
Search for ([aeiou])\1 and replace it with \1:
I don't know about python, but you should be able to make the regex case insensitive and global with something like /([aeiou])\1/gi
What NOT to do:
As noted, this will match any two vowels together. Leaving this answer as an example of what NOT to do. The correct answer (in this case) is to use backreferences as mentioned in numerous other answers.
import re
data = ["took","waaeek","raaag"]
for s in data:
print re.sub(r'([aeiou]){2}',r'\1:',s)
This matches exactly two occurrences {2} of any member of the set [aeiou]. and replaces it with the vowel, captured with the parens () and placed in the sub string by the \1 followed by a ':'
Output:
to:k
wa:e:k
ra:ag
You'll need to use a back reference in your search expression. Try something like: ([a-z])+\1 (or ([a-z])\1 for just a double).

Categories

Resources