How to handle " in Regex Python - python

I am trying to grab fary_trigger_post in the code below using Regex. However, I don't understand why it always includes " in the end of the matched pattern, which I don't expect.
Any idea or suggestion?
re.match(
r'-instance[ "\']*(.+)[ "\']*$',
'-instance "fary_trigger_post" '.strip(),
flags=re.S).group(1)
'fary_trigger_post"'
Thank you.

The (.+) is greedy and grabs ANY character until the end of the input. If you modified your input to include characters after the final double quote (e.g. '-instance "fary_trigger_post" asdf') you would find the double quote and the remaining characters in the output (e.g. fary_trigger_post" asdf). Instead of .+ you should try [^"\']+ to capture all characters except the quotes. This should return what you expect.
re.match(r'-instance[ "\']*([^"\']+)[ "\'].*$', '-instance "fary_trigger_post" '.strip(), flags=re.S).group(1)
Also, note that I modified the end of the expression to use .* which will match any characters following the last quote.

Here's what I'd use in your matching string, but it's hard to provide a better answer without knowing all your cases:
r'-instance\s+"(.+)"\s*$'

When you try to get group 1 (i.e. (.+)) regex will follow this match to the end of string, as it can match . (any character) 1 or more times (but it will take maximum amount of times). I would suggest use the following pattern:
'-instance[ "\']*(.+)["\']+ *$'
This will require regex to match all spaces in the end and all quoutes separatelly, so that it won't be included into group 1

Related

regex expression to get all digits before full stop

I wish to do as my title said but I cant seem to be able to do it.
string = "tex3591.45" #please be aware that my digit is in half-width
text_temp = re.findall("(\d.)", string)
My current output is:
['35', '91', '45']
My expected output is:
['3591.'] # with the "." at the end of the integer. No matter how many integer infront of this full stop
You need to escape the .:
text_temp = re.findall(r"\d+\.", string)
since . is a special character in regex, which matches any character. Added the + also to match 1 or more digits.
Or if you actually are using 'FULLWIDTH FULL STOP' (U+FF0E) you can just use the special character in the regex without escaping it:
text_temp = re.findall(r"\d+.", string)
You can use this regex along with re.findall to get your desired result
\d(?=.*?.)
will generate individual digits as answer
Demo in regex 101
\d+(?=.*?.)
Demo2
This will generate a bunch of numbers as one string
I used a positive lookahead and a greedy matching to check if there is a full stop after a certain digit and then give output. Hope this helps :).

Regular Expression to match a mandatory symbol in an optional part of a string?

What is the regular expression that matches for a mandatory symbol in an optional part of a string.
For example, abcd will be matched by the RE but, if I add :, the resulting string will not be matched unless I add letter(s) afterwards like this abcd:efg.
So, the optional part is the : onward, and the mandatory symbol in this optional part is the : itself.
abcd:efg:hijk need also to be matched.
UPDATE:
I tried this ^([a-z]|_)*(:[a-z]|_)*$ but it did not work as expected.
You should include more examples and counter-examples, but this should be close enough to your goal:
^[a-z_]+(:[a-z_]+)*$
Here's a test.
The problem with your ^([a-z]|_)*(:[a-z]|_)*$ regex is that it only matches one letter after each :. a:b:c:d matches but not a:b:c:de.
Finally, please note that (:[a-z]|_) is :
a colon followed by a letter
or an underscore.
It doesn't match a colon followed by an underscore!
I would prefer a regex with a positive lookbehind. This also makes it easier to group the matching parts. It first matches the first string, and then matches all the following strings when preceded with a ":"
([a-z_]*)((?<=:):[a-z_])?
https://regex101.com/r/NkiZ3g/1
Your problem is that you need to know how to express optionality for a stretch longer than a single character. Try this:
^abcd(:efg)?$
For abcd and efg substitute whatever you're really looking for.

Python using re to match string in a specific pattern

I am trying to use python re to match a string with a specific pattern.
The problem I met is, I have this expected sentence:
"It is X. not X`
X can be anything; A word, or a bunch of word, or number, or digits.
The pattern I build is:
It is \w+. not \w+
just using
string.replace("X", "\w+")
It works if X is a word, or bunch of words, or int, but not for digits. How can I build my pattern in order to match everything in this pattern?
The . is a special character in a regular expression that will match any character. So .+ will match one or more characters.
r"It is .+\. not .+"
Not that the period is escaped \., this is because in that case, you want to match an actual period.
Because .+ won't work in some cases, for example
It is quote. not a double-quote
It is a dog. not a cat
I would use this one instead :
(?<=It is ).+(?=\.)|(?<=not ).+$
Explanation
(?<=It is ).+(?=\.) Any consecutive characters precedeed by It is and followed by a point
| OR
(?<=not ).*$ Any consecutive characters precedeed by not and followed by end of line anchor
(?<=It is ).*(?=\.)|(?<=not ).*$
Demo
I have figured out, can use str.replace("X", "(\w+|\d+\.\d+)") to approach the problem. Hope can help others having the same issue.

regular expression ending

I have ONE string as plain text and want to extract phone numbers of any format from it.
Here is my regex:
r = re.compile(r"(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)[-\s*]\d{3}[-\.\s]??\d{4})")
It extracts the following matches correctly:
617.933.6444
(880)-567-4565
(880) 567-4565
222-333-8888
555 666 4444
9999999999
But how can I avoid getting 7986815059 when I have 798681505951 in the text?
How to make an ending for my regex? (it should not contain letters and digits after and before, exact number count must be 10)
!!!!
Decision
If somebody needs to find US phone numbers in string, use link from the last Wiktor Stribiżew comment.
You need to use word boundaries, but placing them into your pattern is not obvious. It is due to the fact that the second alternative starts with a non-word char, \(. Thus, the first \b must be added at the beginning of the first alternative, and the trailing one at the very end of the pattern:
r'(\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}|\(\d{3}\)[-\s*]\d{3}[-.\s]?\d{4})\b'
^^ ^^
See the regex demo
You may also require a non-word char or start of string before (. Then add \B at the second alternative start:
r'(\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}|\B\(\d{3}\)[-\s*]\d{3}[-.\s]?\d{4})\b'
^^
See another demo
Also, note that there is no need escaping a . inside a character class, it is already parsed as a literal dot in [.]. And no need using a lazy ?? quantifier, it does not make sense here and a greedy version, ?, will work equally well and will look "cleaner".

Finding big string sequence between two keywords within multiple lines

I have a file with the format of
sjaskdjajldlj_abc:
cdf_asjdl_dlsf1:
dfsflks %jdkeajd
sdjfls:
adkfld %dk_.(%sfj)sdaj, %kjdflajfs
afjdfj _ajhfkdjf
zjddjh -15afjkkd
xyz
and I want to find the text in between the string _abc: in the first line and xyz in the last line.
I have already tried print
re.findall(re.escape("*_abc:")+"(*)"+re.escape("xyz"),line)
But I got null.
If I understood the requirement correctly:
a1=re.search(r'_abc(.*)xyz',line,re.DOTALL)
print a1.group(1)
Use re.DOTALL which will enable . to match a newline character as well.
You used re.escape on your pattern when it contains special characters, so there's no way it will work.
>>>>re.escape("*_abc:")
'\\*_abc\\:'
This will match the actual phrase *_abc:, but that's not what you want.
Just take the re.escape calls out and it should work more or less correctly.
It sounds like you have a misunderstanding about what the * symbol means in a regular expression. It doesn't mean "match anything", but rather "repeat the previous thing zero or more times".
To match any string, you need to combine * with ., which matches any single character (almost, more on this later). The pattern .* matches any string of zero or more characters.
So, you could change your pattern to be .*abc(.*)xyz and you'd be most of the way there. However, if the prefix and suffix only exist once in the text the leading .* is unnecessary. You can omit it and just let the regular expression engine handle skipping over any unmatched characters before the abc prefix.
The one remaining issue is that you have multiple lines of text in your source text. I mentioned above that the . patter matches character, but that's not entirely true. By default it won't match a newline. For single-line texts that doesn't matter, but it will cause problems for you here. To change that behavior you can pass the flag re.DOTALL (or its shorter spelling, re.S) as a third argument to re.findall or re.search. That flag tells the regular expression system to allow the . pattern to match any character including newlines.
So, here's how you could turn your current code into a working system:
import re
def find_between(prefix, suffix, text):
pattern = r"{}.*{}".format(re.escape(prefix), re.escape(suffix))
result = re.search(pattern, text, re.DOTALL)
if result:
return result.group()
else:
return None # or perhaps raise an exception instead
I've simplified the pattern a bit, since your comment suggested that you want to get the whole matched text, not just the parts in between the prefix and suffix.

Categories

Resources