Python regex match pattern "X<string1>:X<string2>" - python

I'm parsing a file which has text "$string1:$string2"
How do I regex match this string and extract "string1" and "string2" from it, basically regex match this pattern : "$*:$*"

You were nearly there with your own pattern, it needs three alterations in order to work as you want it.
First, the star in regexes isn't a glob, as you might be expecting it from shell scripting, it's a kleene star. Meaning, it needs some character group it can apply it's "zero to n times" logic on. In your case, the alphanumeric character class \w should work. If that's too restrictive, use . instead, which matches any character except line breaks.
Secondly, you need to apply the regex in a way that you can easily extract the results you want. The usual way to go about it is to define groups, using parentheses.
Last but not least, the $ sign is a meta-character in regexes, so if you want to match it literally, you need to write a backslash in front of it.
In working code, it'll look like this:
import re
s = "$string1:$string2"
r = re.compile(r"\$(\w*):\$(\w*)")
match = r.match(s)
print(match.group(1)) # print the first group that was matched
print(match.group(2)) # print the second group that was matched
Output:
string1
string2

Related

how to write regex to accept the string which end with string

I want to write a regex which accepts this:
Accept:
done
done1
done1,done2,done3
Do not accept:
done1,
done1,done2,
I tried to write this regex
([a-zA-Z]+)?(/d)?(,)([a-zA-Z]+)
but it is not working.
What's wrong? How can I fix it?
I would phrase the regex pattern as:
(?<!\S)\w+(?:,\w+)*(?!\S)
Sample script:
inp = "done done1 done1,done2,done3 done1, done1,done2,"
matches = re.findall(r'(?<!\S)\w+(?:,\w+)*(?!\S)', inp)
print(matches) # ['done', 'done1', 'done1,done2,done3']
Here is an explanation of the regex pattern:
(?<!\S) assert that what precedes is either whitespace or the start of the input
\w+ match a word
(?:,\w+)* followed by comma another word, both zero or more times
(?!\S) assert that what follows the final word is either whitespace
or the end of the input
It also depends on how you apply the regex. The regex alone (e.g. when used with re.search()) tells you whether the input contains any substring which matches your regex. In the trivial case, if you are examining one line at a time, add start and end of line anchors around your regex to force it to match the entire line.
Also, of course, notice that the regex to match a single digit is \d, not /d.
Your regex looks like you want both the alphabetics and the numbers to be optional, but the group of alphabetics and numbers to be non-empty; is that correct? One way to do that is to add a lookahead (?=[a-zA-Z\d]) before the phrase which matches both optionally.
import re
tests = """\
done
done1
done1,done2,done3
done1,
done1,done2,
"""
regex = re.compile(r'^(?=[a-zA-Z\d])[a-zA-Z]*\d?(?:,(?=[a-zA-Z\d])[a-zA-Z]*\d?)*$')
for line in tests.splitlines():
match = regex.search(line)
if match:
print(line)
The individual phrases here should be easy to understand. [a-zA-Z]* matches zero or more alphabetics, and \d? matches zero or one digits. We require one of those, followed by zero or more repetitions of a comma followed by a repeat of the first expression.
Perhaps also note that [a-zA-Z\d] is almost the same as \w (the latter also matches an underscore). If you don't care about this inexactness, the expression could be simplified. It would certainly be useful in the lookahead, where the regex after it will not match an underscore anyhow. But I've left in the more complex expression just to make the code easier to follow in relation to the original example.
Demo: https://ideone.com/4mVGDh

Match only final character of regex

I have some Python code that involves a lot of re.sub() commands. In some cases, I want to replace a character but only if it comes after certain other characters. The following is an example of how I currently am doing this in python:
secStress = "[aeiou],"[-1]
So my input for this would be a string like "a,s I walk, I hum." And I want to replace that first comma but not the "a" that comes before it.
The problem is that Python doesn't like when I give it a variable as input for re.sub(). Is there a way I can write a regex that specifies that only its final character is supposed to be matched?
You are looking for either a capturing group/backreference or a positive lookbehind solution:
s = "a,s I walk, I hum."
# Capturing group / backreference
print(re.sub(r"([aeiou]),", r"\1", s))
# Positive lookbehind
print(re.sub(r"(?<=[aeiou]),", "", s))
See the Python demo.
First approach details
The ([aeiou]) is a capturing group that matches a vowel and stores it in a special memory buffer that you can refer to from the replacement pattern using backreferences. Here, the Group ID is 1, so you can access that value using r"\1".
Second approach details
The (?<=[aeiou]) is a positive lookbehind that only checks (but does not add the text to the match value) if there is a vowel immediately before the current position. So, only those commas are matched that are preceded with a vowel and it is enough to replace with an empty string to get rid of the comma since it is the only symbol kept in the match.
If I understand you correctly,
>>> import re
>>> def doit(matchobj):
... return matchobj.group()[0]
...
>>> re.sub(r'[aeiou],', doit, "a,s I walk, I hum.")
'as I walk, I hum.'
If the regex matches then doit is called with the object that matched. Whatever string doit returns (and it must be a string) is put in place of the match.

Dynamically Removing string with regex python

I am currently having trouble removing the end of strings using regex. I have tried using .partition with unsuccessful results. I am now trying to use regex unsuccessfully. All the strings follow the format of some random words **X*.* Some more words. Where * is a digit and X is a literal X. For Example 21X2.5. Everything after this dynamic string should be removed. I am trying to use re.sub('\d\d\X\d.\d', string). Can someone point me in the right direction with regex and how to split the string?
The expected output should read:
some random words 21X2.5
Thanks!
Use following regex:
re.search("(.*?\d\dX\d\.\d)", "some random words 21X2.5 Some more words").groups()[0]
Output:
'some random words 21X2.5'
Your regex is not correct. The biggest problem is that you need to escape the period. Otherwise, the regex treats the period as a match to any character. To match just that pattern, you can use something like:
re.findall('[\d]{2}X\d\.\d', 'asb12X4.4abc')
[\d]{2} matches a sequence of two integers, X matches the literal X, \d matches a single integer, \. matches the literal ., and \d matches the final integer.
This will match and return only 12X4.4.
It sounds like you instead want to remove everything after the matched expression. To get your desired output, you can do something like:
re.split('(.*?[\d]{2}X\d\.\d)', 'some random words 21X2.5 Some more words')[1]
which will return some random words 21X2.5. This expression pulls everything before and including the matched regex and returns it, discarding the end.
Let me know if this works.
To remove everything after the pattern, i.e do exactly as you say...:
s = re.sub(r'(\d\dX\d\.\d).*', r'\1', s)
Of course, if you mean something else than what you said, something different will be needed! E.g if you want to also remove the pattern itself, not just (as you said) what's after it:
s = re.sub(r'\d\dX\d\.\d.*', r'', s)
and so forth, depending on what, exactly, are your specs!-)

Regex: Complement a group of characters (Python)

I want to write a regex to check if a word ends in anything except s,x,y,z,ch,sh or a vowel, followed by an s. Here's my failed attempt:
re.match(r".*[^ s|x|y|z|ch|sh|a|e|i|o|u]s",s)
What is the correct way to complement a group of characters?
Non-regex solution using str.endswith:
>>> from itertools import product
>>> tup = tuple(''.join(x) for x in product(('s','x','y','z','ch','sh'), 's'))
>>> 'foochf'.endswith(tup)
False
>>> 'foochs'.endswith(tup)
True
[^ s|x|y|z|ch|sh|a|e|i|o|u]
This is an inverted character class. Character classes match single characters, so in your case, it will match any character, except one of these: acehiosuxyz |. Note that it will not respect compound groups like ch and sh and the | are actually interpreted as pipe characters which just appear multiple time in the character class (where duplicates are just ignored).
So this is actually equivalent to the following character class:
[^acehiosuxyz |]
Instead, you will have to use a negative look behind to make sure that a trailing s is not preceded by any of the character sequences:
.*(?<!.[ sxyzaeiou]|ch|sh)s
This one has the problem that it will not be able to match two character words, as, to be able to use look behinds, the look behind needs to have a fixed size. And to include both the single characters and the two-character groups in the look behind, I had to add another character to the single character matches. You can however use two separate look behinds instead:
.*(?<![ sxyzaeiou])(?<!ch|sh)s
As LarsH mentioned in the comments, if you really want to match words that end with this, you should add some kind of boundary at the end of the expression. If you want to match the end of the string/line, you should add a $, and otherwise you should at least add a word boundary \b to make sure that the word actually ends there.
It looks like you need a negative lookbehind here:
import re
rx = r'(?<![sxyzaeiou])(?<!ch|sh)s$'
print re.search(rx, 'bots') # ok
print re.search(rx, 'boxs') # None
Note that re doesn't support variable-width LBs, therefore you need two of them.
How about
re.search("([^sxyzaeiouh]|[^cs]h)s$", s)
Using search() instead of match() means the match doesn't have to begin at the beginning of the string, so we can eliminate the .*.
This is assuming that the end of the word is the end of the string; i.e. we don't have to check for a word boundary.
It also assumes that you don't need to match the "word" hs, even it conforms literally to your rules. If you want to match that as well, you could add another alternative:
re.search("([^sxyzaeiouh]|[^cs]|^h)s$", s)
But again, we're assuming that the beginning of the word is the beginning of the string.
Note that the raw string notation, r"...", is unecessary here (but harmless). It only helps when you have backslashes in the regexp, so that you don't have to escape them in the string notation.

Finding big string sequence between two keywords within multiple lines

I have a file with the format of
sjaskdjajldlj_abc:
cdf_asjdl_dlsf1:
dfsflks %jdkeajd
sdjfls:
adkfld %dk_.(%sfj)sdaj, %kjdflajfs
afjdfj _ajhfkdjf
zjddjh -15afjkkd
xyz
and I want to find the text in between the string _abc: in the first line and xyz in the last line.
I have already tried print
re.findall(re.escape("*_abc:")+"(*)"+re.escape("xyz"),line)
But I got null.
If I understood the requirement correctly:
a1=re.search(r'_abc(.*)xyz',line,re.DOTALL)
print a1.group(1)
Use re.DOTALL which will enable . to match a newline character as well.
You used re.escape on your pattern when it contains special characters, so there's no way it will work.
>>>>re.escape("*_abc:")
'\\*_abc\\:'
This will match the actual phrase *_abc:, but that's not what you want.
Just take the re.escape calls out and it should work more or less correctly.
It sounds like you have a misunderstanding about what the * symbol means in a regular expression. It doesn't mean "match anything", but rather "repeat the previous thing zero or more times".
To match any string, you need to combine * with ., which matches any single character (almost, more on this later). The pattern .* matches any string of zero or more characters.
So, you could change your pattern to be .*abc(.*)xyz and you'd be most of the way there. However, if the prefix and suffix only exist once in the text the leading .* is unnecessary. You can omit it and just let the regular expression engine handle skipping over any unmatched characters before the abc prefix.
The one remaining issue is that you have multiple lines of text in your source text. I mentioned above that the . patter matches character, but that's not entirely true. By default it won't match a newline. For single-line texts that doesn't matter, but it will cause problems for you here. To change that behavior you can pass the flag re.DOTALL (or its shorter spelling, re.S) as a third argument to re.findall or re.search. That flag tells the regular expression system to allow the . pattern to match any character including newlines.
So, here's how you could turn your current code into a working system:
import re
def find_between(prefix, suffix, text):
pattern = r"{}.*{}".format(re.escape(prefix), re.escape(suffix))
result = re.search(pattern, text, re.DOTALL)
if result:
return result.group()
else:
return None # or perhaps raise an exception instead
I've simplified the pattern a bit, since your comment suggested that you want to get the whole matched text, not just the parts in between the prefix and suffix.

Categories

Resources