Python: pattern matching for a string - python

Im trying to check a file line by line for any_string=any_string. It must be that format, no spaces or anything else. The line must contain a string then a "=" and then another string and nothing else. Could someone help me with the syntax in python to find this please? =]
pattern='*\S\=\S*'
I have this, but im pretty sure its wrong haha.

Don't know if you are looking for lines with the same value on both = sides. If so then use:
the_same_re = re.compile(r'^(\S+)=(\1)$')
if values can differ then use
the_same_re = re.compile(r'^(\S+)=(\S+)$')
In this regexpes:
^ is the beginning of line
$ is the end of line
\S+ is one or more non space character
\1 is first group
r before regex string means "raw" string so you need not escape backslashes in string.

pattern = r'\S+=\S+'
If you want to be able to grab the left and right-hand sides, you could add capture groups:
pattern = r'(\S+)=(\S+)'
If you don't want to allow multiple equals signs in the line (which would do weird things), you could use this:
pattern = r'[^\s=]+=[^\s=]+'

I don't know what the tasks you want make use this pattern. Maybe you want parse configuration file.
If it is true you may use module ConfigParser.

Ok, so you want to find anystring=anystring and nothing else. Then no need regex.
>>> s="anystring=anystring"
>>> sp=s.split("=")
>>> if len(sp)==2:
... print "ok"
...
ok

Since Python 2.5 I prefer this to split. If you don't like spaces, just check.
left, _, right = any_string.partition("=")
if right and " " not in any_string:
# proceed
Also it never hurts to learn regular expressions.

Related

Regex to match only part of certain line

I have some config file from which I need to extract only some values. For example, I have this:
PART
{
title = Some Title
description = Some description here. // this 2 params are needed
tags = qwe rty // don't need this param
...
}
I need to extract value of certain param, for example description's value. How do I do this in Python3 with regex?
Here is the regex, assuming that the file text is in txt:
import re
m = re.search(r'^\s*description\s*=\s*(.*?)(?=(//)|$)', txt, re.M)
print(m.group(1))
Let me explain.
^ matches at beginning of line.
Then \s* means zero or more spaces (or tabs)
description is your anchor for finding the value part.
After that we expect = sign with optional spaces before or after by denoting \s*=\s*.
Then we capture everything after the = and optional spaces, by denoting (.*?). This expression is captured by parenthesis. Inside the parenthesis we say match anything (the dot) as many times as you can find (the asterisk) in a non greedy manner (the question mark), that is, stop as soon as the following expression is matched.
The following expression is a lookahead expression, starting with (?= which matches the thing right after the (?=.
And that thing is actually two options, separated by the vertical bar |.
The first option, to the left of the bar says // (in parenthesis to make it atomic unit for the vertical bar choice operation), that is, the start of the comment, which, I suppose, you don't want to capture.
The second option is $, meaning the end of the line, which will be reached if there is no comment // on the line.
So we look for everything we can after the first = sign, until either we meet a // pattern, or we meet the end of the line. This is the essence of the (?=(//)|$) part.
We also need the re.M flag, to tell the regex engine that we want ^ and $ match the start and end of lines, respectively. Without the flag they match the start and end of the entire string, which isn't what we want in this case.
The better approach would be to use an established configuration file system. Python has built-in support for INI-like files in the configparser module.
However, if you just desperately need to get the string of text in that file after the description, you could do this:
def get_value_for_key(key, file):
with open(file) as f:
lines = f.readlines()
for line in lines:
line = line.lstrip()
if line.startswith(key + " ="):
return line.split("=", 1)[1].lstrip()
You can use it with a call like: get_value_for_key("description", "myfile.txt"). The method will return None if nothing is found. It is assumed that your file will be formatted where there is a space and the equals sign after the key name, e.g. key = value.
This avoids regular expressions altogether and preserves any whitespace on the right side of the value. (If that's not important to you, you can use strip instead of lstrip.)
Why avoid regular expressions? They're expensive and really not ideal for this scenario. Use simple string matching. This avoids importing a module and simplifies your code. But really I'd say to convert to a supported configuration file format.
This is a pretty simple regex, you just need a positive lookbehind, and optionally something to remove the comments. (do this by appending ?(//)? to the regex)
r"(?<=description = ).*"
Regex101 demo

Python regex example

If I want to replace a pattern in the following statement structure:
cat&345;
bat &#hut;
I want to replace elements starting from & and ending before (not including ;). What is the best way to do so?
Including or not including the & in the replacement?
>>> re.sub(r'&.*?(?=;)','REPL','cat&345;') # including
'catREPL;'
>>> re.sub(r'(?<=&).*?(?=;)','REPL','bat &#hut;') # not including
'bat &REPL;'
Explanation:
Although not required here, use a r'raw string' to prevent having to escape backslashes which often occur in regular expressions.
.*? is a "non-greedy" match of anything, which makes the match stop at the first semicolon.
(?=;) the match must be followed by a semicolon, but it is not included in the match.
(?<=&) the match must be preceded by an ampersand, but it is not included in the match.
Here is a good regex
import re
result = re.sub("(?<=\\&).*(?=;)", replacementstr, searchText)
Basically this will put the replacement in between the & and the ;
Maybe go a different direction all together and use HTMLParser.unescape(). The unescape() method is undocumented, but it doesn't appear to be "internal" because it doesn't have a leading underscore.
You can use negated character classes to do this:
import re
st='''\
cat&345;
bat &#hut;'''
for line in st.splitlines():
print line
print re.sub(r'([^&]*)&[^;]*;',r'\1;',line)

Match any characters more than once, but stop at a given character

I am writing a regex that will be used for recognizing commands in a string. I have three possible words the commands could start with and they always end with a semi-colon.
I believe the regex pattern should look something like this:
(command1|command2|command3).+;
The problem, I have found, is that since . matches any character and + tells it to match one or more, it skips right over the first instance of a semi-colon and continues going.
Is there a way to get it to stop at the first instance of a semi-colon it comes across? Is there something other than . that I should be using instead?
The issue you are facing with this: (command1|command2|command3).+; is that the + is greedy, meaning that it will match everything till the last value.
To fix this, you will need to make it non-greedy, and to do that you need to add the ? operator, like so: (command1|command2|command3).+?;
Just as an FYI, the same applies for the * operator. Adding a ? will make it non greedy.
Tell it to find only non-semicolons.
[^;]+
What you are looking for is a non-greedy match.
.+?
The "?" after your greedy + quantifier will make it match as less as possible, instead of as much as possible, which it does by default.
Your regex would be
'(command1|command2|command3).+?;'
See Python RE documentation

Repeating a python regular expression until a certain char

I want to get all of the text until a ! appears. Example
some textwfwfdsfosjtortjk\n
sdsfsdfsdfsdfsdfsdfsdfsfsfsdfsdfsdf\n
sfsgdfgdfgdgdfgdg\n
!
The number of lines before the ! changes so I can't hardcode a reg exp like this
"+\n^.+\n^.+"
I am using re.MULTLINE, but should I be using re.DOTALL?
Thanks
Why does this need a regular expression?
index = str.find('!')
if index > -1:
str = str[index:] # or (index+1) to get rid of the '!', too
So you want to match everything from the beginning of the input up to (but not including) the first ! character? This should do it:
re.match(r'[^!]*', input)
If there are no exclamation points this will match the whole string. If you want to match only strings with ! in them, add a lookahead:
re.match(r'[^!]*(?=!)', input)
The MULTILINE flag is not needed because there are no anchors (^ and $), and DOTALL isn't needed because there are no dots.
Following the Python philosophy of "Easier to Ask Forgiveness Than Permission" (EAFP), I suggest you create a subroutine which is easy to understand and later maintain, should your separator change.
SEPARATOR = u"!"
def process_string(s):
try:
return s[:s.index(SEPARATOR)]
except ValueError:
return s
This function will return the string from the beginning up to, and not including, whatever you defined as separator. If the separator is not found, it will return the whole string. The function works regardless of new lines. If your separator changes, simply change SEPARATOR and you are good to go.
ValueError is the exception raised when you request the index of a character not in the string (try it in the command line: "Hola".index("1") (will raise ValueError: substring not found). The workflow then assumes that most of the time you expect the SEPARATOR character to be in the string, so you attempt that first without asking for permission (testing if SEPARATOR is in the string); if you fail (the index method raises ValueError) then you ask forgiveness (return the string as originally received). This approach (EAFP) is considered Pythonic when it applies, as it does in this case.
No regular expressions needed; this is a simple problem.
Look into a 'lookahead' for that particular character you're reading, and match the whole first part as a pattern instead.
I'm not sure exactly how Python's regex reader is different from Ruby, but you can play with it in rubular.com
Maybe something like:
([^!]*(?=\!))
(Just tried this, seems to work)
It should do the job.
re.compile('(.*?)!', re.DOTALL).match(yourString).group(1)
I think you're making this more complex than it needs to be. Your reg exp just needs to say "repeat(any character except !) followed by !". Remember [^!] means "any character except !".
So, like this:
>>> import re
>>> rexp = re.compile("([^!]*)!")
>>> test = """sdasd
... asdasdsa
... asdasdasd
... asdsadsa
... !"""
>>> rexp.findall(test)
['sdasd\nasdasdsa\nasdasdasd\nasdsadsa\n']
>>>
re.DOTALL should be sufficient:
import re
text = """some textwfwfdsfosjtortjk
sdsfsdfsdfsdfsdfsdfsdfsfsfsdfsdfsdf
sfsgdfgdfgdgdfgdg
!"""
rExp = re.compile("(.*)\!", re.S)
print rExp.search(text).groups()[0]
some textwfwfdsfosjtortjk
sdsfsdfsdfsdfsdfsdfsdfsfsfsdfsdfsdf
sfsgdfgdfgdgdfgdg

Why doesn't the regex match when I add groups?

I have this regex code in python :
if re.search(r"\{\\fad|fade\(\d{1,4},\d{1,4}\)\}", text):
print(re.search(r"\{\\fad|fade\((\d{1,4}),(\d{1,4})\)\}", text).groups())
text is {\fad(200,200)}Épisode 101 : {\i1}The Ghost{\i0}\Nv. 1.03 and read from a file (don't know if that helps).
This returns the following:
(None, None)
When I change the regex in the print to r"\{\\fad\((\d{1,4}),(\d{1,4})\)\}", it returns the correct values:
(200, 200)
Can anyone see why the conditional fad|fade matches the regex in the re.search but doesn't return the correct values of the groups in the print?
Thanks.
Put extra parens around the choice: re.search(r"{(?:\\fad|fade)\((\d{1,4}),(\d{1,4})\)}", text).groups()
Also, escaping {} braces isn't necessary, it just needlessly clutters your regexp.
The bracket is part of the or branch starting with fade, so it's looking for either "{fad" or "fade(...". You need to group the fad|fade part together. Try:
r"\{\\(?:fad|fade)\(\d{1,4},\d{1,4}\)\}"
[Edit]
The reason you do get into the if block is because the regex is matching, but only because it detects it starts with "{\fad". However, that part of the match contains no groups. You need to match with the part that defines the groups if you want to capture them.
Try this:
r"\{\\fade?\(\d{1,4},\d{1,4}\)\}"
I think your conditional is looking for "\fad" or "fade", I think you need to move a \ outside the grouping if you want to look for "\fad" or "\fade".
Try this instead:
r"\{\\fade?\((\d{1,4}),(\d{1,4})\)\}"
The e? is an optional e.
The way you have it now matches {\fad or fade(0000,0000)}
I don't know the python dialect of regular expressions, but wouldn't you need to 'group' the "fad|fade" somehow to make sure it isn't trying to find "fad OR fade(etc..."?

Categories

Resources