Convert a TRIM on regex

Convert a TRIM on regex - python

I have a large string and one of the lines is in the form of
Description: something here....
I want to get everything in the: something here... without any trailing or leading space on it. Currently I'm doing it with a mix of regex and a strip(). How could this be done entirely in regex? Currently:
re.search('Description:\s+(.+)', body).group(1).strip()
Other thoughts:
re.search('Description:\s+\w(.+)\w', body).group(1) # works
Also, why doesn't putting an anchor work in the above context?
re.search('Description:\s+\w(.+)$', body).group(1) # fails

You can use either of
Description:\s+(.*\S)
See the regex demo.
The point is that you need to match up to the last non-whitespace character. .* matches any zero or more characters other than line break chars, as many as possible, so the \S matches the last non-whitespace character in the string.
If you have a multiline string and you need to get to the last non-whitespace character, you may add re.S / re.DOTALL option when passing the pattern above to a regex method, or re-write it as
Description:\s+(\S+(?:\s+\S+)*)
where \S+ matches one or more non-whitespace chars and (?:\s+\S+)* matches zero or more occurrences of one or more whitespaces followed with one or more non-whitespace chars.
See this regex demo.

Related

Getting a correct regex for word starting and ending with different letters

I am quite new to regex and I right now Have a problem formulating a regex to match a string where the first and last letter are different. I looked up on the internet and found a regex that just does it's opposite. i.e. matches words that have same starting and ending letter. Can anyone please help me to understand if I can negeate this regex in some way or can create a new regex to match my requirements. The regex that needs to be modiifed or changed is:
^\s|^[a-z]$|^([a-z]).*\1$
This matches these Strings :
aba,
a,
b,
c,
d,
" ",
cccbbbbbbac,
aaaaba
But I want it to match strings like:
aaabbcz,
zba,
ccb,
cbbbba
Can anyone please help me in this regard? Thank you.
Note: I will be using this with Python Regex, so the regex should be compataible to be used with Python.

You don't need a regex for this, just use
s[0] != s[-1]
where s is your string. If you must use a regex, you can use this:
^(.).*(?!\1).$
This looks for
^ : beginning of string
(.) : a character (captured in group 1)
.* : some number of characters
(?!\1). : a character which is not the character captured in group 1
$ : end of string
Regex demo on regex101

This part of your pattern ^([a-z]).*\1$ only accounts for chars a-z, but you also want to exclude " "
You can rewrite that pattern by putting the part after the capture group inside a negative lookahead.
^(.)(?!.*\1$).+
^ Start of string
(.) Capture a single char (including spaces) in group 1
(?!.*\1$) Negative lookahead, assert that the string does not end with the same character
.+ Match 1+ characters so that the string has a minimum of 2 characters
See a regex demo.
If the string should start and end with a non whitespace character to prevent / trailing trailing spaces, you can start the match with a non whitespace character \S and also end the match with a non whitespace character.
^(\S)(?!.*\1$).*\S$
See another regex demo.

Regex pattern for string having spaces only at end

I have a requirement where I need to match string which satisfies all of the below requirements -
String must be of length 12
String can have only following characters - Alphabets, Digits and Spaces
Spaces if any must be at the end of the string. Spaces in between are not allowed.
I have tried with below regex -
"^[0-9a-zA-Z\s]{12}$"
Above regex is satisfying requirement #1 and #2 but not able to satisfy #3.
Please help me to achieve the requirements.
Thanks in advance !!

You can use
^(?=.{12}$)[0-9a-zA-Z]*\s*$
If at least one letter must exist:
^(?=.{12}$)[0-9a-zA-Z]+\s*$
Details:
^ - start of string
(?=.{12}$) - the string must contain 12 chars
[0-9a-zA-Z]* - zero or more alphanumeroics
\s* - zero or more whitespaces
$ - end of string.
See the regex demo.

Use a non-word boundary \B:
^(?:[a-zA-Z0-9]|\s\B){12}$
demo
With it, a space can't be followed by a letter or a digit, but only by a non-word character (a space here) or the end of the string.
To ensure at least one character that isn't blank:
^[a-zA-Z0-9](?:[a-zA-Z0-9]|\s\B){11}$
Note that with PCRE you have to use the D (DOLLAR END ONLY) modifier to be sure that $ matches the end of the string and not before the last newline sequence. Or better replace $ with \z. There isn't this kind of problem with Python and the re module.

You may use this regex:
^(?!.*\h\S)[\da-zA-Z\h]{12}$
RegEx Demo
RegEx Details:
^: Start
(?!.*\h\S): Negative lookahead to fail the match if a whitespace is followed by a non-whitespace character
[\da-zA-Z\h]{12}: Match 12 characters of alphanumerics or white space
$: End

unexpected result for python re.sub() with non-capturing character

I cannot understand the following output :
import re
re.sub(r'(?:\s)ff','fast-forward',' ff')
'fast-forward'
According to the documentation :
Return the string obtained by replacing the leftmost non-overlapping occurrences of the pattern in string by the replacement repl.
So why is the whitespace included in the captured occurence, and then replaced, since I added a non-capturing tag before it?
I would like to have the following output :
' fast-forward'

The non-capturing group still matches and consumes the matched text. Note that consuming means adding the matched text to the match value (memory buffer alotted for the whole matched substring) and the corresponding advancing of the regex index. So, (?:\s) puts the whitespace into the match value, and it is replaced with the ff.
You want to use a look-behind to check for a pattern without consuming it:
re.sub(r'(?<=\s)ff','fast-forward',' ff')
See the regex demo.
An alternative to this approach is using a capturing group around the part of the pattern one needs to keep and a replacement backreference in the replacement pattern:
re.sub(r'(\s)ff',r'\1fast-forward',' ff')
^ ^ ^^
Here, (\s) saves the whitespace in Group 1 memory buffer and \1 in the replacement retrieves it and adds to the replacement string result.
See the Python demo:
import re
print('"{}"'.format(re.sub(r'(?<=\s)ff','fast-forward',' ff')))
# => " fast-forward"

A non-capturing group still matches the pattern it contains. What you wanted to express was a look-behind, which does not match its pattern but simply asserts it is present before your match.
Although, if you are to use a look-behind for whitespace, you might want to consider using a word boundary metacharacter \b instead. It matches the empty string between a \w and a \W character, asserting that your pattern is at the beginning of a word.
import re
re.sub(r'\bff\b', 'fast-forward', ' ff') # ' fast-forward'
Adding a trailing \b will also make sure that you only match 'ff' if it is surrounded by whitespaces, not at the beginning of a word such as in 'ffoo'.
See the demo.

Find first ReGex pattern following a different pattern

Objective: find a second pattern and consider it a match only if it is the first time the pattern was seen following a different pattern.
Background:
I am using Python-2.7 Regex
I have a specific Regex match that I am having trouble with. I am trying to get the text between the square brackets in the following sample.
Sample comments:
[98 g/m2 Ctrl (No IP) 95 min 340oC ]
[ ]
I need the line:
98 g/m2 Ctrl (No IP) 95 min 340oC
The problem is the undetermined number of white-spaces, tabs, and new-lines between the search pattern Sample comments: and the match I want is giving me trouble.
Best Attempt:
I am able to match the first part easily,
match = re.findall(r'Sample comments:[.+\n+]+', string)
But I can't get the match to the length I want to grab the portion between the square brackets,
match = re.findall(r'Sample comments:[.+\n+]+\[(.+)\]', string)
My Thinking:
Is there a way to use ReGex to find the first instance of the pattern \[(.+)\] after a match of the pattern Sample comments:? Or is there a more robust way to find the bit between the square braces in my example case.
Thanks,
Michael

I suggest using
r'Sample comments:\s*\[(.*?)\s*]'
See the regex and IDEONE demo
The point is the \s* matches zero or more whitespace, both vertical (linebreaks) and horizontal. See Python re reference:
\s
When the UNICODE flag is not specified, it matches any whitespace character, this is equivalent to the set [ \t\n\r\f\v]. The LOCALE flag has no extra effect on matching of the space. If UNICODE is set, this will match the characters [ \t\n\r\f\v] plus whatever is classified as space in the Unicode character properties database.
Pattern details:
Sample comments: - a sequence of literal chars
\s* - 0 or more whitespaces
\[ - a literal [
(.*?) - Group 1 (returned by re.findall) capturing 0+ any chars but a newline as few as possible up to the first...
\s* - 0+ whitespaces and
] - a literal ] (note it does not have to be escaped outside the character class).

Not sure if I understand your problem correctly, but re.findall('Sample comments:[^\\[]*\\[([^\\]]*)\\]', string) seems to work.
Or maybe re.findall('Sample comments:[^\\[]*\\[[ \t]*([^\\]]*?)[ \t]*\\]', string) if you want to strip the final spaces from your line?

Python regex with *?

What does this Python regex match?
.*?[^\\]\n
I'm confused about why the . is followed by both * and ?.

* means "match the previous element as many times as possible (zero or more times)".
*? means "match the previous element as few times as possible (zero or more times)".
The other answers already address this, but what they don't bring up is how it changes the regex, well if the re.DOTALL flag is provided it makes a huge difference, because . will match line break characters with that enabled. So .*[^\\]\n would match from the beginning of the string all the way to the last newline character that is not preceeded by a backslash (so several lines would match).
If the re.DOTALL flag is not provided, the difference is more subtle, [^\\] will match everything other than backslash, including line break characters. Consider the following example:
>>> import re
>>> s = "foo\n\nbar"
>>> re.findall(r'.*?[^\\]\n', s)
['foo\n']
>>> re.findall(r'.*[^\\]\n', s)
['foo\n\n']
So the purpose of this regex is to find non-empty lines that don't end with a backslash, but if you use .* instead of .*? you will match an extra \n if you have an empty line following a non-empty line.
This happens because .*? will only match fo, [^\\] will match the second o, and the the \n matches at the end of the first line. However the .* will match foo, the [^\\] will match the \n to end the first line, and the next \n will match because the second line is blank.

. indicates a wild card. It can match anything except a \n, unless the appropriate flag is used.
* indicates that you can have 0 or more of the thing preceding it.
? indicates that the preceding quantifier is lazy. It will stop searching after the first match it finds.

Opening the Python re module documentation, and searching for *?, we find:
*?, +?, ??:
The *, +, and ? qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against <H1>title</H1>, it will match the entire string, and not just <H1>. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only <H1>.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert a TRIM on regex - python

Related

Getting a correct regex for word starting and ending with different letters

Regex pattern for string having spaces only at end

unexpected result for python re.sub() with non-capturing character

Find first ReGex pattern following a different pattern

Python regex with *?

Categories

Resources