Match parenthesis surrounded by spaces in python with regex - python

Why doesn't the following code block match the parantheses?
In [27]: import re
In [28]: re.match('.*?([\(]*)', ' (((( ' ).groups()
Out[28]: ('',)

Demonstrating my comment:
import re
>>> re.match('.*?([\(]*)', ' (((( ' ).groups()
('',)
>>> re.match('.*?([\(]+)', ' (((( ' ).groups()
('((((',)
>>>
Note - you don't even need the backslash inside the [] - since special characters lose their meaning. So
>>> re.match('.*?([(]+)', ' (((( ' ).groups()
('((((',)
>>>
works too...
This is because your "non greedy" first quantifier (*?) doesn't need to give anything to the second quantifier - since the second quantifier is happy with zero matches.

In your case .*? means everything because you used [\(]* which means 0 or more. So changing * into + will work for you as + means 1 or more.
re.match('.*?([\(]+)', ' (((( ' ).groups()

Related

Why I need an extra space in the regex pattern to make it work properly?

When I write following code:
m = re.findall('\sf.*?\s','a f fast and friendly dog');
I get output: [' f ', ' friendly ']
But when I provide extra space between f & fast, I get following output which I expected from the previous one.
Code is as follows
m = re.findall('\sf.*?\s','a f fast and friendly dog');
Output:
[' f ', ' fast ', ' friendly ']
Can anyone tell me why I am not getting later output in first case (without inserting extra space between f & fast)?
Because your pattern ends in \s. Regex matches are non-overlapping, so the first match ' f ' matches the trailing space, making the rest of the string begin with 'fast' instead of ' fast'. 'fast' does not match a pattern starting with \s
The space is consumed by ' f ' after it is matched. Now the next search starts from 'fast and friendly dog'. But now fast does not have a leading space and thus does not match.
If you want the space not be consumed then try a positive lookbehind search.

Is there a way to .replace() certain string snippets according to a criteria?

I'm importing from a .txt file containing some David Foster Wallace that I copy-pasted from a PDF. Some words ran off the page and so come in the form of
"interr- upted"
I was going to sanitize it by using something like:
with open(text, "r", 0) as bookFile:
bookString = bookFile.read().replace("- ", "")
Except... the man also uses some weird constructions in his writing. Things like:
"R - - d©"
for the brand name bug spray Raid©. I'm left with "R d©" obviously, but is there a way to make it .replace() instances of "- " but not instances of " - "? Or do I need to turn everything into lists and do operations to everything that way? Thanks.
You could use a regular expression with a negative lookbehind assertion to check the previous character, and re.sub to replace matches with an empty string.
'(?<! )- ' is a regular expression, matching all instances of '- ', not preceded by a single space character (refer to this section for the syntax). re.sub('(?<! )- ', '', input_string) will replace all occurrences of the '(?<! )- ' pattern in input_string with '' (empty string) and return the result.
Examples:
In [1]: import re
In [2]: re.sub('(?<! )- ', '', 'interr- upted')
Out[2]: 'interrupted'
In [3]: re.sub('(?<! )- ', '', 'R - - d©')
Out[3]: 'R - - d©'
You can use lookbehinds and lookaheads to make sure you substitute only the occurrences that need to be substituted:
>>> import re
>>> regex_pattern = '(?<=[a-z])(- )(?=[a-z])'
>>> re.sub(regex_pattern, '', "interr- upted", re.I)
'interrupted'
And,
>>> re.sub(regex_pattern, '', "R - - d©")
'R - - d©'
The latter is not affected.
is this what you need?
In [23]: import re
In [24]: re.sub(r'- ', '', '"R - - d"')
Out[24]: '"R d"'
This link can help you.
HTH

How to replace all \W (none letters) with exception of '-' (dash) with regular expression?

I want replace all \W not letters with exception of - dash to spaces i.e:
black-white will give black-white
black#white will give black white
I know regular expression very well but I have no idea how to deal with it.
Consider that I want use Unicode so [a-zA-Z] is not \w like in English only.
Consider that I prefer Python re syntax but can read other suggestions.
Using negated character class: (\W is equivalent to [^\w]; [^-\w] => \W except -)
>>> re.sub(r'[^-\w]', ' ', 'black-white')
'black-white'
>>> re.sub(r'[^-\w]', ' ', 'black#white')
'black white'
If you use regex package, you can use nested sets, set operations:
>>> import regex
>>> print regex.sub(r'(?V1)[\W--[-]]', ' ', 'black-white')
black-white
>>> print regex.sub(r'(?V1)[\W--[-]]', ' ', 'black#white')
black white
I would use negative lookahead like below,
>>> re.sub(r'(?!-)\W', r' ', 'black-white')
'black-white'
>>> re.sub(r'(?!-)\W', r' ', 'black#white')
'black white'
(?!-)\W the negative lookahead at the start asserts that the character we are going to match would be any from the \W (non-word character list) but not of hyphen - . It's like a kind of substraction, that is \W - character present inside the negative lookahead (ie. hyphen).
DEMO

removing a character from beginning or end of a string in Python

I have a string that for example can have - any where including some white spaces.
I want using regex in Python to remove - only if it is before all other non-whitespace chacacters or after all non-white space characters. Also want to remove all whitespaces at the beginning or the end.
For example:
string = ' - test '
it should return
string = 'test'
or:
string = ' -this - '
it should return
string = 'this'
or:
string = ' -this-is-nice - '
it should return
string = 'this-is-nice'
You don't need regex for this. str.strip strip removes all combinations of characters passed to it, so pass ' -' or '- ' to it.
>>> s = ' - test '
>>> s.strip('- ')
'test'
>>> s = ' -this - '
>>> s.strip('- ')
'this'
>>> s = ' -this-is-nice - '
>>> s.strip('- ')
'this-is-nice'
To remove any type of white-space character and '-' use string.whitespace + '-'.
>>> from string import whitespace
>>> s = '\t\r\n -this-is-nice - \n'
>>> s.strip(whitespace+'-')
'this-is-nice'
import re
out = re.sub(r'^\s*(-\s*)?|(\s*-)?\s*$', '', input)
This will remove at most one instance of - at the beginning of the string and at most one instance of - at the end of the string. For example, given input -  - text  - - , the output will be - text  -.
Note that \s matches Unicode whitespaces (in Python 3). You will need re.ASCII flag to revert it to matching only [ \t\n\r\f\v].
Since you are not very clear about cases such as   -text, -text-, -text -, the regex above will just output text for those 3 cases.
For strings such as   text  , the regex will just strip the spaces.

Regex findall produces strange results in python3

I want to find all the docblocks of a string using python.
My first attempt was this:
b = re.compile('\/\*(.)*?\*/', re.M|re.S)
match = b.search(string)
print(match.group(0))
And that worked, but as you'll notice yourself: it'll only print out 1 docblock, not all of them.
So I wanted to use the findall function, which says it would output all the matches, like this:
b = re.compile('\/\*(.)*?\*/', re.M|re.S)
match = b.findall(string)
print(match)
But I never get anything useful, only these kinds of arrays:
[' ', ' ', ' ', '\t', ' ', ' ', ' ', ' ', ' ', '\t', ' ', ' ', ' ']
The documentation does say it'll return empty strings, but I don't know how this can be useful.
You need to move the quatifier inside the capture group:
b = re.compile('\/\*(.*?)\*/', re.M|re.S)
To expand a bit on Rohit Jain's (correct) answer, with the qualifier outside the parentheses you're saying "match (non-greedily) any number of the one character inside the parens, and capture that one character". In other words, it would match " " or "aaaaaa", but in "abcde" it would only match the "a". (And since it's non-greedy, even in "aaaaaa" it would only match a single "a"). By moving the qualifier inside the parens (that is, (.*?) instead of what you had before) you're now saying "match any number of characters, and capture all of them".
I hope this helps you understand what's going on a bit better.

Categories

Resources