clarifications on the re.findall() method in python - python

I wanted to strip a string of punctuation marks and I ended up using
re.findall(r"[\w]+|[^\s\w]", text)
It works fine and it does solve my problem. What I don't understand is the details within the parentheses and the whole pattern thing. What does r"[\w]+|[^\s\w]" really mean? I looked it up in the Python standard library and it says:
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result
unless they touch the beginning of another match.
I am not sure if I get this and the clarification sounds a little vague to me. Can anyone please tell me what a pattern in this context means and how exactly it is defined in the findall() method?

To break it down, [] creates a character class. You'll often see things like [abc] which will match a, b or c. Conversely, you also might see [^abc] will will match anything that isn't a, b or c. Finally, you'll also see character ranges: [a-cA-C]. This introduces two ranges and it will match any of a, b, c, A, B, C.
In this case, your character class contains special tokens. \w and \s. \w matches anything letter-like. \w actually depends on your locale, but it is usually the same thing as [a-zA-Z0-9_] matches anything in the ranges a-z, A-Z, 0-9 or _. \s is similar, but it matches anything that can be considered whitespace.
The + means that you can repeat the previous match 1 or more times. so [a]+ will match the entire string aaaaaaaaaaa. In your case, you're matching alphanumeric characters that are next to each other.
the | is basically like "or". match the stuff on the left, or match the stuff on the right if the left stuff doesn't match.

\w means Alphanumeric characters plus "_". And \s means Whitespace characters including " \t\r\n\v\f" and space character " ". So, [\w]+|[^\s\w] means a string which contains only words and "_".

Related

How does this python code convert string to camelCase with regex sub() and group()? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
I'm a total newbie, please go easy on me :)
This is a solution I found online of a kata from codewars
import re
def to_camel_case(text):
return re.sub('[_-](.)', lambda x: x.group(1).upper(), text)
I looked up about re.sub() and group(), but I still couldn't put it together. I'm not sure how [_-](.) works, how come [_-](w+) doesn't work?
How did he get ride of the hyphen and underscore with sub? Then,
successfully capitalize only the first char of each words except the first word?
I thought x.group(1).upper() would capitalize the entire word, how come group(1) is referring to the first char?
So to understand this block of code, you have to understand a bit of regular expressions and a bit of the re Python module. Let's first look at what re.sub does. From the docs, the signature of the function looks like
re.sub(pattern, repl, string, count=0, flags=0)
Of importance here are the pattern, repl, and string parameters.
pattern is a regular expression pattern to be replaced
repl is what you want to replace the matched pattern with, can be a string or function that takes a match object as an argument
string is the string you want the replacement to act on
The function is used to find portions of the string that match the regex pattern, and replace those portions with repl.
Now let's go into the regular expression used: [_-](.).
[_-] matches any of the characters within the square brackets (_ or -)
. matches any character
(.) captures any character in a capture group
Now let's put it all together. The full pattern will match two characters. The first character will be a _ or - and the second character can be anything. In effect, the bold portions of the following strings will be matched.
one_two
test_3
nomatchhere-
thiswill_Match
NoMatchHereEither_
need_more_creative_examples-
The important part here is that the (.) portion of the regex matches any character and stores it in a capture group, this allows us to reference that matched character in the repl part of the argument.
Let's get into what repl is doing here. In this case, repl is a lambda function.
lambda x: x.group(1).upper()
A lambda is really not too much different than a normal Python function. You define arguments before the colon, and then you define the return expression after the colon. The lambda above takes x as an argument, and it assumes that x is a match object. Match objects have a group method that allows you to reference the groups matched by the regex pattern (remember (.) from before?). It grabs the first group matched, and uppercases it using the str object's builtin upper method. It then returns that uppercased string, and that is what replaces the matched pattern.
All together now:
import re
def to_camel_case(text):
return re.sub('[_-](.)', lambda x: x.group(1).upper(), text)
The pattern is [_-](.) which matches any underscore or dash followed by any character. That character is captured and uppercased using the repl lambda function. The portion of string that matched that pattern is then replaced with that uppercased character.
In conclusion, I think the above answers most of your questions, but to summarize:
I looked up about re.sub() and group(), but I still couldn't put it together. I'm not sure how [_-](.) works, how come [_-](w+) doesn't work?
I will assume that you meant to use the \w character set, instead of just w. The \w character set matches all alphanumeric characters and underscores. This pattern would work if the + operator was not used. The + matches characters greedily, so it will cause all characters that belong to the \w set that follow an underscore or hyphen to be captured. This causes two issues: it will capitalize all captured characters (which could be a whole word) and it will capture underscores, causing later underscores to not be properly replaced.
How did he get ride of the hyphen and underscore with sub?
The function given to repl returns only the uppercased version of the first capture group. In the pattern [-_](.), only the character following the hyphen or underscore is captured. In effect, the pattern [-_](.) is matched and replaced with the uppercased character matched by (.). This is why the hyphen/underscore is removed.
Successfully capitalize only the first char of each words except the first word?
I thought x.group(1).upper() would capitalize the entire word, how come group(1) is referring to the first char?
The capture group only matches the first character following the underscore or hyphen, so that is what is uppercased.
I'll try to walk through the solution in Layman's terms.
So firstly, re.sub() searches for occurrences of the pattern specified '[_-](.)' which will match any substrings where a hyphen '-' or an underscore '_' is immediately before another character. The re.sub() function then runs these matches through the anonymous function (lambda function) individually.
Regex grouping in python essentially involves those braces () to collect a sub-expression for later use in the program. The lambda function will take in some regex object generated from searching text for the provided pattern, and then return x.group(1).upper(), and we can see from the regular expression, that the grouped element, is the single character that follows the hyphen or underscore, which is what is returned and substituted by the function.
Now, to answer your dotpoints:
Why doesn't [_-](\w+) work? This is because, when it finds a hypen, it will select all of the alphanumeric characters that follow it, so it will capitalise the entirety of the next word.
How did he get rid of the hyphen and underscore with sub? This is easily answered. The re.sub() function replaces the entire match, not just the grouped element, and in the lambda, he only returns the grouped element as uppercase, not the hyphen as well.
Successfully capitalise only the first char of each word except the first word? When the regex pattern is searched for, it is looking for characters that immediately proceed a hyphen or an underscore, and the first word does not either of those characters before. If you were to feed the function something like '-hello-there' it would yield: 'HelloThere'
I thought x.group(1).upper() would capitalize the entire word, how come group(1) is referring to the first char? This is down to the pattern, because the pattern is '[_-](.)' and not '[_-](.+)', it only matches a single character
I hope this has helped you in some way

Removing special characters and symbols from a string in python

I am trying to do what my title says. I have a list of about 30 thousand business addressess, and I'm trying to make each address as uniform as possible
As far as removing weird symbols and characters goes, I have found three suggestions, but I don't understand how they are different.
If somebody can explain the difference, or provide insight into a better way to standardize address information, please and thank you!
address = re.sub(r'([^\s\w]|_)+', '', address)
address = re.sub('[^a-zA-Z0-9-_*.]', '', address)
address = re.sub(r'[^\w]', ' ', address)
The first suggestion uses the \s and \w regex wildcards.
\s means "match any whitespace".
\w means "match any letter or number".
This is used as an inverted capture group ([^\s\w]), which, all together, means "match anything which isn't whitespace, a letter or a number". Finally, it is combined using an alternative | with _, which will just match an underscore and given a + quantifier which matches one or more times.
So what this says is: "Match any sequence of one or more characters which aren't whitespace, letters, numbers or underscores and remove it".
The second option says: "Match any character which isn't a letter, number, hyphen, underscore, dot or asterisk and remove it". This is stated by that big capture group (the stuff between the brackets).
The third option says "Take anything which is not a letter or number and replace it by a space". It uses the \w wildcard, which I have explained.
All of the options use Regular Expressions in order to match character sequences with certain characteristics, and the re.sub function, which sub-stitutes anything matched by the given regex by the second string argument.
You can read more about Regular Expressions in Python here.
The enumeration [^a-zA-Z0-9-_*.] enumerates exactly the character ranges to remove (though the literal - should be at the beginning or end of the character class).
\w is defined as "word character" which in traditional ASCII locales included A-Z and a-z as well as digits and underscore, but with Unicode support, it matches accented characters, Cyrillics, Japanese ideographs, etc.
\s matches space characters, which again with Unicode includes a number of extended characters such as the non-breakable space, numeric space, etc.
Which exactly to choose obviously depends on what you want to accomplish and what you mean by "special characters". Numbers are "symbols", all characters are "special", etc.
Here's a pertinent quotation from the Python re documentation:
\s
For Unicode (str) patterns:
Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ \t\n\r\f\v] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [ \t\n\r\f\v] may be a better choice).
For 8-bit (bytes) patterns:
Matches characters considered whitespace in the ASCII character set; this is equivalent to [ \t\n\r\f\v].
\w
For Unicode (str) patterns:
Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [a-zA-Z0-9_] may be a better choice).
For 8-bit (bytes) patterns:
Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_].
How you read the re.sub function is like this (more docs):
re.sub(a, b, my_string) # replace any matches of regex a with b in my_string
I would go with the second one. Regexes can be tricky, but this one says:
[^a-zA-Z0-9-_*.] # anything that's NOT a-z, A-Z, 0-9, -, * .
Which seems like it's what you want. Whenever I'm using regexes, I use this site:
http://regexr.com/
You can put in some of your inputs, and make sure they are matching the right kinds of things before throwing them in your code!

Pattern for '.' separated words with arbitrary number of whitespaces

It's the first time that I'm using regular expressions in Python and I just can't get it to work.
Here is what I want to achieve: I want to find all strings, where there is a word followed by a dot followed by another word. After that an unknown number of whitespaces followed by either (off) or (on). For example:
word1.word2 (off)
Here is what I have come up so far.
string_group = re.search(r'\w+\.\w+\s+[(\(on\))(\(off\))]', analyzed_string)
\w+ for the first word
\. for the dot
\w+ for the second word
\s+ for the whitespaces
[(\(on\))(\(off\))] for the (off) or (on)
I think that the last expression might not be doing what I need it to. With the implementation right now, the program does find the right place in the string, but the output of
string_group.group(0)
Is just
word1.word2 (
instead of the whole expression I'm looking for. Could you please give me a hint what I am doing wrong?
[ ... ] is used for character class, and will match any one character inside them unless you put a quantifier: [ ... ]+ for one or more time.
But simply adding that won't work...
\w+\.\w+\s+[(\(on\))(\(off\))]+
Will match garbage stuff like word1.word2 )(fno(nofn too, so you actually don't want to use a character class, because it'll match the characters in any order. What you can use is a capturing group, and a non-capturing group along with an OR operator |:
\w+\.\w+\s+(\((?:on|off)\))
(?:on|off) will match either on or off
Now, if you don't like the parentheses, to be caught too in the first group, you can change that to:
\w+\.\w+\s+\((on|off)\)
You've got your logical OR mixed up.
[(\(on\))(\(off\))]
should be
\((?:on|off)\)
[]s are just for matching single characters.
The square brackets are a character class, which matches any one of the characters in the brackets. You appear to be trying to use it to match one of the sub-regexes (\(one\)) and (\(two\)). The way to do that is with an alternation operation, the pipe symbol: (\(one\)|\(two\)).
I think your problem may be with the square brackets []
they indicate a set of single characters to match. So your expression would match a single instance of any of the following chars: "()ofn"
So for the string "word1.word2 (on)", you are matching only this part: "word1.word2 ("
Try using this one instead:
re.search(r'\w+\.\w+\s+\((on|off)\)', analyzed_string)
This match assumes that the () will be there, and looks for either "on" or "off" inside the parenthesis.

Regex: Complement a group of characters (Python)

I want to write a regex to check if a word ends in anything except s,x,y,z,ch,sh or a vowel, followed by an s. Here's my failed attempt:
re.match(r".*[^ s|x|y|z|ch|sh|a|e|i|o|u]s",s)
What is the correct way to complement a group of characters?
Non-regex solution using str.endswith:
>>> from itertools import product
>>> tup = tuple(''.join(x) for x in product(('s','x','y','z','ch','sh'), 's'))
>>> 'foochf'.endswith(tup)
False
>>> 'foochs'.endswith(tup)
True
[^ s|x|y|z|ch|sh|a|e|i|o|u]
This is an inverted character class. Character classes match single characters, so in your case, it will match any character, except one of these: acehiosuxyz |. Note that it will not respect compound groups like ch and sh and the | are actually interpreted as pipe characters which just appear multiple time in the character class (where duplicates are just ignored).
So this is actually equivalent to the following character class:
[^acehiosuxyz |]
Instead, you will have to use a negative look behind to make sure that a trailing s is not preceded by any of the character sequences:
.*(?<!.[ sxyzaeiou]|ch|sh)s
This one has the problem that it will not be able to match two character words, as, to be able to use look behinds, the look behind needs to have a fixed size. And to include both the single characters and the two-character groups in the look behind, I had to add another character to the single character matches. You can however use two separate look behinds instead:
.*(?<![ sxyzaeiou])(?<!ch|sh)s
As LarsH mentioned in the comments, if you really want to match words that end with this, you should add some kind of boundary at the end of the expression. If you want to match the end of the string/line, you should add a $, and otherwise you should at least add a word boundary \b to make sure that the word actually ends there.
It looks like you need a negative lookbehind here:
import re
rx = r'(?<![sxyzaeiou])(?<!ch|sh)s$'
print re.search(rx, 'bots') # ok
print re.search(rx, 'boxs') # None
Note that re doesn't support variable-width LBs, therefore you need two of them.
How about
re.search("([^sxyzaeiouh]|[^cs]h)s$", s)
Using search() instead of match() means the match doesn't have to begin at the beginning of the string, so we can eliminate the .*.
This is assuming that the end of the word is the end of the string; i.e. we don't have to check for a word boundary.
It also assumes that you don't need to match the "word" hs, even it conforms literally to your rules. If you want to match that as well, you could add another alternative:
re.search("([^sxyzaeiouh]|[^cs]|^h)s$", s)
But again, we're assuming that the beginning of the word is the beginning of the string.
Note that the raw string notation, r"...", is unecessary here (but harmless). It only helps when you have backslashes in the regexp, so that you don't have to escape them in the string notation.

Python: Regular expression to match alpha-numeric not working?

I am looking to match a string that is inputted from a website to check if is alpha-numeric and possibly contains an underscore.
My code:
if re.match('[a-zA-Z0-9_]',playerName):
# do stuff
For some reason, this matches with crazy chars for example: nIg○▲ ☆ ★ ◇ ◆
I only want regular A-Z and 0-9 and _ matching, is there something i am missing here?
Python has a special sequence \w for matching alphanumeric and underscore when the LOCALE and UNICODE flags are not specified. So you can modify your pattern as,
pattern = '^\w+$'
Your regex only matches one character. Try this instead:
if re.match('^[a-zA-Z0-9_]+$',playerName):
…check if is alpha-numeric and possibly contains an underscore.
Do you mean this literally, so that only one underscore is allowed, total? (Not unreasonable for player names; adjacent underscores in particular can be hard for other players to read.) Should "a_b_c" not match?
If so:
if playerName and re.match("^[a-zA-Z0-9]*_?[a-zA-Z0-9]*$", playerName):
The new first part of the condition checks for an empty value, which simplifies the regex.
This places no restrictions on where the underscore can occur, so all of "_a", "a_", and "_" will match. If you instead want to prevent both leading and trailing underscores, which is again reasonable for player names, change to:
if re.match("^[a-zA-Z0-9]+(?:_[a-zA-Z0-9]+)?$", playerName):
// this regex doesn't match an empty string, so that check is unneeded

Categories

Resources