I want something like [A-z] that counts for all alphabetic characters plus stuff like ö, ä, ü etc.
If i do [A-ü] i get probably all special characters used by latin languages but it also allows other stuff like ¿¿]|{}[¢§ø欰µ©¥
Example: https://regex101.com/r/tN9gA5/2
Edit:
I need this in python2.
Depending on what regular expression engine you are using, you could use the ^\p{L}+$ regular expression. The \p{L} denotes a unicode letter:
In addition to complications, Unicode also brings new possibilities.
One is that each Unicode character belongs to a certain category. You
can match a single character belonging to the "letter" category with
\p{L}
Source
This example should illustrate what I am saying. It seems that the regex engine on Regex101 does support this, you just need to select PCRE (PHP) fromo the top left.
When you use [A-z], you are not only capturing letters from "A" to "z", you also capture some more non-letter characters: [ \ ] ^ _ `.
In Python, you can use [^\W\d_] with re.U option to match Unicode characters (see this post).
Here is a sample based on your input string.
Python example:
import re
r = re.search(
r'(?P<unicode_word>[^\W\d_]*)',
u'TestöäüéàèÉÀÈéàè',
re.U
)
print r.group('unicode_word')
>>> TestöäüéàèÉÀÈéàè
Related
This python doc gives a complete list of the metacharacters
. ^ $ * + ? { } [ ] \ | ( )
Similarly, is there a page giving a complete list of character class?
I assume "Character classes" in that doc refers to a finite numbers of some kind of special characters instead of all possible unicode characters. Please correct me if necessary.
I did search and didn't find the canonical term.
If "character classes" indeed refers to all possible unicode characters, I would like change my question as "a convenient way to lookup regex special characters in python".
It seems that regular-expressions.info call that "Shorthand Character Classes"
More positive examples (that I am looking for) are \d, \s, \S, \A etc; negative examples (that I am not looking for) are abcdefghijklmnopqrstuvwxyz0123456789
I've searched "character class" and "Shorthand Character Classes" on Python doc and stackoverflow and didn't find what I want.
Why do I need this? When I read a section of the doc, such as
Character classes such as \w or \S (defined below) are also accepted inside a set, although the characters they match depends on whether ASCII or LOCALE mode is in force.
I would like to know what does \w stand for. Either searching in the doc or google would take me some time. For example, using search menu command of chrome on that doc, \w gets 41 results.
If there is a list of those characters, I can look up everything by no more than 2 search (lower case and capital).
Categories Visible from the Shell
The code shows all the of the "CATEGORIES". The ones marked "IN" are character categories (the others mark specific slice points between characters):
>>> from pprint import pprint
>>> import sre_parse
>>> pprint(sre_parse.CATEGORIES)
{'\\A': (AT, AT_BEGINNING_STRING),
'\\B': (AT, AT_NON_BOUNDARY),
'\\D': (IN, [(CATEGORY, CATEGORY_NOT_DIGIT)]),
'\\S': (IN, [(CATEGORY, CATEGORY_NOT_SPACE)]),
'\\W': (IN, [(CATEGORY, CATEGORY_NOT_WORD)]),
'\\Z': (AT, AT_END_STRING),
'\\b': (AT, AT_BOUNDARY),
'\\d': (IN, [(CATEGORY, CATEGORY_DIGIT)]),
'\\s': (IN, [(CATEGORY, CATEGORY_SPACE)]),
'\\w': (IN, [(CATEGORY, CATEGORY_WORD)])
The entries with "CATEGORY" are the character categories
This also answers the question of what \w stands for. It is a "word character". See also: In regex, what does \w* mean?
Categories Explained in the Docs
This is in the output of print(re.__doc__). It explains the intended meaning of each category:
The special sequences consist of "\\" and a character from the list
below. If the ordinary character is not on the list, then the
resulting RE will match the second character.
\number Matches the contents of the group of the same number.
\A Matches only at the start of the string.
\Z Matches only at the end of the string.
\b Matches the empty string, but only at the start or end of a word.
\B Matches the empty string, but not at the start or end of a word.
\d Matches any decimal digit; equivalent to the set [0-9] in
bytes patterns or string patterns with the ASCII flag.
In string patterns without the ASCII flag, it will match the whole
range of Unicode digits.
\D Matches any non-digit character; equivalent to [^\d].
\s Matches any whitespace character; equivalent to [ \t\n\r\f\v] in
bytes patterns or string patterns with the ASCII flag.
In string patterns without the ASCII flag, it will match the whole
range of Unicode whitespace characters.
\S Matches any non-whitespace character; equivalent to [^\s].
\w Matches any alphanumeric character; equivalent to [a-zA-Z0-9_]
in bytes patterns or string patterns with the ASCII flag.
In string patterns without the ASCII flag, it will match the
range of Unicode alphanumeric characters (letters plus digits
plus underscore).
With LOCALE, it will match the set [0-9_] plus characters defined
as letters for the current locale.
\W Matches the complement of \w.
\\ Matches a literal backslash.
Other Special Character Groups
Besides the short-hand character classes, the sre_parse module details other interesting character groups as well:
SPECIAL_CHARS = ".\\[{()*+?^$|"
REPEAT_CHARS = "*+?{"
DIGITS = frozenset("0123456789")
OCTDIGITS = frozenset("01234567")
HEXDIGITS = frozenset("0123456789abcdefABCDEF")
ASCIILETTERS = frozenset("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
WHITESPACE = frozenset(" \t\n\r\v\f")
ESCAPES = {
r"\a": (LITERAL, ord("\a")),
r"\b": (LITERAL, ord("\b")),
r"\f": (LITERAL, ord("\f")),
r"\n": (LITERAL, ord("\n")),
r"\r": (LITERAL, ord("\r")),
r"\t": (LITERAL, ord("\t")),
r"\v": (LITERAL, ord("\v")),
r"\\": (LITERAL, ord("\\"))
}
FLAGS = {
# standard flags
"i": SRE_FLAG_IGNORECASE,
"L": SRE_FLAG_LOCALE,
"m": SRE_FLAG_MULTILINE,
"s": SRE_FLAG_DOTALL,
"x": SRE_FLAG_VERBOSE,
# extensions
"a": SRE_FLAG_ASCII,
"t": SRE_FLAG_TEMPLATE,
"u": SRE_FLAG_UNICODE,
}
It looks like you're looking for all shorthand character classes Python's re module supports. Things like [abc] also fall under the name "character class", although this might not be obvious from the re docs, and it would be impossible and pointless to try to make a complete list of those.
A character class is regex syntax for matching a single character, usually by specifying that it belongs or doesn't belong to some set of characters. Syntax like [abc] lets you explicitly specify a set of characters to match, while shorthand character classes like \d are shorthand for large, predefined sets of characters.
Python's re module supports 6 shorthand character classes: \d, which matches digits, \s, which matches whitespace, \w, which matches "word" characters, and \D, \S, and \W, which match any character \d, \s, and \w don't match. Exactly which characters count or don't count depend on whether you're using Unicode strings or bytestrings and whether the ASCII or LOCALE flags are set; see the re docs for further details (and expect disappointment with the vague docs for \w).
There are plenty of other backslash-letter sequences with special meaning, but they're not character classes. For example, \b matches a word boundary (or if you forgot to use raw strings, it gets interpreted as a backspace character before the regex engine gets to see it), but that's not a character class.
Other regex implementations may support different shorthand character classes, and their shorthand character classes may match different characters. For example, Perl has way more of these, and Perl's \w matches more characters than Python's, like combining diacritics.
Are you looking for string.printable or perhaps filter(lambda x: not x.isalnum(), string.printable) which returns
!"#$%&\'()*+,-./:;<=>?#[\\]^_``{|}~ \t\n\r\x0b\x0c
?
I am trying to do what my title says. I have a list of about 30 thousand business addressess, and I'm trying to make each address as uniform as possible
As far as removing weird symbols and characters goes, I have found three suggestions, but I don't understand how they are different.
If somebody can explain the difference, or provide insight into a better way to standardize address information, please and thank you!
address = re.sub(r'([^\s\w]|_)+', '', address)
address = re.sub('[^a-zA-Z0-9-_*.]', '', address)
address = re.sub(r'[^\w]', ' ', address)
The first suggestion uses the \s and \w regex wildcards.
\s means "match any whitespace".
\w means "match any letter or number".
This is used as an inverted capture group ([^\s\w]), which, all together, means "match anything which isn't whitespace, a letter or a number". Finally, it is combined using an alternative | with _, which will just match an underscore and given a + quantifier which matches one or more times.
So what this says is: "Match any sequence of one or more characters which aren't whitespace, letters, numbers or underscores and remove it".
The second option says: "Match any character which isn't a letter, number, hyphen, underscore, dot or asterisk and remove it". This is stated by that big capture group (the stuff between the brackets).
The third option says "Take anything which is not a letter or number and replace it by a space". It uses the \w wildcard, which I have explained.
All of the options use Regular Expressions in order to match character sequences with certain characteristics, and the re.sub function, which sub-stitutes anything matched by the given regex by the second string argument.
You can read more about Regular Expressions in Python here.
The enumeration [^a-zA-Z0-9-_*.] enumerates exactly the character ranges to remove (though the literal - should be at the beginning or end of the character class).
\w is defined as "word character" which in traditional ASCII locales included A-Z and a-z as well as digits and underscore, but with Unicode support, it matches accented characters, Cyrillics, Japanese ideographs, etc.
\s matches space characters, which again with Unicode includes a number of extended characters such as the non-breakable space, numeric space, etc.
Which exactly to choose obviously depends on what you want to accomplish and what you mean by "special characters". Numbers are "symbols", all characters are "special", etc.
Here's a pertinent quotation from the Python re documentation:
\s
For Unicode (str) patterns:
Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ \t\n\r\f\v] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [ \t\n\r\f\v] may be a better choice).
For 8-bit (bytes) patterns:
Matches characters considered whitespace in the ASCII character set; this is equivalent to [ \t\n\r\f\v].
\w
For Unicode (str) patterns:
Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [a-zA-Z0-9_] may be a better choice).
For 8-bit (bytes) patterns:
Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_].
How you read the re.sub function is like this (more docs):
re.sub(a, b, my_string) # replace any matches of regex a with b in my_string
I would go with the second one. Regexes can be tricky, but this one says:
[^a-zA-Z0-9-_*.] # anything that's NOT a-z, A-Z, 0-9, -, * .
Which seems like it's what you want. Whenever I'm using regexes, I use this site:
http://regexr.com/
You can put in some of your inputs, and make sure they are matching the right kinds of things before throwing them in your code!
I need a regular expression able to match everything but a string starting with a specific pattern (specifically index.php and what follows, like index.php?id=2342343).
Regex: match everything but:
a string starting with a specific pattern (e.g. any - empty, too - string not starting with foo):
Lookahead-based solution for NFAs:
^(?!foo).*$
^(?!foo)
Negated character class based solution for regex engines not supporting lookarounds:
^(([^f].{2}|.[^o].|.{2}[^o]).*|.{0,2})$
^([^f].{2}|.[^o].|.{2}[^o])|^.{0,2}$
a string ending with a specific pattern (say, no world. at the end):
Lookbehind-based solution:
(?<!world\.)$
^.*(?<!world\.)$
Lookahead solution:
^(?!.*world\.$).*
^(?!.*world\.$)
POSIX workaround:
^(.*([^w].{5}|.[^o].{4}|.{2}[^r].{3}|.{3}[^l].{2}|.{4}[^d].|.{5}[^.])|.{0,5})$
([^w].{5}|.[^o].{4}|.{2}[^r].{3}|.{3}[^l].{2}|.{4}[^d].|.{5}[^.]$|^.{0,5})$
a string containing specific text (say, not match a string having foo):
Lookaround-based solution:
^(?!.*foo)
^(?!.*foo).*$
POSIX workaround:
Use the online regex generator at www.formauri.es/personal/pgimeno/misc/non-match-regex
a string containing specific character (say, avoid matching a string having a | symbol):
^[^|]*$
a string equal to some string (say, not equal to foo):
Lookaround-based:
^(?!foo$)
^(?!foo$).*$
POSIX:
^(.{0,2}|.{4,}|[^f]..|.[^o].|..[^o])$
a sequence of characters:
PCRE (match any text but cat): /cat(*SKIP)(*FAIL)|[^c]*(?:c(?!at)[^c]*)*/i or /cat(*SKIP)(*FAIL)|(?:(?!cat).)+/is
Other engines allowing lookarounds: (cat)|[^c]*(?:c(?!at)[^c]*)* (or (?s)(cat)|(?:(?!cat).)*, or (cat)|[^c]+(?:c(?!at)[^c]*)*|(?:c(?!at)[^c]*)+[^c]*) and then check with language means: if Group 1 matched, it is not what we need, else, grab the match value if not empty
a certain single character or a set of characters:
Use a negated character class: [^a-z]+ (any char other than a lowercase ASCII letter)
Matching any char(s) but |: [^|]+
Demo note: the newline \n is used inside negated character classes in demos to avoid match overflow to the neighboring line(s). They are not necessary when testing individual strings.
Anchor note: In many languages, use \A to define the unambiguous start of string, and \z (in Python, it is \Z, in JavaScript, $ is OK) to define the very end of the string.
Dot note: In many flavors (but not POSIX, TRE, TCL), . matches any char but a newline char. Make sure you use a corresponding DOTALL modifier (/s in PCRE/Boost/.NET/Python/Java and /m in Ruby) for the . to match any char including a newline.
Backslash note: In languages where you have to declare patterns with C strings allowing escape sequences (like \n for a newline), you need to double the backslashes escaping special characters so that the engine could treat them as literal characters (e.g. in Java, world\. will be declared as "world\\.", or use a character class: "world[.]"). Use raw string literals (Python r'\bworld\b'), C# verbatim string literals #"world\.", or slashy strings/regex literal notations like /world\./.
You could use a negative lookahead from the start, e.g., ^(?!foo).*$ shouldn't match anything starting with foo.
You can put a ^ in the beginning of a character set to match anything but those characters.
[^=]*
will match everything but =
Just match /^index\.php/, and then reject whatever matches it.
In Python:
>>> import re
>>> p='^(?!index\.php\?[0-9]+).*$'
>>> s1='index.php?12345'
>>> re.match(p,s1)
>>> s2='index.html?12345'
>>> re.match(p,s2)
<_sre.SRE_Match object at 0xb7d65fa8>
Came across this thread after a long search. I had this problem for multiple searches and replace of some occurrences. But the pattern I used was matching till the end. Example below
import re
text = "start![image]xxx(xx.png) yyy xx![image]xxx(xxx.png) end"
replaced_text = re.sub(r'!\[image\](.*)\(.*\.png\)', '*', text)
print(replaced_text)
gave
start* end
Basically, the regex was matching from the first ![image] to the last .png, swallowing the middle yyy
Used the method posted above https://stackoverflow.com/a/17761124/429476 by Firish to break the match between the occurrence. Here the space is not matched; as the words are separated by space.
replaced_text = re.sub(r'!\[image\]([^ ]*)\([^ ]*\.png\)', '*', text)
and got what I wanted
start* yyy xx* end
IS it possible to define that specific languages characters would be considered as word.
I.e. re do not accept ä,ö as word characters if i search them in following way
Ft=codecs.open('c:\\Python27\\Scripts\\finnish2\\textfields.txt','r','utf–8')
word=Ft.readlines()
word=smart_str(word, encoding='utf-8', strings_only=False, errors='replace')
word=re.sub('[^äÄöÖåÅA-Za-z0-9]',"""\[^A-Za-z0-9]*""", word) ; print 'word= ', word #works in skipping ö,ä,å characters
I would like that these character would be included to [A-Za-z].
How to define this?
[A-Za-z0-9] will only match the characters listed here, but the docs also mention some other special constructs like:
\w which stands for alphanumeric characters (namely [a-zA-Z0-9_] plus all unicode characters which are declared to be alphanumeric
\W which stands for all nun-alphanumeric characters [^a-zA-Z0-9_] plus unicode
\d which stands for digits
\b which matches word boundaries (including all rules from the unicode tables)
So, you will to (a) use this constructs instead (which are shorter and maybe easier to read), and (b) tell re that you want to "localize" those strings with the current locale by setting the UNICODE flag like:
re_word = re.compile(r'\w+', re.U)
For a start, you appear to be slightly confused about the args for re.sub.
The first arg is the pattern. You have '[^äÄöÖåÅA-Za-z0-9]' which matches each character which is NOT in the Finnish alphabet nor a digit.
The second arg is the replacement. You have """[^A-Za-z0-9]*""" ... so each of those non-Finnish-alphanumeric characters is going to be replaced by the literal string [^A-Za-z0-9]*. It's reasonable to assume that this is not what you want.
What do you want to do?
You need to explain your third line; after your first 2 lines, word will be a list of unicode objects, which is A Good Thing. However the encoding= and the errors= indicate that the unknown (to us) smart_str() is converting your lovely unicode back to UTF-8. Processing data in UTF-8 bytes instead of Unicode characters is EVIL, unless you know what you are doing.
What encoding directive do you have at the top of your source file?
Advice: Get your data into unicode. Work on it in unicode. All your string constants should have the u prefix; if you consider that too much wear and tear on your typing fingers, at least put it on the non-ASCII constants e.g. u'[^äÄöÖåÅA-Za-z0-9]'. When you have done all the processing, encode your results for display or storage using an appropriate encoding.
When working with re, consider \w which will match any alphanumeric (and also the underscore) instead of listing out what is alphabetic in one language. Do use the re.UNICODE flag; docs here.
Something like this might do the trick:
pattern = re.compile("(?u)pattern")
or
pattern = re.compile("pattern", re.UNICODE)
I would like to match all lowercase letter forms in the Latin block. The trivial '[a-z]' only matches characters between U+0061 and U+007A, and not all the other lowercase forms.
I would like to match all lowercase letters, most importantly, all the accented lowercase letters in the Latin block used in EFIGS languages.
[a-zà-ý] is a start, but there are still tons of other lowercase characters (see http://www.unicode.org/charts/PDF/U0000.pdf). Is there a recommended way of doing this?
FYI I'm using Python, but I suspect that this problem is cross-language.
Python's builtin "islower()" method seems to do the right checking:
lower = ''
for c in xrange(0,2**16):
if unichr(c).islower():
lower += unichr(c)
print lower
Python does not currently support Unicode properties in regular expressions. See this answer for a link to the Ponyguruma library which does support them.
Using such a library, you could use \p{Ll} to match any lowercase letter in a Unicode string.
Every character in the Unicode standard is in exactly one category. \p{Ll} is the category of lowercase letters, while \p{L} comprises all the characters in one of the "Letter" categories (Letter, uppercase; Letter, lowercase; Letter, titlecase; Letter, modifier; and Letter, other). For more information see the Character Properties chapter of the Unicode Standard. Or see this page for a good explanation on use of Unicode in regular expressions.
Looks as though this recipe posted back in the old 2005
import sys, re
uppers = [u'[']
for i in xrange(sys.maxunicode):
c = unichr(i)
if c.isupper(): uppers.append(c)
uppers.append(u']')
uppers = u"".join(uppers)
uppers_re = re.compile(uppers)
print uppers_re.match('A')
is still relevant.
You might want to have a look at regular-expressions.info.
However, as far as I know there's no character class or modifier that expresses "lower case characters only" (and not every language has lower case characters), so I'd say you might have to use multiple ranges (possible almost as many as there are unicode blocks.
Edit:
reading a bit more on this, there might be a way: [\p{Ll}\p{Lo}] which means lowercase characters with an upper case variant or characters that don't have lower case and upper case (in case of chinese characters for example).
Regex [\p{Ll}\p{Lo}]+ matches test string àÀhelloHello你好Прывітанне and replacing the matches with x results in xÀxHxПx whereas replacing the matches of [\p{Ll}]+ results in xÀxHx你好Пx (note the Chinese characters that were not matched).
if you use \p{L} it will match any unicode letter. check the examples here. You can also combine it with \p{M} to match Hebrew-esqe languages that include diacritic marks. (\p{L}|\p{M})+
EDIT:
I missed the part about only lowercase letters the first time around. \p{L} will match all letters, \p{Ll} will match lowercase only.