How to check unicode string to be letters, spaces and dashes only

How to check unicode string to be letters, spaces and dashes only - python

I've read a great solution for unicode strings here, but I need to check entire string to be letters or spaces or dashes and I can't think of any solution. The example is not working as I want.
name = u"Василий Соловьев-Седой"
r = re.compile(r'^([\s\-^\W\d_]+)$', re.U)
r.match(name) -> None

r = re.compile(r'^(?:[^\W\d_]|[\s-])+$', re.U)
[^\W\d_] matches any letter (by matching any alphanumeric character except for digits and underscore).
[\s-] of course matches whitespace and dashes.

if you ONLY want to check:
name = u"Василий Соловьев-Седой";
name = name.replace("-","").replace(" ","");
name.isalpha()

Related

Python regex match string of 8 characters that contain both alphabets and numbers

I am trying to match a string of length 8 containing both numbers and alphabets(cannot have just numbers or just alphabets)using re.findall. The string can start with either letter or alphabet followed by any combination.
e.g.-
Input String: The reference number is 896av6uf and not 87987647 or ahduhsjs or hn0.
Output: ['896av6uf','a96bv6u0']
I came up with this regex r'([a-z]+[\d]+[\w]*|[\d]+[a-z]+[\w]*)' however it is giving me strings with less than 8 characters as well.
Need to modify the regex to return strings with exactly 8 chars that contain both letters and alphabets.

You can use
\b(?=[a-zA-Z]*[0-9])(?=[0-9]*[a-zA-Z])[a-zA-Z0-9]{8}\b
\b(?=[^\W\d_]*\d)(?=\d*[^\W\d_])[^\W_]{8}\b
The first one only supports ASCII letters, while the second one supports all Unicode letters and digits since [^\W\d_] matches any Unicode letter and \d matches any Unicode digit (as the re.UNICODE option is used by default in Python 3.x).
Details:
\b - a word boundary
(?=[a-zA-Z]*[0-9]) - after any 0+ ASCII letters, there must be a digit
(?=[0-9]*[a-zA-Z]) - after any 0+ digits, there must be an ASCII letter
[a-zA-Z0-9]{8} - eight ASCII alphanumeric chars
\b - a word boundary

First, let's find statement that finds words made of lowercase letters and digits that are 8 characters long:
\b[a-z\d]{8}\b
Next condition is that the word must contain both letters and numbers:
[a-d]\d
Now for the challenging part, combining these into one statement. Easiest way might be to just spit them up but we can use some look-aheads to get this to work:
\b(?=.*[a-z]\d)[a-z\d]{8}\b
Im sure there a tidier way of doing this but this will work.

You can use \b\w{8}\b
It does not guarantee that you will have both digits AND letters, but does guarantee that you will have exactly eight characters, surrounded by word boundaries (e.g. whitespace, start/end of line).
You can try it in one of the online playgrounds such as this one: https://regex101.com/
The meat of the matching is done with the \w{8} which means 8 letters/words (including capitals and underscore). \b means "word boundary"
If you want only digits and lowercase letters, replace this by \b[a-z0-9]{8}\b
You can then further check for existence of both digits AND letter, e.g. by using filter:
list(filter(lambda s: re.search(r'[0-9]', s) and re.search(r'[a-z]', s), result))
result is what you get from re.findall() .
So bottom line, I would use:
list(filter(lambda s: re.search(r'[0-9]', s) and re.search(r'[a-z]', s), re.findall(r'\b[a-z0-9]{8}\b', str)))

A more compact solution than others have suggested is this:
((?![A-Za-z]{8}|[0-9]{8})[0-9A-Za-z]{8})
This guarantees that the found matches are 8 characters in length and that they can not be only numeric or only alphabets.
Breakdown:
(?![A-Za-z]{8}|[0-9]{8}) = This is a negative lookahead that means the match can't be a string of 8 numbers or 8 alphabets.
[0-9A-Za-z]{8} = Simple regex saying the input needs to be alphanumeric of 8 characters in length.
Test Case:
Input: 12345678 abcdefgh i8D0jT5Yu6Ms1GNmrmaUjicc1s9D93aQBj3WWWjww54gkiKqOd7Ytkl0MliJy9xadAgcev8b2UKdfGRDOpxRPm30dw9GeEz3WPRO 1234567890987654321 qwertyuiopasdfghjklzxcvbnm
import re
pattern = re.compile(r'((?![A-Za-z]{8}|\d{8})[A-Za-z\d]{8})')
test = input()
match = pattern.findall(test)
print(match)
Output: ['i8D0jT5Y', 'u6Ms1GNm', 'maUjicc1', 's9D93aQB', 'j3WWWjww', '54gkiKqO', 'd7Ytkl0M', 'liJy9xad', 'Agcev8b2', 'DOpxRPm3', '0dw9GeEz']

Match charactes and whitespaces, but not numbers

I am trying to create a regex that will match characters, whitespaces, but not numbers.
So hello 123 will not match, but hell o will.
I tried this:
[^\d\w]
but, I cannot find a way to add whitespaces here. I have to use \w, because my strings can contain Unicode characters.

Brief
It's unclear what exactly characters refers to, but, assuming you mean alpha characters (based on your input), this regex should work for you.
Code
See regex in use here
^(?:(?!\d)[\w ])+$
Note: This regex uses the mu flags for multiline and Unicode (multiline only necessary if input is separated by newline characters)
Results
Input
ÀÇÆ some words
ÀÇÆ some words 123
Output
This only shows matches
ÀÇÆ some words
Explanation
^ Assert position at the start of the line
(?:(?!\d)[\w ])+ Match the following one or more times (tempered greedy token)
(?!\d) Negative lookahead ensuring what follows doesn't match a digit. You can change this to (?![\d_]) if you want to ensure _ is also not used.
[\w ] Match any word character or space (matches Unicode word characters with u flag)`
$ Assert position at the end of the line

You can use a lookahead:
(?=^\D+$)[\w\s]+
In Python:
import re
strings = ['hello 123', 'hell o']
rx = re.compile(r'(?=^\D+$)[\w\s]+')
new_strings = [string for string in strings if rx.match(string)]
print(new_strings)
# ['hell o']

How can I check if a string only contains uppercase or lowercase letters?

Return True if and only if there is at least one alphabetic character in s and the alphabetic characters in s are either all uppercase or all lowercase.
def upper_lower(s):
""" (str) -> bool
>>> upper_lower('abc')
True
>>> upper_lower('abcXYZ')
False
>>> upper_lower('XYZ')
True
"""

Use re.match
if re.match(r'(?:[A-Z]+|[a-z]+)$', s):
print("True")
else:
print("Nah")
We don't need to add start of the line anchor since re.match tries to match from the beginning of the string.
So it enters in to the if block only if the input string contains only lowercase letters or only uppercase letters.

Regex is by far the most efficient way of doing this.
But if you'd like some pythonic flavor in your code, you can try this:
'abc'.isupper()
'ABC'.isupper()
'abcABC'.isupper()
upper_lower = lambda s: s.isupper() or s.islower()

You can use the following regex:
if re.search('^([a-z]+|[A-Z]+)$', s):
# all upper or lower case
Explanation: ^ matches the beginning of the string, [a-z] and [A-Z] are trivial matches, + matches one or more. | is the "OR" regex, finally $ matches the end of the string.
Or, you can use all and check:
all(c.islower() for c in s)
# same thing for isupper()

You could collect all alphabetic characters and then check if all or upper or all are lower
a=[c for c in s if s.isalpha()]
if a:
if "".join(a).isupper() or "".join(a).islower():
print "yes"

Python 3 regex with diacritics and ligatures,

Names in the form: Ceasar, Julius are to be split into First_name Julius Surname Ceasar.
Names may contain diacritics (á à é ..), and ligatures (æ, ø)
This code seems to work OK in Python 3.3
import re
def doesmatch(pat, str):
try:
yup = re.search(pat, str)
print('Firstname {0} lastname {1}'.format(yup.group(2), yup.group(1)))
except AttributeError:
print('no match for {0}'.format(str))
s = 'Révèrberë, Harry'
t = 'Åapö, Renée'
u = 'C3po, Robby'
v = 'Mærsk, Efraïm'
w = 'MacDønald, Ron'
x = 'Sträßle, Mpopo'
pat = r'^([^\d\s]+), ([^\d\s]+)'
# matches any letter, diacritic or ligature, but not digits or punctuation inside the ()
for i in s, t, u, v, w, x:
doesmatch(pat, i)
All except u match. (no match for numbers in names), but I wonder if there isn't a better way than the non-digit non-space approach.
More important though: I'd like to refine the pattern so it distinquishes capitals from lowercase letters, but including capital diacritics and ligatures, preferably using regex also. As if ([A-Z][a-z]+), would match accented and combined characters.
Is this possible?
(what I've looked at so far:
Dive into python 3 on UTF-8 vs Unicode; This Regex tutorial on Unicode (which I'm not using); I think I don't need new regex but I admit I haven't read all its documentation)

If you want to distinguish uppercase and lowercase letters using the standard library's re module, then I'm afraid you'll have to build a character class of all the relevant Unicode codepoints manually.
If you don't really need to do this, use
[^\W\d_]
to match any Unicode letter. This character class matches anything that's "not a non-alphanumeric character" (which is the same as "an alphanumeric character") that's also not a digit nor an underscore.

Python regex match literal asterisk

Given the following string:
s = 'abcdefg*'
How can I match it or any other string only made of lowercase letters and optionally ending with an asterisk? I thought the following would work, but it does not:
re.match(r"^[a-z]\*+$", s)
It gives None and not a match object.

How can I match it or any other string only made of lowercase letters and optionally ending with an asterisk?
The following will do it:
re.match(r"^[a-z]+[*]?$", s)
The ^ matches the start of the string.
The [a-z]+ matches one or more lowercase letters.
The [*]? matches zero or one asterisks.
The $ matches the end of the string.
Your original regex matches exactly one lowercase character followed by one or more asterisks.

\*? means 0-or-1 asterisk:
re.match(r"^[a-z]+\*?$", s)

re.match(r"^[a-z]+\*?$", s)
The [a-z]+ matches the sequence of lowercase letters, and \*? matches an optional literal * chatacter.

Try
re.match(r"^[a-z]*\*?$", s)
this means "a string consisting zero or more lowercase characters (hence the first asterisk), followed by zero or one asterisk (the question mark after the escaped asterisk).
Your regex means "exactly one lowercase character followed by one or more asterisks".

You forgot the + after the [a-z] match to indicate you want 1 or more of them as well (right now it's matching just one).
re.match(r"^[a-z]+\*+$", s)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to check unicode string to be letters, spaces and dashes only - python

r = re.compile(r'^(?:[^\W\d_]|[\s-])+$', re.U) [^\W\d_] matches any letter (by matching any alphanumeric character except for digits and underscore). [\s-] of course matches whitespace and dashes.

if you ONLY want to check: name = u"Василий Соловьев-Седой"; name = name.replace("-","").replace(" ",""); name.isalpha()

Related

Python regex match string of 8 characters that contain both alphabets and numbers

Match charactes and whitespaces, but not numbers

How can I check if a string only contains uppercase or lowercase letters?

Python 3 regex with diacritics and ligatures,

Python regex match literal asterisk

Categories

Resources