Python regex match literal asterisk - python

Given the following string:
s = 'abcdefg*'
How can I match it or any other string only made of lowercase letters and optionally ending with an asterisk? I thought the following would work, but it does not:
re.match(r"^[a-z]\*+$", s)
It gives None and not a match object.

How can I match it or any other string only made of lowercase letters and optionally ending with an asterisk?
The following will do it:
re.match(r"^[a-z]+[*]?$", s)
The ^ matches the start of the string.
The [a-z]+ matches one or more lowercase letters.
The [*]? matches zero or one asterisks.
The $ matches the end of the string.
Your original regex matches exactly one lowercase character followed by one or more asterisks.

\*? means 0-or-1 asterisk:
re.match(r"^[a-z]+\*?$", s)

re.match(r"^[a-z]+\*?$", s)
The [a-z]+ matches the sequence of lowercase letters, and \*? matches an optional literal * chatacter.

Try
re.match(r"^[a-z]*\*?$", s)
this means "a string consisting zero or more lowercase characters (hence the first asterisk), followed by zero or one asterisk (the question mark after the escaped asterisk).
Your regex means "exactly one lowercase character followed by one or more asterisks".

You forgot the + after the [a-z] match to indicate you want 1 or more of them as well (right now it's matching just one).
re.match(r"^[a-z]+\*+$", s)

Related

Python regex match string of 8 characters that contain both alphabets and numbers

I am trying to match a string of length 8 containing both numbers and alphabets(cannot have just numbers or just alphabets)using re.findall. The string can start with either letter or alphabet followed by any combination.
e.g.-
Input String: The reference number is 896av6uf and not 87987647 or ahduhsjs or hn0.
Output: ['896av6uf','a96bv6u0']
I came up with this regex r'([a-z]+[\d]+[\w]*|[\d]+[a-z]+[\w]*)' however it is giving me strings with less than 8 characters as well.
Need to modify the regex to return strings with exactly 8 chars that contain both letters and alphabets.
You can use
\b(?=[a-zA-Z]*[0-9])(?=[0-9]*[a-zA-Z])[a-zA-Z0-9]{8}\b
\b(?=[^\W\d_]*\d)(?=\d*[^\W\d_])[^\W_]{8}\b
The first one only supports ASCII letters, while the second one supports all Unicode letters and digits since [^\W\d_] matches any Unicode letter and \d matches any Unicode digit (as the re.UNICODE option is used by default in Python 3.x).
Details:
\b - a word boundary
(?=[a-zA-Z]*[0-9]) - after any 0+ ASCII letters, there must be a digit
(?=[0-9]*[a-zA-Z]) - after any 0+ digits, there must be an ASCII letter
[a-zA-Z0-9]{8} - eight ASCII alphanumeric chars
\b - a word boundary
First, let's find statement that finds words made of lowercase letters and digits that are 8 characters long:
\b[a-z\d]{8}\b
Next condition is that the word must contain both letters and numbers:
[a-d]\d
Now for the challenging part, combining these into one statement. Easiest way might be to just spit them up but we can use some look-aheads to get this to work:
\b(?=.*[a-z]\d)[a-z\d]{8}\b
Im sure there a tidier way of doing this but this will work.
You can use \b\w{8}\b
It does not guarantee that you will have both digits AND letters, but does guarantee that you will have exactly eight characters, surrounded by word boundaries (e.g. whitespace, start/end of line).
You can try it in one of the online playgrounds such as this one: https://regex101.com/
The meat of the matching is done with the \w{8} which means 8 letters/words (including capitals and underscore). \b means "word boundary"
If you want only digits and lowercase letters, replace this by \b[a-z0-9]{8}\b
You can then further check for existence of both digits AND letter, e.g. by using filter:
list(filter(lambda s: re.search(r'[0-9]', s) and re.search(r'[a-z]', s), result))
result is what you get from re.findall() .
So bottom line, I would use:
list(filter(lambda s: re.search(r'[0-9]', s) and re.search(r'[a-z]', s), re.findall(r'\b[a-z0-9]{8}\b', str)))
A more compact solution than others have suggested is this:
((?![A-Za-z]{8}|[0-9]{8})[0-9A-Za-z]{8})
This guarantees that the found matches are 8 characters in length and that they can not be only numeric or only alphabets.
Breakdown:
(?![A-Za-z]{8}|[0-9]{8}) = This is a negative lookahead that means the match can't be a string of 8 numbers or 8 alphabets.
[0-9A-Za-z]{8} = Simple regex saying the input needs to be alphanumeric of 8 characters in length.
Test Case:
Input: 12345678 abcdefgh i8D0jT5Yu6Ms1GNmrmaUjicc1s9D93aQBj3WWWjww54gkiKqOd7Ytkl0MliJy9xadAgcev8b2UKdfGRDOpxRPm30dw9GeEz3WPRO 1234567890987654321 qwertyuiopasdfghjklzxcvbnm
import re
pattern = re.compile(r'((?![A-Za-z]{8}|\d{8})[A-Za-z\d]{8})')
test = input()
match = pattern.findall(test)
print(match)
Output: ['i8D0jT5Y', 'u6Ms1GNm', 'maUjicc1', 's9D93aQB', 'j3WWWjww', '54gkiKqO', 'd7Ytkl0M', 'liJy9xad', 'Agcev8b2', 'DOpxRPm3', '0dw9GeEz']

python string split only by `/` but not `//` [duplicate]

I have a string like this
"yJdz:jkj8h:jkhd::hjkjh"
I want to split it using colon as a separator, but not a double colon. Desired result:
("yJdz", "jkj8h", "jkhd::hjkjh")
I'm trying with:
re.split(":{1}", "yJdz:jkj8h:jkhd::hjkjh")
but I got a wrong result.
In the meanwhile I'm escaping "::", with string.replace("::", "$$")
You could split on (?<!:):(?!:). This uses two negative lookarounds (a lookbehind and a lookahead) which assert that a valid match only has one colon, without a colon before or after it.
To explain the pattern:
(?<!:) # assert that the previous character is not a colon
: # match a literal : character
(?!:) # assert that the next character is not a colon
Both lookarounds are needed, because if there was only the lookbehind, then the regular expression engine would match the first colon in :: (because the previous character isn't a colon), and if there was only the lookahead, the second colon would match (because the next character isn't a colon).
You can do this with lookahead and lookbehind, if you want:
>>> s = "yJdz:jkj8h:jkhd::hjkjh"
>>> l = re.split("(?<!:):(?!:)", s)
>>> print l
['yJdz', 'jkj8h', 'jkhd::hjkjh']
This regex essentially says "match a : that is not followed by a : or preceded by a :"

strange output regular expression r'[-.\:alnum:](.*)'

I expect to fetch all alphanumeric characters after "-"
For an example:
>>> str1 = "12 - mystr"
>>> re.findall(r'[-.\:alnum:](.*)', str1)
[' mystr']
First, it's strange that white space is considered alphanumeric, while I expected to get ['mystr'].
Second, I cannot understand why this can be fetched, if there is no "-":
>>> str2 = "qwertyuio"
>>> re.findall(r'[-.\:alnum:](.*)', str2)
['io']
First of all, Python re does not support POSIX character classes.
The white space is not considered alphanumeric, your first pattern matches - with [-.\:alnum:] and then (.*) captures into Group 1 all 0 or more chars other than a newline. The [-.\:alnum:] pattern matches one char that is either -, ., :, a, l, n, u or m. Thus, when run against the qwertyuio, u is matched and io is captured into Group 1.
Alphanumeric chars can be matched with the [^\W_] pattern. So, to capture all alphanumeric chars after - that is followed with 0+ whitespaces you may use
re.findall(r'-\s*([^\W_]+)', s)
See the regex demo
Details
- - a hyphen
\s* - 0+ whitespaces
([^\W_]+) - Capturing group 1: one or more (+) chars that are letters or digits.
Python demo:
print(re.findall(r'-\s*([^\W_]+)', '12 - mystr')) # => ['mystr']
print(re.findall(r'-\s*([^\W_]+)', 'qwertyuio')) # => []
Your regex says: "Find any one of the characters -.:alnum, then capture any amount of any characters into the first capture group".
In the first test, it found - for the first character, then captured mystr in the first capture group. If any groups are in the regex, findall returns list of found groups, not the matches, so the matched - is not included.
Your second test found u as one of the -.:alnum characters (as none of qwerty matched any), then captured and returned the rest after it, io.
As #revo notes in comments, [....] is a character class - matching any one character in it. In order to include a POSIX character class (like [:alnum:]) inside it, you need two sets of brackets. Also, there is no order in a character class; the fact that you included - inside it just means it would be one of the matched characters, not that alphanumeric characters would be matched without it. Finally, if you want to match any number of alphanumerics, you have your quantifier * on the wrong thing.
Thus, "match -, then any number of alphanumeric characters" would be -([[:alnum:]]*), except... Python does not support POSIX character classes. So you have to write your own: -([A-Za-z0-9]*).
However, that will not match your string because the intervening space is, as you note, not an alphanumeric character. In order to account for that, -\s*([A-Za-z0-9]*).
Not quite sure what you want to match. I'll assume you don't want to include '-' in any matches.
If you want to get all alphanumeric chars after the first '-' and skip all other characters you can do something like this.
re.match('.*?(?<=-)(((?<=\s+)?[a-zA-Z\d]+(?=\s+)?)+)', inputString)
If you want to find each string of alphanumerics after a each '-' then you can do this.
re.findall('(?<=-)[a-zA-Z\d]+')

Match charactes and whitespaces, but not numbers

I am trying to create a regex that will match characters, whitespaces, but not numbers.
So hello 123 will not match, but hell o will.
I tried this:
[^\d\w]
but, I cannot find a way to add whitespaces here. I have to use \w, because my strings can contain Unicode characters.
Brief
It's unclear what exactly characters refers to, but, assuming you mean alpha characters (based on your input), this regex should work for you.
Code
See regex in use here
^(?:(?!\d)[\w ])+$
Note: This regex uses the mu flags for multiline and Unicode (multiline only necessary if input is separated by newline characters)
Results
Input
ÀÇÆ some words
ÀÇÆ some words 123
Output
This only shows matches
ÀÇÆ some words
Explanation
^ Assert position at the start of the line
(?:(?!\d)[\w ])+ Match the following one or more times (tempered greedy token)
(?!\d) Negative lookahead ensuring what follows doesn't match a digit. You can change this to (?![\d_]) if you want to ensure _ is also not used.
[\w ] Match any word character or space (matches Unicode word characters with u flag)`
$ Assert position at the end of the line
You can use a lookahead:
(?=^\D+$)[\w\s]+
In Python:
import re
strings = ['hello 123', 'hell o']
rx = re.compile(r'(?=^\D+$)[\w\s]+')
new_strings = [string for string in strings if rx.match(string)]
print(new_strings)
# ['hell o']

Split a string using Python

My string is
"S001P001Q001"
I want to split the string into:
['S001', 'P001', 'Q001']
I tried this steps:
test_re = re.compile("(P?[^P]+)")
result_str = test_re.findall(str1)
Like i said in my comment, you could use re.findall instead of re.split method.
>>> s = "S001P001Q001"
>>> re.findall(r'[A-Za-z][^A-Za-z]*', s)
['S001', 'P001', 'Q001']
>>> re.findall(r'[A-Za-z]\d*', s)
['S001', 'P001', 'Q001']
[A-Za-z] - Matches an alphabet.
[^A-Za-z]* - Matches zero or more non-alphabetic characters.
\d* - Matches zero or more digit characters.
So the above findall function start matching from an alphabet, matches greedily all the zero or more non-alphabetic characters until an alphabet is identified. Once it finds an alphabet, it stops matching. Now from the second alphabet, it matches all the chars upto the next alphabet. Likewise it goes on.

Categories

Resources