Split a string using Python - python

My string is
"S001P001Q001"
I want to split the string into:
['S001', 'P001', 'Q001']
I tried this steps:
test_re = re.compile("(P?[^P]+)")
result_str = test_re.findall(str1)

Like i said in my comment, you could use re.findall instead of re.split method.
>>> s = "S001P001Q001"
>>> re.findall(r'[A-Za-z][^A-Za-z]*', s)
['S001', 'P001', 'Q001']
>>> re.findall(r'[A-Za-z]\d*', s)
['S001', 'P001', 'Q001']
[A-Za-z] - Matches an alphabet.
[^A-Za-z]* - Matches zero or more non-alphabetic characters.
\d* - Matches zero or more digit characters.
So the above findall function start matching from an alphabet, matches greedily all the zero or more non-alphabetic characters until an alphabet is identified. Once it finds an alphabet, it stops matching. Now from the second alphabet, it matches all the chars upto the next alphabet. Likewise it goes on.

Related

Regex - Regular expression for counting no.of digits between alphabets

Need to construct a regular expression that counts numbers between alphabets.
schowalte3rguss77ie85 - 2
xyz1zyx - 1
x1y1z1 - 2
I have constructed this . But this doesn't work for case 3.
[[a-z]+[0-9]+[a-z]]*
Any help would be appreciated. Thanks in advance.
Use regx:
(?<=[a-z])\d+(?=[a-z])
Demo: https://regex101.com/r/tpss6x/1
[Javascript]
If you want a count only, the last part should be a lookahead assertion.
If you want to also match uppercase chars, you can make the pattern case insensitive.
[a-z]\d+(?=[a-z])
Explanation
[a-z] Match a single char a-z
\d+ Match 1+ digits
(?=[a-z]) Positive lookahead, assert a char a-z to the right
Regex demo
You can use
(?<=[^\W\d_])\d+(?=[^\W\d_])
See the regex demo. If you want to only support ASCII letters, replace [^\W\d_] (that matches any Unicode letter) with [a-zA-Z].
Details:
(?<=[^\W\d_]) - immediately before the current location, there must be any Unicode letter
\d+ - one or more digits
(?=[^\W\d_]) - immediately after the current location, there must be any Unicode letter.
Counting can be done with len(...), see this Python demo:
import re
text = "schowalte3rguss77ie85"
matches = re.findall(r'(?<=[^\W\d_])\d+(?=[^\W\d_])', text)
print(len(matches)) # => 2

Python regex match string of 8 characters that contain both alphabets and numbers

I am trying to match a string of length 8 containing both numbers and alphabets(cannot have just numbers or just alphabets)using re.findall. The string can start with either letter or alphabet followed by any combination.
e.g.-
Input String: The reference number is 896av6uf and not 87987647 or ahduhsjs or hn0.
Output: ['896av6uf','a96bv6u0']
I came up with this regex r'([a-z]+[\d]+[\w]*|[\d]+[a-z]+[\w]*)' however it is giving me strings with less than 8 characters as well.
Need to modify the regex to return strings with exactly 8 chars that contain both letters and alphabets.
You can use
\b(?=[a-zA-Z]*[0-9])(?=[0-9]*[a-zA-Z])[a-zA-Z0-9]{8}\b
\b(?=[^\W\d_]*\d)(?=\d*[^\W\d_])[^\W_]{8}\b
The first one only supports ASCII letters, while the second one supports all Unicode letters and digits since [^\W\d_] matches any Unicode letter and \d matches any Unicode digit (as the re.UNICODE option is used by default in Python 3.x).
Details:
\b - a word boundary
(?=[a-zA-Z]*[0-9]) - after any 0+ ASCII letters, there must be a digit
(?=[0-9]*[a-zA-Z]) - after any 0+ digits, there must be an ASCII letter
[a-zA-Z0-9]{8} - eight ASCII alphanumeric chars
\b - a word boundary
First, let's find statement that finds words made of lowercase letters and digits that are 8 characters long:
\b[a-z\d]{8}\b
Next condition is that the word must contain both letters and numbers:
[a-d]\d
Now for the challenging part, combining these into one statement. Easiest way might be to just spit them up but we can use some look-aheads to get this to work:
\b(?=.*[a-z]\d)[a-z\d]{8}\b
Im sure there a tidier way of doing this but this will work.
You can use \b\w{8}\b
It does not guarantee that you will have both digits AND letters, but does guarantee that you will have exactly eight characters, surrounded by word boundaries (e.g. whitespace, start/end of line).
You can try it in one of the online playgrounds such as this one: https://regex101.com/
The meat of the matching is done with the \w{8} which means 8 letters/words (including capitals and underscore). \b means "word boundary"
If you want only digits and lowercase letters, replace this by \b[a-z0-9]{8}\b
You can then further check for existence of both digits AND letter, e.g. by using filter:
list(filter(lambda s: re.search(r'[0-9]', s) and re.search(r'[a-z]', s), result))
result is what you get from re.findall() .
So bottom line, I would use:
list(filter(lambda s: re.search(r'[0-9]', s) and re.search(r'[a-z]', s), re.findall(r'\b[a-z0-9]{8}\b', str)))
A more compact solution than others have suggested is this:
((?![A-Za-z]{8}|[0-9]{8})[0-9A-Za-z]{8})
This guarantees that the found matches are 8 characters in length and that they can not be only numeric or only alphabets.
Breakdown:
(?![A-Za-z]{8}|[0-9]{8}) = This is a negative lookahead that means the match can't be a string of 8 numbers or 8 alphabets.
[0-9A-Za-z]{8} = Simple regex saying the input needs to be alphanumeric of 8 characters in length.
Test Case:
Input: 12345678 abcdefgh i8D0jT5Yu6Ms1GNmrmaUjicc1s9D93aQBj3WWWjww54gkiKqOd7Ytkl0MliJy9xadAgcev8b2UKdfGRDOpxRPm30dw9GeEz3WPRO 1234567890987654321 qwertyuiopasdfghjklzxcvbnm
import re
pattern = re.compile(r'((?![A-Za-z]{8}|\d{8})[A-Za-z\d]{8})')
test = input()
match = pattern.findall(test)
print(match)
Output: ['i8D0jT5Y', 'u6Ms1GNm', 'maUjicc1', 's9D93aQB', 'j3WWWjww', '54gkiKqO', 'd7Ytkl0M', 'liJy9xad', 'Agcev8b2', 'DOpxRPm3', '0dw9GeEz']

strange output regular expression r'[-.\:alnum:](.*)'

I expect to fetch all alphanumeric characters after "-"
For an example:
>>> str1 = "12 - mystr"
>>> re.findall(r'[-.\:alnum:](.*)', str1)
[' mystr']
First, it's strange that white space is considered alphanumeric, while I expected to get ['mystr'].
Second, I cannot understand why this can be fetched, if there is no "-":
>>> str2 = "qwertyuio"
>>> re.findall(r'[-.\:alnum:](.*)', str2)
['io']
First of all, Python re does not support POSIX character classes.
The white space is not considered alphanumeric, your first pattern matches - with [-.\:alnum:] and then (.*) captures into Group 1 all 0 or more chars other than a newline. The [-.\:alnum:] pattern matches one char that is either -, ., :, a, l, n, u or m. Thus, when run against the qwertyuio, u is matched and io is captured into Group 1.
Alphanumeric chars can be matched with the [^\W_] pattern. So, to capture all alphanumeric chars after - that is followed with 0+ whitespaces you may use
re.findall(r'-\s*([^\W_]+)', s)
See the regex demo
Details
- - a hyphen
\s* - 0+ whitespaces
([^\W_]+) - Capturing group 1: one or more (+) chars that are letters or digits.
Python demo:
print(re.findall(r'-\s*([^\W_]+)', '12 - mystr')) # => ['mystr']
print(re.findall(r'-\s*([^\W_]+)', 'qwertyuio')) # => []
Your regex says: "Find any one of the characters -.:alnum, then capture any amount of any characters into the first capture group".
In the first test, it found - for the first character, then captured mystr in the first capture group. If any groups are in the regex, findall returns list of found groups, not the matches, so the matched - is not included.
Your second test found u as one of the -.:alnum characters (as none of qwerty matched any), then captured and returned the rest after it, io.
As #revo notes in comments, [....] is a character class - matching any one character in it. In order to include a POSIX character class (like [:alnum:]) inside it, you need two sets of brackets. Also, there is no order in a character class; the fact that you included - inside it just means it would be one of the matched characters, not that alphanumeric characters would be matched without it. Finally, if you want to match any number of alphanumerics, you have your quantifier * on the wrong thing.
Thus, "match -, then any number of alphanumeric characters" would be -([[:alnum:]]*), except... Python does not support POSIX character classes. So you have to write your own: -([A-Za-z0-9]*).
However, that will not match your string because the intervening space is, as you note, not an alphanumeric character. In order to account for that, -\s*([A-Za-z0-9]*).
Not quite sure what you want to match. I'll assume you don't want to include '-' in any matches.
If you want to get all alphanumeric chars after the first '-' and skip all other characters you can do something like this.
re.match('.*?(?<=-)(((?<=\s+)?[a-zA-Z\d]+(?=\s+)?)+)', inputString)
If you want to find each string of alphanumerics after a each '-' then you can do this.
re.findall('(?<=-)[a-zA-Z\d]+')

Python Alphanumeric Regex

Below I have the following regex:
alphanumeric = compile('^[\w\d ]+$')
I'm running the current data against this regex:
Tomkiewicz Zigomalas Andrade Mcwalters
I have a separate regex to identify alpha characters only, yet the data above still matches the alphanumeric criteria.
Edit: How do I stop the only alpha data matching with the regex above?
Description: It can be in two forms:
Starts with numeric chars then there should be some chars, followed by any number of alpha-numeric chars are possible.
Starts with alphabets, then some numbers, followed by any number of alpha-numeric chars are possible.
Demo:
>>> an_re = r"(\d+[A-Z])|([A-Z]+\d)[\dA-Z]*"
>>> re.search(an_re, '12345', re.I) # not acceptable string
>>> re.search(an_re, 'abcd', re.I) # not acceptable string
>>> re.search(an_re, 'abc1', re.I) # acceptable string
<_sre.SRE_Match object at 0x14153e8>
>>> re.search(an_re, '1abc', re.I)
<_sre.SRE_Match object at 0x14153e8>
Use a lookahead to assert the condition that at least one alpha and at least one digit are present:
(?=.*[a-zA-Z])(?=.*[0-9])^[\w\d ]+$
The above RegEx utilizes two lookaheads to first check the entire string for each condition. The lookaheads search up until a single character in the specified range is found. If the assertion matches then it moves on to the next one. The last part I borrowed from the OP's original attempt and just ensures that the entire string is composed of one or more lower/upper alphas, underscores, digits, or spaces.

Python regex match literal asterisk

Given the following string:
s = 'abcdefg*'
How can I match it or any other string only made of lowercase letters and optionally ending with an asterisk? I thought the following would work, but it does not:
re.match(r"^[a-z]\*+$", s)
It gives None and not a match object.
How can I match it or any other string only made of lowercase letters and optionally ending with an asterisk?
The following will do it:
re.match(r"^[a-z]+[*]?$", s)
The ^ matches the start of the string.
The [a-z]+ matches one or more lowercase letters.
The [*]? matches zero or one asterisks.
The $ matches the end of the string.
Your original regex matches exactly one lowercase character followed by one or more asterisks.
\*? means 0-or-1 asterisk:
re.match(r"^[a-z]+\*?$", s)
re.match(r"^[a-z]+\*?$", s)
The [a-z]+ matches the sequence of lowercase letters, and \*? matches an optional literal * chatacter.
Try
re.match(r"^[a-z]*\*?$", s)
this means "a string consisting zero or more lowercase characters (hence the first asterisk), followed by zero or one asterisk (the question mark after the escaped asterisk).
Your regex means "exactly one lowercase character followed by one or more asterisks".
You forgot the + after the [a-z] match to indicate you want 1 or more of them as well (right now it's matching just one).
re.match(r"^[a-z]+\*+$", s)

Categories

Resources