I am trying to create a regex that will match characters, whitespaces, but not numbers.
So hello 123 will not match, but hell o will.
I tried this:
[^\d\w]
but, I cannot find a way to add whitespaces here. I have to use \w, because my strings can contain Unicode characters.
Brief
It's unclear what exactly characters refers to, but, assuming you mean alpha characters (based on your input), this regex should work for you.
Code
See regex in use here
^(?:(?!\d)[\w ])+$
Note: This regex uses the mu flags for multiline and Unicode (multiline only necessary if input is separated by newline characters)
Results
Input
ÀÇÆ some words
ÀÇÆ some words 123
Output
This only shows matches
ÀÇÆ some words
Explanation
^ Assert position at the start of the line
(?:(?!\d)[\w ])+ Match the following one or more times (tempered greedy token)
(?!\d) Negative lookahead ensuring what follows doesn't match a digit. You can change this to (?![\d_]) if you want to ensure _ is also not used.
[\w ] Match any word character or space (matches Unicode word characters with u flag)`
$ Assert position at the end of the line
You can use a lookahead:
(?=^\D+$)[\w\s]+
In Python:
import re
strings = ['hello 123', 'hell o']
rx = re.compile(r'(?=^\D+$)[\w\s]+')
new_strings = [string for string in strings if rx.match(string)]
print(new_strings)
# ['hell o']
Related
I want to split strings like:
(so) what (are you trying to say)
what (do you mean)
Into lists like:
[(so), what, (are you trying to say)]
[what, (do you mean)]
The code that I tried is below. In the site regexr, the regex expression match the parts that I want but gives a warning, so... I'm not a expert in regex, I don't know what I'm doing wrong.
import re
string = "(so) what (are you trying to say)?"
rx = re.compile(r"((\([\w \w]*\)|[\w]*))")
print(re.split(rx, string ))
Using [\w \w]* is the same as [\w ]* and also matches an empty string.
Instead of using split, you can use re.findall without any capture groups and write the pattern like:
\(\w+(?:[^\S\n]+\w+)*\)|\w+
\( Match (
\w+ Match 1+ word chars
(?:[^\S\n]+\w+)* Optionally repeat matching spaces and 1+ word chars
\) Match )
| Or
\w+ Match 1+ word chars
Regex demo
import re
string = "(so) what (are you trying to say)? what (do you mean)"
rx = re.compile(r"\(\w+(?:[^\S\n]+\w+)*\)|\w+")
print(re.findall(rx, string))
Output
['(so)', 'what', '(are you trying to say)', 'what', '(do you mean)']
For your two examples you can write:
re.split(r'(?<=\)) +| +(?=\()', str)
Python regex<¯\(ツ)/¯>Python code
This does not work, however, for string defined in the OP's code, which contains a question mark, which is contrary to the statement of the question in terms of the two examples.
The regular expression can be broken down as follows.
(?<=\)) # positive lookbehind asserts that location in the
# string is preceded by ')'
[ ]+ # match one or more spaces
| # or
[ ]+ # match one or more spaces
(?=\() # positive lookahead asserts that location in the
# string is followed by '('
In the above I've put each of two space characters in a character class merely to make it visible.
I am trying to match a string of length 8 containing both numbers and alphabets(cannot have just numbers or just alphabets)using re.findall. The string can start with either letter or alphabet followed by any combination.
e.g.-
Input String: The reference number is 896av6uf and not 87987647 or ahduhsjs or hn0.
Output: ['896av6uf','a96bv6u0']
I came up with this regex r'([a-z]+[\d]+[\w]*|[\d]+[a-z]+[\w]*)' however it is giving me strings with less than 8 characters as well.
Need to modify the regex to return strings with exactly 8 chars that contain both letters and alphabets.
You can use
\b(?=[a-zA-Z]*[0-9])(?=[0-9]*[a-zA-Z])[a-zA-Z0-9]{8}\b
\b(?=[^\W\d_]*\d)(?=\d*[^\W\d_])[^\W_]{8}\b
The first one only supports ASCII letters, while the second one supports all Unicode letters and digits since [^\W\d_] matches any Unicode letter and \d matches any Unicode digit (as the re.UNICODE option is used by default in Python 3.x).
Details:
\b - a word boundary
(?=[a-zA-Z]*[0-9]) - after any 0+ ASCII letters, there must be a digit
(?=[0-9]*[a-zA-Z]) - after any 0+ digits, there must be an ASCII letter
[a-zA-Z0-9]{8} - eight ASCII alphanumeric chars
\b - a word boundary
First, let's find statement that finds words made of lowercase letters and digits that are 8 characters long:
\b[a-z\d]{8}\b
Next condition is that the word must contain both letters and numbers:
[a-d]\d
Now for the challenging part, combining these into one statement. Easiest way might be to just spit them up but we can use some look-aheads to get this to work:
\b(?=.*[a-z]\d)[a-z\d]{8}\b
Im sure there a tidier way of doing this but this will work.
You can use \b\w{8}\b
It does not guarantee that you will have both digits AND letters, but does guarantee that you will have exactly eight characters, surrounded by word boundaries (e.g. whitespace, start/end of line).
You can try it in one of the online playgrounds such as this one: https://regex101.com/
The meat of the matching is done with the \w{8} which means 8 letters/words (including capitals and underscore). \b means "word boundary"
If you want only digits and lowercase letters, replace this by \b[a-z0-9]{8}\b
You can then further check for existence of both digits AND letter, e.g. by using filter:
list(filter(lambda s: re.search(r'[0-9]', s) and re.search(r'[a-z]', s), result))
result is what you get from re.findall() .
So bottom line, I would use:
list(filter(lambda s: re.search(r'[0-9]', s) and re.search(r'[a-z]', s), re.findall(r'\b[a-z0-9]{8}\b', str)))
A more compact solution than others have suggested is this:
((?![A-Za-z]{8}|[0-9]{8})[0-9A-Za-z]{8})
This guarantees that the found matches are 8 characters in length and that they can not be only numeric or only alphabets.
Breakdown:
(?![A-Za-z]{8}|[0-9]{8}) = This is a negative lookahead that means the match can't be a string of 8 numbers or 8 alphabets.
[0-9A-Za-z]{8} = Simple regex saying the input needs to be alphanumeric of 8 characters in length.
Test Case:
Input: 12345678 abcdefgh i8D0jT5Yu6Ms1GNmrmaUjicc1s9D93aQBj3WWWjww54gkiKqOd7Ytkl0MliJy9xadAgcev8b2UKdfGRDOpxRPm30dw9GeEz3WPRO 1234567890987654321 qwertyuiopasdfghjklzxcvbnm
import re
pattern = re.compile(r'((?![A-Za-z]{8}|\d{8})[A-Za-z\d]{8})')
test = input()
match = pattern.findall(test)
print(match)
Output: ['i8D0jT5Y', 'u6Ms1GNm', 'maUjicc1', 's9D93aQB', 'j3WWWjww', '54gkiKqO', 'd7Ytkl0M', 'liJy9xad', 'Agcev8b2', 'DOpxRPm3', '0dw9GeEz']
I want to get the content between single quotes, but only if it contains a certain word (i.e 'sample_2'). It additionally should not match ones with white space.
Input example: (The following should match and return only: ../sample_2/file and sample_2/file)
['asdf', '../sample_2/file', 'sample_2/file', 'example with space', sample_2, sample]
Right now I just have that matched the first 3 items in the list:
'(.\S*?)'
I can't seem to find the right regex that would return those containing the word 'sample_2'
If you want specific words/characters you need to have them in the regular expression and not use the '\S'. The \S is the equivalent to [^\r\n\t\f\v ] or "any non-whitespace character".
import re
teststr = "['asdf', '../sample_2/file', 'sample_2/file', 'sample_2 with spaces','example with space', sample_2, sample]"
matches = re.findall(r"'([^\s']*sample_2[^\s]*?)',", teststr)
# ['../sample_2/file', 'sample_2/file']
Based on your wording, you suggest the desired word can change. In that case, I would recommend using re.compile() to dynamically create a string which then defines the regular expression.
import re
word = 'sample_2'
teststr = "['asdf', '../sample_2/file', 'sample_2/file', ' sample_2 with spaces','example with space', sample_2, sample]"
regex = re.compile("'([^'\\s]*"+word+"[^\\s]*?)',")
matches = regex.findall(teststr)
# ['../sample_2/file', 'sample_2/file']
Also if you haven't heard of this tool yet, check out regex101.com. I always build my regular expressions here to make sure I get them correct. It gives you the references, explanation of what is happening and even lets you test it right there in the browser.
Explanation of regex
regex = r"'([^\s']*sample_2[^\s]*?)',"
Find first apostrophe, start group capture. Capture anything except a whitespace character or the corresponding ending apostrophe. It must see the letters "sample_2" before accepting any non-whitespace character. Stop group capture when you see the closing apostrophe and a comma.
Note: In python, a string " or ' prepositioned with the character 'r' means the text is compiled as a regular expression. Strings with the character 'r' also do not require double-escape '\' characters.
I have the string.
st = "12345 hai how r u #3456? Awer12345 7890"
re.findall('([0-9]+)',st)
It should not come like :
['12345', '3456', '12345', '7890']
I should get
['12345','7890']
I should only take the numeric values
and
It should not contain any other chars like alphabets,special chars
No need to use a regular expression:
[i for i in st.split(" ") if i.isdigit()]
Which I think is much more readable than using a regex
Corey's solution is really the right way to go here, but since the question did ask for regex, here is a regex solution that I think is simpler than the others:
re.findall(r'(?<!\S)\d+(?!\S)', st)
And an explanation:
(?<!\S) # Fail if the previous character (if one exists) isn't whitespace
\d+ # Match one or more digits
(?!\S) # Fail if the next character (if one exists) isn't whitespace
Some examples:
>>> re.findall(r'(?<!\S)\d+(?!\S)', '12345 hai how r u #3456? Awer12345 7890')
['12345', '7890']
>>> re.findall(r'(?<!\S)\d+(?!\S)', '12345 hai how r u #3456? Awer12345 7890123ER%345 234 456 789')
['12345', '234', '456', '789']
In [21]: re.findall(r'(?:^|\s)(\d+)(?=$|\s)', st)
Out[21]: ['12345', '7890']
Here,
(?:^|\s) is a non-capture group that matches the start of the string, or a space.
(\d+) is a capture group that matches one or more digits.
(?=$|\s) is lookahead assertion that matches the end of the string, or a space, without consuming it.
use this: (^|\s)[0-9]+(\s|$) pattern. (^|\s) means that your number must be at the start of the string or there must be a whitespace character before the number. And (\s|$) means that there must be a whitespace after number or the number is at the end of the string.
As said Jan Pöschko, 456 won't be found in 123 456. If your "bad" parts (#, Awer) are always prefixes, you can use this (^|\s)[0-9]+ pattern and everything will be OK. It will match all numbers, which have only whitespaces before or are at the start of the string. Hope this helped...
Your expression finds all sequences of digits, regardless of what surrounds them. You need to include a specification of what comes before and after the sequence to get the behavior you want:
re.findall(r"[\D\b](\d+)[\D\b]", st)
will do what you want. In English, it says "match all sequences of one or more digits that are surrounded by a non-digit character.or a word boundary"
Given the following string:
s = 'abcdefg*'
How can I match it or any other string only made of lowercase letters and optionally ending with an asterisk? I thought the following would work, but it does not:
re.match(r"^[a-z]\*+$", s)
It gives None and not a match object.
How can I match it or any other string only made of lowercase letters and optionally ending with an asterisk?
The following will do it:
re.match(r"^[a-z]+[*]?$", s)
The ^ matches the start of the string.
The [a-z]+ matches one or more lowercase letters.
The [*]? matches zero or one asterisks.
The $ matches the end of the string.
Your original regex matches exactly one lowercase character followed by one or more asterisks.
\*? means 0-or-1 asterisk:
re.match(r"^[a-z]+\*?$", s)
re.match(r"^[a-z]+\*?$", s)
The [a-z]+ matches the sequence of lowercase letters, and \*? matches an optional literal * chatacter.
Try
re.match(r"^[a-z]*\*?$", s)
this means "a string consisting zero or more lowercase characters (hence the first asterisk), followed by zero or one asterisk (the question mark after the escaped asterisk).
Your regex means "exactly one lowercase character followed by one or more asterisks".
You forgot the + after the [a-z] match to indicate you want 1 or more of them as well (right now it's matching just one).
re.match(r"^[a-z]+\*+$", s)