Python regular expression to find letters and numbers - python

Entering a string
I used 'findall' to find words that are only letters and numbers (The number of words to be found is not specified).
I created:
words = re.findall ("\ w * \ s", x) # x is the input string
If i entered "asdf1234 cdef11dfe a = 1 b = 2"
these sentences seperated asdf1234, cdef11dfe, a =, 1, b =, 2
I would like to pick out only asdf1234, cdef11dfe
How do you write a regular expression?

Try /[a-zA-z0-9]{2,}/.
This looks for any alphanumeric character ([a-zA-Z0-9]) at least 2 times in a row ({2,}). That would be the only way to filter out the one letter words of the string.
The problem with \w is that it includes underscores.

This one should work : (?<![\"=\w])(?:[^\W_]+)(?![\"=\w])
Explanation
(?:[^\W_])+ Anything but a non-word character or an underscore at least one time (non capturing group)
(?<![\"=\w]) not precedeed by " or a word character
(?![\"=\w]) not followed by " or a word character
RegEx Demo
Sample code Run online
import re
regex = r"(?<![\"=\w])(?:[^\W_]+)(?![\"=\w])"
test_str = "a01a b02 c03 e dfdfd abcdef=2 b=3 e=4 c=\"a b\" aaa=2f f=\"asdf 12af\""
matches = re.finditer(regex, test_str)
for matchNum, match in enumerate(matches):
print (match.group())

Related

Extracting a complex substring using regex with data from a string in python

I have a string say
text = 'i have on 31-Dec-08 USD 5234765 which I gave it in the donation"
i tried :
pattern = r"^[\d]{2}.*,[\d]{3}$"
data = re.findall(pattern, text)
for s in data:
print(s)
my desired output :
[31-Dec-08, USD, 5234765]
you can do it that way
import re
regex = r"(\w+-\w+-\w+)|([A-Z]{3})|(\d+)"
test_str = "i have on 31-Dec-08 USD 5234765 which I gave it in the donation"
matches = re.findall(regex, test_str)
temp = [_ for tupl in matches for _ in tupl if _]
print(temp) #['31-Dec-08', 'USD', '5234765']
\w matches any word character (equivalent to [a-zA-Z0-9_])
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
-matches the character - literally (case sensitive)
[A-Z]{3} matches the capital alphabet exactly 3 times.
\d matches a digit (equivalent to [0-9])

regex match a word after a certain character

I would like to match a word when it is after a char m or b
So for example, when the word is men, I would like to return en (only the word that is following m), if the word is beetles then return eetles
Initially I tried (m|b)\w+ but it matches the entire men not en
How do I write regex expression in this case?
Thank you!
You could get the match only using a positive lookbehind asserting what is on the left is either m or b using character class [mb] preceded by a word boundary \b
(?<=\b[mb])\w+
(?<= Positive lookbehind, assert what is directly to the left is
\b[mb] Word boundary, match either m or b
) Close lookbehind
\w+ Match 1 + word chars
Regex demo
If there can not be anything after the the word characters, you can assert a whitespace boundary at the right using (?!\S)
(?<=\b[mb])\w+(?!\S)
Regex demo | Python demo
Example code
import re
test_str = ("beetles men")
regex = r"(?<=\b[mb])\w+"
print(re.findall(regex, test_str))
Output
['eetles', 'en']
You may use
\b[mb](\w+)
See the regex demo.
NOTE: When your known prefixes include multicharacter sequences, say, you want to find words starting with m or be, you will have to use a non-capturing group rather than a character class: \b(?:m|be)(\w+). The current solution can thus be written as \b(?:m|b)(\w+) (however, a character class here looks more natural, unless you have to build the regex dynamically).
Details
\b - a word boundary
[mb] - m or b
(\w+) - Capturing group 1: any one or more word chars, letters, digits or underscores. To match only letters, use ([^\W\d_]+) instead.
Python demo:
import re
rx = re.compile(r'\b[mb](\w+)')
text = "The words are men and beetles."
# First occurrence:
m = rx.search(text)
if m:
print(m.group(1)) # => en
# All occurrences
print( rx.findall(text) ) # => ['en', 'eetles']
(?<=[mb])\w+/
You can use this above regex. The regex means "Any word starts with m or b".
(?<=[mb]): positive lookbehind
\w+: matches any word character (equal to [a-zA-Z0-9]+)

regex to remove every hyphen except between two words

I am cleaning a text and I would like to remove all the hyphens and special characters. Except for the hyphens between two words such as: tic-tacs, popcorn-flavoured.
I wrote the below regex but it removes every hyphen.
text='popcorn-flavoured---'
new_text=re.sub(r'[^a-zA-Z0-9]+', '',text)
new_text
I would like the output to be:
popcorn-flavoured
You can replace matches of the regular expression
-(?!\w)|(?<!\w)-
with empty strings.
Regex demo <¯\_(ツ)_/¯> Python demo
The regex will match hyphens that are not both preceded and followed by a word character.
Python's regex engine performs the following operations.
- match '-'
(?!\w) the previous character is not a word character
|
(?<!\w) the following character is not a word character
- match '-'
(?!\w) is a negative lookahead; (?<!\w) is a negative lookbehind.
As an alternative, you could capture a hyphen between word characters and keep that group in the replacement. Using an alternation, you could match the hyphens that you want to remove.
(\w+-\w+)|-+
Explanation
(\w+-\w+) Capture group 1, match 1+ word chars, hyphen and 1+ word chars
| Or
-+ Match 1+ times a hyphen
Regex demo | Python demo
Example code
import re
regex = r"(\w+-\w+)|-+"
test_str = ("popcorn-flavoured---\n"
"tic-tacs")
result = re.sub(regex, r"\1", test_str)
print (result)
Output
popcorn-flavoured
tic-tacs
You can use findall() to get that part that matches your criteria.
new_text = re.findall('[\w]+[-]?[\w]+', text)[0]
Play around with it with other inputs.
You can use
p = re.compile(r"(\b[-]\b)|[-]")
result = p.sub(lambda m: (m.group(1) if m.group(1) else ""), text)
Test
With:
text='popcorn-flavoured---'
Output (result):
popcorn-flavoured
Explanation
This pattern detects hyphens between two words:
(\b[-]\b)
This pattern detects all hyphens
[-]
Regex substitution
p.sub(lambda m: (m.group(1) if m.group(1) else " "), text)
When hyphen detected between two words m.group(1) exists, so we maintain things as they are
else "")
Occurs when the pattern was triggered by [-] then we substitute a "" for the hyphen removing it.

how to match either word or sentence in this Python regex?

I have a decent familiarity with regex but this is tricky. I need to find instances like this from a SQL case statement:
when col_name = 'this can be a word or sentence'
I can match the above when it's just one word, but when it's more than one word it's not working.
s = """when col_name = 'a sentence of words'"""
x = re.search("when\s(\w+)\s*=\s*\'(\w+)", s)
if x:
print(x.group(1)) # this returns "col_name"
print(x.group(2)) # this returns "a"
I want group(2) to return "a sentence of words" but I'm just getting the first word. That part could either be one word or several. How to do it?
When I add in the second \', then I get no match:
x = re.search("when\s(\w+)\s*=\s*\'(\w+)\'", s)
You may match all characters other than single quotation mark rather than matching letters, digits and connector punctuation ("word" chars) with the Group 2 pattern:
import re
s = """when col_name = 'a sentence of words'"""
x = re.search(r"when\s+(\w+)\s*=\s*'([^']+)", s)
if x:
print(x.group(1)) # this returns "col_name"
print(x.group(2)) # this returns "a sentence of words"
See the Python demo
The [^'] is a negated character class that matches any char but a single quotation mark, see the regex demo.
In case the string can contain escaped single quotes, you may consider replacing [^'] with
If the escape char is ': ([^']*(?:''[^']*)*)
If the escape char is \: ([^\\']*(?:\\.[^'\\]*)*).
Note the use of the raw string literal to define the regex pattern (all backslashes are treated as literal backslashes inside it).

How to find words with two vowels

How to find words with two vowels in the middle in a string using python regular expression
this is my code :
s= "reading a book is great"
print(re.findall(r'\b(\w+[aeiyou]+w)\b',s))
expected output: [book]
my output: [book],[grea]
Replace + in your regex with {2}, because + repeats the previous token one or more times where {2} repeats the previous token exactly the 2 times.
print(re.findall(r'\b\w[aeiou]{2}\w\b',s))
For both upper and lowercase vowels.
print(re.findall(r'\b\w[aeiouAEIOU]{2}\w\b',s))
You could use [A-Za-z] instead of \w, if you don't want a digit or _ exists before or after the vowels. Because \w also matches _ and digits.
print(re.findall(r'\b[A-Za-z][aeiouAEIOU]{2}[A-Za-z]\b',s))
Add case-insensitive modifier (?i) or re.IGNORECASE to do a case-insensitive match.
print(re.findall(r'(?i)\b[a-z][aeiou]{2}[a-z]\b',s))

Categories

Resources