python regex to find alphanumeric string with at least one letter - python

I am trying to figure out the syntax for regular expression that would match 4 alphanumeric characters, where there is at least one letter. Each should be wrapped by: > and < but I wouldn't like to return the angle brackets.
For example when using re.findall on string >ABCD<>1234<>ABC1<>ABC2 it should return ['ABCD', 'ABC1'].
1234 - doesn't have a letter
ABC2 - is not wrapped with angle brackets

You may use this lookahead based regex in python with findall:
(?i)>((?=\d*[a-z])[a-z\d]{4})<
RegEx Demo
Code:
>>> regex = re.compile(r">((?=\d*[a-z])[a-z\d]{4})<", re.I)
>>> s = ">ABCD<>1234<>ABC1<>ABC2"
>>> print (regex.findall(s))
['ABCD', 'ABC1']
RegEx Details:
re.I: Enable ignore case modifier
>: Match literal character >
(: Start capture group
(?=\d*[a-z]): Lookahead to assert we have at least one letter after 0 or more digits
[a-z\d]{4}: Match 4 alphanumeric characters
): End capture group
<: Match literal character <

import re
sentence = ">ABCD<>1234<>ABC1<>ABC2"
pattern = "\>((?=[a-zA-Z])(.){4})\<"
m = [m[0] for m in re.findall(pattern, sentence)]
#outputs ['ABCD', 'ABC1']

Related

Replace a substring between two substrings

How can I replace a substring between page1/ and _type-A with 222.6 in the below-provided l string?
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
Expected result:
https://homepage.com/home/page1/222.6_type-A/go
I tried:
import re
re.sub('page1/.*?_type-A','',l, flags=re.DOTALL)
But it also removes page1/ and _type-A.
You may use re.sub like this:
import re
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
print (re.sub(r'(?<=page1/).*?(?=_type-A)', replace_with, l))
Output:
https://homepage.com/home/page1/222.6_type-A/go
RegEx Demo
RegEx Breakup:
(?<=page1/): Lookbehind to assert that we have page1/ at previous position
.*?: Match 0 or more of any string (lazy)
(?=_type-A): Lookahead to assert that we have _type-A at next position
You can use
import re
l = 'https://'+'homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
print (re.sub('(page1/).*?(_type-A)',fr'\g<1>{replace_with}\2',l, flags=re.DOTALL))
Output: https://homepage.com/home/page1/222.6_type-A/go
See the Python demo online
Note you used an empty string as the replacement argument. In the above snippet, the parts before and after .*? are captured and \g<1> refers to the first group value, and \2 refers to the second group value from the replacement pattern. The unambiguous backreference form (\g<X>) is used to avoid backreference issues since there is a digit right after the backreference.
Since the replacement pattern contains no backslashes, there is no need preprocessing (escaping) anything in it.
This works:
import re
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
pattern = r"(?<=page1/).*?(?=_type)"
replace_with = '222.6'
s = re.sub(pattern, replace_with, l)
print(s)
The pattern uses the positive lookahead and lookback assertions, ?<= and ?=. A match only occurs if a string is preceded and followed by the assertions in the pattern, but does not consume them. Meaning that re.sub looks for a string with page1/ in front and _type behind it, but only replaces the part in between.

Extract a string between two set of patterns in Python

I am trying to extract a substring between two set of patterns using re.search().
On the left, there can be either 0x or 0X, and on the right there can be either U, , or \n. The result should not contain boundary patterns. For example, 0x1234U should result in 1234.
I tried with the following search pattern: (0x|0X)(.*)(U| |\n), but it includes the left and right patterns in the result.
What would be the correct search pattern?
You could use also use a single group using .group(1)
0[xX](.*?)[U\s]
The pattern matches:
0[xX] Match either 0x or 0X
(.*?) Capture in group 1 matching any character except a newline, as least as possible
[U\s] Match either U or a whitespace characters (which could also match a newline)
Regex demo | Python demo
import re
s = r"0x1234U"
pattern = r"0[xX](.*?)[U\s]"
m = re.search(pattern, s)
if m:
print(m.group(1))
Output
1234
You could use a combination of lookbehind and lookahead with a non-greedy match pattern in between:
import re
pattern = r"(?<=0[xX])(.*?)(?=[U\s\n])"
re.findall(pattern,"---0x1234U...0X456a ")
['1234', '456a']

regex match a word after a certain character

I would like to match a word when it is after a char m or b
So for example, when the word is men, I would like to return en (only the word that is following m), if the word is beetles then return eetles
Initially I tried (m|b)\w+ but it matches the entire men not en
How do I write regex expression in this case?
Thank you!
You could get the match only using a positive lookbehind asserting what is on the left is either m or b using character class [mb] preceded by a word boundary \b
(?<=\b[mb])\w+
(?<= Positive lookbehind, assert what is directly to the left is
\b[mb] Word boundary, match either m or b
) Close lookbehind
\w+ Match 1 + word chars
Regex demo
If there can not be anything after the the word characters, you can assert a whitespace boundary at the right using (?!\S)
(?<=\b[mb])\w+(?!\S)
Regex demo | Python demo
Example code
import re
test_str = ("beetles men")
regex = r"(?<=\b[mb])\w+"
print(re.findall(regex, test_str))
Output
['eetles', 'en']
You may use
\b[mb](\w+)
See the regex demo.
NOTE: When your known prefixes include multicharacter sequences, say, you want to find words starting with m or be, you will have to use a non-capturing group rather than a character class: \b(?:m|be)(\w+). The current solution can thus be written as \b(?:m|b)(\w+) (however, a character class here looks more natural, unless you have to build the regex dynamically).
Details
\b - a word boundary
[mb] - m or b
(\w+) - Capturing group 1: any one or more word chars, letters, digits or underscores. To match only letters, use ([^\W\d_]+) instead.
Python demo:
import re
rx = re.compile(r'\b[mb](\w+)')
text = "The words are men and beetles."
# First occurrence:
m = rx.search(text)
if m:
print(m.group(1)) # => en
# All occurrences
print( rx.findall(text) ) # => ['en', 'eetles']
(?<=[mb])\w+/
You can use this above regex. The regex means "Any word starts with m or b".
(?<=[mb]): positive lookbehind
\w+: matches any word character (equal to [a-zA-Z0-9]+)

Python regular expression to find letters and numbers

Entering a string
I used 'findall' to find words that are only letters and numbers (The number of words to be found is not specified).
I created:
words = re.findall ("\ w * \ s", x) # x is the input string
If i entered "asdf1234 cdef11dfe a = 1 b = 2"
these sentences seperated asdf1234, cdef11dfe, a =, 1, b =, 2
I would like to pick out only asdf1234, cdef11dfe
How do you write a regular expression?
Try /[a-zA-z0-9]{2,}/.
This looks for any alphanumeric character ([a-zA-Z0-9]) at least 2 times in a row ({2,}). That would be the only way to filter out the one letter words of the string.
The problem with \w is that it includes underscores.
This one should work : (?<![\"=\w])(?:[^\W_]+)(?![\"=\w])
Explanation
(?:[^\W_])+ Anything but a non-word character or an underscore at least one time (non capturing group)
(?<![\"=\w]) not precedeed by " or a word character
(?![\"=\w]) not followed by " or a word character
RegEx Demo
Sample code Run online
import re
regex = r"(?<![\"=\w])(?:[^\W_]+)(?![\"=\w])"
test_str = "a01a b02 c03 e dfdfd abcdef=2 b=3 e=4 c=\"a b\" aaa=2f f=\"asdf 12af\""
matches = re.finditer(regex, test_str)
for matchNum, match in enumerate(matches):
print (match.group())

Match first parenthesis with Python

From a string such as
70849 mozilla/5.0(linux;u;android4.2.1;zh-cn)applewebkit/534.30(khtml,likegecko)version/4.0mobilesafari/534.30
I want to get the first parenthesized content linux;u;android4.2.1;zh-cn.
My code looks like this:
s=r'70849 mozilla/5.0(linux;u;android4.2.1;zh-cn)applewebkit/534.30(khtml,likegecko)version/4.0mobilesafari/534.30'
re.search("(\d+)\s.+\((\S+)\)", s).group(2)
but the result is the last brackets' contents khtml,likegecko.
How to solve this?
The main issue you have is the greedy dot matching .+ pattern. It grabs the whole string you have, and then backtracks, yielding one character from the right at a time, trying to accommodate for the subsequent patterns. Thus, it matches the last parentheses.
You can use
^(\d+)\s[^(]+\(([^()]+)\)
See the regex demo. Here, the [^(]+ restricts the matching to the characters other than ( (so, it cannot grab the whole line up to the end) and get to the first pair of parentheses.
Pattern expalantion:
^ - string start (NOTE: If the number appears not at the start of the string, remove this ^ anchor)
(\d+) - Group 1: 1 or more digits
\s - a whitespace (if it is not a required character, it can be removed since the subsequent negated character class will match the space)
[^(]+ - 1+ characters other than (
\( - a literal (
([^()]+) - Group 2 matching 1+ characters other than ( and )
\)- closing ).
Debuggex Demo
Here is the IDEONE demo:
import re
p = re.compile(r'^(\d+)\s[^(]+\(([^()]+)\)')
test_str = "70849 mozilla/5.0(linux;u;android4.2.1;zh-cn)applewebkit/534.30(khtml,likegecko)version/4.0mobilesafari/534.30"
print(p.findall(test_str))
# or using re.search if the number is not at the beginning of the string
m = re.search(r'(\d+)\s[^(]+\(([^()]+)\)', test_str)
if m:
print("Number: {0}\nString: {1}".format(m.group(1), m.group(2)))
# [('70849', 'linux;u;android4.2.1;zh-cn')]
# Number: 70849
# String: linux;u;android4.2.1;zh-cn
You can use a negated class \(([^)]*)\) to match anything between ( and ):
>>> s=r'70849 mozilla/5.0(linux;u;android4.2.1;zh-cn)applewebkit/534.30(khtml,likegecko)version/4.0mobilesafari/534.30'
>>> m = re.search(r"(\d+)[^(]*\(([^)]*)\)", s)
>>> print m.group(1)
70849
>>> print m.group(2)
linux;u;android4.2.1;zh-cn

Categories

Resources