RegEx for matching capital letters and numbers - python

Hi I have a lot of corpus I parse them to extract all patterns:
like how to extract all patterns like: AP70, ML71, GR55, etc..
and all patterns for a sequence of words that start with capital letter like: Hello Little Monkey, How Are You, etc..
For the first case I did this regexp and don't get all matches:
>>> p = re.compile("[A-Z]+[0-9]+")
>>> res = p.search("aze azeaz GR55 AP1 PM89")
>>> res
<re.Match object; span=(10, 14), match='GR55'>
and for the second one:
>>> s = re.compile("[A-Z]+[a-z]+\s[A-Z]+[a-z]+\s[A-Z]+[a-z]+")
>>> resu = s.search("this is a test string, Hello Little Monkey, How Are You ?")
>>> resu
<re.Match object; span=(23, 42), match='Hello Little Monkey'>
>>> resu.group()
'Hello Little Monkey'
it's seems working but I want to get all matches when parsing a whole 'big' line.

Try these 2 regex:
(for safety, they are enclosed by whitespace/comma boundary's)
>>> import re
>>> teststr = "aze azeaz GR55 AP1 PM89"
>>> res = re.findall(r"(?<![^\s,])[A-Z]+[0-9]+(?![^\s,])", teststr)
>>> print(res)
['GR55', 'AP1', 'PM89']
>>>
Readable regex
(?<! [^\s,] )
[A-Z]+ [0-9]+
(?! [^\s,] )
and
>>> import re
>>> teststr = "this is a test string, ,Hello Little Monkey, How Are You ?"
>>> res = re.findall(r"(?<![^\s,])[A-Z]+[a-z]+(?:\s[A-Z]+[a-z]+){1,}(?![^\s,])", teststr)
>>> print(res)
['Hello Little Monkey', 'How Are You']
>>>
Readable regex
(?<! [^\s,] )
[A-Z]+ [a-z]+
(?: \s [A-Z]+ [a-z]+ ){1,}
(?! [^\s,] )

This expression might help you to do so, or design one. It seems you wish that your expression would contain at least one [A-Z] and at least one [0-9]:
(?=[A-Z])(?=.+[0-9])([A-Z0-9]+)
Graph
This graph shows how your expression would work, and you can test more in this link:
Example Code:
This code shows how the expression would work in Python:
# -*- coding: UTF-8 -*-
import re
string = "aze azeaz GR55 AP1 PM89"
expression = r'(?=[A-Z])(?=.+[0-9])([A-Z0-9]+)'
match = re.search(expression, string)
if match:
print("YAAAY! \"" + match.group(1) + "\" is a match 💚💚💚 ")
else:
print('🙀 Sorry! No matches! Something is not right! Call 911 👮')
Example Output
YAAAY! "GR55" is a match 💚💚💚
Performance
This JavaScript snippet shows the performance of your expression using a simple 1-million times for loop.
repeat = 1000000;
start = Date.now();
for (var i = repeat; i >= 0; i--) {
var string = 'aze azeaz GR55 AP1 PM89';
var regex = /(.*?)(?=[A-Z])(?=.+[0-9])([A-Z0-9]+)/g;
var match = string.replace(regex, "$2 ");
}
end = Date.now() - start;
console.log("YAAAY! \"" + match + "\" is a match 💚💚💚 ");
console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test. 😳 ");

Related

Match if string starts with n digits and no more

I have strings like 6202_52_55_1959.txt
I want to match those which starts with 3 digits and no more.
So 6202_52_55_1959.txt should not match but 620_52_55_1959.txt should.
import re
regexp = re.compile(r'^\d{3}')
file = r'6202_52_55_1959.txt'
print(regexp.search(file))
<re.Match object; span=(0, 3), match='620'> #I dont want this example to match
How can I get it to only match if there are three digits and no more following?
Use a negative lookahead:
regexp = re.compile(r'^\d{3}(?!\d)')
Use the pattern ^\d{3}(?=\D):
inp = ["6202_52_55_1959.txt", "620_52_55_1959.txt"]
for i in inp:
if re.search(r'^\d{3}(?=\D)', i):
print("MATCH: " + i)
else:
print("NO MATCH: " + i)
This prints:
NO MATCH: 6202_52_55_1959.txt
MATCH: 620_52_55_1959.txt
The regex pattern used here says to match:
^ from the start of the string
\d{3} 3 digits
(?=\D) then assert that what follows is NOT a digit (includes end of string)

How to use "." as a wildcard inside "string" instead of pattern?

I have this:
incompleted_string1 = "Thom"
incompleted_string2 = "s Mueller naive"
entire_string = 'Thom.s Mueller naive' # <= dot means any char!!! I dont know which char is it
pattern = "mas M"
I would like to know if "mas M" if present inside entire_string. I do not care if "." is equal to "a" or something else. I cannot change the pattern string!
re.findall("mas M", entire_string)
This returns [] I'd like to have "mas M" but True will be enough
Thank you for your help
You can replace each char in the pattern with [ + this char + . + ]:
bool(re.search("".join([f"[{x}.]" for x in pattern]), entire_string))
The pattern will look like [m.][a.][s.][ .][M.] here, and each can match either the corresponding letter or a dot. See the regex demo.
See the Python demo:
import re
incompleted_string1 = "Thom"
incompleted_string2 = "s Mueller naive"
entire_string = 'Thom.s Mueller naive' # <= dot means any char!!! I dont know which char is it
pattern = "mas M"
print (bool(re.search("".join([f"[{x}.]" for x in pattern]), entire_string)) )
# => True
The another approach could be to have all the possible combinations of pattern
bool(re.search(pattern + "|"+ "|".join([pattern[0:i] + '.' + pattern[i+1:] for i in range(len(pattern))]), entire_string))

The problem of regex strings containing special characters in python

I have a string: "s = string.charAt (0) == 'd'"
I want to retrieve a tuple of ('0', 'd')
I have used: re.search(r "\ ((. ?) \) == '(.?)' && "," string.charAt (0) == 'd' ")
I checked the s variable when printed as "\\ ((.?) \\) == '(.?) '&& "
How do I fix it?
You may try:
\((\d+)\).*?'(\w+)'
Explanation of the above regex:
\( - Matches a ( literally.
(\d+) - Represents the first capturing group matching digits one or more times.
\) - Matches a ) literally.
.*? - Lazily matches everything except a new-line.
'(\w+)' - Represents second capturing group matching ' along with any word character([0-9a-zA-Z_]) one or more times.
Regex Demo
import re
regex = r"\((\d+)\).*?'(\w+)'"
test_str = "s = string.charAt (0) == 'd'"
print(re.findall(regex, test_str))
# Output: [('0', 'd')]
You can find the sample run of the above implementation in here.
Your regular expression should be ".*\((.?)\) .* '(.?)\'". This will get both the character inside the parenthesis and then the character inside the single quotes.
>>> import re
>>> s = " s = string.charAt (0) == 'd'"
>>> m = re.search(r".*\((.?)\) .* '(.?)'", s)
>>> m.groups()
('0', 'd')
Use
\((.*?)\)\s*==\s*'(.*?)'
See proof. The first variable is captured inside Group 1 and the second variable is inside Group 2.
Python code:
import re
string = "s = string.charAt (0) == 'd'"
match_data = re.search(r"\((.*?)\)\s*==\s*'(.*?)'", string)
if match_data:
print(f"Var#1 = {match_data.group(1)}\nVar#2 = {match_data.group(2)}")
Output:
Var#1 = 0
Var#2 = d
Thanks everyone for the very helpful answer. My problem has been solved ^^

python regex matching "ab" or "ba" words

I tried matching words including the letter "ab" or "ba" e.g. "ab"olition, f"ab"rics, pro"ba"ble. I came up with the following regular expression:
r"[Aa](?=[Bb])[Bb]|[Bb](?=[Aa])[Aa]"
But it includes words that start or end with ", (, ), / ....non-alphanumeric characters. How can I erase it? I just want to match words list.
import sys
import re
word=[]
dict={}
f = open('C:/Python27/brown_half.txt', 'rU')
w = open('C:/Python27/brown_halfout.txt', 'w')
data = f.read()
word = data.split() # word is list
f.close()
for num2 in word:
match2 = re.findall("\w*(ab|ba)\w*", num2)
if match2:
dict[num2] = (dict[num2] + 1) if num2 in dict.keys() else 1
for key2 in sorted(dict.iterkeys()):print "%s: %s" % (key2, dict[key2])
print len(dict.keys())
Here, I don't know how to mix it up with "re.compile~~" method that 1st comment said...
To match all the words with ab or ba (case insensitive):
import re
text = 'fabh, obar! (Abtt) yybA, kk'
pattern = re.compile(r"(\w*(ab|ba)\w*)", re.IGNORECASE)
# to print all the matches
for match in pattern.finditer(text):
print match.group(0)
# to print the first match
print pattern.search(text).group(0)
https://regex101.com/r/uH3xM9/1
Regular expressions are not the best tool for the job in this case. They'll complicate stuff way too much for such simple circumstances. You can instead use Python's builtin in operator (works for both Python 2 and 3)...
sentence = "There are no probable situations whereby that may happen, or so it seems since the Abolition."
words = [''.join(filter(lambda x: x.isalpha(), token)) for token in sentence.split()]
for word in words:
word = word.lower()
if 'ab' in word or 'ba' in word:
print('Word "{}" matches pattern!'.format(word))
As you can see, 'ab' in word evaluates to True if the string 'ab' is found as-is (that is, exactly) in word, or False otherwise. For example 'ba' in 'probable' == True and 'ab' in 'Abolition' == False. The second line takes take of dividing the sentence in words and taking out any punctuation character. word = word.lower() makes word lowercase before the comparisons, so that for word = 'Abolition', 'ab' in word == True.
I would do it this way:
Strip your string from unwanted chars using the below two
techniques, your choice:
a - By building a translation dictionary and using translate method:
>>> import string
>>> del_punc = dict.fromkeys(ord(c) for c in string.punctuation)
s = 'abolition, fabrics, probable, test, case, bank;, halfback 1(ablution).'
>>> s = s.translate(del_punc)
>>> print(s)
'abolition fabrics probable test case bank halfback 1ablution'
b - using re.sub method:
>>> import string
>>> import re
>>> s = 'abolition, fabrics, probable, test, case, bank;, halfback 1(ablution).'
>>> s = re.sub(r'[%s]'%string.punctuation, '', s)
>>> print(s)
'abolition fabrics probable test case bank halfback 1ablution'
Next will be finding your words containing 'ab' or 'ba':
a - Splitting over whitespaces and finding occurrences of your desired strings, which is the one I recommend to you:
>>> [x for x in s.split() if 'ab' in x.lower() or 'ba' in x.lower()]
['abolition', 'fabrics', 'probable', 'bank', 'halfback', '1ablution']
b -Using re.finditer method:
>>> pat
re.compile('\\b.*?(ab|ba).*?\\b', re.IGNORECASE)
>>> for m in pat.finditer(s):
print(m.group())
abolition
fabrics
probable
test case bank
halfback
1ablution
string = "your string here"
lowercase = string.lower()
if 'ab' in lowercase or 'ba' in lowercase:
print(true)
else:
print(false)
Try this one
[(),/]*([a-z]|(ba|ab))+[(),/]*

How to convert an iterative str.replace() with str.translate()? - python

The aim of my task is to add spaces before and after punctuation. Currently i've been using an iterative str.replace() to replace each punctuation p with " "+p+" ". How do i achieve the same output with str.translate() where i can just pass in two list or a dictionary:
inlist = string.punctuation
outlist = [" "+p+" " for p in string.punctuation]
inoutdict = {p:" "+p+" " for p in string.punctuation}
Lets assume that all the punctuations i have are in string.punctuation. Currently, i'm doing it as such:
from string import punctuation as punct
def punct_tokenize(text):
for ch in text:
if ch in deupunct:
text = text.replace(ch, " "+ch+" ")
return " ".join(text.split())
sent = "This's a foo-bar sentences with many, many punctuation."
print punct_tokenize(sent)
Also this iterative str.replace() is taking too long, will str.translate() be any faster?
The dict form of translate only works with unicodes:
>>> import string
>>> inoutdict = {ord(p):unicode(" "+p+" ") for p in string.punctuation}
>>> unicode("foo,,,bar!!1").translate(inoutdict)
u'foo , , , bar ! ! 1'
Another option is with regular expressions:
>>> import re
>>> rx = '[%s]' % re.escape(string.punctuation)
>>> re.sub(rx, r" \g<0> ", "foo,,,bar!!1")
'foo , , , bar ! ! 1'
As usual, show us a bigger picture to get better answers, e.g. why are you doing that? where does the input come from?, etc...

Categories

Resources