Match if string starts with n digits and no more - python

I have strings like 6202_52_55_1959.txt
I want to match those which starts with 3 digits and no more.
So 6202_52_55_1959.txt should not match but 620_52_55_1959.txt should.
import re
regexp = re.compile(r'^\d{3}')
file = r'6202_52_55_1959.txt'
print(regexp.search(file))
<re.Match object; span=(0, 3), match='620'> #I dont want this example to match
How can I get it to only match if there are three digits and no more following?

Use a negative lookahead:
regexp = re.compile(r'^\d{3}(?!\d)')

Use the pattern ^\d{3}(?=\D):
inp = ["6202_52_55_1959.txt", "620_52_55_1959.txt"]
for i in inp:
if re.search(r'^\d{3}(?=\D)', i):
print("MATCH: " + i)
else:
print("NO MATCH: " + i)
This prints:
NO MATCH: 6202_52_55_1959.txt
MATCH: 620_52_55_1959.txt
The regex pattern used here says to match:
^ from the start of the string
\d{3} 3 digits
(?=\D) then assert that what follows is NOT a digit (includes end of string)

Related

Extract substring from a python string

I want to extract the string before the 9 digit number below:
tmp = place1_128017000_gw_cl_mask.tif
The output should be place1
I could do this:
tmp.split('_')[0] but I also want the solution to work for:
tmp = place1_place2_128017000_gw_cl_mask.tif where the result would be:
place1_place2
You can assume that the number will also be 9 digits long
Using regular expressions and the lookahead feature of regex, this is a simple solution:
tmp = "place1_place2_128017000_gw_cl_mask.tif"
m = re.search(r'.+(?=_\d{9}_)', tmp)
print(m.group())
Result:
place1_place2
Note that the \d{9} bit matches exactly 9 digits. And the bit of the regex that is in (?= ... ) is a lookahead, which means it is not part of the actual match, but it only matches if that follows the match.
Assuming we can phrase your problem as wanting the substring up to, but not including the underscore which is followed by all numbers, we can try:
tmp = "place1_place2_128017000_gw_cl_mask.tif"
m = re.search(r'^([^_]+(?:_[^_]+)*)_\d+_', tmp)
print(m.group(1)) # place1_place2
Use a regular expression:
import re
places = (
"place1_128017000_gw_cl_mask.tif",
"place1_place2_128017000_gw_cl_mask.tif",
)
pattern = re.compile("(place\d+(?:_place\d+)*)_\d{9}")
for p in places:
matched = pattern.match(p)
if matched:
print(matched.group(1))
prints:
place1
place1_place2
The regex works like this (adjust as needed, e.g., for less than 9 digits or a variable number of digits):
( starts a capture
place\d+ matches "places plus 1 to many digits"
(?: starts a group, but does not capture it (no need to capture)
_place\d+ matches more "places"
) closes the group
* means zero or many times the previous group
) closes the capture
\d{9} matches 9 digits
The result is in the first (and only) capture group.
Here's a possible solution without regex (unoptimized!):
def extract(s):
result = ''
for x in s.split('_'):
try: x = int(x)
except: pass
if isinstance(x, int) and len(str(x)) == 9:
return result[:-1]
else:
result += x + '_'
tmp = 'place1_128017000_gw_cl_mask.tif'
tmp2 = 'place1_place2_128017000_gw_cl_mask.tif'
print(extract(tmp)) # place1
print(extract(tmp2)) # place1_place2

The problem of regex strings containing special characters in python

I have a string: "s = string.charAt (0) == 'd'"
I want to retrieve a tuple of ('0', 'd')
I have used: re.search(r "\ ((. ?) \) == '(.?)' && "," string.charAt (0) == 'd' ")
I checked the s variable when printed as "\\ ((.?) \\) == '(.?) '&& "
How do I fix it?
You may try:
\((\d+)\).*?'(\w+)'
Explanation of the above regex:
\( - Matches a ( literally.
(\d+) - Represents the first capturing group matching digits one or more times.
\) - Matches a ) literally.
.*? - Lazily matches everything except a new-line.
'(\w+)' - Represents second capturing group matching ' along with any word character([0-9a-zA-Z_]) one or more times.
Regex Demo
import re
regex = r"\((\d+)\).*?'(\w+)'"
test_str = "s = string.charAt (0) == 'd'"
print(re.findall(regex, test_str))
# Output: [('0', 'd')]
You can find the sample run of the above implementation in here.
Your regular expression should be ".*\((.?)\) .* '(.?)\'". This will get both the character inside the parenthesis and then the character inside the single quotes.
>>> import re
>>> s = " s = string.charAt (0) == 'd'"
>>> m = re.search(r".*\((.?)\) .* '(.?)'", s)
>>> m.groups()
('0', 'd')
Use
\((.*?)\)\s*==\s*'(.*?)'
See proof. The first variable is captured inside Group 1 and the second variable is inside Group 2.
Python code:
import re
string = "s = string.charAt (0) == 'd'"
match_data = re.search(r"\((.*?)\)\s*==\s*'(.*?)'", string)
if match_data:
print(f"Var#1 = {match_data.group(1)}\nVar#2 = {match_data.group(2)}")
Output:
Var#1 = 0
Var#2 = d
Thanks everyone for the very helpful answer. My problem has been solved ^^

How to use regex to only keep first n repeated words

If I have an input sentence
input = 'ok ok, it is very very very very very hard'
and what I want to do is to only keep the first three replica for any repeated word:
output = 'ok ok, it is very very very hard'
How can I achieve this with re or regex module in python?
One option could be to use a capturing group with a backreference and use that in the replacement.
((\w+)(?: \2){2})(?: \2)*
Explanation
( Capture group 1
(\w+) capture group 2, match 1+ word chars (The example data only uses word characters. To make sure they are no part of a larger word use a word boundary \b)
(?: \2){2} Repeat 2 times matching a space and a backreference to group 2. Instead of a single space you could use [ \t]+ to match 1+ spaces or tabs or use \s+ to match 1+ whitespace chars. (Note that that would also match a newline)
) Close group 1
(?: \2)* Match 0+ times a space and a backreference to group 2 to match the same words that you want to remove
Regex demo | Python demo
For example
import re
regex = r"((\w+)(?: \2){2})(?: \2)*"
s = "ok ok, it is very very very very very hard"
result = re.sub(regex, r"\1", s)
if result:
print (result)
Result
ok ok, it is very very very hard
You can group a word and use a backreference to refer to it to ensure that it repeats for more than 2 times:
import re
print(re.sub(r'\b((\w+)(?:\s+\2){2})(?:\s+\2)+\b', r'\1', input))
This outputs:
ok ok, it is very very very hard
One solution with re.sub with custom function:
s = 'ok ok, it is very very very very very hard'
def replace(n=3):
last_word, cnt = '', 0
current_word = yield
while True:
if last_word == current_word:
cnt += 1
else:
cnt = 0
last_word = current_word
if cnt >= n:
current_word = yield ''
else:
current_word = yield current_word
import re
replacer = replace()
next(replacer)
print(re.sub(r'\s*[\w]+\s*', lambda g: replacer.send(g.group(0)), s))
Prints:
ok ok, it is very very very hard

RegEx for matching capital letters and numbers

Hi I have a lot of corpus I parse them to extract all patterns:
like how to extract all patterns like: AP70, ML71, GR55, etc..
and all patterns for a sequence of words that start with capital letter like: Hello Little Monkey, How Are You, etc..
For the first case I did this regexp and don't get all matches:
>>> p = re.compile("[A-Z]+[0-9]+")
>>> res = p.search("aze azeaz GR55 AP1 PM89")
>>> res
<re.Match object; span=(10, 14), match='GR55'>
and for the second one:
>>> s = re.compile("[A-Z]+[a-z]+\s[A-Z]+[a-z]+\s[A-Z]+[a-z]+")
>>> resu = s.search("this is a test string, Hello Little Monkey, How Are You ?")
>>> resu
<re.Match object; span=(23, 42), match='Hello Little Monkey'>
>>> resu.group()
'Hello Little Monkey'
it's seems working but I want to get all matches when parsing a whole 'big' line.
Try these 2 regex:
(for safety, they are enclosed by whitespace/comma boundary's)
>>> import re
>>> teststr = "aze azeaz GR55 AP1 PM89"
>>> res = re.findall(r"(?<![^\s,])[A-Z]+[0-9]+(?![^\s,])", teststr)
>>> print(res)
['GR55', 'AP1', 'PM89']
>>>
Readable regex
(?<! [^\s,] )
[A-Z]+ [0-9]+
(?! [^\s,] )
and
>>> import re
>>> teststr = "this is a test string, ,Hello Little Monkey, How Are You ?"
>>> res = re.findall(r"(?<![^\s,])[A-Z]+[a-z]+(?:\s[A-Z]+[a-z]+){1,}(?![^\s,])", teststr)
>>> print(res)
['Hello Little Monkey', 'How Are You']
>>>
Readable regex
(?<! [^\s,] )
[A-Z]+ [a-z]+
(?: \s [A-Z]+ [a-z]+ ){1,}
(?! [^\s,] )
This expression might help you to do so, or design one. It seems you wish that your expression would contain at least one [A-Z] and at least one [0-9]:
(?=[A-Z])(?=.+[0-9])([A-Z0-9]+)
Graph
This graph shows how your expression would work, and you can test more in this link:
Example Code:
This code shows how the expression would work in Python:
# -*- coding: UTF-8 -*-
import re
string = "aze azeaz GR55 AP1 PM89"
expression = r'(?=[A-Z])(?=.+[0-9])([A-Z0-9]+)'
match = re.search(expression, string)
if match:
print("YAAAY! \"" + match.group(1) + "\" is a match 💚💚💚 ")
else:
print('🙀 Sorry! No matches! Something is not right! Call 911 👮')
Example Output
YAAAY! "GR55" is a match 💚💚💚
Performance
This JavaScript snippet shows the performance of your expression using a simple 1-million times for loop.
repeat = 1000000;
start = Date.now();
for (var i = repeat; i >= 0; i--) {
var string = 'aze azeaz GR55 AP1 PM89';
var regex = /(.*?)(?=[A-Z])(?=.+[0-9])([A-Z0-9]+)/g;
var match = string.replace(regex, "$2 ");
}
end = Date.now() - start;
console.log("YAAAY! \"" + match + "\" is a match 💚💚💚 ");
console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test. 😳 ");

Regular expressions in python

I have following string
WA2Ä…Ä…-02 -7+12,7. PP-.5P x0.6 words
and I need to count words, number and sum of all number using regular expressions.
Words:
WA2Ä…Ä…-02
-7+12,7.
PP-.5P
x0.6
words
Numbers:
2
-2
-7
12
7
-0.5
0.6
Sum of numbers should be 12.1.
I wrote this code, and only word count works well:
import re
string = "WA2Ä…Ä…-02 -7+12.7. PP-.5P x0.6 word"
#regular expresions
regex1 = r'\S+'
regex2 = r'-?\b\d+(?:[,\.]\d*)?\b'
count_words = len(re.findall(regex1, string))
count_numbers = len(re.findall(regex2, string))
sum_numbers = sum([float(i) for i in re.findall(regex2, string)])
print("\n")
print("String:", string)
print("\n")
print("Count words:", count_words)
print("Count numbers:", count_numbers)
print("Sum numbers:", sum_numbers)
print("\n")
input("Press enter to exit")
Output:
Count words: 5
Count numbers: 4
Sum numbers: 9.7
I think your regex1 is good to go, it's simple enough.
regex2 = r'[-+]?\d*\.?\d+'
Seems to do the trick (but it's easy to miss edge cases with regex). Optional - or '+', followed by any number of digits, followed by optional ., then match at least one digit.
Regex101 Demo
The following regex seems to work fine
([-+]?[\.]?(?=\d)(?:\d*)(?:\.\d+)?)
Python Code
p = re.compile(r'([-+]?[\.]?(?=\d)(?:\d*)(?:\.\d+)?)')
test_str = u"WA2Ä…Ä…-02 -7+12,7. PP-.5P x0.6 words"
print(sum([float(x) for x in re.findall(p, test_str)]))
Ideone Demo
UPDATE FOR HEX
The following regex seems to work (assuming hex numbers do not have decimal in the string)
([-+]?)(?:0?x)([0-9A-Fa-f]+)
Python Code
p = re.compile(r'([-+]?)(?:0?x)([0-9A-Fa-f]+)')
test_str = u"WA2Ä…Ä…-02 -7+12,7. -0x1AEfPq PP-.5P 0x1AEf +0x1AEf x0.6 words"
for x in re.findall(p, test_str):
tmp = x[0] + x[1]
print(int(tmp, 16))
Ideone Demo
If there is any issue, feel free to comment

Categories

Resources