Regular expressions in python

Regular expressions in python - python

I have following string
WA2ąą-02 -7+12,7. PP-.5P x0.6 words
and I need to count words, number and sum of all number using regular expressions.
Words:
WA2ąą-02
-7+12,7.
PP-.5P
x0.6
words
Numbers:
2
-2
-7
12
7
-0.5
0.6
Sum of numbers should be 12.1.
I wrote this code, and only word count works well:
import re
string = "WA2ąą-02 -7+12.7. PP-.5P x0.6 word"
#regular expresions
regex1 = r'\S+'
regex2 = r'-?\b\d+(?:[,\.]\d*)?\b'
count_words = len(re.findall(regex1, string))
count_numbers = len(re.findall(regex2, string))
sum_numbers = sum([float(i) for i in re.findall(regex2, string)])
print("\n")
print("String:", string)
print("\n")
print("Count words:", count_words)
print("Count numbers:", count_numbers)
print("Sum numbers:", sum_numbers)
print("\n")
input("Press enter to exit")
Output:
Count words: 5
Count numbers: 4
Sum numbers: 9.7

I think your regex1 is good to go, it's simple enough.
regex2 = r'[-+]?\d*\.?\d+'
Seems to do the trick (but it's easy to miss edge cases with regex). Optional - or '+', followed by any number of digits, followed by optional ., then match at least one digit.
Regex101 Demo

The following regex seems to work fine
([-+]?[\.]?(?=\d)(?:\d*)(?:\.\d+)?)
Python Code
p = re.compile(r'([-+]?[\.]?(?=\d)(?:\d*)(?:\.\d+)?)')
test_str = u"WA2ąą-02 -7+12,7. PP-.5P x0.6 words"
print(sum([float(x) for x in re.findall(p, test_str)]))
Ideone Demo
UPDATE FOR HEX
The following regex seems to work (assuming hex numbers do not have decimal in the string)
([-+]?)(?:0?x)([0-9A-Fa-f]+)
Python Code
p = re.compile(r'([-+]?)(?:0?x)([0-9A-Fa-f]+)')
test_str = u"WA2ąą-02 -7+12,7. -0x1AEfPq PP-.5P 0x1AEf +0x1AEf x0.6 words"
for x in re.findall(p, test_str):
tmp = x[0] + x[1]
print(int(tmp, 16))
Ideone Demo
If there is any issue, feel free to comment

Related

Extract substring from a python string

I want to extract the string before the 9 digit number below:
tmp = place1_128017000_gw_cl_mask.tif
The output should be place1
I could do this:
tmp.split('_')[0] but I also want the solution to work for:
tmp = place1_place2_128017000_gw_cl_mask.tif where the result would be:
place1_place2
You can assume that the number will also be 9 digits long

Using regular expressions and the lookahead feature of regex, this is a simple solution:
tmp = "place1_place2_128017000_gw_cl_mask.tif"
m = re.search(r'.+(?=_\d{9}_)', tmp)
print(m.group())
Result:
place1_place2
Note that the \d{9} bit matches exactly 9 digits. And the bit of the regex that is in (?= ... ) is a lookahead, which means it is not part of the actual match, but it only matches if that follows the match.

Assuming we can phrase your problem as wanting the substring up to, but not including the underscore which is followed by all numbers, we can try:
tmp = "place1_place2_128017000_gw_cl_mask.tif"
m = re.search(r'^([^_]+(?:_[^_]+)*)_\d+_', tmp)
print(m.group(1)) # place1_place2

Use a regular expression:
import re
places = (
"place1_128017000_gw_cl_mask.tif",
"place1_place2_128017000_gw_cl_mask.tif",
)
pattern = re.compile("(place\d+(?:_place\d+)*)_\d{9}")
for p in places:
matched = pattern.match(p)
if matched:
print(matched.group(1))
prints:
place1
place1_place2
The regex works like this (adjust as needed, e.g., for less than 9 digits or a variable number of digits):
( starts a capture
place\d+ matches "places plus 1 to many digits"
(?: starts a group, but does not capture it (no need to capture)
_place\d+ matches more "places"
) closes the group
* means zero or many times the previous group
) closes the capture
\d{9} matches 9 digits
The result is in the first (and only) capture group.

Here's a possible solution without regex (unoptimized!):
def extract(s):
result = ''
for x in s.split('_'):
try: x = int(x)
except: pass
if isinstance(x, int) and len(str(x)) == 9:
return result[:-1]
else:
result += x + '_'
tmp = 'place1_128017000_gw_cl_mask.tif'
tmp2 = 'place1_place2_128017000_gw_cl_mask.tif'
print(extract(tmp)) # place1
print(extract(tmp2)) # place1_place2

FInding position of number in string

I would like to separate the letters from the numbers like this
inp= "AE123"
p= #position of where the number start in this case "2"
I've already tried to use str.find() but its has a limit of 3

Extracting the letters and the digits
If the goal is to extract both the letters and the digits, regular expressions can solve the problem directly without need for indices or slices:
>>> re.match(r'([A-Za-z]+)(\d+)', inp).groups()
('AE', '123')
Finding the position of the number
If needed, regular expressions can also locate the indices for the match.
>>> import re
>>> inp = "AE123"
>>> mo = re.search(r'\d+', inp)
>>> mo.span()
(2, 5)
>>> inp[2 : 5]
'123'

You can run a loop that checks for digits:
for p, c in enumerate(inp):
if c.isdigit():
break
print(p)
Find out more about str.isdigit

this should work
for i in range(len(inp)):
if inp[i].isdigit():
p = i
break

#Assuming all characters come before the first numeral as mentioned in the question
def findWhereNoStart(string):
start_index=-1
for char in string:
start_index+=1
if char.isdigit():
return string[start_index:]
return "NO NUMERALS IN THE GIVEN STRING"
#TEST
print(findWhereNoStart("ASDFG"))
print(findWhereNoStart("ASDFG13213"))
print(findWhereNoStart("ASDFG1"))
#OUTPUT
"""
NO NUMERALS IN THE GIVEN STRING
13213
1
"""

Match if string starts with n digits and no more

I have strings like 6202_52_55_1959.txt
I want to match those which starts with 3 digits and no more.
So 6202_52_55_1959.txt should not match but 620_52_55_1959.txt should.
import re
regexp = re.compile(r'^\d{3}')
file = r'6202_52_55_1959.txt'
print(regexp.search(file))
<re.Match object; span=(0, 3), match='620'> #I dont want this example to match
How can I get it to only match if there are three digits and no more following?

Use a negative lookahead:
regexp = re.compile(r'^\d{3}(?!\d)')

Use the pattern ^\d{3}(?=\D):
inp = ["6202_52_55_1959.txt", "620_52_55_1959.txt"]
for i in inp:
if re.search(r'^\d{3}(?=\D)', i):
print("MATCH: " + i)
else:
print("NO MATCH: " + i)
This prints:
NO MATCH: 6202_52_55_1959.txt
MATCH: 620_52_55_1959.txt
The regex pattern used here says to match:
^ from the start of the string
\d{3} 3 digits
(?=\D) then assert that what follows is NOT a digit (includes end of string)

How to extract words from a sentence and check if a word is not there in it

this is my code below. I need to take an input from the user (a sentence) and check whether there are any numbers in it from 0 to 10 in it or not. I know there are many ways to approach it, e.g. split(), isalnum(), etc. but I just need help in putting it all together. Please find my code below:
sentence1 = input("Enter any sentence (it may include numbers): ")
numbers = ["1","2","3","4","5","6","7","8","9","10","0"]
ss1 = sentence1.split()
print(ss1)
if numbers in ss1:
print("There are numbers between 0 to 10 in the sentece")
else:
print("There are no numbers in the sentence”)
Thanks :)
Edit: Expected Output should be like:
Enter any sentence (it may include numbers): I am 10 years old
There are numbers between 0 and 10 in this sentence

I would use a regex:
def contains_nums(s):
import re
m = re.search('\d+', s)
if m:
return f'"{s}" contains {m.group()}'
else:
return f'"{s}" contains no numbers'
Examples:
>>> contains_nums('abc 10')
'"abc 10" contains 10'
>>> contains_nums('abc def')
'"abc def" contains no numbers'
NB. this is only checking the first number, use re.findall if you need all. Also this is finding numbers within words, if you want separate numbers only, use a regex with word boundaries (\b\d+\b), finally, if you want to restrict to 0-10 numbers, use (?<!\d)1?\d(?!\d) (or \b(?<!\d)1?\d(?!\d)\b for independent numbers)
more complete solution
def contains_nums(s, standalone_number=False, zero_to_ten=False):
import re
regex = r'(?<!\d)1?\d(?!\d)' if zero_to_ten else '\d+'
if standalone_number:
regex = r'\b%s\b' % regex
m = re.search(regex, s)
if m:
return f'"{s}" contains {m.group()}'
else:
a = "standalone " if standalone_number else ""
b = "0-10 " if zero_to_ten else ""
return f'"{s}" contains no {a}{b}numbers'
>>> contains_nums('abc100', standalone_number=False, zero_to_ten=False)
'"abc100" contains 100'
>>> contains_nums('abc100', standalone_number=True, zero_to_ten=False)
'"abc100" contains no standalone numbers'
>>> contains_nums('abc 100', standalone_number=True, zero_to_ten=True)
'"abc 100" contains no standalone 0-10 numbers'

Try this solution:
sentence1 = input("Enter any sentence (it may include numbers): ")
numbers = ["1","2","3","4","5","6","7","8","9","10","0"]
verif = False
for n in numbers:
if n in sentence1:
verif = True
break
if verif:
print("There are numbers between 0 to 10 in the sentence")
else:
print('There are no numbers in the sentence')
output:
Enter any sentence (it may include numbers): gdrsgrg8
There are numbers between 0 to 10 in the sentence

To make your program work with minimal changes, you could use sets:
sentence1 = input("Enter any sentence (it may include numbers): ")
numbers = {"1","2","3","4","5","6","7","8","9","10","0"}
ss1 = sentence1.split()
print(ss1)
if numbers.intersection(set(ss1)):
print("There are numbers between 0 to 10 in the sentece")
else:
print("There are no numbers in the sentence")
Output (with number):
Enter any sentence (it may include numbers): There is 1 number
['There', 'is', '1', 'number']
There are numbers between 0 to 10 in the sentece
Output (without number):
Enter any sentence (it may include numbers): There are no numbers
['There', 'are', 'no', 'numbers']
There are no numbers in the sentence
However this will only work if the numbers to be detected are space separated. But then again, this is almost identical to your solution except it uses set.

import re
sentence = input("Enter a sentence: ")
def check_numbers(sentence):
if re.search(r'\d', sentence):
return True
else:
return False
check_numbers(sentence)
Input: Hello World ... Output: False
Input: Hello 1 World ... Output: True

One of the way is to check the intersection of the set of numbers between and the set of words in the phrase. If it is not empty, there are numbers:
sentence1 = input("Enter any sentence (it may include numbers): ")
numbers = {"1","2","3","4","5","6","7","8","9","10","0"}
ss1 = set(sentence1.split())
print(ss1)
if ss1.intersection(numbers):
print("There are numbers between 0 to 10 in the sentece")
else:
print("There are no numbers in the sentence")

You can use a pattern \b(?:10|\d)\b which will match either 10 or a digit 0-9 between word boundaries \b to prevent a partial word match.
Note that the message in your question print("There are no numbers in the sentence”) is not entirely correct. If there is only 19 in the string, there is a number but is will not match as it is not in the range of 0-10.
See a regex demo for the matches.
import re
sentence1 = input("Enter any sentence (it may include numbers): ")
if re.search(r"\b(?:10|\d)\b", sentence1):
print("There are numbers between 0 to 10 in the sentece")
else:
print("There are no numbers between 0 to 10 in the sentence")

sentence1 = input("Enter any sentence (it may include numbers): ")
numbers = ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "0"]
ss1 = sentence1.split()
for x in numbers:
if x in ss1:
print(x, " belongs to the sentence")
else:
print(x, "does not belong to a sentence")

How to extract a Double from a string in Python?

At the moment I have this but it only works for integers (whole numbers) not doubles:
S = "Weight is 3.5 KG"
weight = [int(i) for i in S.split() if i.isdigit()]
print(weight)
result: []

You can use regular expression to extract the floating point number:
import re
S = "Weight is 3.5 KG"
pattern = re.compile(r'\-?\d+\.\d+')
weights = list(map(float, re.findall(pattern, S)))
print(weights)
re.findall() will return you the list of numbers found in the text.
The map function will convert the list results to floating point number. Since it returns a generator, you need to convert it to a list.

The following code will do the job for the example you placed:
if __name__ == "__main__":
S = "Weight is 3.5 KG"
# search for the dot (.)
t = S.find('.')
# check if the dot (.) exist in the string
# make sure the dot (.) is not the last character
# of the string
if t > 0 and t+1 != len(S):
# check the character before and after
# the dot (.)
t_before = S[t-1]
t_after = S[t+1]
# check if the charactef before and after the
# dot (.) is a digit
if t_before.isdigit() and t_after.isdigit():
# split the string
S_split = S.split()
for x in S_split:
if '.' in x:
print(float(x))
Output:
3.5

You can use re
import re
print(re.findall('\d+\.\d+',S)
#['3.5']
Using try-except
for i in S.split():
try:
new.append(float(i))
except Exception:
pass
print(new)
#['3.5']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regular expressions in python - python

I think your regex1 is good to go, it's simple enough. regex2 = r'[-+]?\d*\.?\d+' Seems to do the trick (but it's easy to miss edge cases with regex). Optional - or '+', followed by any number of digits, followed by optional ., then match at least one digit. Regex101 Demo

Related

Extract substring from a python string

FInding position of number in string

Match if string starts with n digits and no more

How to extract words from a sentence and check if a word is not there in it

How to extract a Double from a string in Python?

Categories

Resources