I would like to extract all the numbers contained in a string. Which is better suited for the purpose, regular expressions or the isdigit() method?
Example:
line = "hello 12 hi 89"
Result:
[12, 89]
I'd use a regexp :
>>> import re
>>> re.findall(r'\d+', "hello 42 I'm a 32 string 30")
['42', '32', '30']
This would also match 42 from bla42bla. If you only want numbers delimited by word boundaries (space, period, comma), you can use \b :
>>> re.findall(r'\b\d+\b', "he33llo 42 I'm a 32 string 30")
['42', '32', '30']
To end up with a list of numbers instead of a list of strings:
>>> [int(s) for s in re.findall(r'\b\d+\b', "he33llo 42 I'm a 32 string 30")]
[42, 32, 30]
NOTE: this does not work for negative integers
If you only want to extract only positive integers, try the following:
>>> txt = "h3110 23 cat 444.4 rabbit 11 2 dog"
>>> [int(s) for s in txt.split() if s.isdigit()]
[23, 11, 2]
I would argue that this is better than the regex example because you don't need another module and it's more readable because you don't need to parse (and learn) the regex mini-language.
This will not recognize floats, negative integers, or integers in hexadecimal format. If you can't accept these limitations, jmnas's answer below will do the trick.
This is more than a bit late, but you can extend the regex expression to account for scientific notation too.
import re
# Format is [(<string>, <expected output>), ...]
ss = [("apple-12.34 ba33na fanc-14.23e-2yapple+45e5+67.56E+3",
['-12.34', '33', '-14.23e-2', '+45e5', '+67.56E+3']),
('hello X42 I\'m a Y-32.35 string Z30',
['42', '-32.35', '30']),
('he33llo 42 I\'m a 32 string -30',
['33', '42', '32', '-30']),
('h3110 23 cat 444.4 rabbit 11 2 dog',
['3110', '23', '444.4', '11', '2']),
('hello 12 hi 89',
['12', '89']),
('4',
['4']),
('I like 74,600 commas not,500',
['74,600', '500']),
('I like bad math 1+2=.001',
['1', '+2', '.001'])]
for s, r in ss:
rr = re.findall("[-+]?[.]?[\d]+(?:,\d\d\d)*[\.]?\d*(?:[eE][-+]?\d+)?", s)
if rr == r:
print('GOOD')
else:
print('WRONG', rr, 'should be', r)
Gives all good!
Additionally, you can look at the AWS Glue built-in regex
If you know it will be only one number in the string, i.e 'hello 12 hi', you can try filter.
For example:
In [1]: int(''.join(filter(str.isdigit, '200 grams')))
Out[1]: 200
In [2]: int(''.join(filter(str.isdigit, 'Counters: 55')))
Out[2]: 55
In [3]: int(''.join(filter(str.isdigit, 'more than 23 times')))
Out[3]: 23
But be carefull !!! :
In [4]: int(''.join(filter(str.isdigit, '200 grams 5')))
Out[4]: 2005
I'm assuming you want floats not just integers so I'd do something like this:
l = []
for t in s.split():
try:
l.append(float(t))
except ValueError:
pass
Note that some of the other solutions posted here don't work with negative numbers:
>>> re.findall(r'\b\d+\b', 'he33llo 42 I\'m a 32 string -30')
['42', '32', '30']
>>> '-3'.isdigit()
False
To catch different patterns it is helpful to query with different patterns.
Setup all the patterns that catch different number patterns of interest:
To find commas, e.g. 12,300 or 12,300.00
r'[\d]+[.,\d]+'
To find floats, e.g. 0.123 or .123
r'[\d]*[.][\d]+'
To find integers, e.g. 123
r'[\d]+'
Combine with pipe ( | ) into one pattern with multiple or conditionals.
(Note: Put complex patterns first else simple patterns will return chunks of the complex catch instead of the complex catch returning the full catch).
p = '[\d]+[.,\d]+|[\d]*[.][\d]+|[\d]+'
Below, we'll confirm a pattern is present with re.search(), then return an iterable list of catches. Finally, we'll print each catch using bracket notation to subselect the match object return value from the match object.
s = 'he33llo 42 I\'m a 32 string 30 444.4 12,001'
if re.search(p, s) is not None:
for catch in re.finditer(p, s):
print(catch[0]) # catch is a match object
Returns:
33
42
32
30
444.4
12,001
I was looking for a solution to remove strings' masks, specifically from Brazilian phones numbers, this post not answered but inspired me. This is my solution:
>>> phone_number = '+55(11)8715-9877'
>>> ''.join([n for n in phone_number if n.isdigit()])
'551187159877'
# extract numbers from garbage string:
s = '12//n,_##$%3.14kjlw0xdadfackvj1.6e-19&*ghn334'
newstr = ''.join((ch if ch in '0123456789.-e' else ' ') for ch in s)
listOfNumbers = [float(i) for i in newstr.split()]
print(listOfNumbers)
[12.0, 3.14, 0.0, 1.6e-19, 334.0]
For phone numbers you can simply exclude all non-digit characters with \D in regex:
import re
phone_number = "(619) 459-3635"
phone_number = re.sub(r"\D", "", phone_number)
print(phone_number)
The r in r"\D" stands for raw string. It is necessary. Without it, Python will consider \D as an escape character.
Using Regex below is the way
lines = "hello 12 hi 89"
import re
output = []
#repl_str = re.compile('\d+.?\d*')
repl_str = re.compile('^\d+$')
#t = r'\d+.?\d*'
line = lines.split()
for word in line:
match = re.search(repl_str, word)
if match:
output.append(float(match.group()))
print (output)
with findall
re.findall(r'\d+', "hello 12 hi 89")
['12', '89']
re.findall(r'\b\d+\b', "hello 12 hi 89 33F AC 777")
['12', '89', '777']
line2 = "hello 12 hi 89" # this is the given string
temp1 = re.findall(r'\d+', line2) # find number of digits through regular expression
res2 = list(map(int, temp1))
print(res2)
you can search all the integers in the string through digit by using findall expression.
In the second step create a list res2 and add the digits found in string to this list.
I am just adding this answer because no one added one using Exception handling and because this also works for floats
a = []
line = "abcd 1234 efgh 56.78 ij"
for word in line.split():
try:
a.append(float(word))
except ValueError:
pass
print(a)
Output :
[1234.0, 56.78]
This answer also contains the case when the number is float in the string
def get_first_nbr_from_str(input_str):
'''
:param input_str: strings that contains digit and words
:return: the number extracted from the input_str
demo:
'ab324.23.123xyz': 324.23
'.5abc44': 0.5
'''
if not input_str and not isinstance(input_str, str):
return 0
out_number = ''
for ele in input_str:
if (ele == '.' and '.' not in out_number) or ele.isdigit():
out_number += ele
elif out_number:
break
return float(out_number)
I am amazed to see that no one has yet mentioned the usage of itertools.groupby as an alternative to achieve this.
You may use itertools.groupby() along with str.isdigit() in order to extract numbers from string as:
from itertools import groupby
my_str = "hello 12 hi 89"
l = [int(''.join(i)) for is_digit, i in groupby(my_str, str.isdigit) if is_digit]
The value hold by l will be:
[12, 89]
PS: This is just for illustration purpose to show that as an alternative we could also use groupby to achieve this. But this is not a recommended solution. If you want to achieve this, you should be using accepted answer of fmark based on using list comprehension with str.isdigit as filter.
The cleanest way i found:
>>> data = 'hs122 125 &55,58, 25'
>>> new_data = ''.join((ch if ch in '0123456789.-e' else ' ') for ch in data)
>>> numbers = [i for i in new_data.split()]
>>> print(numbers)
['122', '125', '55', '58', '25']
or this:
>>> import re
>>> data = 'hs122 125 &55,58, 25'
>>> numbers = re.findall(r'\d+', data)
>>> print(numbers)
['122', '125', '55', '58', '25']
#jmnas, I liked your answer, but it didn't find floats. I'm working on a script to parse code going to a CNC mill and needed to find both X and Y dimensions that can be integers or floats, so I adapted your code to the following. This finds int, float with positive and negative vals. Still doesn't find hex formatted values but you could add "x" and "A" through "F" to the num_char tuple and I think it would parse things like '0x23AC'.
s = 'hello X42 I\'m a Y-32.35 string Z30'
xy = ("X", "Y")
num_char = (".", "+", "-")
l = []
tokens = s.split()
for token in tokens:
if token.startswith(xy):
num = ""
for char in token:
# print(char)
if char.isdigit() or (char in num_char):
num = num + char
try:
l.append(float(num))
except ValueError:
pass
print(l)
Since none of these dealt with real world financial numbers in excel and word docs that I needed to find, here is my variation. It handles ints, floats, negative numbers, currency numbers (because it doesn't reply on split), and has the option to drop the decimal part and just return ints, or return everything.
It also handles Indian Laks number system where commas appear irregularly, not every 3 numbers apart.
It does not handle scientific notation or negative numbers put inside parentheses in budgets -- will appear positive.
It also does not extract dates. There are better ways for finding dates in strings.
import re
def find_numbers(string, ints=True):
numexp = re.compile(r'[-]?\d[\d,]*[\.]?[\d{2}]*') #optional - in front
numbers = numexp.findall(string)
numbers = [x.replace(',','') for x in numbers]
if ints is True:
return [int(x.replace(',','').split('.')[0]) for x in numbers]
else:
return numbers
str1 = "There are 2 apples for 4 persons"
# printing original string
print("The original string : " + str1) # The original string : There are 2 apples for 4 persons
# using List comprehension + isdigit() +split()
# getting numbers from string
res = [int(i) for i in str1.split() if i.isdigit()]
print("The numbers list is : " + str(res)) # The numbers list is : [2, 4]
The best option I found is below. It will extract a number and can eliminate any type of char.
def extract_nbr(input_str):
if input_str is None or input_str == '':
return 0
out_number = ''
for ele in input_str:
if ele.isdigit():
out_number += ele
return float(out_number)
Related
I would like to separate the letters from the numbers like this
inp= "AE123"
p= #position of where the number start in this case "2"
I've already tried to use str.find() but its has a limit of 3
Extracting the letters and the digits
If the goal is to extract both the letters and the digits, regular expressions can solve the problem directly without need for indices or slices:
>>> re.match(r'([A-Za-z]+)(\d+)', inp).groups()
('AE', '123')
Finding the position of the number
If needed, regular expressions can also locate the indices for the match.
>>> import re
>>> inp = "AE123"
>>> mo = re.search(r'\d+', inp)
>>> mo.span()
(2, 5)
>>> inp[2 : 5]
'123'
You can run a loop that checks for digits:
for p, c in enumerate(inp):
if c.isdigit():
break
print(p)
Find out more about str.isdigit
this should work
for i in range(len(inp)):
if inp[i].isdigit():
p = i
break
#Assuming all characters come before the first numeral as mentioned in the question
def findWhereNoStart(string):
start_index=-1
for char in string:
start_index+=1
if char.isdigit():
return string[start_index:]
return "NO NUMERALS IN THE GIVEN STRING"
#TEST
print(findWhereNoStart("ASDFG"))
print(findWhereNoStart("ASDFG13213"))
print(findWhereNoStart("ASDFG1"))
#OUTPUT
"""
NO NUMERALS IN THE GIVEN STRING
13213
1
"""
I am trying to extract numbers from a string. Without any fancy inports like regex and for or if statements.
Example
495 * 89
Output
495 89
Edit I have tried this:
num1 = int(''.join(filter(str.isdigit, num)))
It works, but doesn't space out the numbers
Actually, regex is a very simple and viable option here:
inp = "495 * 89"
nums = re.findall(r'\d+(?:\.\d+)?', inp)
print(nums) # ['495', '89']
Assuming you always expect integers and you want to avoid regex, you could use a string split approach with a list comprehension:
inp = "495 * 89"
parts = inp.split()
nums = [x for x in parts if x.isdigit()]
print(nums) # ['495', '89']
You can do this without much fancy stuff
s = "495 * 89"
#replace non-digits with spaces, goes into a list of characters
li = [c if c.isdigit() else " " for c in s ]
#join characters back into a string
s_digit_spaces = "".join(li)
#split will separate on space boundaries, multiple spaces count as one
nums = s_digit_spaces.split()
print(nums)
#one-liner:
print ("".join([c if c.isdigit() else " " for c in s ]).split())
output:
['495', '89']
['495', '89']
#and with non-digit number stuff
s = "495.1 * -89"
print ("".join([c if (c.isdigit() or c in ('-',".")) else " " for c in s ]).split())
output:
['495.1', '-89']
Finally, this works too:
print ("".join([c if c in "0123456789+-." else " " for c in s ]).split())
You're close.
You don't want to int() a single value when there are multiple numbers in the string. The filter function is being applied over characters, since strings are iterable that way
Instead, you need to first split the string into its individual tokens, then filter whole numerical strings, then cast each element
s = "123 * 54"
digits = list(map(int, filter(str.isdigit, s.split())))
Keep in mind, this only handles non-negative integers
I would like to extract all the numbers contained in a string. Which is better suited for the purpose, regular expressions or the isdigit() method?
Example:
line = "hello 12 hi 89"
Result:
[12, 89]
I'd use a regexp :
>>> import re
>>> re.findall(r'\d+', "hello 42 I'm a 32 string 30")
['42', '32', '30']
This would also match 42 from bla42bla. If you only want numbers delimited by word boundaries (space, period, comma), you can use \b :
>>> re.findall(r'\b\d+\b', "he33llo 42 I'm a 32 string 30")
['42', '32', '30']
To end up with a list of numbers instead of a list of strings:
>>> [int(s) for s in re.findall(r'\b\d+\b', "he33llo 42 I'm a 32 string 30")]
[42, 32, 30]
NOTE: this does not work for negative integers
If you only want to extract only positive integers, try the following:
>>> txt = "h3110 23 cat 444.4 rabbit 11 2 dog"
>>> [int(s) for s in txt.split() if s.isdigit()]
[23, 11, 2]
I would argue that this is better than the regex example because you don't need another module and it's more readable because you don't need to parse (and learn) the regex mini-language.
This will not recognize floats, negative integers, or integers in hexadecimal format. If you can't accept these limitations, jmnas's answer below will do the trick.
This is more than a bit late, but you can extend the regex expression to account for scientific notation too.
import re
# Format is [(<string>, <expected output>), ...]
ss = [("apple-12.34 ba33na fanc-14.23e-2yapple+45e5+67.56E+3",
['-12.34', '33', '-14.23e-2', '+45e5', '+67.56E+3']),
('hello X42 I\'m a Y-32.35 string Z30',
['42', '-32.35', '30']),
('he33llo 42 I\'m a 32 string -30',
['33', '42', '32', '-30']),
('h3110 23 cat 444.4 rabbit 11 2 dog',
['3110', '23', '444.4', '11', '2']),
('hello 12 hi 89',
['12', '89']),
('4',
['4']),
('I like 74,600 commas not,500',
['74,600', '500']),
('I like bad math 1+2=.001',
['1', '+2', '.001'])]
for s, r in ss:
rr = re.findall("[-+]?[.]?[\d]+(?:,\d\d\d)*[\.]?\d*(?:[eE][-+]?\d+)?", s)
if rr == r:
print('GOOD')
else:
print('WRONG', rr, 'should be', r)
Gives all good!
Additionally, you can look at the AWS Glue built-in regex
If you know it will be only one number in the string, i.e 'hello 12 hi', you can try filter.
For example:
In [1]: int(''.join(filter(str.isdigit, '200 grams')))
Out[1]: 200
In [2]: int(''.join(filter(str.isdigit, 'Counters: 55')))
Out[2]: 55
In [3]: int(''.join(filter(str.isdigit, 'more than 23 times')))
Out[3]: 23
But be carefull !!! :
In [4]: int(''.join(filter(str.isdigit, '200 grams 5')))
Out[4]: 2005
I'm assuming you want floats not just integers so I'd do something like this:
l = []
for t in s.split():
try:
l.append(float(t))
except ValueError:
pass
Note that some of the other solutions posted here don't work with negative numbers:
>>> re.findall(r'\b\d+\b', 'he33llo 42 I\'m a 32 string -30')
['42', '32', '30']
>>> '-3'.isdigit()
False
To catch different patterns it is helpful to query with different patterns.
Setup all the patterns that catch different number patterns of interest:
To find commas, e.g. 12,300 or 12,300.00
r'[\d]+[.,\d]+'
To find floats, e.g. 0.123 or .123
r'[\d]*[.][\d]+'
To find integers, e.g. 123
r'[\d]+'
Combine with pipe ( | ) into one pattern with multiple or conditionals.
(Note: Put complex patterns first else simple patterns will return chunks of the complex catch instead of the complex catch returning the full catch).
p = '[\d]+[.,\d]+|[\d]*[.][\d]+|[\d]+'
Below, we'll confirm a pattern is present with re.search(), then return an iterable list of catches. Finally, we'll print each catch using bracket notation to subselect the match object return value from the match object.
s = 'he33llo 42 I\'m a 32 string 30 444.4 12,001'
if re.search(p, s) is not None:
for catch in re.finditer(p, s):
print(catch[0]) # catch is a match object
Returns:
33
42
32
30
444.4
12,001
I was looking for a solution to remove strings' masks, specifically from Brazilian phones numbers, this post not answered but inspired me. This is my solution:
>>> phone_number = '+55(11)8715-9877'
>>> ''.join([n for n in phone_number if n.isdigit()])
'551187159877'
# extract numbers from garbage string:
s = '12//n,_##$%3.14kjlw0xdadfackvj1.6e-19&*ghn334'
newstr = ''.join((ch if ch in '0123456789.-e' else ' ') for ch in s)
listOfNumbers = [float(i) for i in newstr.split()]
print(listOfNumbers)
[12.0, 3.14, 0.0, 1.6e-19, 334.0]
For phone numbers you can simply exclude all non-digit characters with \D in regex:
import re
phone_number = "(619) 459-3635"
phone_number = re.sub(r"\D", "", phone_number)
print(phone_number)
The r in r"\D" stands for raw string. It is necessary. Without it, Python will consider \D as an escape character.
Using Regex below is the way
lines = "hello 12 hi 89"
import re
output = []
#repl_str = re.compile('\d+.?\d*')
repl_str = re.compile('^\d+$')
#t = r'\d+.?\d*'
line = lines.split()
for word in line:
match = re.search(repl_str, word)
if match:
output.append(float(match.group()))
print (output)
with findall
re.findall(r'\d+', "hello 12 hi 89")
['12', '89']
re.findall(r'\b\d+\b', "hello 12 hi 89 33F AC 777")
['12', '89', '777']
line2 = "hello 12 hi 89" # this is the given string
temp1 = re.findall(r'\d+', line2) # find number of digits through regular expression
res2 = list(map(int, temp1))
print(res2)
you can search all the integers in the string through digit by using findall expression.
In the second step create a list res2 and add the digits found in string to this list.
I am just adding this answer because no one added one using Exception handling and because this also works for floats
a = []
line = "abcd 1234 efgh 56.78 ij"
for word in line.split():
try:
a.append(float(word))
except ValueError:
pass
print(a)
Output :
[1234.0, 56.78]
This answer also contains the case when the number is float in the string
def get_first_nbr_from_str(input_str):
'''
:param input_str: strings that contains digit and words
:return: the number extracted from the input_str
demo:
'ab324.23.123xyz': 324.23
'.5abc44': 0.5
'''
if not input_str and not isinstance(input_str, str):
return 0
out_number = ''
for ele in input_str:
if (ele == '.' and '.' not in out_number) or ele.isdigit():
out_number += ele
elif out_number:
break
return float(out_number)
I am amazed to see that no one has yet mentioned the usage of itertools.groupby as an alternative to achieve this.
You may use itertools.groupby() along with str.isdigit() in order to extract numbers from string as:
from itertools import groupby
my_str = "hello 12 hi 89"
l = [int(''.join(i)) for is_digit, i in groupby(my_str, str.isdigit) if is_digit]
The value hold by l will be:
[12, 89]
PS: This is just for illustration purpose to show that as an alternative we could also use groupby to achieve this. But this is not a recommended solution. If you want to achieve this, you should be using accepted answer of fmark based on using list comprehension with str.isdigit as filter.
The cleanest way i found:
>>> data = 'hs122 125 &55,58, 25'
>>> new_data = ''.join((ch if ch in '0123456789.-e' else ' ') for ch in data)
>>> numbers = [i for i in new_data.split()]
>>> print(numbers)
['122', '125', '55', '58', '25']
or this:
>>> import re
>>> data = 'hs122 125 &55,58, 25'
>>> numbers = re.findall(r'\d+', data)
>>> print(numbers)
['122', '125', '55', '58', '25']
#jmnas, I liked your answer, but it didn't find floats. I'm working on a script to parse code going to a CNC mill and needed to find both X and Y dimensions that can be integers or floats, so I adapted your code to the following. This finds int, float with positive and negative vals. Still doesn't find hex formatted values but you could add "x" and "A" through "F" to the num_char tuple and I think it would parse things like '0x23AC'.
s = 'hello X42 I\'m a Y-32.35 string Z30'
xy = ("X", "Y")
num_char = (".", "+", "-")
l = []
tokens = s.split()
for token in tokens:
if token.startswith(xy):
num = ""
for char in token:
# print(char)
if char.isdigit() or (char in num_char):
num = num + char
try:
l.append(float(num))
except ValueError:
pass
print(l)
Since none of these dealt with real world financial numbers in excel and word docs that I needed to find, here is my variation. It handles ints, floats, negative numbers, currency numbers (because it doesn't reply on split), and has the option to drop the decimal part and just return ints, or return everything.
It also handles Indian Laks number system where commas appear irregularly, not every 3 numbers apart.
It does not handle scientific notation or negative numbers put inside parentheses in budgets -- will appear positive.
It also does not extract dates. There are better ways for finding dates in strings.
import re
def find_numbers(string, ints=True):
numexp = re.compile(r'[-]?\d[\d,]*[\.]?[\d{2}]*') #optional - in front
numbers = numexp.findall(string)
numbers = [x.replace(',','') for x in numbers]
if ints is True:
return [int(x.replace(',','').split('.')[0]) for x in numbers]
else:
return numbers
str1 = "There are 2 apples for 4 persons"
# printing original string
print("The original string : " + str1) # The original string : There are 2 apples for 4 persons
# using List comprehension + isdigit() +split()
# getting numbers from string
res = [int(i) for i in str1.split() if i.isdigit()]
print("The numbers list is : " + str(res)) # The numbers list is : [2, 4]
The best option I found is below. It will extract a number and can eliminate any type of char.
def extract_nbr(input_str):
if input_str is None or input_str == '':
return 0
out_number = ''
for ele in input_str:
if ele.isdigit():
out_number += ele
return float(out_number)
I would like to extract all the numbers contained in a string. Which is better suited for the purpose, regular expressions or the isdigit() method?
Example:
line = "hello 12 hi 89"
Result:
[12, 89]
I'd use a regexp :
>>> import re
>>> re.findall(r'\d+', "hello 42 I'm a 32 string 30")
['42', '32', '30']
This would also match 42 from bla42bla. If you only want numbers delimited by word boundaries (space, period, comma), you can use \b :
>>> re.findall(r'\b\d+\b', "he33llo 42 I'm a 32 string 30")
['42', '32', '30']
To end up with a list of numbers instead of a list of strings:
>>> [int(s) for s in re.findall(r'\b\d+\b', "he33llo 42 I'm a 32 string 30")]
[42, 32, 30]
NOTE: this does not work for negative integers
If you only want to extract only positive integers, try the following:
>>> txt = "h3110 23 cat 444.4 rabbit 11 2 dog"
>>> [int(s) for s in txt.split() if s.isdigit()]
[23, 11, 2]
I would argue that this is better than the regex example because you don't need another module and it's more readable because you don't need to parse (and learn) the regex mini-language.
This will not recognize floats, negative integers, or integers in hexadecimal format. If you can't accept these limitations, jmnas's answer below will do the trick.
This is more than a bit late, but you can extend the regex expression to account for scientific notation too.
import re
# Format is [(<string>, <expected output>), ...]
ss = [("apple-12.34 ba33na fanc-14.23e-2yapple+45e5+67.56E+3",
['-12.34', '33', '-14.23e-2', '+45e5', '+67.56E+3']),
('hello X42 I\'m a Y-32.35 string Z30',
['42', '-32.35', '30']),
('he33llo 42 I\'m a 32 string -30',
['33', '42', '32', '-30']),
('h3110 23 cat 444.4 rabbit 11 2 dog',
['3110', '23', '444.4', '11', '2']),
('hello 12 hi 89',
['12', '89']),
('4',
['4']),
('I like 74,600 commas not,500',
['74,600', '500']),
('I like bad math 1+2=.001',
['1', '+2', '.001'])]
for s, r in ss:
rr = re.findall("[-+]?[.]?[\d]+(?:,\d\d\d)*[\.]?\d*(?:[eE][-+]?\d+)?", s)
if rr == r:
print('GOOD')
else:
print('WRONG', rr, 'should be', r)
Gives all good!
Additionally, you can look at the AWS Glue built-in regex
If you know it will be only one number in the string, i.e 'hello 12 hi', you can try filter.
For example:
In [1]: int(''.join(filter(str.isdigit, '200 grams')))
Out[1]: 200
In [2]: int(''.join(filter(str.isdigit, 'Counters: 55')))
Out[2]: 55
In [3]: int(''.join(filter(str.isdigit, 'more than 23 times')))
Out[3]: 23
But be carefull !!! :
In [4]: int(''.join(filter(str.isdigit, '200 grams 5')))
Out[4]: 2005
I'm assuming you want floats not just integers so I'd do something like this:
l = []
for t in s.split():
try:
l.append(float(t))
except ValueError:
pass
Note that some of the other solutions posted here don't work with negative numbers:
>>> re.findall(r'\b\d+\b', 'he33llo 42 I\'m a 32 string -30')
['42', '32', '30']
>>> '-3'.isdigit()
False
To catch different patterns it is helpful to query with different patterns.
Setup all the patterns that catch different number patterns of interest:
To find commas, e.g. 12,300 or 12,300.00
r'[\d]+[.,\d]+'
To find floats, e.g. 0.123 or .123
r'[\d]*[.][\d]+'
To find integers, e.g. 123
r'[\d]+'
Combine with pipe ( | ) into one pattern with multiple or conditionals.
(Note: Put complex patterns first else simple patterns will return chunks of the complex catch instead of the complex catch returning the full catch).
p = '[\d]+[.,\d]+|[\d]*[.][\d]+|[\d]+'
Below, we'll confirm a pattern is present with re.search(), then return an iterable list of catches. Finally, we'll print each catch using bracket notation to subselect the match object return value from the match object.
s = 'he33llo 42 I\'m a 32 string 30 444.4 12,001'
if re.search(p, s) is not None:
for catch in re.finditer(p, s):
print(catch[0]) # catch is a match object
Returns:
33
42
32
30
444.4
12,001
I was looking for a solution to remove strings' masks, specifically from Brazilian phones numbers, this post not answered but inspired me. This is my solution:
>>> phone_number = '+55(11)8715-9877'
>>> ''.join([n for n in phone_number if n.isdigit()])
'551187159877'
# extract numbers from garbage string:
s = '12//n,_##$%3.14kjlw0xdadfackvj1.6e-19&*ghn334'
newstr = ''.join((ch if ch in '0123456789.-e' else ' ') for ch in s)
listOfNumbers = [float(i) for i in newstr.split()]
print(listOfNumbers)
[12.0, 3.14, 0.0, 1.6e-19, 334.0]
For phone numbers you can simply exclude all non-digit characters with \D in regex:
import re
phone_number = "(619) 459-3635"
phone_number = re.sub(r"\D", "", phone_number)
print(phone_number)
The r in r"\D" stands for raw string. It is necessary. Without it, Python will consider \D as an escape character.
Using Regex below is the way
lines = "hello 12 hi 89"
import re
output = []
#repl_str = re.compile('\d+.?\d*')
repl_str = re.compile('^\d+$')
#t = r'\d+.?\d*'
line = lines.split()
for word in line:
match = re.search(repl_str, word)
if match:
output.append(float(match.group()))
print (output)
with findall
re.findall(r'\d+', "hello 12 hi 89")
['12', '89']
re.findall(r'\b\d+\b', "hello 12 hi 89 33F AC 777")
['12', '89', '777']
line2 = "hello 12 hi 89" # this is the given string
temp1 = re.findall(r'\d+', line2) # find number of digits through regular expression
res2 = list(map(int, temp1))
print(res2)
you can search all the integers in the string through digit by using findall expression.
In the second step create a list res2 and add the digits found in string to this list.
I am just adding this answer because no one added one using Exception handling and because this also works for floats
a = []
line = "abcd 1234 efgh 56.78 ij"
for word in line.split():
try:
a.append(float(word))
except ValueError:
pass
print(a)
Output :
[1234.0, 56.78]
This answer also contains the case when the number is float in the string
def get_first_nbr_from_str(input_str):
'''
:param input_str: strings that contains digit and words
:return: the number extracted from the input_str
demo:
'ab324.23.123xyz': 324.23
'.5abc44': 0.5
'''
if not input_str and not isinstance(input_str, str):
return 0
out_number = ''
for ele in input_str:
if (ele == '.' and '.' not in out_number) or ele.isdigit():
out_number += ele
elif out_number:
break
return float(out_number)
I am amazed to see that no one has yet mentioned the usage of itertools.groupby as an alternative to achieve this.
You may use itertools.groupby() along with str.isdigit() in order to extract numbers from string as:
from itertools import groupby
my_str = "hello 12 hi 89"
l = [int(''.join(i)) for is_digit, i in groupby(my_str, str.isdigit) if is_digit]
The value hold by l will be:
[12, 89]
PS: This is just for illustration purpose to show that as an alternative we could also use groupby to achieve this. But this is not a recommended solution. If you want to achieve this, you should be using accepted answer of fmark based on using list comprehension with str.isdigit as filter.
The cleanest way i found:
>>> data = 'hs122 125 &55,58, 25'
>>> new_data = ''.join((ch if ch in '0123456789.-e' else ' ') for ch in data)
>>> numbers = [i for i in new_data.split()]
>>> print(numbers)
['122', '125', '55', '58', '25']
or this:
>>> import re
>>> data = 'hs122 125 &55,58, 25'
>>> numbers = re.findall(r'\d+', data)
>>> print(numbers)
['122', '125', '55', '58', '25']
#jmnas, I liked your answer, but it didn't find floats. I'm working on a script to parse code going to a CNC mill and needed to find both X and Y dimensions that can be integers or floats, so I adapted your code to the following. This finds int, float with positive and negative vals. Still doesn't find hex formatted values but you could add "x" and "A" through "F" to the num_char tuple and I think it would parse things like '0x23AC'.
s = 'hello X42 I\'m a Y-32.35 string Z30'
xy = ("X", "Y")
num_char = (".", "+", "-")
l = []
tokens = s.split()
for token in tokens:
if token.startswith(xy):
num = ""
for char in token:
# print(char)
if char.isdigit() or (char in num_char):
num = num + char
try:
l.append(float(num))
except ValueError:
pass
print(l)
Since none of these dealt with real world financial numbers in excel and word docs that I needed to find, here is my variation. It handles ints, floats, negative numbers, currency numbers (because it doesn't reply on split), and has the option to drop the decimal part and just return ints, or return everything.
It also handles Indian Laks number system where commas appear irregularly, not every 3 numbers apart.
It does not handle scientific notation or negative numbers put inside parentheses in budgets -- will appear positive.
It also does not extract dates. There are better ways for finding dates in strings.
import re
def find_numbers(string, ints=True):
numexp = re.compile(r'[-]?\d[\d,]*[\.]?[\d{2}]*') #optional - in front
numbers = numexp.findall(string)
numbers = [x.replace(',','') for x in numbers]
if ints is True:
return [int(x.replace(',','').split('.')[0]) for x in numbers]
else:
return numbers
str1 = "There are 2 apples for 4 persons"
# printing original string
print("The original string : " + str1) # The original string : There are 2 apples for 4 persons
# using List comprehension + isdigit() +split()
# getting numbers from string
res = [int(i) for i in str1.split() if i.isdigit()]
print("The numbers list is : " + str(res)) # The numbers list is : [2, 4]
The best option I found is below. It will extract a number and can eliminate any type of char.
def extract_nbr(input_str):
if input_str is None or input_str == '':
return 0
out_number = ''
for ele in input_str:
if ele.isdigit():
out_number += ele
return float(out_number)
I new on python.
I have this string "[12:3]" and i what to calculate the difference between these two numbers.
Ex: 12 - 3 = 9
Of course I can do something (not very clear) like this:
num1 = []
num2 = []
s = '[12:3]'
dot = 0;
#find the ':' sign
for i in range(len(s)):
if s[i] == ':' :
dot = i
#left side
for i in range(dot):
num1.append(s[i])
#right side
for i in range(len(s) - dot-1):
num2.append(s[i+dot+1])
return str(int("".join(num1))-int("".join(num2))+1)
But i'm sure the is a more clear and comprehensible way.
Thanks!
You could use regex to pick the numbers out of your string:
import re
s = '[12:3]'
numbers = [int(x) for x in re.findall(r'\d+',s)]
return numbers[0]-numbers[1]
Or, without re
numbers = [int(x) for x in s.strip('[]').split(':')]
print numbers[0] - numbers[1]
prints
9
You should use regular expressions.
>>> import re
>>> match = re.match(r'\[(\d+):(\d+)\]', '[12:3]')
>>> match.groups()
('12', '3')
>>> a = int(match.groups()[0])
>>> b = int(match.groups()[1])
>>> a - b
9
The regular expression there says "match starting at the beginning of the string, find [, then any number of digits \d+ (and store them), then a :, then any number of digits \d+ (and store them), and finally ]". We then extract the stored digits using .groups() and do arithmetic on them.