Regular expression search shows different result - python

I want to extract number between > and < using regular expression on Python 2.7
i.e. From 3213>1234<3213 to 1234.
But the result(print(data2)) shows nothing. What is the problem?
I tested the code below on Ubuntu and Windows pydev.
import re
a = "3213>1234<3213"
p = re.compile('>[0-9]*<')
data = p.search(a).group()
print(data)
p2 = re.compile('[0-9]*')
data2 = p2.search(data).group()
print(data2)

The problem is that you get the earliest possible match for [0-9]* in '>1234<', and that's in fact the empty string at the very start of it, before the >.
Besides direct regex solutions, you could also fix yours simply with data2 = data[1:-1].

Because you're trying to use [0-9]* on >1234<. And * try to match 0 or more digits.
So it gives an empty string when it try to find a digit on the fist letter of the string, which is >.
You can replace re.search() with re.findall() and see what's happening:
import re
a = "3213>1234<3213"
p = re.compile('>[0-9]*<')
data = p.search(a).group()
print(data)
p2 = re.compile('[0-9]*')
data2 = p2.findall(data)
print(data2)
Output:
['', '1234', '', '']
You need use [0-9]+ instead of [0-9]* here. Which match 1 or more digits. So it would skips the > and <:
>>> p2 = re.compile('[0-9]+')
>>> data2 = p2.search(data).group()
>>> print(data2)
1234
You can also totally drop the p2 and capture the digits in > and < via p = re.compile('>([0-9]+)<') and data = p.search(a).group(1). Like this:
>>> import re
>>> a = "3213>1234<3213"
>>> p = re.compile('>([0-9]+)<')
>>> data = p.search(a).group(1)
>>> print(data)
1234

>>> string='3213>1234<3213'
>>> re.search(r'(?<=>)[^<]+(?=<)', string).group()
'1234'
(?<=>) is the zero width positive lookbehind pattern ensuring > before the desired match
[^<]+ will select the desired portion i.e. the portion after > till next <, 1234 in this case
(?=<) is the zero width positive lookahead pattern ensuring > after the desired match

you can group your search:
>>> a = "3213>1234<3213"
>>> re.findall(">(\d+)<", a)
['1234']

the regular expression look for the > any number < and findall returns a list of matches. You then iterate over the matches
a = "3213>1234<3213>5123<"
p = re.compile('>([0-9]+)<')
data=p.findall(a)
for item in data:
print(item)
output:
1234
5123

Related

Python regular expression?

import re
pattern = re.compile(r"(\d{3})+$")
print pattern.match("123567").groups()
output result:
('567',)
I need the result is ('123','567').
The (\d{3}) only can output last group, but I want output every group.
I am doing it in a bit of pythonic way
Solution 1
Python Code
p = re.compile(r'(?<=\d)(?=(?:\d{3})+$)')
test_str = "2890191245"
tmp = [x.start() for x in re.finditer(p, test_str)]
res = [test_str[0: tmp[0]]] + [(test_str[tmp[i]: tmp[i] + 3]) for i in range(len(tmp))]
Ideone Demo
Solution 2 (one liner)
print(re.sub("(?<=\d)(?=(\d{3})+$)", ",", test_str).split(","))
Ideone Demo

Extract a number from string in python

I want to extract a number form a string like this in Python:
string1 = 154787xs.txt
I want to get 154787 from there. I am using this:
searchPattern = re.compile('\d\d\d\d\d\d(?=xs)')
m = searchPattern.search(string1)
number = m.group()
but I do not get the correct value. Also the number of digits could change...
What am I doing wrong?
Simply you could use the below pattern,
searchPattern = re.compile(r'\d+(?=xs)')
Explanation:
\d+ matches one or more numbers.
(?=xs) Lookahead asserts that the characters which are following the numbers must be xs
Code:
>>> import re
>>> searchPattern = re.compile(r'\d+(?=xs)')
>>> m = searchPattern.search(string1)
>>> m
<_sre.SRE_Match object at 0x7f6047f66370>
>>> number = m.group()
>>> number
'154787'
What do you mean when you say you do not get the right value?
Your code does successfully match the string '154787'.
Perhaps you want number to be an int? In that case use:
number = int(m.group())
By the way, the regex could be written as
searchPattern = re.compile('(\d+)xs')
m = searchPattern.search(string1)
if m:
number = int(m.group(1))

Python Regular Expressions; Having 0 match either 0 or 1

I have a shorter string s I'm trying to match to a longer string s1. 1's match 1's, but 0's will match either a 0 or a 1.
For instance:
s = '11111' would match s1 = '11111'
s = '11010' would match s1 = '11111' or '11011' or '11110' or '11010'
I know regular expressions would make this much easier but am confused on where to start.
Replace each instance of 0 with [01] to enable it matching either 0 or 1:
s = '11010'
pattern = s.replace('0', '[01]')
regex = re.compile(pattern)
regex.match('11111')
regex.match('11011')
It looks to me like you're actually looking for bit arithmetics
s = '11010'
n = int(s, 2)
for r in ('11111', '11011', '11110', '11010'):
if int(r, 2) & n == n:
print r, 'matches', s
else:
print r, 'doesnt match', s
import re
def matches(pat, s):
p = re.compile(pat.replace('0', '[01]') + '$')
return p.match(s) is not None
print matches('11111', '11111')
print matches('11111', '11011')
print matches('11010', '11111')
print matches('11010', '11011')
You say "match to a longer string s1", but you don't say whether you'd like to match the start of the string, or the end etc. Until I better understand your requirements, this performs an exact match.

Count overlapping regex matches once again

How can I obtain the number of overlapping regex matches using Python?
I've read and tried the suggestions from this, that and a few other questions, but found none that would work for my scenario. Here it is:
input example string: akka
search pattern: a.*k
A proper function should yield 2 as the number of matches, since there are two possible end positions (k letters).
The pattern might also be more complicated, for example a.*k.*a should also be matched twice in akka (since there are two k's in the middle).
I think that what you're looking for is probably better done with a parsing library like lepl:
>>> from lepl import *
>>> parser = Literal('a') + Any()[:] + Literal('k')
>>> parser.config.no_full_first_match()
>>> list(parser.parse_all('akka'))
[['akk'], ['ak']]
>>> parser = Literal('a') + Any()[:] + Literal('k') + Any()[:] + Literal('a')
>>> list(parser.parse_all('akka'))
[['akka'], ['akka']]
I believe that the length of the output from parser.parse_all is what you're looking for.
Note that you need to use parser.config.no_full_first_match() to avoid errors if your pattern doesn't match the whole string.
Edit: Based on the comment from #Shamanu4, I see you want matching results starting from any position, you can do that as follows:
>>> text = 'bboo'
>>> parser = Literal('b') + Any()[:] + Literal('o')
>>> parser.config.no_full_first_match()
>>> substrings = [text[i:] for i in range(len(text))]
>>> matches = [list(parser.parse_all(substring)) for substring in substrings]
>>> matches = filter(None, matches) # Remove empty matches
>>> matches = list(itertools.chain.from_iterable(matches)) # Flatten results
>>> matches = list(itertools.chain.from_iterable(matches)) # Flatten results (again)
>>> matches
['bboo', 'bbo', 'boo', 'bo']
Yes, it is ugly and unoptimized but it seems to be working. This is a simple try of all possible but unique variants
def myregex(pattern,text,dir=0):
import re
m = re.search(pattern, text)
if m:
yield m.group(0)
if len(m.group('suffix')):
for r in myregex(pattern, "%s%s%s" % (m.group('prefix'),m.group('suffix')[1:],m.group('end')),1):
yield r
if dir<1 :
for r in myregex(pattern, "%s%s%s" % (m.group('prefix'),m.group('suffix')[:-1],m.group('end')),-1):
yield r
def myprocess(pattern, text):
parts = pattern.split("*")
for i in range(0, len(parts)-1 ):
res=""
for j in range(0, len(parts) ):
if j==0:
res+="(?P<prefix>"
if j==i:
res+=")(?P<suffix>"
res+=parts[j]
if j==i+1:
res+=")(?P<end>"
if j<len(parts)-1:
if j==i:
res+=".*"
else:
res+=".*?"
else:
res+=")"
for r in myregex(res,text):
yield r
def mycount(pattern, text):
return set(myprocess(pattern, text))
test:
>>> mycount('a*b*c','abc')
set(['abc'])
>>> mycount('a*k','akka')
set(['akk', 'ak'])
>>> mycount('b*o','bboo')
set(['bbo', 'bboo', 'bo', 'boo'])
>>> mycount('b*o','bb123oo')
set(['b123o', 'bb123oo', 'bb123o', 'b123oo'])
>>> mycount('b*o','ffbfbfffofoff')
set(['bfbfffofo', 'bfbfffo', 'bfffofo', 'bfffo'])

Separate number from unit in a string in Python

I have strings containing numbers with their units, e.g. 2GB, 17ft, etc.
I would like to separate the number from the unit and create 2 different strings. Sometimes, there is a whitespace between them (e.g. 2 GB) and it's easy to do it using split(' ').
When they are together (e.g. 2GB), I would test every character until I find a letter, instead of a number.
s='17GB'
number=''
unit=''
for c in s:
if c.isdigit():
number+=c
else:
unit+=c
Is there a better way to do it?
Thanks
You can break out of the loop when you find the first non-digit character
for i,c in enumerate(s):
if not c.isdigit():
break
number = s[:i]
unit = s[i:].lstrip()
If you have negative and decimals:
numeric = '0123456789-.'
for i,c in enumerate(s):
if c not in numeric:
break
number = s[:i]
unit = s[i:].lstrip()
You could use a regular expression to divide the string into groups:
>>> import re
>>> p = re.compile('(\d+)\s*(\w+)')
>>> p.match('2GB').groups()
('2', 'GB')
>>> p.match('17 ft').groups()
('17', 'ft')
tokenize can help:
>>> import StringIO
>>> s = StringIO.StringIO('27GB')
>>> for token in tokenize.generate_tokens(s.readline):
... print token
...
(2, '27', (1, 0), (1, 2), '27GB')
(1, 'GB', (1, 2), (1, 4), '27GB')
(0, '', (2, 0), (2, 0), '')
s='17GB'
for i,c in enumerate(s):
if not c.isdigit():
break
number=int(s[:i])
unit=s[i:]
You should use regular expressions, grouping together what you want to find out:
import re
s = "17GB"
match = re.match(r"^([1-9][0-9]*)\s*(GB|MB|KB|B)$", s)
if match:
print "Number: %d, unit: %s" % (int(match.group(1)), match.group(2))
Change the regex according to what you want to parse. If you're unfamiliar with regular expressions, here's a great tutorial site.
>>> s="17GB"
>>> ind=map(str.isalpha,s).index(True)
>>> num,suffix=s[:ind],s[ind:]
>>> print num+":"+suffix
17:GB
This uses an approach which should be a bit more forgiving than regexes. Note: this is not as performant as the other solutions posted.
def split_units(value):
"""
>>> split_units("2GB")
(2.0, 'GB')
>>> split_units("17 ft")
(17.0, 'ft')
>>> split_units(" 3.4e-27 frobnitzem ")
(3.4e-27, 'frobnitzem')
>>> split_units("9001")
(9001.0, '')
>>> split_units("spam sandwhiches")
(0, 'spam sandwhiches')
>>> split_units("")
(0, '')
"""
units = ""
number = 0
while value:
try:
number = float(value)
break
except ValueError:
units = value[-1:] + units
value = value[:-1]
return number, units.strip()
This kind of parser is already integrated into Pint:
Pint is a Python package to define, operate and manipulate physical
quantities: the product of a numerical value and a unit of
measurement. It allows arithmetic operations between them and
conversions from and to different units.
You can install it with pip install pint.
Then, you can parse a string, get the desired value ('magnitude') and its unit:
>>> from pint import UnitRegistry
>>> ureg = UnitRegistry()
>>> size = ureg('2GB')
>>> size.m
2
>>> size.u
<Unit('gigabyte')>
>>> size.to('GiB')
<Quantity(1.86264515, 'gibibyte')>
>>> length = ureg('17ft')
>>> length.m
17
>>> length.u
<Unit('foot')>
>>> length.to('cm')
<Quantity(518.16, 'centimeter')>
How about using a regular expression
http://python.org/doc/1.6/lib/module-regsub.html
For this task, I would definitely use a regular expression:
import re
there = re.compile(r'\s*(\d+)\s*(\S+)')
thematch = there.match(s)
if thematch:
number, unit = thematch.groups()
else:
raise ValueError('String %r not in the expected format' % s)
In the RE pattern, \s means "whitespace", \d means "digit", \S means non-whitespace; * means "0 or more of the preceding", + means "1 or more of the preceding, and the parentheses enclose "capturing groups" which are then returned by the groups() call on the match-object. (thematch is None if the given string doesn't correspond to the pattern: optional whitespace, then one or more digits, then optional whitespace, then one or more non-whitespace characters).
A regular expression.
import re
m = re.match(r'\s*(?P<n>[-+]?[.0-9])\s*(?P<u>.*)', s)
if m is None:
raise ValueError("not a number with units")
number = m.group("n")
unit = m.group("u")
This will give you a number (integer or fixed point; too hard to disambiguate scientific notation's "e" from a unit prefix) with an optional sign, followed by the units, with optional whitespace.
You can use re.compile() if you're going to be doing a lot of matches.
SCIENTIFIC NOTATION
This regex is working well for me to parse numbers that may be in scientific notation, and is based on the recent python documentation about scanf:
https://docs.python.org/3/library/re.html#simulating-scanf
units_pattern = re.compile("([-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?|\s*[a-zA-Z]+\s*$)")
number_with_units = list(match.group(0) for match in units_pattern.finditer("+2.0e-1 mm"))
print(number_with_units)
>>>['+2.0e-1', ' mm']
n, u = number_with_units
print(float(n), u.strip())
>>>0.2 mm
try the regex pattern below. the first group (the scanf() tokens for a number any which way) is lifted directly from the python docs for the re module.
import re
SCANF_MEASUREMENT = re.compile(
r'''( # group match like scanf() token %e, %E, %f, %g
[-+]? # +/- or nothing for positive
(\d+(\.\d*)?|\.\d+) # match numbers: 1, 1., 1.1, .1
([eE][-+]?\d+)? # scientific notation: e(+/-)2 (*10^2)
)
(\s*) # separator: white space or nothing
( # unit of measure: like GB. also works for no units
\S*)''', re.VERBOSE)
'''
:var SCANF_MEASUREMENT:
regular expression object that will match a measurement
**measurement** is the value of a quantity of something. most complicated example::
-666.6e-100 units
'''
def parse_measurement(value_sep_units):
measurement = re.match(SCANF_MEASUREMENT, value_sep_units)
try:
value = float(measurement[0])
except ValueError:
print 'doesn't start with a number', value_sep_units
units = measurement[5]
return value, units

Categories

Resources