Separate number from unit in a string in Python - python

I have strings containing numbers with their units, e.g. 2GB, 17ft, etc.
I would like to separate the number from the unit and create 2 different strings. Sometimes, there is a whitespace between them (e.g. 2 GB) and it's easy to do it using split(' ').
When they are together (e.g. 2GB), I would test every character until I find a letter, instead of a number.
s='17GB'
number=''
unit=''
for c in s:
if c.isdigit():
number+=c
else:
unit+=c
Is there a better way to do it?
Thanks

You can break out of the loop when you find the first non-digit character
for i,c in enumerate(s):
if not c.isdigit():
break
number = s[:i]
unit = s[i:].lstrip()
If you have negative and decimals:
numeric = '0123456789-.'
for i,c in enumerate(s):
if c not in numeric:
break
number = s[:i]
unit = s[i:].lstrip()

You could use a regular expression to divide the string into groups:
>>> import re
>>> p = re.compile('(\d+)\s*(\w+)')
>>> p.match('2GB').groups()
('2', 'GB')
>>> p.match('17 ft').groups()
('17', 'ft')

tokenize can help:
>>> import StringIO
>>> s = StringIO.StringIO('27GB')
>>> for token in tokenize.generate_tokens(s.readline):
... print token
...
(2, '27', (1, 0), (1, 2), '27GB')
(1, 'GB', (1, 2), (1, 4), '27GB')
(0, '', (2, 0), (2, 0), '')

s='17GB'
for i,c in enumerate(s):
if not c.isdigit():
break
number=int(s[:i])
unit=s[i:]

You should use regular expressions, grouping together what you want to find out:
import re
s = "17GB"
match = re.match(r"^([1-9][0-9]*)\s*(GB|MB|KB|B)$", s)
if match:
print "Number: %d, unit: %s" % (int(match.group(1)), match.group(2))
Change the regex according to what you want to parse. If you're unfamiliar with regular expressions, here's a great tutorial site.

>>> s="17GB"
>>> ind=map(str.isalpha,s).index(True)
>>> num,suffix=s[:ind],s[ind:]
>>> print num+":"+suffix
17:GB

This uses an approach which should be a bit more forgiving than regexes. Note: this is not as performant as the other solutions posted.
def split_units(value):
"""
>>> split_units("2GB")
(2.0, 'GB')
>>> split_units("17 ft")
(17.0, 'ft')
>>> split_units(" 3.4e-27 frobnitzem ")
(3.4e-27, 'frobnitzem')
>>> split_units("9001")
(9001.0, '')
>>> split_units("spam sandwhiches")
(0, 'spam sandwhiches')
>>> split_units("")
(0, '')
"""
units = ""
number = 0
while value:
try:
number = float(value)
break
except ValueError:
units = value[-1:] + units
value = value[:-1]
return number, units.strip()

This kind of parser is already integrated into Pint:
Pint is a Python package to define, operate and manipulate physical
quantities: the product of a numerical value and a unit of
measurement. It allows arithmetic operations between them and
conversions from and to different units.
You can install it with pip install pint.
Then, you can parse a string, get the desired value ('magnitude') and its unit:
>>> from pint import UnitRegistry
>>> ureg = UnitRegistry()
>>> size = ureg('2GB')
>>> size.m
2
>>> size.u
<Unit('gigabyte')>
>>> size.to('GiB')
<Quantity(1.86264515, 'gibibyte')>
>>> length = ureg('17ft')
>>> length.m
17
>>> length.u
<Unit('foot')>
>>> length.to('cm')
<Quantity(518.16, 'centimeter')>

How about using a regular expression
http://python.org/doc/1.6/lib/module-regsub.html

For this task, I would definitely use a regular expression:
import re
there = re.compile(r'\s*(\d+)\s*(\S+)')
thematch = there.match(s)
if thematch:
number, unit = thematch.groups()
else:
raise ValueError('String %r not in the expected format' % s)
In the RE pattern, \s means "whitespace", \d means "digit", \S means non-whitespace; * means "0 or more of the preceding", + means "1 or more of the preceding, and the parentheses enclose "capturing groups" which are then returned by the groups() call on the match-object. (thematch is None if the given string doesn't correspond to the pattern: optional whitespace, then one or more digits, then optional whitespace, then one or more non-whitespace characters).

A regular expression.
import re
m = re.match(r'\s*(?P<n>[-+]?[.0-9])\s*(?P<u>.*)', s)
if m is None:
raise ValueError("not a number with units")
number = m.group("n")
unit = m.group("u")
This will give you a number (integer or fixed point; too hard to disambiguate scientific notation's "e" from a unit prefix) with an optional sign, followed by the units, with optional whitespace.
You can use re.compile() if you're going to be doing a lot of matches.

SCIENTIFIC NOTATION
This regex is working well for me to parse numbers that may be in scientific notation, and is based on the recent python documentation about scanf:
https://docs.python.org/3/library/re.html#simulating-scanf
units_pattern = re.compile("([-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?|\s*[a-zA-Z]+\s*$)")
number_with_units = list(match.group(0) for match in units_pattern.finditer("+2.0e-1 mm"))
print(number_with_units)
>>>['+2.0e-1', ' mm']
n, u = number_with_units
print(float(n), u.strip())
>>>0.2 mm

try the regex pattern below. the first group (the scanf() tokens for a number any which way) is lifted directly from the python docs for the re module.
import re
SCANF_MEASUREMENT = re.compile(
r'''( # group match like scanf() token %e, %E, %f, %g
[-+]? # +/- or nothing for positive
(\d+(\.\d*)?|\.\d+) # match numbers: 1, 1., 1.1, .1
([eE][-+]?\d+)? # scientific notation: e(+/-)2 (*10^2)
)
(\s*) # separator: white space or nothing
( # unit of measure: like GB. also works for no units
\S*)''', re.VERBOSE)
'''
:var SCANF_MEASUREMENT:
regular expression object that will match a measurement
**measurement** is the value of a quantity of something. most complicated example::
-666.6e-100 units
'''
def parse_measurement(value_sep_units):
measurement = re.match(SCANF_MEASUREMENT, value_sep_units)
try:
value = float(measurement[0])
except ValueError:
print 'doesn't start with a number', value_sep_units
units = measurement[5]
return value, units

Related

A generic version of python's `str.zfill` that supports arbitrary fillers?

str.zfill is a useful tool to pad strings with leading zeros:
In [1]: '123'.zfill(5)
Out[1]: '00123'
However, is there a more generic version that will take any filler character and pad a string with it? I'm looking for something like this:
In []: 'txt'.foo(' ', 5)
Out[]: ' txt'
In []: '12'.foo('#', 5)
Out[]: '###12'
Does such a function exist?
Yes, there are the string justification methods, ljust and rjust.
>>> '12'.rjust(5, '#')
'###12'
>>> 'txt'.rjust(5, ' ')
' txt'
>>> '12'.ljust(5, '#')
'12###'
Implementing your own version should be straightforward enough:
def xfill(string, num, filler='0'):
if len(filler) != 1:
raise TypeError('xfill() expected a character, but string of length %d found' %len(filler))
return filler * max(0, num - len(string)) + string
The assert ensures the filler is a single, valid character only.
xfill(string, num [, filler]):
string: the string to be padded
num: the total width of the field (similar to str.zfill)
filler: to pad with. Defaults to 0 to mimic str.zfill functionality
Examples:
In [321]: xfill('123', 5)
Out[321]: '00123'
In [322]: xfill('123', 5, '#')
Out[322]: '##123'
To add a little more flexibility for alternative patterns instead of just a character:
def filler(string, pattern, width):
left_filler = (pattern * width)[:max(0, width - len(string))]
return left_filler + string
>>> filler(string='some text', pattern='*.', width=15)
# Output:
# '*.*.*.some text'
>>> filler(string='some text', pattern='*.', width=14)
# Output:
# '*.*.*some text'
If all you need is simple padding, I'd go with #PM2Ring's answer, but there is another, more versatile way, using the str.format method (Python 2.6 onward). This method allows you to interpolate the format specifier by nesting the replacement fields:
'{string:{fill}>{num}}'.format(string=string, fill=fill, num=num)
Replace > with < if you need to left-align the string instead.

Regular expression search shows different result

I want to extract number between > and < using regular expression on Python 2.7
i.e. From 3213>1234<3213 to 1234.
But the result(print(data2)) shows nothing. What is the problem?
I tested the code below on Ubuntu and Windows pydev.
import re
a = "3213>1234<3213"
p = re.compile('>[0-9]*<')
data = p.search(a).group()
print(data)
p2 = re.compile('[0-9]*')
data2 = p2.search(data).group()
print(data2)
The problem is that you get the earliest possible match for [0-9]* in '>1234<', and that's in fact the empty string at the very start of it, before the >.
Besides direct regex solutions, you could also fix yours simply with data2 = data[1:-1].
Because you're trying to use [0-9]* on >1234<. And * try to match 0 or more digits.
So it gives an empty string when it try to find a digit on the fist letter of the string, which is >.
You can replace re.search() with re.findall() and see what's happening:
import re
a = "3213>1234<3213"
p = re.compile('>[0-9]*<')
data = p.search(a).group()
print(data)
p2 = re.compile('[0-9]*')
data2 = p2.findall(data)
print(data2)
Output:
['', '1234', '', '']
You need use [0-9]+ instead of [0-9]* here. Which match 1 or more digits. So it would skips the > and <:
>>> p2 = re.compile('[0-9]+')
>>> data2 = p2.search(data).group()
>>> print(data2)
1234
You can also totally drop the p2 and capture the digits in > and < via p = re.compile('>([0-9]+)<') and data = p.search(a).group(1). Like this:
>>> import re
>>> a = "3213>1234<3213"
>>> p = re.compile('>([0-9]+)<')
>>> data = p.search(a).group(1)
>>> print(data)
1234
>>> string='3213>1234<3213'
>>> re.search(r'(?<=>)[^<]+(?=<)', string).group()
'1234'
(?<=>) is the zero width positive lookbehind pattern ensuring > before the desired match
[^<]+ will select the desired portion i.e. the portion after > till next <, 1234 in this case
(?=<) is the zero width positive lookahead pattern ensuring > after the desired match
you can group your search:
>>> a = "3213>1234<3213"
>>> re.findall(">(\d+)<", a)
['1234']
the regular expression look for the > any number < and findall returns a list of matches. You then iterate over the matches
a = "3213>1234<3213>5123<"
p = re.compile('>([0-9]+)<')
data=p.findall(a)
for item in data:
print(item)
output:
1234
5123

Splitting a string before the nth occurrence of a character [duplicate]

Is there a Python-way to split a string after the nth occurrence of a given delimiter?
Given a string:
'20_231_myString_234'
It should be split into (with the delimiter being '_', after its second occurrence):
['20_231', 'myString_234']
Or is the only way to accomplish this to count, split and join?
>>> n = 2
>>> groups = text.split('_')
>>> '_'.join(groups[:n]), '_'.join(groups[n:])
('20_231', 'myString_234')
Seems like this is the most readable way, the alternative is regex)
Using re to get a regex of the form ^((?:[^_]*_){n-1}[^_]*)_(.*) where n is a variable:
n=2
s='20_231_myString_234'
m=re.match(r'^((?:[^_]*_){%d}[^_]*)_(.*)' % (n-1), s)
if m: print m.groups()
or have a nice function:
import re
def nthofchar(s, c, n):
regex=r'^((?:[^%c]*%c){%d}[^%c]*)%c(.*)' % (c,c,n-1,c,c)
l = ()
m = re.match(regex, s)
if m: l = m.groups()
return l
s='20_231_myString_234'
print nthofchar(s, '_', 2)
Or without regexes, using iterative find:
def nth_split(s, delim, n):
p, c = -1, 0
while c < n:
p = s.index(delim, p + 1)
c += 1
return s[:p], s[p + 1:]
s1, s2 = nth_split('20_231_myString_234', '_', 2)
print s1, ":", s2
I like this solution because it works without any actuall regex and can easiely be adapted to another "nth" or delimiter.
import re
string = "20_231_myString_234"
occur = 2 # on which occourence you want to split
indices = [x.start() for x in re.finditer("_", string)]
part1 = string[0:indices[occur-1]]
part2 = string[indices[occur-1]+1:]
print (part1, ' ', part2)
I thought I would contribute my two cents. The second parameter to split() allows you to limit the split after a certain number of strings:
def split_at(s, delim, n):
r = s.split(delim, n)[n]
return s[:-len(r)-len(delim)], r
On my machine, the two good answers by #perreal, iterative find and regular expressions, actually measure 1.4 and 1.6 times slower (respectively) than this method.
It's worth noting that it can become even quicker if you don't need the initial bit. Then the code becomes:
def remove_head_parts(s, delim, n):
return s.split(delim, n)[n]
Not so sure about the naming, I admit, but it does the job. Somewhat surprisingly, it is 2 times faster than iterative find and 3 times faster than regular expressions.
I put up my testing script online. You are welcome to review and comment.
>>>import re
>>>str= '20_231_myString_234'
>>> occerence = [m.start() for m in re.finditer('_',str)] # this will give you a list of '_' position
>>>occerence
[2, 6, 15]
>>>result = [str[:occerence[1]],str[occerence[1]+1:]] # [str[:6],str[7:]]
>>>result
['20_231', 'myString_234']
It depends what is your pattern for this split. Because if first two elements are always numbers for example, you may build regular expression and use re module. It is able to split your string as well.
I had a larger string to split ever nth character, ended up with the following code:
# Split every 6 spaces
n = 6
sep = ' '
n_split_groups = []
groups = err_str.split(sep)
while len(groups):
n_split_groups.append(sep.join(groups[:n]))
groups = groups[n:]
print n_split_groups
Thanks #perreal!
In function form of #AllBlackt's solution
def split_nth(s, sep, n):
n_split_groups = []
groups = s.split(sep)
while len(groups):
n_split_groups.append(sep.join(groups[:n]))
groups = groups[n:]
return n_split_groups
s = "aaaaa bbbbb ccccc ddddd eeeeeee ffffffff"
print (split_nth(s, " ", 2))
['aaaaa bbbbb', 'ccccc ddddd', 'eeeeeee ffffffff']
As #Yuval has noted in his answer, and #jamylak commented in his answer, the split and rsplit methods accept a second (optional) parameter maxsplit to avoid making splits beyond what is necessary. Thus, I find the better solution (both for readability and performance) is this:
s = '20_231_myString_234'
first_part = text.rsplit('_', 2)[0] # Gives '20_231'
second_part = text.split('_', 2)[2] # Gives 'myString_234'
This is not only simple, but also avoids performance hits of regex solutions and other solutions using join to undo unnecessary splits.

Extract a number from string in python

I want to extract a number form a string like this in Python:
string1 = 154787xs.txt
I want to get 154787 from there. I am using this:
searchPattern = re.compile('\d\d\d\d\d\d(?=xs)')
m = searchPattern.search(string1)
number = m.group()
but I do not get the correct value. Also the number of digits could change...
What am I doing wrong?
Simply you could use the below pattern,
searchPattern = re.compile(r'\d+(?=xs)')
Explanation:
\d+ matches one or more numbers.
(?=xs) Lookahead asserts that the characters which are following the numbers must be xs
Code:
>>> import re
>>> searchPattern = re.compile(r'\d+(?=xs)')
>>> m = searchPattern.search(string1)
>>> m
<_sre.SRE_Match object at 0x7f6047f66370>
>>> number = m.group()
>>> number
'154787'
What do you mean when you say you do not get the right value?
Your code does successfully match the string '154787'.
Perhaps you want number to be an int? In that case use:
number = int(m.group())
By the way, the regex could be written as
searchPattern = re.compile('(\d+)xs')
m = searchPattern.search(string1)
if m:
number = int(m.group(1))

Python Regular Expressions; Having 0 match either 0 or 1

I have a shorter string s I'm trying to match to a longer string s1. 1's match 1's, but 0's will match either a 0 or a 1.
For instance:
s = '11111' would match s1 = '11111'
s = '11010' would match s1 = '11111' or '11011' or '11110' or '11010'
I know regular expressions would make this much easier but am confused on where to start.
Replace each instance of 0 with [01] to enable it matching either 0 or 1:
s = '11010'
pattern = s.replace('0', '[01]')
regex = re.compile(pattern)
regex.match('11111')
regex.match('11011')
It looks to me like you're actually looking for bit arithmetics
s = '11010'
n = int(s, 2)
for r in ('11111', '11011', '11110', '11010'):
if int(r, 2) & n == n:
print r, 'matches', s
else:
print r, 'doesnt match', s
import re
def matches(pat, s):
p = re.compile(pat.replace('0', '[01]') + '$')
return p.match(s) is not None
print matches('11111', '11111')
print matches('11111', '11011')
print matches('11010', '11111')
print matches('11010', '11011')
You say "match to a longer string s1", but you don't say whether you'd like to match the start of the string, or the end etc. Until I better understand your requirements, this performs an exact match.

Categories

Resources