pyparsing - Parse numbers with thousand separators - python

So I am making a parser, and I noticed a problem. Indeed, to parse numbers, I have:
from pyparsing import Word, nums
n = Word(nums)
This works well with numbers without thousands separators. For example, n.parseString("1000", parseAll=True) returns (['1000'], {}) and therefore works.
However, it doesn't work when I add the thousand separator. Indeed, n.parseString("1,000", parseAll=True) raises pyparsing.ParseException: Expected end of text, found ',' (at char 1), (line:1, col:2).
How can I parse numbers with thousand separators? I don't just want to ignore commas (for example, n.parseString("1,00", parseAll=True) should return an error as it is not a number).

A pure pyparsing approach would use Combine to wrap a series of pyparsing expressions representing the different fields that you are seeing in the regex:
import pyparsing as pp
int_with_thousands_separators = pp.Combine(pp.Optional("-")
+ pp.Word(pp.nums, max=3)
+ ("," + pp.Word(pp.nums, exact=3))[...])
I've found that building up numeric expressions like this results in much slower parse time, because all those separate parts are parsed independently, with multiple internal function and method calls (which are real performance killers in Python). So you can replace this with an expression using Regex:
# more efficient parsing with a Regex
int_with_thousands_separators = pp.Regex(r"-?\d{1,3}(,\d{3})*")
You could also use the code as posted by Jan, and pass that compiled regex to the Regex constructor.
To do parse-time conversion to int, add a parse action that strips out the commas.
# add parse action to convert to int, after stripping ','s
int_with_thousands_separators.addParseAction(
lambda t: int(t[0].replace(",", "")))
I like using runTests to check out little expressions like this - it's easy to write a series of test strings, and the output shows either the parsed result or an annotated input string with the parse failure location. ("1,00" is included as an intentional error to demonstrate error output by runTests.)
int_with_thousands_separators.runTests("""\
1
# invalid value
1,00
1,000
-3,000,100
""")
If you want to parse real numbers, add pieces to represent the trailing decimal point and following digits.
real_with_thousands_separators = pp.Combine(pp.Optional("-")
+ pp.Word(pp.nums, max=3)
+ ("," + pp.Word(pp.nums, exact=3))[...]
+ "." + pp.Word(pp.nums))
# more efficient parsing with a Regex
real_with_thousands_separators = pp.Regex(r"-?\d{1,3}(,\d{3})*\.\d+")
# add parse action to convert to float, after stripping ','s
real_with_thousands_separators.addParseAction(
lambda t: float(t[0].replace(",", "")))
real_with_thousands_separators.runTests("""\
# invalid values
1
1,00
1,000
-3,000,100
1.
# valid values
1.732
-273.15
""")

As you are dealing with strings in the first place, you could very well use a regular expression on it to ensure that it is indeed a number (thousand sep including). If it is, replace every comma and feed it to the parser:
import re
from pyparsing import Word, nums
n = Word(nums)
def is_number(number):
rx = re.compile(r'^-?\d+(?:,\d{3})*$')
if rx.match(number):
return number.replace(",", "")
raise ValueError
try:
number = is_number("10,000,000")
print(n.parseString(number, parseAll=True))
except ValueError:
print("Not a number")
With this, e.g. 1,00 will result in Not a number, see a demo for the expression on regex101.com.

I don't understand well what you mean with "numbers with thousands of separators".
In any case, with pyparsing you should define the pattern of what you want to parse.
In the first example pyparse works well just because you defined n as just a number, so:
n = Word(nums)
print(n.parseString("1000", parseAll=True))
['1000']
So, if you want to parse "1,000" or "1,00", you should define n as:
n = Word(nums) + ',' + Word(nums)
print(n.parseString("1,000", parseAll=True))
['1', ',', '000']
print(n.parseString("1,00", parseAll=True))
['1', ',', '00']

I also came up with a regex solution, kind of late:
from pyparsing import Word, nums
import re
n = Word(nums)
def parseNumber(x):
parseable = re.sub('[,][0-9]{3}', lambda y: y.group()[1:], x)
return n.parseString(parseable, parseAll=True)
print(parseNumber("1,000,123"))

Related

how to split string between different separators in python

I want to pick up a substring from <personne01166+30-90>, which the output should look like: +30 and -90.
The strings can be like: 'personne01144+0-30', 'personne01146+0+0', 'personne01180+60-75', etc.
I tried use
<string.split('+')[len(string.split('+')) -1 ].split('+')[0]>
but the output must be two correspondent numbers.
Here is how you can use a list comprehension and re.findall:
import re
s = ['personne01144+0-30', 'personne01146+0+0', 'personne01180+60-75']
print([re.findall('[+-]\d+', i) for i in s])
Output:
[['+0', '-30'], ['+0', '+0'], ['+60', '-75']]
re.findall('[+-]\d+', i) finds all the patterns of '[+-]\d+' in the string i.
[+-] means any either + or -. \d+ means all numbers in a row.
If you know the interesting part always comes after + then you can simply split twice.
numbers = string.split('+', 1)[1]
if '+' in numbers:
this, that = numbers.split('+')
elif '-' in numbers:
this, that = numbers.split('-')
that = -that
else:
raise ValueError('Could not parse %s', string)
Perhaps a regex-based approach makes more sense, though;
import re
m = re.search(r'([-+]\d+)([-+]\d+)$', string)
if m:
this, that = m.groups()

Regular Expression (find matching characters in order)

Let us say that I have the following string variables:
welcome = "StackExchange 2016"
string_to_find = "Sx2016"
Here, I want to find the string string_to_find inside welcome using regular expressions. I want to see if each character in string_to_find comes in the same order as in welcome.
For instance, this expression would evaluate to True since the 'S' comes before the 'x' in both strings, the 'x' before the '2', the '2' before the 0, and so forth.
Is there a simple way to do this using regex?
Your answer is rather trivial. The .* character combination matches 0 or more characters. For your purpose, you would put it between all characters in there. As in S.*x.*2.*0.*1.*6. If this pattern is matched, then the string obeys your condition.
For a general string you would insert the .* pattern between characters, also taking care of escaping special characters like literal dots, stars etc. that may otherwise be interpreted by regex.
This function might fit your need
import re
def check_string(text, pattern):
return re.match('.*'.join(pattern), text)
'.*'.join(pattern) create a pattern with all you characters separated by '.*'. For instance
>> ".*".join("Sx2016")
'S.*x.*2.*0.*1.*6'
Use wildcard matches with ., repeating with *:
expression = 'S.*x.*2.*0.*1.*6'
You can also assemble this expression with join():
expression = '.*'.join('Sx2016')
Or just find it without a regular expression, checking whether the location of each of string_to_find's characters within welcome proceeds in ascending order, handling the case where a character in string_to_find is not present in welcome by catching the ValueError:
>>> welcome = "StackExchange 2016"
>>> string_to_find = "Sx2016"
>>> try:
... result = [welcome.index(c) for c in string_to_find]
... except ValueError:
... result = None
...
>>> print(result and result == sorted(result))
True
Actually having a sequence of chars like Sx2016 the pattern that best serve your purpose is a more specific:
S[^x]*x[^2]*2[^0]*0[^1]*1[^6]*6
You can obtain this kind of check defining a function like this:
import re
def contains_sequence(text, seq):
pattern = seq[0] + ''.join(map(lambda c: '[^' + c + ']*' + c, list(seq[1:])))
return re.search(pattern, text)
This approach add a layer of complexity but brings a couple of advantages as well:
It's the fastest one because the regex engine walk down the string only once while the dot-star approach go till the end of the sequence and back each time a .* is used. Compare on the same string (~1k chars):
Negated class -> 12 steps
Dot star -> 4426 step
It works on multiline strings in input as well.
Example code
>>> sequence = 'Sx2016'
>>> inputs = ['StackExchange2015','StackExchange2016','Stack\nExchange\n2015','Stach\nExchange\n2016']
>>> map(lambda x: x + ': yes' if contains_sequence(x,sequence) else x + ': no', inputs)
['StackExchange2015: no', 'StackExchange2016: yes', 'Stack\nExchange\n2015: no', 'Stach\nExchange\n2016: yes']

python pyparsing word excludeChars

I am trying to make a parser for a number which can contain an '_'. I would like the underscore to be suppressed in the output. For example, a valid word would be 1000_000 which should return a number: 1000000.
I have tried the excludeChars keyword argument for this as my understanding is that this should do the following:
"If supplied, this argument specifies characters not to be considered to match, even if those characters are otherwise considered to match."
Taken from http://infohost.nmt.edu/tcc/help/pubs/pyparsing/pyparsing.pdf - page 33 section 5.35 (great pyparsing reference btw)
So below is my attempt:
import pyparsing as pp
num = pp.Word(pp.nums+'_', excludeChars='_')
num.parseString('123_4')
but I end up with the result '123' instead of '1234'
In [113]: num.parseString('123_4')
Out[113]: (['123'], {})
Any suggestions?
You are misinterpreting the purpose of excludeChars. It is not there to suppress those characters from the output, it is there as an override to characters given in the initial and body character strings. So this
Word(nums+'_', excludeChars='_')
is just the same as
Word(nums)
excludeChars was added because there were many times that users wanted to define words like:
all printables except for ':'
all printables except for ',' or '.'
all printables except for ...
Before excludeChars was added, the only way to do this was the clunky-looking:
Word(''.join(c for c in printables if c != ':'))
or
Word(printables.replace(',',''))
Instead you can now write
Word(printables, excludeChars=',.')
In your case, you want to parse the numeric value, allowing embedded '_'s, but return just the numerics. This would be a good case for a parse action:
integer = Word(nums+'_').setParseAction(lambda t: t[0].replace('_',''))
Parse actions are called at parse time to do filtering and conversions. You can even include the conversion to int as part of your parse action:
integer = Word(nums+'_').setParseAction(lambda t: int(t[0].replace('_','')))
integer.parseString('1_000') --> [1000]
How about simply replacing the underscore char?
"123_4".replace("_", "")
# "1234"

Parsing text file in python using pyparsing

I am trying to parse the following text using pyparsing.
acp (SOLO1,
"solo-100",
"hi here is the gift"
"Maximum amount of money, goes",
430, 90)
jhk (SOLO2,
"solo-101",
"hi here goes the wind."
"and, they go beyond",
1000, 320)
I have tried the following code but it doesn't work.
flag = Word(alphas+nums+'_'+'-')
enclosed = Forward()
nestedBrackets = nestedExpr('(', ')', content=enclosed)
enclosed << (flag | nestedBrackets)
print list(enclosed.searchString (str1))
The comma(,) within the quotation is producing undesired results.
Well, I might have oversimplified slightly in my comments - here is a more complete
answer.
If you don't really have to deal with nested data items, then a single-level parenthesized
data group in each section will look like this:
LPAR,RPAR = map(Suppress, "()")
ident = Word(alphas, alphanums + "-_")
integer = Word(nums)
# treat consecutive quoted strings as one combined string
quoted_string = OneOrMore(quotedString)
# add parse action to concatenate multiple adjacent quoted strings
quoted_string.setParseAction(lambda t: '"' +
''.join(map(lambda s:s.strip('"\''),t)) +
'"' if len(t)>1 else t[0])
data_item = ident | integer | quoted_string
# section defined with no nesting
section = ident + Group(LPAR + delimitedList(data_item) + RPAR)
I wasn't sure if it was intentional or not when you omitted the comma between
two consecutive quoted strings, so I chose to implement logic like Python's compiler,
in which two quoted strings are treated as just one longer string, that is "AB CD " "EF" is
the same as "AB CD EF". This was done with the definition of quoted_string, and adding
the parse action to quoted_string to concatenate the contents of the 2 or more component
quoted strings.
Finally, we create a parser for the overall group
results = OneOrMore(Group(section)).parseString(source)
results.pprint()
and get from your posted input sample:
[['acp',
['SOLO1',
'"solo-100"',
'"hi here is the giftMaximum amount of money, goes"',
'430',
'90']],
['jhk',
['SOLO2',
'"solo-101"',
'"hi here goes the wind.and, they go beyond"',
'1000',
'320']]]
If you do have nested parenthetical groups, then your section definition can be
as simple as this:
# section defined with nesting
section = ident + nestedExpr()
Although as you have already found, this will retain the separate commas as if they
were significant tokens instead of just data separators.

Elegant way test in python if string contains nothing except 0-9,e,+,-,spaces,tabs

I would like to find the most efficient and simple way to test in python if a string passes the following criteria:
contains nothing except:
digits (the numbers 0-9)
decimal points: '.'
the letter 'e'
the sign '+' or '-'
spaces (any number of them)
tabs (any number of them)
I can do this easily with nested 'if' loops, etc., but i'm wondering if there's a more convenient way...
For example, I would want the string:
0.0009017041601 5.13623e-05 0.00137531 0.00124203
to be 'true' and all the following to be 'false':
# File generated at 10:45am Tuesday, July 8th
# Velocity: 82.568
# Ambient Pressure: 150000.0
Time(seconds) Force_x Force_y Force_z
That's trivial for a regex, using a character class:
import re
if re.match(r"[0-9e \t+.-]*$", subject):
# Match!
However, that will (according to the rules) also match eeeee or +-e-+ etc...
If what you actually want to do is check whether a given string is a valid number, you could simply use
try:
num = float(subject)
except ValueError:
print("Illegal value")
This will handle strings like "+34" or "-4e-50" or " 3.456e7 ".
import re
if re.match(r"^[0-9\te+ -]+$",x):
print "yes"
else:
print "no"
You can try this.If there is a match,its a pass else fail.Here x will be your string.
Easiest way to check whether the string has only required characters is by using the string.translate method.
num = "1234e+5"
if num.translate(None, "0123456789e+- \t"
print "pass"
else:
print "Wrong character present!!!"
You can add any character at the second parameter in the translate method other than that I mentioned.
You dont need to use regular expressions just use a test_list and all operation :
>>> from string import digits
>>> test_list=list(digits)+['+','-',' ','\t','e','.']
>>> all(i in test_list for i in s)
Demo:
>>> s ='+4534e '
>>> all(i in test_list for i in s)
True
>>> s='+9328a '
>>> all(i in test_list for i in s)
False
>>> s="0.0009017041601 5.13623e-05 0.00137531 0.00124203"
>>> all(i in test_list for i in s)
True
Performance wise, running a regular expression check is costly, depending on the expression. Also running a regex check for each valid line (i.e. lines which the value should be "True") will be costly, especially because you'll end up parsing each line with a regex and parse the same line again to get the numbers.
You did not say what you wanted to do with the data so I will empirically assume a few things.
First off in a case like this I would make sure the data source is always formatted the same way. Using your example as a template I would then define the following convention:
any line, which first non-blank character is a hash sign is ignored
any blank line is ignored
any line that contains only spaces is ignored
This kind of convention makes parsing much easier since you only need one regular expression to fit rules 1. to 3. : ^\s*(#|$), i.e. any number of space followed by either a hash sign or an end of line. On the performance side, this expression scans an entire line only when it's comprised of spaces and just spaces, which shall not happen very often. In other cases the expression scans a line and stops at the first non-space character, which means comments will be detected quickly for the scanning will stop as soon as the hash is encountered, at position 0 most of the time.
If you can also enforce the following convention:
the first non blank line of the remaining lines is the header with column names
there is no blank lines between samples
there are no comments in samples
Your code would then do the following:
read lines into line for as long as re.match(r'^\s*(#|$)', line) evaluates to True;
continue, reading headers from the next line into line: headers = line.split() and you have headers in a list.
You can use a namedtuple for your line layout — which I assume is constant throughout the same data table:
class WindSample(namedtuple('WindSample', 'time, force_x, force_y, force_z')):
def __new__(cls, time, force_x, force_y, force_z):
return super(WindSample, cls).__new__(
cls,
float(time),
float(force_x),
float(force_y),
float(force_z)
)
Parsing valid lines would then consist of the following, for each line:
try:
data = WindSample(*line.split())
except ValueError, e:
print e
Variable data would hold something such as:
>>> print data
WindSample(time=0.0009017041601, force_x=5.13623e-05, force_y=0.00137531, force_z=0.00124203)
The advantage is twofold:
you run costly regular expressions only for the smallest set of lines (i.e. blank lines and comments);
your code parses floats, raising an exception whenever parsing would yield something invalid.

Categories

Resources