Converting string with leading-zero integer to json - python

I convert a string to a json-object using the json-library:
a = '{"index":1}'
import json
json.loads(a)
{'index': 1}
However, if I instead change the string a to contain a leading 0, then it breaks down:
a = '{"index":01}'
import json
json.loads(a)
>>> JSONDecodeError: Expecting ',' delimiter
I believe this is due to the fact that it is invalid JSON if an integer begins with a leading zero as described in this thread.
Is there a way to remedy this? If not, then I guess the best way is to remove any leading zeroes by a regex from the string first, then convert to json?

A leading 0 in a number literal in JSON is invalid unless the number literal is only the character 0 or starts with 0.. The Python json module is quite strict in that it will not accept such number literals. In part because a leading 0 is sometimes used to denote octal notation rather than decimal notation. Deserialising such numbers could lead to unintended programming errors. That is, should 010 be parsed as the number 8 (in octal notation) or as 10 (in decimal notation).
You can create a decoder that will do what you want, but you will need to heavily hack the json module or rewrite much of its internals. Either way, you will see a performance slow down as you will no longer be using the C implementation of the module.
Below is an implementation that can decode JSON which contains numbers with any number of leading zeros.
import json
import re
import threading
# a more lenient number regex (modified from json.scanner.NUMBER_RE)
NUMBER_RE = re.compile(
r'(-?(?:\d*))(\.\d+)?([eE][-+]?\d+)?',
(re.VERBOSE | re.MULTILINE | re.DOTALL))
# we are going to be messing with the internals of `json.scanner`. As such we
# want to return it to its initial state when we're done with it, but we need to
# do so in a thread safe way.
_LOCK = threading.Lock()
def thread_safe_py_make_scanner(context, *, number_re=json.scanner.NUMBER_RE):
with _LOCK:
original_number_re = json.scanner.NUMBER_RE
try:
json.scanner.NUMBER_RE = number_re
return json.scanner._original_py_make_scanner(context)
finally:
json.scanner.NUMBER_RE = original_number_re
json.scanner._original_py_make_scanner = json.scanner.py_make_scanner
json.scanner.py_make_scanner = thread_safe_py_make_scanner
class MyJsonDecoder(json.JSONDecoder):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# overwrite the stricter scan_once implementation
self.scan_once = json.scanner.py_make_scanner(self, number_re=NUMBER_RE)
d = MyJsonDecoder()
n = d.decode('010')
assert n == 10
json.loads('010') # check the normal route still raise an error
I would stress that you shouldn't rely on this as a proper solution. Rather, it's a quick hack to help you decode malformed JSON that is nearly, but not quite valid. It's useful if recreating the JSON in a valid form is not possible for some reason.

First, using regex on JSON is evil, almost as bad as killing a kitten.
If you want to represent 01 as a valid JSON value, then consider using this structure:
a = '{"index" : "01"}'
import json
json.loads(a)
If you need the string literal 01 to behave like a number, then consider just casting it to an integer in your Python script.

How to convert string int JSON into real int with json.loads
Please see the post above
You need to use your own version of Decoder.
More information can be found here , in the github
https://github.com/simplejson/simplejson/blob/master/index.rst
c = '{"value": 02}'
value= json.loads(json.dumps(c))
print(value)
This seems to work .. It is strange
> >>> c = '{"value": 02}'
> >>> import json
> >>> value= json.loads(json.dumps(c))
> >>> print(value) {"value": 02}
> >>> c = '{"value": 0002}'
> >>> value= json.loads(json.dumps(c))
> >>> print(value) {"value": 0002}
As #Dunes, pointed out the loads produces string as an outcome which is not a valid solution.
However,
DEMJSON seems to decode it properly.
https://pypi.org/project/demjson/ -- alternative way
>>> c = '{"value": 02}'
>>> import demjson
>>> demjson.decode(c)
{'value': 2}

Related

How do I extract simple numerical expressions numbers from a string?

I want to code a unit converter and I need to extract the given value from the unit in the input string.
To provide a user friendly experience while using the converter I want the user to be able to input the value and the unit in the same string. My problem is that I want to extract the numbers and the letters so that I can tell the program the unit and the value and store them in two different variables. For extracting the letters, I used the in operator, and that works properly. I also found a solution for getting the numbers from the input, but that doesn't work for values with exponents.
a = str(input("Type in your wavelength: "))
if "mm" in a:
print("Unit = Millimeter")
b = float(a.split()[0])
Storing simple inputs like 567 mm as a float in b works but I want to be able to extract inputs like 5*10**6 mm but it says
could not convert string to float: '5*10**6'.
So what can I use to extract more complex numbers like this into a float?
Traditionally, in Python, as in many other languages, exponents are prefixed by the letter e or E. While 5 * 10**6 is not a valid floating point literal, 5e6 most definitely is.
This is something to keep in mind for the future, but it won't solve your issue with the in operator. The problem is that in can only check if something you already know is there. What if your input was 5e-8 km instead?
You should start by coming up with an unambiguously clear definition of how you identify the boundary between number and units in a string. For example, units could be the last contiguous bit of non-digit characters in your string.
You could then split the string using regular expressions. Since the first part can be an arbitrary expression, so you can evaluate it with something as simple as ast.literal_eval. The more complicated your expression can be, the more complicated your parser will have to be as well.
Here's an example to get you started:
from ast import literal_eval
import re
pattern = re.compile(r'(.*[\d\.])\s*(\D+)')
data = '5 * 10**6 mm'
match = pattern.fullmatch(data)
if not match:
raise ValueError('Invalid Expression')
num, units = match.groups()
num = literal_eval(num)
It seems that you are looking for the eval function, as noted in #Rasgel's answer. Documentation here
As some people have pointed out, it poses a big security risk.
To circumvent this, I can think of 2 ways:
1. Combine eval with regex
If you only want to do basic arithmetic operations like addition, subtraction and maybe 2**4 or sth like that, then you can use regex to first remove any non-numerical, non-arithmetic operational characters.
import re
a = str(input("Type in your wavelength: "))
if "mm" in a:
print("Unit = Millimeter")
# After parsing the units,
# Remove anything other than digits, +, -, *, /, . (floats), ! (factorial?) and ()
# If you require any other symbols, add them in
pruned_a = re.sub(r'[^0-9\*\+\-\/\!\.\(\)]', "", a)
result = eval(pruned_a)
2. Make sure eval doesn't actually evaluate any of your local or global variables in your python code.
result = eval(expression, {'__builtins__': None}, {})
(the above code is from another Stackoverflow answer here: Math Expression Evaluation -- there might be other solutions there that you might be interested in)
Combined
import re
a = str(input("Type in your wavelength: "))
if "mm" in a:
print("Unit = Millimeter")
# After parsing the units,
# Remove anything other than digits, +, -, *, /, . (floats), ! (factorial?) and ()
# If you require any other symbols, add them in
pruned_a = re.sub(r'[^0-9\*\+\-\/\!\.\(\)]', "", a)
result = eval(pruned_a, {'__builtins__': None}, {}) #to be extra safe :)
There are many ways to tackle this simple problem, using str.split, regular expressions, eval, ast.literal_eval... Here I propose you to have your own safe routine that will evaluate simple mathematical expressions, code below:
import re
import ast
import operator
def safe_eval(s):
bin_ops = {
ast.Add: operator.add,
ast.Sub: operator.sub,
ast.Mult: operator.mul,
ast.Div: operator.itruediv,
ast.Mod: operator.mod,
ast.Pow: operator.pow
}
node = ast.parse(s, mode='eval')
def _eval(node):
if isinstance(node, ast.Expression):
return _eval(node.body)
elif isinstance(node, ast.Str):
return node.s
elif isinstance(node, ast.Num):
return node.n
elif isinstance(node, ast.BinOp):
return bin_ops[type(node.op)](_eval(node.left), _eval(node.right))
else:
raise Exception('Unsupported type {}'.format(node))
return _eval(node.body)
if __name__ == '__main__':
text = str(input("Type in your wavelength: "))
tokens = [v.strip() for v in text.split()]
if len(tokens) < 2:
raise Exception("expected input: <wavelength expression> <unit>")
wavelength = safe_eval("".join(tokens[:-1]))
dtype = tokens[-1]
print(f"You've typed {wavelength} in {dtype}")
I'll also recommend you read this post Why is using 'eval' a bad practice?
In case you have a string like 5*106and want to convert this number into a float, you can use the eval() function.
>>> float(eval('5*106'))
530.0

Creating new conversion specifier in Python

In python we have conversion specifier like
'{0!s}'.format(10)
which prints
'10'
How can I make my own conversion specifiers like
'{0!d}'.format(4561321)
which print integers in following format
4,561,321
Or converts it into binary like
'{0!b}'.format(2)
which prints
10
What are the classes I need to inherit and which functions I need to modify? If possible please provide a small example.
Thanks!!
What you want to do is impossible, because built-in types cannot be modified and literals always refer to built-in types.
There is a special method to handle the formatting of values, that is __format__, however it only handles the format string, not the conversion specifier, i.e. you can customize how {0:d} is handled but not how {0!d} is. The only things that work with ! are s and r.
Note that d and b already exist as format specifiers:
>>> '{0:b}'.format(2)
'10'
In any case you could implement your own class that handles formatting:
class MyInt:
def __init__(self, value):
self.value = value
def __format__(self, fmt):
if fmt == 'd':
text = list(str(self.value))
elif fmt == 'b':
text = list(bin(self.value)[2:])
for i in range(len(text)-3, 0, -3):
text.insert(i, ',')
return ''.join(text)
Used as:
>>> '{0:d}'.format(MyInt(5000000))
5,000,000
>>> '{0:b}'.format(MyInt(8))
1,000
Try not to make your own and try to use default functions already present in python. You can use,
'{0:b}'.format(2) # for binary
'{0:d}'.format(2) # for integer
'{0:x}'.format(2) # for hexadecimal
'{0:f}'.format(2) # for float
'{0:e}'.format(2) # for exponential
Please refer https://docs.python.org/2/library/string.html#formatspec for more.

How to eval an expression in other bases?

i was making a calculator in which the user inputs an expression such as 3*2+1/20 and i use the eval to display the answer.
Is there a function that lets me do the same in other bases(bin,oct,hex)?
If they enter in the values as hex, binary, etc, eval will work:
eval("0xa + 8 + 0b11")
# 21
Beware though, eval can be dangerous.
No; eval is used to parse Python and the base of numbers in Python code is fixed.
You could use a regex replace to prefix numbers with 0x if you were insistent upon this method, but it would be better to build a parser utilizing, say, int(string, base) to generate the numbers.
If you really want to go down the Python route, here's a token based transformation:
import tokenize
from io import BytesIO
def tokens_with_base(tokens, base):
for token in tokens:
if token.type == tokenize.NUMBER:
try:
value = int(token.string, base)
except ValueError:
# Not transformable
pass
else:
# Transformable
token = tokenize.TokenInfo(
type = tokenize.NUMBER,
string = str(value),
start = token.start,
end = token.end,
line = token.line
)
yield token
def python_change_default_base(string, base):
tokens = tokenize.tokenize(BytesIO(string.encode()).readline)
transformed = tokens_with_base(tokens, base)
return tokenize.untokenize(transformed)
eval(python_change_default_base("3*2+1/20", 16))
#>>> 6.03125
0x3*0x2+0x1/0x20
#>>> 6.03125
This is safer because it respects things like strings.

Is there a way to override python's json handler?

I'm having trouble encoding infinity in json.
json.dumps will convert this to "Infinity", but I would like it do convert it to null or another value of my choosing.
Unfortunately, setting default argument only seems to work if dumps does't already understand the object, otherwise the default handler appears to be bypassed.
Is there a way I can pre-encode the object, change the default way a type/class is encoded, or convert a certain type/class into a different object prior to normal encoding?
Look at the source here: http://hg.python.org/cpython/file/7ec9255d4189/Lib/json/encoder.py
If you subclass JSONEncoder, you can override just the iterencode(self, o, _one_shot=False) method, which has explicit special casing for Infinity (inside an inner function).
To make this reusable, you'll also want to alter the __init__ to take some new options, and store them in the class.
Alternatively, you could pick a json library from pypi which has the appropriate extensibility you are looking for: https://pypi.python.org/pypi?%3Aaction=search&term=json&submit=search
Here's an example:
import json
class FloatEncoder(json.JSONEncoder):
def __init__(self, nan_str = "null", **kwargs):
super(FloatEncoder,self).__init__(**kwargs)
self.nan_str = nan_str
# uses code from official python json.encoder module.
# Same licence applies.
def iterencode(self, o, _one_shot=False):
"""Encode the given object and yield each string
representation as available.
For example::
for chunk in JSONEncoder().iterencode(bigobject):
mysocket.write(chunk)
"""
if self.check_circular:
markers = {}
else:
markers = None
if self.ensure_ascii:
_encoder = json.encoder.encode_basestring_ascii
else:
_encoder = json.encoder.encode_basestring
if self.encoding != 'utf-8':
def _encoder(o, _orig_encoder=_encoder,
_encoding=self.encoding):
if isinstance(o, str):
o = o.decode(_encoding)
return _orig_encoder(o)
def floatstr(o, allow_nan=self.allow_nan,
_repr=json.encoder.FLOAT_REPR,
_inf=json.encoder.INFINITY,
_neginf=-json.encoder.INFINITY,
nan_str = self.nan_str):
# Check for specials. Note that this type of test is
# processor and/or platform-specific, so do tests which
# don't depend on the internals.
if o != o:
text = nan_str
elif o == _inf:
text = 'Infinity'
elif o == _neginf:
text = '-Infinity'
else:
return _repr(o)
if not allow_nan:
raise ValueError(
"Out of range float values are not JSON compliant: " +
repr(o))
return text
_iterencode = json.encoder._make_iterencode(
markers, self.default, _encoder, self.indent, floatstr,
self.key_separator, self.item_separator, self.sort_keys,
self.skipkeys, _one_shot)
return _iterencode(o, 0)
example_obj = {
'name': 'example',
'body': [
1.1,
{"3.3": 5, "1.1": float('Nan')},
[float('inf'), 2.2]
]}
print json.dumps(example_obj, cls=FloatEncoder)
ideone: http://ideone.com/dFWaNj
No, there is no simple way to achieve this. In fact, NaN and Infinity floating point values shouldn't be serialized with json at all, according to the standard.
Python uses an extension of the standard. You can make the python encoding standard-compliant passing the allow_nan=False parameter to dumps, but this will raise a ValueError for infinity/nans even if you provide a default function.
You have two ways of doing what you want:
Subclass JSONEncoder and change how these values are encoded. Note that you will have to take into account cases where a sequence can contain an infinity value etc. AFAIK there is no API to redefine how objects of a specific class are encoded.
Make a copy of the object to encode and replace any occurrence of infinity/nan with None or some other object that is encoded as you want.
A less robust, yet much simpler solution, is to modify the encoded data, for example replacing all Infinity substrings with null:
>>> import re
>>> infty_regex = re.compile(r'\bInfinity\b')
>>> def replace_infinities(encoded):
... regex = re.compile(r'\bInfinity\b')
... return regex.sub('null', encoded)
...
>>> import json
>>> replace_infinities(json.dumps([1, 2, 3, float('inf'), 4]))
'[1, 2, 3, null, 4]'
Obviously you should take into account the text Infinity inside strings etc., so even here a robust solution is not immediate, nor elegant.
Context
I ran into this issue and didn't want to bring an extra dependency into the project just to handle this case. Additionally, my project supports Python 2.6, 2.7, 3.3, and 3.4 and user's of simplejson. Unfortunately there are three different implementations of iterencode between these versions, so hard-coding a particular version was undesirable.
Hopefully this will help someone else with similar requirements!
Qualifiers
If the encoding time/processing-power surrounding your json.dumps call is small compared to other components of your project, you can un-encode/re-encode the JSON to get your desired result leveraging the parse_constant kwarg.
Benefits
It doesn't matter if the end-user has Python 2.x's json, Python 3.x's json or is using simplejson (e.g, import simplejson as json)
It only uses public json interfaces which are unlikely to change.
Caveats
This will take ~3X as long to encode things
This implementation doesn't handle object_pairs_hook because then it wouldn't work for python 2.6
Invalid separators will fail
Code
class StrictJSONEncoder(json.JSONEncoder):
def default(self, o):
"""Make sure we don't instantly fail"""
return o
def coerce_to_strict(self, const):
"""
This is used to ultimately *encode* into strict JSON, see `encode`
"""
# before python 2.7, 'true', 'false', 'null', were include here.
if const in ('Infinity', '-Infinity', 'NaN'):
return None
else:
return const
def encode(self, o):
"""
Load and then dump the result using parse_constant kwarg
Note that setting invalid separators will cause a failure at this step.
"""
# this will raise errors in a normal-expected way
encoded_o = super(StrictJSONEncoder, self).encode(o)
# now:
# 1. `loads` to switch Infinity, -Infinity, NaN to None
# 2. `dumps` again so you get 'null' instead of extended JSON
try:
new_o = json.loads(encoded_o, parse_constant=self.coerce_to_strict)
except ValueError:
# invalid separators will fail here. raise a helpful exception
raise ValueError(
"Encoding into strict JSON failed. Did you set the separators "
"valid JSON separators?"
)
else:
return json.dumps(new_o, sort_keys=self.sort_keys,
indent=self.indent,
separators=(self.item_separator,
self.key_separator))
You could do something along these lines:
import json
import math
target=[1.1,1,2.2,float('inf'),float('nan'),'a string',int(2)]
def ffloat(f):
if not isinstance(f,float):
return f
if math.isnan(f):
return 'custom NaN'
if math.isinf(f):
return 'custom inf'
return f
print 'regular json:',json.dumps(target)
print 'customized:',json.dumps(map(ffloat,target))
Prints:
regular json: [1.1, 1, 2.2, Infinity, NaN, "a string", 2]
customized: [1.1, 1, 2.2, "custom inf", "custom NaN", "a string", 2]
If you want to handle nested data structures, this is also not that hard:
import json
import math
from collections import Mapping, Sequence
def nested_json(o):
if isinstance(o, float):
if math.isnan(o):
return 'custom NaN'
if math.isinf(o):
return 'custom inf'
return o
elif isinstance(o, basestring):
return o
elif isinstance(o, Sequence):
return [nested_json(item) for item in o]
elif isinstance(o, Mapping):
return dict((key, nested_json(value)) for key, value in o.iteritems())
else:
return o
nested_tgt=[1.1,{1.1:float('inf'),3.3:5},(float('inf'),2.2),]
print 'regular json:',json.dumps(nested_tgt)
print 'nested json',json.dumps(nested_json(nested_tgt))
Prints:
regular json: [1.1, {"3.3": 5, "1.1": Infinity}, [Infinity, 2.2]]
nested json [1.1, {"3.3": 5, "1.1": "custom inf"}, ["custom inf", 2.2]]

Method for guessing type of data represented currently represented as strings

I'm currently parsing CSV tables and need to discover the "data types" of the columns. I don't know the exact format of the values. Obviously, everything that the CSV parser outputs is a string. The data types I am currently interested in are:
integer
floating point
date
boolean
string
My current thoughts are to test a sample of rows (maybe several hundred?) in order to determine the types of data present through pattern matching.
I am particularly concerned about the date data type - is their a python module for parsing common date idioms (obviously I will not be able to detect them all)?
What about integers and floats?
ast.literal_eval() can get the easy ones.
Dateutil comes to mind for parsing dates.
For integers and floats you could always try a cast in a try/except section
>>> f = "2.5"
>>> i = "9"
>>> ci = int(i)
>>> ci
9
>>> cf = float(f)
>>> cf
2.5
>>> g = "dsa"
>>> cg = float(g)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for float(): dsa
>>> try:
... cg = float(g)
... except:
... print "g is not a float"
...
g is not a float
>>>
The data types I am currently interested in are...
These do not exist in a CSV file. The data is only strings. Only. Nothing more.
test a sample of rows
Tells you nothing except what you saw in the sample. The next row after your sample can be a string which looks entirely different from the sampled strings.
The only way you can process CSV files is to write CSV-processing applications that assume specific data types and attempt conversion. You cannot "discover" much about a CSV file.
If column 1 is supposed to be a date, you'll have to look at the string and work out the format. It could be anything. A number, a typical Gregorian date in US or European format (there's not way to know whether 1/1/10 is US or European).
try:
x= datetime.datetime.strptime( row[0], some format )
except ValueError:
# column is not valid.
If column 2 is supposed to be a float, you can only do this.
try:
y= float( row[1] )
except ValueError:
# column is not valid.
If column 3 is supposed to be an int, you can only do this.
try:
z= int( row[2] )
except ValueError:
# column is not valid.
There is no way to "discover" if the CSV has floating-point digit strings except by doing float on each row. If a row fails, then someone prepared the file improperly.
Since you have to do the conversion to see if the conversion is possible, you might as well simply process the row. It's simpler and gets you the results in one pass.
Don't waste time analyzing the data. Ask the folks who created it what's supposed to be there.
You may be interested in this python library which does exactly this kind of type guessing on both general python data and CSVs and XLS files:
https://github.com/okfn/messytables
https://messytables.readthedocs.org/ - docs
It happily scales to very large files, to streaming data off the internet etc.
There is also an even simpler wrapper library that includes a command line tool named dataconverters: http://okfnlabs.org/dataconverters/ (and an online service: https://github.com/okfn/dataproxy!)
The core algorithm that does the type guessing is here: https://github.com/okfn/messytables/blob/7e4f12abef257a4d70a8020e0d024df6fbb02976/messytables/types.py#L164
We tested ast.literal_eval() but rescuing from error is pretty slow, if you want to cast from data that you receive all as string, I think that regex would be faster.
Something like the following worked very well for us.
import datetime
import re
"""
Helper function to detect the appropriate type for a given string.
"""
def guess_type(s):
if s == ""
return None
elif re.match("\A[0-9]+\.[0-9]+\Z", s):
return float
elif re.match("\A[0-9]+\Z", s):
return int
# 2019-01-01 or 01/01/2019 or 01/01/19
elif re.match("\A[0-9]{4}-[0-9]{2}-[0-9]{2}\Z", s) or \
re.match("\A[0-9]{2}/[0-9]{2}/([0-9]{2}|[0-9]{4})\Z", s):
return datetime.date
elif re.match("\A(true|false)\Z", s):
return bool
else:
return str
Tests:
assert guess_type("") == None
assert guess_type("this is a string") == str
assert guess_type("0.1") == float
assert guess_type("true") == bool
assert guess_type("1") == int
assert guess_type("2019-01-01") == datetime.date
assert guess_type("01/01/2019") == datetime.date
assert guess_type("01/01/19") == datetime.date

Categories

Resources