Get wrong values by parsing YAML

Get wrong values by parsing YAML - python

I'm somewhat confused by yaml parsing results. I made a test.yaml and the results are the same.
val_1: 05334000
val_2: 2345784
val_3: 0537380
str_1: foobar
val_4: 05798
val_5: 051342123
Parsing that with:
import yaml
with open('test.yaml', 'r', encoding='utf8') as f:
a = yaml.load(f, Loader=yaml.FullLoader)
returns:
{'val_1': 1423360,
'val_2': 2345784,
'val_3': '0537380',
'str_1': 'foobar',
'val_4': '05798',
'val_5': 10863699}
Why these values for val_1 and val_5? Is there something special?
In my real data with many yaml files there are values like val_1. For some they parsed correct but for some they don't? All starts with 05, followed by more numbers. Caused by the leading 0 results should be strings. But yaml parses something completely different.
If I read the yaml as textfile f.readlines(), all is fine:
['val_1: 05334000\n',
'val_2: 2345784\n',
'val_3: 0537380\n',
'str_1: foobar\n',
'val_4: 05798\n',
'val_5: 051342123\n']

Integers with a leading 0 are parsed as octal; in python you'd need to write them with a leading 0o:
0o5334000 == 1423360
as for '0537380': as there is an 8 present as digit it can not be parsed as an octal number. therefore it remains a string.
if you want to get strings for all your entries you can use the BaseLoader
from io import StringIO
import yaml
file = StringIO("""
val_1: 05334000
val_2: 2345784
val_3: 0537380
str_1: foobar
val_4: 05798
val_5: 051342123
""")
dct = yaml.load(file, Loader=yaml.BaseLoader)
with that i get:
{'val_1': '05334000', 'val_2': '2345784', 'val_3': '0537380',
'str_1': 'foobar', 'val_4': '05798', 'val_5': '051342123'}

Related

Convert tuple with quotes to csv like string

How to convert tuple
text = ('John', '"n"', '"ABC 123\nDEF, 456GH\nijKl"\r\n', '"Johny\nIs\nHere"')
to csv format
out = '"John", """n""", """ABC 123\\nDEF, 456\\nijKL\\r\\n""", """Johny\\nIs\\nHere"""'
or even omitting the special chars at the end
out = '"John", """n""", """ABC 123\\nDEF, 456\\nijKL""", """Johny\\nIs\\nHere"""'
I came up with this monster
out1 = ','.join(f'""{t}""' if t.startswith('"') and t.endswith('"')
else f'"{t}"' for t in text)
out2 = out1.replace('\n', '\\n').replace('\r', '\\r')

You can get pretty close to what you want with the csv and io modules from the standard library:
use csv to correctly encode the delimiters and handle the quoting rules; it only writes to a file handle
use io.StringIO for that file handle to get the resulting CSV as a string
import csv
import io
f = io.StringIO()
text = ("John", '"n"', '"ABC 123\nDEF, 456GH\nijKl"\r\n', '"Johny\nIs\nHere"')
writer = csv.writer(f)
writer.writerow(text)
csv_str = f.getvalue()
csv_repr = repr(csv_str)
print("CSV_STR")
print("=======")
print(csv_str)
print("CSV_REPR")
print("========")
print(csv_repr)
and that prints:
CSV_STR
=======
John,"""n""","""ABC 123
DEF, 456GH
ijKl""
","""Johny
Is
Here"""
CSV_REPR
========
'John,"""n""","""ABC 123\nDEF, 456GH\nijKl""\r\n","""Johny\nIs\nHere"""\r\n'
csv_str is what you'd see in a file if you wrote directly to a file you opened for writing, it is true CSV
csv_repr is kinda what you asked for when you showed us out, but not quite. Your example included "doubly escaped" newlines \\n and carriage returns \\r\\n. CSV doesn't need to escape those characters any more because the entire field is quoted. If you need that, you'll need to do it yourself with something like:
csv_repr.replace(r"\r", r"\\r").replace(r"\n", r"\\n")
but again, that's not necessary for valid CSV.
Also, I don't know how to make the writer include an initial space before every field after the first field, like the spaces you show between "John" and "n" and then after "n" in:
out = 'John, """n""", ...'
The reader can be configured to expect and ignore an initial space, with Dialect.skipinitialspace, but I don't see any options for the writer.

ruamel.yaml dump lists without adding new line at the end

I trying to dump a dict object as YAML using the snippet below:
from ruamel.yaml import YAML
# YAML settings
yaml = YAML(typ="rt")
yaml.default_flow_style = False
yaml.explicit_start = False
yaml.indent(mapping=2, sequence=4, offset=2)
rip= {"rip_routes": ["23.24.10.0/15", "23.30.0.10/15", "50.73.11.0/16", "198.0.0.0/16"]}
file = 'test.yaml'
with open(file, "w") as f:
yaml.dump(rip, f)
It dumps correctly, but I am getting an new line appended to the end of the list
rip_routes:
- 23.24.10.0/15
- 23.30.0.10/15
- 198.0.11.0/16
I don't want the new line to be inserted at the end of file. How can I do it?

The newline is part of the representation code for block style sequence elements. And since that code
doesn't have much knowledge about context, and certainly not about representing the last element to be dumped
in a document, it is almost impossible for the final newline not to be output.
However, the .dump() method has an optional transform parameter that allows you to
run the output of the dumped text through some filter:
import sys
import pathlib
import string
import ruamel.yaml
# YAML settings
yaml = ruamel.yaml.YAML(typ="rt")
yaml.default_flow_style = False
yaml.explicit_start = False
yaml.indent(mapping=2, sequence=4, offset=2)
rip= {"rip_routes": ["23.24.10.0/15", "23.30.0.10/15", "50.73.11.0/16", "198.0.0.0/16"]}
def strip_final_newline(s):
if not s or s[-1] != '\n':
return s
return s[:-1]
file = pathlib.Path('test.yaml')
yaml.dump(rip, file, transform=strip_final_newline)
print(repr(file.read_text()))
which gives:
'rip_routes:\n - 23.24.10.0/15\n - 23.30.0.10/15\n - 50.73.11.0/16\n - 198.0.0.0/16'
It is better to use Path() instances as in the code above,
especially if your YAML document is going to contain non-ASCII characters.

Convert csv file to json with no quotes around float values

I have some csv files that I need to convert to json. Some of the float values in the csv are numeric strings (to maintain trailing zeros). When converting to json, all keys and values are wrapped in double quotes. I need the numeric string float values to not have quotes, but maintain the trailing zeros.
Here is a sample of the input csv file:
ACCOUNTNAMEDENORM,DELINQUENCYSTATUS,RETIRED,INVOICEDAYOFWEEK,ID,BEANVERSION,ACCOUNTTYPE,ORGANIZATIONTYPEDENORM,HIDDENTACCOUNTCONTAINERID,NEWPOLICYPAYMENTDISTRIBUTABLE,ACCOUNTNUMBER,PAYMENTMETHOD,INVOICEDELIVERYTYPE,DISTRIBUTIONLIMITTYPE,CLOSEDATE,FIRSTTWICEPERMTHINVOICEDOM,HELDFORINVOICESENDING,FEINDENORM,COLLECTING,ACCOUNTNUMBERDENORM,CHARGEHELD,PUBLICID
John Smith,2.0000000000,0.0000000000,5.0000000000,1234567.0000000000,69.0000000000,1.0000000000,,4321987.0000000000,1,000-000-000-00,10012.0000000000,10002.0000000000,3.0000000000,,1.0000000000,0,,0,000-000-000-00,0,bc:1234346
The json output I am getting is:
{"ACCOUNTNAMEDENORM":"John Smith","DELINQUENCYSTATUS":"2.0000000000","RETIRED":"0.0000000000","INVOICEDAYOFWEEK":"5.0000000000","ID":"1234567.0000000000","BEANVERSION":"69.0000000000","ACCOUNTTYPE":"1.0000000000","ORGANIZATIONTYPEDENORM":null,"HIDDENTACCOUNTCONTAINERID":"4321987.0000000000","NEWPOLICYPAYMENTDISTRIBUTABLE":"1","ACCOUNTNUMBER":"000-000-000-00","PAYMENTMETHOD":"12345.0000000000","INVOICEDELIVERYTYPE":"98765.0000000000","DISTRIBUTIONLIMITTYPE":"3.0000000000","CLOSEDATE":null,"FIRSTTWICEPERMTHINVOICEDOM":"1.0000000000","HELDFORINVOICESENDING":"0","FEINDENORM":null,"COLLECTING":"0","ACCOUNTNUMBERDENORM":"000-000-000-00","CHARGEHELD":"0","PUBLICID":"xx:1234346"}
Here is the code I am using:
import csv
import json
csvfile = open('output2.csv', 'r')
jsonfile = open('output2.json', 'w')
readHeaders = csv.reader(csvfile)
fieldnames = next(readHeaders)
reader = csv.DictReader(csvfile, fieldnames)
for row in reader:
json.dump(row, jsonfile, separators=(',', ':'))
jsonfile.write('\n')
I would like the output to have no quotes around float values, similar to the following:
{"ACCOUNTNAMEDENORM":"John Smith","DELINQUENCYSTATUS":2.0000000000,"RETIRED":0.0000000000,"INVOICEDAYOFWEEK":5.0000000000,"ID":1234567.0000000000,"BEANVERSION":69.0000000000,"ACCOUNTTYPE":1.0000000000,"ORGANIZATIONTYPEDENORM":null,"HIDDENTACCOUNTCONTAINERID":4321987.0000000000,"NEWPOLICYPAYMENTDISTRIBUTABLE":"1","ACCOUNTNUMBER":"000-000-000-00","PAYMENTMETHOD":12345.0000000000,"INVOICEDELIVERYTYPE":98765.0000000000,"DISTRIBUTIONLIMITTYPE":3.0000000000,"CLOSEDATE":null,"FIRSTTWICEPERMTHINVOICEDOM":1.0000000000,"HELDFORINVOICESENDING":"0","FEINDENORM":null,"COLLECTING":"0","ACCOUNTNUMBERDENORM":"000-000-000-00","CHARGEHELD":"0","PUBLICID":"xx:1234346"}

Now, from your comments, that I understand your question better, here's a completely different answer. Note that it doesn't use the json module and just does the processing needed "manually". Although it probably could be done using the module, getting it to format the Python data types it recognizes by default differently can be fairly involved — I know from experience — as compared to the relatively simple logic used below anyway.
Anther note: Like your code, this converts each row of the csv file into a valid JSON object and writes each one to a file on a separate line. However the contents of the resulting file technically won't be valid JSON because all of these individual objects need to be be comma-separated and enclosed in [ ] brackets (i.e. thereby becoming a valid JSON "Array" Object).
import csv
with open('output2.csv', 'r', newline='') as csvfile, \
open('output2.json', 'w') as jsonfile:
for row in csv.DictReader(csvfile):
newfmt = []
for field, value in row.items():
field = '"{}"'.format(field)
try:
float(value)
except ValueError:
value = 'null' if value == '' else '"{}"'.format(value)
else:
# Avoid changing integer values to float.
try:
int(value)
except ValueError:
pass
else:
value = '"{}"'.format(value)
newfmt.append((field, value))
json_repr = '{' + ','.join(':'.join(pair) for pair in newfmt) + '}'
jsonfile.write(json_repr + '\n')
This is the JSON written to the file:
{"ACCOUNTNAMEDENORM":"John Smith","DELINQUENCYSTATUS":2.0000000000,"RETIRED":0.0000000000,"INVOICEDAYOFWEEK":5.0000000000,"ID":1234567.0000000000,"BEANVERSION":69.0000000000,"ACCOUNTTYPE":1.0000000000,"ORGANIZATIONTYPEDENORM":null,"HIDDENTACCOUNTCONTAINERID":4321987.0000000000,"NEWPOLICYPAYMENTDISTRIBUTABLE":"1","ACCOUNTNUMBER":"000-000-000-00","PAYMENTMETHOD":12345.0000000000,"INVOICEDELIVERYTYPE":98765.0000000000,"DISTRIBUTIONLIMITTYPE":3.0000000000,"CLOSEDATE":null,"FIRSTTWICEPERMTHINVOICEDOM":1.0000000000,"HELDFORINVOICESENDING":"0","FEINDENORM":null,"COLLECTING":"0","ACCOUNTNUMBERDENORM":"000-000-000-00","CHARGEHELD":"0","PUBLICID":"bc:1234346"}
Shown again below with added whitespace:
{"ACCOUNTNAMEDENORM": "John Smith",
"DELINQUENCYSTATUS": 2.0000000000,
"RETIRED": 0.0000000000,
"INVOICEDAYOFWEEK": 5.0000000000,
"ID": 1234567.0000000000,
"BEANVERSION": 69.0000000000,
"ACCOUNTTYPE": 1.0000000000,
"ORGANIZATIONTYPEDENORM": null,
"HIDDENTACCOUNTCONTAINERID": 4321987.0000000000,
"NEWPOLICYPAYMENTDISTRIBUTABLE": "1",
"ACCOUNTNUMBER": "000-000-000-00",
"PAYMENTMETHOD": 12345.0000000000,
"INVOICEDELIVERYTYPE": 98765.0000000000,
"DISTRIBUTIONLIMITTYPE": 3.0000000000,
"CLOSEDATE": null,
"FIRSTTWICEPERMTHINVOICEDOM": 1.0000000000,
"HELDFORINVOICESENDING": "0",
"FEINDENORM": null,
"COLLECTING": "0",
"ACCOUNTNUMBERDENORM": "000-000-000-00",
"CHARGEHELD": "0",
"PUBLICID": "bc:1234346"}

Might be a bit of overkill, but with pandas it would be pretty simple:
import pandas as pd
data = pd.read_csv('output2.csv')
data.to_json(''output2.json')

One solution is to use a regular expression to see if the string value looks like a float, and convert it to a float if it is.
import re
null = None
j = {"ACCOUNTNAMEDENORM":"John Smith","DELINQUENCYSTATUS":"2.0000000000",
"RETIRED":"0.0000000000","INVOICEDAYOFWEEK":"5.0000000000",
"ID":"1234567.0000000000","BEANVERSION":"69.0000000000",
"ACCOUNTTYPE":"1.0000000000","ORGANIZATIONTYPEDENORM":null,
"HIDDENTACCOUNTCONTAINERID":"4321987.0000000000",
"NEWPOLICYPAYMENTDISTRIBUTABLE":"1","ACCOUNTNUMBER":"000-000-000-00",
"PAYMENTMETHOD":"12345.0000000000","INVOICEDELIVERYTYPE":"98765.0000000000",
"DISTRIBUTIONLIMITTYPE":"3.0000000000","CLOSEDATE":null,
"FIRSTTWICEPERMTHINVOICEDOM":"1.0000000000","HELDFORINVOICESENDING":"0",
"FEINDENORM":null,"COLLECTING":"0","ACCOUNTNUMBERDENORM":"000-000-000-00",
"CHARGEHELD":"0","PUBLICID":"xx:1234346"}
for key in j:
if j[key] is not None:
if re.match("^\d+?\.\d+?$", j[key]):
j[key] = float(j[key])
I used null = None here to deal with the "null"s that show up in the JSON. But you can replace 'j' here with each CSV row you're reading, then use this to update the row before writing it back with the floats replacing the strings.
If you're OK with converting any numerical string into a float, then you can skip the regular expression (re.match() command) and replace it with j[key].isnumeric(), if it's available for your Python version.
EDIT: I don't think floats in Python handle the "precision" in a way you might think. It may look like 2.0000000000 is being "truncated" to 2.0, but I think this is more of a formatting and display issue, rather than losing information. Consider the following examples:
>>> float(2.0000000000)
2.0
>>> float(2.00000000001)
2.00000000001
>>> float(1.00) == float(1.000000000)
True
>>> float(3.141) == float(3.140999999)
False
>>> float(3.141) == float(3.1409999999999999)
True
>>> print('%.10f' % 3.14)
3.1400000000
It's possible though to get the JSON to have those zeroes, but in that case it comes down to treating the number as a string, namely a formatted one.

Hah, it's really interesting, I want to find the opposite answer with you that is the results are with quotes.
Actually it's very easy to remove it automatically, just remove the param "separators=(',', ':')".
For me, just adding this param is Okay.

YAML: Dump Python List Without Quotes

I have a Python list, my_list that looks like this ["test1", "test2", "test3"]. I simply want to dump it to a YAML file without quotes. So the desired output is:
test_1
test_2
test_3
I've tried:
import yaml
with open("my_yaml.yaml", "w") as f:
yaml.safe_dump(my_list, f)
Unfortunately, this includes all 3 elements on a single line and they're quoted:
'test_1', 'test_2', 'test_3'
How can I modify to get the desired output?

Try using default_style=None to avoid quotes, and default_flow_style=False to output items on separate lines:
yaml.safe_dump(my_list, f, default_style=None, default_flow_style=False)

You want to output a Python list as a multi-line plain scalar and that
is going to be hard. Normally a list is output a YAML sequence, which
has either dashes (-, in block style, over multiple lines) or using
square brackets ([], in flow style, on one or more lines.
Block style with dashes:
import sys
from ruamel.yaml import YAML
data = ["test1", "test2", "test3"]
yaml = YAML()
yaml.dump(data, sys.stdout)
gives:
- test1
- test2
- test3
flow style, on a narrow line:
yaml = YAML()
yaml.default_flow_style = True
yaml.dump(data, sys.stdout)
output:
Flow style, made narrow:
[test1, test2, test3]
yaml = YAML()
yaml.default_flow_style = True
yaml.width = 5
yaml.dump(data, sys.stdout)
gets you:
[test1,
test2,
test3]
This is unlikely what you want as it affects the whole YAML document,
and you still got the square brackets.
One alternative is converting the string to a plain scalar. This is
actualy what your desired output would be loaded as.
yaml_str = """\
test_1
test_2
test_3
"""
yaml = YAML()
x = yaml.load(yaml_str)
assert type(x) == str
assert x == 'test_1 test_2 test_3'
Loading your expected output is often a good test to see what you
need to provide.
Therefore you would have to convert your list to a multi-word
string. Once more the problem is that you can only force the line
breaks in any YAML library known to me, by setting the width of the
document and there is a minimum width for most which is bigger than 4
(although that can be patched that doesn't solve the problem of that
this applies to the whole document).
yaml = YAML()
yaml.width = 5
s = ' '.join(data)
yaml.dump(s, sys.stdout)
result:
test1 test2
test3
...
This leaves what is IMO the best solution if you really don't want dashes: to
use a literal block style
scalar (string):
from ruamel.yaml.scalarstring import PreservedScalarString
yaml = YAML()
s = PreservedScalarString('\n'.join(data) + '\n')
yaml.dump(s, sys.stdout)
In that scalar style newlines are preserved:
|
test1
test2
test3

Preserve quotes and also add data with quotes in Ruamel

I am using Ruamel to preserve quote styles in human-edited YAML files.
I have example input data as:
---
a: '1'
b: "2"
c: 3
I read in data using:
def read_file(f):
with open(f, 'r') as _f:
return ruamel.yaml.round_trip_load(_f.read(), preserve_quotes=True)
I then edit that data:
data = read_file('in.yaml')
data['foo'] = 'bar'
I write back to disk using:
def write_file(f, data):
with open(f, 'w') as _f:
_f.write(ruamel.yaml.dump(data, Dumper=ruamel.yaml.RoundTripDumper, width=1024))
write_file('out.yaml', data)
And the output file is:
a: '1'
b: "2"
c: 3
foo: bar
Is there a way I can enforce hard quoting of the string 'bar' without also enforcing that quoting style throughout the rest of the file?
(Also, can I stop it from deleting the three dashes --- ?)

In order to preserve quotes (and literal block style) for string scalars, ruamel.yaml¹—in round-trip-mode—represents these scalars as SingleQuotedScalarString, DoubleQuotedScalarString and PreservedScalarString. The class definitions for these very thin wrappers can be found in scalarstring.py.
When serializing such instances are written "as they were read", although sometimes the representer falls back to double quotes when things get difficult, as that can represent any string.
To get this behaviour when adding new key-value pairs (or when updating an existing pair), you just have to create these instances yourself:
import sys
from ruamel.yaml import YAML
from ruamel.yaml.scalarstring import SingleQuotedScalarString, DoubleQuotedScalarString
yaml_str = """\
---
a: '1'
b: "2"
c: 3
"""
yaml = YAML()
yaml.preserve_quotes = True
yaml.explicit_start = True
data = yaml.load(yaml_str)
data['foo'] = SingleQuotedScalarString('bar')
data.yaml_add_eol_comment('# <- single quotes added', 'foo', column=20)
yaml.dump(data, sys.stdout)
gives:
---
a: '1'
b: "2"
c: 3
foo: 'bar' # <- single quotes added
the yaml.explicit_start = True recreates the (superfluous) document start marker. Whether such a marker was in the original file or not is not "known" by the top-level dictionary object, so you have to re-add it by hand.
Please note that without preserve_quotes, there would be (single) quotes around the values 1 and 2 anyway to make sure they are seen as string scalars and not as integers.
¹ Of which I am the author.

Since Ruamel 0.15, set the preserve_quotes flag like this:
from ruamel.yaml import YAML
from pathlib import Path
yaml = YAML(typ='rt') # Round trip loading and dumping
yaml.preserve_quotes = True
data = yaml.load(Path("in.yaml"))
yaml.dump(data, Path("out.yaml"))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get wrong values by parsing YAML - python

Related

Convert tuple with quotes to csv like string

ruamel.yaml dump lists without adding new line at the end

Convert csv file to json with no quotes around float values

YAML: Dump Python List Without Quotes

Preserve quotes and also add data with quotes in Ruamel

Categories

Resources