I have some csv files that I need to convert to json. Some of the float values in the csv are numeric strings (to maintain trailing zeros). When converting to json, all keys and values are wrapped in double quotes. I need the numeric string float values to not have quotes, but maintain the trailing zeros.
Here is a sample of the input csv file:
ACCOUNTNAMEDENORM,DELINQUENCYSTATUS,RETIRED,INVOICEDAYOFWEEK,ID,BEANVERSION,ACCOUNTTYPE,ORGANIZATIONTYPEDENORM,HIDDENTACCOUNTCONTAINERID,NEWPOLICYPAYMENTDISTRIBUTABLE,ACCOUNTNUMBER,PAYMENTMETHOD,INVOICEDELIVERYTYPE,DISTRIBUTIONLIMITTYPE,CLOSEDATE,FIRSTTWICEPERMTHINVOICEDOM,HELDFORINVOICESENDING,FEINDENORM,COLLECTING,ACCOUNTNUMBERDENORM,CHARGEHELD,PUBLICID
John Smith,2.0000000000,0.0000000000,5.0000000000,1234567.0000000000,69.0000000000,1.0000000000,,4321987.0000000000,1,000-000-000-00,10012.0000000000,10002.0000000000,3.0000000000,,1.0000000000,0,,0,000-000-000-00,0,bc:1234346
The json output I am getting is:
{"ACCOUNTNAMEDENORM":"John Smith","DELINQUENCYSTATUS":"2.0000000000","RETIRED":"0.0000000000","INVOICEDAYOFWEEK":"5.0000000000","ID":"1234567.0000000000","BEANVERSION":"69.0000000000","ACCOUNTTYPE":"1.0000000000","ORGANIZATIONTYPEDENORM":null,"HIDDENTACCOUNTCONTAINERID":"4321987.0000000000","NEWPOLICYPAYMENTDISTRIBUTABLE":"1","ACCOUNTNUMBER":"000-000-000-00","PAYMENTMETHOD":"12345.0000000000","INVOICEDELIVERYTYPE":"98765.0000000000","DISTRIBUTIONLIMITTYPE":"3.0000000000","CLOSEDATE":null,"FIRSTTWICEPERMTHINVOICEDOM":"1.0000000000","HELDFORINVOICESENDING":"0","FEINDENORM":null,"COLLECTING":"0","ACCOUNTNUMBERDENORM":"000-000-000-00","CHARGEHELD":"0","PUBLICID":"xx:1234346"}
Here is the code I am using:
import csv
import json
csvfile = open('output2.csv', 'r')
jsonfile = open('output2.json', 'w')
readHeaders = csv.reader(csvfile)
fieldnames = next(readHeaders)
reader = csv.DictReader(csvfile, fieldnames)
for row in reader:
json.dump(row, jsonfile, separators=(',', ':'))
jsonfile.write('\n')
I would like the output to have no quotes around float values, similar to the following:
{"ACCOUNTNAMEDENORM":"John Smith","DELINQUENCYSTATUS":2.0000000000,"RETIRED":0.0000000000,"INVOICEDAYOFWEEK":5.0000000000,"ID":1234567.0000000000,"BEANVERSION":69.0000000000,"ACCOUNTTYPE":1.0000000000,"ORGANIZATIONTYPEDENORM":null,"HIDDENTACCOUNTCONTAINERID":4321987.0000000000,"NEWPOLICYPAYMENTDISTRIBUTABLE":"1","ACCOUNTNUMBER":"000-000-000-00","PAYMENTMETHOD":12345.0000000000,"INVOICEDELIVERYTYPE":98765.0000000000,"DISTRIBUTIONLIMITTYPE":3.0000000000,"CLOSEDATE":null,"FIRSTTWICEPERMTHINVOICEDOM":1.0000000000,"HELDFORINVOICESENDING":"0","FEINDENORM":null,"COLLECTING":"0","ACCOUNTNUMBERDENORM":"000-000-000-00","CHARGEHELD":"0","PUBLICID":"xx:1234346"}
Now, from your comments, that I understand your question better, here's a completely different answer. Note that it doesn't use the json module and just does the processing needed "manually". Although it probably could be done using the module, getting it to format the Python data types it recognizes by default differently can be fairly involved — I know from experience — as compared to the relatively simple logic used below anyway.
Anther note: Like your code, this converts each row of the csv file into a valid JSON object and writes each one to a file on a separate line. However the contents of the resulting file technically won't be valid JSON because all of these individual objects need to be be comma-separated and enclosed in [ ] brackets (i.e. thereby becoming a valid JSON "Array" Object).
import csv
with open('output2.csv', 'r', newline='') as csvfile, \
open('output2.json', 'w') as jsonfile:
for row in csv.DictReader(csvfile):
newfmt = []
for field, value in row.items():
field = '"{}"'.format(field)
try:
float(value)
except ValueError:
value = 'null' if value == '' else '"{}"'.format(value)
else:
# Avoid changing integer values to float.
try:
int(value)
except ValueError:
pass
else:
value = '"{}"'.format(value)
newfmt.append((field, value))
json_repr = '{' + ','.join(':'.join(pair) for pair in newfmt) + '}'
jsonfile.write(json_repr + '\n')
This is the JSON written to the file:
{"ACCOUNTNAMEDENORM":"John Smith","DELINQUENCYSTATUS":2.0000000000,"RETIRED":0.0000000000,"INVOICEDAYOFWEEK":5.0000000000,"ID":1234567.0000000000,"BEANVERSION":69.0000000000,"ACCOUNTTYPE":1.0000000000,"ORGANIZATIONTYPEDENORM":null,"HIDDENTACCOUNTCONTAINERID":4321987.0000000000,"NEWPOLICYPAYMENTDISTRIBUTABLE":"1","ACCOUNTNUMBER":"000-000-000-00","PAYMENTMETHOD":12345.0000000000,"INVOICEDELIVERYTYPE":98765.0000000000,"DISTRIBUTIONLIMITTYPE":3.0000000000,"CLOSEDATE":null,"FIRSTTWICEPERMTHINVOICEDOM":1.0000000000,"HELDFORINVOICESENDING":"0","FEINDENORM":null,"COLLECTING":"0","ACCOUNTNUMBERDENORM":"000-000-000-00","CHARGEHELD":"0","PUBLICID":"bc:1234346"}
Shown again below with added whitespace:
{"ACCOUNTNAMEDENORM": "John Smith",
"DELINQUENCYSTATUS": 2.0000000000,
"RETIRED": 0.0000000000,
"INVOICEDAYOFWEEK": 5.0000000000,
"ID": 1234567.0000000000,
"BEANVERSION": 69.0000000000,
"ACCOUNTTYPE": 1.0000000000,
"ORGANIZATIONTYPEDENORM": null,
"HIDDENTACCOUNTCONTAINERID": 4321987.0000000000,
"NEWPOLICYPAYMENTDISTRIBUTABLE": "1",
"ACCOUNTNUMBER": "000-000-000-00",
"PAYMENTMETHOD": 12345.0000000000,
"INVOICEDELIVERYTYPE": 98765.0000000000,
"DISTRIBUTIONLIMITTYPE": 3.0000000000,
"CLOSEDATE": null,
"FIRSTTWICEPERMTHINVOICEDOM": 1.0000000000,
"HELDFORINVOICESENDING": "0",
"FEINDENORM": null,
"COLLECTING": "0",
"ACCOUNTNUMBERDENORM": "000-000-000-00",
"CHARGEHELD": "0",
"PUBLICID": "bc:1234346"}
Might be a bit of overkill, but with pandas it would be pretty simple:
import pandas as pd
data = pd.read_csv('output2.csv')
data.to_json(''output2.json')
One solution is to use a regular expression to see if the string value looks like a float, and convert it to a float if it is.
import re
null = None
j = {"ACCOUNTNAMEDENORM":"John Smith","DELINQUENCYSTATUS":"2.0000000000",
"RETIRED":"0.0000000000","INVOICEDAYOFWEEK":"5.0000000000",
"ID":"1234567.0000000000","BEANVERSION":"69.0000000000",
"ACCOUNTTYPE":"1.0000000000","ORGANIZATIONTYPEDENORM":null,
"HIDDENTACCOUNTCONTAINERID":"4321987.0000000000",
"NEWPOLICYPAYMENTDISTRIBUTABLE":"1","ACCOUNTNUMBER":"000-000-000-00",
"PAYMENTMETHOD":"12345.0000000000","INVOICEDELIVERYTYPE":"98765.0000000000",
"DISTRIBUTIONLIMITTYPE":"3.0000000000","CLOSEDATE":null,
"FIRSTTWICEPERMTHINVOICEDOM":"1.0000000000","HELDFORINVOICESENDING":"0",
"FEINDENORM":null,"COLLECTING":"0","ACCOUNTNUMBERDENORM":"000-000-000-00",
"CHARGEHELD":"0","PUBLICID":"xx:1234346"}
for key in j:
if j[key] is not None:
if re.match("^\d+?\.\d+?$", j[key]):
j[key] = float(j[key])
I used null = None here to deal with the "null"s that show up in the JSON. But you can replace 'j' here with each CSV row you're reading, then use this to update the row before writing it back with the floats replacing the strings.
If you're OK with converting any numerical string into a float, then you can skip the regular expression (re.match() command) and replace it with j[key].isnumeric(), if it's available for your Python version.
EDIT: I don't think floats in Python handle the "precision" in a way you might think. It may look like 2.0000000000 is being "truncated" to 2.0, but I think this is more of a formatting and display issue, rather than losing information. Consider the following examples:
>>> float(2.0000000000)
2.0
>>> float(2.00000000001)
2.00000000001
>>> float(1.00) == float(1.000000000)
True
>>> float(3.141) == float(3.140999999)
False
>>> float(3.141) == float(3.1409999999999999)
True
>>> print('%.10f' % 3.14)
3.1400000000
It's possible though to get the JSON to have those zeroes, but in that case it comes down to treating the number as a string, namely a formatted one.
Hah, it's really interesting, I want to find the opposite answer with you that is the results are with quotes.
Actually it's very easy to remove it automatically, just remove the param "separators=(',', ':')".
For me, just adding this param is Okay.
Related
How to convert tuple
text = ('John', '"n"', '"ABC 123\nDEF, 456GH\nijKl"\r\n', '"Johny\nIs\nHere"')
to csv format
out = '"John", """n""", """ABC 123\\nDEF, 456\\nijKL\\r\\n""", """Johny\\nIs\\nHere"""'
or even omitting the special chars at the end
out = '"John", """n""", """ABC 123\\nDEF, 456\\nijKL""", """Johny\\nIs\\nHere"""'
I came up with this monster
out1 = ','.join(f'""{t}""' if t.startswith('"') and t.endswith('"')
else f'"{t}"' for t in text)
out2 = out1.replace('\n', '\\n').replace('\r', '\\r')
You can get pretty close to what you want with the csv and io modules from the standard library:
use csv to correctly encode the delimiters and handle the quoting rules; it only writes to a file handle
use io.StringIO for that file handle to get the resulting CSV as a string
import csv
import io
f = io.StringIO()
text = ("John", '"n"', '"ABC 123\nDEF, 456GH\nijKl"\r\n', '"Johny\nIs\nHere"')
writer = csv.writer(f)
writer.writerow(text)
csv_str = f.getvalue()
csv_repr = repr(csv_str)
print("CSV_STR")
print("=======")
print(csv_str)
print("CSV_REPR")
print("========")
print(csv_repr)
and that prints:
CSV_STR
=======
John,"""n""","""ABC 123
DEF, 456GH
ijKl""
","""Johny
Is
Here"""
CSV_REPR
========
'John,"""n""","""ABC 123\nDEF, 456GH\nijKl""\r\n","""Johny\nIs\nHere"""\r\n'
csv_str is what you'd see in a file if you wrote directly to a file you opened for writing, it is true CSV
csv_repr is kinda what you asked for when you showed us out, but not quite. Your example included "doubly escaped" newlines \\n and carriage returns \\r\\n. CSV doesn't need to escape those characters any more because the entire field is quoted. If you need that, you'll need to do it yourself with something like:
csv_repr.replace(r"\r", r"\\r").replace(r"\n", r"\\n")
but again, that's not necessary for valid CSV.
Also, I don't know how to make the writer include an initial space before every field after the first field, like the spaces you show between "John" and "n" and then after "n" in:
out = 'John, """n""", ...'
The reader can be configured to expect and ignore an initial space, with Dialect.skipinitialspace, but I don't see any options for the writer.
I have just started to learn Python and I have a task of converting a JSON to a CSV file as semicolon as the delimiter and with three constraints.
My JSON is:
{"_id": "5cfffc2dd866fc32fcfe9fcc",
"tuple5": ["system1/folder", "system3/folder"],
"tuple4": ["system1/folder/text3.txt", "system2/folder/text3.txt"],
"tuple3": ["system2/folder/text2.txt"],
"tuple2": ["system2/folder"],
"tuple1": ["system1/folder/text1.txt", "system2/folder/text1.txt"],
"tupleSize": 3}
The output CSV should be in a form:
system1 ; system2 ; system3
system1/folder ; ~ ; system3/folder
system1/folder/text3.txt ; system2/folder/text3.txt ; ~
~ ; system2/folder/text2.txt ; ~
~ ; system2/folder ; ~
system1/folder/text1.txt ; system2/folder/text1.txt ; ~
So the three constraints are that the tupleSize will indicate the number of rows, the first part of the array elements i.e., sys1, sys2 and sys3 will be the array elements and finally only those elements belonging to a particular system will have the values in the CSV file (rest is ~).
I found a few posts regarding the conversion in Python like this and this. None of them had any constraints any way related to these and I am unable to figure out how to approach this.
Can someone help?
EDIT: I should mention that the array elements are dynamic and thus the row headers may vary in the CSV file.
What you want to do is fairly substantial, so if it's just a Python learning exercise, I suggest you begin with more elementary tasks.
I also think you've got what most folks call rows and columns reversed — so be warned that everything below, including the code, is using them in the opposite sense to the way you used them in your question.
Anyway, the code below first preprocesses the data to determine what the columns or fieldnames of the CSV file are going to be and to make sure there are the right number of them as specified by the 'tupleSize' key.
Assuming that constraint is met, it then iterates through the data a second time and extracts the column/field values from each key value, putting them into a dictionary whose contents represents a row to be written to the output file — and then does that when finished.
Updated
Modified to remove all keys that start with "_id" in the JSON object dictionary.
import csv
import json
import re
SEP = '/' # Value sub-component separator.
id_regex = re.compile(r"_id\d*")
json_string = '''
{"_id1": "5cfffc2dd866fc32fcfe9fc1",
"_id2": "5cfffc2dd866fc32fcfe9fc2",
"_id3": "5cfffc2dd866fc32fcfe9fc3",
"tuple5": ["system1/folder", "system3/folder"],
"tuple4": ["system1/folder/text3.txt", "system2/folder/text3.txt"],
"tuple3": ["system2/folder/text2.txt"],
"tuple2": ["system2/folder"],
"tuple1": ["system1/folder/text1.txt", "system2/folder/text1.txt"],
"tupleSize": 3}
'''
data = json.loads(json_string) # Convert JSON string into a dictionary.
# Remove non-path items from dictionary.
tupleSize = data.pop('tupleSize')
_ids = {key: data.pop(key)
for key in tuple(data.keys()) if id_regex.search(key)}
#print(f'_ids: {_ids}')
max_columns = int(tupleSize) # Use to check a contraint.
# Determine how many columns are present and what they are.
columns = set()
for key in data:
paths = data[key]
if not paths:
raise RuntimeError('key with no paths')
for path in paths:
comps = path.split(SEP)
if len(comps) < 2:
raise RuntimeError('component with no subcomponents')
columns.add(comps[0])
if len(columns) > max_columns:
raise RuntimeError('too many columns - conversion aborted')
# Create CSV file.
with open('converted_json.csv', 'w', newline='') as file:
writer = csv.DictWriter(file, delimiter=';', restval='~',
fieldnames=sorted(columns))
writer.writeheader()
for key in data:
row = {}
for path in data[key]:
column, *_ = path.split(SEP, maxsplit=1)
row[column] = path
writer.writerow(row)
print('Conversion complete')
I am getting the following error. What does it mean?
AttributeError: 'bool' object has no attribute 'decode'
in code line : writer.writerow({k:v.decode('utf8') for k,v in dictionary.iteritems()})
My code looks like :
import json
import csv
def make_csv(data):
fname = "try.csv"
with open(fname,'wb') as outf:
dic_list = data['bookmarks']
dictionary = dic_list[0]
writer = csv.DictWriter(outf,fieldnames = sorted(dictionary.keys()), restval = "None", extrasaction = 'ignore')
writer.writeheader()
for dictionary in dic_list:
writer.writerow({k:v.decode('utf8') for k,v in dictionary.iteritems()})
return
def main():
fil = "readability.json"
f = open(fil,'rb')
data = json.loads(f.read())
print type(data)
make_csv(data)
The json file looks like :
{ "bookmarks" : [{..},{..} ..... {..}],
"recommendations" : [{..},{..}...{..}]
}
where [..] = list and {..} = dictionary
EDIT :
The above problem was solved, But when I ran the above code, The CSV file generated has some discrepancies. Some rows were pasted randomly i.e. under different headers in .csv file. Any suggestions?
Somewhere in your readability.json file you have an entry that's a boolean value, like true or false (in JSON), translated to the Python True and False objects.
You should not be using decode() in the first place, however, as json.loads() already produces Unicode values for strings.
Since this is Python 2, you want to encode your data, to UTF-8, instead. Convert your objects to unicode first:
writer.writerow({
k: unicode(v).encode('utf8')
for k ,v in dictionary.iteritems()
})
Converting existing Unicode strings to unicode is a no-op, but for integers, floating point values, None and boolean values you'll get a nice Unicode representation that can be encoded to UTF-8:
>>> unicode(True).encode('utf8')
'True'
I am converting an XML file to a JSON file. I do this by opening the xml, use the xmltodict module and then use the .get method to traverse the tree to the level I want. This level is the parents to the leaves. I then check on a certain condition that some of the leaves for each of these task is true and if it is then I use json.dumps() and write it to the file. The issue is (I think this is where it is stemming from) that when I only append one JSON object to the file, it doesn't append a comma to the end of the object because it thinks it is the only object. I tried combating this by appending a ',' at the end of each JSON object but then when I try to use the json.loads() method it gives me an error saying "No JSON object could be decoded". However when I manually append the '[' and ']' to the file it doesn't give me an error. My code is below and I'd appreciate any help/suggestions you have.
def getTasks(filename):
f = open(filename, 'r')
a = open('tasksJSON', 'w')
a.write('[')
d = xmltodict.parse(f)
l = d.get('Project').get('Tasks').get('Task')
for task in l:
if (task['Name'] == 'dinner'): #criteria for desirable tasks
j = json.dumps(task)
a.write (str(j))
a.write(',')
a.write(']')
f.close()
a.close()
This works and puts everything in tasksJSON but like I said, when I call
my_file = open('tasksJSON', 'r')
data = json.load(my_file) # LINE THAT GIVES ME ERROR
I get an error saying
ValueError: No JSON object could be decoded
and the output file contains:
[{"UID": "4", "ID": "14", "Name": "Design"},{"UID": "5", "ID": "15", "Name": "Basic Skeleton"}]
^
this is the comma I manually inserted
make it this way:
def getTasks(filename):
f = open(filename, 'r')
a = open('tasksJSON', 'w')
x = []
d = xmltodict.parse(f)
l = d.get('Project').get('Tasks').get('Task')
for task in l:
if (task['Name'] == 'dinner'): #criteria for desirable tasks
#j = json.dumps(task)
x.append(task)
#a.write (str(j))
#a.write(',')
a.write(json.dumps(x))
f.close()
a.close()
JSON doesn't allow extra commas at the end of an array or object. But your code adds such an extra comma. If you look at the official grammar here, you can only have a , before another value. And Python's json library conforms to that grammar, so:
>>> json.loads('[1, 2, 3, ]')
ValueError: Expecting value: line 1 column 8 (char 7)
To fix this, you could do something like this:
first = True
for task in l:
if (task['Name'] == 'dinner'): #criteria for desirable tasks
if first:
first = False
else:
a.write(',')
j = json.dumps(task)
a.write(str(j))
On the other hand, if memory isn't an issue, it might be simpler—and certainly cleaner—to just add all of the objects to a list and then json.dumps that list:
output = []
for task in l:
if (task['Name'] == 'dinner'): #criteria for desirable tasks
output.append(task)
a.write(json.dumps(output))
Or, more simply:
json.dump([task for task in l if task['Name'] == 'dinner'], a)
(In fact, even if memory is an issue, you can extend JSONEncoder, as shown in the docs, to handle iterators by converting them lazily into JSON arrays, but this is a bit tricky, so I won't show the details unless someone needs them.)
It seems, that you put into a file several json objects and add your own square brackets. Hence, it can not load as single obj
This is the python script:
f = open('csvdata.csv','rb')
fo = open('out6.csv','wb')
for line in f:
bits = line.split(',')
bits[1] = '"input"'
fo.write( ','.join(bits) )
f.close()
fo.close()
I have a CSV file and I'm replacing the content of the 2nd column with the string "input". However, I need to grab some information from that column content first.
The content might look like this:
failurelog_wl","inputfile/source/XXXXXXXX"; "**X_CORD2**"; "Invoice_2M";
"**Y_CORD42**"; "SIZE_ID37""
It has weird type of data as you can see, especially that it has 2 double quotes at the end of the line instead of just one that you would expect.
I need to extract the XCORD and YCORD information, like XCORD = 2 and YCORD = 42, before replacing the column value. I then want to insert an extra column, named X_Y, which represents (2_42).
How can I modify my script to do that?
If I understand your question correctly, you can use a simple regular expression to pull out the numbers you want:
import re
f = open('csvdata.csv','rb')
fo = open('out6.csv','wb')
for line in f:
bits = line.split(',')
x_y_matches = re.match('.*X_CORD(\d+).*Y_CORD(\d+).*', bits[1])
assert x_y_matches is not None, 'Line had unexpected format: {0}'.format(bits[1])
x_y = '({0}_{1})'.format(x_y_matches.group(1), x_y_matches.group(2))
bits[1] = '"input"'
bits.append(x_y)
fo.write( ','.join(bits) )
f.close()
fo.close()
Note that this will only work if column 2 always says 'X_CORD' and 'Y_CORD' immediately before the numbers. If it is sometimes a slightly different format, you'll need to adjust the regular expression to allow for that. I added the assert to give a more useful error message if that happens.
You mentioned wanting the column to be named X_Y. Your script appears to assume that there is no header, and my modified version definitely makes this assumption. Again, you'd need to adjust for that if there is a header line.
And, yes, I agree with the other commenters that using the csv module would be cleaner, in general, for reading and writing csv files.