Parsing a json gives JSONDecodeError: Unterminated string - python

I have a document with new-line-delimited json's, to which I apply some functions. Everything works up until this line, which looks exactly like this:
{"_id": "5f114", "type": ["Type1", "Type2"], "company": ["5e84734"], "answers": [{"title": " answer 1", "value": false}, {"title": "answer 2
", "value": true}, {"title": "This is a title.", "value": true}, {"title": "This is another title", "value": true}], "audios": [null], "text": {}, "lastUpdate": "2020-07-17T06:24:50.562Z", "title": "This is a question?", "description": "1000 €.", "image": "image.jpg", "__v": 0}
The entire code:
import json
def unimportant_function(d):
d.pop('audios', None)
return {k:v for k,v in d.items() if v != {}}
def parse_ndjson(data):
return [json.loads(l) for l in data.splitlines()]
with open('C:\\path\\the_above_example.json', 'r', encoding="utf8") as handle:
data = handle.read()
dicts = parse_ndjson(data)
for d in dicts:
new_d = unimportant_function(d)
json_string=json.dumps(new_d, ensure_ascii=False)
print(json_string)
The error JSONDecodeError: Unterminated string starting at: line 1 column 260 (char 259) happens at dicts = parse_ndjson(data). Why? I also have no idea what that symbol after "answer 2" is, it didn't appear in the data but it appeared when I copy pasted it.
What is the problem with the data?

The unprintable character embedded in the "answer 2" string is a paragraph separator, which is treated as whitespace by .splitlines():
>>> 'foo\u2029bar'.splitlines()
['foo', 'bar']
(Speculation: the ndjson file might be exploiting this to represent "this string should have a newline in it", working around the file format. If so, it should probably be using a \n escape instead.)
The character is, however, not treated specially if you iterate over the lines of the file normally:
>>> # For demonstration purposes, I create a `StringIO`
>>> # from a hard-coded string. A file object reading
>>> # from disk will behave similarly.
>>> import io
>>> for line in io.StringIO('foo\u2029bar'):
... print(repr(line))
...
'foo\u2029bar'
So, the simple fix is to make parse_ndjson expect a sequence of lines already - don't call .splitlines, and fix the calling code appropriately. You can either pass the open handle directly:
def parse_ndjson(data):
return [json.loads(l) for l in data]
with open('C:\\path\\the_above_example.json', 'r', encoding="utf8") as handle:
dicts = parse_ndjson(handle)
or pass it to list to create a list explicitly:
def parse_ndjson(data):
return [json.loads(l) for l in data]
with open('C:\\path\\the_above_example.json', 'r', encoding="utf8") as handle:
dicts = parse_ndjson(list(handle))
or create the list using the provided .readlines() method:
def parse_ndjson(data):
return [json.loads(l) for l in data]
with open('C:\\path\\the_above_example.json', 'r', encoding="utf8") as handle:
dicts = parse_ndjson(handle.readlines())

Related

Adding a comma between JSON objects in a datafile with Python?

I have large file (about 3GB) which contains what looks like a JSON file but isn't because it lacks commas (,) between "observations" or JSON objects (I have about 2 million of these "objects" in my data file).
For example, this is what I have:
{
"_id": {
"$id": "fh37fc3huc3"
},
"messageid": "4757724838492485088139042828",
"attachments": [],
"usernameid": "47284592942",
"username": "Alex",
"server": "475774810304151552",
"text": "Must watch",
"type": "462050823720009729",
"datetime": "2018-08-05T21:20:20.486000+00:00",
"type": {
"$numberLong": "0"
}
}
{
"_id": {
"$id": "23453532dwq"
},
"messageid": "232534",
"attachments": [],
"usernameid": "273342",
"usernameid": "Alice",
"server": "475774810304151552",
"text": "https://www.youtube.com/",
"type": "4620508237200097wd29",
"datetime": "2018-08-05T21:20:11.803000+00:00",
"type": {
"$numberLong": "0"
}
And this is what I want (the comma between "observations"):
{
"_id": {
"$id": "fh37fc3huc3"
},
"messageid": "4757724838492485088139042828",
"attachments": [],
"username": "Alex",
"server": "475774810304151552",
"type": {
"$numberLong": "0"
}
},
{
"_id": {
"$id": "23453532dwq"
},
"messageid": "232534",
"attachments": [],
"usernameid": "Alice",
"server": "475774810304151552",
"type": {
"$numberLong": "0"
}
This is what I tried but it doesn't give me a comma where I need it:
import re
with open('dataframe.txt', 'r') as input, open('out.txt', 'w') as output:
output.write("[")
for line in input:
line = re.sub('', '},{', line)
output.write(' '+line)
output.write("]")
What can I do so that I can add a comma between each JSON object in my datafile?
This solution presupposes that none of the fields in JSON contains neither { nor }.
If we assume that there is at least one blank line between JSON dictionaries, an idea: let's maintain unclosed curly brackets count ({) as unclosed_count; and if we meet an empty line, we add the coma once.
Like this:
with open('test.json', 'r') as input_f, open('out.json', 'w') as output_f:
output_f.write("[")
unclosed_count = 0
comma_after_zero_added = True
for line in input_f:
unclosed_count_change = line.count('{') - line.count('}')
unclosed_count += unclosed_count_change
if unclosed_count_change != 0:
comma_after_zero_added = False
if line.strip() == '' and unclosed_count == 0 and not comma_after_zero_added:
output_f.write(",\n")
comma_after_zero_added = True
else:
output_f.write(line)
output_f.write("]")
Assuming sufficient memory, you can parse such a stream one object at a time using json.JSONDecoder.raw_decode directly, instead of using json.loads.
>>> x = '{"a": 1}\n{"b": 2}\n' # Hypothetical output of open("dataframe.txt").read()
>>> decoder = json.JSONDecoder()
>>> x = '{"a": 1}\n{"b":2}\n'
>>> decoder.raw_decode(x)
({'a': 1}, 8)
>>> decoder.raw_decode(x, 9)
({'b': 2}, 16)
The output of raw_decode is a tuple containing the first JSON value decoded and the position in the string where the remaining data starts. (Note that json.loads just creates an instance of JSONDecoder, and calls the decode method, which just calls raw_decode and artificially raises an exception if the entire input isn't consumed by the first decoded value.)
A little extra work is involved; note that you can't start decoding with whitespace, so you'll have to use the returned index to detect where the next value starts, following any additional whitespace at the returned index.
Another way to view your data is that you have multiple json records separated by whitespace. You can use the stdlib JSONDecoder to read each record, then strip whitespace and repeat til done. The decoder reads a record from a string and tells you how far it got. Apply that iteratively to the data until all is consumed. This is far less risky than making a bunch of assumptions about what data is contained in the json itself.
import json
def json_record_reader(filename):
with open(filename, encoding="utf-8") as f:
txt = f.read().lstrip()
decoder = json.JSONDecoder()
result = []
while txt:
data, pos = decoder.raw_decode(txt)
result.append(data)
txt = txt[pos:].lstrip()
return result
print(json_record_reader("data.json"))
Considering the size of your file, a memory mapped text file may be the better option.
If you're sure that the only place you will find a blank line is between two dicts, then you can go ahead with your current idea, after you fix its execution. For every line, check if it's empty. If it isn't, write it as-is. If it is, write a comma instead
with open('dataframe.txt', 'r') as input_file, open('out.txt', 'w') as output_file:
output_file.write("[")
for line in input_file:
if line.strip():
output_file.write(line)
else:
output_file.write(",")
output_file.write("]")
If you cannot guarantee that any blank line must be replaced by a comma, you need a different approach.
You want to replace a close-bracket, followed by an empty line (or multiple whitespace), followed by an open-bracket, with },{.
You can keep track of the previous two lines in addition to the current line, and if these are "}", "", and "{" in that order, then write a comma before writing the "{".
from collections import deque
with open('dataframe.txt', 'r') as input_file, open('out.txt', 'w') as output_file:
last_two_lines = deque(maxlen=2)
output_file.write("[")
for line in input_file:
line_s = line.strip()
if line_s == "{" and list(last_two_lines) == ["}", ""]:
output_file.write("," + line)
else:
output_file.write(line)
last_two_lines.append(line_s)
Alternatively, if you want to stick with regex, then you could do
with open('dataframe.txt') as input_file:
file_contents = input_file.read()
repl_contents = re.sub(r'\}(\s+)\{', r'},\1{', file_contents)
with open('out.txt', 'w') as output_file:
output_file.write(repl_contents)
Here, the regex r"\}(\s+)\{" matches the pattern we're looking for (\s+ matches multiple whitespace characters, and captures them in group 1, which we then use in the replacement string as \1.
Note that you will need to read and run re.sub on the entire file, which will be slow.

How can I delete an item from a json using the below method?

I need to remove data from a json, at the minute i am using the following code:
import json
with open('E:/file/timings.json', 'r+') as f:
qe = json.load(f)
for item in qe['times']:
if item['Proc'] == 'APS':
print(f'{item["Num"]}')
del item
json.dump(qe, f, indent=4, sort_keys=False, ensure_ascii=False)
This doesn't delete anything from the JSON, here is a small example of my JSON file
{
"times": [
{
"Num": "12345678901234567",
"Start_Time": "2016-12-14 15:54:35",
"Proc": "UPD",
},
{
"Num": "12345678901234567",
"Start_Time": "2016-12-08 15:34:05",
"Proc": "APS",
},
{
"Num": "12345678901234567",
"Start_Time": "2016-11-30 11:20:21",
"Proc": "Dev,
i would like it to look like this:
{
"times": [
{
"Num": "12345678901234567",
"Start_Time": "2016-12-14 15:54:35",
"Proc": "UPD",
},
{
"Num": "12345678901234567",
"Start_Time": "2016-11-30 11:20:21",
"Proc": "Dev,
as you can see the portion containing APS as the process has been removed
You could save your initial json and then create new one that doesn't contain items which 'Proc' is equal to 'APS' (here new_json) and then overwrite your json file with that new_json.
import json
content = json.loads(open('timings.json', 'r').read())
new_json = {'times': []}
for item in content['times']:
if item['Proc'] != 'APS':
new_json['times'].append(item)
file = open('timings.json', 'w')
file.write(json.dumps(new_json, indent=4, sort_keys=False, ensure_ascii=False))
file.close()
It is not a good practice to delete element while iterating the list.
Use:
import json
with open('E:/file/timings.json', 'r') as f:
qe = json.load(f)
qe = [item for item in qe['times'] if item['Proc'] != 'APS'] #Delete Required element.
with open('E:/file/timings.json', 'w') as f:
json.dump(qe, f, indent=4, sort_keys=False, ensure_ascii=False)
del as you're using it, removes the variable item from your session, but leaves the actual item untouched in the data structure. You need to explicitly remove whatever item is pointing to from your data structure. Also, you want to avoid deleting items from a list while you are iterating over said list. You should recreate your entire list:
qe['times'] = [item for item in qe['times'] if item['Proc'] != 'APS']
You can use a method if you need to print:
def keep_item(thing):
if item['Proc'] == 'APS':
print thing['Num']
return False
else:
return True
qe['times'] = [item for item in qe['times'] if keep_item(item)]
You can use the below method to remove the element from list:
for i,item in enumerate(qe['times']):
if item['Proc'] == 'APS':
qe['times'].pop(i)
and then write back to the JSON file.

write into a js file without passing in string quotes (in Python)

I am writing a parser to grab data from GitHub api, and I would like to output the file into the following .js format:
//Ideal output
var LIST_DATA = [
{
"name": "Python",
"type": "category"
}]
Though I am having trouble writing into output.js file with the variable var LIST_DATA, no matter what I do with the string, the end result shows as "var LIST_DATA"
For example:
//Current Output
"var LIST_DATA = "[
{
"name": "Python",
"type": "category"
}]
My Python script:
var = "var LIST_DATA = "
with open('list_data.js', 'w') as outfile:
outfile.write(json.dumps(var, sort_keys = True))
I also tried using strip method according to this StackOverflow answer and got the same result
var = "var LIST_DATA = "
with open('list_data.js', 'w') as outfile:
outfile.write(json.dumps(var.strip('"'), sort_keys = True))
I am assuming, whenever I am dumping the text into the js file, the string gets passed in along with the double quote... Is there a way around this?
Thanks.
If you pass a string to json.dumps it will always be quoted. The first part (the name of the variable) is not JSON - so you want to just write it verbatim, and then write the object using json.dumps:
var = "var LIST_DATA = "
my_dict = [
{
"name": "Python",
"type": "category"
}
]
with open('list_data.js', 'w') as outfile:
outfile.write(var)
# Write the JSON value here
outfile.write(json.dumps(my_dict))
Try
var = '''var LIST_DATA = [
{
name: "Python",
type: "category"
}]'''
with open("list_data.js", "w") as f:
f.write(var)
The json library is not what you are looking for here.
And also, dictionaries in Javascript do not require quoted keys unless there are spaces in the key.
Output:

The collection of some JSON data into a file

Can you give some idea how to do this collection. The problem is this: I get the JSON to assume the following
[{
"pk": 1,
"model": "store.book",
"fields": {
"name": "Mostly Harmless",
"author": ["Douglas", "Adams"]
}
}]
then unzip a file I save the data and close the file, the next time (this is a cycle) again receive again like JSON, for example, the following
[{
"pk": 2,
"model": "store.book",
"fields": {
"name": "Henry",
"author": ["Hans"]
}
}]
the second JSON must go into the same file in which it is located and the first. Here comes the problem how to do. At this stage, I do it in the following way, delete the brackets and put commas.Is there any smarter and a better way for this job?
Creating JSON-Serializing Django objects of a use. I would be very grateful if you share their ideas.
PS: It is important to use minimal memory. Assume that the file is around 50-60 GB and in memory to hold about 1 GB maximum
You would have to convert your data into JSON and store it into a file. Then read from the file again and append your new data to the object and again save it into the file. Here is some code that might be useful for you:
Use JSON. Documentation available at - http://docs.python.org/2/library/json.html
The first time you write into the file, you can use something like:
>>> import json
>>> fileW = open("filename.txt","w")
>>> json1 = [{
... "pk": 1,
... "model": "store.book",
... "fields": {
... "name": "Mostly Harmless",
... "author": ["Douglas", "Adams"]
... }
... }]
>>> json.dump(json1, fileW)
>>> fileW.close()
The following code could be used in a loop to read from the file and add data to it.
>>> fileLoop = open("filename.txt","r+")
>>> jsonFromFile = json.load(fileLoop)
>>> jsonFromFile
[{u'pk': 1, u'model': u'store.book', u'fields': {u'name': u'Mostly Harmless', u'author': [u'Douglas', u'Adams']}}]
>>> newJson = [{
... "pk": 2,
... "model": "store.book",
... "fields": {
... "name": "Henry",
... "author": ["Hans"]
... }
... }]
>>> jsonFromFile.append(newJson[0])
>>> jsonFromFile
[{u'pk': 1, u'model': u'store.book', u'fields': {u'name': u'Mostly Harmless', u'author': [u'Douglas', u'Adams']}}, {'pk': 2, 'model': 'store.book', 'fields': {'name': 'Henry', 'author': ['Hans']}}]
>>> json.dump(jsonFromFile, fileLoop)
>>> fileLoop.close()
You do not need to parse the JSON because you are only storing it. The following (a) creates a file and (b) appends text to the file in each cycle.
from os.path import getsize
def init(filename):
"""
Creates a new file and sets its content to "[]".
"""
with open(filename, 'w') as f:
f.write("[]")
f.close()
def append(filename, text):
"""
Appends a JSON to a file that has been initialised with `init`.
"""
length = getsize(filename) #Figure out the number of characters in the file
with open(filename, 'r+') as f:
f.seek(length - 1) #Go to the end of the file
if length > 2: #Insert a delimiter if this is not the first JSON
f.write(",\n")
f.write(text[1:-1]) #Append the JSON
f.write("]") #Write a closing bracket
f.close()
filename = "temp.txt"
init(filename)
while mycondition:
append(filename, getjson())
If you did not have to save the JSON after each cycle, you could do the following
jsons = []
while mycondition:
jsons.append(getjson()[1:-1])
with open("temp.txt", "w") as f:
f.write("[")
f.write(",".join(jsons))
f.write("]")
f.close()
To avoid creating multigigabytes objects, you could store each object on a separate line. It requires you to dump each object without newlines used for formatting (json strings themselves may use \n (two chars) as usual):
import json
with open('output.txt', 'a') as file: # open the file in the append mode
json.dump(obj, file,
separators=',:') # the most compact representation by default
file.write("\n")

How to dump a dict to a JSON file?

I have a dict like this:
sample = {'ObjectInterpolator': 1629, 'PointInterpolator': 1675, 'RectangleInterpolator': 2042}
I can't figure out how to dump the dict to a JSON file as showed below:
{
"name": "interpolator",
"children": [
{"name": "ObjectInterpolator", "size": 1629},
{"name": "PointInterpolator", "size": 1675},
{"name": "RectangleInterpolator", "size": 2042}
]
}
Is there a pythonic way to do this?
You may guess that I want to generate a d3 treemap.
import json
with open('result.json', 'w') as fp:
json.dump(sample, fp)
This is an easier way to do it.
In the second line of code the file result.json gets created and opened as the variable fp.
In the third line your dict sample gets written into the result.json!
Combine the answer of #mgilson and #gnibbler, I found what I need was this:
d = {
"name": "interpolator",
"children": [{
'name': key,
"size": value
} for key, value in sample.items()]
}
j = json.dumps(d, indent=4)
with open('sample.json', 'w') as f:
print >> f, j
It this way, I got a pretty-print json file.
The tricks print >> f, j is found from here:
http://www.anthonydebarros.com/2012/03/11/generate-json-from-sql-using-python/
d = {"name":"interpolator",
"children":[{'name':key,"size":value} for key,value in sample.items()]}
json_string = json.dumps(d)
Since python 3.7 the ordering of dicts is retained https://docs.python.org/3.8/library/stdtypes.html#mapping-types-dict
Dictionaries preserve insertion order. Note that updating a key does not affect the order. Keys added after deletion are inserted at the end
Also wanted to add this (Python 3.7)
import json
with open("dict_to_json_textfile.txt", 'w') as fout:
json_dumps_str = json.dumps(a_dictionary, indent=4)
print(json_dumps_str, file=fout)
Update (11-04-2021): So the reason I added this example is because sometimes you can use the print() function to write to files, and this also shows how to use the indentation (unindented stuff is evil!!). However I have recently started learning about threading and some of my research has shown that the print() statement is not always thread-safe. So if you need threading you might want to be careful with this one.
This should give you a start
>>> import json
>>> print json.dumps([{'name': k, 'size': v} for k,v in sample.items()], indent=4)
[
{
"name": "PointInterpolator",
"size": 1675
},
{
"name": "ObjectInterpolator",
"size": 1629
},
{
"name": "RectangleInterpolator",
"size": 2042
}
]
with pretty-print format:
import json
with open(path_to_file, 'w') as file:
json_string = json.dumps(sample, default=lambda o: o.__dict__, sort_keys=True, indent=2)
file.write(json_string)
If you're using Path:
example_path = Path('/tmp/test.json')
example_dict = {'x': 24, 'y': 25}
json_str = json.dumps(example_dict, indent=4) + '\n'
example_path.write_text(json_str, encoding='utf-8')

Categories

Resources