json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 21641 (char 21640) - python

I am getting the error as described in the title when calling the function:
def read_file(file_name):
"""Return all the followers of a user."""
f = open(file_name, 'r')
data = []
for line in f:
data.append(json.loads(line.strip()))
f.close()
return data
Sample data looks like this:
"from": {"name": "Ronaldo Naz\u00e1rio", "id": "618977411494412"},
"message": "The king of football. Happy birthday, Pel\u00e9!",
"type": "photo", "shares": {"count": 2585}, "created_time": "2018-10-23T11:43:49+0000",
"link": "https://www.facebook.com/ronaldo/photos/a.661211307271022/2095750393817099/?type=3",
"status_type": "added_photos",
"reactions": {"data": [], "summary": {"total_count": 51779, "viewer_reaction": "NONE"}},
"comments": {"data": [{"created_time": "2018-10-23T11:51:57+0000", ... }]}

You are trying to parse each line of the file as JSON on its own, which is probably wrong. You should read the entire file and convert to JSON at once, preferably using with so Python can handle the opening and closing of the file, even if an exception occures.
The entire thing can be condensed to 2 lines thanks to json.load accepting a file object and handling the reading of it on its own.
def read_file(file_name):
with open(file_name) as f:
return json.load(f)

Related

Separate large JSON object into many different files

I have a JSON file with 10000 data entries like below in a file.
{
"1":{
"name":"0",
"description":"",
"image":""
},
"2":{
"name":"1",
"description":"",
"image":""
},
...
}
I need to write each entry in this object into its own file.
For example, the output of each file looks like this:
1.json
{
"name": "",
"description": "",
"image": ""
}
I have the following code, but I'm not sure how to proceed from here. Can anyone help with this?
import json
with open('sample.json', 'r') as openfile:
# Reading from json file
json_object = json.load(openfile)
You can use a for loop to iterate over all the fields in the outer object, and then create a new file for each inner object:
import json
with open('sample.json', 'r') as input_file:
json_object = json.load(input_file)
for key, value in json_object.items():
with open(f'{key}.json', 'w') as output_file:
json.dump(value, output_file)

Adding a comma between JSON objects in a datafile with Python?

I have large file (about 3GB) which contains what looks like a JSON file but isn't because it lacks commas (,) between "observations" or JSON objects (I have about 2 million of these "objects" in my data file).
For example, this is what I have:
{
"_id": {
"$id": "fh37fc3huc3"
},
"messageid": "4757724838492485088139042828",
"attachments": [],
"usernameid": "47284592942",
"username": "Alex",
"server": "475774810304151552",
"text": "Must watch",
"type": "462050823720009729",
"datetime": "2018-08-05T21:20:20.486000+00:00",
"type": {
"$numberLong": "0"
}
}
{
"_id": {
"$id": "23453532dwq"
},
"messageid": "232534",
"attachments": [],
"usernameid": "273342",
"usernameid": "Alice",
"server": "475774810304151552",
"text": "https://www.youtube.com/",
"type": "4620508237200097wd29",
"datetime": "2018-08-05T21:20:11.803000+00:00",
"type": {
"$numberLong": "0"
}
And this is what I want (the comma between "observations"):
{
"_id": {
"$id": "fh37fc3huc3"
},
"messageid": "4757724838492485088139042828",
"attachments": [],
"username": "Alex",
"server": "475774810304151552",
"type": {
"$numberLong": "0"
}
},
{
"_id": {
"$id": "23453532dwq"
},
"messageid": "232534",
"attachments": [],
"usernameid": "Alice",
"server": "475774810304151552",
"type": {
"$numberLong": "0"
}
This is what I tried but it doesn't give me a comma where I need it:
import re
with open('dataframe.txt', 'r') as input, open('out.txt', 'w') as output:
output.write("[")
for line in input:
line = re.sub('', '},{', line)
output.write(' '+line)
output.write("]")
What can I do so that I can add a comma between each JSON object in my datafile?
This solution presupposes that none of the fields in JSON contains neither { nor }.
If we assume that there is at least one blank line between JSON dictionaries, an idea: let's maintain unclosed curly brackets count ({) as unclosed_count; and if we meet an empty line, we add the coma once.
Like this:
with open('test.json', 'r') as input_f, open('out.json', 'w') as output_f:
output_f.write("[")
unclosed_count = 0
comma_after_zero_added = True
for line in input_f:
unclosed_count_change = line.count('{') - line.count('}')
unclosed_count += unclosed_count_change
if unclosed_count_change != 0:
comma_after_zero_added = False
if line.strip() == '' and unclosed_count == 0 and not comma_after_zero_added:
output_f.write(",\n")
comma_after_zero_added = True
else:
output_f.write(line)
output_f.write("]")
Assuming sufficient memory, you can parse such a stream one object at a time using json.JSONDecoder.raw_decode directly, instead of using json.loads.
>>> x = '{"a": 1}\n{"b": 2}\n' # Hypothetical output of open("dataframe.txt").read()
>>> decoder = json.JSONDecoder()
>>> x = '{"a": 1}\n{"b":2}\n'
>>> decoder.raw_decode(x)
({'a': 1}, 8)
>>> decoder.raw_decode(x, 9)
({'b': 2}, 16)
The output of raw_decode is a tuple containing the first JSON value decoded and the position in the string where the remaining data starts. (Note that json.loads just creates an instance of JSONDecoder, and calls the decode method, which just calls raw_decode and artificially raises an exception if the entire input isn't consumed by the first decoded value.)
A little extra work is involved; note that you can't start decoding with whitespace, so you'll have to use the returned index to detect where the next value starts, following any additional whitespace at the returned index.
Another way to view your data is that you have multiple json records separated by whitespace. You can use the stdlib JSONDecoder to read each record, then strip whitespace and repeat til done. The decoder reads a record from a string and tells you how far it got. Apply that iteratively to the data until all is consumed. This is far less risky than making a bunch of assumptions about what data is contained in the json itself.
import json
def json_record_reader(filename):
with open(filename, encoding="utf-8") as f:
txt = f.read().lstrip()
decoder = json.JSONDecoder()
result = []
while txt:
data, pos = decoder.raw_decode(txt)
result.append(data)
txt = txt[pos:].lstrip()
return result
print(json_record_reader("data.json"))
Considering the size of your file, a memory mapped text file may be the better option.
If you're sure that the only place you will find a blank line is between two dicts, then you can go ahead with your current idea, after you fix its execution. For every line, check if it's empty. If it isn't, write it as-is. If it is, write a comma instead
with open('dataframe.txt', 'r') as input_file, open('out.txt', 'w') as output_file:
output_file.write("[")
for line in input_file:
if line.strip():
output_file.write(line)
else:
output_file.write(",")
output_file.write("]")
If you cannot guarantee that any blank line must be replaced by a comma, you need a different approach.
You want to replace a close-bracket, followed by an empty line (or multiple whitespace), followed by an open-bracket, with },{.
You can keep track of the previous two lines in addition to the current line, and if these are "}", "", and "{" in that order, then write a comma before writing the "{".
from collections import deque
with open('dataframe.txt', 'r') as input_file, open('out.txt', 'w') as output_file:
last_two_lines = deque(maxlen=2)
output_file.write("[")
for line in input_file:
line_s = line.strip()
if line_s == "{" and list(last_two_lines) == ["}", ""]:
output_file.write("," + line)
else:
output_file.write(line)
last_two_lines.append(line_s)
Alternatively, if you want to stick with regex, then you could do
with open('dataframe.txt') as input_file:
file_contents = input_file.read()
repl_contents = re.sub(r'\}(\s+)\{', r'},\1{', file_contents)
with open('out.txt', 'w') as output_file:
output_file.write(repl_contents)
Here, the regex r"\}(\s+)\{" matches the pattern we're looking for (\s+ matches multiple whitespace characters, and captures them in group 1, which we then use in the replacement string as \1.
Note that you will need to read and run re.sub on the entire file, which will be slow.

Delete specific content in a json file

I have this json file :
{
"help": [
{
"title": "",
"date": "",
"link": ""
},
{
"title": "",
"date": "",
"link": ""
},
{
"title": "",
"date": "",
"link": ""
}
]
}
And I am currently struggling trying to delete each 'block' in the help list.
I eventually came up with this:
import json
with open('dest_file.json', 'w') as dest_file:
with open('source.json', 'r') as source_file:
for line in source_file:
element = json.loads(line.strip())
if 'help' in element:
del element['help']
dest_file.write(json.dumps(element))
So I was wondering how could I delete each thing in the help list, without deleting the help list.
ty
Yes you can replace the element with an empty list:
if 'help' in element:
element['help'] = []
You have some further issues in the script specifically with line for line in source_file. If you read line by line then you are getting each line and not the complete dictionary object and that is giving several other json errors.
Try this:
import json
with open('dest_file.json', 'w') as dest_file:
with open('source.json', 'r') as source_file:
element = json.load(source_file)
if "help" in element:
element['help'] = []
dest_file.write(json.dumps(element))
This works for the above example shown but if you have multiple nested items, then you need to iterate over each separately and fix them individually.

Text file to dictionary in python

I have a txt file of this type, grouping related messages, am looking for a way to make a dictionary out of this that I can use in a python script.
Edit:
The script I want should loop through these messages and create a dictionary with the first word in each sentence as the key and the other part as the value.
My implementation right now does not have any errors, just that the output list is full of duplicated records.
from Doe
message Hello there
sent_timestamp 33333333334
message_id sklekelke3434
device_id 3434
from sjkjs
message Hesldksdllo there
sent_timestamp 3333sdsd3333334
message_id sklekelksde3434
device_id 34sd34
from Doe
message Hello there
sent_timestamp 33333333334
message_id sklekelke3434
device_id 3434
here is my code as of now
lines = []
records = {}
f = open('test1.txt', 'r+')
for line in f.readlines():
if len(line.split()) != 0:
key_name, key_value = line.strip().split(None, 1)
records[key_name] = key_value.strip()
lines.append(records)
f.close()
Like #Michael Butschner said in their comment, it's kind of hard to tell what you're looking for, but here's an idea that you can use.
records = []
# used with since your question tags python-3.x
with open('test1.txt', 'r+') as f:
messages = f.read().split("\n\n\n")
for message in messages:
message = message.split("\n")
records.append({
"from": message[0][5:],
"message": message[1][8:],
"sent_timestamp": message[2][15:],
"message_id": message[3][11:],
"device_id": message[4][10:]
})
Here is what that records list looks like, using the json package to stringify it:
[
{
"from": "Doe",
"message": "Hello there",
"sent_timestamp": "33333333334",
"message_id": "sklekelke3434",
"device_id": "3434"
},
{
"from": "sjkjs",
"message": "Hesldksdllo there",
"sent_timestamp": "3333sdsd3333334",
"message_id": "sklekelksde3434",
"device_id": "34sd34"
},
{
"from": "Doe",
"message": "Hello there",
"sent_timestamp": "33333333334",
"message_id": "sklekelke3434",
"device_id": "3434"
}
]
Without clarification as to what exactly you're expecting, I hope this helps you a bit with whatever you're working on :D.

The collection of some JSON data into a file

Can you give some idea how to do this collection. The problem is this: I get the JSON to assume the following
[{
"pk": 1,
"model": "store.book",
"fields": {
"name": "Mostly Harmless",
"author": ["Douglas", "Adams"]
}
}]
then unzip a file I save the data and close the file, the next time (this is a cycle) again receive again like JSON, for example, the following
[{
"pk": 2,
"model": "store.book",
"fields": {
"name": "Henry",
"author": ["Hans"]
}
}]
the second JSON must go into the same file in which it is located and the first. Here comes the problem how to do. At this stage, I do it in the following way, delete the brackets and put commas.Is there any smarter and a better way for this job?
Creating JSON-Serializing Django objects of a use. I would be very grateful if you share their ideas.
PS: It is important to use minimal memory. Assume that the file is around 50-60 GB and in memory to hold about 1 GB maximum
You would have to convert your data into JSON and store it into a file. Then read from the file again and append your new data to the object and again save it into the file. Here is some code that might be useful for you:
Use JSON. Documentation available at - http://docs.python.org/2/library/json.html
The first time you write into the file, you can use something like:
>>> import json
>>> fileW = open("filename.txt","w")
>>> json1 = [{
... "pk": 1,
... "model": "store.book",
... "fields": {
... "name": "Mostly Harmless",
... "author": ["Douglas", "Adams"]
... }
... }]
>>> json.dump(json1, fileW)
>>> fileW.close()
The following code could be used in a loop to read from the file and add data to it.
>>> fileLoop = open("filename.txt","r+")
>>> jsonFromFile = json.load(fileLoop)
>>> jsonFromFile
[{u'pk': 1, u'model': u'store.book', u'fields': {u'name': u'Mostly Harmless', u'author': [u'Douglas', u'Adams']}}]
>>> newJson = [{
... "pk": 2,
... "model": "store.book",
... "fields": {
... "name": "Henry",
... "author": ["Hans"]
... }
... }]
>>> jsonFromFile.append(newJson[0])
>>> jsonFromFile
[{u'pk': 1, u'model': u'store.book', u'fields': {u'name': u'Mostly Harmless', u'author': [u'Douglas', u'Adams']}}, {'pk': 2, 'model': 'store.book', 'fields': {'name': 'Henry', 'author': ['Hans']}}]
>>> json.dump(jsonFromFile, fileLoop)
>>> fileLoop.close()
You do not need to parse the JSON because you are only storing it. The following (a) creates a file and (b) appends text to the file in each cycle.
from os.path import getsize
def init(filename):
"""
Creates a new file and sets its content to "[]".
"""
with open(filename, 'w') as f:
f.write("[]")
f.close()
def append(filename, text):
"""
Appends a JSON to a file that has been initialised with `init`.
"""
length = getsize(filename) #Figure out the number of characters in the file
with open(filename, 'r+') as f:
f.seek(length - 1) #Go to the end of the file
if length > 2: #Insert a delimiter if this is not the first JSON
f.write(",\n")
f.write(text[1:-1]) #Append the JSON
f.write("]") #Write a closing bracket
f.close()
filename = "temp.txt"
init(filename)
while mycondition:
append(filename, getjson())
If you did not have to save the JSON after each cycle, you could do the following
jsons = []
while mycondition:
jsons.append(getjson()[1:-1])
with open("temp.txt", "w") as f:
f.write("[")
f.write(",".join(jsons))
f.write("]")
f.close()
To avoid creating multigigabytes objects, you could store each object on a separate line. It requires you to dump each object without newlines used for formatting (json strings themselves may use \n (two chars) as usual):
import json
with open('output.txt', 'a') as file: # open the file in the append mode
json.dump(obj, file,
separators=',:') # the most compact representation by default
file.write("\n")

Categories

Resources