json dump generating unnecessary curly braces - python

This question might have been asked many times but I am still unable to understand how to use json file. I use json.dump(data, filename). While dumping I get unnecessary {} at the end of the file. So json.load(data) gives me below error.
simplejson.scanner.JSONDecodeError: Extra data: line 1 column 1865 - line 1 column 1867 (char 1864 - 1866)
I read that there is no way to load a first or second dictionary. I have also read that there is a separater which can be used with json dump but I see no use here. Should I be using encoding, decoding here?
My json.dump file:
{
"deployCI2": ["094fd196-20f0-4e8d-b946-f74a56d2f319", "6a1ce382-98c6-4058-a929-95a7d2415fd0"],
"deployCI3": ["c8fff661-4482-4908-b722-4fac0227a8b0", "929cf1fa-3fa6-4f95-8464-d58e5490f4cf"],
"deployCI4": ["9f8ffa3c-460d-43a9-8113-58e891340e1b", "6e535e92-4da2-4228-a6ab-c8fc8d31adcd", "8e26a35e-7fb9-43b3-8026-d1283f7b678c", "f40e5c29-b4df-4cfb-9d7f-3bcc9c4dcf9f"],
"HeenaStackABC": [], "HeenaStackABC-DISK_VM1-mm55lkkvccej": ["cc2a89a2-3b27-4f88-af09-b3b0b1301056"]
}{}
Edited: I think the code is doing something here.
with open('stackList.json', 'a') as f:
for stack in stacks:
try:
hlist = hc.resources.list(stack_id=stack.id)
vlist = [o.physical_resource_id for o in hlist if o.resource_type =='OS::Cinder::Volume']
myDict[stack.stack_name] = vlist
except heatclient.exc.HTTPBadRequest as e:
pass
json.dump(myDict,f)
I edited the code like below. I hope this is valid. It removed the last braces
if len(myDict) != 0:
json.dump(myDict, f)

Your problem is here :
with open('stackList.json', 'a') as f:
You're opening the file in 'append' mode, so each time the code is executed it appends the dump to your file. The result you complain about comes from this and mydict being empty on the second run.
You either have to open the file in "w" ("write") mode which will overwrite the existing content (you can eventually create a new dump file for each call) or switch to the "jsonline" format (but your file will NOT be a valid json file anymore and any code reading it will have to know to parse it as jsonlines)

Related

Cant read arabic CSV file for sentiment analysis in arabic jupyter notebook [duplicate]

I'm working with some CSV files, with the following code:
reader = csv.reader(open(filepath, "rU"))
try:
for row in reader:
print 'Row read successfully!', row
except csv.Error, e:
sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))
And one file is throwing this error:
file my.csv, line 1: line contains NULL byte
What can I do? Google seems to suggest that it may be an Excel file that's been saved as a .csv improperly. Is there any way I can get round this problem in Python?
== UPDATE ==
Following #JohnMachin's comment below, I tried adding these lines to my script:
print repr(open(filepath, 'rb').read(200)) # dump 1st 200 bytes of file
data = open(filepath, 'rb').read()
print data.find('\x00')
print data.count('\x00')
And this is the output I got:
'\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1\x00\x00\x00\x00\x00\x00\x00\x00\ .... <snip>
8
13834
So the file does indeed contain NUL bytes.
As #S.Lott says, you should be opening your files in 'rb' mode, not 'rU' mode. However that may NOT be causing your current problem. As far as I know, using 'rU' mode would mess you up if there are embedded \r in the data, but not cause any other dramas. I also note that you have several files (all opened with 'rU' ??) but only one causing a problem.
If the csv module says that you have a "NULL" (silly message, should be "NUL") byte in your file, then you need to check out what is in your file. I would suggest that you do this even if using 'rb' makes the problem go away.
repr() is (or wants to be) your debugging friend. It will show unambiguously what you've got, in a platform independant fashion (which is helpful to helpers who are unaware what od is or does). Do this:
print repr(open('my.csv', 'rb').read(200)) # dump 1st 200 bytes of file
and carefully copy/paste (don't retype) the result into an edit of your question (not into a comment).
Also note that if the file is really dodgy e.g. no \r or \n within reasonable distance from the start of the file, the line number reported by reader.line_num will be (unhelpfully) 1. Find where the first \x00 is (if any) by doing
data = open('my.csv', 'rb').read()
print data.find('\x00')
and make sure that you dump at least that many bytes with repr or od.
What does data.count('\x00') tell you? If there are many, you may want to do something like
for i, c in enumerate(data):
if c == '\x00':
print i, repr(data[i-30:i]) + ' *NUL* ' + repr(data[i+1:i+31])
so that you can see the NUL bytes in context.
If you can see \x00 in the output (or \0 in your od -c output), then you definitely have NUL byte(s) in the file, and you will need to do something like this:
fi = open('my.csv', 'rb')
data = fi.read()
fi.close()
fo = open('mynew.csv', 'wb')
fo.write(data.replace('\x00', ''))
fo.close()
By the way, have you looked at the file (including the last few lines) with a text editor? Does it actually look like a reasonable CSV file like the other (no "NULL byte" exception) files?
data_initial = open("staff.csv", "rb")
data = csv.reader((line.replace('\0','') for line in data_initial), delimiter=",")
This works for me.
Reading it as UTF-16 was also my problem.
Here's my code that ended up working:
f=codecs.open(location,"rb","utf-16")
csvread=csv.reader(f,delimiter='\t')
csvread.next()
for row in csvread:
print row
Where location is the directory of your csv file.
You could just inline a generator to filter out the null values if you want to pretend they don't exist. Of course this is assuming the null bytes are not really part of the encoding and really are some kind of erroneous artifact or bug.
with open(filepath, "rb") as f:
reader = csv.reader( (line.replace('\0','') for line in f) )
try:
for row in reader:
print 'Row read successfully!', row
except csv.Error, e:
sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))
I bumped into this problem as well. Using the Python csv module, I was trying to read an XLS file created in MS Excel and running into the NULL byte error you were getting. I looked around and found the xlrd Python module for reading and formatting data from MS Excel spreadsheet files. With the xlrd module, I am not only able to read the file properly, but I can also access many different parts of the file in a way I couldn't before.
I thought it might help you.
Converting the encoding of the source file from UTF-16 to UTF-8 solve my problem.
How to convert a file to utf-8 in Python?
import codecs
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with codecs.open(sourceFileName, "r", "utf-16") as sourceFile:
with codecs.open(targetFileName, "w", "utf-8") as targetFile:
while True:
contents = sourceFile.read(BLOCKSIZE)
if not contents:
break
targetFile.write(contents)
Why are you doing this?
reader = csv.reader(open(filepath, "rU"))
The docs are pretty clear that you must do this:
with open(filepath, "rb") as src:
reader= csv.reader( src )
The mode must be "rb" to read.
http://docs.python.org/library/csv.html#csv.reader
If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.
appparently it's a XLS file and not a CSV file as http://www.garykessler.net/library/file_sigs.html confirm
Instead of csv reader I use read file and split function for string:
lines = open(input_file,'rb')
for line_all in lines:
line=line_all.replace('\x00', '').split(";")
I got the same error. Saved the file in UTF-8 and it worked.
This happened to me when I created a CSV file with OpenOffice Calc. It didn't happen when I created the CSV file in my text editor, even if I later edited it with Calc.
I solved my problem by copy-pasting in my text editor the data from my Calc-created file to a new editor-created file.
I had the same problem opening a CSV produced from a webservice which inserted NULL bytes in empty headers. I did the following to clean the file:
with codecs.open ('my.csv', 'rb', 'utf-8') as myfile:
data = myfile.read()
# clean file first if dirty
if data.count( '\x00' ):
print 'Cleaning...'
with codecs.open('my.csv.tmp', 'w', 'utf-8') as of:
for line in data:
of.write(line.replace('\x00', ''))
shutil.move( 'my.csv.tmp', 'my.csv' )
with codecs.open ('my.csv', 'rb', 'utf-8') as myfile:
myreader = csv.reader(myfile, delimiter=',')
# Continue with your business logic here...
Disclaimer:
Be aware that this overwrites your original data. Make sure you have a backup copy of it. You have been warned!
I opened and saved the original csv file as a .csv file through Excel's "Save As" and the NULL byte disappeared.
I think the original encoding for the file I received was double byte unicode (it had a null character every other character) so saving it through excel fixed the encoding.
For all those 'rU' filemode haters: I just tried opening a CSV file from a Windows machine on a Mac with the 'rb' filemode and I got this error from the csv module:
Error: new-line character seen in unquoted field - do you need to
open the file in universal-newline mode?
Opening the file in 'rU' mode works fine. I love universal-newline mode -- it saves me so much hassle.
I encountered this when using scrapy and fetching a zipped csvfile without having a correct middleware to unzip the response body before handing it to the csvreader. Hence the file was not really a csv file and threw the line contains NULL byte error accordingly.
Have you tried using gzip.open?
with gzip.open('my.csv', 'rb') as data_file:
I was trying to open a file that had been compressed but had the extension '.csv' instead of 'csv.gz'. This error kept showing up until I used gzip.open
One case is that - If the CSV file contains empty rows this error may show up. Check for row is necessary before we proceed to write or read.
for row in csvreader:
if (row):
do something
I solved my issue by adding this check in the code.

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I am trying to import a file which was saved using json.dumps and contains tweet coordinates:
{
"type": "Point",
"coordinates": [
-4.62352292,
55.44787441
]
}
My code is:
>>> import json
>>> data = json.loads('/Users/JoshuaHawley/clean1.txt')
But each time I get the error:
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
I want to end up extracting all the coordinates and saving them separately to a different file so they can then be mapped, but this seemingly simple problem is stopping me from doing so. I have looked at answers to similar errors but don't seem to be able to apply it to this. Any help would be appreciated as I am relatively new to python.
json.loads() takes a JSON encoded string, not a filename. You want to use json.load() (no s) instead and pass in an open file object:
with open('/Users/JoshuaHawley/clean1.txt') as jsonfile:
data = json.load(jsonfile)
The open() command produces a file object that json.load() can then read from, to produce the decoded Python object for you. The with statement ensures that the file is closed again when done.
The alternative is to read the data yourself and then pass it into json.loads().
It helped for me to add "myfile.seek(0)", move the pointer to the 0 character:
with open(storage_path, 'r') as myfile:
if len(myfile.readlines()) != 0:
myfile.seek(0)
Bank_0 = json.load(myfile)
I got same type of error after reading in a json file creating from python.
Same error occurred whether i read into a string and tried json.loads() or straight from file with json.load()
In my case, it turned out to be because I had written python booleans (False/True) straight out to the file.
Trying to read them back in again caused this error.
When i modified to valid json (true/false), json.load worked fine
Didnt see any SO questions with this as a possible cause for this error so adding here for reference.
You may use this function (it works with data):
def read_json_file(filename):
with open(filename, 'r') as f:
cache = f.read()
data = eval(cache)
return data
Or, you may put this in your code (it has the same effect):
def read_json_file(filename):
data = []
with open(filename, 'r') as f:
data = [json.loads(_.replace('}]}"},', '}]}"}')) for _ in f.readlines()]
return data
import json
file_path = "C:/Projects/Tryouts/books.json"
with open(file_path, 'r') as j:
contents = json.loads(j.read())
print(contents)

Writing Json in for loop in Python

I am downloading Json files from an API, I use the following code to write the JSON. Each item the loop gives me a JSON file. I need to save it and extract entities from the appended JSON file using a loop.
for item in style_ls:
dat = get_json(api, item)
specs_dict[item] = dat
with open("specs_append.txt", "a") as myfile:
json.dump(dat, myfile)
myfile.close()
print item
with open ("specs_data.txt", "w") as my file:
json.dump(spec_dict, myfile)
myfile.close()
I know that I cannot get a valid JSON format from the specs_append.txt, but I can get one from the specs_data.txt. I am doing the first one just because my program needs atleast 3-4 days to complete and there are high chances that my system may shutdown. So is there anyway I can do this efficiently ?
If not is there anyway I can extract it from specs_append.txt <{JSON}{JSON}> format (which is not a valid JSON format)?
If not should I write specs_dict to a txt file every time in the loop, so that even if program gets terminated i can start if from that point in loop and still get a valid json format?
I suggest several possible solutions.
One solution is to write custom code to slurp in the input file. I would suggest putting a special line before each JSON object in the file, such as: ###
Then you could write code like this:
import json
def json_get_objects(f):
temp = ''
line = next(f) # pull first line
assert line == SPECIAL_LINE
for line in f:
if line != SPECIAL_LINE:
temp += line
else:
# found special marker, temp now contains a complete JSON object
j = json.loads(temp)
yield j
temp = ''
# after loop done, yield up last JSON object
if temp:
j = json.loads(temp)
yield j
with open("specs_data.txt", "r") as f:
for j in json_get_objects(f):
pass # do something with JSON object j
Two notes on this. First, I am simply appending to a string over and over; this used to be a very slow way to do this in Python, so if you are using a very old version of Python, don't do it this way unless your JSON objects are very small. Second, I wrote code to split the input and yield up JSON objects one at a time, but you could also use a guaranteed-unique string, slurp in all the data with a single call to f.read() and then split on your guaranteed-unique string using the str.split() method function.
Another solution would be to write the whole file as a valid JSON list of valid JSON objects. Write the file like this:
{"mylist":[
# first JSON object, followed by a comma
# second JSON object, followed by a comma
# third JSON object
]}
This would require your file appending code to open the file with writing permission, and seek to the last ] in the file before writing a comma plus newline, then the new JSON object on the end, and then finally writing ]} to close out the file. If you do it this way, you can use json.loads() to slurp the whole thing in and have a list of JSON objects.
Finally, I suggest that maybe you should just use a database. Use SQLite or something and just throw the JSON strings in to a table. If you choose this, I suggest using an ORM to make your life simple, rather than writing SQL commands by hand.
Personally, I favor the first suggestion: write in a special line like ###, then have custom code to split the input on those marks and then get the JSON objects.
EDIT: Okay, the first suggestion was sort of assuming that the JSON was formatted for human readability, with a bunch of short lines:
{
"foo": 0,
"bar": 1,
"baz": 2
}
But it's all run together as one big long line:
{"foo":0,"bar":1,"baz":2}
Here are three ways to fix this.
0) write a newline before the ### and after it, like so:
###
{"foo":0,"bar":1,"baz":2}
###
{"foo":0,"bar":1,"baz":2}
Then each input line will alternately be ### or a complete JSON object.
1) As long as SPECIAL_LINE is completely unique (never appears inside a string in the JSON) you can do this:
with open("specs_data.txt", "r") as f:
temp = f.read() # read entire file contents
lst = temp.split(SPECIAL_LINE)
json_objects = [json.loads(x) for x in lst]
for j in json_objects:
pass # do something with JSON object j
The .split() method function can split up the temp string into JSON objects for you.
2) If you are certain that each JSON object will never have a newline character inside it, you could simply write JSON objects to the file, one after another, putting a newline after each; then assume that each line is a JSON object:
import json
def json_get_objects(f):
for line in f:
if line.strip():
yield json.loads(line)
with open("specs_data.txt", "r") as f:
for j in json_get_objects(f):
pass # do something with JSON object j
I like the simplicity of option (2), but I like the reliability of option (0). If a newline ever got written in as part of a JSON object, option (0) would still work, but option (2) would error.
Again, you can also simply use an actual database (SQLite) with an ORM and let the database worry about the details.
Good luck.
Append json data to a dict on every loop.
In the end dump this dict as a json and write it to a file.
For getting you an idea for appending data to dict:
>>> d1 = {'suku':12}
>>> t1 = {'suku1':212}
>>> d1.update(t1)
>>> d1
{'suku1': 212, 'suku': 12}

How to read line-delimited JSON from large file (line by line)

I'm trying to load a large file (2GB in size) filled with JSON strings, delimited by newlines. Ex:
{
"key11": value11,
"key12": value12,
}
{
"key21": value21,
"key22": value22,
}
…
The way I'm importing it now is:
content = open(file_path, "r").read()
j_content = json.loads("[" + content.replace("}\n{", "},\n{") + "]")
Which seems like a hack (adding commas between each JSON string and also a beginning and ending square bracket to make it a proper list).
Is there a better way to specify the JSON delimiter (newline \n instead of comma ,)?
Also, Python can't seem to properly allocate memory for an object built from 2GB of data, is there a way to construct each JSON object as I'm reading the file line by line? Thanks!
Just read each line and construct a json object at this time:
with open(file_path) as f:
for line in f:
j_content = json.loads(line)
This way, you load proper complete json object (provided there is no \n in a json value somewhere or in the middle of your json object) and you avoid memory issue as each object is created when needed.
There is also this answer.:
https://stackoverflow.com/a/7795029/671543
contents = open(file_path, "r").read()
data = [json.loads(str(item)) for item in contents.strip().split('\n')]
This will work for the specific file format that you gave. If your format changes, then you'll need to change the way the lines are parsed.
{
"key11": 11,
"key12": 12
}
{
"key21": 21,
"key22": 22
}
Just read line-by-line, and build the JSON blocks as you go:
with open(args.infile, 'r') as infile:
# Variable for building our JSON block
json_block = []
for line in infile:
# Add the line to our JSON block
json_block.append(line)
# Check whether we closed our JSON block
if line.startswith('}'):
# Do something with the JSON dictionary
json_dict = json.loads(''.join(json_block))
print(json_dict)
# Start a new block
json_block = []
If you are interested in parsing one very large JSON file without saving everything to memory, you should look at using the object_hook or object_pairs_hook callback methods in the json.load API.
This expands Cohen's answer:
content_object = s3_resource.Object(BucketName, KeyFileName)
file_buffer = io.StringIO()
file_buffer = content_object.get()['Body'].read().decode('utf-8')
json_lines = []
for line in file_buffer.splitlines():
j_content = json.loads(line)
json_lines.append(j_content)
df_readback = pd.DataFrame(json_lines)
This assumes that the entire file will fit in memory. If it is too big then this will have to be modified to read in chunks or use Dask.
Had to read some data from AWS S3 and parse a newline delimited jsonl file. My solution was this using splitlines
The code:
for line in json_input.splitlines():
one_json = json.loads(line)
The line by line reading approach is good, as mentioned in some of the above answers.
However across multiple JSON tree structures I would recommend decomposition into 2 functions to have more robust error handling.
For example,
def load_cases(file_name):
with open(file_name) as file:
cases = (parse_case_line(json.loads(line)) for line in file)
cases = filter(None, cases)
return list(cases)
parse_case_line can encapsulate the key parsing logic required in your above example, for example with regex matching, or application-specific requirements. It also means that you can select which json key-values you want to parse out.
Another advantage of this approach is filter handles multiple \n in the middle of your json object, and parses the whole file :-).
Just read it line by line and parse e through a stream
while ur hacking trick (adding commas between each JSON string and also a beginning and ending square bracket to make it a proper list) isn't memory-friendly if the file is too more than 1GB as the whole content will land on the RAM.

Adding brackets and commas to multiple JSON objects

I've created a very simple piece of code to read in tweets in JSON format in text files, determine if they contain an id and coordinates and if so, write these attributes to a csv file. This is the code:
f = csv.writer(open('GeotaggedTweets/ListOfTweets.csv', 'wb+'))
all_files = glob.glob('SampleTweets/*.txt')
for filename in all_files:
with open(filename, 'r') as file:
data = simplejson.load(file)
if 'text' and 'coordinates' in data:
f.writerow([data['id'], data['geo']['coordinates']])
I've been having some difficulties but with the help of the excellent JSON Lint website have realised my mistake. I have multiple JSON objects and from what I read these need to be separated by commas and have square brackets added to the start and end of the file.
How can I achieve this? I've seen some examples online where each individual line is read and it's added to the first and last line, but as I load the whole file I'm not entirely sure how to do this.
You have a file that either contains too many newlines (in the JSON values themselves) or too few (no newlines between the tweets at all).
You can still repair this by using some creative re-stitching. The following generator function should do it:
import json
def read_objects(filename):
decoder = json.JSONDecoder()
with open(filename, 'r') as inputfile:
line = next(inputfile).strip()
while line:
try:
obj, index = decoder.raw_decode(line)
yield obj
line = line[index:]
except ValueError:
# Assume we didn't have a complete object yet
line += next(inputfile).strip()
if not line:
line += next(inputfile).strip()
This should be able to read all your JSON objects in sequence:
for filename in all_files:
for data in read_objects(filename):
if 'text' and 'coordinates' in data:
f.writerow([data['id'], data['geo']['coordinates']])
It is otherwise fine to have multiple JSON strings written to one file, but you need to make sure that the entries are clearly separated somehow. Writing JSON entries that do not use newlines, then using newlines in between them, for example, makes sure you can later on read them one by one again and process them sequentially without this much hassle.

Categories

Resources