Exception when trying to parse large JSON file using ijson - python

I am trying to parse a large JSON file (16GB) using ijson but I always get the following error :
Exception has occurred: IncompleteJSONError
lexical error: invalid char in json text.
venue" : { "type" : NumberInt(0) }, "yea
(right here) ------^
File "C:\pyth\dblp_parser.py", line 14, in <module>
for record in ijson.items(f, 'item', use_float=True):
My code is as follows:
with open("dblpv13.json", "rb") as f:
for record in ijson.items(f, 'records.item', use_float=True):
paper_id = record["_id"] #_id is only for test
paper_id_tab.append(paper_id)
A part of my json file is as follows:
{
"_id" : "53e99784b7602d9701f3f636",
"title" : "Flatlined",
"authors" : [
{
"_id" : "53f58b15dabfaece00f8046d",
"name" : "Peter J. Denning",
"org" : "ACM Education Board",
"gid" : "5b86c72de1cd8e14a3c2b772",
"oid" : "544bd99545ce266baef0668a",
"orgid" : "5f71b2811c455f439fe3c58a"
}
],
"venue" : {
"_id" : "555036f57cea80f954169e28",
"raw" : "Commun. ACM",
"raw_zh" : null,
"publisher" : null,
"type" : NumberInt(0)
},
"year" : NumberInt(2002),
"keywords" : [
"linear scale",
"false dichotomy"
],
"n_citation" : NumberInt(7),
"page_start" : "15",
"page_end" : "19",
"lang" : "en",
"volume" : "45",
"issue" : "6",
"issn" : "",
"isbn" : "",
"doi" : "10.1145/508448.508463",
"pdf" : "",
"url" : [
"http://doi.acm.org/10.1145/508448.508463"
],
"abstract" : "Our propensity to create linear scales between opposing alternatives creates false dichotomies that hamper our thinking and limit our action."
},
I tried to fill in records item by item but always the same error. I'm completely blocked.
Please, can any body help me?

The same problem happened to me with the said dataset. ijson can't handle it. I overcame the problem by creating another dataset and then parsing the new dataset with ijson. The approach is quite simple: read the orignal dataset with simple read; remove "NumberInt(" and ")", write the result to a new json file. the code is given below.
f=open('dblpv13_clean.json')
with open('dblpv13.json','r',errors='ignore') as myFile:
for line in myFile:
line=line.replace("NumberInt(","").replace(")","")
f.write(line)
f.close()
Now you can parse the new dataset with ijson as follows.
with open('dblpv13_clean.json', "r",errors='ignore') as f:
for i, element in enumerate(ijson.items(f, "item")):
do something....

Related

Correctly referencing a JSON in Python. Strings versus Integers and Nested items

Sample JSON file below
{
"destination_addresses" : [ "New York, NY, USA" ],
"origin_addresses" : [ "Washington, DC, USA" ],
"rows" : [
{
"elements" : [
{
"distance" : {
"text" : "225 mi",
"value" : 361715
},
"duration" : {
"text" : "3 hours 49 mins",
"value" : 13725
},
"status" : "OK"
}
]
}
],
"status" : "OK"
}
I'm looking to reference the text value for distance and duration. I've done research but i'm still not sure what i'm doing wrong...
I have a work around using several lines of code, but i'm looking for a clean one line solution..
thanks for your help!
If you're using the regular JSON module:
import json
And you're opening your JSON like this:
json_data = open("my_json.json").read()
data = json.loads(json_data)
# Equivalent to:
data = json.load(open("my_json.json"))
# Notice json.load vs. json.loads
Then this should do what you want:
distance_text, duration_text = [data['rows'][0]['elements'][0][key]['text'] for key in ['distance', 'duration']]
Hope this is what you wanted!

Python parsing json file to access values returning TypeError

I am using python to parse a json file full of url data to try and build a url reputation classifier. There are around 2,000 entries in the json file and not all of them have all of the fields present. A typical entry looks like this:
[
{
"host_len" : 12,
"fragment" : null,
"url_len" : 84,
"default_port" : 80,
"domain_age_days" : "5621",
"tld" : "com",
"num_domain_tokens" : 3,
"ips" : [
{
"geo" : "CN",
"ip" : "115.236.98.124",
"type" : "A"
}
],
"malicious_url" : 0,
"url" : "http://www.oppo.com/?utm_source=WeiBo&utm_medium=OPPO&utm_campaign=DailyFlow",
"alexa_rank" : "25523",
"query" : "utm_source=WeiBo&utm_medium=OPPO&utm_campaign=DailyFlow",
"file_extension" : null,
"registered_domain" : "oppo.com",
"scheme" : "http",
"path" : "/",
"path_len" : 1,
"port" : 80,
"host" : "www.oppo.com",
"domain_tokens" : [
"www",
"oppo",
"com"
],
"mxhosts" : [
{
"mxhost" : "mail1.oppo.com",
"ips" : [
{
"geo" : "CN",
"ip" : "121.12.164.123",
"type" : "A"
}
]
}
],
"path_tokens" : [
""
],
"num_path_tokens" : 1
}
]
I am trying to access the data stored in the fields "ips" and "mxhosts" to compare the "geo" location. To try and access the first "ips" field I'm using:
corpus = open(file)
urldata = json.load(corpus, encoding="latin1")
for record in urldata:
print record["ips"][0]["geo"]
But as I mentioned not all of the json entries have all of the fields. "ips" is always present but sometimes it's "null" and the same goes for "geo". I'm trying to check for the data before accessing it using:
if(record["ips"] is not None and record["ips"][0]["geo"] is not None):
But I this an error:
if(record["ips"] is not None and record["ips"][0]["geo"] is not None):
TypeError: string indices must be integers
When I try to check it using this:
if("ips" in record):
I get this error message:
print record["ips"][0]["geo"]
TypeError: 'NoneType' object has no attribute '__getitem__'
So I'm not sure how to check if the record I'm trying to access exists before I access it, or if I'm even accessing in the most correct way. Thanks.
You can simply check if record["ips"] is not None, or more simply if it's True, before proceeding to access it as a list; otherwise you would be calling a list method on a None object.
for record in urldata:
if record["ips"]:
print record["ips"][0]["geo"]
So it ended up being a little convoluted due to the inconsistent nature of the json file, but I had to end up first checking that "ips" was not null and then checking that "geo" was present in record["ips"][0]. This is what it looks like:
if(record["ips"] is not None and "geo" in record["ips"][0]):
print record["ips"][0]["geo"]
Thanks for the feedback everyone!

Jsonify data not returning to ajax call

I have an app where I am using flask, python, ajax, json, javascript, and leaflet. This app reads a csv file, puts it into json format, then returns it to an ajax call. My issue is that the geojson is not being returned. In the console, I am getting a 5000 NetworkError in the console log. The end result is to use the return geojson in a leaflet map layer. If I remove the jsonify, the return works fine, but it is a string of course, and this wont work for the layer.
As you can see, I have a simple alert("success") in the ajax success part. This is not being executed. Nor is the alert(data).
I do have jsonify in the from Flask import statement.
Thank you for the help
Ajax call
$.ajax({
type : "POST",
url : '/process',
data: {
chks: chks
}
})
.success(function(data){
alert("success"); // I am doing this just to get see if I get back here. I do not
alert(data);
python/flask
#app.route('/process', methods=['POST'])
def process():
data = request.form['chks']
rawData = csv.reader(open('static/csvfile.csv', 'r'), dialect='excel')
count = sum(1 for row in open('static/csvfile.csv))
template =\
''' \
{"type" : "Feature",
"geometry" : {
"type" : "Point",
"coordinates" : [%s, %s]},
"properties" : {"name" : "%s" }
}%s
'''
output = \
''' \
{"type" : "Feature Collection",
"features" : [
'''
iter = 0
separator = ","
lastrow = ""
for row in rawData:
iter += 1 // this is used to skip the first line of the csv file
if iter >=2:
id = row[0]
lat = row[1]
long = row[2]
if iter != count:
output += template % (row[2], row[1], row[0], separator)
else:
output += template % (row[2], row[1], row[0], lastrow)
output += \
''' \
]}
'''
return jsonify(output)
More Info - taking David Knipe's info into hand, If I remove the jsonify from my return statement, it returns what I expect, and I can output the return in an alert. It looks like this
{ "type" : "Feature Collection",
"features" : [
{"type" : "Feature",
"geometry" : {
"type" : "Point",
"coordinates" : [ -86.28, 32.36]},
"properties" : {"name" : "Montgomery"}
},
{ "type" : "Feature",
"geometry" : {
"type" : "Point",
"coordinates" : [ -105.42, 40.30]},
"properties" : {"name" : "Boulder"}
},
]}
If I take that data and hard code it into the ajax success, then pass it to the leaflet layer code like this - it will work, and my points will be displayed on my map
...
.success(function(data){
var pointsHC= { "type" : "Feature Collection",
"features" : [
{"type" : "Feature",
"geometry" : {
"type" : "Point",
"coordinates" : [ -86.28, 32.36]},
"properties" : {"name" : "Montgomery"}
},
{ "type" : "Feature",
"geometry" : {
"type" : "Point",
"coordinates" : [ -105.42, 40.30]},
"properties" : {"name" : "Boulder"}
},
]};
// leaflet part
var layer = L.geoJson(pointsHC, {
pointToLayer: function(feature, latlng){
return L.circleMarker( ...
If I do not hard code and pass the data via a variable, it does not work, and I get and invalid geoJson object. I have tried it with both the final semi-colon removed and not removed, and no love either way
...
.success(function(data){
// leaflet part
var layer = L.geoJson(data, {
pointToLayer: function(feature, latlng){
return L.circleMarker( ...
So it works if you don't try to parse the JSON, but if you do then it fails. Your JSON is invalid:
As loganbertram pointed out, you're missing a " on "Feature Collection".
You're missing a " on "properties".
output = template % ... should be output += template % ... - you're appending to output, not replacing it.
the features array will have a trailing comma (unless it is empty).
Although actually in your code features will always be empty anyway: you set iter = 0, never change its value, and then don't do the output = ... bit because iter < 2.
Are you sure you actually want to use jsonify? As I understand it, that turns any object into a JSON string. But output is already a JSON string - or should be, if you fix the various bugs loganbertram and I have spotted. In that case the client-side code will not fail trying to parse JSON. But if you jsonify something that's already JSON, you'll get something like this:
"{\"type\" : \"Feature\",
\"geometry\" : {
...
which the javascript will then convert back to the original JSON string, instead of a JSON object.
Actually, it would be better to rewrite the whole thing so it constructs an object instead of a string, and then calls jsonify on that object. But I don't know enough Python to give more details easily.

working with JSON with a bash script

I’m writing a bash script that does a few things. Right now it copies a few files into correct directories and runs a few commands. I need this bash script to edit a JSON file. Essentially this script would append a snippet of JSON to an existing JSON object that exists in a file.JSON. I cannot just append the data because the JSON snippet must be a part of an existing JSON object (should be added to the tracks array). So is this possible to do with a bash script? Should I just write another python or R script to handle this JSON Logic or is there a more elegant solution. Thanks for any help.
file.JSON looks like this...
{
"formatVersion" : 1,
"tracks" : [
{
"key" : "Reference sequence",
"chunkSize" : 20000,
"urlTemplate" : "seq/{refseq_dirpath}/{refseq}-",
"storeClass" : "JBrowse/Store/Sequence/StaticChunked",
"type" : "SequenceTrack",
"seqType" : "dna",
"category" : "Reference sequence",
"label" : "DNA"
},
{
"type" : "FeatureTrack",
"label" : "gff_track1",
"trackType" : null,
"key" : "gff_track1",
"compress" : 0,
"style" : {
"className" : "feature"
},
"storeClass" : "JBrowse/Store/SeqFeature/NCList",
"urlTemplate" : "tracks/gff_track1/{refseq}/trackData.json"
},
{
"storeClass" : "JBrowse/Store/SeqFeature/NCList",
"style" : {
"className" : "feature"
},
"urlTemplate" : "tracks/ITAG2.4_gene_models.gff3/{refseq}/trackData.json",
"key" : "ITAG2.4_gene_models.gff3",
"compress" : 0,
"trackType" : null,
"label" : "ITAG242.4_gene_models.gff3",
"type" : "FeatureTrack"
},
{
"urlTemplate" : "g-231FRL.bam",
"storeClass" : "JBrowse/Store/SeqFeature/BAM",
"label" : "g-1FRL.bam",
"type" : "JBrowse/View/Track/Alignments2",
"key" : "g-1FRL.bam"
}
]
}
the JSON snippet looks like this ...
{
"urlTemplate": "AX2_filtered.vcf.gz",
"label": "AX2_filtered.vcf.gz",
"storeClass": "JBrowse/Store/SeqFeature/VCFTabix",
"type": "CanvasVariants"
}
Do yourself a favor and install jq, then it's as simple as:
jq -n 'input | .tracks += [inputs]' file.json snippet.json > out.json
Trying to modify a structured data (like JSON) is a fool's errand without a proper parser and jq really makes it easy.
However, if you prefer doing it through Python (although it would be an overkill for this kind of a task) it's pretty much as straight forward as with jq:
import json
with open("file.json", "r") as f, open("snippet.json", "r") as s, open("out.json", "w") as u:
data = json.load(f) # parse `file.json`
data["tracks"].append(json.load(s)) # parse `snippet.json` and append it to `.tracks[]`
json.dump(data, u, indent=4) # encode the data back to JSON and write it to `out.json`

Python json - editing a .json file

I am trying to add something to a .json file.
This is what saves
"106569102398611456" : {
"currentlocation" : "Pallet Town",
"name" : "Anthony",
"party" : [
{
"hp" : "5",
"level" : "1",
"pokemonname" : "bulbasaur"
}
],
"pokedollars" : 0
}
}
What I'm trying to do is make a command to add something else to the "party". Here is an example of what I want.
"106569102398611456" : {
"currentlocation" : "Pallet Town",
"name" : "Anthony",
"party" : [
{
"hp" : "5",
"level" : "1",
"pokemonname" : "bulbasaur"
},
{
"hp" : "3",
"level" : "1",
"pokemonname" : "squirtle"
}
],
"pokedollars" : 0
}
}
edit:
This is what I've attempted but I have no idea
def addPokemon(pokemon):
pokemonName = convert(pokemon)
for pokemon in players['party']:
pokemon.append(pokemonName)
convert(pokemon) basically grabs the pokemon i type in and change gives it a level and health to be added to the .json file
To update a JSON file, write out the object to a temporary file and then replace the target file with the temporary file. Example:
import json
import os
import shutil
import tempfile
def rewriteJsonFile(sourceObj, targetFilePath, **kwargs):
temp = tempfile.mkstemp()
tempHandle = os.fdopen(temp[0], 'w')
tempFilePath = temp[1]
json.dump(sourceObj, tempHandle, **kwargs)
tempHandle.close()
shutil.move(tempFilePath, targetFilePath)
This assumes that updates are happening serially. If updates are potentially happening in parallel, you'd need some kind of locking to ensure only one update is happening at a time. Although at that point, you're better off using a database like sqlite and returning queries in JSON format.

Categories

Resources