Skipping broken jsons python - python

I am reading JSON from the database and parsing it using python.
cur1.execute("Select JSON from t1")
dataJSON = cur1.fetchall()
for row in dataJSON:
jsonparse = json.loads(row)
The problem is some JSON's that I'm reading is broken.
I would like my program to skip the json if its not a valid json and if it is then go ahead and parse it. Right now my program crashes once it encounters a broken json.
T1 has several JSON's that I'm reading one by one.

Update
You're getting an expecting string or buffer - you need to be using row[0] as the results will be 1-tuples... and you wish to take the first and only column.
If you did want to check for bad json
You can put a try/except around it:
for row in dataJSON:
try:
jsonparse = json.loads(row)
except Exception as e:
pass
Now - instead of using Exception as above - use the type of exception that's occuring at the moment so that you don't capture non-json loading related errors... (It's probably ValueError)

If you just want to silently ignore errors, you can wrap json.loads in a try..except block:
try: jsonparse = json.loads(row)
except: pass

Try this:
def f(x):
try:
return json.loads(x)
except:
pass
json_df = pd.DataFrame()
json_df = df.join(df["error"].apply(lambda x: f(x)).apply(pd.Series))
After JSON loads, I also wanted to convert each key-value pair from JSON to a new column (all JSON keys), so I used apply(pd.Series) in conjunction. You should try this by removing that if your goal is only to convert each row from a data frame column to JSON.

Related

How do I parse faulty json file from python using module json?

I have big size of json file to parse with python, however it's incomplete (i.e., missing parentheses in the end). The json file consist of one big json object which contains json objects inside. All json object in outer json object is complete, just finishing parenthese are missing.
for example, its structure is like this.
{bigger_json_head:value, another_key:[{small_complete_json1},{small_complete_json2}, ...,{small_complete_json_n},
So final "]}" are missing. however, each small json forms a single row so when I tried to print each line of the json file I have, I get each json object as a single string.
so I've tried to use:
with open("file.json","r",encoding="UTF-8") as f:
for line in f.readlines()
line_arr.append(line)
I expected to have a list with line of json object as its element
and then I tried below after the process:
for json_line in line_arr:
try:
json_str = json.loads(json_line)
print(json_str)
except json.decoder.JSONDecodeError:
continue
I expected from this code block, except first and last string, this code would print json string to console. However, it printed nothing and just got decode error.
Is there anyone who solved similar problem? please help. Thank you
If the faulty json file only miss the final "]}", then you can actually fix it before parse it.
Here is an example code to illustrate:
with open("file.json","r",encoding="UTF-8") as f:
faulty_json_str = f.read()
fixed_json_str = faulty_json_str + ']}'
json_obj = json.loads(fixed_json_str)

How to continue a loop after catching exception in try ... except

I am reading a big file in chunks and I am doing some operations on each of the chunks. While reading one of get them I had the following message error:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 26 fields in line 15929977, saw 118
which means that one of my file lines doesn't follow the same format as the others. What I thought I could do was to just omit this chunk but I couldn't get a way to do it. I tried to do a try/except block as follows:
data = pd.read_table('ny_data_file.txt', sep=',',
header=0, encoding = 'latin1', chunksize = 5000)
try:
for chunk in data:
# operations
except pandas.errors.ParseError:
# Here is my problem
What I have written here is my problem is that if the chunk is not well parsed, my code will automatically go to the exception not even entering the for loop, but what I would like is to skip this chunk and move forward to the next one, on which I would like to perform the operations inside the loop.
I have checked on stackoverflow but I couldn't find anything similar where the try was performed on the for loop. Any help would be appreciated.
UPDATE:
I have tried to do as suggested in the comments:
try:
for chunk in data:
#operations
except pandas.errors.ParserError:
# continue/pass/handle error
But still is not cathching the exception because as said the exception is created when getting the chyunk out of my data not when doing operations with it.
The way you use try - except makes it skip the entire for loop if an exception is caught in it. If you want to only skip one iteration you need to write the try-except inside the loop like so:
for chunk in data:
try:
# operations
except pandas.errors.ParseError as e:
# inform the user of the error
print("Error encountered while parsing chunk {}".format(chunk))
print(e)
I understood that, in the operations part you get exception. If it is like that: you should just continue:
for chunk in data:
try:
# operations
except pandas.errors.ParseError:
# continue
I am not sure where the exception is thrown. Maybe adding a full error stack would help. If the error is thrown by the read_table() call maybe you could try this:
try:
data = pd.read_table('ny_data_file.txt', sep=',',
header=0, encoding = 'latin1', chunksize = 5000)
except pandas.errors.ParseError:
pass
for chunk in data:
# operations
As suggested by #JonClements what solved my problem was to use error_bad_lines=False in the pd.read_csv so it just skipped that lines causing trouble and let me execute the rest of the for loop.

Convert a raw tweet string to a JSON object in Python

I'm using twitter's API to download raw tweets so I can play with them. The iterator loop they gave in the example looks something like this (I added an if condition to run the loop n times, not shown here):
iterator = twitter_stream.statuses.sample()
for tweet in iterator:
print (json.dumps(tweet))
break
These commands output the entire JSON object in the correct format.
To extract the "text" item from the raw tweet json object, I tried using the .get("text") operator on the
txts = []
for tweet in iterator:
txts.append((json.dumps(tweet)).get("text"))
break
print (txts)
But I get an error saying "AttributeError: 'str' object has no attribute 'get'"
So I searched around and found a solution where they wrote all the outputs from json.dumps(tweet) to a file, use json.loads(jsonfile) to a variable, and tried to use the .get("text") operator on it to load the text:
fl = open("ipjson.json", "a")
for tweet in iterator:
fl.write(json.dumps(tweet))
break
fl.flush()
decode = json.loads(fl)
for item in decode:
txt = item.get("text")
txts.append(txt)
print (txts)
But this gives me another error saying "TypeError: the JSON object must be str, not 'TextIOWrapper'"
What am I doing wrong? Is there a better/easier way to extract text from a raw tweet JSON object?
For the first example you don't need JSON you can just do:
txts = []
for status in statuses:
txts.append(status.text)
For the second example you're handling the JSON incorrectly. You should instead do:
txts = []
for status in statuses:
txts.append(json.dumps(status))
with open('ipjson.json','w') as fou:
json.dump(txts,fou)
And to read it back in:
with open('ipjson.json','r') as fin:
txts = json.load(fin)
for txt in txts:
print(json.loads(txt)['text'])
Please note that when you're writing and reading the JSON you use dump and load but with the individual JSON objects you're using dumps and loads.
JSON files require recursive scanning,
https://stackoverflow.com/a/42855667/3342050
or known locations within the structure.
After you get your dict, list, & entries, you parse through for specific values:
https://stackoverflow.com/a/42860573/3342050
This is entirely dependent upon what data is returned,
because keys will be unique to that structure.

What is the most Pythonic way to convert a valid json file to a string?

Below is what I'm doing currently, just wondering if there is a better way.
with open("sample.json", "r") as fp:
json_dict = json.load(fp)
json_string = json.dumps(json_dict)
with open("sample.json") as f:
json_string = f.read()
No need to parse and unparse it.
If you need to raise an exception on invalid JSON, you can parse the string and skip the work of unparsing it:
with open("sample.json") as f:
json_string = f.read()
json.loads(json_string) # Raises an exception if the JSON is invalid.
Json file is just a regular file. You open() it and read() it. It will give you a str. If you want to make sure it contains valid JSON, put the load part of the above code in a try/except block.
I don't know if it's Pythonic or just pointless but you could also do this if validation is part of your requirements:
import json
# I'm fully aware of the missing "แบith" or "close" in the line below
json_string = json.dumps(json.load(open('sample.json')))
Otherwise, user2357112 already said it: "No need to parse and unparse it."
You are doing it right. You probably can find some libraries with different implementations for performance or memory optimization specifics. The python standard is reliable, cover most of the cases, is assured to compatible to other platforms and is simple. It cannot get more pythonic than that.
The way you're doing it is probably fine if you just need to have it raise an Exception for invalid JSON. However, if you want to make sure that you're not changing the file at all you could try something like this:
import json
with open("sample.json") as fp:
json_string = fp.read()
json.loads(json_string)
It will still raise a ValueError if it is invalid JSON, and you know that you haven't changed the data at all. If you're wondering what may change, off the top of my head: the order of dict items, and whitespace, Not to mention if there are duplicate keys in the JSON.

Check if file is json loadable

I have two types of txt files, one which is saved in some arbitrary format on the form
Header
key1 value1
key2 value2
and the other file formart is a simple json dump stored as
with open(filename,"w") as outfile:
json.dump(json_data,outfile)
From a dialog window, the user can load either of these two files, but my loader need to be able to distinguish between type1 and type2 and send the file to the correct load routine.
#Pseudocode
def load(filename):
if filename is json-loadable:
json_loader(filename)
else:
other_loader(filename)
The easiest way I can think of is to use a try/except block as
def load(filename):
try:
data = json.load(open(filename))
process_data(data)
except:
other_loader(filename)
but I do not like this approach since there is like a 50/50 risk of fail in the try/except block, and as far as I know try/except is slow if you fail.
So is there a simpler and more convenient way of checking if its json-format or not?
You can do something like this:
def convert(tup):
"""
Convert to python dict.
"""
try:
tup_json = json.loads(tup)
return tup_json
except ValueError, error: # includes JSONDecodeError
logger.error(error)
return None
converted = convert(<string_taht_neeeds_to_be_converted_to_json>):
if converted:
<do_your_logic>
else:
<if_string_is_not_converteble>
If the top-level data you're dumping is an object, you could check if the first character is {, or [ if it's an array. That's only valid if the header for the other format will never start with those characters. It's also not foolproof because it doesn't guarantee that your data is well formed JSON.
On the other hand your existing solution is fine, much more clear and robust.

Categories

Resources