Error occurs while parsing the json file - python

I'm trying to parse the json format data to json.load() method. But it's giving me an error. I tried different methods like reading line by line, convert into dictionary, list, and so on but it isn't working. I also tried the solution mention in the following url loading-and-parsing-a-json but it give's me the same error.
import json
data = []
with open('output.txt','r') as f:
for line in f:
data.append(json.loads(line))
Error:
ValueError: Extra data: line 1 column 71221 - line 1 column 6783824 (char 71220 - 6783823)
Please find the output.txt in the below URL
Content- output.txt

I wrote up the following which will break up your file into one JSON string per line and then go back through it and do what you originally intended. There's certainly room for optimization here, but at least it works as you expected now.
import json
import re
PATTERN = '{"statuses"'
file_as_str = ''
with open('output.txt', 'r+') as f:
file_as_str = f.read()
m = re.finditer(PATTERN, file_as_str)
f.seek(0)
for pos in m:
if pos.start() == 0:
pass
else:
f.seek(pos.start())
f.write('\n{"')
data = []
with open('output.txt','r') as f:
for line in f:
data.append(json.loads(line))

Your alleged JSON file is not a properly formatted JSON file. JSON files must contain exactly one object (a list, a mapping, a number, a string, etc). Your file appears to contain a number of JSON objects in sequence, but not in the correct format for a list.
Your program's JSON parser correctly returns an error condition when presented with this non-JSON data.
Here is a program that will interpret your file:
import json
# Idea and some code stolen from https://gist.github.com/sampsyo/920215
data = []
with open('output.txt') as f:
s = f.read()
decoder = json.JSONDecoder()
while s.strip():
datum, index = decoder.raw_decode(s)
data.append(datum)
s = s[index:]
print len(data)

Related

Combining json files in python

I have 3 json files as below:
test1.json:
{"item":"book1","price":"10.00","location":"library"}
test2.json:
{"item":"book2","price":"15.00","location":"store"}
test3.json:
{"item":"book3","price":"9.50","location":"store"}
I have this code:
import json
import glob
result = ''
for f in glob.glob("*.json"):
with open (f, "r") as infile:
result += infile.read()
with open("m1.json", "w") as outfile:
outfile.writelines(result)
I get the following output:
{"item":"book1","price":"10.00","location":"library"}
{"item":"book2","price":"15.00","location":"store"}
{"item":"book3","price":"9.50","location":"store"}
Is it possible to get each file as a new line separated by a comma like below?
{"item":"book1","price":"10.00","location":"library"}, <-- line 1
{"item":"book2","price":"15.00","location":"store"}, <-- line 2
{"item":"book3","price":"9.50","location":"store"} <-- line 3
As others commented, your expected result is invalid json.
But if you really want the format, use str.join() is more convenient.
jsons = []
for f in glob.glob("*.json"):
with open (f, "r") as infile:
jsons.append(infile.read())
result = ',\n'.join(jsons)
infile.read() gets the string, d = json.loads(infile.read()) can get a real json object(dict).
To write a valid combined json(of type list(of dict)), just write s = json.dumps(jsons) to file.

Compare Number from local file and overwrite if the number is higher than the local one

I am Trying to compare an number from local file with an number that i'm scraping.
My unique problem is that I don't know how to read the file and compare it with the number I have. Everythin else works fine.
My code:
all_div = driver.find_elements_by_xpath("//div[#class='im_message_outer_wrap hasselect']")
for x in range(0,len(all_div)):
idMessage = all_div[x].get_attribute('data-msg-id')
if idMessage > open('idMessage.js', 'r'):
all_div[x].click()
fp = open('idMessage.js', 'w')
fp.writelines(idMessage)
The error:
TypeError: '>' not supported between instances of 'str' and '_io.TextIOWrapper'
The content inside the json file:
207
Note:
The number inside the json file always change when something is bigger than him by the FOR loop.
The file it's a json one, but I can replace it for a .txt or any other extension without problem if necessary.
You can try this for reading the js file.
# first read data from the js file
f = open('idMessage.js', 'r')
p = f.readline()
f.close()
all_div = driver.find_elements_by_xpath("//div[#class='im_message_outer_wrap hasselect']")
for x in range(0,len(all_div)):
idMessage = all_div[x].get_attribute('data-msg-id')
# use data from js file here
if int(idMessage) > int(p.strip()):
all_div[x].click()
fp = open('idMessage.js', 'w')
fp.writelines(idMessage)

How to preserve trailing zeros with Python CSV Writer

I am trying to convert a json file with individual json lines to csv. The json data has some elements with trailng zeros that I need to maintain (ex. 1.000000). When writing to csv the value is changed to 1.0, removing all trailing zeros except the first zero following the decimal point. How can I keep all trailing zeros? The number of trailing zeros may not always static.
Updated the formatting of the sample data.
Here is a sample of the json input:
{"ACCOUNTNAMEDENORM":"John Smith","DELINQUENCYSTATUS":2.0000000000,"RETIRED":0.0000000000,"INVOICEDAYOFWEEK":5.0000000000,"ID":1234567.0000000000,"BEANVERSION":69.0000000000,"ACCOUNTTYPE":1.0000000000,"ORGANIZATIONTYPEDENORM":null,"HIDDENTACCOUNTCONTAINERID":4321987.0000000000,"NEWPOLICYPAYMENTDISTRIBUTABLE":"1","ACCOUNTNUMBER":"000-000-000-00","PAYMENTMETHOD":12345.0000000000,"INVOICEDELIVERYTYPE":98765.0000000000,"DISTRIBUTIONLIMITTYPE":3.0000000000,"CLOSEDATE":null,"FIRSTTWICEPERMTHINVOICEDOM":1.0000000000,"HELDFORINVOICESENDING":"0","FEINDENORM":null,"COLLECTING":"0","ACCOUNTNUMBERDENORM":"000-000-000-00","CHARGEHELD":"0","PUBLICID":"xx:1234346"}
Here is a sample of the output:
ACCOUNTNAMEDENORM,DELINQUENCYSTATUS,RETIRED,INVOICEDAYOFWEEK,ID,BEANVERSION,ACCOUNTTYPE,ORGANIZATIONTYPEDENORM,HIDDENTACCOUNTCONTAINERID,NEWPOLICYPAYMENTDISTRIBUTABLE,ACCOUNTNUMBER,PAYMENTMETHOD,INVOICEDELIVERYTYPE,DISTRIBUTIONLIMITTYPE,CLOSEDATE,FIRSTTWICEPERMTHINVOICEDOM,HELDFORINVOICESENDING,FEINDENORM,COLLECTING,ACCOUNTNUMBERDENORM,CHARGEHELD,PUBLICID
John Smith,2.0,0.0,5.0,1234567.0,69.0,1.0,,4321987.0,1,000-000-000-00,10012.0,10002.0,3.0,,1.0,0,,0,000-000-000-00,0,bc:1234346
Here is the code:
import json
import csv
f=open('test2.json') #open input file
outputFile = open('output.csv', 'w', newline='') #load csv file
output = csv.writer(outputFile) #create a csv.writer
i=1
for line in f:
try:
data = json.loads(line) #reads current line into tuple
except:
print("Can't load line {}".format(i))
if i == 1:
header = data.keys()
output.writerow(header) #Writes header row
i += 1
output.writerow(data.values()) #writes values row
f.close() #close input file
The desired output would look like:
ACCOUNTNAMEDENORM,DELINQUENCYSTATUS,RETIRED,INVOICEDAYOFWEEK,ID,BEANVERSION,ACCOUNTTYPE,ORGANIZATIONTYPEDENORM,HIDDENTACCOUNTCONTAINERID,NEWPOLICYPAYMENTDISTRIBUTABLE,ACCOUNTNUMBER,PAYMENTMETHOD,INVOICEDELIVERYTYPE,DISTRIBUTIONLIMITTYPE,CLOSEDATE,FIRSTTWICEPERMTHINVOICEDOM,HELDFORINVOICESENDING,FEINDENORM,COLLECTING,ACCOUNTNUMBERDENORM,CHARGEHELD,PUBLICID
John Smith,2.0000000000,0.0000000000,5.0000000000,1234567.0000000000,69.0000000000,1.0000000000,,4321987.0000000000,1,000-000-000-00,10012.0000000000,10002.0000000000,3.0000000000,,1.0000000000,0,,0,000-000-000-00,0,bc:1234346
I've been trying and I think this may solve your problem:
Pass the str function to the parse_float argument in json.loads :)
data = json.loads(line, parse_float=str)
This way when json.loads() tries to parse a float it will use the str method so it will be parsed as string and maintain the zeroes. Tried doing that and it worked:
i=1
for line in f:
try:
data = json.loads(line, parse_float=str) #reads current line into tuple
except:
print("Can't load line {}".format(i))
if i == 1:
header = data.keys()
print(header) #Writes header row
i += 1
print(data.values()) #writes values row
More information here: Json Documentation
PS: You could use a boolean instead of i += 1 to get the same behaviour.
The decoder of the json module parses real numbers with float by default, so trailing zeroes are not preserved as they are not in Python. You can use the parse_float parameter of the json.loads method to override the constructor of a real number for the JSON decoder with the str constructor instead:
data = json.loads(line, parse_float=str)
Use format but here need to give static decimal precision.
>>> '{:.10f}'.format(10.0)
'10.0000000000'

process multiple jason arrays from a .dat file in python

I am new to json data processing and stuck with this issue. Data in my input file looks like this -
[{"key1":"value1"},{"key2":"value2"}] [{"key3":"value3"},{"key4":"value4"}]
I tried to read using
json.load(file)
or by
with open(file) as f:
json.loads(f)
tried with pandas.read_json(file, orient="records") as well
each of these attempts failed with stating Extra data: line 1 column n (char n) issue
Can someone guide how best to parse this file? I am not in favor writing a manual parser which may fail to scale later
P.S. There is no , between two arrays
TIA
Your Json file content has issue.
1. If , between arrays:
Code:
import json
with open("my.json") as fp:
data = json.load(fp) # data = json.loads(fp.read())
print data
your file content can be eithor of these.
Option1:
Use outer most square bracket for your json content.
[[ {"key1":"value1"}, {"key2":"value2"}], [{"key3":"value3"},
{"key4":"value4"}]]
Option2:
use only one square bracket.
[ {"key1":"value1"}, {"key2":"value2"}, {"key3":"value3"},
{"key4":"value4"}]
2. If no , between arrays:
code:
Just writing as per the given JSON format.
def valid_json_creator(given):
replaced = given.replace("}] [{", "}],[{")
return "[" + replaced + "]"
def read_json():
with open("data.txt") as fp:
data = fp.read()
valid_json = valid_json_creator(data)
jobj = json.loads(valid_json)
print(jobj)
if __name__ == '__main__':
read_json()
This code works for JSON if it is in the following format.
Note no , between arrays, but space is there.
[{"key0":"value0"},{"key1":"value41"}]
[{"key1":"value1"},{"key2":"value42"}]
[{"key2":"value2"},{"key3":"value43"}]
[{"key3":"value3"},{"key4":"value44"}]
[{"key4":"value4"},{"key5":"value45"}]
[{"key5":"value5"},{"key6":"value46"}]
[{"key6":"value6"},{"key7":"value47"}]
[{"key7":"value7"},{"key8":"value48"}]
[{"key8":"value8"},{"key9":"value49"}]
[{"key9":"value9"},{"key10":"value410"}]
[{"key10":"value10"},{"key11":"value411"}]
[{"key11":"value11"},{"key12":"value412"}]
[{"key12":"value12"},{"key13":"value413"}]
[{"key13":"value13"},{"key14":"value414"}]
[{"key14":"value14"},{"key15":"value415"}]
[{"key15":"value15"},{"key16":"value416"}]
[{"key16":"value16"},{"key17":"value417"}]
[{"key17":"value17"},{"key18":"value418"}]
[{"key18":"value18"},{"key19":"value419"}]
[{"key19":"value19"},{"key20":"value420"}]
What you have tested is reading from a structure that corresponds to the JSON file (which by definition is text, not Python data structure).
Test:
file = '[{"key1":"value1"},{"key2":"value2"}],[{"key3":"value3"},{"key4":"value4"}]'
This should work better. But wait... you do not seem to provide a list or dict at the top level of your to-be JSON! Hence the error:
ValueError: Extra data: line 1 column 38 - line 1 column 76 (char 37 -
75)
Change it then to (note the additional list opening and closing brackets at the beginning and end):
file = '[[{"key1":"value1"},{"key2":"value2"}],[{"key3":"value3"},{"key4":"value4"}]]'
This will work with:
json.load(file)
but not with:
with open(file) as f:
json.loads(f)
as your text variable is not a file! You would want to store the contents of the variable named file to a file and pass the path to that file:
with open(r'C:\temp\myfile.json') as f:
json.loads(f)
For the code to work properly.

Python using re module to parse an imported text file

def regexread():
import re
result = ''
savefileagain = open('sliceeverfile3.txt','w')
#text=open('emeverslicefile4.txt','r')
text='09,11,14,34,44,10,11, 27886637, 0\n561, Tue, 5,Feb,2013, 06,25,31,40,45,06,07, 19070109, 0\n560, Fri, 1,Feb,2013, 05,21,34,37,38,01,06, 13063500, 0\n559, Tue,29,Jan,2013,'
pattern='\d\d,\d\d,\d\d,\d\d,\d\d,\d\d,\d\d'
#with open('emeverslicefile4.txt') as text:
f = re.findall(pattern,text)
for item in f:
print(item)
savefileagain.write(item)
#savefileagain.close()
The above function as written parses the text and returns sets of seven numbers. I have three problems.
Firstly the 'read' file which contains exactly the same text as text='09,...etc' returns a TypeError expected string or buffer, which I cannot solve even by reading some of the posts.
Secondly, when I try to write results to the 'write' file, nothing is returned and
thirdly, I am not sure how to get the same output that I get with the print statement, which is three lines of seven numbers each which is the output that I want.
This should do the trick:
import re
filename = 'sliceeverfile3.txt'
pattern = '\d\d,\d\d,\d\d,\d\d,\d\d,\d\d,\d\d'
new_file = []
# Make sure file gets closed after being iterated
with open(filename, 'r') as f:
# Read the file contents and generate a list with each line
lines = f.readlines()
# Iterate each line
for line in lines:
# Regex applied to each line
match = re.search(pattern, line)
if match:
# Make sure to add \n to display correctly when we write it back
new_line = match.group() + '\n'
print new_line
new_file.append(new_line)
with open(filename, 'w') as f:
# go to start of file
f.seek(0)
# actually write the lines
f.writelines(new_file)
You're sort of on the right track...
You'll iterate over the file:
How to iterate over the file in python
and apply the regex to each line. The link above should really answer all 3 of your questions when you realize you're trying to write 'item', which doesn't exist outside of that loop.

Categories

Resources