How to correct leading zeroes in JSON with python - python

I have a wrongly-formatted JSON file where I have numbers with leading zeroes.
p = """[
{
"name": "Alice",
"RegisterNumber": 911100020001
},
{
"name": "Bob",
"RegisterNumber": 000111110300
}
]"""
arc = json.loads(p)
I get this error.
JSONDecodeError: Expecting ',' delimiter: line 8 column 24 (char 107)
Here's what is on char 107:
print(p[107])
#0
The problem is: this is the data I have. Here I am only showing two examples, but my file has millions of lines to be parsed, I need a script. At the end of the day, I need this string:
"""[
{
"name": "Alice",
"RegisterNumber": "911100020001"
},
{
"name": "Bob",
"RegisterNumber": "000111110300"
}
]"""
How can I do it?

Read the file (best line by line) and replace all the values with their string representation. You can use regular expressions for that (remodule).
Then save and later parse the valid json.
If it fits into memory, you don't need to save the file of course, but just loads the then valid json string.
Here is a simple version:
import json
p = """[
{
"name": "Alice",
"RegisterNumber": 911100020001
},
{
"name": "Bob",
"RegisterNumber": 000111110300
}
]"""
from re import sub
p = sub(r"(\d{12})", "\"\\1\"", p)
arc = json.loads(p)
print(arc[1])

This probably won't be pretty but you could probably fix this using a regex.
import re
p = "..."
sub = re.sub(r'"RegisterNumber":\W([0-9]+)', r'"RegisterNumber": "\1"', p)
json.loads(sub)
This will match all the case where you have the RegisterNumber followed by numbers.

Since the problem is the leading zeroes, tne easy way to fix the data would be to split it into lines and fix any lines that exhibit the problem. It's cheap and nasty, but this seems to work.
data = """[
{
"name": "Alice",
"RegisterNumber": 911100020001
},
{
"name": "Bob",
"RegisterNumber": 000111110300
}
]"""
result = []
for line in data.splitlines():
if ': 0' in line:
while ": 0" in line:
line = line.replace(': 0', ': ')
result.append(line.replace(': ', ': "')+'"')
else:
result.append(line)
data = "".join(result)
arc = json.loads(data)
print(arc)

Related

Adding a comma between JSON objects in a datafile with Python?

I have large file (about 3GB) which contains what looks like a JSON file but isn't because it lacks commas (,) between "observations" or JSON objects (I have about 2 million of these "objects" in my data file).
For example, this is what I have:
{
"_id": {
"$id": "fh37fc3huc3"
},
"messageid": "4757724838492485088139042828",
"attachments": [],
"usernameid": "47284592942",
"username": "Alex",
"server": "475774810304151552",
"text": "Must watch",
"type": "462050823720009729",
"datetime": "2018-08-05T21:20:20.486000+00:00",
"type": {
"$numberLong": "0"
}
}
{
"_id": {
"$id": "23453532dwq"
},
"messageid": "232534",
"attachments": [],
"usernameid": "273342",
"usernameid": "Alice",
"server": "475774810304151552",
"text": "https://www.youtube.com/",
"type": "4620508237200097wd29",
"datetime": "2018-08-05T21:20:11.803000+00:00",
"type": {
"$numberLong": "0"
}
And this is what I want (the comma between "observations"):
{
"_id": {
"$id": "fh37fc3huc3"
},
"messageid": "4757724838492485088139042828",
"attachments": [],
"username": "Alex",
"server": "475774810304151552",
"type": {
"$numberLong": "0"
}
},
{
"_id": {
"$id": "23453532dwq"
},
"messageid": "232534",
"attachments": [],
"usernameid": "Alice",
"server": "475774810304151552",
"type": {
"$numberLong": "0"
}
This is what I tried but it doesn't give me a comma where I need it:
import re
with open('dataframe.txt', 'r') as input, open('out.txt', 'w') as output:
output.write("[")
for line in input:
line = re.sub('', '},{', line)
output.write(' '+line)
output.write("]")
What can I do so that I can add a comma between each JSON object in my datafile?
This solution presupposes that none of the fields in JSON contains neither { nor }.
If we assume that there is at least one blank line between JSON dictionaries, an idea: let's maintain unclosed curly brackets count ({) as unclosed_count; and if we meet an empty line, we add the coma once.
Like this:
with open('test.json', 'r') as input_f, open('out.json', 'w') as output_f:
output_f.write("[")
unclosed_count = 0
comma_after_zero_added = True
for line in input_f:
unclosed_count_change = line.count('{') - line.count('}')
unclosed_count += unclosed_count_change
if unclosed_count_change != 0:
comma_after_zero_added = False
if line.strip() == '' and unclosed_count == 0 and not comma_after_zero_added:
output_f.write(",\n")
comma_after_zero_added = True
else:
output_f.write(line)
output_f.write("]")
Assuming sufficient memory, you can parse such a stream one object at a time using json.JSONDecoder.raw_decode directly, instead of using json.loads.
>>> x = '{"a": 1}\n{"b": 2}\n' # Hypothetical output of open("dataframe.txt").read()
>>> decoder = json.JSONDecoder()
>>> x = '{"a": 1}\n{"b":2}\n'
>>> decoder.raw_decode(x)
({'a': 1}, 8)
>>> decoder.raw_decode(x, 9)
({'b': 2}, 16)
The output of raw_decode is a tuple containing the first JSON value decoded and the position in the string where the remaining data starts. (Note that json.loads just creates an instance of JSONDecoder, and calls the decode method, which just calls raw_decode and artificially raises an exception if the entire input isn't consumed by the first decoded value.)
A little extra work is involved; note that you can't start decoding with whitespace, so you'll have to use the returned index to detect where the next value starts, following any additional whitespace at the returned index.
Another way to view your data is that you have multiple json records separated by whitespace. You can use the stdlib JSONDecoder to read each record, then strip whitespace and repeat til done. The decoder reads a record from a string and tells you how far it got. Apply that iteratively to the data until all is consumed. This is far less risky than making a bunch of assumptions about what data is contained in the json itself.
import json
def json_record_reader(filename):
with open(filename, encoding="utf-8") as f:
txt = f.read().lstrip()
decoder = json.JSONDecoder()
result = []
while txt:
data, pos = decoder.raw_decode(txt)
result.append(data)
txt = txt[pos:].lstrip()
return result
print(json_record_reader("data.json"))
Considering the size of your file, a memory mapped text file may be the better option.
If you're sure that the only place you will find a blank line is between two dicts, then you can go ahead with your current idea, after you fix its execution. For every line, check if it's empty. If it isn't, write it as-is. If it is, write a comma instead
with open('dataframe.txt', 'r') as input_file, open('out.txt', 'w') as output_file:
output_file.write("[")
for line in input_file:
if line.strip():
output_file.write(line)
else:
output_file.write(",")
output_file.write("]")
If you cannot guarantee that any blank line must be replaced by a comma, you need a different approach.
You want to replace a close-bracket, followed by an empty line (or multiple whitespace), followed by an open-bracket, with },{.
You can keep track of the previous two lines in addition to the current line, and if these are "}", "", and "{" in that order, then write a comma before writing the "{".
from collections import deque
with open('dataframe.txt', 'r') as input_file, open('out.txt', 'w') as output_file:
last_two_lines = deque(maxlen=2)
output_file.write("[")
for line in input_file:
line_s = line.strip()
if line_s == "{" and list(last_two_lines) == ["}", ""]:
output_file.write("," + line)
else:
output_file.write(line)
last_two_lines.append(line_s)
Alternatively, if you want to stick with regex, then you could do
with open('dataframe.txt') as input_file:
file_contents = input_file.read()
repl_contents = re.sub(r'\}(\s+)\{', r'},\1{', file_contents)
with open('out.txt', 'w') as output_file:
output_file.write(repl_contents)
Here, the regex r"\}(\s+)\{" matches the pattern we're looking for (\s+ matches multiple whitespace characters, and captures them in group 1, which we then use in the replacement string as \1.
Note that you will need to read and run re.sub on the entire file, which will be slow.

convert json to string with newline and escape in python

how to convert using python below payloads:
{
"abc": {
"i": "1212",
"j": "add"
}
}
to
"{\n\n \"abc\": {\n \"i\": \"1212\",\n \"j\": \"add\"\n }\n}"
You can use the optional indent parameter of json.dump and json.dumps to add newlines and indent to the generated string.
>>> import json
>>> payload = {
... "assessmentstatus": {
... "id": "D37002079003",
... "value": "In-Progress"
... }
... }
...
>>> json.dumps(payload)
'{"assessmentstatus": {"id": "D37002079003", "value": "In-Progress"}}'
>>> json.dumps(payload, indent=0)
'{\n"assessmentstatus": {\n"id": "D37002079003",\n"value": "In-Progress"\n}\n}'
>>> json.dumps(payload, indent=2)
'{\n "assessmentstatus": {\n "id": "D37002079003",\n "value": "In-Progress"\n }\n}'
With properly displayed whitespace:
>>> print(json.dumps(payload, indent=2))
{
"assessmentstatus": {
"id": "D37002079003",
"value": "In-Progress"
}
}
It seems you actually want the string including the enclosing " and with all the " within the string escaped. This is surprisingly tricky using Python's repr, as it always tries to use either ' or " as the outer quotes so that the quotes do not have to be escaped.
What seems to work, though, is to just json.dumps the JSON string again:
>>> json.dumps(json.dumps(payload, indent=2))
'"{\\n \\"assessmentstatus\\": {\\n \\"id\\": \\"D37002079003\\",\\n \\"value\\": \\"In-Progress\\"\\n }\\n}"'
>>> print(json.dumps(json.dumps(payload, indent=2)))
"{\n \"assessmentstatus\": {\n \"id\": \"D37002079003\",\n \"value\": \"In-Progress\"\n }\n}"
Im assuming it has newlines since its in a file
with open('file.json') as f: data = json.load(f)
Then do what you want with the data
You can save data to a file in JSON format like this
import json
data = {
"assessmentstatus": {
"id": "D37002079003",
"value": "In-Progress"
}
}
with open("file.txt", "w") as file:
json.dump(data, file, indent=4)
To retrieve this data you can do
import json
with open("file.txt") as file:
data = json.load(file)
print(data)
JSON is a way to store store data in a string format and be able to parse that data to convert it to actual data.
So trying to convert "{\n\n "assessmentstatus": {\n "id": "D37002079003",\n "value": "In-Progress"\n }\n}" on your own is the wrong way to go at this, since JSON is great at doing this.

Reading json in python separated by newlines

I am trying to read some json with the following format. A simple pd.read_json() returns ValueError: Trailing data. Adding lines=True returns ValueError: Expected object or value. I've tried various combinations of readlines() and load()/loads() so far without success.
Any ideas how I could get this into a dataframe?
{
"content": "kdjfsfkjlffsdkj",
"source": {
"name": "jfkldsjf"
},
"title": "dsldkjfslj",
"url": "vkljfklgjkdlgj"
}
{
"content": "djlskgfdklgjkfgj",
"source": {
"name": "ldfjkdfjs"
},
"title": "lfsjdfklfldsjf",
"url": "lkjlfggdflkjgdlf"
}
The sample you have above isn't valid JSON. To be valid JSON these objects need to be within a JS array ([]) and be comma separated, as follows:
[{
"content": "kdjfsfkjlffsdkj",
"source": {
"name": "jfkldsjf"
},
"title": "dsldkjfslj",
"url": "vkljfklgjkdlgj"
},
{
"content": "djlskgfdklgjkfgj",
"source": {
"name": "ldfjkdfjs"
},
"title": "lfsjdfklfldsjf",
"url": "lkjlfggdflkjgdlf"
}]
I just tried on my machine. When formatted correctly, it works
>>> pd.read_json('data.json')
content source title url
0 kdjfsfkjlffsdkj {'name': 'jfkldsjf'} dsldkjfslj vkljfklgjkdlgj
1 djlskgfdklgjkfgj {'name': 'ldfjkdfjs'} lfsjdfklfldsjf lkjlfggdflkjgdlf
Another solution if you do not want to reformat your files.
Assuming your JSON is in a string called my_json you could do:
import json
import pandas as pd
splitted = my_json.split('\n\n')
my_list = [json.loads(e) for e in splitted]
df = pd.DataFrame(my_list)
Thanks for the ideas internet. None quite solved the problem in the way I needed (I had lots of newline characters in the strings themselves which meant I couldn't split on them) but they helped point the way. In case anyone has a similar problem, this is what worked for me:
with open('path/to/original.json', 'r') as f:
data = f.read()
data = data.split("}\n")
data = [d.strip() + "}" for d in data]
data = list(filter(("}").__ne__, data))
data = [json.loads(d) for d in data]
with open('path/to/reformatted.json', 'w') as f:
json.dump(data, f)
df = pd.read_json('path/to/reformatted.json')
If you can use jq then solution is simpler:
jq -s '.' path/to/original.json > path/to/reformatted.json

converting text file to json in python

I have multiple documents that together are approximately 400 GB and I want to convert them to json format in order to drop to elasticsearch for analysis.
Each file is approximately 200 MB.
Original file looked like:
IUGJHHGF#BERLIN:lhfrjy
0t7yfudf#WARSAW:qweokm246
0t7yfudf#CRACOW:Er747474
0t7yfudf#cracow:kui666666
000t7yf#Vienna:1йй2ц2й2цй2цц3у
It has the characters that are not only English. key1 is always separated with #, where city was separated either by ; or :
After I have parsed it with code:
#!/usr/bin/env python
# coding: utf8
import json
with open('2') as f:
for line in f:
s1 = line.find("#")
rest = line[s1+1:]
if rest.find(";") != -1:
if rest.find(":") != -1:
print "FOUND BOTH : ; "
s2 = -0
else:
s2 = s1+1+rest.find(";")
elif rest.find(":") != -1:
s2 = s1+1+rest.find(":")
else:
print "FOUND NO : ; "
s2 = -0
key1 = line[:s1]
city = line[s1+1:s2]
description = line[s2+1:len(line)-1]
All file looks like:
RRS12345 Cracow Sunflowers
RRD12345 Berin Data
After that parsing I want to have the output:
{
"location_data":[
{
"key1":"RRS12345",
"city":"Cracow",
"description":"Sunflowers"
},
{
"key1":"RRD123dsd45",
"city":"Berlin",
"description":"Data"
},
{
"key1":"RRD123dsds45",
"city":"Berlin",
"description":"1йй2ц2й2цй2цц3у"
}
]
}
How can I convert it to the required json format quickly, where we do not have only English characters?
import json
def process_text_to_json():
location_data = []
with open("file.txt") as f:
for line in f:
line = line.split()
location_data.append({"key1": line[0], "city": line[1], "description": line[2]})
location_data = {"location_data": location_data}
return json.dumps(location_data)
Output sample:
{"location_data": [{"city": "Cracow", "key1": "RRS12345", "description": "Sunflowers"}, {"city": "Berin", "key1": "RRD12345", "description": "Data"}, {"city": "Cracow2", "key1": "RRS12346", "description": "Sunflowers"}, {"city": "Berin2", "key1": "RRD12346", "description": "Data"}, {"city": "Cracow3", "key1": "RRS12346", "description": "Sunflowers"}, {"city": "Berin3", "key1": "RRD12346", "description": "Data"}]}
Iterate over each line and form your dict.
Ex:
d = {"location_data":[]}
with open(filename, "r") as infile:
for line in infile:
val = line.split()
d["location_data"].append({"key1": val[0], "city": val[1], "description": val[2]})
print(d)

Unable to format the output of a link in python

I am trying to access weather api data. It returns a long a less human readable single line. I am trying to replace every bracket({) with '{/n' so that bracket remains but as well a new line character just for better readable json.
But it returns every character on a new line in the shell.
import urllib2
url2 = 'http://api.openweathermap.org/data/2.5/find?q=london,PK&units=metric'
data = urllib2.urlopen(url2)
s = data.read()
count = 0
s = s.replace('{',"{\n")
#s = ''.join(s)
for line in s:
print line
count = count + 1
print count
after join() the problem still persists.
The problematic output after this code is like this
Why don't you use the built-in capabilities of the json library that's standard in Python?
import urllib2
import json
url2 = 'http://api.openweathermap.org/data/2.5/find?q=london,PK&units=metric'
data = urllib2.urlopen(url2)
# read the contents in and parse the JSON.
jsonData = json.loads(data.read())
# print it out nicely formatted:
print json.dumps(jsonData, sort_keys=True, indent=4, separators=(',', ': '))
output:
{
"cod": "200",
"count": 1,
"list": [
{
"clouds": {
"all": 20
},
"coord": {
"lat": 38.7994,
"lon": -89.9603
},
"dt": 1442072098,
"id": 4237717,
"main": {
"humidity": 67,
"pressure": 1020,
"temp": 16.82,
"temp_max": 18.89,
"temp_min": 15
},
"name": "Edwardsville",
"sys": {
"country": "United States of America"
},
"weather": [
{
"description": "few clouds",
"icon": "02d",
"id": 801,
"main": "Clouds"
}
],
"wind": {
"deg": 350,
"speed": 4.6
}
}
],
"message": "accurate"
}
The issue is here:
for line in s:
print line
At this point, it will print every character on a separate line - that's what print does (it adds a trailing newline to each print command), as shown by this code:
print 1
print
print 2
which outputs this:
1
2
You may be confused with the name line, but it's not a special variable name. You can change the word line to any valid variable name and it will work the same way.
A for loop will iterate over an iterable. If it's a file, it will do each line. A list will be each element, and a string goes over every character. Because you say to print it, it then prints them individually.
Are you expecting a non-string response from the API? If it gives a list like this:
["calls=10","message=hello"]
then your for loop will print each in turn. But if it's just a string, like "message=hello" it will print each character.
And the reason there is a blank newline after the {? Because the replace command is working fine.
s is just a string, so doing for x in s actually iterates over individual characters of s, not over its lines. I think you're confusing it with for line in f when f is a file object!

Categories

Resources