I have large file (about 3GB) which contains what looks like a JSON file but isn't because it lacks commas (,) between "observations" or JSON objects (I have about 2 million of these "objects" in my data file).
For example, this is what I have:
{
"_id": {
"$id": "fh37fc3huc3"
},
"messageid": "4757724838492485088139042828",
"attachments": [],
"usernameid": "47284592942",
"username": "Alex",
"server": "475774810304151552",
"text": "Must watch",
"type": "462050823720009729",
"datetime": "2018-08-05T21:20:20.486000+00:00",
"type": {
"$numberLong": "0"
}
}
{
"_id": {
"$id": "23453532dwq"
},
"messageid": "232534",
"attachments": [],
"usernameid": "273342",
"usernameid": "Alice",
"server": "475774810304151552",
"text": "https://www.youtube.com/",
"type": "4620508237200097wd29",
"datetime": "2018-08-05T21:20:11.803000+00:00",
"type": {
"$numberLong": "0"
}
And this is what I want (the comma between "observations"):
{
"_id": {
"$id": "fh37fc3huc3"
},
"messageid": "4757724838492485088139042828",
"attachments": [],
"username": "Alex",
"server": "475774810304151552",
"type": {
"$numberLong": "0"
}
},
{
"_id": {
"$id": "23453532dwq"
},
"messageid": "232534",
"attachments": [],
"usernameid": "Alice",
"server": "475774810304151552",
"type": {
"$numberLong": "0"
}
This is what I tried but it doesn't give me a comma where I need it:
import re
with open('dataframe.txt', 'r') as input, open('out.txt', 'w') as output:
output.write("[")
for line in input:
line = re.sub('', '},{', line)
output.write(' '+line)
output.write("]")
What can I do so that I can add a comma between each JSON object in my datafile?
This solution presupposes that none of the fields in JSON contains neither { nor }.
If we assume that there is at least one blank line between JSON dictionaries, an idea: let's maintain unclosed curly brackets count ({) as unclosed_count; and if we meet an empty line, we add the coma once.
Like this:
with open('test.json', 'r') as input_f, open('out.json', 'w') as output_f:
output_f.write("[")
unclosed_count = 0
comma_after_zero_added = True
for line in input_f:
unclosed_count_change = line.count('{') - line.count('}')
unclosed_count += unclosed_count_change
if unclosed_count_change != 0:
comma_after_zero_added = False
if line.strip() == '' and unclosed_count == 0 and not comma_after_zero_added:
output_f.write(",\n")
comma_after_zero_added = True
else:
output_f.write(line)
output_f.write("]")
Assuming sufficient memory, you can parse such a stream one object at a time using json.JSONDecoder.raw_decode directly, instead of using json.loads.
>>> x = '{"a": 1}\n{"b": 2}\n' # Hypothetical output of open("dataframe.txt").read()
>>> decoder = json.JSONDecoder()
>>> x = '{"a": 1}\n{"b":2}\n'
>>> decoder.raw_decode(x)
({'a': 1}, 8)
>>> decoder.raw_decode(x, 9)
({'b': 2}, 16)
The output of raw_decode is a tuple containing the first JSON value decoded and the position in the string where the remaining data starts. (Note that json.loads just creates an instance of JSONDecoder, and calls the decode method, which just calls raw_decode and artificially raises an exception if the entire input isn't consumed by the first decoded value.)
A little extra work is involved; note that you can't start decoding with whitespace, so you'll have to use the returned index to detect where the next value starts, following any additional whitespace at the returned index.
Another way to view your data is that you have multiple json records separated by whitespace. You can use the stdlib JSONDecoder to read each record, then strip whitespace and repeat til done. The decoder reads a record from a string and tells you how far it got. Apply that iteratively to the data until all is consumed. This is far less risky than making a bunch of assumptions about what data is contained in the json itself.
import json
def json_record_reader(filename):
with open(filename, encoding="utf-8") as f:
txt = f.read().lstrip()
decoder = json.JSONDecoder()
result = []
while txt:
data, pos = decoder.raw_decode(txt)
result.append(data)
txt = txt[pos:].lstrip()
return result
print(json_record_reader("data.json"))
Considering the size of your file, a memory mapped text file may be the better option.
If you're sure that the only place you will find a blank line is between two dicts, then you can go ahead with your current idea, after you fix its execution. For every line, check if it's empty. If it isn't, write it as-is. If it is, write a comma instead
with open('dataframe.txt', 'r') as input_file, open('out.txt', 'w') as output_file:
output_file.write("[")
for line in input_file:
if line.strip():
output_file.write(line)
else:
output_file.write(",")
output_file.write("]")
If you cannot guarantee that any blank line must be replaced by a comma, you need a different approach.
You want to replace a close-bracket, followed by an empty line (or multiple whitespace), followed by an open-bracket, with },{.
You can keep track of the previous two lines in addition to the current line, and if these are "}", "", and "{" in that order, then write a comma before writing the "{".
from collections import deque
with open('dataframe.txt', 'r') as input_file, open('out.txt', 'w') as output_file:
last_two_lines = deque(maxlen=2)
output_file.write("[")
for line in input_file:
line_s = line.strip()
if line_s == "{" and list(last_two_lines) == ["}", ""]:
output_file.write("," + line)
else:
output_file.write(line)
last_two_lines.append(line_s)
Alternatively, if you want to stick with regex, then you could do
with open('dataframe.txt') as input_file:
file_contents = input_file.read()
repl_contents = re.sub(r'\}(\s+)\{', r'},\1{', file_contents)
with open('out.txt', 'w') as output_file:
output_file.write(repl_contents)
Here, the regex r"\}(\s+)\{" matches the pattern we're looking for (\s+ matches multiple whitespace characters, and captures them in group 1, which we then use in the replacement string as \1.
Note that you will need to read and run re.sub on the entire file, which will be slow.
Related
I have a document with new-line-delimited json's, to which I apply some functions. Everything works up until this line, which looks exactly like this:
{"_id": "5f114", "type": ["Type1", "Type2"], "company": ["5e84734"], "answers": [{"title": " answer 1", "value": false}, {"title": "answer 2
", "value": true}, {"title": "This is a title.", "value": true}, {"title": "This is another title", "value": true}], "audios": [null], "text": {}, "lastUpdate": "2020-07-17T06:24:50.562Z", "title": "This is a question?", "description": "1000 €.", "image": "image.jpg", "__v": 0}
The entire code:
import json
def unimportant_function(d):
d.pop('audios', None)
return {k:v for k,v in d.items() if v != {}}
def parse_ndjson(data):
return [json.loads(l) for l in data.splitlines()]
with open('C:\\path\\the_above_example.json', 'r', encoding="utf8") as handle:
data = handle.read()
dicts = parse_ndjson(data)
for d in dicts:
new_d = unimportant_function(d)
json_string=json.dumps(new_d, ensure_ascii=False)
print(json_string)
The error JSONDecodeError: Unterminated string starting at: line 1 column 260 (char 259) happens at dicts = parse_ndjson(data). Why? I also have no idea what that symbol after "answer 2" is, it didn't appear in the data but it appeared when I copy pasted it.
What is the problem with the data?
The unprintable character embedded in the "answer 2" string is a paragraph separator, which is treated as whitespace by .splitlines():
>>> 'foo\u2029bar'.splitlines()
['foo', 'bar']
(Speculation: the ndjson file might be exploiting this to represent "this string should have a newline in it", working around the file format. If so, it should probably be using a \n escape instead.)
The character is, however, not treated specially if you iterate over the lines of the file normally:
>>> # For demonstration purposes, I create a `StringIO`
>>> # from a hard-coded string. A file object reading
>>> # from disk will behave similarly.
>>> import io
>>> for line in io.StringIO('foo\u2029bar'):
... print(repr(line))
...
'foo\u2029bar'
So, the simple fix is to make parse_ndjson expect a sequence of lines already - don't call .splitlines, and fix the calling code appropriately. You can either pass the open handle directly:
def parse_ndjson(data):
return [json.loads(l) for l in data]
with open('C:\\path\\the_above_example.json', 'r', encoding="utf8") as handle:
dicts = parse_ndjson(handle)
or pass it to list to create a list explicitly:
def parse_ndjson(data):
return [json.loads(l) for l in data]
with open('C:\\path\\the_above_example.json', 'r', encoding="utf8") as handle:
dicts = parse_ndjson(list(handle))
or create the list using the provided .readlines() method:
def parse_ndjson(data):
return [json.loads(l) for l in data]
with open('C:\\path\\the_above_example.json', 'r', encoding="utf8") as handle:
dicts = parse_ndjson(handle.readlines())
I am working on a file which prints json encoded messages
{
"Status": "Non_Malicious",
"alert_level": "Low",
"alert_count": 1,
"alert": "",
"hosts_alert": ""
}
NOTE the file is appended every message as above
def jsonconvert(new_line, set_alert_level, set_status, hosts_alert, ip_address_alert, alert_count, title):
data = {}
data['alert_description'] = new_line
data['alert_level'] = set_alert_level
data['Status'] = set_status
data['hosts_alert'] = hosts_alert
data['alert'] = title
json_data = json.dumps(data)
file_write = open("json_file", 'a')
file_write.write(json_data+','+'\n')
file_write.close()
But in order to make it valid, i need to append it as such bellow : which im not sure how to get it done.
{
"isoc": [{
"Status": "Non_Malicious",
"alert_level": "Low",
"alert_count": 1,
"alert": "",
"hosts_alert": ""
}, {
"Status": "Non_Malicious",
"alert_level": "Low",
"alert_count": 1,
"alert": "",
"hosts_alert": ""
}]
}
You can't make valid JSON files by just appending. This is not just a problem with Python.
That's because, as you noted, every open bracket must be closed, and there should not be trailing commas.
Here are some possible solutions that come to my mind:
Use another file format that does not suffer for this limitation, like CSV or YAML or even plain text
Don't append to a file, but use a database instead
Keep all the contents in memory and dump them all at once in a single JSON file, overwriting it every time
At every append operation, delete the closing brackets, append your data then re-add them again. This solution is a ugly hack, so I don't advice it, but it works.
I am trying to parse a big json file (hundreds of gigs) to extract information from its keys. For simplicity, consider the following example:
import random, string
# To create a random key
def random_string(length):
return "".join(random.choice(string.lowercase) for i in range(length))
# Create the dicitonary
dummy = {random_string(10): random.sample(range(1, 1000), 10) for times in range(15)}
# Dump the dictionary into a json file
with open("dummy.json", "w") as fp:
json.dump(dummy, fp)
Then, I use ijson in python 2.7 to parse the file:
file_name = "dummy.json"
with open(file_name, "r") as fp:
for key in dummy.keys():
print "key: ", key
parser = ijson.items(fp, str(key) + ".item")
for number in parser:
print number,
I was expecting to retrieve all the numbers in the lists corresponding to the keys of the dic. However, I got
IncompleteJSONError: Incomplete JSON data
I am aware of this post: Using python ijson to read a large json file with multiple json objects, but in my case I have a single json file, that is well formed, with a relative simple schema. Any ideas on how can I parse it? Thank you.
ijson has an iterator interface to deal with large JSON files allowing to read the file lazily. You can process the file in small chunks and save results somewhere else.
Calling ijson.parse() yields three values prefix, event, value
Some JSON:
{
"europe": [
{"name": "Paris", "type": "city"},
{"name": "Rhein", "type": "river"}
]
}
Code:
import ijson
data = ijson.parse(open(FILE_PATH, 'r'))
for prefix, event, value in data:
if event == 'string':
print(value)
Output:
Paris
city
Rhein
river
Reference: https://pypi.python.org/pypi/ijson
The sample json content file is given below: it has records of two people. It might as well have 2 million records.
[
{
"Name" : "Joy",
"Address" : "123 Main St",
"Schools" : [
"University of Chicago",
"Purdue University"
],
"Hobbies" : [
{
"Instrument" : "Guitar",
"Level" : "Expert"
},
{
"percussion" : "Drum",
"Level" : "Professional"
}
],
"Status" : "Student",
"id" : 111,
"AltID" : "J111"
},
{
"Name" : "Mary",
"Address" : "452 Jubal St",
"Schools" : [
"University of Pensylvania",
"Washington University"
],
"Hobbies" : [
{
"Instrument" : "Violin",
"Level" : "Expert"
},
{
"percussion" : "Piano",
"Level" : "Professional"
}
],
"Status" : "Employed",
"id" : 112,
"AltID" : "M112"
}
}
]
I created a generator which would return each person's record as a json object. The code would look like below. This is not the generator code. Changing couple of lines would make it a generator.
import json
curly_idx = []
jstr = ""
first_curly_found = False
with open("C:\\Users\\Rajeshs\\PycharmProjects\\Project1\\data\\test.json", 'r') as fp:
#Reading file line by line
line = fp.readline()
lnum = 0
while line:
for a in line:
if a == '{':
curly_idx.append(lnum)
first_curly_found = True
elif a == '}':
curly_idx.pop()
# when the right curly for every left curly is found,
# it would mean that one complete data element was read
if len(curly_idx) == 0 and first_curly_found:
jstr = f'{jstr}{line}'
jstr = jstr.rstrip()
jstr = jstr.rstrip(',')
jstr[:-1]
print("------------")
if len(jstr) > 10:
print("making json")
j = json.loads(jstr)
print(jstr)
jstr = ""
line = fp.readline()
lnum += 1
continue
if first_curly_found:
jstr = f'{jstr}{line}'
line = fp.readline()
lnum += 1
if lnum > 100:
break
You are starting more than one parsing iterations with the same file object without resetting it. The first call to ijson will work, but will move the file object to the end of the file; then the second time you pass the same.object to ijson it will complain because there is nothing to read from the file anymore.
Try opening the file each time you call ijson; alternatively you can seek to the beginning of the file after calling ijson so the file object can read your file data again.
if you are working with json with the following format you can use ijson.item()
sample json:
[
{"id":2,"cost":0,"test":0,"testid2":255909890011279,"test_id_3":0,"meeting":"daily","video":"paused"}
{"id":2,"cost":0,"test":0,"testid2":255909890011279,"test_id_3":0,"meeting":"daily","video":"paused"}
]
input = 'file.txt'
res=[]
if Path(input).suffix[1:].lower() == 'gz':
input_file_handle = gzip.open(input, mode='rb')
else:
input_file_handle = open(input, 'rb')
for json_row in ijson.items(input_file_handle,
'item'):
res.append(json_row)
I have a list of json objects that I would like to write to a json file. Example of my data is as follows:
{
"_id": "abc",
"resolved": false,
"timestamp": "2017-04-18T04:57:41 366000",
"timestamp_utc": {
"$date": 1492509461366
},
"sessionID": "abc",
"resHeight": 768,
"time_bucket": ["2017-year", "2017-04-month", "2017-16-week", "2017-04-18-day", "2017-04-18 16-hour"],
"referrer": "Standalone",
"g_event_id": "abc",
"user_agent": "abc"
"_id": "abc",
} {
"_id": "abc",
"resolved": false,
"timestamp": "2017-04-18T04:57:41 366000",
"timestamp_utc": {
"$date": 1492509461366
},
"sessionID": "abc",
"resHeight": 768,
"time_bucket": ["2017-year", "2017-04-month", "2017-16-week", "2017-04-18-day", "2017-04-18 16-hour"],
"referrer": "Standalone",
"g_event_id": "abc",
"user_agent": "abc"
}
I would like to wirte this to a json file. Here's the code that I am using for this purpose:
with open("filename", 'w') as outfile1:
for row in data:
outfile1.write(json.dumps(row))
But this gives me a file with only 1 long row of data. I would like to have a row for each json object in my original data. I know there are some other StackOverflow questions that are trying to address somewhat similar situation (by externally inserting '\n' etc.), but it hasn't worked in my case for some reason. I believe there has to be a pythonic way to do this.
How do I achieve this?
The format of the file you are trying to create is called JSON lines.
It seems, you are asking why the jsons are not separated with a newline. Because write method does not append the newline.
If you want implicit newlines you should better use print function:
with open("filename", 'w') as outfile1:
for row in data:
print(json.dumps(row), file=outfile1)
Use the indent argument to output json with extra whitespace. The default is to not output linebreaks or extra spaces.
with open('filename.json', 'w') as outfile1:
json.dump(data, outfile1, indent=4)
https://docs.python.org/3/library/json.html#basic-usage
I am trying to access weather api data. It returns a long a less human readable single line. I am trying to replace every bracket({) with '{/n' so that bracket remains but as well a new line character just for better readable json.
But it returns every character on a new line in the shell.
import urllib2
url2 = 'http://api.openweathermap.org/data/2.5/find?q=london,PK&units=metric'
data = urllib2.urlopen(url2)
s = data.read()
count = 0
s = s.replace('{',"{\n")
#s = ''.join(s)
for line in s:
print line
count = count + 1
print count
after join() the problem still persists.
The problematic output after this code is like this
Why don't you use the built-in capabilities of the json library that's standard in Python?
import urllib2
import json
url2 = 'http://api.openweathermap.org/data/2.5/find?q=london,PK&units=metric'
data = urllib2.urlopen(url2)
# read the contents in and parse the JSON.
jsonData = json.loads(data.read())
# print it out nicely formatted:
print json.dumps(jsonData, sort_keys=True, indent=4, separators=(',', ': '))
output:
{
"cod": "200",
"count": 1,
"list": [
{
"clouds": {
"all": 20
},
"coord": {
"lat": 38.7994,
"lon": -89.9603
},
"dt": 1442072098,
"id": 4237717,
"main": {
"humidity": 67,
"pressure": 1020,
"temp": 16.82,
"temp_max": 18.89,
"temp_min": 15
},
"name": "Edwardsville",
"sys": {
"country": "United States of America"
},
"weather": [
{
"description": "few clouds",
"icon": "02d",
"id": 801,
"main": "Clouds"
}
],
"wind": {
"deg": 350,
"speed": 4.6
}
}
],
"message": "accurate"
}
The issue is here:
for line in s:
print line
At this point, it will print every character on a separate line - that's what print does (it adds a trailing newline to each print command), as shown by this code:
print 1
print
print 2
which outputs this:
1
2
You may be confused with the name line, but it's not a special variable name. You can change the word line to any valid variable name and it will work the same way.
A for loop will iterate over an iterable. If it's a file, it will do each line. A list will be each element, and a string goes over every character. Because you say to print it, it then prints them individually.
Are you expecting a non-string response from the API? If it gives a list like this:
["calls=10","message=hello"]
then your for loop will print each in turn. But if it's just a string, like "message=hello" it will print each character.
And the reason there is a blank newline after the {? Because the replace command is working fine.
s is just a string, so doing for x in s actually iterates over individual characters of s, not over its lines. I think you're confusing it with for line in f when f is a file object!