Adding brackets and commas to multiple JSON objects - python

I've created a very simple piece of code to read in tweets in JSON format in text files, determine if they contain an id and coordinates and if so, write these attributes to a csv file. This is the code:
f = csv.writer(open('GeotaggedTweets/ListOfTweets.csv', 'wb+'))
all_files = glob.glob('SampleTweets/*.txt')
for filename in all_files:
with open(filename, 'r') as file:
data = simplejson.load(file)
if 'text' and 'coordinates' in data:
f.writerow([data['id'], data['geo']['coordinates']])
I've been having some difficulties but with the help of the excellent JSON Lint website have realised my mistake. I have multiple JSON objects and from what I read these need to be separated by commas and have square brackets added to the start and end of the file.
How can I achieve this? I've seen some examples online where each individual line is read and it's added to the first and last line, but as I load the whole file I'm not entirely sure how to do this.

You have a file that either contains too many newlines (in the JSON values themselves) or too few (no newlines between the tweets at all).
You can still repair this by using some creative re-stitching. The following generator function should do it:
import json
def read_objects(filename):
decoder = json.JSONDecoder()
with open(filename, 'r') as inputfile:
line = next(inputfile).strip()
while line:
try:
obj, index = decoder.raw_decode(line)
yield obj
line = line[index:]
except ValueError:
# Assume we didn't have a complete object yet
line += next(inputfile).strip()
if not line:
line += next(inputfile).strip()
This should be able to read all your JSON objects in sequence:
for filename in all_files:
for data in read_objects(filename):
if 'text' and 'coordinates' in data:
f.writerow([data['id'], data['geo']['coordinates']])
It is otherwise fine to have multiple JSON strings written to one file, but you need to make sure that the entries are clearly separated somehow. Writing JSON entries that do not use newlines, then using newlines in between them, for example, makes sure you can later on read them one by one again and process them sequentially without this much hassle.

Related

Python reading nothing from file [duplicate]

I am a beginner of Python. I am trying now figuring out why the second 'for' loop doesn't work in the following script. I mean that I could only get the result of the first 'for' loop, but nothing from the second one. I copied and pasted my script and the data csv in the below.
It will be helpful if you tell me why it goes in this way and how to make the second 'for' loop work as well.
My SCRIPT:
import csv
file = "data.csv"
fh = open(file, 'rb')
read = csv.DictReader(fh)
for e in read:
print(e['a'])
for e in read:
print(e['b'])
"data.csv":
a,b,c
tree,bough,trunk
animal,leg,trunk
fish,fin,body
The csv reader is an iterator over the file. Once you go through it once, you read to the end of the file, so there is no more to read. If you need to go through it again, you can seek to the beginning of the file:
fh.seek(0)
This will reset the file to the beginning so you can read it again. Depending on the code, it may also be necessary to skip the field name header:
next(fh)
This is necessary for your code, since the DictReader consumed that line the first time around to determine the field names, and it's not going to do that again. It may not be necessary for other uses of csv.
If the file isn't too big and you need to do several things with the data, you could also just read the whole thing into a list:
data = list(read)
Then you can do what you want with data.
I have created small piece of function which doe take path of csv file read and return list of dict at once then you loop through list very easily,
def read_csv_data(path):
"""
Reads CSV from given path and Return list of dict with Mapping
"""
data = csv.reader(open(path))
# Read the column names from the first line of the file
fields = data.next()
data_lines = []
for row in data:
items = dict(zip(fields, row))
data_lines.append(items)
return data_lines
Regards

Can I replace part of a string in a JSON key in Python?

This is my first question here, I'm new to python and trying to figure some things out to set up an automatic 3D model processing chain that relies on data being stored in JSON files moving from one server to another.
The problem is that I need to store absolute paths to files that are being processed, but these absolute paths should be modified in the original JSON files upon the first time that they are processed.
Basically the JSON file comes in like this:
{
"normaldir": "D:\\Outgoing\\1621_1\\",
"projectdir": "D:\\Outgoing\\1622_2\\"
}
And I would like to rename the file paths to
{
"normaldir": "X:\\Incoming\\1621_1\\",
"projectdir": "X:\\Incoming\\1622_2\\",
}
What I've been trying to do is replace the first part of the path using this code, but it isn't working:
def processscan(scanfile):
configfile= MonitorDirectory + scanfile
with open(configfile, 'r+') as file:
content = file.read()
file.seek(0)
content.replace("D:\\Outgoing\\", "X:\\Incoming\\")
file.write(content)
However this was not working at all, so I tried interpreting the JSON file properly and replacing the key code from here:
def processscan(scanfile):
configfile= MonitorDirectory + scanfile
with open(configfile, 'r+') as settingsData:
settings = json.load(settingsData)
settings['normaldir'] = 'X:\\Incoming\\1621_1\\'
settings['projectdir'] = 'X:\\Incoming\\1622_2\\'
settingsData.seek(0) # rewind to beginning of file
settingsData.write(json.dumps(settings,indent=2,sort_keys=True)) #write the updated version
settingsData.truncate() #truncate the remainder of the data in the file
This works perfectly, however I'm replacing the whole path so it won't really work for every JSON file that I need to process. What I would really like to do is to take a JSON key corresponding to a file path, keep the last 8 characters and replace the rest of the patch with a new string, but I can't figure out how to do this using json in python, as far as I can tell I can't edit part of a key.
Does anyone have a workaround for this?
Thanks!
Your replace logic failed as you need to reassign content to the new string,str.replace is not an inplace operation, it creates a new string:
content = content.replace("D:\\Outgoing\\", "X:\\Incoming\\")
Using the json approach just do a replace too, using the current value:
settings['normaldir'] = settings['normaldir'].replace("D:\\Outgoing\\", "X:\\Incoming\\")
You also would want truncate() before you write or just reopen the file with w and dump/write the new value, if you really wanted to just keep the last 8 chars and prepend a string:
settings['normaldir'] = "X:\\Incoming\\" + settings['normaldir'][-8:]
Python come with a json library.
With this library, you can read and write JSON files (or JSON strings).
Parsed data is converted to Python objects and vice versa.
To use the json library, simply import it:
import json
Say your data is stored in input_data.json file.
input_data_path = "input_data.json"
You read the file like this:
import io
with io.open(input_data_path, mode="rb") as fd:
obj = json.load(fd)
or, alternatively:
with io.open(input_data_path, mode="rb") as fd:
content = fd.read()
obj = json.loads(content)
Your data is automatically converted into Python objects, here you get a dict:
print(repr(obj))
# {u'projectdir': u'D:\\Outgoing\\1622_2\\',
# u'normaldir': u'D:\\Outgoing\\1621_1\\'}
note: I'm using Python 2.7 so you get the unicode string prefixed by "u", like u'projectdir'.
It's now easy to change the values for normaldir and projectdir:
obj["normaldir"] = "X:\\Incoming\\1621_1\\"
obj["projectdir"] = "X:\\Incoming\\1622_2\\"
Since obj is a dict, you can also use the update method like this:
obj.update({'normaldir': "X:\\Incoming\\1621_1\\",
'projectdir': "X:\\Incoming\\1622_2\\"})
That way, you use a similar syntax like JSON.
Finally, you can write your Python object back to JSON file:
output_data_path = "output_data.json"
with io.open(output_data_path, mode="wb") as fd:
json.dump(obj, fd)
or, alternatively with indentation:
content = json.dumps(obj, indent=True)
with io.open(output_data_path, mode="wb") as fd:
fd.write(content)
Remarks: reading/writing JSON objects is faster with a buffer (the content variable).
.replace returns a new string, and don't change it. But you should not treat json-files as normal text files, so you can combine parsing json with replace:
def processscan(scanfile):
configfile= MonitorDirectory + scanfile
with open(configfile, 'rb') as settingsData:
settings = json.load(settingsData)
settings = {k: v.replace("D:\\Outgoing\\", "X:\\Incoming\\")
for k, v in settings.items()
}
with open(configfile, 'wb') as settingsData:
json.dump(settings, settingsData)

Using python ijson to read a large json file with multiple json objects

I'm trying to parse a large (~100MB) json file using ijson package which allows me to interact with the file in an efficient way. However, after writing some code like this,
with open(filename, 'r') as f:
parser = ijson.parse(f)
for prefix, event, value in parser:
if prefix == "name":
print(value)
I found that the code parses only the first line and not the rest of the lines from the file!!
Here is how a portion of my json file looks like:
{"name":"accelerator_pedal_position","value":0,"timestamp":1364323939.012000}
{"name":"engine_speed","value":772,"timestamp":1364323939.027000}
{"name":"vehicle_speed","value":0,"timestamp":1364323939.029000}
{"name":"accelerator_pedal_position","value":0,"timestamp":1364323939.035000}
In my opinion, I think ijson parses only one json object.
Can someone please suggest how to work around this?
Since the provided chunk looks more like a set of lines each composing an independent JSON, it should be parsed accordingly:
# each JSON is small, there's no need in iterative processing
import json
with open(filename, 'r') as f:
for line in f:
data = json.loads(line)
# data[u'name'], data[u'engine_speed'], data[u'timestamp'] now
# contain correspoding values
Unfortunately the ijson library (v2.3 as of March 2018) does not handle parsing multiple JSON objects. It can only handle 1 overall object, and if you attempt to parse a second object, you will get an error: "ijson.common.JSONError: Additional data". See bug reports here:
https://github.com/isagalaev/ijson/issues/40
https://github.com/isagalaev/ijson/issues/42
https://github.com/isagalaev/ijson/issues/67
python: how do I parse a stream of json arrays with ijson library
It's a big limitation. However, as long as you have line breaks (new line character) after each JSON object, you can parse each one line-by-line independently, like this:
import io
import ijson
with open(filename, encoding="UTF-8") as json_file:
cursor = 0
for line_number, line in enumerate(json_file):
print ("Processing line", line_number + 1,"at cursor index:", cursor)
line_as_file = io.StringIO(line)
# Use a new parser for each line
json_parser = ijson.parse(line_as_file)
for prefix, type, value in json_parser:
print ("prefix=",prefix, "type=",type, "value=",value)
cursor += len(line)
You are still streaming the file, and not loading it entirely in memory, so it can work on large JSON files. It also uses the line streaming technique from: How to jump to a particular line in a huge text file? and uses enumerate() from: Accessing the index in 'for' loops?

Writing Json in for loop in Python

I am downloading Json files from an API, I use the following code to write the JSON. Each item the loop gives me a JSON file. I need to save it and extract entities from the appended JSON file using a loop.
for item in style_ls:
dat = get_json(api, item)
specs_dict[item] = dat
with open("specs_append.txt", "a") as myfile:
json.dump(dat, myfile)
myfile.close()
print item
with open ("specs_data.txt", "w") as my file:
json.dump(spec_dict, myfile)
myfile.close()
I know that I cannot get a valid JSON format from the specs_append.txt, but I can get one from the specs_data.txt. I am doing the first one just because my program needs atleast 3-4 days to complete and there are high chances that my system may shutdown. So is there anyway I can do this efficiently ?
If not is there anyway I can extract it from specs_append.txt <{JSON}{JSON}> format (which is not a valid JSON format)?
If not should I write specs_dict to a txt file every time in the loop, so that even if program gets terminated i can start if from that point in loop and still get a valid json format?
I suggest several possible solutions.
One solution is to write custom code to slurp in the input file. I would suggest putting a special line before each JSON object in the file, such as: ###
Then you could write code like this:
import json
def json_get_objects(f):
temp = ''
line = next(f) # pull first line
assert line == SPECIAL_LINE
for line in f:
if line != SPECIAL_LINE:
temp += line
else:
# found special marker, temp now contains a complete JSON object
j = json.loads(temp)
yield j
temp = ''
# after loop done, yield up last JSON object
if temp:
j = json.loads(temp)
yield j
with open("specs_data.txt", "r") as f:
for j in json_get_objects(f):
pass # do something with JSON object j
Two notes on this. First, I am simply appending to a string over and over; this used to be a very slow way to do this in Python, so if you are using a very old version of Python, don't do it this way unless your JSON objects are very small. Second, I wrote code to split the input and yield up JSON objects one at a time, but you could also use a guaranteed-unique string, slurp in all the data with a single call to f.read() and then split on your guaranteed-unique string using the str.split() method function.
Another solution would be to write the whole file as a valid JSON list of valid JSON objects. Write the file like this:
{"mylist":[
# first JSON object, followed by a comma
# second JSON object, followed by a comma
# third JSON object
]}
This would require your file appending code to open the file with writing permission, and seek to the last ] in the file before writing a comma plus newline, then the new JSON object on the end, and then finally writing ]} to close out the file. If you do it this way, you can use json.loads() to slurp the whole thing in and have a list of JSON objects.
Finally, I suggest that maybe you should just use a database. Use SQLite or something and just throw the JSON strings in to a table. If you choose this, I suggest using an ORM to make your life simple, rather than writing SQL commands by hand.
Personally, I favor the first suggestion: write in a special line like ###, then have custom code to split the input on those marks and then get the JSON objects.
EDIT: Okay, the first suggestion was sort of assuming that the JSON was formatted for human readability, with a bunch of short lines:
{
"foo": 0,
"bar": 1,
"baz": 2
}
But it's all run together as one big long line:
{"foo":0,"bar":1,"baz":2}
Here are three ways to fix this.
0) write a newline before the ### and after it, like so:
###
{"foo":0,"bar":1,"baz":2}
###
{"foo":0,"bar":1,"baz":2}
Then each input line will alternately be ### or a complete JSON object.
1) As long as SPECIAL_LINE is completely unique (never appears inside a string in the JSON) you can do this:
with open("specs_data.txt", "r") as f:
temp = f.read() # read entire file contents
lst = temp.split(SPECIAL_LINE)
json_objects = [json.loads(x) for x in lst]
for j in json_objects:
pass # do something with JSON object j
The .split() method function can split up the temp string into JSON objects for you.
2) If you are certain that each JSON object will never have a newline character inside it, you could simply write JSON objects to the file, one after another, putting a newline after each; then assume that each line is a JSON object:
import json
def json_get_objects(f):
for line in f:
if line.strip():
yield json.loads(line)
with open("specs_data.txt", "r") as f:
for j in json_get_objects(f):
pass # do something with JSON object j
I like the simplicity of option (2), but I like the reliability of option (0). If a newline ever got written in as part of a JSON object, option (0) would still work, but option (2) would error.
Again, you can also simply use an actual database (SQLite) with an ORM and let the database worry about the details.
Good luck.
Append json data to a dict on every loop.
In the end dump this dict as a json and write it to a file.
For getting you an idea for appending data to dict:
>>> d1 = {'suku':12}
>>> t1 = {'suku1':212}
>>> d1.update(t1)
>>> d1
{'suku1': 212, 'suku': 12}

How to read line-delimited JSON from large file (line by line)

I'm trying to load a large file (2GB in size) filled with JSON strings, delimited by newlines. Ex:
{
"key11": value11,
"key12": value12,
}
{
"key21": value21,
"key22": value22,
}
…
The way I'm importing it now is:
content = open(file_path, "r").read()
j_content = json.loads("[" + content.replace("}\n{", "},\n{") + "]")
Which seems like a hack (adding commas between each JSON string and also a beginning and ending square bracket to make it a proper list).
Is there a better way to specify the JSON delimiter (newline \n instead of comma ,)?
Also, Python can't seem to properly allocate memory for an object built from 2GB of data, is there a way to construct each JSON object as I'm reading the file line by line? Thanks!
Just read each line and construct a json object at this time:
with open(file_path) as f:
for line in f:
j_content = json.loads(line)
This way, you load proper complete json object (provided there is no \n in a json value somewhere or in the middle of your json object) and you avoid memory issue as each object is created when needed.
There is also this answer.:
https://stackoverflow.com/a/7795029/671543
contents = open(file_path, "r").read()
data = [json.loads(str(item)) for item in contents.strip().split('\n')]
This will work for the specific file format that you gave. If your format changes, then you'll need to change the way the lines are parsed.
{
"key11": 11,
"key12": 12
}
{
"key21": 21,
"key22": 22
}
Just read line-by-line, and build the JSON blocks as you go:
with open(args.infile, 'r') as infile:
# Variable for building our JSON block
json_block = []
for line in infile:
# Add the line to our JSON block
json_block.append(line)
# Check whether we closed our JSON block
if line.startswith('}'):
# Do something with the JSON dictionary
json_dict = json.loads(''.join(json_block))
print(json_dict)
# Start a new block
json_block = []
If you are interested in parsing one very large JSON file without saving everything to memory, you should look at using the object_hook or object_pairs_hook callback methods in the json.load API.
This expands Cohen's answer:
content_object = s3_resource.Object(BucketName, KeyFileName)
file_buffer = io.StringIO()
file_buffer = content_object.get()['Body'].read().decode('utf-8')
json_lines = []
for line in file_buffer.splitlines():
j_content = json.loads(line)
json_lines.append(j_content)
df_readback = pd.DataFrame(json_lines)
This assumes that the entire file will fit in memory. If it is too big then this will have to be modified to read in chunks or use Dask.
Had to read some data from AWS S3 and parse a newline delimited jsonl file. My solution was this using splitlines
The code:
for line in json_input.splitlines():
one_json = json.loads(line)
The line by line reading approach is good, as mentioned in some of the above answers.
However across multiple JSON tree structures I would recommend decomposition into 2 functions to have more robust error handling.
For example,
def load_cases(file_name):
with open(file_name) as file:
cases = (parse_case_line(json.loads(line)) for line in file)
cases = filter(None, cases)
return list(cases)
parse_case_line can encapsulate the key parsing logic required in your above example, for example with regex matching, or application-specific requirements. It also means that you can select which json key-values you want to parse out.
Another advantage of this approach is filter handles multiple \n in the middle of your json object, and parses the whole file :-).
Just read it line by line and parse e through a stream
while ur hacking trick (adding commas between each JSON string and also a beginning and ending square bracket to make it a proper list) isn't memory-friendly if the file is too more than 1GB as the whole content will land on the RAM.

Categories

Resources