Writing Json in for loop in Python - python

I am downloading Json files from an API, I use the following code to write the JSON. Each item the loop gives me a JSON file. I need to save it and extract entities from the appended JSON file using a loop.
for item in style_ls:
dat = get_json(api, item)
specs_dict[item] = dat
with open("specs_append.txt", "a") as myfile:
json.dump(dat, myfile)
myfile.close()
print item
with open ("specs_data.txt", "w") as my file:
json.dump(spec_dict, myfile)
myfile.close()
I know that I cannot get a valid JSON format from the specs_append.txt, but I can get one from the specs_data.txt. I am doing the first one just because my program needs atleast 3-4 days to complete and there are high chances that my system may shutdown. So is there anyway I can do this efficiently ?
If not is there anyway I can extract it from specs_append.txt <{JSON}{JSON}> format (which is not a valid JSON format)?
If not should I write specs_dict to a txt file every time in the loop, so that even if program gets terminated i can start if from that point in loop and still get a valid json format?

I suggest several possible solutions.
One solution is to write custom code to slurp in the input file. I would suggest putting a special line before each JSON object in the file, such as: ###
Then you could write code like this:
import json
def json_get_objects(f):
temp = ''
line = next(f) # pull first line
assert line == SPECIAL_LINE
for line in f:
if line != SPECIAL_LINE:
temp += line
else:
# found special marker, temp now contains a complete JSON object
j = json.loads(temp)
yield j
temp = ''
# after loop done, yield up last JSON object
if temp:
j = json.loads(temp)
yield j
with open("specs_data.txt", "r") as f:
for j in json_get_objects(f):
pass # do something with JSON object j
Two notes on this. First, I am simply appending to a string over and over; this used to be a very slow way to do this in Python, so if you are using a very old version of Python, don't do it this way unless your JSON objects are very small. Second, I wrote code to split the input and yield up JSON objects one at a time, but you could also use a guaranteed-unique string, slurp in all the data with a single call to f.read() and then split on your guaranteed-unique string using the str.split() method function.
Another solution would be to write the whole file as a valid JSON list of valid JSON objects. Write the file like this:
{"mylist":[
# first JSON object, followed by a comma
# second JSON object, followed by a comma
# third JSON object
]}
This would require your file appending code to open the file with writing permission, and seek to the last ] in the file before writing a comma plus newline, then the new JSON object on the end, and then finally writing ]} to close out the file. If you do it this way, you can use json.loads() to slurp the whole thing in and have a list of JSON objects.
Finally, I suggest that maybe you should just use a database. Use SQLite or something and just throw the JSON strings in to a table. If you choose this, I suggest using an ORM to make your life simple, rather than writing SQL commands by hand.
Personally, I favor the first suggestion: write in a special line like ###, then have custom code to split the input on those marks and then get the JSON objects.
EDIT: Okay, the first suggestion was sort of assuming that the JSON was formatted for human readability, with a bunch of short lines:
{
"foo": 0,
"bar": 1,
"baz": 2
}
But it's all run together as one big long line:
{"foo":0,"bar":1,"baz":2}
Here are three ways to fix this.
0) write a newline before the ### and after it, like so:
###
{"foo":0,"bar":1,"baz":2}
###
{"foo":0,"bar":1,"baz":2}
Then each input line will alternately be ### or a complete JSON object.
1) As long as SPECIAL_LINE is completely unique (never appears inside a string in the JSON) you can do this:
with open("specs_data.txt", "r") as f:
temp = f.read() # read entire file contents
lst = temp.split(SPECIAL_LINE)
json_objects = [json.loads(x) for x in lst]
for j in json_objects:
pass # do something with JSON object j
The .split() method function can split up the temp string into JSON objects for you.
2) If you are certain that each JSON object will never have a newline character inside it, you could simply write JSON objects to the file, one after another, putting a newline after each; then assume that each line is a JSON object:
import json
def json_get_objects(f):
for line in f:
if line.strip():
yield json.loads(line)
with open("specs_data.txt", "r") as f:
for j in json_get_objects(f):
pass # do something with JSON object j
I like the simplicity of option (2), but I like the reliability of option (0). If a newline ever got written in as part of a JSON object, option (0) would still work, but option (2) would error.
Again, you can also simply use an actual database (SQLite) with an ORM and let the database worry about the details.
Good luck.

Append json data to a dict on every loop.
In the end dump this dict as a json and write it to a file.
For getting you an idea for appending data to dict:
>>> d1 = {'suku':12}
>>> t1 = {'suku1':212}
>>> d1.update(t1)
>>> d1
{'suku1': 212, 'suku': 12}

Related

Can I replace part of a string in a JSON key in Python?

This is my first question here, I'm new to python and trying to figure some things out to set up an automatic 3D model processing chain that relies on data being stored in JSON files moving from one server to another.
The problem is that I need to store absolute paths to files that are being processed, but these absolute paths should be modified in the original JSON files upon the first time that they are processed.
Basically the JSON file comes in like this:
{
"normaldir": "D:\\Outgoing\\1621_1\\",
"projectdir": "D:\\Outgoing\\1622_2\\"
}
And I would like to rename the file paths to
{
"normaldir": "X:\\Incoming\\1621_1\\",
"projectdir": "X:\\Incoming\\1622_2\\",
}
What I've been trying to do is replace the first part of the path using this code, but it isn't working:
def processscan(scanfile):
configfile= MonitorDirectory + scanfile
with open(configfile, 'r+') as file:
content = file.read()
file.seek(0)
content.replace("D:\\Outgoing\\", "X:\\Incoming\\")
file.write(content)
However this was not working at all, so I tried interpreting the JSON file properly and replacing the key code from here:
def processscan(scanfile):
configfile= MonitorDirectory + scanfile
with open(configfile, 'r+') as settingsData:
settings = json.load(settingsData)
settings['normaldir'] = 'X:\\Incoming\\1621_1\\'
settings['projectdir'] = 'X:\\Incoming\\1622_2\\'
settingsData.seek(0) # rewind to beginning of file
settingsData.write(json.dumps(settings,indent=2,sort_keys=True)) #write the updated version
settingsData.truncate() #truncate the remainder of the data in the file
This works perfectly, however I'm replacing the whole path so it won't really work for every JSON file that I need to process. What I would really like to do is to take a JSON key corresponding to a file path, keep the last 8 characters and replace the rest of the patch with a new string, but I can't figure out how to do this using json in python, as far as I can tell I can't edit part of a key.
Does anyone have a workaround for this?
Thanks!
Your replace logic failed as you need to reassign content to the new string,str.replace is not an inplace operation, it creates a new string:
content = content.replace("D:\\Outgoing\\", "X:\\Incoming\\")
Using the json approach just do a replace too, using the current value:
settings['normaldir'] = settings['normaldir'].replace("D:\\Outgoing\\", "X:\\Incoming\\")
You also would want truncate() before you write or just reopen the file with w and dump/write the new value, if you really wanted to just keep the last 8 chars and prepend a string:
settings['normaldir'] = "X:\\Incoming\\" + settings['normaldir'][-8:]
Python come with a json library.
With this library, you can read and write JSON files (or JSON strings).
Parsed data is converted to Python objects and vice versa.
To use the json library, simply import it:
import json
Say your data is stored in input_data.json file.
input_data_path = "input_data.json"
You read the file like this:
import io
with io.open(input_data_path, mode="rb") as fd:
obj = json.load(fd)
or, alternatively:
with io.open(input_data_path, mode="rb") as fd:
content = fd.read()
obj = json.loads(content)
Your data is automatically converted into Python objects, here you get a dict:
print(repr(obj))
# {u'projectdir': u'D:\\Outgoing\\1622_2\\',
# u'normaldir': u'D:\\Outgoing\\1621_1\\'}
note: I'm using Python 2.7 so you get the unicode string prefixed by "u", like u'projectdir'.
It's now easy to change the values for normaldir and projectdir:
obj["normaldir"] = "X:\\Incoming\\1621_1\\"
obj["projectdir"] = "X:\\Incoming\\1622_2\\"
Since obj is a dict, you can also use the update method like this:
obj.update({'normaldir': "X:\\Incoming\\1621_1\\",
'projectdir': "X:\\Incoming\\1622_2\\"})
That way, you use a similar syntax like JSON.
Finally, you can write your Python object back to JSON file:
output_data_path = "output_data.json"
with io.open(output_data_path, mode="wb") as fd:
json.dump(obj, fd)
or, alternatively with indentation:
content = json.dumps(obj, indent=True)
with io.open(output_data_path, mode="wb") as fd:
fd.write(content)
Remarks: reading/writing JSON objects is faster with a buffer (the content variable).
.replace returns a new string, and don't change it. But you should not treat json-files as normal text files, so you can combine parsing json with replace:
def processscan(scanfile):
configfile= MonitorDirectory + scanfile
with open(configfile, 'rb') as settingsData:
settings = json.load(settingsData)
settings = {k: v.replace("D:\\Outgoing\\", "X:\\Incoming\\")
for k, v in settings.items()
}
with open(configfile, 'wb') as settingsData:
json.dump(settings, settingsData)

How to read each line of a file to a separate list to process them individually

There are already several questions to similar topics, but none of them solves mine.
I've written multiple lists to a text file. There, every line represents a list. Looks like this:
1: ['4bf58dd8d48988d1ce941735', '4bf58dd8d48988d157941735', '4bf58dd8d48988d1f1931735', etc.]
2: ['4bf58dd8d48988d16a941735', '4bf58dd8d48988d1f6941735', '4bf58dd8d48988d143941735', etc.]
...
I created it with:
with open('user_interest.txt', 'w') as f:
for x in range(1, 1084):
temp = df.get_group(x)
temp_list = temp['CategoryID'].tolist()
f.write(str(temp_list) + "\n")
If I read the file I get the whole file as a list. If I then access the lines, I have them as class string! But I want them again as a list like before I stored them.
with open('user_interest.txt', 'r') as file:
for line in file:
#temp_list.append(line)
print(similarity_score(user_1_list, temp_list))
line is class string here, not list like I wanted. The idea with temp_list doesn't really work either.
(user_1_list is a fix value, while temp_list is not)
Here's the context of the question: I want every line to be processed in my similarity_score function. I don't need the lists "forever" just hand it over to my function. This function should be applied to every line.
The function calculates cosine similarity and I have to find top 10 most similar users to a given user. So I have to compare each other user with my given user (user_1_list).
Psedo code:
read line
convert line to a list
give list to my function
read next line ...
Probably it's just an easy fix, but I don't get it yet. I neither want each line integrated into a new list / nested list
[['foo', 'bar', ...]]
nor I want them all in a single list.
Thanks for any help and just ask if you need more information!
You should use a proper serializer like JSON to write your lists. Then, you can use the same to deserialize them:
import json
# when writing the lists
f.write(json.dumps(temp_list) + "\n")
# when reading
lst = json.loads(line)
Use Pickle or JSON to serialize/deserialize your data
If you absolutely need to do your way, you can use ast.literal_eval You can get some help here

Writing an object to python file

I have a following code:
matrix_file = open("abc.txt", "rU")
matrix = matrix_file.readlines()
keys = matrix[0]
vals = [line[1:] for line in matrix[1:]]
ea=open("abc_format.txt",'w')
ea.seek(0)
ea.write(vals)
ea.close()
However I am getting the following error:
TypeError: expected a character buffer object
How do I buffer the output and what data type is the variable vals?
vals is a list. If you want to write a list of strings to a file, as opposed to an individual string, use writelines:
ea=open("abc_format.txt",'w')
ea.seek(0)
ea.writelines(vals)
ea.close()
Note that this will not insert newlines for you (although in your specific case your strings already end in newlines, as pointed out in the comments). If you need to add newlines you could do the following as an example:
ea=open("abc_format.txt",'w')
ea.seek(0)
ea.writelines([line+'\n' for line in vals])
ea.close()
The write function will only handle characters or bytes. To write arbitrary objects, use python's pickle library. Write with pickle.dump(), read them back with pickle.load().
But if what you're really after is writing something in the same format as your input, you'll have to write out the matrix values and newlines yourself.
for line in vals:
ea.write(line)
ea.close()
You've now written a file that looks like abc.txt, except that the first row and first character from each line has been removed. (You dropped those when constructing vals.)
Somehow I doubt this is what you intended, since you chose to name it abc_format.txt, but anyway this is how you write out a list of lines of text.
You cannot "write" objects to files. Rather, use the pickle module:
matrix_file = open("abc.txt", "rU")
matrix = matrix_file.readlines()
keys = matrix[0]
vals = [line[1:] for line in matrix[1:]]
#pickling begins!
import pickle
f = open("abc_format.txt")
pickle.dump(vals, f) #call with (object, file)
f.close()
Then read it like this:
import pickle
f = open("abc_format.txt")
vals = pickle.load(f) #exactly the same list
f.close()
You can do this with any kind of object, your own or built-in. You can only write strings and bytes to files, python's open() function just opens it like opening notepad would.
To answer your first question, vals is a list, because anything in [operation(i) for i in iterated_over] is a list comprehension, and list comprehensions make lists. To see what the type of any object is, just use the type() function; e.g. type([1,4,3])
Examples: https://repl.it/qKI/3
Documentation here:
https://docs.python.org/2/library/pickle.html and https://docs.python.org/2/tutorial/datastructures.html#list-comprehensions
First of all instead of opening and closing the file separately you can use with statement that does the job automatically.and about the Error,as it says the write method only accepts character buffer object so you need to convert your list to a string.
For example you can use join function that join the items within an iterable object with a specific delimiter and return a concatenated string.
with open("abc.txt", "rU") as f,open("abc_format.txt",'w') as out:
matrix = f.readlines()
keys = matrix[0]
vals = [line[1:] for line in matrix[1:]]
out.write('\n'.join(vals))
Also as a more pythonic way as the file objects are iterators you can do it in following code and get the first line with calling its next method and pass the rest to join function :
with open("abc.txt", "rU") as f,open("abc_format.txt",'w') as out:
matrix = next(f)
out.write('\n'.join(f))

How to read line-delimited JSON from large file (line by line)

I'm trying to load a large file (2GB in size) filled with JSON strings, delimited by newlines. Ex:
{
"key11": value11,
"key12": value12,
}
{
"key21": value21,
"key22": value22,
}
…
The way I'm importing it now is:
content = open(file_path, "r").read()
j_content = json.loads("[" + content.replace("}\n{", "},\n{") + "]")
Which seems like a hack (adding commas between each JSON string and also a beginning and ending square bracket to make it a proper list).
Is there a better way to specify the JSON delimiter (newline \n instead of comma ,)?
Also, Python can't seem to properly allocate memory for an object built from 2GB of data, is there a way to construct each JSON object as I'm reading the file line by line? Thanks!
Just read each line and construct a json object at this time:
with open(file_path) as f:
for line in f:
j_content = json.loads(line)
This way, you load proper complete json object (provided there is no \n in a json value somewhere or in the middle of your json object) and you avoid memory issue as each object is created when needed.
There is also this answer.:
https://stackoverflow.com/a/7795029/671543
contents = open(file_path, "r").read()
data = [json.loads(str(item)) for item in contents.strip().split('\n')]
This will work for the specific file format that you gave. If your format changes, then you'll need to change the way the lines are parsed.
{
"key11": 11,
"key12": 12
}
{
"key21": 21,
"key22": 22
}
Just read line-by-line, and build the JSON blocks as you go:
with open(args.infile, 'r') as infile:
# Variable for building our JSON block
json_block = []
for line in infile:
# Add the line to our JSON block
json_block.append(line)
# Check whether we closed our JSON block
if line.startswith('}'):
# Do something with the JSON dictionary
json_dict = json.loads(''.join(json_block))
print(json_dict)
# Start a new block
json_block = []
If you are interested in parsing one very large JSON file without saving everything to memory, you should look at using the object_hook or object_pairs_hook callback methods in the json.load API.
This expands Cohen's answer:
content_object = s3_resource.Object(BucketName, KeyFileName)
file_buffer = io.StringIO()
file_buffer = content_object.get()['Body'].read().decode('utf-8')
json_lines = []
for line in file_buffer.splitlines():
j_content = json.loads(line)
json_lines.append(j_content)
df_readback = pd.DataFrame(json_lines)
This assumes that the entire file will fit in memory. If it is too big then this will have to be modified to read in chunks or use Dask.
Had to read some data from AWS S3 and parse a newline delimited jsonl file. My solution was this using splitlines
The code:
for line in json_input.splitlines():
one_json = json.loads(line)
The line by line reading approach is good, as mentioned in some of the above answers.
However across multiple JSON tree structures I would recommend decomposition into 2 functions to have more robust error handling.
For example,
def load_cases(file_name):
with open(file_name) as file:
cases = (parse_case_line(json.loads(line)) for line in file)
cases = filter(None, cases)
return list(cases)
parse_case_line can encapsulate the key parsing logic required in your above example, for example with regex matching, or application-specific requirements. It also means that you can select which json key-values you want to parse out.
Another advantage of this approach is filter handles multiple \n in the middle of your json object, and parses the whole file :-).
Just read it line by line and parse e through a stream
while ur hacking trick (adding commas between each JSON string and also a beginning and ending square bracket to make it a proper list) isn't memory-friendly if the file is too more than 1GB as the whole content will land on the RAM.

Adding brackets and commas to multiple JSON objects

I've created a very simple piece of code to read in tweets in JSON format in text files, determine if they contain an id and coordinates and if so, write these attributes to a csv file. This is the code:
f = csv.writer(open('GeotaggedTweets/ListOfTweets.csv', 'wb+'))
all_files = glob.glob('SampleTweets/*.txt')
for filename in all_files:
with open(filename, 'r') as file:
data = simplejson.load(file)
if 'text' and 'coordinates' in data:
f.writerow([data['id'], data['geo']['coordinates']])
I've been having some difficulties but with the help of the excellent JSON Lint website have realised my mistake. I have multiple JSON objects and from what I read these need to be separated by commas and have square brackets added to the start and end of the file.
How can I achieve this? I've seen some examples online where each individual line is read and it's added to the first and last line, but as I load the whole file I'm not entirely sure how to do this.
You have a file that either contains too many newlines (in the JSON values themselves) or too few (no newlines between the tweets at all).
You can still repair this by using some creative re-stitching. The following generator function should do it:
import json
def read_objects(filename):
decoder = json.JSONDecoder()
with open(filename, 'r') as inputfile:
line = next(inputfile).strip()
while line:
try:
obj, index = decoder.raw_decode(line)
yield obj
line = line[index:]
except ValueError:
# Assume we didn't have a complete object yet
line += next(inputfile).strip()
if not line:
line += next(inputfile).strip()
This should be able to read all your JSON objects in sequence:
for filename in all_files:
for data in read_objects(filename):
if 'text' and 'coordinates' in data:
f.writerow([data['id'], data['geo']['coordinates']])
It is otherwise fine to have multiple JSON strings written to one file, but you need to make sure that the entries are clearly separated somehow. Writing JSON entries that do not use newlines, then using newlines in between them, for example, makes sure you can later on read them one by one again and process them sequentially without this much hassle.

Categories

Resources