Python writing to json file object by object - python

I am looping through a bunch of urls to medium sized json files, that I'm trying to combine into a single file. Currently it's appending the json data for each file to a list, that I can then write to a single Json file
data = []
for i in range(50):
item = tree.getroot()[i][0]
with urllib.request.urlopen(url + item.text) as f:
for line in f:
data.append(json.loads(line))
# Save to file
with open('data.json', 'w') as outfile:
json.dump(data, outfile)
However, I'm not sure how scalable this method is and I will eventually have to combine 100s of files this way. If the data list becomes way to too large I'm thinking that trying to write it in one go would cause my system to crash due to memory issues? Is there a way to write continuously to a json file, so instead of appending inside the loop like this:
data.append(json.loads(line))
I would instead write to the file inside each loop with something like this:
with open('data.json', 'w') as outfile:
json.writedata, outfile)
That way I'll be building the file as I go and can clear the memory between each loop?

Related

saving large streaming data in python

I have a large amount of data coming in every second in the form of python dictionaries, right now I am saving it to mySQL server as they come in but that creates a backlog thats more than a few hours. What is the best way to save the data locally and move it to a mySQL server every hour or so as a chunk to save time.I have tried redis but it cant save a list of these dictionaries which I can later move to mySQL.
A little-known fact about the Python native pickle format is that you can happily concatenate them into a file.
That is, simply open a file in append mode and pickle.dump() your dictionary into that file. If you want to be extra fancy, you could do something like timestamped files:
def ingest_data(data_dict):
filename = '%s.pickles' % date.strftime('%Y-%m-%d_%H')
with open(filename, 'ab') as outf:
pickle.dump(data_dict, outf, pickle.HIGHEST_PROTOCOL)
def read_data(filename):
with open(filename, 'rb') as inf:
while True:
yield pickle.load(inf) # TODO: handle EOF error

Paste file in new file in python

Is there a way to just open/create filehandle = open( "example.bin", "wb") and extend this file with an existing file?
I think about something like the .extend from function for lists
Like so:
filehandle = open( "example.bin", "wb")
filehande.extend(existing_file.bin)
I know that i can read the existing file and write that to a variable/list and "paste" it in the new file but im curious if there is an easier option like this...
with open('original', 'a') as out_file, open('other', 'r') as ins_file:
out_file.write(ins_file.read())
This will append the contents of other onto original. If you're dealing with binary data you can change the mode on each to ab and rb.
If the contents of the file are large you can do it in chunks too.
You can't merge file objects. You can make a list of each file and extended them
files_combined = list(open("example.bin", "wb")) + list(open("file_2"))
Will return a list with all the lines in file_2 appended to file_1, but in a new list. You can then save it to a new file, or overwrite one of the files.

Store data as numbers in a file in Python

I wrote a program that opens a file and read it line by line and store just the third element of each line. The problem is that, when I write those outputs into a file I need to change them as strings which is not suitable for me due to the fact that I want to do some mathematical operations on the written file later on. FYI, it also is not suitable to store it like this and use int() while reading it.
Can anybody help me with this issue?
with open("/home/test1_to_write", "w") as g:
with open("/home/test1_to_read", 'r') as f:
for line in f:
a=line.split()
number = int(a[3])
g.write(str(number)+'\n')
g.close()
There's no way to tell a text file that 1 is the number one not the letter "1". If you need that, consider storing the whole thing as a list instead using some sort of serial format e.g. JSON:
import json
with open("/home/test1_to_write.json", 'w') as outfile:
with open("/home/test1_to_read", 'r') as infile:
data = [int(line.split()[3]) for line in infile]
json.dump(data, outfile)
You can then load the data with:
with open("/home/test1_to_write.json", "r") as infile:
read_data = json.load(infile)

How to store something other than a string in a file

I'm trying to write some code to create a file that will write data about a "character". I've been able to write strings using:
f = open('player.txt','w')
f.write("Karatepig")
f.close()
f = open('player.txt','r')
f.read()
The issue is, how do I store something other than a string to a file? Can I convert it from a string to a value?
Files can only store strings, so you have to convert other values to strings when writing, and converting them back to original values when reading.
The Python standard library has a whole section dedicated to data persistence that can help make this task easier.
However, for simple types, it is perhaps easiest to use the json module to serialize data to a file and read it back again with ease:
import json
def write_data(data, filename):
with open(filename, 'w') as outfh:
json.dump(data, outfh)
def read_data(filename):
with open(filename, 'r') as infh:
json.load(infh)

Update json file

I have json file with some data, and would like to occasionally update this file.
I read the file:
with open('index.json', 'rb') as f:
idx = json.load(f)
then check for presence of a key from potentially new data, and if key is not present update the file:
with open('index.json', mode='a+') as f:
json.dump(new_data, f, indent=4)
However this procedure just creates new json object (python dict) and appends it as new object in output json file, making the file not valid json file.
Is there any simple way to append new data to json file without overwriting whole file, by updating the initial dict?
One way to do what you're after is to write one JSON object per line in the file. I'm using that approach and it works quite well.
A nice benefit is that you can read the file more efficiently (memory-wise) because you can read one line at a time. If you need all of them, there's no problem with assembling a list in Python, but if you don't you're operating much faster and you can also append.
So to initially write all your objects, you'd do something like this:
with open(json_file_path, "w") as json_file:
for data in data_iterable:
json_file.write("{}\n".format(json.dumps(data)))
Then to read efficiently (will consume little memory, no matter the file size):
with open(json_file_path, "r") as json_file:
for line in json_file:
data = json.loads(line)
process_data(data)
To update/append:
with open(json_file_path, "a") as json_file:
json_file.write("{}\n".format(json.dumps(new_data)))
Hope this helps :)

Categories

Resources