Recently, I had to json.load() a file. A 13,045KB file from a network location. Over a slow corporate VPN. I did it in an interactive shell and decided to go grab a coffee in the meantime, because this would surely take ages to load. I didn't even manage do stand up and my code was done reading the json and translating its half milion lines into a beautiful dictionary.
How does it happen that Python is able to do it so efficiently? What optimizations are used? Why does it take a few minutes for Sublime to show this json as a text?
>>> f_path = r"Z:\some\network\location\data.json"
>>> import os
>>> os.stat(f_path).st_size
13357297
>>> with measurement_tools.SimpleBenchmark():
... with open(f_path) as f:
... t_dict = json.load(f)
Took 0.203125 seconds.
Edit: SimpleBenchmark() is just my own context manager measuring tool doing basically t1-t0.
Edit 2: since there were questions about why I consider it fast, I compared it to serializing the same dictionary into a json in the same location which took c.a. 10 times as much time (well over 2 secs).
Related
I am trying to write a big list of numpy nd_arrays to disk.
The list is ~50000 elements long
Each element is a nd_array of size (~2048,2) of ints. The arrays have different shapes.
The method I am (curently) using is
#staticmethod
def _write_with_yaml(path, obj):
with io.open(path, 'w+', encoding='utf8') as outfile:
yaml.dump(obj, outfile, default_flow_style=False, allow_unicode=True)
I have also tried pickle which also give the same problem:
On small lists (~3400 long), this works fine, finishes fast enough (<30 sec).
On ~6000 long lists, this finishes after ~2 minutes.
When the list gets larger, the process seems not to do anything. No change in RAM or disk activity.
I stopped waiting after 30 minutes.
After force stopping the process, the file suddenly became of significant size (~600MB).
I can't know if it finished writing or not.
What is the correct way to write such large lists, know if he write succeeded, and, if possible, knowing when the write/read is going to finish?
How can I debug what's happening when the process seems to hang?
I prefer not to break and assemble the lists manually in my code, I expect the serialization libraries to be able to do that for me.
For the code
import numpy as np
import yaml
x = []
for i in range(0,50000):
x.append(np.random.rand(2048,2))
print("Arrays generated")
with open("t.yaml", 'w+', encoding='utf8') as outfile:
yaml.dump(x, outfile, default_flow_style=False, allow_unicode=True)
on my system (MacOSX, i7, 16 GiB RAM, SSD) with Python 3.7 and PyYAML 3.13 the finish time is 61min. During the save the python process occupied around 5 GBytes of memory and final file size is 2 GBytes. This also shows the overhead of the file format: as the size of the data is 50k * 2048 * 2 * 8 (the size of a float is generally 64 bits in python) = 1562 MBytes, means yaml is around 1.3 times worse (and serialisation/deserialisation is also taking time).
To answer your questions:
There is no correct or incorrect way. To have a progress update and
estimation of finishing time is not easy (ex: other tasks might
interfere with the estimation, resources like memory could be used
up, etc.). You can rely on a library that supports that or implement
something yourself (as the other answer suggested)
Not sure "debug" is the correct term, as in practice it might be that the process just slow. Doing a performance analysis is not easy, especially if
using multiple/different libraries. What I would start with is clear
requirements: what do you want from the file saved? Do they need to
be yaml? Saving 50k arrays as yaml does not seem the best solution
if you care about performance. Should you ask yourself first "which is the best format for what I want?" (but you did not give details so can't say...)
Edit: if you want something just fast, use pickle. The code:
import numpy as np
import yaml
import pickle
x = []
for i in range(0,50000):
x.append(np.random.rand(2048,2))
print("Arrays generated")
pickle.dump( x, open( "t.yaml", "wb" ) )
finishes in 9 seconds, and generates a file of 1.5GBytes (no overhead). Of course pickle format should be used in very different circumstances than yaml...
I cant say this is the answer, but it may be it.
When I was working on app that required fast cycles, I found out that something in the code is very slow. It was opening / closing yaml files.
It was solved by using JSON.
Dont use YAML for anything else than as some kind of config you dont open often.
Solution to your array saving:
np.save(path,array) # path = path+name+'.npy'
If you really need to save a list of arrays, I recommend you to save list with array paths(array themselfs you will save on disk with np.save). Saving python objects on disk is not really what you want. What you want is to save numpy arrays with np.save
Complete solution(Saving example):
for array_index in range(len(list_of_arrays)):
np.save(array_index+'.npy',list_of_arrays[array_index])
# path = array_index+'.npy'
Complete solution(Loading example):
list_of_array_paths = ['1.npy','2.npy']
list_of_arrays = []
for array_path in list_of_array_paths:
list_of_arrays.append(np.load(array_path))
Further advice:
Python cant really handle large arrays. Moreover if you have loaded several of them in the list. From the point of speed and memory, always work with one,two arrays at a time. The rest must be waiting on the disk. So instead of object reference, have reference as a path and when needed, load it from disk.
Also, you said you dont want to assemble the list manually.
Possible solution, which I dont advice, but is possibly exactly what you are looking for
>>> a = np.zeros(shape = [10,5,3])
>>> b = np.zeros(shape = [7,7,9])
>>> c = [a,b]
>>> np.save('data.npy',c)
>>> d = np.load('data.npy')
>>> d.shape
(2,)
>>> type(d)
<type 'numpy.ndarray'>
>>> d.shape
(2,)
>>> d[0].shape
(10, 5, 3)
>>>
I believe I dont need to comment above mentioned code. However, after loading back, you will lose list as the list will be transformed into numpy array.
This tutorial https://www.dataquest.io/blog/python-json-tutorial/ has a 600MB file that they work with, however when I run their code
import ijson
filename = "md_traffic.json"
with open(filename, 'r') as f:
objects = ijson.items(f, 'meta.view.columns.item')
columns = list(objects)
I'm running into 10+ minutes of waiting for the file to be read into ijson and I'm really confused how this is supposed to be reasonable. Shouldn't there be parsing? Am I missing something?
The main problem is not that you are creating a list after parsing (that only collects the individual results into a single structure), but that you are using the default pure-python backend provided by ijson.
There are other backends that can be used which are way faster. In ijson's homepage it is explained how you can import those. The yajl2_cffi backend is the fastest currently available at the moment, but I've created a new yajl2_c backend (there's a pull request pending acceptance) that performs even better.
In my laptop (Intel(R) Core(TM) i7-5600U) using the yajl2_cffi backend your code runs in ~1.5 minutes. Using the yajl2_c backend it runs in ~10.5 seconds (python 3) and ~15 seconds (python 2.7.12).
Edit: #lex-scarisbrick is of course also right in that you can quickly break out of the loop if you are only interested in the column names.
This looks like a direct copy/paste of the tutorial found here:
https://www.dataquest.io/blog/python-json-tutorial/
The reason it's taking so long is the list() around the output of the ijson.items function. This effectively forces parsing of the entire file before returning any results. Taking advantage of the ijson.items being a generator, the first result can be returned almost immediately:
import ijson
filename = "md_traffic.json"
with open(filename, 'r') as f:
for item in ijson.items(f, 'meta.view.columns.item'):
print(item)
break
EDIT: The very next step in the tutorial is print(columns[0]), which is why I included printing the first item in the answer. Also, it's not clear whether the question was for Python 2 or 3, so the answer uses syntax that works in both, albeit inelegantly.
I tried running your code and I killed the program after 25 minutes. So yes 10 minutes it's reasonable fast.
I've a json file data_large of size 150.1MB. The content inside the file is of type [{"score": 68},{"score": 78}]. I need to find the list of unique scores from each item.
This is what I'm doing:-
import ijson # since json file is large, hence making use of ijson
f = open ('data_large')
content = ijson.items(f, 'item') # json loads quickly here as compared to when json.load(f) is used.
print set(i['score'] for i in content) #this line is actually taking a long time to get processed.
Can I make print set(i['score'] for i in content) line more efficient. Currently it's taking 201secs to execute. Can it be made more efficient?
This will give you the set of unique score values (only) as ints. You'll need the 150 MB of free memory. It uses re.finditer() to parse which is about three times faster than the json parser (on my computer).
import re
import time
t = time.time()
obj = re.compile('{.*?: (\d*?)}')
with open('datafile.txt', 'r') as f:
data = f.read()
s = set(m.group(1) for m in obj.finditer(data))
s = set(map(int, s))
print time.time() - t
Using re.findall() also seems to be about three times faster than the json parser, it consumes about 260 MB:
import re
obj = re.compile('{.*?: (\d*?)}')
with open('datafile.txt', 'r') as f:
data = f.read()
s = set(obj.findall(data))
I don't think there is any way to improve things by much. The slow part is probably just the fact that at some point you need to parse the whole JSON file. Whether you do it all up front (with json.load) or little by little (when consuming the generator from ijson.items), the whole file needs to be processed eventually.
The advantage to using ijson is that you only need to have a small amount of data in memory at any given time. This probably doesn't matter too much for a file with a hundred or so megabytes of data, but would be a very big deal if your data file grew to be gigabytes or more. Of course, this may also depend on the hardware you're running on. If your code is going to run on an embedded system with limited RAM, limiting your memory use is much more important. On the other hand, if it is going to be running on a high performance server or workstation with lots and lots of ram available, there's may not be any reason to hold back.
So, if you don't expect your data to get too big (relative to your system's RAM capacity), you might try testing to see if using json.load to read the whole file at the start, then getting the unique values with a set is faster. I don't think there are any other obvious shortcuts.
On my system, the straightforward code below handles 10,000,000 scores (139 megabytes) in 18 seconds. Is that too slow?
#!/usr/local/cpython-2.7/bin/python
from __future__ import print_function
import json # since json file is large, hence making use of ijson
with open('data_large', 'r') as file_:
content = json.load(file_)
print(set(element['score'] for element in content))
Try using a set
set([x['score'] for x in scores])
For example
>>> scores = [{"score" : 78}, {"score": 65} , {"score" : 65}]
>>> set([x['score'] for x in scores])
set([65, 78])
this is my first post here to stackoverflow, and I am still just learning Python and programming in general. I'm working on some simple game logic, and I'm getting a little washed up on how Python handles file input/output.
What I'm trying to do is, while my game is running, store a series of variables (all numeric, integer data), and when the game is over, dump that information to txt file that can later be read (again, as numeric, integer data) so that it can be added to. A tracker, really.
Perhaps if you were playing some racing game, for example, every time you hit a pedestrian, pedestrians += 1. Then when your game is over, after hitting like 23 pedestrians, that number (along with any other variables I wished to track) is saved to a text file. When you start the game again, it loads the number 23 back into the pedestrians variable, so if you hit 30 more this time you end up with 53 total, and so on. Thanks in advance!
Does it have to be text? I'd use pickle if not
http://docs.python.org/library/pickle.html
There are quite a few ways to do this. Do you want the file to be human-readable or human-writable? (Could encourage cheating if you do.)
The simplest thing that you could do which would work is to use the ConfigParser library, which stores simple data like what you described in a text file. Something like:
Reading:
import ConfigParser
config = ConfigParser.ConfigParser()
config.readfp(open('game_data.dat'))
dead_pedestrians = config.getint('JoeUser', 'dead_pedestrians')
Writing:
config = ConfigParser.RawConfigParser()
config.add_section('JoeUser')
config.set('JoeUser', 'dead_pedestrians', '15')
with open('game_data.dat', 'wb') as configfile:
config.write(configfile)
Other options: If you don't want it to be human-readable, you could use shelve (but a clever user who knows you're using python would find it trivial to read.
Hope that helps!
I have some code that will need to write about 20 bytes of data every 10 seconds.
I'm on Windows 7 using python 2.7
You guys recommend any 'least strain to the os/hard drive' way to do this?
I was thinking about opening and closing the same file very 10 seconds:
f = open('log_file.txt', 'w')
f.write(information)
f.close()
Or should I keep it open and just flush() the data and not close it as often?
What about sqllite? Will it improve performance and be less intensive than the open and close file operations?
(Isn't it just a flat file database so == to text file anyways...?)
What about mysql (this uses a local server/process.. not sure the specifics on when/how it saves data to hdd) ?
I'm just worried about not frying my hard drive and improving the performance on this logging procedure. I will be receiving new log information about every 10 seconds, and this will be going on 24/7 24 hours a day.
Your advice?
ie: Think about programs like utorrent that require saving large amounts of data on a constant basis for long periods of time, (my log file is significantly less data that those being written in such "downloader type programs" like utorrent)
import random
import time
def get_data():
letters = 'isn\'t the code obvious'
data = ''
for i in xrange(20):
data += random.choice(letters)
return data
while True:
f = open('log_file.txt', 'w')
f.write(get_data())
f.close()
time.sleep(10)
My CPU starts whining after about 15 seconds... (or is that my hdd? )
As expected, python comes included with a great tool for this, have a look at the logging module
Use the logging framework. This is exactly what it is designed to do.
Edit: Balls, beaten to it :).
Don't worry about "frying" your hard drive - 20 bytes every 10 seconds is a small fraction of the data written to the disk in the normal operation of the OS.