I have 1000+ JSON files that look like
{
"name": "Some name",
"part_num": "123456",
"other_config": {
// Large amount of objects
},
"some more": {
// Large amount of objects
}
// etc
}
When my program starts up, it has to scan the directory with all these JSON files, load each one, and extract the "name" and "part_num" values and populates a list view with these values. The user then selects one and that config is then re-parsed and the proper actions taken then.
The problem is reading that many files takes a while. I've mitigated it somewhat by using multiprocessing to throw the work on all available cores in the background, then populating the list view once done, but I'm still bounded by IO. Since I know this code will be running on computers with slower processors and hard drives than mine, this speed is not acceptable.
The average case scenario is that the values I need are at the beginning of the file, but I can't assume that in the worst case. Is there a way to iteratively parse a JSON file so that I can load what I need from these files faster?
I could resort to regex, but I'd really prefer not to.
YAJL is an event-driven parser with Python bindings. This way you can stop parsing once you have retrieved the info you need.
This reminded me of XML back in the day, when we had three styles of parsers: DOM-based, event based (e.g., SAX), and lesser known pull-based (e.g., StAX). The latter two were more efficient, because they didn't require loading the entire file into memory when you only need bits of it.
Fortunately, similar things exist for JSON. For Python, take a look at yajl-py, which is a Python wrapper for the C-library Yajl.
It takes a bit more code to parse via an event-based parser, but it can be much more efficient.
There is ijson which performs performs iterative json parsing in a pythonc way.
I guess that solution to your problem would be:
import ijson
f = open('...')
objects = ijson.items(f, 'other_config.item')
for part in objects:
do_something_with(part)
Dislaimer I did not use that library.
Related
I'm building a chatbot database atm. I uses data from pushshift.io. In order to deal with big datafile, (I understand that json loads everything into RAM, so if you only have 16GB RAM and working with 30GB of data, that is a nono), I wrote a bash script that split the big file into smaller chunk of 3GB of file so that I can run it through json.loads (or pd.read_json). The problem whenever I run my code it returns
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Thus I take a look into the temp json file that I just created and I see this happens in my JSON file:
ink_id":"t3_2qyr1a","body":"Most of us have some family members like this. *Most* of my family is like this. ","downs":0,"created_utc":"1420070400","score":14,"author":"YoungModern","distinguished":null,"id":"cnas8zv","archived":false,"parent_id":"t3_2qyr1a","subreddit":"exmormon","author_flair_css_class":null,"author_flair_text":null,"gilded":0,"retrieved_on":1425124282,"ups":14,"controversiality":0,"subreddit_id":"t5_2r0gj","edited":false}
The sample correction of the data looks like this
{"score_hidden":false,"name":"t1_cnas8zv","link_id":"t3_2qyr1a","body":"Most of us have some family members like this. *Most* of my family is like this. ","downs":0,"created_utc":"1420070400","score":14,"author":"YoungModern","distinguished":null,"id":"cnas8zv","archived":false,"parent_id":"t3_2qyr1a","subreddit":"exmormon","author_flair_css_class":null,"author_flair_text":null,"gilded":0,"retrieved_on":1425124282,"ups":14,"controversiality":0,"subreddit_id":"t5_2r0gj","edited":false}
I notice that my bash script split the file without paying attention to the JSON objects. So my question is are there ways to write a function in python that can detect JSON objects that are not correctly formatted and deleted it?
There isn't a lot of information to go on, but I would challenge the frame a little.
There are several incremental json parsers available in Python. A quick search shows ijson should allow you to traverse your very large data structure without exploding.
You also should consider another data format (or a real database), or you will easily find yourself spending time reimplementing much much slower versions of features that already exist with the right tools.
If you are using the json standard library, then calling json.loads on badly formatted data will return JSONDecodeError. You can put your code in a try-catch statement and check if this exception occurs to make sure you only process correctly formatted data.
I have first file (data.py):
database = {
'school': 2,
'class': 3
}
my second python file (app.py)
import data
del data.database['school']
print(data.database)
>>>{'class': 3}
But in data.py didn't change anything? Why?
And how can I change it from my app.py?
del data.database['school'] modifies the data in memory, but does not modify the source code.
Modifying a source code to manage the persistence of your data is not a good practice IMHO.
You could use a database, a csv file, a json file ...
To elaborate on Gelineau's answer: at runtime, your source code is turned into a machine-usable representation (known as "bytecode") which is loaded into the process memory, then executed. When the del data.database['school'] statement (in it's bytecode form) is executed, it only modifies the in-memory data.database object, not (hopefully!) the source code itself. Actually, your source code is not "the program", it's a blueprint for the runtime process.
What you're looking for is known as data persistance (data that "remembers" it's last known state between executions of the program). There are many solutions to this problem, ranging from the simple "write it to a text or binary file somewhere and re-read it at startup" to full-blown multi-servers database systems. Which solution is appropriate for you depends on your program's needs and constraints, whether you need to handle concurrent access (multiple users / processes editing the data at the same time), etc etc so there's really no one-size-fits-all answers. For the simplest use cases (single user, small datasets etc), json or csv files written to disk or a simple binary key:value file format like anydbm or shelve (both in Python's stdlib) can be enough. As soon as things gets a bit more complex, SQL databases are most often your best bet (no wonder why they are still the industry standard and will remain so for long years).
In all cases, data persistance is not "automagic", you will have to write quite some code to make sure your changes are saved in timely manner.
As what you are trying to achieve is basically related to file operation.
So when you are importing data , it just loads instanc of your file in memory and create a reference from your new file, ie. app.py. So, if you modify it in app.py its just modifiying the instance which is in RAM not in harddrive where your actual file is stored in harddrive.
If you want to change source code of another file "As its not good practice" then you can use file operations.
I have an interesting problem at hand. As someone who's a beginner when it comes to working with data at even a morderate scale, I'd love some tips from the veterans here.
I have around 6000 Json.gz files totalling around 5GB compressed and 20GB uncompressed.
I'm opening each file and reading them line by line using the gzip module; then using json.loads() loading each line and parsing the complicated JSON structure. Then I'm inserting lines from each file into a Pytable all at once before iterating to the next file.
All this is taking me around 3 hours. Bulk inserting into the Pytable didn't really help the speed at all. Much of the time is gone getting values from the parsed JSON line since they have a truly horrible structure. Some are straightforward like 'attrname':attrvalue, but some are complicated and time consuming structures like:
'attrarray':[{'name':abc, 'value':12},{'value':12},{'name':xyz, 'value':12}...]
...where I need to pick up the value of all those objects in the attr array which have some corresponding name, and ignore those that don't. So I need to iterate through the list and inspect each JSON object inside. (I'd be glad if you can point out any quicker clever way, if it exists)
So I suppose the actual parsing part of it doesn't have much scope of speedup. Where I think their might be scope of speedup is the actual reading the file part.
So I ran a few tests (I don't have the numbers with me right now) and even after removing the parsing part of my program; simply going through the files line by line itself was taking a considerable amount of time.
So I ask: Is there any part of this problem that you think I might be doing suboptimally?
for filename in filenamelist:
f = gzip.open(filename):
toInsert=[]
for line in f:
parsedline = json.loads(line)
attr1 = parsedline['attr1']
attr2 = parsedline['attr2']
.
.
.
attr10 = parsedline['attr10']
arr = parsedline['attrarray']
for el in arr:
try:
if el['name'] == 'abc':
attrABC = el['value']
elif el['name'] == 'xyz':
attrXYZ = el['value']
.
.
.
except KeyError:
pass
toInsert.append([attr1,attr2,...,attr10,attrABC,attrXYZ...])
table.append(toInsert)
One clear piece of "low-hanging fruit"
If you're going to be accessing the same compressed files over and over (it's not especially clear from your description whether this is a one-time operation), then you should decompress them once rather than decompressing them on-the-fly each time you read them.
Decompression is a CPU-intensive operation, and Python's gzip module is not that fast compared to C utilities like zcat/gunzip.
Likely the fastest approach is to gunzip all these files, save the results somewhere, and then read from the uncompressed files in your script.
Other issues
The rest of this is not really an answer, but it's too long for a comment. In order to make this faster, you need to think about a few other questions:
What are you trying to do with all this data?
Do you really need to load all of it at once?
If you can segment the data into smaller pieces, then you can reduce the latency of the program if not the overall time required. For example, you might know that you only need a few specific lines from specific files for whatever analysis you're trying to do... great! Only load those specific lines.
If you do need to access the data in arbitrary and unpredictable ways, then you should load it into another system (RDBMS?) which stores it in a format that is more amenable to the kinds of analyses you're doing with it.
If the last bullet point is true, one option is to load each JSON "document" into a PostgreSQL 9.3 database (the JSON support is awesome and fast) and then do your further analyses from there. Hopefully you can extract meaningful keys from the JSON documents as you load them.
I have a ridiculously simple python script that uses the arcpy module. I turned it into a script tool in arcmap and am running it that way. It works just fine, I've tested it multiple times on small datasets. The problem is that I have a very large amount of data. I need to run the script/tool on a .dbf table with 4 columns and 490,481,440 rows, and currently it has taken days. Does anyone have any suggestions on how to speed it up? To save time I've already created the columns that will be populated in the table before I run the script. "back" represents the second number after the comma in the "back_pres_dist" column and "dist" represents the fourth. All I want is for them to be in their own separate columns. The table and script look something like this:
back_pres_dist back dist
1,1,1,2345.6
1,1,2,3533.8
1,1,3,4440.5
1,1,4,3892.6
1,1,5,1292.0
import arcpy
from arcpy import env
inputTable = arcpy.GetParameterAsText(0)
back1 = arcpy.GetParameterAsText(1) #the empty back column to be populated
dist3 = arcpy.GetParameterAsText(2) #the empty dist column to be populated
arcpy.CalculateField_management(inputTable, back1, '!back_pres_dist!.split(",")[1]', "PYTHON_9.3")
updateMess = arcpy.AddMessage("back column updated.")
arcpy.CalculateField_management(inputTable, dist3, '!back_pres_dist!.split(",")[3]', "PYTHON_9.3")
updateMess = arcpy.AddMessage("dist column updated.")
updateMess = arcpy.AddMessage("All columns updated.")
Any suggestions would be greatly appreciated. I know that reading some parts of the data into memory might speed things up, but I'm not sure how to do that with python (when using R it took forever to read into memory and was a nightmare trying to write to a .csv).
This is a ton of data. I'm guessing that your main bottleneck is read/write operations on the disk and not CPU or memory.
Your process appears to modify each row independently according to constant input values in what's essentially a tabular operation that doesn't really require GIS functionality. As a result, I would definitely look at doing this outside of the arcpy environment to avoid that overhead. While you could dump this stuff to numpy with the new arcpy.da functionality, I think that even this might be a bottleneck. Seems you should be able to more directly read your *.dbf file with a different library.
In fact, this operation is not really tabular; it's really about iteration. You'll probably want to exploit things like the "WITH"/"AS" keywords (PEP 343, Raymond Hettinger has a good video on youtube, too) or iterators in general (see PEPs 234, 255), which only load a record at a time.
Beyond those general programming approaches, I'm thinking that your best bet would be to break this data into chunks, parallelize, and then reassemble the results. Part of engineering the parallelization could be to spread your data across different disk platters to avoid competing between i/o requests. iPython is an add-on for python that has a pretty easy to use, high-level pacakge, "parallel", if you want an easy place to start. Lots of pretty good videos on youtube from PyCon 2012. There's a 3 hour one where the parallel stuff starts at 2:13:00 or so.
trying to keep it stupidly simple. Is it a bad idea to move a txt file into and out of a python list? the txt files will probably get to about 2-5k entries. what is the preferred method to create a simple flat file databse?
It might or might not be a bad idea. It depends on what you are trying to achieve, how much memory you have and how big those lines are on average. It also depends on what you are doing with that data. Maybe it is worth it to read and process file line by line? In any case, database assumes indexes, what are you going to do with a list of strings without an index? You cannot search it efficiently, for example.
In any case, if you feel like you need a database, take a look at SQLite. It is a small embedded SQL server written in C with Python interface. It is stable and proven to work. For example, it is being used on iPhone in tons of applications.
If you're looking for a very simple file database, maybe you should look at the shelve module. Example usage:
import shelve
with shelve.open("myfile") as mydb:
mydb["0"] = "first value"
mydb["1"] = "second value"
# ...