Optimization tips for reading/parsing large number of JSON.gz files

Optimization tips for reading/parsing large number of JSON.gz files - python

I have an interesting problem at hand. As someone who's a beginner when it comes to working with data at even a morderate scale, I'd love some tips from the veterans here.
I have around 6000 Json.gz files totalling around 5GB compressed and 20GB uncompressed.
I'm opening each file and reading them line by line using the gzip module; then using json.loads() loading each line and parsing the complicated JSON structure. Then I'm inserting lines from each file into a Pytable all at once before iterating to the next file.
All this is taking me around 3 hours. Bulk inserting into the Pytable didn't really help the speed at all. Much of the time is gone getting values from the parsed JSON line since they have a truly horrible structure. Some are straightforward like 'attrname':attrvalue, but some are complicated and time consuming structures like:
'attrarray':[{'name':abc, 'value':12},{'value':12},{'name':xyz, 'value':12}...]
...where I need to pick up the value of all those objects in the attr array which have some corresponding name, and ignore those that don't. So I need to iterate through the list and inspect each JSON object inside. (I'd be glad if you can point out any quicker clever way, if it exists)
So I suppose the actual parsing part of it doesn't have much scope of speedup. Where I think their might be scope of speedup is the actual reading the file part.
So I ran a few tests (I don't have the numbers with me right now) and even after removing the parsing part of my program; simply going through the files line by line itself was taking a considerable amount of time.
So I ask: Is there any part of this problem that you think I might be doing suboptimally?
for filename in filenamelist:
f = gzip.open(filename):
toInsert=[]
for line in f:
parsedline = json.loads(line)
attr1 = parsedline['attr1']
attr2 = parsedline['attr2']
.
.
.
attr10 = parsedline['attr10']
arr = parsedline['attrarray']
for el in arr:
try:
if el['name'] == 'abc':
attrABC = el['value']
elif el['name'] == 'xyz':
attrXYZ = el['value']
.
.
.
except KeyError:
pass
toInsert.append([attr1,attr2,...,attr10,attrABC,attrXYZ...])
table.append(toInsert)

One clear piece of "low-hanging fruit"
If you're going to be accessing the same compressed files over and over (it's not especially clear from your description whether this is a one-time operation), then you should decompress them once rather than decompressing them on-the-fly each time you read them.
Decompression is a CPU-intensive operation, and Python's gzip module is not that fast compared to C utilities like zcat/gunzip.
Likely the fastest approach is to gunzip all these files, save the results somewhere, and then read from the uncompressed files in your script.
Other issues
The rest of this is not really an answer, but it's too long for a comment. In order to make this faster, you need to think about a few other questions:
What are you trying to do with all this data?
Do you really need to load all of it at once?
If you can segment the data into smaller pieces, then you can reduce the latency of the program if not the overall time required. For example, you might know that you only need a few specific lines from specific files for whatever analysis you're trying to do... great! Only load those specific lines.
If you do need to access the data in arbitrary and unpredictable ways, then you should load it into another system (RDBMS?) which stores it in a format that is more amenable to the kinds of analyses you're doing with it.
If the last bullet point is true, one option is to load each JSON "document" into a PostgreSQL 9.3 database (the JSON support is awesome and fast) and then do your further analyses from there. Hopefully you can extract meaningful keys from the JSON documents as you load them.

Related

Removing JSON objects that aren't correctly formatted Python

I'm building a chatbot database atm. I uses data from pushshift.io. In order to deal with big datafile, (I understand that json loads everything into RAM, so if you only have 16GB RAM and working with 30GB of data, that is a nono), I wrote a bash script that split the big file into smaller chunk of 3GB of file so that I can run it through json.loads (or pd.read_json). The problem whenever I run my code it returns
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Thus I take a look into the temp json file that I just created and I see this happens in my JSON file:
ink_id":"t3_2qyr1a","body":"Most of us have some family members like this. *Most* of my family is like this. ","downs":0,"created_utc":"1420070400","score":14,"author":"YoungModern","distinguished":null,"id":"cnas8zv","archived":false,"parent_id":"t3_2qyr1a","subreddit":"exmormon","author_flair_css_class":null,"author_flair_text":null,"gilded":0,"retrieved_on":1425124282,"ups":14,"controversiality":0,"subreddit_id":"t5_2r0gj","edited":false}
The sample correction of the data looks like this
{"score_hidden":false,"name":"t1_cnas8zv","link_id":"t3_2qyr1a","body":"Most of us have some family members like this. *Most* of my family is like this. ","downs":0,"created_utc":"1420070400","score":14,"author":"YoungModern","distinguished":null,"id":"cnas8zv","archived":false,"parent_id":"t3_2qyr1a","subreddit":"exmormon","author_flair_css_class":null,"author_flair_text":null,"gilded":0,"retrieved_on":1425124282,"ups":14,"controversiality":0,"subreddit_id":"t5_2r0gj","edited":false}
I notice that my bash script split the file without paying attention to the JSON objects. So my question is are there ways to write a function in python that can detect JSON objects that are not correctly formatted and deleted it?

There isn't a lot of information to go on, but I would challenge the frame a little.
There are several incremental json parsers available in Python. A quick search shows ijson should allow you to traverse your very large data structure without exploding.
You also should consider another data format (or a real database), or you will easily find yourself spending time reimplementing much much slower versions of features that already exist with the right tools.

If you are using the json standard library, then calling json.loads on badly formatted data will return JSONDecodeError. You can put your code in a try-catch statement and check if this exception occurs to make sure you only process correctly formatted data.

Iteratively parse JSON file

I have 1000+ JSON files that look like
{
"name": "Some name",
"part_num": "123456",
"other_config": {
// Large amount of objects
},
"some more": {
// Large amount of objects
}
// etc
}
When my program starts up, it has to scan the directory with all these JSON files, load each one, and extract the "name" and "part_num" values and populates a list view with these values. The user then selects one and that config is then re-parsed and the proper actions taken then.
The problem is reading that many files takes a while. I've mitigated it somewhat by using multiprocessing to throw the work on all available cores in the background, then populating the list view once done, but I'm still bounded by IO. Since I know this code will be running on computers with slower processors and hard drives than mine, this speed is not acceptable.
The average case scenario is that the values I need are at the beginning of the file, but I can't assume that in the worst case. Is there a way to iteratively parse a JSON file so that I can load what I need from these files faster?
I could resort to regex, but I'd really prefer not to.

YAJL is an event-driven parser with Python bindings. This way you can stop parsing once you have retrieved the info you need.

This reminded me of XML back in the day, when we had three styles of parsers: DOM-based, event based (e.g., SAX), and lesser known pull-based (e.g., StAX). The latter two were more efficient, because they didn't require loading the entire file into memory when you only need bits of it.
Fortunately, similar things exist for JSON. For Python, take a look at yajl-py, which is a Python wrapper for the C-library Yajl.
It takes a bit more code to parse via an event-based parser, but it can be much more efficient.

There is ijson which performs performs iterative json parsing in a pythonc way.
I guess that solution to your problem would be:
import ijson
f = open('...')
objects = ijson.items(f, 'other_config.item')
for part in objects:
do_something_with(part)
Dislaimer I did not use that library.

Most efficient way to store data on drive

baseline - I have CSV data with 10,000 entries. I save this as 1 csv file and load it all at once.
alternative - I have CSV data with 10,000 entries. I save this as 10,000 CSV files and load it individually.
Approximately how much more inefficient is this computationally. I'm not hugely interested in memory concerns. The purpose of the alternative method is because I frequently need to access subsets of the data and don't want to have to read the entire array.
I'm using python.
Edit: I can other file formats if needed.
Edit1: SQLite wins. Amazingly easy and efficient compared to what I was doing before.

SQLite is ideal solution for your application.
Simply import your CSV file into SQLite database table (it is going to be single file), then add indexes as necessary.
To access your data, use python sqlite3 library. You can use this tutorial on how to use it.
Compared to many other solutions, SQLite will be the fastest way to select partial data sets locally - certainly much, much faster than access 10000 files. Also read this answer which explains why SQLite is so good.

I would write all the lines to one file. For 10,000 lines it's probably not worthwhile, but you can pad all the lines to the same length - say 1000 bytes.
Then it's easy to seek to the nth line, just multiply n by the line length

10,000 files is going to be slower to load and access than one file, if only because the files' data will likely be fragmented around your disk drive, so accessing it will require a much larger number of seeks than would accessing the contents of a single file, which will generally be stored as sequentially as possible. Seek times are a big slowdown on spinning media, since your program has to wait while the drive heads are physically repositioned, which can take milliseconds. (slow seeks times aren't an issue for SSDs, but even then there will still be the overhead of 10,000 file's worth of metadata for the operating system to deal with). Also with a single file, the OS can speed things up for you by doing read-ahead buffering (as it can reasonably assume that if you read one part of the file, you will likely want to read the next part soon). With multiple files, the OS can't do that.
My suggestion (if you don't want to go the SQLite route) would be to use a single CSV file, and (if possible) pad all of the lines of your CSV file out with spaces so that they all have the same length. For example, say you make sure when writing out the CSV file to make all lines in the file exactly 80 bytes long. Then reading the (n)th line of the file becomes relatively fast and easy:
myFileObject.seek(n*80)
theLine = myFileObject.read(80)

Speeding up Arcpy python Script, Big Data

I have a ridiculously simple python script that uses the arcpy module. I turned it into a script tool in arcmap and am running it that way. It works just fine, I've tested it multiple times on small datasets. The problem is that I have a very large amount of data. I need to run the script/tool on a .dbf table with 4 columns and 490,481,440 rows, and currently it has taken days. Does anyone have any suggestions on how to speed it up? To save time I've already created the columns that will be populated in the table before I run the script. "back" represents the second number after the comma in the "back_pres_dist" column and "dist" represents the fourth. All I want is for them to be in their own separate columns. The table and script look something like this:
back_pres_dist back dist
1,1,1,2345.6
1,1,2,3533.8
1,1,3,4440.5
1,1,4,3892.6
1,1,5,1292.0
import arcpy
from arcpy import env
inputTable = arcpy.GetParameterAsText(0)
back1 = arcpy.GetParameterAsText(1) #the empty back column to be populated
dist3 = arcpy.GetParameterAsText(2) #the empty dist column to be populated
arcpy.CalculateField_management(inputTable, back1, '!back_pres_dist!.split(",")[1]', "PYTHON_9.3")
updateMess = arcpy.AddMessage("back column updated.")
arcpy.CalculateField_management(inputTable, dist3, '!back_pres_dist!.split(",")[3]', "PYTHON_9.3")
updateMess = arcpy.AddMessage("dist column updated.")
updateMess = arcpy.AddMessage("All columns updated.")
Any suggestions would be greatly appreciated. I know that reading some parts of the data into memory might speed things up, but I'm not sure how to do that with python (when using R it took forever to read into memory and was a nightmare trying to write to a .csv).

This is a ton of data. I'm guessing that your main bottleneck is read/write operations on the disk and not CPU or memory.
Your process appears to modify each row independently according to constant input values in what's essentially a tabular operation that doesn't really require GIS functionality. As a result, I would definitely look at doing this outside of the arcpy environment to avoid that overhead. While you could dump this stuff to numpy with the new arcpy.da functionality, I think that even this might be a bottleneck. Seems you should be able to more directly read your *.dbf file with a different library.
In fact, this operation is not really tabular; it's really about iteration. You'll probably want to exploit things like the "WITH"/"AS" keywords (PEP 343, Raymond Hettinger has a good video on youtube, too) or iterators in general (see PEPs 234, 255), which only load a record at a time.
Beyond those general programming approaches, I'm thinking that your best bet would be to break this data into chunks, parallelize, and then reassemble the results. Part of engineering the parallelization could be to spread your data across different disk platters to avoid competing between i/o requests. iPython is an add-on for python that has a pretty easy to use, high-level pacakge, "parallel", if you want an easy place to start. Lots of pretty good videos on youtube from PyCon 2012. There's a 3 hour one where the parallel stuff starts at 2:13:00 or so.

Reparsing a file or unpickling

I have a 100Mb file with roughly 10million lines that I need to parse into a dictionary every time I run my code. This process is incredibly slow, and I am hunting for ways to speed it up. One thought that came to mind is to parse the file once and then use pickle to save it to disk. I'm not sure this would result in a speed up.
Any suggestions appreciated.
EDIT:
After doing some testing, I am worried that the slow down happens when I create the dictionary. Pickling does seem significantly faster, though I wouldn't mind doing better.
Lalit

MessagePack has in my experience been much faster for dumping/loading data in python then cPickle, even when using the highest protocol.
However, if you have a dictionary with 10 million entries in it, you might want to check that you're not hitting the upper limit of your computer's memory. The process will happen much slower if you run out of memory and have to use swap.

Depending on how you use the data, you could
divide it into many smaller files and load only what's needed
create an index into the file and lazy-load
store it to a database and then query the database
Can you give us a better idea just what your data looks like (structure)?
How you are using the data? Do you actually use every row on every execution? If you only use a subset on each run, could the data be pre-sorted?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.