Parsing N-Triples Via Streaming - python

I was fairly confused about this for some time but I finally learned how to parse a large N-Triples RDF store (.nt) using Raptor and the Redland Python Extensions.
A common example is to do the following:
import RDF
parser=RDF.Parser(name="ntriples")
model=RDF.Model()
stream=parser.parse_into_model(model,"file:./mybigfile.nt")
for triple in model:
print triple.subject, triple.predicate, triple.object
Parse_into_model() by default loads the object into memory, so if you are parsing a big file you could consider using a HashStorage as your model and serializing it that way.
But what if you want to just read the file and say, add it to MongoDB without loading it into a Model or anything complicated like that?

import RDF
parser=RDF.NTriplesParser()
for triple in parser.parse_as_stream("file:./mybigNTfile.nt"):
print triple.subject, triple.predicate, triple.object

Related

How to use Python to effectively parse a frequently updated json file?

I need to use Python to parse a json file that that has this format:
{
"seldom_changed_key_0": "frequently_changed_value_0",
"seldom_changed_key_1": "frequently_changed_value_1",
"seldom_changed_key_2": "frequently_changed_value_2",
"seldom_changed_key_3": "frequently_changed_value_3",
....
"seldom_changed_key_n": "frequently_changed_value_n",
}
and this json file is huge in size(greater than 100M bytes)
As the format indicates, the content might change frequently but the format(or structure) itself seldom changes.
Each time this json file is updated, I have to re-parse it from scratch.
Do we have any optimization method that can reduce the parse time?
Any suggestions are appreciated.

Read a large JSON file with Python Dask raises a delimiter error

I try to read a JSON file of size ~12GB. I downloaded from the AMiner Citation Network dataset V12 from here. This is a sample, where I removed a couple of fields that makes the JSON too long.
[{"id":1091,"authors":[{"name":"Makoto Satoh","org":"Shinshu University","id":2312688602},{"name":"Ryo Muramatsu","org":"Shinshu University","id":2482909946},{"name":"Mizue Kayama","org":"Shinshu University","id":2128134587},{"name":"Kazunori Itoh","org":"Shinshu University","id":2101782692},{"name":"Masami Hashimoto","org":"Shinshu University","id":2114054191},{"name":"Makoto Otani","org":"Shinshu University","id":1989208940},{"name":"Michio Shimizu","org":"Nagano Prefectural College","id":2134989941},{"name":"Masahiko Sugimoto","org":"Takushoku University, Hokkaido Junior College","id":2307479915}],"title":"Preliminary Design of a Network Protocol Learning Tool Based on the Comprehension of High School Students: Design by an Empirical Study Using a Simple Mind Map","year":2013,"n_citation":1,"page_start":"89","page_end":"93","doc_type":"Conference","publisher":"Springer, Berlin, Heidelberg","volume":"","issue":"","doi":"10.1007/978-3-642-39476-8_19","references":[2005687710,2018037215],"fos":[{"name":"Telecommunications network","w":0.45139},{"name":"Computer science","w":0.45245},{"name":"Mind map","w":0.5347},{"name":"Humanâcomputer interaction","w":0.47011},{"name":"Multimedia","w":0.46629},{"name":"Empirical research","w":0.49737},{"name":"Comprehension","w":0.47042},{"name":"Communications protocol","w":0.51907}],"venue":{"raw":"International Conference on Human-Computer Interaction","id":1127419992,"type":"C"}}
,{"id":1388,"authors":[{"name":"Pranava K. Jha","id":2718958994}],"title":"Further Results on Independence in Direct-Product Graphs.","year":2000,"n_citation":1,"page_start":"","page_end":"","doc_type":"Journal","publisher":"","volume":"56","issue":"","doi":"","fos":[{"name":"Graph","w":0.0},{"name":"Discrete mathematics","w":0.45872},{"name":"Combinatorics","w":0.4515},{"name":"Direct product","w":0.59104},{"name":"Mathematics","w":0.42784}],"venue":{"raw":"Ars Combinatoria","id":73158690,"type":"J"}}
,{"id":1674,"authors":[{"name":"G. Beale","org":"Archaeological Computing Research Group, University of Southampton, UK#TAB#","id":2103626414},{"name":"G. Earl","org":"Archaeological Computing Research Group, University of Southampton, UK#TAB#","id":2117665592}],"title":"A methodology for the physically accurate visualisation of roman polychrome statuary","year":2011,"n_citation":1,"page_start":"137","page_end":"144","doc_type":"Conference","publisher":"Eurographics Association","volume":"","issue":"","doi":"10.2312/VAST/VAST11/137-144","references":[1535888970,1992876689,1993710814,2035653341,2043970887,2087778487,2094478046,2100468662,2110331535,2120448006,2138624212,2149970020,2150266006,2296384428,2403283736],"fos":[{"name":"Statue","w":0.40216},{"name":"Engineering drawing","w":0.43427},{"name":"Virtual reconstruction","w":0.0},{"name":"Computer science","w":0.42062},{"name":"Visualization","w":0.4595},{"name":"Polychrome","w":0.4474},{"name":"Artificial intelligence","w":0.40496}],"venue":{"raw":"International Conference on Virtual Reality","id":2754954274,"type":"C"}}]
When I try to read the file with Python Dask (I cannot open it like any other file since it's too big and I get a memory limit error)
import dask.bag as db
if __name__ == '__main__':
b = db.read_text('dblp.v12.json').map(json.loads)
print(b.take(4))
I get the following error:
JSONDecodeError: Expecting ',' delimiter
I checked the sample above in an online validator and it passes. So, I guess it's not an error in JSON, but something on Dask and how I should read the file.
It seems like dask is passing each line of your file separately to json.loads. I've removed the line breaks from your sample and was able to load the data.
Of course, doing that for your entire file and sending a single JSON object to json.loads would defeat the purpose of using dask.
One possible solution (though I'm not sure how scalable it is for very large files) is to use jq to convert your JSON file into JSON lines -- by writing each element of your root JSON array into a single line in a file:
jq -c '.[]' your.json > your.jsonl
Then you can load it with dask:
import dask.bag as db
import json
json_file = "your.jsonl"
b = db.read_text(json_file).map(json.loads)
print(b.take(4))

Saving and loading simple data in Python convenient way

I'm currently working on a simple Python 3.4.3 and Tkinter game.
I struggle with saving/reading data now, because I'm a beginner at coding.
What I do now is use .txt files to store my data, but I find this extremely counter-intuitive, as saving/reading more than one line of data requires of me to have additional code to catch any newlines.
Skipping a line would be terrible too.
I've googled it, but I either find .txt save/file options or way too complex ones for saving large-scale data.
I only need to save some strings right now and be able to access them (if possible) by key like in a dictionary key:value .
Do you know of any file format/method to help me accomplish that?
Also: If possible, should work on Win/iOS/Linux.
It sounds like using json would be best for this, which comes as part of the Python Standard library in Python-2.6+
import json
data = {'username':'John', 'health':98, 'weapon':'warhammer'}
# serialize the data to user-data.txt
with open('user-data.txt', 'w') as fobj:
json.dump(data, fobj)
# read the data back in
with open('user-data.txt', 'r') as fobj:
data = json.load(fobj)
print(data)
# outputs:
# {u'username': u'John', u'weapon': u'warhammer', u'health': 98}
A popular alternative is yaml, which is actually a superset of json and produces slightly more human readable results.
You might want to try Redis.
http://redis.io/
I'm not totally sure it'll meet all your needs, but it would probably be better than a flat file.

Python pickle to xml

How can I convert to a pickle object to a xml document?
For example, I have a pickle like that:
cpyplusplus_test
Coordinate
p0
(I23
I-11
tp1
Rp2
.
I want to get something like:
<Coordinate>
<x>23</x>
<y>-11</y>
</Coordinate>
The Coordinate class has x and y attributes of course. I can supply a xml schema for conversion.
I tried gnosis.xml module. It can objectify xml documents to python object. But it cannot serialize objects to xml documents like above.
Any suggestion?
Thanks.
gnosis.xml does support pickling to XML:
import gnosis.xml.pickle
xml_str = gnosis.xml.pickle.dumps(obj)
To deserialize the XML, use loads:
o2 = gnosis.xml.pickle.loads(xml_str)
Of course, this will not directly convert existing pickles to XML — you have to first deserialize them into live object, and then dump them to XML.
Having said that, I must warn you that gnosis.xml is quite slow, somewhat fragile, and most likely unmaintained (last release was over six years ago). It is also very bloated, containing a huge number of subpackages with lots and lots of features that not only you won't need, but are untested and buggy. We tried to use for our development and, after a lot of effort wasted on trying to debug and improve it, ended up writing a simple XML pickler running at ~500 lines of code, and never looked back.
First you need to unpickle the data by pickle.load or pickle.loads. Then generate xml snippet. If you have a pickle in tmpStr variable, simply do this:
c = pickle.loads(tmpStr)
print '<Coordinate>\n<x>%d</x>\n<y>%d</y>\n</Coordinate>' % (c.x, c.y)
Writing to file is left as an exercise to the reader.

Parameter with dictionary path

I am very new to Python and am not very familiar with the data structures in Python.
I am writing an automatic JSON parser in Python, the JSON message is read into a dictionary using Ultra-JSON:
jsonObjs = ujson.loads(data)
Now, if I try something like:
jsonObjs[param1][0][param2] it works fine
However, I need to get the path from an external source (I read it from the DB), we initially thought we'll just write in the DB:
myPath = [param1][0][param2]
and then try to access:
jsonObjs[myPath]
But after a couple of failures I realized I'm trying to access:
jsonObjs[[param1][0][param2]]
Is there a way to fix this without parsing myPath?
Many thanks for your help and advice
Store the keys in a format that preserves type information, e.g. JSON, and then use reduce() to perform recursive accesses on the structure.

Categories

Resources