What is the fastest way to import to Neo4j? - python

I have a list of JSON documents, in the format:
[{a:1, b:[2,5,6]}, {a:2, b:[1,3,5]}, ...]
What I need to do is make nodes with parameter a, and connect them to all the nodes in the list b that have that value for a. So the first node will connect to nodes 2, 5 and 6. Right now I'm using Python's neo4jrestclient to populate but it's taking a long time. Is there a faster way to populate?
Currently this is my script:
break_list = []
for each in ans[1:]:
ref = each[0]
q = """MATCH n WHERE n.url = '%s' RETURN n;""" %(ref)
n1 = gdb.query(q, returns=client.Node)[0][0]
for link in each[6]:
if len(link)>4:
text,link = link.split('!__!')
q2 = """MATCH n WHERE n.url = '%s' RETURN n;""" %(link)
try:
n2 = gdb.query(q2, returns=client.Node)
n1.relationships.create("Links", n2[0][0], anchor_text=text)
except:
break_list.append((ref,link))

You might want to consider converting your JSON to CSV (using some like jq), then you could use the LOAD CSV Cypher tool for import. LOAD CSV is optimized for data import so you will have much better performance using this method. With your example the LOAD CSV script would look something like this:
Your JSON converted to CSV:
"a","b"
"1","2,5,6"
"2","1,3,5"
First create uniqueness constraint / index. This will ensure only one Node is created for any "name" and create an index for faster lookup performance.
CREATE CONSTRAINT ON (p:Person) ASSERT p.name IS UNIQUE;
Given the above CSV file this Cypher script can be used to efficiently import data:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///path/to/file.csv" AS row
MERGE (a:Person{name: row.a})
WITH a,row
UNWIND split(row.b,',') AS other
MERGE (b:Person {name:other})
CREATE UNIQUE (a)-[:CONNECTED_TO]->(b);
Other option
Another option is to use the JSON as a parameter in a Cypher query and then iterate through each element of the JSON array using UNWIND.
WITH {d} AS json
UNWIND json AS doc
MERGE (a:Person{name: doc.a})
WITH doc, a
UNWIND doc.b AS other
MERGE (b:Person{name:other})
CREATE UNIQUE (a)-[:CONNECTED_TO]->(b);
Although there might be some performance issues with a very large JSON array. See some examples of this here and here.

Related

Python: How do I efficiently nest 4 list of dictionaries into one?

I have an MSSQL stored procedure which returns 4 selections to me: Entities, Certificates, Contacts and Logs. I need to combine these 4 selections in Pyton where a I put all Entities, Contacts and Logs under their Certificate. Each of these selections has an EntityId I can use for the merge.
The inputs are lists of simple, basic dataclasses containing the information from SQL. We convert these dataclasses into dictionaries inside the merging function.
When I originally wrote the code, I had no idea that the selections could be very large (100.000s of Certificates with all their other records). Unfortunately this made the code below very inefficient due to the many unnecessary iterations of the list comprehensions inside the loop. It can take up to 70 seconds. I am sure that there is a way to make this much faster. How do I improve the performance to being as efficient as possible?
from dataclasses import asdict
def cert_and_details(entities: List[Entity],
certificates: List[Certificate],
req_logs: List[DocumentRequestHistory],
recipients: List[Recipient]) -> List[dict]:
entities = [asdict(ent) for ent in entities]
certificates = [asdict(cert) for cert in certificates]
req_logs = [asdict(log) for log in req_logs]
recipients = [asdict(rec) for rec in recipients]
results = []
for cert_dict in certificates:
cert_entity_id = cert_dict["entityid"]
logs_under_cert = [log for log in req_logs if log["entityid"] == cert_entity_id]
cert_dict["logs"] = logs_under_cert
entities_under_cert = [ent for ent in entities if ent["entityid"] == cert_entity_id]
cert_dict["linkedentity"] = entities_under_cert
recipients_under_cert = [rec for rec in recipients if rec["entityid"] == cert_entity_id]
cert_dict["recipients"] = recipients_under_cert
results.append(cert_dict)
return results
The main issue with the provided code is its computational complexity: it runs in O(C * (L + E + R)) where C is the number of certificates, L
the number of logs, E the number of entities and R the number of recipients. This is fine if L+E+R is small, but if this is not the case, then the code will be slow.
You can write an implementation running in O(C + L + E + R) time. The idea is to build an index first to group logs/entities/recipients by entity ID. Here is a short example:
# Note: defaultdict should help to make this code smaller (and possibly faster)
logIndex = dict()
for log in req_logs:
entityId = log["entityid"]
if entityId in logIndex:
logIndex[entityId].append(log)
else:
logIndex[entityId] = [log]
This code runs in (amortized) O(L). You can then retrieve all the items in req_log with a given entity ID using just logIndex[entityId].
There is another issue in the provided code: lists of dictionaries are inefficient: a dictionary indexing is slow and dictionaries are not memory efficient either. A better way to store and compute data could be to use dataframes (e.g. with Pandas which also provide a relatively optimized groupby function).
Below might also be another way to make the complexity order(2*C+L+E+R).
Caveat: I haven't tried running this, and it's just mock-up code that isn't trying to be as efficient as possible. I also just mocked it up thinking conceptually how to make it linear complexity, and it might have some fundamental 'Ooops' that I missed.
But it's based on the concept of looping through each of C, L, E and R's only once. This is done by first making certificates a dictionary, instead of list. The key is it's entityid. The lists to store each certificates logs, entities and recipients are also created at that time.
Then you can loop through L, E and R's only once, and directly add their entries to the certificates dictionary by looking up the entityid.
The final step (so why the 2*C in the complexity) would be to loop back through the certificates dictionary and turn it into a list to match the desired output type.
from dataclasses import asdict
def cert_and_details(entities: List[Entity],
certificates: List[Certificate],
req_logs: List[DocumentRequestHistory],
recipients: List[Recipient]) -> List[dict]:
certs = {}
for cert in certificates:
cert_dict = asdict(cert)
cert_id = cert_dict['entityid']
certs[cert_id] = cert_dict
certs['logs'] = []
certs['recipients'] = []
certs['linkedentity'] = []
for log in logs:
log_dict = asdict(log)
log_id = log_dict['entityid']
certs[log_id]['logs'].append(log_dict)
for ent in entities:
ent_dict = asdict(ent)
ent_id = ent_dict['entityid']
certs[ent_id]['linkedentity'].append(ent_dict)
for rec in recipients:
rec_dict = asdict(rec)
rec_id = rec_dict['entityid']
certs[rec_id]['recipients'].append(rec_dict)
# turn certs back into list, not dictionary
certs = [cert for cert in certs.values()]
return certs

How to write dictionary comprehension in this complicated case?

An example is artificial, but I had similar problems many times.
db_file_names = ['f1', 'f2'] # list of database files
def make_report(filename):
# read the database and prepare some report object
return report_object
Now I want to construct a dictionary: db_version -> number_of_tables. The report object contains all the information I need.
The dictionary comprehension could look like:
d = {
make_report(filename).db_version: make_report(filename).num_tables
for filename in db_file_names
}
This approach sometimes works, but is very inefficient: the report is prepared twice for each database.
To avoid this inefficiency I usually use one of the following approaches:
Use temporary storage:
reports = [make_report(filename) for filename in db_file_names]
d = {r.db_version: r.num_tables for r in reports}
Or use some adaptor-generator:
def gen_data():
for filename in db_file_names:
report = make_report(filename)
yield report.db_version, report.num_tables
d = {dat[0]: dat[1] for dat in gen_data()}
But it's usually only after I write some wrong comprehension, think over and realize, that clean and simple comprehension isn't possible in this case.
The question is, is there a better way to create required dictionary in such situations?
Since yesterday (when I decided to post this question) I invented one more approach, which I like more then all others:
d = {
report.db_version: report.num_tables
for filename in db_file_names
for report in [make_report(filename), ]
}
but even this one looks not very good.
You can use:
d = {
r.db_version: r.num_tables
for r in map(make_report, db_file_names)
}
Note that in Python 3, map gives an iterator, thus there is no unnecessary storage cost.
Here's a functional way:
from operator import attrgetter
res = dict(map(attrgetter('db_version', 'num_tables'),
map(make_report, db_file_names)))
Unfortunately, functional composition is not part of the standard library, but the 3rd party toolz does offer this feature:
from toolz import compose
foo = compose(attrgetter('db_version', 'num_tables'), make_report)
res = dict(map(foo, db_file_names))
Conceptually, you can think of these functional solutions outputting an iterable of tuples, which can then be fed directly to dict.

Inserting documents in MongoDB in specific order using pymongo

I have to insert documents in MongoDB in a left-shift manner i.e if the collection contains 60 documents, I am removing the 1st document and I want to insert the new document at the rear of the database. But when I am inserting the 61st element and so forth, the documents are being inserted in random positions.
Is there any way I can insert the documents in the order that I specified above?
Or do I have to do this processing when I am retrieving the values from the database? If yes then how?
The data format is :
data = {"time":"10:14:23", #timestamp
"stats":[<list of dictionaries>]
}
The code I am using is
from pymongo import MongoClient
db = MongoClient().test
db.timestamp.delete_one({"_id":db.timestamp.find()[0]["_id"]})
db.timestamp.insert_one(new_data)
the timestamp is the name of the collection.
Edit: Changed the code. Is there any better way?
from pymongo.operations import InsertOne,DeleteOne
def save(collection,data,cap=60):
if collection.count() == cap:
top_doc_time= min(doc['time'] for doc in collection.find())
collection.delete_one({'time':top_doc_time['_time']})
collection.insert_one(data)
A bulk write operation guarantees query ordering by default.
This means that the queries are executed sequentially.
from pymongo.operations import DeleteOne, InsertOne
def left_shift_insert(collection, doc, cap=60):
ops = []
variance = max((collection.count() + 1) - cap, 0)
delete_ops = [DeleteOne({})] * variance
ops.extend(delete_ops)
ops.append(InsertOne(doc))
return collection.bulk_write(ops)
left_shift_insert(db.timestamp, new_data)

pymongo: Error Creating embedded array in an OrderedDict

While importing SQL data into mongodb, I have merged few tables as an embedded array but while implementing I get syntactic errors stating 'key errors'.
Below is my code.
import pyodbc, json, collections, pymongo, datetime
arrayCol =[]
mongoConStr = 'localhost:27017'
sqlConStr = 'DRIVER={MSSQL-NC1311};SERVER=tcp:172.16.1.75,1433;DATABASE=devdb;UID=qauser;PWD=devuser'
mongoConnect = pymongo.MongoClient(mongoConStr)
sqlConnect = pyodbc.connect(sqlConStr)
dbo = mongoConnect.eaedw.ctArrayData
sqlCur = sqlConnect.cursor()
sqlCur.execute('''SELECT M.fldUserId ,TRU.intRuleGroupId ,TGM.strGroupName FROM TBL_USER_MASTER M
JOIN TBL_RULEGROUP_USER TRU ON M.fldUserId = TRU.intUserId
JOIN tbl_Group_Master TGM ON TRU.intRuleGroupId = TGM.intGroupId
''')
tuples = sqlCur.fetchall()
for tuple in tuples:
doc = collections.OrderedDict()
doc['fldUserId'] = tuple.fldUserId
doc['groups.gid'].append(tuple.intRuleGroupId)
doc['groups.gname'].append(tuple.strGroupName)
arrayCol.append(doc)
mongoImp = dbo.insert_many(arrayCol)
sqlCur.close()
mongoConnect.close()
sqlConnect.close()
Here, I was trying to create an embedded array name groups which will hold gid and groupname as a sub-doc in the array.
I get error for using append, it runs successfully without the embedded array.
Is there any error or mistake with the array definition?
You can't append to a list that doesn't exist. When you call append on them, doc['groups.gid'] and doc['groups.gname'] have no value. Even once you fix that problem, PyMongo prohibits you from inserting a document with keys like "groups.gid" that include dots. I think you intend to do this:
for tuple in tuples:
doc = collections.OrderedDict()
doc['fldUserId'] = tuple.fldUserId
doc['groups'] = collections.OrderedDict([
('gid', tuple.intRuleGroupId),
('gname', tuple.strGroupName)
])
arrayCol.append(doc)
I'm only guessing, based on your question, the schema that you really want to create.

How to vectorize a json dictionary using R wrapped in python?

High level description of what I want: I want to be able to receive a json response detailing certain values of fields/features, say {a: 1, b:2, c:3} as a flask (json) request. Then I want to convert the resulting python_dict into an r dataframe with rpy2(a single row of one), and feed it into a model in R which is expecting to receive a set of input where each column is a factor in r. I usually use python for this sort of thing, and serialize a vectorizer object from sklearn -- but this particular analysis needs to be done an R.
So here is what I'm doing so far.
import rpy2.robjects as robjects
from rpy2.robjects.packages import STAP
model = os.path.join('model', 'rsource_file.R')
with open(model, 'r') as f:
string = f.read()
model = STAP(string, "model")
data_r = robjects.DataFrame(data)
data_factored = model.prepdata(data_r)
result = model.predict(data_factored)
the relevant r functions from rsource_code are:
prepdata = function(row){
for(v in vars) if(typeof(row[,v])=="character") row[,v] = as.factor(row[,v], levs[0,v])
modm2=model.matrix(frm, data=tdz2, contrasts.arg = c1,xlev = levs)
}
where contrasts and levels have been pre-extracted from an existing dataset likeso:
#vars = vector of columns of interest
load(data.Rd)
for(v in vars) if(typeof(data[,v])=="character") data[,v] = as.factor(data[,v])
frm = ~ weightedsum_of_things #function mapped, causes no issue
modm= (model.matrix(frm,data=data))
levs = lapply(data, levels)
c1 = attributes(modm)$contrasts
calling prepdata does not give me what I want, which is for the newly dataframe(from the json request data_r) to be properly turned into a vector of "factors" with the same encoding by which the elements of the data.Rd database where transformed.
Thank you for your assistance, will upvote.
More detail: So what my code is attempting to do is map the labels() method over a the dataset to extract a list of lists of possible "levels" for a factor -- and then for matching values in the new input, call factor() with the new data row as well as the corresponding set of levels, levs[0,v].
This throws an error that you can't use factor if there isn't more than one level. I think this might have something to do with the labels/level difference? I'm calling levs[,v] to get the element of the return value of lapply(data, levels) corresponding to the "title" v (a string). I extracted the levelsfrom the data set -- but referencing them in the body of prep_data this way doesn't seem to work. Do I need to extract labels instead? if so, how can I do that?

Categories

Resources