Performance impact of or-ing Q objects in django query - python

I am performing a query that is or-ing a bunch Q's together and it seems to be taking a lot of time. Here is some psuedo code
query_params = []
for i in range(80): #there are about 80ish Q objects being created
query_params.append(Q(filter_stuff))
Then I or them all together
query_params = reduce(or_, query_params)
And when I execute the query
query = list(MyModel.objects.filter(query_params))
It hangs for a LONG time. I know this is a pretty general question and it's hard to given a diagnostic without an intimate understanding of the data structure (which would be difficult to give here). But I'm just curious if there is an inherent performance impact of or-ing Q objects in a django query

Was able to shrink down the length of the query significantly by reducing the number of of Qobjects. They were all of the format like:
q1 = Q(field1=field1_val1, field2=field2_val1)
q2 = Q(field1=field1_val2, field2=field2_val2)
#...etc
I ended up grouping them by field1 values:
q_dict = {field1_val1: [all field_2 values paired with field1_val1], ...}
Then my q objects looked like this:
for field1_val, field2_vals = q_dict.items():
query_params.append(Q(field1=field1_val, field2__in=field2_vals))
This ultimately shrank down my number of Q objects significantly and the query ran much faster

Related

Hierarchical dictionary (reducing memory footprint or using a database)

I am working with extremely high dimensional biological count data (single cell RNA sequencing where rows are cell ID and columns are genes).
Each dataset is a separate flat file (AnnData format). Each flat file can be broken down by various metadata attributes, including by cell type (eg: muscle cell, heart cell), subtypes (eg: a lung dataset can be split into normal lung and cancerous lung), cancer stage (eg: stage 1, stage 2), etc.
The goal is to pre-compute aggregate metrics for a specific metadata column, sub-group, dataset, cell-type, gene combination and keep that readily accessible such that when a person queries my web app for a plot, I can quickly retrieve results (refer to Figure below to understand what I want to create). I have generated Python code to assemble the dictionary below and it has sped up how quickly I can create visualizations.
Only issue now is that the memory footprint of this dictionary is very high (there are ~10,000 genes per dataset). What is the best way to reduce the memory footprint of this dictionary? Or, should I consider another storage framework (briefly saw something called Redis Hashes)?
One option to reduce your memory footprint but keep fast lookup is to use an hdf5 file as a database. This will be a single large file that lives on your disk instead of memory, but is structured the same way as your nested dictionaries and allows for rapid lookups by reading in only the data you need. Writing the file will be slow, but you only have to do it once and then upload to your web-app.
To test this idea, I've created two test nested dictionaries in the format of the diagram you shared. The small one has 1e5 metadata/group/dataset/celltype/gene entries, and the other is 10 times larger.
Writing the small dict to hdf5 took ~2 minutes and resulted in a file 140 MB in size while the larger dict-dataset took ~14 minutes to write to hdf5 and is a 1.4 GB file.
Querying the small and large hdf5 files similar amounts of time showing that the queries scale well to more data.
Here's the code I used to create the test dict-datasets, write to hdf5, and query
import h5py
import numpy as np
import time
def create_data_dict(level_counts):
"""
Create test data in the same nested-dict format as the diagram you show
The Agg_metric values are random floats between 0 and 1
(you shouldn't need this function since you already have real data in dict format)
"""
if not level_counts:
return {f'Agg_metric_{i+1}':np.random.random() for i in range(num_agg_metrics)}
level,num_groups = level_counts.popitem()
return {f'{level}_{i+1}':create_data_dict(level_counts.copy()) for i in range(num_groups)}
def write_dict_to_hdf5(hdf5_path,d):
"""
Write the nested dictionary to an HDF5 file to act as a database
only have to create this file once, but can then query it any number of times
(unless the data changes)
"""
def _recur_write(f,d):
for k,v in d.items():
#check if the next level is also a dict
sk,sv = v.popitem()
v[sk] = sv
if type(sv) == dict:
#this is a 'node', move on to next level
_recur_write(f.create_group(k),v)
else:
#this is a 'leaf', stop here
leaf = f.create_group(k)
for sk,sv in v.items():
leaf.attrs[sk] = sv
with h5py.File(hdf5_path,'w') as f:
_recur_write(f,d)
def query_hdf5(hdf5_path,search_terms):
"""
Query the hdf5_path with a list of search terms
The search terms must be in the order of the dict, and have a value at each level
Output is a dict of agg stats
"""
with h5py.File(hdf5_path,'r') as f:
k = '/'.join(search_terms)
try:
f = f[k]
except KeyError:
print('oh no! at least one of the search terms wasnt matched')
return {}
return dict(f.attrs)
################
# start #
################
#this "small_level_counts" results in an hdf5 file of size 140 MB (took < 2 minutes to make)
#all possible nested dictionaries are made,
#so there are 40*30*10*3*3 = ~1e5 metadata/group/dataset/celltype/gene entries
num_agg_metrics = 7
small_level_counts = {
'Gene':40,
'Cell_Type':30,
'Dataset':10,
'Unique_Group':3,
'Metadata':3,
}
#"large_level_counts" results in an hdf5 file of size 1.4 GB (took 14 mins to make)
#has 400*30*10*3*3 = ~1e6 metadata/group/dataset/celltype/gene combinations
num_agg_metrics = 7
large_level_counts = {
'Gene':400,
'Cell_Type':30,
'Dataset':10,
'Unique_Group':3,
'Metadata':3,
}
#Determine which test dataset to use
small_test = True
if small_test:
level_counts = small_level_counts
hdf5_path = 'small_test.hdf5'
else:
level_counts = large_level_counts
hdf5_path = 'large_test.hdf5'
np.random.seed(1)
start = time.time()
data_dict = create_data_dict(level_counts)
print('created dict in {:.2f} seconds'.format(time.time()-start))
start = time.time()
write_dict_to_hdf5(hdf5_path,data_dict)
print('wrote hdf5 in {:.2f} seconds'.format(time.time()-start))
#Search terms in order of most broad to least
search_terms = ['Metadata_1','Unique_Group_3','Dataset_8','Cell_Type_15','Gene_17']
start = time.time()
query_result = query_hdf5(hdf5_path,search_terms)
print('queried in {:.2f} seconds'.format(time.time()-start))
direct_result = data_dict['Metadata_1']['Unique_Group_3']['Dataset_8']['Cell_Type_15']['Gene_17']
print(query_result == direct_result)
Although Python dictionaries themselves are fairly efficient in terms of memory usage you are likely storing multiple copies of the strings you are using as dictionary keys. From your description of your data structure it is likely that you have 10000 copies of “Agg metric 1”, “Agg metric 2”, etc for every gene in your dataset. It is likely that these duplicate strings are taking up a significant amount of memory. These can be deduplicated with sys.inten so that although you still have as many references to the string in your dictionary, they all point to a single copy in memory. You would only need to make a minimal adjustment to your code by simply changing the assignment to data[sys.intern(‘Agg metric 1’)] = value. I would do this for all of the keys used at all levels of your dictionary hierarchy.

Python: Data-structure and processing of GPS points and properties

I'm trying to read data from a csv and then process it on different way. (For starter just the average)
Data
(OneDrive) https://1drv.ms/u/s!ArLDiUd-U5dtg0teQoKGguBA1qt9?e=6wlpko
The data looks like this:
ID; Property1; Property2; Property3...
1; ....
1; ...
1; ...
2; ...
2; ...
3; ...
...
Every line is a GPS point. All points with same ID together (for example 1) produce one Route. The routes are not of the same length and some IDs are skipped. So it isn't a seamless increase of numbers.
I may need to add, that the points are ALWAYS the same set of meters apart from each other. And I don't need the XY information currently.
Wanted Result
In the end I want something like this:
[ID, AVG_Property1, AVG_Property2...] [1, 1.00595, 2.9595, ...] [2,1.50606, 1.5959, ...]
What I got so far
import os
import numpy
import pandas as pd
data = pd.read_csv(os.path.join('C:\\data' ,'data.csv'), sep=';')
# [id, len, prop1, prop2, ...]
routes = numpy.zeros((data.size, 10)) # 10 properties
sums = numpy.zeros(8)
nr_of_entries = 0;
current_id = 1;
for index, row in data.iterrows():
if(int(row['id']) != current_id): #after the last point of the route
routes[current_id-1][0] = current_id;
routes[current_id-1][1] = nr_of_entries; #how many points are in this route?
routes[current_id-1][2] = sums[0] / nr_of_entries;
routes[current_id-1][3] = sums[1] / nr_of_entries;
routes[current_id-1][4] = sums[2] / nr_of_entries;
routes[current_id-1][5] = sums[3] / nr_of_entries;
routes[current_id-1][6] = sums[4] / nr_of_entries;
routes[current_id-1][7] = sums[5] / nr_of_entries;
routes[current_id-1][8] = sums[6] / nr_of_entries;
routes[current_id-1][9] = sums[7] / nr_of_entries;
current_id = int(row['id']);
sums = numpy.zeros(8)
nr_of_entries = 0;
sums[0] += row[3];
sums[1] += row[4];
sums[2] += row[5];
sums[3] += row[6];
sums[4] += row[7];
sums[5] += row[8];
sums[6] += row[9];
sums[7] += row[10];
nr_of_entries = nr_of_entries + 1;
routes
My problem
1.) The way I did it, I have to copy paste the same code for every other processing approach, since as stated I need to do multiple different way. Average is just an example.
2.) The reading of the data is clumsy and fails when IDs are missing
3.) I'm a C# Developer, so my approach would be to create a Class 'Route' which has all the points and then provide methods for 'calculate average for prop 1'. Or something. This way I could also tweak the data if needed. (extreme values for example). But I have no idea how this would be done in Phyton and if this is a reasonable approach in this language.
4.) Is there a more elegant way to iterate through the original csv and getting like Route ID 1, then Route ID 2 and so on? Maybe something like LINQ Queries in C#?
Thanks for any help.
He is a solution and some ideas you can use. The example features multiple options for the same issue so you have to choose which fits the purpose best. Also it is Python 3.7, you didn't specify a version so i hope this works.
class Route(object):
"""description of class"""
def __init__(self, id, rawdata): # on startup
self.id = id
self.rawdata = rawdata
self.avg_Prop1 = self.calculate_average('Prop1')
self.sum_Prop4 = None
def calculate_average(self, Prop_Name): #selfreference for first argument in class method
return self.rawdata[Prop_Name].mean()
def give_Prop_data(self, Prop_Name): #return the Propdata as list
return self.rawdata[Prop_Name].tolist()
def any_function(self, my_function, Prop_Name): #not sure what dataframes support so turning it into a list first
return my_function(self.rawdata[Prop_Name].tolist())
#end of class definiton
data = pd.read_csv('testdata.csv', sep=';')
# [id, len, prop1, prop2, ...]
route_list = [] #List of all the objects created from the route class
for i in data.id.unique():
print('Current id:', i,' with ',len(data[data['id']==i]),'entries')
route_list.append(Route(i,data[data['id']==i]))
#created the Prop1 average in initialization of route so just accessing attribute
print(route_list[1].avg_Prop1)
for current_route in route_list:
print('Route ',current_route.id , ' Properties :')
for i in current_route.rawdata.columns[1:]: #for all except the first (id)
print(i, ' has average ', current_route.calculate_average(i)) #i is the string of the column not just an id
#or pass any function that you want
route_list[1].sum_Prop4 = (route_list[1].any_function(sum,'Prop4'))
print(route_list[1].sum_Prop4)
#which is equivalent to
print(sum(route_list[1].rawdata['Prop4']))
To adress your individual problems out of order:
For 2. and 4.) Looping only over the existing Ids (data.id.unique()) solves the problem. I have no idea what LINQ Queries are, but i assume they are similar. In general, Python has a great way of looping over objects (like for current_route in route_list), which is worth looking into if you want to use it a little more.
For 1. and 3.) Again looping solves the issue. I created a class in the example, mostly to show the syntax for classes. The benefits and drawbacks for using classes should be the same in Python as in C#.
As it is right now the class probably isn't great, but this depends on how you want to use it. If the class should just be a practical way of storing and accessing data it shouldn't have the methods, because you don't need an individual average method for each route. Then you can just access it's data and use it in a function like in sum(route_list[1].rawdata['Prop4']). If however, depending on the data (amount of rows for example) different calculations are necessary, it might come in handy to use the method calculate_average and differentiate in there.
An other example would be the use of the attributes. If you need the average for Prop1 every time, creating it at the initialization sees a good idea, otherwise i wouldn't bother always calculating it.
I hope this helps!

parallel processing - nearest neighbour search using pysal python?

I have this data frame df1,
id lat_long
400743 2504043 (175.0976323, -41.1141412)
43203 1533418 (173.976683, -35.2235338)
463952 3805508 (174.6947496, -36.7437555)
1054906 3144009 (168.0105269, -46.36193)
214474 3030933 (174.6311167, -36.867717)
1008802 2814248 (169.3183615, -45.1859095)
988706 3245376 (171.2338968, -44.3884099)
492345 3085310 (174.740957, -36.8893026)
416106 3794301 (174.0106383, -35.3876921)
937313 3114127 (174.8436185, -37.80499)
I have constructed the tree for search here,
def construct_geopoints(s):
data_geopoints = [tuple(x) for x in s[['longitude','latitude']].to_records(index=False)]
tree = KDTree(data_geopoints, distance_metric='Arc', radius=pysal.cg.RADIUS_EARTH_KM)
return tree
tree = construct_geopoints(actualdata)
Now, I am trying to search all the geopoints which are within 1KM of every geopoint in my data frame df1. Here is how I am doing,
dfs = []
for name,group in df1.groupby(np.arange(len(df1))//10000):
s = group.reset_index(drop=True).copy()
pts = list(s['lat_long'])
neighbours = tree.query_ball_point(pts, 1)
s['neighbours'] = pd.Series(neighbours)
dfs.append(s)
output = pd.concat(dfs,axis = 0)
Everything here works fine, however I am trying to parallelise this task, since my df1 size is 2M records, this process is running for more than 8 hours. Can anyone help me on this? And another thing is, the result returned by query_ball_point is a list and so its throwing memory error when I am processing it for the huge amount of records. Any way to handle this.
EDIT :- Memory issue, look at the VIRT size.
It should be possible to parallelize your last segment of code with something like this:
from multiprocessing import Pool
...
def process_group(group):
s = group[1].reset_index(drop=True) # .copy() is implicit
pts = list(s['lat_long'])
neighbours = tree.query_ball_point(pts, 1)
s['neighbours'] = pd.Series(neighbours)
return s
groups = df1.groupby(np.arange(len(df1))//10000)
p = Pool(5)
dfs = p.map(process_group, groups)
output = pd.concat(dfs, axis=0)
But watch out, because the multiprocessing library pickles all the data on its way to and from the workers, and that can add a lot of overhead for data-intensive tasks, possibly cancelling the savings due to parallel processing.
I can't see where you'd be getting out-of-memory errors from. 8 million records is not that much for pandas. Maybe if your searches are producing hundreds of matches per row that could be a problem. If you say more about that I might be able to give some more advice.
It also sounds like pysal may be taking longer than necessary to do this. You might be able to get better performance by using GeoPandas or "rolling your own" solution like this:
assign each point to a surrounding 1-km grid cell (e.g., calculate UTM coordinates x and y, then create columns cx=x//1000 and cy=y//1000);
create an index on the grid cell coordinates cx and cy (e.g., df=df.set_index(['cx', 'cy']));
for each point, find the points in the 9 surrounding cells; you can select these directly from the index via df.loc[[(cx-1,cy-1),(cx-1,cy),(cx-1,cy+1),(cx,cy-1),...(cx+1,cy+1)], :];
filter the points you just selected to find the ones within 1 km.

Parallelizing loading data from MongoDB into python

All documents in my collection in MongoDB have the same fields. My goal is to load them into Python into pandas.DataFrame or dask.DataFrame.
I'd like to speedup the loading procedure by parallelizing it. My plan is to spawn several processes or threads. Each process would load a chunk of a collection, then these chunks would be merged together.
How do I do it correctly with MongoDB?
I have tried similar approach with PostgreSQL. My initial idea was to use SKIP and LIMIT in SQL queries. It has failed, since each cursor, opened for each particular query, started reading data table from the beginning and just skipped specified amount of rows. So I had to create additional column, containing record numbers, and specify ranges of these numbers in queries.
On the contrary, MongoDB assigns unique ObjectID to each document. However, I've found that it is impossible to subtract one ObjectID from another, they can be only compared with ordering operations: less, greater and equal.
Also, pymongo returns the cursor object, that supports indexing operation and has some methods, seeming useful for my task, like count, limit.
MongoDB connector for Spark accomplishes this task somehow. Unfortunately, I'm not familiar with Scala, therefore, it's hard for me to find out how they do it.
So, what is the correct way for parallel loading data from Mongo into python?
up to now, I've come to the following solution:
import pandas as pd
import dask.dataframe as dd
from dask.delayed import delayed
# import other modules.
collection = get_mongo_collection()
cursor = collection.find({ })
def process_document(in_doc):
out_doc = # process doc keys and values
return pd.DataFrame(out_doc)
df = dd.from_delayed( (delayed(process_document)(d) for d in cursor) )
However, it looks like dask.dataframe.from_delayed internally creates a list from passed generator, effectively loading all collection in a single thread.
Update. I've found in docs, that skip method of pymongo.Cursor starts from beginning of a collection too, as PostgreSQL. The same page suggests using pagination logic in the application. Solutions, that I've found so far, use sorted _id for this. However, they also store last seen _id, that implies that they also work in a single thread.
Update2. I've found the code of the partitioner in the official MongoDb Spark connector: https://github.com/mongodb/mongo-spark/blob/7c76ed1821f70ef2259f8822d812b9c53b6f2b98/src/main/scala/com/mongodb/spark/rdd/partitioner/MongoPaginationPartitioner.scala#L32
Looks like, initially this partitioner reads the key field from all documents in the collection and calculates ranges of values.
Update3: My incomplete solution.
Doesn't work, gets the exception from pymongo, because dask seems to incorrectly treat the Collection object:
/home/user/.conda/envs/MBA/lib/python2.7/site-packages/dask/delayed.pyc in <genexpr>(***failed resolving arguments***)
81 return expr, {}
82 if isinstance(expr, (Iterator, list, tuple, set)):
---> 83 args, dasks = unzip((to_task_dask(e) for e in expr), 2)
84 args = list(args)
85 dsk = sharedict.merge(*dasks)
/home/user/.conda/envs/MBA/lib/python2.7/site-packages/pymongo/collection.pyc in __next__(self)
2342
2343 def __next__(self):
-> 2344 raise TypeError("'Collection' object is not iterable")
2345
2346 next = __next__
TypeError: 'Collection' object is not iterable
What raises the exception:
def process_document(in_doc, other_arg):
# custom processing of incoming records
return out_doc
def compute_id_ranges(collection, query, partition_size=50):
cur = collection.find(query, {'_id': 1}).sort('_id', pymongo.ASCENDING)
id_ranges = [cur[0]['_id']]
count = 1
for r in cur:
count += 1
if count > partition_size:
id_ranges.append(r['_id'])
count = 0
id_ranges.append(r['_id'])
return zip(id_ranges[:len(id_ranges)-1], id_ranges[1: ])
def load_chunk(id_pair, collection, query={}, projection=None):
q = query
q.update( {"_id": {"$gte": id_pair[0], "$lt": id_pair[1]}} )
cur = collection.find(q, projection)
return pd.DataFrame([process_document(d, other_arg) for d in cur])
def parallel_load(*args, **kwargs):
collection = kwargs['collection']
query = kwargs.get('query', {})
projection = kwargs.get('projection', None)
id_ranges = compute_id_ranges(collection, query)
dfs = [ delayed(load_chunk)(ir, collection, query, projection) for ir in id_ranges ]
df = dd.from_delayed(dfs)
return df
collection = connect_to_mongo_and_return_collection_object(credentials)
# df = parallel_load(collection=collection)
id_ranges = compute_id_ranges(collection)
dedf = delayed(load_chunk)(id_ranges[0], collection)
load_chunk perfectly runs when called directly. However, call delayed(load_chunk)( blah-blah-blah ) fails with exception, mentioned above.
I was looking into pymongo parallelization and this is what worked for me. It took my humble gaming laptop nearly 100 minutes to process my mongodb of 40 million documents. The CPU was 100% utilised I had to turn on the AC :)
I used skip and limit functions to split the database, then assigned batches to processes. The code is written for Python 3:
import multiprocessing
from pymongo import MongoClient
def your_function(something):
<...>
return result
def process_cursor(skip_n,limit_n):
print('Starting process',skip_n//limit_n,'...')
collection = MongoClient().<db_name>.<collection_name>
cursor = collection.find({}).skip(skip_n).limit(limit_n)
for doc in cursor:
<do your magic>
# for example:
result = your_function(doc['your_field'] # do some processing on each document
# update that document by adding the result into a new field
collection.update_one({'_id': doc['_id']}, {'$set': {'<new_field_eg>': result} })
print('Completed process',skip_n//limit_n,'...')
if __name__ == '__main__':
n_cores = 7 # number of splits (logical cores of the CPU-1)
collection_size = 40126904 # your collection size
batch_size = round(collection_size/n_cores+0.5)
skips = range(0, n_cores*batch_size, batch_size)
processes = [ multiprocessing.Process(target=process_cursor, args=(skip_n,batch_size)) for skip_n in skips]
for process in processes:
process.start()
for process in processes:
process.join()
The last split will have a larger limit than the remaining documents, but that won't raise an error
I think dask-mongo will do the work for here. You can install it with pip or conda, and in the repo you can find some examples in a notebook.
dask-mongo will read the data you have in MongoDB as a Dask bag but then you can go from a Dask bag to a Dask Dataframe with df = b.to_dataframe() where b is the bag you read from mongo using with dask_mongo.read_mongo
"Read the mans, thery're rulez" :)
pymongo.Collection has method parallel_scan that returns a list of cursors.
UPDATE. This function can do the job, if the collection does not change too often, and queries are always the same (my case). One could just store query results in different collections and run parallel scans.

Fastest way to get a large number of nodes from Neo4j using py2neo

I'm trying to load nodes (about 400) and relationships (about 800) from a Neo4j DB to create a force directed graph using D3. This is my get function (I'm using Tornado):
def get(self):
query_string = "START r=rel(*) RETURN r"
query = neo4j.CypherQuery(graph_db, query_string)
results = query.execute().data
start = set([r[0].start_node for r in results])
end = set([r[0].end_node for r in results])
nodes_to_keep = list(start.union(end))
nodes = []
for n in nodes_to_keep:
nodes.append({
"name":n['name'].encode('utf-8'),
"group":n['type'].encode('utf-8'),
"description":n['description'].encode('utf-8'),
"node":int(n['node_id'])})
#links
links = []
for r in results:
links.append({"source":int(r[0].start_node['node_id']), "target":int(r[0].end_node['node_id'])})
self.render(
"index.html",
page_title='My Page',
page_heading='Sweet D3 Force Diagram',
nodes=nodes,
links =links,
)
I'm thinking the expensive process is in for n in nodes_to_keep: and the for r in results: since every time I get each property, that's a trip to the server. Right?
What's the best way to accomplish this task?
The reason why the above process is taking so long is because every time I ask for a node property, I'm taking a trip to the server to fetch something out of the database. I was able to drastically reduce the time this process takes by simply modifying the Cypher query.
For instance, to get all nodes with relationships I used this query:
query_string = """MATCH (n)-[r]-(m)
RETURN n, n.node_id, n.name, n.type, n.description, m.node_id, m.name, m.type, m.description"""
query = neo4j.CypherQuery(graph_db, query_string)
results = query.execute().data
The results contain the information I need, so I just loop through the results to get the properties.
The takeaway is that you need to write your queries such that they get you the info you need the first time around.

Categories

Resources