Tldr of Problem
Frontend is a form that requires a complex lookup with ranges and stuff across several models, given in a dict. Best way to do it?
Explanation
From the view, I receive a dict of the following form (After being processed by something else):
{'h_index': {"min": 10,"max":20},
'rank' : "supreme_overlord",
'total_citations': {"min": 10,"max":400},
'year_began': {"min": 2000},
'year_end': {"max": 3000},
}
The keys are column names from different models (Right now, 2 separate models, Researcher and ResearchMetrics), and the values are the range / exact value that I want to query.
Example (Above)
Belonging to model Researcher :
rank
year_began
year_end
Belonging to model ResearchMetrics
total_citations
h_index
Researcher has a One to Many relationship with ResearchMetrics
Researcher has a Many to Many relationship with Journals (not mentioned in question)
Ideally: I want to show the researchers who fulfill all the criteria above in a list of list format.
Researcher ID, name, rank, year_began, year_end, total_citations, h_index
[[123, "Thomas", "professor", 2000, 2012, 15, 20],
[ 343 ... ]]
What's the best way to go about solving this problem? (Including changes to form, etc?) I'm not very familiar with the whole form query model thing.
Thank you for your help!
To dynamically perform a query you pass a dict with items 'fieldname__lookuptype': value as **kwargs to Model.objects.filter.
So to filter for rank, year_began and year_end in your example above, you would do this:
How exactly you do the transformation depends on how variable this incoming dictionary is. An example could be something like this:
filter_in = {
'h_index': {"min": 10,"max":20},
'rank' : "supreme_overlord",
'total_citations': {"min": 10,"max":400},
'year_began': {"min": 2000},
'year_end': {"max": 3000},
}
LOOKUP_MAPPING = {
'min': 'gt',
'max': 'lt'
}
filter_kwargs = {}
for field in RESEARCHER_FIELDS:
if not field in filter_in:
continue
filter = filter_in[field]
if isinstance(filter, dict):
for filter_type, value in filter.items():
lookup_type = LOOKUP_MAPPING[filter_type]
lookup = '%s__%s' % (field, lookup_type)
filter_dict[lookup] = value
else:
filter_dict[field] = filter
This results in a dictionary like this:
{
'rank': 'supreme_overlord',
'year_began__gt': 2000,
'year_end__lt': 3000
}
Use it like this:
qs = Researcher.objects.filter(**filter_kwargs)
Regarding the fields total_citations and h_index from ResearchMetrics, I assume you want to aggregate the values. So in your example above you want either a sum or an average.
The principle is the same:
from django.db.models import Sum
METRICS_FIELDS = ['total_citations', 'h_index']
annotate_kwargs = {}
for field in METRICS_FIELDS:
if not field in filter_in:
continue
annotated_field = '%s_sum' % field
annotate_kwargs[annotated_field] = Sum('researchmetric__%s' % field)
filter = filter_in[field]
if isinstance(filter, dict):
for filter_type, value in filter.items():
lookup_type = LOOKUP_MAPPING[filter_type]
lookup = '%s__%s' % (annotated_field, lookup_type)
filter_dict[lookup] = value
else:
filter_kwargs[field] = filter
Now your filter_kwargs look like this:
{
'h_index_sum__gt': 10,
'h_index_sum__lt': 20,
'rank': 'supreme_overlord',
'total_citations_sum__gt': 10,
'total_citations_sum__lt': 400,
'year_began__gt': 2000,
'year_end__lt': 3000
}
And your annotate_kwargs look like this:
{
'h_index_sum': Sum('reasearchmetric__h_index')),
'total_citations_sum': Sum('reasearchmetric__total_citations'))
}
So your final call looks like this:
Researcher.objects.annotate(**annotate_kwargs).filter(**filter_kwargs)
There are some assumptions in my answer, but I hope you get the general idea.
There is one important point: make sure you properly validate the input to make sure that only the field can be filtered that you want the user to filter. In my approach, this is ensured by hard coding the field names in RESEARCHER_FIELDS and METRICS_FIELDS.
Related
Lets say I have a following dict:
schools_dict = {
'1': {'points': 10},
'2': {'points': 14},
'3': {'points': 5},
}
And how can I put these values into my queryset using annotate?
I would like to do smth like this, but its not working
schools = SchoolsExam.objects.all()
queryset = schools.annotate(
total_point = schools_dict[F('school__school_id')]['points']
)
Models:
class SchoolsExam(Model):
school = ForeignKey('School', on_delete=models.CASCADE),
class School(Model):
school_id = CharField(),
This code gives me an error KeyError: F(school__school_id)
You can not work with F objects in a lookup, since a dictionary does not "understand" F-objects.
You can translate this to a conditional expression [Django-doc]:
from django.db.models import Case, Value, When
schools = SchoolsExam.objects.annotate(
total_point=Case(
*[
When(school__school_id=school_id, then=Value(v['points']))
for school_id, v in school_dict.items()
]
)
)
This will thus "unwind" the dictionary into CASE WHEN school_id=1 THEN 10 WHEN school_id=2 THEN 14 WHEN school_id=3 THEN 5.
However using data in a dictionary often does not make much sense: usually you store this in a table and perform a JOIN.
I'm trying to use Sentence Transformers and Haystack for document retrieval, focusing on searching documents on other metadata beside document text.
I'm using a dataset of academic publication titles, and I've appended a fake publication year (which I want to use as a search term). From reading around I've combined the columns and just added a separator between the title and publication year, and included the column titles since I thought maybe this could add context. An example input looks like:
title Sparsity-certifying Graph Decompositions [SEP] published year 1980
I have a document store and method of retrieving here, based on this:
document_store_faiss = FAISSDocumentStore(faiss_index_factory_str="Flat",
return_embedding=True,
similarity='cosine')
retriever_faiss = EmbeddingRetriever(document_store_faiss,
embedding_model='all-mpnet-base-v2',
model_format='sentence_transformers')
document_store_faiss.write_documents(df.rename(columns={'combined':'content'}).to_dict(orient='records'))
document_store_faiss.update_embeddings(retriever=retriever_faiss)
def get_results(query, retriever, n_docs = 25):
return [(item.content) for item in retriever.retrieve(q, top_k = n_docs)]
q = 'published year 1999'
print('Results: ')
res = get_results(q, retriever_faiss)
for r in res:
print(r)
I do a check to see if any inputs actually have a publication year matching the search term, but when I look at my search results I'm getting entries with seemingly random published years. I was hoping that at least the results would all be the same published year, since I hoped to do more complicated queries like "published year before 1980".
If anyone could either tell me what I'm doing wrong, or whether I have misunderstood this process / expected results it would be much appreciated.
It sounds like you need metadata filtering rather than placing the year within the query itself. The FaissDocumentStore doesn't support filtering, I'd recommend switching to the PineconeDocumentStore which Haystack introduced in the v1.3 release a few days ago. It supports the strongest filter functionality in the current set of document stores.
You will need to make sure you have the latest version of Haystack installed, and it needs an additional pinecone-client library too:
pip install -U farm-haystack pinecone-client
There's a guide here that may help, it will go something like:
document_store = PineconeDocumentStore(
api_key="<API_KEY>", # from https://app.pinecone.io
environment="us-west1-gcp"
)
retriever = EmbeddingRetriever(
document_store,
embedding_model='all-mpnet-base-v2',
model_format='sentence_transformers'
)
Before you write the documents you need to convert the data to include your text in content (as you have done above, but no need to pre-append the year), and then include the year as a field in a meta dictionary. So you would create a list of dictionaries that look like:
dicts = [
{'content': 'your text here', 'meta': {'year': 1999}},
{'content': 'another record text', 'meta': {'year': 1971}},
...
]
I don't know the exact format of your df but assuming it is something like:
text
year
"your text here"
1999
"another record here"
1971
We could write the following to reformat it:
df = df.rename(columns={'text': 'content'}) # you did this already
# create a new 'meta' column that contains {'year': <year>} data
df['meta'] = df['year'].apply(lambda x: {'year': x})
# we don't need the year column anymore so we drop it
df = df.drop(['year'], axis=1)
# now convert into the list of dictionaries format as you did before
dicts = df.to_dict(orient='records')
This data replaces the df dictionaries you write, so we would continue as so:
document_store.write_documents(dicts)
document_store.update_embeddings(retriever=retriever)
Now you can query with filters, for example to search for docs with the publish year of 1999 we use the condition "$eq" (equals):
docs = retriever.retrieve(
"some query here",
top_k=25,
filters={
{"year": {"$eq": 1999}}
}
)
For published before 1980 we can use "$lt" (less than):
docs = retriever.retrieve(
"some query here",
top_k=25,
filters={
{"year": {"$lt": 1980}}
}
)
It's been hours since I tried to perform this operation but I couldn't figure it out.
Let's say I have a Django project with two classes like these:
from django.db import models
class Person(models.Model):
name=models.CharField()
address=models.ManyToManyField(to=Address)
class Address(models.Model):
city=models.CharField()
zip=models.IntegerField()
So it's just a simple Person having multiple addresses.
Then I create some objects:
addr1=Address.objects.create(city='first', zip=12345)
addr2=Address.objects.create(city='second', zip=34555)
addr3=Address.objects.create(city='third', zip=5435)
person1=Person.objects.create(name='person_one')
person1.address.set([addr1,addr2])
person2=Person.objects.create(name='person_two')
person2.address.set([addr1,addr2,addr3])
Now it comes the hard part, I want to make a single query that will return something like that:
result = [
{
'name': 'person_one',
'addresses': [
{
'city':'first',
'zip': 12345
},
{
'city': 'second',
'zip': 34555
}
]
},
{
'name': 'person_two',
'addresses': [
{
'city':'first',
'zip': 12345
},
{
'city': 'second',
'zip': 34555
},
{
'city': 'third',
'zip': 5435
}
]
}
]
The best i could get was using ArrayAgg and JSONBAgg operators for Django (I'm on POSTGRESQL BY THE WAY):
from django.contrib.postgres.aggregates import JSONBAgg, ArrayAgg
result = Person.objects.values(
'name',
addresses=JSONBAgg('city')
)
But that's not enough, I can't pull a lit of dictionaries out of the query directly as I would like to do, just a list of values or something useless using:
addresses=JSONBAgg(('city','zip'))
which returns a dictionari with random keys and the strings I passed as input as values.
Can someone help me out?
Thanks
If you use postgres, you can do this:
subquery = Address.objects.filter(person_id=OuterRef("pk")).annotate(
data=JSONObject(city=F("city"), zip=F("zip"))
).values_list("data")
persons = Persons.objects.annotate(addresses=ArraySubquery(subquery))
Your requirement: To make an aggregation of customized JSON objects after group_by (values) in Django.
Currently, to my knowledge, Django is not providing any function to aggregate manually created JSON objects. There are a couple of ways to solve this. Firstly, make a customized function which is quite laborious. However, there is another approach that is pretty much easy, using both aggregate functions (ArrayAgg or JSONBAgg) and RawSQL together.
from django.contrib.postgres.aggregates import JSONBAgg, ArrayAgg
result = Person.objects.values('name').annotate(addresses=JSONBAgg(RawSQL("json_build_object('city', city, 'zip', zip)", ())))
I hope it would help you.
person.address already holds a queryset of addresses. From there you can use list-comprehension / model_from_dict to get the values you want.
I have a Collection with heavily nested docs in MongoDB, I want to flatten and import to Pandas. There are some nested dicts, but also a list of dicts that I want to transform into columns (see examples below for details).
I already have function, that works for smaller batches of documents. But the solution (I found it in the answer to this question) uses json. The problem with the json.loads operation is, that it fails with a MemoryError on bigger selections from the Collection.
I tried many solutions suggesting other json-parsers (e.g. ijson), but for different reasons none of them solved my problem. The only way left, if I want to keep up the transformation via json, would be chunking bigger selections into smaller groups of documents and iterate the parsing.
At this point I thought, - and that is my main question here - maybe there is a smarter way to do the unnesting without taking the detour through json directly in MongoDB or in Pandas or somehow combined?
This is a shortened example Doc:
{
'_id': ObjectId('5b40fcc4affb061b8871cbc5'),
'eventId': 2,
'sId' : 6833,
'stage': {
'value': 1,
'Name': 'FirstStage'
},
'quality': [
{
'type': {
'value': 2,
'Name': 'Color'
},
'value': '124'
},
{
'type': {
'value': 7,
'Name': 'Length'
},
'value': 'Short'
},
{
'type': {
'value': 15,
'Name': 'Printed'
}
}
}
This is what a succcesful dataframe-representation would look like (I skipped columns '_id' and 'sId' for readability:
eventId stage.value stage.name q_color q_length q_printed
1 2 1 'FirstStage' 124 'Short' 1
My code so far (which runs into memory problems - see above):
def load_events(filter = 'sId', id = 6833, all = False):
if all:
print('Loading all events.')
cursor = events.find()
else:
print('Loading events with %s equal to %s.' %(filter, id))
print('Filtering...')
cursor = events.find({filter : id})
print('Loading...')
l = list(cursor)
print('Parsing json...')
sanitized = json.loads(json_util.dumps(l))
print('Parsing quality...')
for ev in sanitized:
for q in ev['quality']:
name = 'q_' + str(q['type']['Name'])
value = q.pop('value', 1)
ev[name] = value
ev.pop('quality',None)
normalized = json_normalize(sanitized)
df = pd.DataFrame(normalized)
return df
You don't need to convert the nested structures using json parsers. Just create your dataframe from the record list:
df = DataFrame(list(cursor))
and afterwards use pandas in order to unpack your lists and dictionaries:
import pandas
from itertools import chain
import numpy
df = pandas.DataFrame(t)
df['stage.value'] = df['stage'].apply(lambda cell: cell['value'])
df['stage.name'] = df['stage'].apply(lambda cell: cell['Name'])
df['q_']= df['quality'].apply(lambda cell: [(m['type']['Name'], m['value'] if 'value' in m.keys() else 1) for m in cell])
df['q_'] = df['q_'].apply(lambda cell: dict((k, v) for k, v in cell))
keys = set(chain(*df['q_'].apply(lambda column: column.keys())))
for key in keys:
column_name = 'q_{}'.format(key).lower()
df[column_name] = df['q_'].apply(lambda cell: cell[key] if key in cell.keys() else numpy.NaN)
df.drop(['stage', 'quality', 'q_'], axis=1, inplace=True)
I use three steps in order to unpack the nested data types. Firstly, the names and values are used to create a flat list of pairs (tuples). In the second step a dictionary based on the tuples takes keys from 1st and values from 2nd location of the tuples. Then all existing property names are extracted once using a set. Each property gets a new column using a loop. Inside the loop the values of each pair is mapped to the respective column cells.
I have a function which computes something like sum of data (it's not a simple sum, there is an increasing number that multiplies it every time) in database through year. It is calculated in views, I need to pass them to template. I store it in Dictionary portfolio_dict[year] += amount
{'2013': Decimal('92.96892879384746351465539182'), '2012': Decimal('71.48765907571338816005401399')}
But I need some extra data to send as well. Let's say:
date:date
amount:Decimal
year:string
I know it sounds kind of stupid to have a year and date as well. I use year as index. How do I pass these data to template/add date to my current dictionary?
But now, I always had Model and I passed a list of that model instances. But now I don't need to store these data in database, so I don't want to create a model.
Where do I create new class in django if I don't want it to be in database?
Or should I use collections or data structures?
Only django.db.Model instances are stored in the database (and only if you explicitely ask for it). Else this is just plain old Python and you can create and use your own classes as you see fit.
But anyway: if all you need is a year-indexed collection of (date, amount) items, then a dict of dicts is enough:
{
'2013': {
'amount': Decimal('92.96892879384746351465539182'),
'date': datetime.date(2013, 10, 25)
},
# etc
}
Or if you need more than one (amount, date) per year, a dict with lists or dicts:
{
'2013': [
{
'amount': Decimal('92.96892879384746351465539182'),
'date': datetime.date(2013, 10, 25)
},
{
'amount': Decimal('29.9689287'),
'date': datetime.date(2013, 10, 21)
},
],
# etc
}
In fact the proper structure depends on how you're going to use the data.