I have models similar to the below:
class Tag(models.Model):
text = models.CharField(max_length=30)
class Post(models.Model):
title = models.CharField(max_length=30)
tags = models.ManyToManyField(Tag)
A Post can have many Tags and Tags can be associated with many Posts.
What I need is to get a list of all posts along with all the tags associated with each post. I then create a Pandas DataFrame from that data. Here is how I am currently doing it:
qs = Post.objects.all().prefetch_related('tags')
tag_df = pd.DataFrame(columns=["post_id", "tags"])
for q in qs:
tag_df = tag_df.append(
{
"post_id": q.pk,
"tags": list(q.tags.all().values_list("text", flat=True)),
},
ignore_index=True,
)
post_df = pd.DataFrame(qs.values("id", "title"))
final_df = post_df.merge(tag_df, left_on="id", right_on="post_id")
The result is correct in terms of the data I require. The problem is how incredibly inefficient it is and the number of queries that run even though I'm using prefetch_related. It appears that a query is hitting the database for each iteration of the loop.
Is there a better, more efficient way to do this (possibly without loops)? All I need in the end is a dataframe that contains all the posts along with a column which has a list of the tags for each post.
By using .values_list(..) you will make an extra query each iteration. So that is not very effective. You can simply use the, already prefetched Tag objects, and obtain the .text attributes:
qs = Post.objects.prefetch_related('tags')
tag_df = pd.DataFrame(columns=['post_id', 'tags'])
for q in qs:
tag_df = tag_df.append(
{
'post_id': q.pk,
'tags': [t.text for t in q.tags.all()],
},
ignore_index=True,
)
post_df = pd.DataFrame(qs.values('id', 'title'))
final_df = post_df.merge(tag_df, left_on='id', right_on='post_id')
It might however be more efficient to first make a list of dictionaries, and then load these in a dataframe once:
qs = Post.objects.prefetch_related('tags')
data = [
{'id': q.pk, 'title': q.title, 'tags': [t.text for t in q.tags.all()]}
for q in qs
]
final_df= pd.DataFrame(data, columns=['id', 'title', 'tags'])
Note that using .values(..) or .values_list(..) is not a good idea. Only in certain cases, like making a GROUP BY on a certain value, that is a good idea. Usually it is better to make use of the model objects, since these add an extra layer of logic.
Related
I'm trying to use Sentence Transformers and Haystack for document retrieval, focusing on searching documents on other metadata beside document text.
I'm using a dataset of academic publication titles, and I've appended a fake publication year (which I want to use as a search term). From reading around I've combined the columns and just added a separator between the title and publication year, and included the column titles since I thought maybe this could add context. An example input looks like:
title Sparsity-certifying Graph Decompositions [SEP] published year 1980
I have a document store and method of retrieving here, based on this:
document_store_faiss = FAISSDocumentStore(faiss_index_factory_str="Flat",
return_embedding=True,
similarity='cosine')
retriever_faiss = EmbeddingRetriever(document_store_faiss,
embedding_model='all-mpnet-base-v2',
model_format='sentence_transformers')
document_store_faiss.write_documents(df.rename(columns={'combined':'content'}).to_dict(orient='records'))
document_store_faiss.update_embeddings(retriever=retriever_faiss)
def get_results(query, retriever, n_docs = 25):
return [(item.content) for item in retriever.retrieve(q, top_k = n_docs)]
q = 'published year 1999'
print('Results: ')
res = get_results(q, retriever_faiss)
for r in res:
print(r)
I do a check to see if any inputs actually have a publication year matching the search term, but when I look at my search results I'm getting entries with seemingly random published years. I was hoping that at least the results would all be the same published year, since I hoped to do more complicated queries like "published year before 1980".
If anyone could either tell me what I'm doing wrong, or whether I have misunderstood this process / expected results it would be much appreciated.
It sounds like you need metadata filtering rather than placing the year within the query itself. The FaissDocumentStore doesn't support filtering, I'd recommend switching to the PineconeDocumentStore which Haystack introduced in the v1.3 release a few days ago. It supports the strongest filter functionality in the current set of document stores.
You will need to make sure you have the latest version of Haystack installed, and it needs an additional pinecone-client library too:
pip install -U farm-haystack pinecone-client
There's a guide here that may help, it will go something like:
document_store = PineconeDocumentStore(
api_key="<API_KEY>", # from https://app.pinecone.io
environment="us-west1-gcp"
)
retriever = EmbeddingRetriever(
document_store,
embedding_model='all-mpnet-base-v2',
model_format='sentence_transformers'
)
Before you write the documents you need to convert the data to include your text in content (as you have done above, but no need to pre-append the year), and then include the year as a field in a meta dictionary. So you would create a list of dictionaries that look like:
dicts = [
{'content': 'your text here', 'meta': {'year': 1999}},
{'content': 'another record text', 'meta': {'year': 1971}},
...
]
I don't know the exact format of your df but assuming it is something like:
text
year
"your text here"
1999
"another record here"
1971
We could write the following to reformat it:
df = df.rename(columns={'text': 'content'}) # you did this already
# create a new 'meta' column that contains {'year': <year>} data
df['meta'] = df['year'].apply(lambda x: {'year': x})
# we don't need the year column anymore so we drop it
df = df.drop(['year'], axis=1)
# now convert into the list of dictionaries format as you did before
dicts = df.to_dict(orient='records')
This data replaces the df dictionaries you write, so we would continue as so:
document_store.write_documents(dicts)
document_store.update_embeddings(retriever=retriever)
Now you can query with filters, for example to search for docs with the publish year of 1999 we use the condition "$eq" (equals):
docs = retriever.retrieve(
"some query here",
top_k=25,
filters={
{"year": {"$eq": 1999}}
}
)
For published before 1980 we can use "$lt" (less than):
docs = retriever.retrieve(
"some query here",
top_k=25,
filters={
{"year": {"$lt": 1980}}
}
)
It's been hours since I tried to perform this operation but I couldn't figure it out.
Let's say I have a Django project with two classes like these:
from django.db import models
class Person(models.Model):
name=models.CharField()
address=models.ManyToManyField(to=Address)
class Address(models.Model):
city=models.CharField()
zip=models.IntegerField()
So it's just a simple Person having multiple addresses.
Then I create some objects:
addr1=Address.objects.create(city='first', zip=12345)
addr2=Address.objects.create(city='second', zip=34555)
addr3=Address.objects.create(city='third', zip=5435)
person1=Person.objects.create(name='person_one')
person1.address.set([addr1,addr2])
person2=Person.objects.create(name='person_two')
person2.address.set([addr1,addr2,addr3])
Now it comes the hard part, I want to make a single query that will return something like that:
result = [
{
'name': 'person_one',
'addresses': [
{
'city':'first',
'zip': 12345
},
{
'city': 'second',
'zip': 34555
}
]
},
{
'name': 'person_two',
'addresses': [
{
'city':'first',
'zip': 12345
},
{
'city': 'second',
'zip': 34555
},
{
'city': 'third',
'zip': 5435
}
]
}
]
The best i could get was using ArrayAgg and JSONBAgg operators for Django (I'm on POSTGRESQL BY THE WAY):
from django.contrib.postgres.aggregates import JSONBAgg, ArrayAgg
result = Person.objects.values(
'name',
addresses=JSONBAgg('city')
)
But that's not enough, I can't pull a lit of dictionaries out of the query directly as I would like to do, just a list of values or something useless using:
addresses=JSONBAgg(('city','zip'))
which returns a dictionari with random keys and the strings I passed as input as values.
Can someone help me out?
Thanks
If you use postgres, you can do this:
subquery = Address.objects.filter(person_id=OuterRef("pk")).annotate(
data=JSONObject(city=F("city"), zip=F("zip"))
).values_list("data")
persons = Persons.objects.annotate(addresses=ArraySubquery(subquery))
Your requirement: To make an aggregation of customized JSON objects after group_by (values) in Django.
Currently, to my knowledge, Django is not providing any function to aggregate manually created JSON objects. There are a couple of ways to solve this. Firstly, make a customized function which is quite laborious. However, there is another approach that is pretty much easy, using both aggregate functions (ArrayAgg or JSONBAgg) and RawSQL together.
from django.contrib.postgres.aggregates import JSONBAgg, ArrayAgg
result = Person.objects.values('name').annotate(addresses=JSONBAgg(RawSQL("json_build_object('city', city, 'zip', zip)", ())))
I hope it would help you.
person.address already holds a queryset of addresses. From there you can use list-comprehension / model_from_dict to get the values you want.
I'm using the google sheets API to get data which I then pass to Pandas so I can easily work with the data.
Let's say I want to get a sheet with the following data (depicted as a JSON object as tables weren't presented here well)
{
columns: ['Name', 'Age', 'Tlf.' 'Address'],
data: ['Julie', '35', '12345', '8 Leafy Street']
}
The sheets API will return something along the lines of this:
{
'range': 'Cases!A1:AE999',
'majorDimension': 'ROWS',
'values':
[
['Name', 'Age', 'Tlf.', 'Address'],
['Julie', '35', '12345', '8 Leafy Street']
]
}
This is great and allows me to easily pass the column headings and data to Pandas without much fuss. I do this in the following manner:
values = sheets_api_result["values"]
df = pd.DataFrame(values[1:], columns=values[0])
My Problem
If I have a Gsuite Sheet that looks like the below table, depicted as a key:value data type
{
columns: ['Name', 'Age', 'Tlf.' 'Address'],
data: ['Julie', '35', '', '']
}
I will receive the following response
{
'range': 'Cases!A1:AE999',
'majorDimension': 'ROWS',
'values':
[
['Name', 'Age', 'Tlf.', 'Address'],
['Julie', '35']
]
}
Note that the length of the two arrays are not unequal, and that instead of None or null values being returned, the data is simply not present in the response.
When working with this data in my code, I end up with an error that looks like this
ValueError: 4 columns passed, passed data had 2 columns
So as far as I can tell I have two options:
Come up with a clever way to pad my response where necessary with None
If possible, instruct the API to return a null value in the JSON where null values exist, especially when the last column(s) have no data at all.
With regards to point 1. I think I can append x None values to the list where x is equal to length_of_column_heading_array - length_of_data_array. This does however seem ugly and perhaps there is a more elegant way of doing it.
And with regards to point 2, I haven't managed to find an answer that helps me.
If anyone has any ideas on how I can solve this, I'd be very grateful.
Cheers!
If anyone is interested, here is how I solved the issue.
First, we need to get all the data from the Sheets API.
# define the names of the tabs I want to get
ranges = ['tab1', 'tab2']
# Call the Sheets API
request = service.spreadsheets().values().batchGet(spreadsheetId=document, ranges=ranges,)
response = request.execute()
Now I want to go through every column and ensure that each row's list contains the same number of elements as the first row which contains the column headings.
# response is the response from google sheets API,
# and from the code above. It contains column headings
# and data from every row.
# valueRanges is the key to access the data.
def extract_case_data(response, keyword):
for obj in response["valueRanges"]:
if keyword in obj["range"]:
values = pad_data(obj["values"])
df = pd.DataFrame(values[1:], columns=values[0])
return df
return None
And finally, the method to pad the data
def pad_data(data: list):
# build a new array with the column heading data
# this is the list which we will return
return_data = [data[0]]
for row in data[1:]:
difference = len(data[0]) - len(row)
new_row = row
# append None to the lists which have a shorter
# length than the column heading list
for count in range(1, difference + 1):
new_row.append(None)
return_data.append(new_row)
return return_data
I'm certainly not saying that this is the best or most elegant solution, but it has done the trick for me.
Hope this helps someone.
Same idea, maybe simpler look:
Get raw values
result = service.spreadsheets().values().get(spreadsheetId=spreadsheet_id, range=data_range).execute()
raw_values = result.get('values', [])
Then complete while iterating
for row in raw_values:
row = row + [''] * (expected_length - len(row))
I have the following models:
class Event(models.Model):
date = models.DateTimeField()
event_type = models.ForeignKey('EventType')
class EventType(models.Model):
name = models.CharField(unique=True)
I am trying to get a list of all dates, and what event types are available on that date.
Each item in the list would be a dictionary with two fields: date and event_types which would be a list of distinct event types available on that date.
Currently I have come up with a query to get me a list of all distinct dates, but this is only half of what I want to do:
query = Event.objects.all().select_related('event_type')
results = query.distinct('date').order_by('date').values_list('date', flat=True)
Now I can change this slightly to get me a list of all distinct date + event_type combinations:
query = Event.objects.all().select_related('event_type')
results = query.order_by('date').distinct('date', 'event_type').values_list('date', 'event_type__name')
But this will have an entry for each event type within a given date. I need to aggregate a list within each date.
Is there a way I can construct a queryset to do this? If not, how would I do this some other way to get to the same result?
You can perform such aggregate with the groupby function of itertools. It is a requirement that the elements appearch in "chunks" with respect to the "grouper criteria". But this is the case here, since you use order_by.
We can thus write it like:
from itertools import groupby
from operator import itemgetter
query = (Event.objects.all.select_related('event_type')
.order_by('date', 'event_type')
.distinct('date', 'event_type')
.values_list('date', 'event_type__name'))
result = [
{ 'date': k, 'datetypes': [v[1] for v in vs]}
for k, vs in groupby(query, itemgetter(0))
]
You also better use 'event_type' in the order by criterion.
This will result in something like:
[{'date': datetime.date(2018, 5, 19), 'datetypes': ['Famous person died',
'Royal wedding']},
{'date': datetime.date(2018, 5, 24), 'datetypes': ['Famous person died']},
{'date': datetime.date(2011, 5, 25), 'datetypes': ['Important law enforced',
'Referendum']}]
(based on quick Wikipedia scan of the last days in May).
The groupby works in linear time with the number of rows returned.
Tldr of Problem
Frontend is a form that requires a complex lookup with ranges and stuff across several models, given in a dict. Best way to do it?
Explanation
From the view, I receive a dict of the following form (After being processed by something else):
{'h_index': {"min": 10,"max":20},
'rank' : "supreme_overlord",
'total_citations': {"min": 10,"max":400},
'year_began': {"min": 2000},
'year_end': {"max": 3000},
}
The keys are column names from different models (Right now, 2 separate models, Researcher and ResearchMetrics), and the values are the range / exact value that I want to query.
Example (Above)
Belonging to model Researcher :
rank
year_began
year_end
Belonging to model ResearchMetrics
total_citations
h_index
Researcher has a One to Many relationship with ResearchMetrics
Researcher has a Many to Many relationship with Journals (not mentioned in question)
Ideally: I want to show the researchers who fulfill all the criteria above in a list of list format.
Researcher ID, name, rank, year_began, year_end, total_citations, h_index
[[123, "Thomas", "professor", 2000, 2012, 15, 20],
[ 343 ... ]]
What's the best way to go about solving this problem? (Including changes to form, etc?) I'm not very familiar with the whole form query model thing.
Thank you for your help!
To dynamically perform a query you pass a dict with items 'fieldname__lookuptype': value as **kwargs to Model.objects.filter.
So to filter for rank, year_began and year_end in your example above, you would do this:
How exactly you do the transformation depends on how variable this incoming dictionary is. An example could be something like this:
filter_in = {
'h_index': {"min": 10,"max":20},
'rank' : "supreme_overlord",
'total_citations': {"min": 10,"max":400},
'year_began': {"min": 2000},
'year_end': {"max": 3000},
}
LOOKUP_MAPPING = {
'min': 'gt',
'max': 'lt'
}
filter_kwargs = {}
for field in RESEARCHER_FIELDS:
if not field in filter_in:
continue
filter = filter_in[field]
if isinstance(filter, dict):
for filter_type, value in filter.items():
lookup_type = LOOKUP_MAPPING[filter_type]
lookup = '%s__%s' % (field, lookup_type)
filter_dict[lookup] = value
else:
filter_dict[field] = filter
This results in a dictionary like this:
{
'rank': 'supreme_overlord',
'year_began__gt': 2000,
'year_end__lt': 3000
}
Use it like this:
qs = Researcher.objects.filter(**filter_kwargs)
Regarding the fields total_citations and h_index from ResearchMetrics, I assume you want to aggregate the values. So in your example above you want either a sum or an average.
The principle is the same:
from django.db.models import Sum
METRICS_FIELDS = ['total_citations', 'h_index']
annotate_kwargs = {}
for field in METRICS_FIELDS:
if not field in filter_in:
continue
annotated_field = '%s_sum' % field
annotate_kwargs[annotated_field] = Sum('researchmetric__%s' % field)
filter = filter_in[field]
if isinstance(filter, dict):
for filter_type, value in filter.items():
lookup_type = LOOKUP_MAPPING[filter_type]
lookup = '%s__%s' % (annotated_field, lookup_type)
filter_dict[lookup] = value
else:
filter_kwargs[field] = filter
Now your filter_kwargs look like this:
{
'h_index_sum__gt': 10,
'h_index_sum__lt': 20,
'rank': 'supreme_overlord',
'total_citations_sum__gt': 10,
'total_citations_sum__lt': 400,
'year_began__gt': 2000,
'year_end__lt': 3000
}
And your annotate_kwargs look like this:
{
'h_index_sum': Sum('reasearchmetric__h_index')),
'total_citations_sum': Sum('reasearchmetric__total_citations'))
}
So your final call looks like this:
Researcher.objects.annotate(**annotate_kwargs).filter(**filter_kwargs)
There are some assumptions in my answer, but I hope you get the general idea.
There is one important point: make sure you properly validate the input to make sure that only the field can be filtered that you want the user to filter. In my approach, this is ensured by hard coding the field names in RESEARCHER_FIELDS and METRICS_FIELDS.