I'm trying to use Sentence Transformers and Haystack for document retrieval, focusing on searching documents on other metadata beside document text.
I'm using a dataset of academic publication titles, and I've appended a fake publication year (which I want to use as a search term). From reading around I've combined the columns and just added a separator between the title and publication year, and included the column titles since I thought maybe this could add context. An example input looks like:
title Sparsity-certifying Graph Decompositions [SEP] published year 1980
I have a document store and method of retrieving here, based on this:
document_store_faiss = FAISSDocumentStore(faiss_index_factory_str="Flat",
return_embedding=True,
similarity='cosine')
retriever_faiss = EmbeddingRetriever(document_store_faiss,
embedding_model='all-mpnet-base-v2',
model_format='sentence_transformers')
document_store_faiss.write_documents(df.rename(columns={'combined':'content'}).to_dict(orient='records'))
document_store_faiss.update_embeddings(retriever=retriever_faiss)
def get_results(query, retriever, n_docs = 25):
return [(item.content) for item in retriever.retrieve(q, top_k = n_docs)]
q = 'published year 1999'
print('Results: ')
res = get_results(q, retriever_faiss)
for r in res:
print(r)
I do a check to see if any inputs actually have a publication year matching the search term, but when I look at my search results I'm getting entries with seemingly random published years. I was hoping that at least the results would all be the same published year, since I hoped to do more complicated queries like "published year before 1980".
If anyone could either tell me what I'm doing wrong, or whether I have misunderstood this process / expected results it would be much appreciated.
It sounds like you need metadata filtering rather than placing the year within the query itself. The FaissDocumentStore doesn't support filtering, I'd recommend switching to the PineconeDocumentStore which Haystack introduced in the v1.3 release a few days ago. It supports the strongest filter functionality in the current set of document stores.
You will need to make sure you have the latest version of Haystack installed, and it needs an additional pinecone-client library too:
pip install -U farm-haystack pinecone-client
There's a guide here that may help, it will go something like:
document_store = PineconeDocumentStore(
api_key="<API_KEY>", # from https://app.pinecone.io
environment="us-west1-gcp"
)
retriever = EmbeddingRetriever(
document_store,
embedding_model='all-mpnet-base-v2',
model_format='sentence_transformers'
)
Before you write the documents you need to convert the data to include your text in content (as you have done above, but no need to pre-append the year), and then include the year as a field in a meta dictionary. So you would create a list of dictionaries that look like:
dicts = [
{'content': 'your text here', 'meta': {'year': 1999}},
{'content': 'another record text', 'meta': {'year': 1971}},
...
]
I don't know the exact format of your df but assuming it is something like:
text
year
"your text here"
1999
"another record here"
1971
We could write the following to reformat it:
df = df.rename(columns={'text': 'content'}) # you did this already
# create a new 'meta' column that contains {'year': <year>} data
df['meta'] = df['year'].apply(lambda x: {'year': x})
# we don't need the year column anymore so we drop it
df = df.drop(['year'], axis=1)
# now convert into the list of dictionaries format as you did before
dicts = df.to_dict(orient='records')
This data replaces the df dictionaries you write, so we would continue as so:
document_store.write_documents(dicts)
document_store.update_embeddings(retriever=retriever)
Now you can query with filters, for example to search for docs with the publish year of 1999 we use the condition "$eq" (equals):
docs = retriever.retrieve(
"some query here",
top_k=25,
filters={
{"year": {"$eq": 1999}}
}
)
For published before 1980 we can use "$lt" (less than):
docs = retriever.retrieve(
"some query here",
top_k=25,
filters={
{"year": {"$lt": 1980}}
}
)
Related
I don't even know how to approach it as it feels too complex for my level.
Imagine courier tracking numbers and I am receiving some duplicated updates from upstream system in following format:
see attached image or a small piece of code that creates such table:
import pandas as pd
incoming_df = pd.DataFrame({
'Tracking ID' : ['4845','24345', '8436474', '457453', '24345-S2'],
'Previous' : ['Paris', 'Lille', 'Paris', 'Marseille', 'Dijon'],
'Current' : ['Nantes', 'Dijon', 'Dijon', 'Marseille', 'Lyon'],
'Next' : ['Lyone', 'Lyon', 'Lyon', 'Rennes', 'NICE']
})
incoming_df
Obviously, tracking ID 24345-S2 (green arrow) is a duplication of 24345 (red arrow), however, it is not fully duplicated but a newer, updated location information (with history) for the parcel. How do I delete old line 24345 and keep new line 24345-S2 in the data set?
The length of tracking ID can be from 4 to 20 chars but '-S2' is always helpfully appended.
Thank you!
Edit: New solution:
# extract duplicates
duplicates = df['Tracking ID'].str.extract('(.+)-S2').dropna()
# remove older entry if necessary
df = df[~df['Tracking ID'].isin(duplicates[0].unique())]
If the 1234-S2 entry is always lower in the DataFrame than the 1234 entry , you could do something like:
# remove the suffix from all entries
incoming_df['Tracking ID'] = incoming_df['Tracking ID'].apply(lambda x: x.split('-')[0])
# keep only the last entry of the duplicates
incoming_df = incoming_df.drop_duplicates(subset='Tracking ID', keep='last')
I'm using the google sheets API to get data which I then pass to Pandas so I can easily work with the data.
Let's say I want to get a sheet with the following data (depicted as a JSON object as tables weren't presented here well)
{
columns: ['Name', 'Age', 'Tlf.' 'Address'],
data: ['Julie', '35', '12345', '8 Leafy Street']
}
The sheets API will return something along the lines of this:
{
'range': 'Cases!A1:AE999',
'majorDimension': 'ROWS',
'values':
[
['Name', 'Age', 'Tlf.', 'Address'],
['Julie', '35', '12345', '8 Leafy Street']
]
}
This is great and allows me to easily pass the column headings and data to Pandas without much fuss. I do this in the following manner:
values = sheets_api_result["values"]
df = pd.DataFrame(values[1:], columns=values[0])
My Problem
If I have a Gsuite Sheet that looks like the below table, depicted as a key:value data type
{
columns: ['Name', 'Age', 'Tlf.' 'Address'],
data: ['Julie', '35', '', '']
}
I will receive the following response
{
'range': 'Cases!A1:AE999',
'majorDimension': 'ROWS',
'values':
[
['Name', 'Age', 'Tlf.', 'Address'],
['Julie', '35']
]
}
Note that the length of the two arrays are not unequal, and that instead of None or null values being returned, the data is simply not present in the response.
When working with this data in my code, I end up with an error that looks like this
ValueError: 4 columns passed, passed data had 2 columns
So as far as I can tell I have two options:
Come up with a clever way to pad my response where necessary with None
If possible, instruct the API to return a null value in the JSON where null values exist, especially when the last column(s) have no data at all.
With regards to point 1. I think I can append x None values to the list where x is equal to length_of_column_heading_array - length_of_data_array. This does however seem ugly and perhaps there is a more elegant way of doing it.
And with regards to point 2, I haven't managed to find an answer that helps me.
If anyone has any ideas on how I can solve this, I'd be very grateful.
Cheers!
If anyone is interested, here is how I solved the issue.
First, we need to get all the data from the Sheets API.
# define the names of the tabs I want to get
ranges = ['tab1', 'tab2']
# Call the Sheets API
request = service.spreadsheets().values().batchGet(spreadsheetId=document, ranges=ranges,)
response = request.execute()
Now I want to go through every column and ensure that each row's list contains the same number of elements as the first row which contains the column headings.
# response is the response from google sheets API,
# and from the code above. It contains column headings
# and data from every row.
# valueRanges is the key to access the data.
def extract_case_data(response, keyword):
for obj in response["valueRanges"]:
if keyword in obj["range"]:
values = pad_data(obj["values"])
df = pd.DataFrame(values[1:], columns=values[0])
return df
return None
And finally, the method to pad the data
def pad_data(data: list):
# build a new array with the column heading data
# this is the list which we will return
return_data = [data[0]]
for row in data[1:]:
difference = len(data[0]) - len(row)
new_row = row
# append None to the lists which have a shorter
# length than the column heading list
for count in range(1, difference + 1):
new_row.append(None)
return_data.append(new_row)
return return_data
I'm certainly not saying that this is the best or most elegant solution, but it has done the trick for me.
Hope this helps someone.
Same idea, maybe simpler look:
Get raw values
result = service.spreadsheets().values().get(spreadsheetId=spreadsheet_id, range=data_range).execute()
raw_values = result.get('values', [])
Then complete while iterating
for row in raw_values:
row = row + [''] * (expected_length - len(row))
I have models similar to the below:
class Tag(models.Model):
text = models.CharField(max_length=30)
class Post(models.Model):
title = models.CharField(max_length=30)
tags = models.ManyToManyField(Tag)
A Post can have many Tags and Tags can be associated with many Posts.
What I need is to get a list of all posts along with all the tags associated with each post. I then create a Pandas DataFrame from that data. Here is how I am currently doing it:
qs = Post.objects.all().prefetch_related('tags')
tag_df = pd.DataFrame(columns=["post_id", "tags"])
for q in qs:
tag_df = tag_df.append(
{
"post_id": q.pk,
"tags": list(q.tags.all().values_list("text", flat=True)),
},
ignore_index=True,
)
post_df = pd.DataFrame(qs.values("id", "title"))
final_df = post_df.merge(tag_df, left_on="id", right_on="post_id")
The result is correct in terms of the data I require. The problem is how incredibly inefficient it is and the number of queries that run even though I'm using prefetch_related. It appears that a query is hitting the database for each iteration of the loop.
Is there a better, more efficient way to do this (possibly without loops)? All I need in the end is a dataframe that contains all the posts along with a column which has a list of the tags for each post.
By using .values_list(..) you will make an extra query each iteration. So that is not very effective. You can simply use the, already prefetched Tag objects, and obtain the .text attributes:
qs = Post.objects.prefetch_related('tags')
tag_df = pd.DataFrame(columns=['post_id', 'tags'])
for q in qs:
tag_df = tag_df.append(
{
'post_id': q.pk,
'tags': [t.text for t in q.tags.all()],
},
ignore_index=True,
)
post_df = pd.DataFrame(qs.values('id', 'title'))
final_df = post_df.merge(tag_df, left_on='id', right_on='post_id')
It might however be more efficient to first make a list of dictionaries, and then load these in a dataframe once:
qs = Post.objects.prefetch_related('tags')
data = [
{'id': q.pk, 'title': q.title, 'tags': [t.text for t in q.tags.all()]}
for q in qs
]
final_df= pd.DataFrame(data, columns=['id', 'title', 'tags'])
Note that using .values(..) or .values_list(..) is not a good idea. Only in certain cases, like making a GROUP BY on a certain value, that is a good idea. Usually it is better to make use of the model objects, since these add an extra layer of logic.
Trying to output just the employee data(empfirst, emplast, empsalary, emproles) to a bottle project. I Just want the value not the keys. How would I go about this? It feels like i've tried everything but can't get at the data I need!
My query
emp_curs = connection.coll.find({},{"_id": False,"employee.empFirst":True})
dept_list = list(emp_curs)```
(just playing with the first name for now until its working)
My loop
```% for d in emp_list:
% for i in d:
<tr>
<td>{{d[i]}}</td>
<td>{{d[i]}}</td>
<td>{{d[i]}}</td>
<td>{{d[i]}}</td>
</tr>
%end
%end```
thats the closest i've gotten :\
Looking to take all the data and place in a table.
Sorry, here is the whole data file!
Sorry, here's some sample data
[
{
"deptCode": "ACCT",
"deptName": "Accounting",
"deptBudget": 200000,
"employee": [
{
"empFirst": "Marsha",
"empLast": "Bonavoochi",
"empSalary": 59000
},
{
"empFirst": "Roberto",
"empLast": "Acostaletti",
"empSalary": 85000,
"empRoles": [
"Manager"
]
},
{
"empFirst": "Dini",
"empLast": "Cappelletti",
"empSalary": 50500
}
]
}
]
It looks like you are stopping just one layer early within your nested list of dictionaries. This should get you all the applicable values for the employee data:
for department in department_list:
for employee in department["employee"]:
for value in employee.values():
print(value) # or whatever operation you want, adding to the table in your case
Looks like you have adding to the table working as you want, so that should work for you. Based on the structure of your sample data, I'm assuming there will be multiple departments to pull this data from (hence me starting with department_list).
Tldr of Problem
Frontend is a form that requires a complex lookup with ranges and stuff across several models, given in a dict. Best way to do it?
Explanation
From the view, I receive a dict of the following form (After being processed by something else):
{'h_index': {"min": 10,"max":20},
'rank' : "supreme_overlord",
'total_citations': {"min": 10,"max":400},
'year_began': {"min": 2000},
'year_end': {"max": 3000},
}
The keys are column names from different models (Right now, 2 separate models, Researcher and ResearchMetrics), and the values are the range / exact value that I want to query.
Example (Above)
Belonging to model Researcher :
rank
year_began
year_end
Belonging to model ResearchMetrics
total_citations
h_index
Researcher has a One to Many relationship with ResearchMetrics
Researcher has a Many to Many relationship with Journals (not mentioned in question)
Ideally: I want to show the researchers who fulfill all the criteria above in a list of list format.
Researcher ID, name, rank, year_began, year_end, total_citations, h_index
[[123, "Thomas", "professor", 2000, 2012, 15, 20],
[ 343 ... ]]
What's the best way to go about solving this problem? (Including changes to form, etc?) I'm not very familiar with the whole form query model thing.
Thank you for your help!
To dynamically perform a query you pass a dict with items 'fieldname__lookuptype': value as **kwargs to Model.objects.filter.
So to filter for rank, year_began and year_end in your example above, you would do this:
How exactly you do the transformation depends on how variable this incoming dictionary is. An example could be something like this:
filter_in = {
'h_index': {"min": 10,"max":20},
'rank' : "supreme_overlord",
'total_citations': {"min": 10,"max":400},
'year_began': {"min": 2000},
'year_end': {"max": 3000},
}
LOOKUP_MAPPING = {
'min': 'gt',
'max': 'lt'
}
filter_kwargs = {}
for field in RESEARCHER_FIELDS:
if not field in filter_in:
continue
filter = filter_in[field]
if isinstance(filter, dict):
for filter_type, value in filter.items():
lookup_type = LOOKUP_MAPPING[filter_type]
lookup = '%s__%s' % (field, lookup_type)
filter_dict[lookup] = value
else:
filter_dict[field] = filter
This results in a dictionary like this:
{
'rank': 'supreme_overlord',
'year_began__gt': 2000,
'year_end__lt': 3000
}
Use it like this:
qs = Researcher.objects.filter(**filter_kwargs)
Regarding the fields total_citations and h_index from ResearchMetrics, I assume you want to aggregate the values. So in your example above you want either a sum or an average.
The principle is the same:
from django.db.models import Sum
METRICS_FIELDS = ['total_citations', 'h_index']
annotate_kwargs = {}
for field in METRICS_FIELDS:
if not field in filter_in:
continue
annotated_field = '%s_sum' % field
annotate_kwargs[annotated_field] = Sum('researchmetric__%s' % field)
filter = filter_in[field]
if isinstance(filter, dict):
for filter_type, value in filter.items():
lookup_type = LOOKUP_MAPPING[filter_type]
lookup = '%s__%s' % (annotated_field, lookup_type)
filter_dict[lookup] = value
else:
filter_kwargs[field] = filter
Now your filter_kwargs look like this:
{
'h_index_sum__gt': 10,
'h_index_sum__lt': 20,
'rank': 'supreme_overlord',
'total_citations_sum__gt': 10,
'total_citations_sum__lt': 400,
'year_began__gt': 2000,
'year_end__lt': 3000
}
And your annotate_kwargs look like this:
{
'h_index_sum': Sum('reasearchmetric__h_index')),
'total_citations_sum': Sum('reasearchmetric__total_citations'))
}
So your final call looks like this:
Researcher.objects.annotate(**annotate_kwargs).filter(**filter_kwargs)
There are some assumptions in my answer, but I hope you get the general idea.
There is one important point: make sure you properly validate the input to make sure that only the field can be filtered that you want the user to filter. In my approach, this is ensured by hard coding the field names in RESEARCHER_FIELDS and METRICS_FIELDS.