How to use msearch() with "q" in ElasticSearch?

How to use msearch() with "q" in ElasticSearch? - python

I've been using the standard Python ElasticSearch client to make single requests in the following format:
es.search(index='my_index', q=query, size=5, search_type='dfs_query_then_fetch')
I now want to make queries in batch for multiple strings q.
I've seen this question explaining how to use the msearch() functionality to do queries in batch. However, msearch requires the full json-formatted request body for each request. I'm not sure which parameters in the query API correspond to just the q parameter from search(), or size, or search_type, which seem to be API shortcuts specific to the single-example search().
How can I use msearch but specify q, size, and search_type?

I read through the API and figured out how to batch simple search queries:
from typing import List
from elasticsearch import Elasticsearch
import json
def msearch(
es: Elasticsearch,
max_hits: int,
query_strings: List[str],
index: str
):
search_arr = []
for q in query_strings:
search_arr.append({'index': index })
search_arr.append(
{
"query": {
"query_string": {
"query": q
}
},
'size': max_hits
})
request = ''
request = ' \n'.join([json.dumps(x) for x in search_arr])
resp = es.msearch(body = request)
return resp
msearch(es, query_strings=['query 1', 'query 2'], max_hits=1, index='my_index')
EDIT: For my use case, I made one more improvement, which was because I didn't want to return the entire document in the result– for my purpose, I just needed the document ID and its score.
So the final search request object part looked like this, including the '_source': False bit:
search_arr.append(
{
# Queries `q` using Lucene syntax.
"query": {
"query_string": {
"query": q
},
},
# Don't return the full profile string, etc. with the result.
# We just want the ID and the score.
'_source': False,
# Only return `max_hits` documents.
'size': max_hits
}
)

Related

How to get the rows of a database in the order in which they're displayed?

I have some Python code that retrieves the rows in a Notion database using the notion-client library. The code does manage to retrieve all the rows, but the order is wrong. I looked at the Sort object from the API reference, but was unable to figure out how to use it to return the rows in the exact order in which they're displayed on notion.so. Here's the snippet in question:
from notion_client import Client
notion = Client(auth=NOTION_API_TOKEN)
result = notion.databases.query(database_id='...')
for row in result['results']:
title = row['properties']['NAME_OF_PROPERTY']['title']
if len(title) == 0:
print('')
else:
print(title[0]['plain_text'])
What am I missing?

The Notion API does not support views in the current version, so it is not necessarily going to match the order you have it in unless you have applied a sort or filter that you can also apply via the API.

This is working as well as their documentation
const response = await notion.databases.query({
database_id: databaseId,
filter: {
or: [
{
property: 'In stock',
checkbox: {
equals: true,
},
},
{
property: 'Cost of next trip',
number: {
greater_than_or_equal_to: 2,
},
},
],
},
sorts: [
{
property: 'Last ordered',
direction: 'ascending',
},
],
});

Use the order argument to notion.databases.query(). This argument is a list of sort specifications, which are dictionaries.
result = notion.databases.query(
database_id = 'df4dfb3f-f36f-462d-ad6e-1ef29f1867eb',
sort = [{"property": "NAME_OF_PROPERTY", "direction": "ascending"}]
)
You can put multiple sort specifications in the list, the later ones will be used for rows that are equal in the preceding properties.

Wikidata - get labels for a large number of ids

I have a list of around 300.000 wikidata ids (e.g. Q1347065, Q731635 etc.) in an ndjson file as
{"Q1347065": ""}
{"Q731635": ""}
{"Q191789": ""} ... etc
What I would like is to get the label of each id, and form a dictionary of key values, such as
{"Q1347065":"epiglottitis", "Q731635":"Mount Vernon", ...} etc.
What I've used before the list of ids got so large, was a Wikidata python library (https://pypi.org/project/Wikidata/)
from wikidata.client import Client
import ndjson
client = Client()
with open("claims.ndjson") as f, open('claims_to_strings.json', 'w') as out:
claims = ndjson.load(f)
l = {}
for d in claims:
l.update(d)
for key in l:
v = client.get(key)
l[key] = str(v.label)
json.dumps(l, out)
But it is too slow (around 15 hours for 1000 ids). Is there another way to achieve this that is faster than what I have been doing?

Before answering: I don't know what do you mean with json.dumps(r, out); I'm assuming you want json.dump(l, out) instead.
My answer consists in using the following SPARQL query to Wikidata Query Service:
SELECT ?item ?itemLabel WHERE {
VALUES ?item { wd:Q1347065 wd:Q731635 wd:Q105492052 }
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
for asking multiple labels at same time.
This speeds up a lot your execution time, because your bottleneck is the number of connections, and with this method the id -> label mapping is entirely done at server side.
import json
import ndjson
import re
import requests
def wikidata_query(query):
url = 'https://query.wikidata.org/sparql'
try:
r = requests.get(url, params = {'format': 'json', 'query': query})
return r.json()['results']['bindings']
except json.JSONDecodeError as e:
raise Exception('Invalid query')
with open("claims.ndjson") as f, open('claims_to_strings.json', 'w') as out:
claims = ndjson.load(f)
l = {}
for d in claims:
l.update(d)
item_ids = l.keys()
sparql_values = list(map(lambda id: "wd:" + id, item_ids))
item2label = wikidata_query('''
SELECT ?item ?itemLabel WHERE {
VALUES ?item { %s }
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}''' % " ".join(sparql_values))
for result in item2label :
item = re.sub(r".*[#/\\]", "", result['item']['value'])
label = result['itemLabel']['value']
l[item] = label
json.dump(l, out)
I guess you cannot do a single query for all 300.000 items, but you can easily find a maximum supported number of accepted ids and split your original id list according to that number.

Elasticsearch DocumentSimilarity dense_vector got multiple values for argument 'body'

I want to store Document Vectors in an Elasticsearch index in order to calculate document similarity. I'm using the Python client for Elasticsearch 7.8.0.
I have a (dummy) Elasticsearch index with the following mapping:
mapping = {
"mappings": {
"properties": {
"title_vector":{
"type": "dense_vector",
"dims": 3
}
}
}
}
es.indices.create(index="test_vector", body=mapping)
And I stored a bunch of vectors in the following way:
vectors = [[1,2,3],[2,2,2],[1,2,2],[2,2,2],[4,5,6],[1,1,1]]
for i, v in enumerate(vectors):
doc = {"title_vector": v}
es.create("test_vector", id=i, body=doc)
According to the documentation, my query to get the most similar documents, should be as follows:
doc = {
"query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": "cosineSimilarity(params.queryVector, 'title_vector') + 1.0",
"params": {
"queryVector": [1,1,1]
}
}
}
}}
es.search("test_vector", body=doc)
But I'm getting
TypeError: search() got multiple values for argument 'body'
It seems more like a Python error than an Elastic error. But I can't really find the cause of the error and how I should structure my query differently in order to solve it.
Thanks in advance!
Edit: added Elasticsearch version

You are correct, it is a python error. So below is how the es.search is defined according to this link
search(body=None, index=None, params=None, headers=None)
As you see the first parameter is body.
Notice the es.search you have, you haven't specified the key in the first parameter i.e. body, index, params, headers. As a result, python interprets that as value for body according to the above method declaration.
Just add index="test_vector" instead of just "test_vector" in the first parameter and that should do the trick.
es.search(index="test_vector", body=doc)
Hope it helps!

Flask testing, post file and nested dictionaries

I have created a Flask app and wanted to test it. In a single endpoint, I would like to post a multipart request, which includes a file and a complex JSON object. I thought at first of using werkzeug EnvironBuilder for this task, as it seems to provide a quite automated approach, handling content types, etc. My snippet of code for preparing the request is the following:
# client is an instance of FlaskClient produced using a pytest fixture and the method test client
def _post(endpoint, file_path=None, serialized_message=None):
with open(file_path, 'rb') as fin:
fil = io.BytesIO(fin.read())
file_name = file_path.split(os.sep)[-1]
builder = EnvironBuilder(path='/' + endpoint,
method='POST',
data=json.loads(
serialized_message),
content_type="application/json")
builder.files[file_name] = fil
result = client.open(builder, buffered=True)
return result
This failed with the following error:
def _add_file_from_data(self, key, value):
"""Called in the EnvironBuilder to add files from the data dict."""
if isinstance(value, tuple):
self.files.add_file(key, *value)
elif isinstance(value, dict):
from warnings import warn
warn(DeprecationWarning('it\'s no longer possible to pass dicts '
'as `data`. Use tuples or FileStorage '
'objects instead'), stacklevel=2)
value = dict(value)
mimetype = value.pop('mimetype', None)
if mimetype is not None:
value['content_type'] = mimetype
> self.files.add_file(key, **value)
E TypeError: add_file() got an unexpected keyword argument 'globalServiceOptionId'
With the globalServiceOptionId being a key of a nested dictionary in the dictionary I am posting. I have some thoughts over bypassing this problem, with converting to string jsons the inner dictionaries, but I would like something more concrete as an answer, as I do not want the representation of the request to be changed inside and outside of testing. Thank you.
Update 1
The form of the passwed dictionary doesn't really matter, as long as it has nested dictionaries inside it. This json is given in this example:
{
"attachments": [],
"Ids": [],
"globalServiceOptions": [{
"globalServiceOptionId": {
"id": 2,
"agentServiceId": {
"id": 2
},
"serviceOptionName": "Time",
"value": "T_last",
"required": false,
"defaultValue": "T_last",
"description": "UTC Timestamp",
"serviceOptionType": "TIME"
},
"name": "Time",
"value": null
}]
}
Update 2
I tested another snippet:
def _post(endpoint, file_path=None, serialized_message=None):
with open(file_path, 'rb') as fin:
fil = io.BytesIO(fin.read())
files = {
'file': (file_path, fil, 'application/octet-stream')
}
for key, item in json.loads(serialized_message).items():
files[key] = (None, json.dumps(item), 'application/json')
builder = EnvironBuilder(path='/' + endpoint,
method='POST', data=files,
)
result = client.open(builder, buffered=True)
return result
Although this runs without errors, Flask recognizes (as expected) the incoming jsons as files, which again requires different handling during testing and normal running.

I ran into a similar issue, and what ended up working for me was changing the data approach to exclude nested dicts. Taking your sample JSON, doing the following should allow it to clear the EnvironBuilder:
data_json = {
"attachments": [],
"Ids": [],
"globalServiceOptions": [json.dumps({ # Dump all nested JSON to a string representation
"globalServiceOptionId": {
"id": 2,
"agentServiceId": {
"id": 2
},
"serviceOptionName": "Time",
"value": "T_last",
"required": false,
"defaultValue": "T_last",
"description": "UTC Timestamp",
"serviceOptionType": "TIME"
},
"name": "Time",
"value": null
})
]
}
builder = EnvironBuilder(path='/' + endpoint,
method='POST',
data=data_json,
content_type="application/json")
Taking the approach above still allowed the nested dict/JSON to be passed appropriately while clearing the werkzeug limitation.

Python creating a string using \ and send it in a json request

I am trying to automate some queries to a API using Python. The problem is that the request needs to be created in a special way, and I just cant make it work. This is the part of the string I am having problems creating.
payload = "{\n \"filter\": {\n \"name\":[\"name1\", \"name2\"]\n }"
Where name1 and name2 is variable and is created from a list. The way I tried to do it was just to first create a function to create the
[\"name1\", \"name2\"]
This is the function
def create_string(list_of_names):
#Creates the string of line items we want data from
start = '[\\"%s\\"' % list_of_names[0]
for i in list_of_names[1 : ]:
start += ', \\"%s\\"' %(i)
start += "]"
return start
list_of_names = ['name1', 'name2']
And then just using the %s part to add it into the string.
payload = "{\n \"filter\": {\n \"name\":%s\n }" % create_string(list_of_names)
This doesnt work, and I can think this has something to do with how the \ is used in Python.
The create_string function creates different output depening on if I am printing it or not.
a = create_string(list_of_names)
print(a)
Creates the string I need to pass in using %s
[\"name1\", \"name2\", \"name3\"]
And just a outputs
'[\\"name1\\", \\"name2\\", \\"name3\\"]'
So my problem is then how to pass the print(a) part into the payload string. Does anyone have some sort of solution to this?

Instead of creating your payload by hand, first create a python dictionary and use the json-module to convert it to a string:
payload = {"filter": {"name": list_of_names]}}
payload = json.dumps(payload)
or with your more complex dictionary:
payload = {
"filter": {
"date": "pastThirtyDays",
"lineitem": {
"buyType": "RTB",
"name": list_of_names,
}
},
"metrics": ["cost"],
"dimensions": ["date", "lineItem"],
}
payload = json.dumps(payload)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use msearch() with "q" in ElasticSearch? - python

Related

How to get the rows of a database in the order in which they're displayed?

Wikidata - get labels for a large number of ids

Elasticsearch DocumentSimilarity dense_vector got multiple values for argument 'body'

Flask testing, post file and nested dictionaries

Python creating a string using \ and send it in a json request

Categories

Resources