How to delete documents from Elasticsearch - python

I can't find any example of deleting documents from Elasticsearch in Python. What I've seen by now - is definition of delete and delete_by_query functions. But for some reason documentation does not provide even a microscopic example of using these functions. The single list of parameters does not tell me too much, if I do not know how to correctly feed them into the function call. So, lets say, I've just inserted one new doc like so:
doc = {'name':'Jacobian'}
db.index(index="reestr",doc_type="some_type",body=doc)
Who in the world knows how can I now delete this document using delete and delete_by_query ?

Since you are not giving a document id while indexing your document, you have to get the auto-generated document id from the return value and delete according to the id. Or you can define the id yourself, try the following:
db.index(index="reestr",doc_type="some_type",id=1919, body=doc)
db.delete(index="reestr",doc_type="some_type",id=1919)
In the other case, you need to look into return value;
r = db.index(index="reestr",doc_type="some_type", body=doc)
# r = {u'_type': u'some_type', u'_id': u'AU36zuFq-fzpr_HkJSkT', u'created': True, u'_version': 1, u'_index': u'reestr'}
db.delete(index="reestr",doc_type="some_type",id=r['_id'])
Another example for delete_by_query. Let's say after adding several documents with name='Jacobian', run the following to delete all documents with name='Jacobian':
db.delete_by_query(index='reestr',doc_type='some_type', q={'name': 'Jacobian'})

The Delete-By-Query API was removed from the ES core in version 2 for several reasons. This function became a plugin. You can look for more details here:
Why Delete-By-Query is a plugin
Delete By Query Plugin
Because I didn't want to add another dependency (because I need this later to run in a docker image) I wrote an own function solving this problem. My solution is to search for all quotes with the specified index and type. After that I remove them using the Bulk API:
def delete_es_type(es, index, type_):
try:
count = es.count(index, type_)['count']
response = es.search(
index=index,
filter_path=["hits.hits._id"],
body={"size": count, "query": {"filtered" : {"filter" : {
"type" : {"value": type_ }}}}})
ids = [x["_id"] for x in response["hits"]["hits"]]
if len(ids) > 0:
return
bulk_body = [
'{{"delete": {{"_index": "{}", "_type": "{}", "_id": "{}"}}}}'
.format(index, type_, x) for x in ids]
es.bulk('\n'.join(bulk_body))
# es.indices.flush_synced([index])
except elasticsearch.exceptions.TransportError as ex:
print("Elasticsearch error: " + ex.error)
raise ex
I hope that helps future googlers ;)

One can also do something like this:
def delete_by_ids(index, ids):
query = {"query": {"terms": {"_id": ids}}}
res = es.delete_by_query(index=index, body=query)
pprint(res)
# Pass index and list of id that you want to delete.
delete_by_ids('my_index', ['test1', 'test2', 'test3'])
Which will perform the delete operation on bulk data

I came across this post while searching for a way to delete a document on ElasticSearch using their Python library, ElasticSearch-DSL.
In case it helps anyone, this part of their documentation describes the document lifecycle.
https://elasticsearch-dsl.readthedocs.io/en/latest/persistence.html#document-life-cycle
And at the end of it, it details how to delete a document:
To delete a document just call its delete method:
first = Post.get(id=42)
first.delete()
Hope that helps 🤞

Related

How to add document for search in marqo

i recently started using the marqo library and i am trying to add document so that marqo can search and return the relevant part of the document but i keep getting error when i run the the code.
i used the
add_document()
method and i pass the document as a string for search but it returns an error. Here is what my code look like;
import marqo
DOCUMENT = 'the document'
mq = marqo.Client(url='http://localhost:8882')
mq.index("my-first-index").add_documents(DOCUMENT)
and when i run it i get a
MarqoWebError
you are getting the error because the add_document() method takes a list of python dictionaries as an argument not a string, so you are to pass the document as a value to any key you assign to it. But it is advisable to add a title and also an id for later referencing. Here is what i mean;
mq.index("my-first-index").add_documents([
{
"Title": the_title_of_your_document,
"Description": your_document,
"_id": your_id,
}]
)
the id can be any string of your choice. You can add as many dictionaries as you want to the list, each dictionary represents a document.
I think the documents need to be a list of dicts. See here https://marqo.pages.dev/API-Reference/documents/

Is there a Python API for submitting batch get requests to AWS DynamoDB?

The package boto3 - Amazon's official AWS API wrapper for python - has great support for uploading items to DynamoDB in bulk. It looks like this:
db = boto3.resource("dynamodb", region_name = "my_region").Table("my_table")
with db.batch_writer() as batch:
for item in my_items:
batch.put_item(Item = item)
Here my_items is a list of Python dictionaries each of which must have the table's primary key(s). The situation isn't perfect - for instance, there is no safety mechanism to prevent you from exceeding your throughput limits - but it's still pretty good.
However, there does not appear to be any counterpart for reading from the database. The closest I can find is DynamoDB.Client.batch_get_item(), but here the API is extremely complicated. Here's what requesting two items looks like:
db_client = boto3.client("dynamodb", "my_region")
db_client.batch_get_item(
RequestItems = {
"my_table": {
"Keys": [
{"my_primary_key": {"S": "my_key1"}},
{"my_primary_key": {"S": "my_key2"}}
]
}
}
)
This might be tolerable, but the response has the same problem: all values are dictionaries whose keys are data types ("S" for string, "N" for number, "M" for mapping, etc.) and it is more than a little annoying to have to parse everything. So my questions are:
Is there any native boto3 support for batch reading from DynamoDb, similar to the batch_writer function above?
Failing that,
Does boto3 provide any built-in way to automatically deserialize the responses to the DynamoDB.Client.batch_get_item() function?
I'll also add that the function boto3.resource("dynamodb").Table().get_item() has what I would consider to be the "correct" API, in that no type-parsing is necessary for inputs or outputs. So it seems that this is some sort of oversight by the developers, and I suppose I'm looking for a workaround.
So thankfully there is something that you might find useful - much like the json module which has json.dumps and json.loads, boto3 has a types module that includes a serializer and deserializer. See TypeSerializer/TypeDeserializer. If you look at the source code, the serialization/deserialization is recursive and should be perfect for your use case.
Note: Its recommended that you use Binary/Decimal instead of just using a regular old python float/int for round trip conversions.
serializer = TypeSerializer()
serializer.serialize('awesome') # returns {'S' : 'awesome' }
deser = TypeDeserializer()
deser.deserialize({'S' : 'awesome'}) # returns u'awesome'
Hopefully this helps!
There's the service resource level batch_get_item. Maybe you could do something like that :
def batch_query_wrapper(table, key, values):
results = []
response = dynamo.batch_get_item(RequestItems={table: {'Keys': [{key: val} for val in values]}})
results.extend(response['Responses'][table])
while response['UnprocessedKeys']:
# Implement some kind of exponential back off here
response = dynamo.batch_get_item(RequestItems={table: {'Keys': [{key: val} for val in values]}})
results.extend(response['Response'][table])
return results
It will return your result as python objects.
I find this to be an effective way to convert a Boto 3 DynamoDB item to a Python dict.
https://github.com/Alonreznik/dynamodb-json
Somewhat related answer, the docs advertise the query method; their e.g.:
from boto3.dynamodb.conditions import Key, Attr
response = table.query(
KeyConditionExpression=Key('username').eq('johndoe')
)
items = response['Items']
print(items)

Pymongo.find() only return answer

I am working on a code which will fetch data from the database using pymongo. After that I'll show it in a GUI using Tkinter.
I am using
.find()
to find specific documents. However I don't want anything else then 'name' to show up. So I used {"name":1}, now it returns:
{u'name':u'**returned_name**'}
How do I remove the u'name': so it will only return returned_name?
Thanks in advance,
Max
P.s. I have searched a lot around the web but couldn't find anything which would give me some argument to help me.
What you see returned by find() call is a cursor. Just iterate over the cursor and get the value by the name key for every document found:
result = db.col.find({"some": "condition"}, {"name": 1})
print([document["name"] for document in result])
As a result, you'll get a list of names.
Or, if you want and expect a single document to be matched, use find_one():
document = db.col.find_one({"some": "condition"}, {"name": 1})
print(document["name"])
Mongo will return the data with keys, though you can as workaround use something like this
var result = []
db.Resellers_accounts.find({"name":1, "_id":0}).forEach(function(u) { result.push(u.name) })
This example is for NodeJS driver, similar can be done for Python
Edit (Python Code) -
res = db.Resellers_accounts.find({},{"name":1, "_id":0})
result = []
for each in res:
result.append(res['name'])
Edit 2 -
No pymongo doesn't support returning only values, everything is key-value paired in MongoDB.

How to append a value to list attribute on AWS DynamoDB?

I'm using DynamoDB as an K-V db (cause there's not much data, I think that's fine) , and part of 'V' is list type (about 10 elements). There's some session to append a new value to it, and I cannot find a way to do this in 1 request. What I did is like this:
item = self.list_table.get_item(**{'k': 'some_key'})
item['v'].append('some_value')
item.partial_save()
I request the server first and save it after modified the value. That's not atomic and looks ugly. Is there any way to do this in one request?
The following code should work with boto3:
table = get_dynamodb_resource().Table("table_name")
result = table.update_item(
Key={
'hash_key': hash_key,
'range_key': range_key
},
UpdateExpression="SET some_attr = list_append(some_attr, :i)",
ExpressionAttributeValues={
':i': [some_value],
},
ReturnValues="UPDATED_NEW"
)
if result['ResponseMetadata']['HTTPStatusCode'] == 200 and 'Attributes' in result:
return result['Attributes']['some_attr']
The get_dynamodb_resource method here is just:
def get_dynamodb_resource():
return boto3.resource(
'dynamodb',
region_name=os.environ['AWS_DYNAMO_REGION'],
endpoint_url=os.environ['AWS_DYNAMO_ENDPOINT'],
aws_secret_access_key=os.environ['AWS_SECRET_ACCESS_KEY'],
aws_access_key_id=os.environ['AWS_ACCESS_KEY_ID'])
You can do this in 1 request by using the UpdateItem API in conjunction with an UpdateExpression. Since you want to append to a list, you would use the SET action with the list_append function:
SET supports the following functions:
...
list_append (operand, operand) - evaluates to a list with a new
element added to it. You can append the new element to the start or
the end of the list by reversing the order of the operands.
You can see a couple examples of this on the Modifying Items and Attributes with Update Expressions documentation:
The following example adds a new element to the FiveStar review list.
The expression attribute name #pr is ProductReviews; the attribute
value :r is a one-element list. If the list previously had two
elements, [0] and [1], then the new element will be [2].
SET #pr.FiveStar = list_append(#pr.FiveStar, :r)
The following example adds another element to the FiveStar review
list, but this time the element will be appended to the start of the
list at [0]. All of the other elements in the list will be shifted by
one.
SET #pr.FiveStar = list_append(:r, #pr.FiveStar)
The #pr and :r are using placeholders for the attribute names and values. You can see more information on those on the Using Placeholders for Attribute Names and Values documentation.
I would look at update expressions:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Expressions.Modifying.html#Expressions.Modifying.UpdateExpressions.ADD
Should be doable with an ADD, although not sure what the support in boto is for this.
#LaserJesus 's answer is correct. However, using boto3 directly is kind of a pain, hard to maintain, and not at all reusable. dynamof abstracts that junk away. Using dynamof appending an item to a list attribute would look like:
from functools import partial
from boto3 import client
from dynamof.executor import execute
from dynamof.operations import update
from dynamof.attribute import attr
client = client('dynamodb', endpoint_url='http://localstack:4569')
db = partial(execute, client)
db(update(
table_name='users',
key={ 'id': user_id },
attributes={
'roles': attr.append('admin')
}))
disclaimer: I wrote dynamof

How to build "Tagging" support using CouchDB?

I'm using the following view function to iterate over all items in the database (in order to find a tag), but I think the performance is very poor if the dataset is large.
Any other approach?
def by_tag(tag):
return '''
function(doc) {
if (doc.tags.length > 0) {
for (var tag in doc.tags) {
if (doc.tags[tag] == "%s") {
emit(doc.published, doc)
}
}
}
};
''' % tag
Disclaimer: I didn't test this and don't know if it can perform better.
Create a single perm view:
function(doc) {
for (var tag in doc.tags) {
emit([tag, doc.published], doc)
}
};
And query with
_view/your_view/all?startkey=['your_tag_here']&endkey=['your_tag_here', {}]
Resulting JSON structure will be slightly different but you will still get the publish date sorting.
You can define a single permanent view, as Bahadir suggests. when doing this sort of indexing, though, don't output the doc for each key. Instead, emit([tag, doc.published], null). In current release versions you'd then have to do a separate lookup for each doc, but SVN trunk now has support for specifying "include_docs=True" in the query string and CouchDB will automatically merge the docs into your view for you, without the space overhead.
You are very much on the right track with the view. A list of thoughts though:
View generation is incremental. If you're read traffic is greater than you're write traffic, then your views won't cause an issue at all. People that are concerned about this generally shouldn't be. Frame of reference, you should be worried if you're dumping hundreds of records into the view without an update.
Emitting an entire document will slow things down. You should only emit what is necessary for use of the view.
Not sure what the val == "%s" performance would be, but you shouldn't over think things. If there's a tag array you should emit the tags. Granted if you expect a tags array that will contain non-strings, then ignore this.
# Works on CouchDB 0.8.0
from couchdb import Server # http://code.google.com/p/couchdb-python/
byTag = """
function(doc) {
if (doc.type == 'post' && doc.tags) {
doc.tags.forEach(function(tag) {
emit(tag, doc);
});
}
}
"""
def findPostsByTag(self, tag):
server = Server("http://localhost:1234")
db = server['my_table']
return [row for row in db.query(byTag, key = tag)]
The byTag map function returns the data with each unique tag in the "key", then each post with that tag in value, so when you grab key = "mytag", it will retrieve all posts with the tag "mytag".
I've tested it against about 10 entries and it seems to take about 0.0025 seconds per query, not sure how efficient it is with large data sets..

Categories

Resources