How to add document for search in marqo - python

i recently started using the marqo library and i am trying to add document so that marqo can search and return the relevant part of the document but i keep getting error when i run the the code.
i used the
add_document()
method and i pass the document as a string for search but it returns an error. Here is what my code look like;
import marqo
DOCUMENT = 'the document'
mq = marqo.Client(url='http://localhost:8882')
mq.index("my-first-index").add_documents(DOCUMENT)
and when i run it i get a
MarqoWebError

you are getting the error because the add_document() method takes a list of python dictionaries as an argument not a string, so you are to pass the document as a value to any key you assign to it. But it is advisable to add a title and also an id for later referencing. Here is what i mean;
mq.index("my-first-index").add_documents([
{
"Title": the_title_of_your_document,
"Description": your_document,
"_id": your_id,
}]
)
the id can be any string of your choice. You can add as many dictionaries as you want to the list, each dictionary represents a document.

I think the documents need to be a list of dicts. See here https://marqo.pages.dev/API-Reference/documents/

Related

Work with nested objects using couchdb-python

Disclaimer: Both Python and CouchDB are new for me. So far my "programming" has mostly consisted of Bash scripts.
I'm trying to create a small script that updates objects in a CouchDB database. The objects however aren't created by my script but by an App called Tap Forms that uses CouchDB for sync. Basically I'm trying to automatically update the content of the app. That also means I can't really influence the structure or names of the objects in CouchDB.
The Database is mostly filled with objects of this structure:
{
"_id": "rec-3b17...",
"_rev": "21-cdf6...",
"values": {
"fld-c3d4...": 4,
"fld-1def...": 1000000000000,
"fld-bb44...": 760000000000,
"fld-a44f...": "admin,name",
"fld-5fc0...": "SSD",
"fld-642c...": true,
},
"deviceName": "MacBook Air",
"dateModified": "2019-02-08T14:47:06.051Z",
"dateCreated": "2019-02-08T11:33:00.018Z",
"type": "frm-7ff3...",
"dbID": "db-1435...",
"form": "frm-7ff3..."
}
I shortened the numbers a bit and removed some entries to increase readability.
Now the actual values I'm trying to update are within the "values" : {...} array (or object, or list, guess I don't have much experience with JSON either).
As I know some of these values, I managed to create view that finds the _id of an object on the server. I then use the python-couchdb module as described in documentation:
for item in db.view('CustomViews/test2', key="GENERIC"):
doc = db[item.id]
This gives me the object. However I want to update one of the values within the values array, lets say fld-c3d4.... But how? Using doc['values'] = 'new_value' updates the whole array. I tried other (seemingly logical) ways along the lines of doc['values['fld-c3d4']'] = 'new_value' but couldn't wrap my head around it. I couldn't find an example in any documentation.
So here's a example how to update the fld-c3d4.
You have your document that represent a dictionary with nested dictionary.
If you want to get the values, you will do something like this:
values = doc['values']
Now the variable values points to the values in your document.
From there, you can access a sub value:
values['fld-c3d4'] = 'new value'
If you want to directly update the value from the doc, you just have to chain those operations:
doc['values']['fld-c3d4'] = 'new value'

Clean API results to get the headlines of news articles?

I have been having trouble finding a way to pull out specific text info from the Guardian API for my dissertation. I have managed to get all my text onto Python but how do you then clean it to get say, just the headlines of the news articles?
This is a snippet of the API result that I want to pull out info from:
{
"response": {
"status":"ok",
"userTier":"developer",
"total":1869990,
"startIndex":1,
"pageSize":10,
"currentPage":1,
"pages":186999,
"orderBy":"newest",
"results":[
{
"id":"sport/live/2016/jul/09/tour-de-france-2016-stage-eight-live",
"type":"liveblog",
"sectionId":"sport",
"sectionName":"Sport",
"webPublicationDate":"2016-07-09T13:21:36Z",
"webTitle":"Tour de France 2016: stage eight – live!",
"webUrl":"https://www.theguardian.com/sport/live/2016/jul/09/tour-de-france-2016-stage-eight-live",
"apiUrl":"https://content.guardianapis.com/sport/live/2016/jul/09/tour-de-france-2016-stage-eight-live",
"isHosted":false
},
{
"id":"sport/live/2016/jul/09/serena-williams-v-angelique-kerber-wimbledon-womens-final-live",
"type":"liveblog",
"sectionId":"sport",
"sectionName":"Sport",
"webPublicationDate":"2016-07-09T13:21:02Z",
"webTitle":"Serena Williams v Angelique Kerber: Wimbledon women's final –
...
Hoping the OP adds the used code to the question.
One solution in python is, that whatever you get (from the methods offered by the requests module?) will be either already deeply nested structures you can well index into or you can easily map it to these structures (via json.loads(the_string_you_displayed).
Sample:
d = json.loads(the_string_you_displayed)
head_line = d['response']['results'][0]['webTitle']
Would give the value into headline that is stored in the first dict found in the results "array" (index 0) of the response entries value. (The question was updated so now, the full path is visible)
in case I read the sample snippet given correctly, and it has been cut during copy and paste here, as the given sample is (as is) invalid JSON.
If the text does not represent a valid JSON text, it will depend on sifting through text via substring or pattern matching and may well be very brittle ...
Update: So assuming the full response structure is stored inside a variable named data:
result_seq = data['response']['results'] # yields a list here
headlines = [result['webTitle'] for result in result_seq]
The last line works like so: This is a list comprehension compactly creating a list from all entries result in the result_seq by picking the value of the key webTitle in each dict.
An explicit for loop like solution picking them all would be:
result_seq = data['response']['results']
headlines = []
for result in result_seq:
headlines.append(result['webTitle'])
This does not check for errors like result dicts without a key webTitle etc. but Python will raise a matching exception, and one can decide, if one likes to wrap the processing inside a try: except block or hope for the best ...

Pymongo.find() only return answer

I am working on a code which will fetch data from the database using pymongo. After that I'll show it in a GUI using Tkinter.
I am using
.find()
to find specific documents. However I don't want anything else then 'name' to show up. So I used {"name":1}, now it returns:
{u'name':u'**returned_name**'}
How do I remove the u'name': so it will only return returned_name?
Thanks in advance,
Max
P.s. I have searched a lot around the web but couldn't find anything which would give me some argument to help me.
What you see returned by find() call is a cursor. Just iterate over the cursor and get the value by the name key for every document found:
result = db.col.find({"some": "condition"}, {"name": 1})
print([document["name"] for document in result])
As a result, you'll get a list of names.
Or, if you want and expect a single document to be matched, use find_one():
document = db.col.find_one({"some": "condition"}, {"name": 1})
print(document["name"])
Mongo will return the data with keys, though you can as workaround use something like this
var result = []
db.Resellers_accounts.find({"name":1, "_id":0}).forEach(function(u) { result.push(u.name) })
This example is for NodeJS driver, similar can be done for Python
Edit (Python Code) -
res = db.Resellers_accounts.find({},{"name":1, "_id":0})
result = []
for each in res:
result.append(res['name'])
Edit 2 -
No pymongo doesn't support returning only values, everything is key-value paired in MongoDB.

How to delete documents from Elasticsearch

I can't find any example of deleting documents from Elasticsearch in Python. What I've seen by now - is definition of delete and delete_by_query functions. But for some reason documentation does not provide even a microscopic example of using these functions. The single list of parameters does not tell me too much, if I do not know how to correctly feed them into the function call. So, lets say, I've just inserted one new doc like so:
doc = {'name':'Jacobian'}
db.index(index="reestr",doc_type="some_type",body=doc)
Who in the world knows how can I now delete this document using delete and delete_by_query ?
Since you are not giving a document id while indexing your document, you have to get the auto-generated document id from the return value and delete according to the id. Or you can define the id yourself, try the following:
db.index(index="reestr",doc_type="some_type",id=1919, body=doc)
db.delete(index="reestr",doc_type="some_type",id=1919)
In the other case, you need to look into return value;
r = db.index(index="reestr",doc_type="some_type", body=doc)
# r = {u'_type': u'some_type', u'_id': u'AU36zuFq-fzpr_HkJSkT', u'created': True, u'_version': 1, u'_index': u'reestr'}
db.delete(index="reestr",doc_type="some_type",id=r['_id'])
Another example for delete_by_query. Let's say after adding several documents with name='Jacobian', run the following to delete all documents with name='Jacobian':
db.delete_by_query(index='reestr',doc_type='some_type', q={'name': 'Jacobian'})
The Delete-By-Query API was removed from the ES core in version 2 for several reasons. This function became a plugin. You can look for more details here:
Why Delete-By-Query is a plugin
Delete By Query Plugin
Because I didn't want to add another dependency (because I need this later to run in a docker image) I wrote an own function solving this problem. My solution is to search for all quotes with the specified index and type. After that I remove them using the Bulk API:
def delete_es_type(es, index, type_):
try:
count = es.count(index, type_)['count']
response = es.search(
index=index,
filter_path=["hits.hits._id"],
body={"size": count, "query": {"filtered" : {"filter" : {
"type" : {"value": type_ }}}}})
ids = [x["_id"] for x in response["hits"]["hits"]]
if len(ids) > 0:
return
bulk_body = [
'{{"delete": {{"_index": "{}", "_type": "{}", "_id": "{}"}}}}'
.format(index, type_, x) for x in ids]
es.bulk('\n'.join(bulk_body))
# es.indices.flush_synced([index])
except elasticsearch.exceptions.TransportError as ex:
print("Elasticsearch error: " + ex.error)
raise ex
I hope that helps future googlers ;)
One can also do something like this:
def delete_by_ids(index, ids):
query = {"query": {"terms": {"_id": ids}}}
res = es.delete_by_query(index=index, body=query)
pprint(res)
# Pass index and list of id that you want to delete.
delete_by_ids('my_index', ['test1', 'test2', 'test3'])
Which will perform the delete operation on bulk data
I came across this post while searching for a way to delete a document on ElasticSearch using their Python library, ElasticSearch-DSL.
In case it helps anyone, this part of their documentation describes the document lifecycle.
https://elasticsearch-dsl.readthedocs.io/en/latest/persistence.html#document-life-cycle
And at the end of it, it details how to delete a document:
To delete a document just call its delete method:
first = Post.get(id=42)
first.delete()
Hope that helps 🤞

Google app engine IN filter

I am using python 2.7, jinja2 and google app engine.
I have a database where one of the entities are string in this form:
"test1,test2,test3,test4,test5"
I try to figure a query where I will retrieve all the entities where test1 for example are in the string.
I tried the following where tags is the string property:
category = "test1"
mymodel.gql("WHERE tags in :1", category)
I have this error : BadArgumentError: List expected for "IN" filter
I can imagine that this is logical, since my property is string and not list, but how can I change the query to make it work?
You would have to modify you property to be a StringListProperty rather than string for this to work.
There are not a lot of other choices either. Use FullText Search is or something like WHoosh https://github.com/tallstreet/Whoosh-AppEngine which also does full text search are your other alternatives.
I would personally change it to a StringListProperty.
As the error indicates, the GQL is expecting a list. Try making category a list:
category = ["test1"]

Categories

Resources