How to query with analyzers and stop-words in elastic search

How to query with analyzers and stop-words in elastic search - python

So what I need to do is to pass some information from XML files into elasticsearch and then search those files with tfidf weights applied to them. I also need to output the top 20 best results. I want to do this with python.
So far I have been able to pass the XML data and create an index successfully through python by creating arrays and then indexing them through a json-like format. I am aware that this means that while indexing most other options that are available through elasticsearch get a default value however I was unable to find a way to do this in a different way. What remains for me to do since all the data is passed into the index, is to search for it. I am given 10 documents that contain the title and a small summary of the text contained and I need to return the top 20 results with tfidf through elasticsearch. This is how I gather the 10 text files that need to be searched for in my index and this is how I try to search for them.
queries = []
with open("testingQueries.txt") as file:
queries = [i.strip() for i in file]
for query_text in queries:
query = {
'query': {
'more_like_this': {
'fields': ['document.text'],
'like': query_text
}
}
}
results = es.search(index=INDEX_NAME, body=query)
print(str(results) + "\n")
As you can see I haven't added an analyzer in this query and I have no idea how to add tfidf weights to search for these queries in my data. I've been searching for an answer everywhere but most answers are either not python related or do not really solve my problem. The search results that I am getting are also not giving me the top 20 results...in fact they aren't giving me any results. The output looks like this: {'took': 14, 'timed_out': False, '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0}, 'hits': {'total': 0, 'max_score': None, 'hits': []}}
when I try to do the same with 'match' instead of 'more_like_this' I get a lot more results with hits but again I would still need tfidf scores and a result of the top 20 documents that are similar to my queries.

Related

How can I best convert an API JSON object to a single row for SQL server?

I have a script setup to pull a JSON from an API and I need to convert objects into different columns for a single row layout for a SQL server. See the example below for the body raw layout of an example object:
"answers": {
"agent_star_rating": {
"question_id": 145,
"question_text": "How satisfied are you with the service you received from {{ employee.first_name }} today?",
"comment": "John was exceptionally friendly and knowledgeable.",
"selected_options": {
"1072": {
"option_id": 1072,
"option_text": "5",
"integer_value": 5
}
}
},
In said example I need the output for all parts of agent_star_rating to be individual columns so all data spits out 1 row for the entire survey on our SQL server. I have tried mapping several keys like so:
agent_star_rating = [list(response['answers']['agent_star_rating']['selected_options'].values())[0]['integer_value']]
agent_question = (response['answers']['agent_star_rating']['question_text'])
agent_comment = (response['answers']['agent_star_rating']['comment'])
response['agent_question'] = agent_question
response['agent_comment'] = agent_comment
response['agent_star_rating'] = agent_star_rating
I get the expected result until we reach a point where some surveys have skipped a field like ['question text'] and we'll get a missing key error. This happens over the course of other objects and I am failing to come up with a solution for these missing keys. If there is a better way to format the output as I've described beyond the keys method I've used I'd also love to hear ideas! I'm fresh to learning python/pandas so pardon any improper terminology!

I would do something like this:
# values that you always capture
row = ['value1', 'value2', ...]
gottem_attrs = {'question_id': '' ,
'question_text': '',
'comment': '',
'selected_options': ''}
# find and save the values that response have
for attr in list(response['agent_star_rating']):
gottem_attrs[attr] = response['agent_star_rating'][attr]
# then you have your final row
final_row = row + gottem_attrs.values()
If the response have a value in his attribute, this code will save it. Else, it will save a empty string for that value.

Annotating a queryset using aggregations of values with more than one field

Django annotations are really awesome. However I can't figure out how to deal with annotations where several values() are required.
Question:
I would like to annotate an author_queryset with counts of items in a related m2m. I don't know if I need to use a Subquery or not, however:
annotated_queryset = author_queryset.annotate(genre_counts=Subquery(genre_counts))
Returns:
SyntaxError: subquery must return only one column
I've tried casting the values to a JSONField to get it back in one field, hoping that I could use JSONBagg on it since I'm using postgres and need to filter the result:
subquery = Author.objects.filter(id=OuterRef('pk')).values('id','main_books__genre').annotate(genre_counts=Count('main_books__genre'))
qss = qs.annotate(genre_counts=Subquery(Cast(subquery,JSONField()), output_field=JSONField()))
Yeilds:
AttributeError: 'Cast' object has no attribute 'clone'
I'm not sure what I need to get the dict cast to a JSONField(). There's some great info here about filtering on these. There's also something for postgres coming soon in the development version called ArraySubQuery() which looks to address this. However, I can't use this feature until it's in a stable release.
Desired result
I would like to annotate so I could filter based on the annotations, like this:
annotated_queryset.filter(genre_counts__scifi__gte=5)
Detail
I can use dunders to get at a related field, then count like so:
# get all the authors with Virginia in their name
author_queryset = Author.objects.filter(name__icontains='Virginia')
author_queryset.count()
# returns: 21
# aggregate the book counts by genre in the Book m2m model
genre_counts = author_queryset.values('id','main_books__genre').annotate(genre_counts=Count('main_books__genre'))
genre_counts.count()
# returns: 25
this is because there can be several genres counts returned for each Author object in the queryset. In this particular example, there is an Author which has books in 4 different genres:
To illustrate:
...
{'id': 'authorid:0054f04', 'main_books__genre': 'scifi', 'genre_counts': 1}
{'id': 'authorid:c245457', 'main_books__genre': 'fantasy', 'genre_counts': 4}
{'id': 'authorid:a129a73', 'main_books__genre': None, 'genre_counts': 0}
{'id': 'authorid:f41f14b', 'main_books__genre': 'mystery', 'genre_counts': 16}
{'id': 'authorid:f41f14b', 'main_books__genre': 'romance', 'genre_counts': 1}
{'id': 'authorid:f41f14b', 'main_books__genre': 'scifi', 'genre_counts': 9}
{'id': 'authorid:f41f14b', 'main_books__genre': 'fantasy', 'genre_counts': 3}
...
and there is another Author with 2, the rest have one genre each. Which is the 25 values.
Hoping this makes sense to someone! I'm sure there is a way to handle this properly without waiting for the feature described above.

You want to use .annotate( without a Subquery because as you found, that needs to return a single value. You should be able to span all the relationships in the first annotate's count expression.
Unfortunately Django doesn't support what you're looking for with genre_counts__scifi_gt=5 currently. You can structure it such that you do the Count with the filter passed to it.
selected_genre = 'scifi'
annotated_queryset = author_queryset.annotate(
genre_count=Count("main_books__genre", filter=Q(genre=selected_genre))
).filter(genre_count__gte=5)
To have the full breakdown, you'll be better off returning the breakdown and doing the final aggregation in the application as you showed in your question.

API loop Python

I am trying to call paginated API through loop. It looks like below. API doc is new and does not tell anything about next,offset or anything. Tried adding {page:page} in param but still returning the first 100 rows. Any help.
headers = {'key':'key'}
data = requests.get(url,headers=headers).json()
Sample response:
{'results': []
'page': 1,
'results_per_page': 100,
'total_results': 25000}

Extracting data from nested JSON using python

I am hoping someone can help me solve this problem I am having with a nested JSON response. I have been trying to crack this for a few weeks now with no success.
Using a sites API I am trying to create a dictionary which can hold three pieces of information, for each user, extracted from the JSON responses. The first JSON response holds the users uid and crmid that I require.
The API comes back with a large JSON response, with an object for each account. An extract of this for a single account can be seen below:
{
'uid': 10,
'key':
'[
N#839374',
'customerUid': 11,
'selfPaid': True,
'billCycleAllocationMethodUid': 1,
'stmtPaidForAccount': False,
'accountInvoiceDeliveryMethodUid': 1,
'payerAccountUid': 0,
'countryCode': None,
'currencyCode': 'GBP',
'languageCode': 'en',
'customFields':
{
'field':
[{
'name': 'CRMID',
'value': '11001'
}
]
},
'consentDetails': [],
'href': '/accounts/10'}
I have made a loop which extracts each UID for each account:
get_accounts = requests.get('https://website.com/api/v1/accounts?access_token=' + auth_key)
all_account_info = get_accounts.json()
account_info = all_account_info['resource']
account_information = {}
for account in account_info:
account_uid = account['uid']
I am now trying to extract the CRMID value, in this case '11001': {'name': 'CRMID', 'value': '11001'}.
I have been struggling all week to make this work, I have two problems:
I would like to extract the UID (which I have done) and the CRMID from the deeply nested 'customFields' dictionary in the JSON response. I manage to get as far as ['key'][0], but I am not sure how to access the next dictionary that is nested in the list.
I would like to store this information in a dictionary in the format below:
{'accounts': [{'uid': 10, 'crmid': 11001, 'amount': ['bill': 4027]}{'uid': 11, 'crmid': 11002, 'amount': ['bill': 1054]}]}
(The 'bill' information is going to come from a separate JSON response.)
My problem is, with every loop I design the dictionary seems to only hold one account/the last account it loops over. I cant figure out a way to append to the dictionary instead of overwrite whilst using a loop. If anyone has a useful link on how to do this it would be much appreciated.
My end goal is to have a single dictionary which holds the three pieces of information for each account (uid, crmid, bill). I'm then going to export this into a CSV document.
Any help, guidance, useful links etc would be much appreciated.

In regards to question 1, it may be helpful to print each level as you go down, then try and work out how to access the object you are returned at that level. If it is an array it will using number notation like [0] and if it is a dictionary it will use key notation like ['key']
Regarding question 2, your dictionary needs unique keys. You are probably looping over and replacing the whole thing each time.
The final structure you suggest is a bit off, imagine it as:
accounts: {
'10': {
'uid': '10',
'crmid': 11001,
'amount': {
'bill': 4027
}
},
'11': {
'uid': '11',
'crmid': 11011,
'amount': {
'bill': 4028
}
}
}
etc.
So you can access accounts['10']['crmid'] or accounts['10']['amount']['bill'] for example.

MongoDB + K means clustering

I'm using MongoDB as my datastore and wish to store a "clustered" configuration of my documents in a separate collection.
So in one collection, I'd have my original set of objects, and in my second, it'd have
kMeansCollection: {
1: [mongoObjectCopy1], [mongoObjectCopy2]...
2: [mongoObjectCopy3], [mongoObjectCopy4]...
}
I'm following the implementation a K-means for text clustering here, http://tech.swamps.io/recipe-text-clustering-using-nltk-and-scikit-learn/, but I'm having a hard time thinking about how I'd tie the outputs back into MongoDB.
An example (taken from the link):
if __name__ == "__main__":
tags = collection.find({}, {'tag_data': 1, '_id': 0})
clusters = cluster_texts(tags, 5) #algo runs here with 5 clusters
pprint(dict(clusters))
The var "tags" is the required input for the algo to run.
It must be in the form of an array, but currently tags returns an array of objects (I must therefore extract the text values from the query)
However, after magically clustering my collection 5 ways, how can I reunite them with their respective object entry from mongo?
I am only feeding specific text content from one property of the object.
Thanks a lot!

You would need to have some identifier for the documents. It is probably a good idea to include the _id field in your query so that you do have a unique document identifier. Then you can create parallel lists of ids and tag_data.
docs = collection.find({}, {'tag_data': 1, '_id': 1})
ids = [doc['_id'] for doc in docs]
tags = [doc['tag_data'] for doc in docs]
Then call the cluster function on the tag data.
clusters = cluster_text(tags)
And zip the results back with the ids.
doc_clusters = zip(ids, clusters)
From here you have built tuples of (_id, cluster) so you can update the cluster labels on your mongo documents.

The efficient way to do this is to use the aggregation framework to create the list of "_id" and "tag-data" using server-side operation. This also reduces both the amount of data sent over the wire and the time and memory used to decode documents on the client-side.
You need to $group your documents and use the $push accumulator operator to return the list of _id and the list of tag-data. Of course the aggregate() method gives access to the aggregation pipeline.
cursor = collection.aggregate([{
'$group': {
'_id': None,
'ids': {'$push': '$_id'},
'tags': {'$push': '$tag-data'}
}
}])
You then retrieve you data using the .next() method on the CommandCursor because we group by None thus our cursor hold one element.
data = cursor.next()
After that, simply call your function and zip the result.
clusters = cluster_text(data['tags'])
doc_clusters = zip(data['ids'], clusters)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to query with analyzers and stop-words in elastic search - python

Related

How can I best convert an API JSON object to a single row for SQL server?

Annotating a queryset using aggregations of values with more than one field

API loop Python

Extracting data from nested JSON using python

MongoDB + K means clustering

Categories

Resources