How to generate a word cloud using elasticsearch? - python

I have an elasticsearch DB with data of the form
record = {#all but age are strings
'diagnosis': self.diagnosis,
'vignette': self.vignette,
'symptoms': self.symptoms_list,
'care': self.care_level_string,
'age': self.age, #float
'gender': self.gender
}
I want to create a word cloud of the data in vignette.
I tried all sorts of queries and I get error 400, meaning I don't understand how to query the database.
I am using python
This is the only successful query I was able to come up with
def search_phrase_in_vignettes(self, phrase):
body = {
"_source": ["vignette"],
"query": {
"match_phrase": {
"vignette": {
"query": phrase,
}
}
}
}
res = self.es.search(index=self.index_name, doc_type=self.doc_type, body=body)
Which finds any record with phrase contained in the field `'vignette'
I am thinking some aggregation should do the trick, but I can't seem to be able to write a correct query with 'aggr'.
Would love some help on how to correctly write even the simplest query with aggregation in python.

Use terms aggregation for the approach words count. Your query will be:
{
"query": {
"match_phrase": {
"vignette": {
"query": phrase,
}
}
},
"aggs" : {
"cloud" : {
"terms" : { "field" : "vignette" }
}
}
}
When you receive results take buckets from aggregations key:
res = self.es.search(index=self.index_name, doc_type=self.doc_type, body=body)
for bucket in res['aggregations']['cloud']['buckets']:
rest of build cloud

Related

How to use conditionals on Uniswap GraphQL API pool method?

So I'm new to graphQL and I've been figuring out the Uniswap API, through the sandbox browser, but I'm running this program which just gets metadata on the top 100 tokens and their relative pools, but the pool one isn't working at all. I'm trying to put two conditions of if token0's hash is this and token1's hash is this, it should output the pool of those two, however if only outputs pools with the token0 hash, and just ignores the second one. I've tried using and, _and, or two where's seperated by {} or , so on so forth. This is an example I have (python btw):
class ExchangePools:
def QueryPoolDB(self, hash1, hash2):
query = """
{
pools(where: {token0: "%s"}, where: {token1:"%s"}, first: 1, orderBy:volumeUSD, orderDirection:desc) {
id
token0 {
id
symbol
}
token1 {
id
symbol
}
token1Price
}
}""" % (hash1, hash2)
return query
or in the sandbox explorer this:
{
pools(where: {token0: "0x2260fac5e5542a773aa44fbcfedf7c193bc2c599"} and: {token1:"0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48"}, first: 1, orderBy:volumeUSD, orderDirection:desc) {
id
token0 {
id
symbol
}
token1 {
id
symbol
}
token1Price
}
}
with this output:
{
"data": {
"pools": [
{
"id": "0x4585fe77225b41b697c938b018e2ac67ac5a20c0",
"token0": {
"id": "0x2260fac5e5542a773aa44fbcfedf7c193bc2c599",
"symbol": "WBTC"
},
"token1": {
"id": "0xc02aaa39b223fe8d0a0e5c4f27ead9083c756cc2",
"symbol": "WETH"
},
"token1Price": "14.8094450357546760737720184457113"
}
]
}
}
How can I get the API to register both statements?

Bulk Update for elasticsearch documents using Python

I have elasticsearch documents like below where I need to rectify age value based on creationtime currentdate
age = creationtime - currentdate
:
hits = [
{
"_id":"CrRvuvcC_uqfwo-WSwLi",
"creationtime":"2018-05-20T20:57:02",
"currentdate":"2021-02-05 00:00:00",
"age":"60 months"
},
{
"_id":"CrRvuvcC_uqfwo-WSwLi",
"creationtime":"2013-07-20T20:57:02",
"currentdate":"2021-02-05 00:00:00",
"age":"60 months"
},
{
"_id":"CrRvuvcC_uqfwo-WSwLi",
"creationtime":"2014-08-20T20:57:02",
"currentdate":"2021-02-05 00:00:00",
"age":"60 months"
},
{
"_id":"CrRvuvcC_uqfwo-WSwLi",
"creationtime":"2015-09-20T20:57:02",
"currentdate":"2021-02-05 00:00:00",
"age":"60 months"
}
]
I want to do bulk update based on each document ID, but the problem is I need to correct 6 months of data & per data size (doc count of Index) is almost 535329, I want to efficiently do bulk update on age based on _id for each day on all documents using python.
Is there a way to do this, without looping through, all examples I came across using Pandas dataframes for update is based on a known value. But here _id I will get as and when the code runs.
The logic I had written was to fetch all doc & store their _id & then for each _id update the age . But its not an efficient way if I want to update all documents in bulk for each day of 6 months.
Can anyone give me some ideas for this or point me in the right direction.
As mentioned in the comments, fetching the IDs won't be necessary. You don't even need to fetch the documents themselves!
A single _update_by_query call will be enough. You can use ChronoUnit to get the difference after you've parsed the dates:
POST your-index-name/_update_by_query
{
"query": {
"match_all": {}
},
"script": {
"source": """
def created = LocalDateTime.parse(ctx._source.creationtime, DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss"));
def currentdate = LocalDateTime.parse(ctx._source.currentdate, DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss"));
def months = ChronoUnit.MONTHS.between(created, currentdate);
ctx._source._age = months + ' month' + (months > 1 ? 's' : '');
""",
"lang": "painless"
}
}
The official python client has this method too. Here's a working example.
🔑 Try running this update script on a small subset of your documents before letting in out on your whole index by adding a query other than the match_all I put there.
💡 It's worth mentioning that unless you search on this age field, it doesn't need to be stored in your index because it can be calculated at query time.
You see, if your index mapping's dates are properly defined like so:
{
"mappings": {
"properties": {
"creationtime": {
"type": "date",
"format": "yyyy-MM-dd'T'HH:mm:ss"
},
"currentdate": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
},
...
}
}
}
the age can be calculated as a script field:
POST ttimes/_search
{
"query": {
"match_all": {}
},
"script_fields": {
"age_calculated": {
"script": {
"source": """
def months = ChronoUnit.MONTHS.between(
doc['creationtime'].value,
doc['currentdate'].value );
return months + ' month' + (months > 1 ? 's' : '');
"""
}
}
}
}
The only caveat is, the value won't be inside of the _source but rather inside of its own group called fields (which implies that more script fields are possible at once!).
"hits" : [
{
...
"_id" : "FFfPuncBly0XYOUcdIs5",
"fields" : {
"age_calculated" : [ "32 months" ] <--
}
},
...

Pymongo include only the fields which are starting with a name

For example, if this is my record
{
"_id":"123",
"name":"google",
"ip_1":"10.0.0.1",
"ip_2":"10.0.0.2",
"ip_3":"10.0.1",
"ip_4":"10.0.1",
"description":""}
I want to get only those fields starting with 'ip_'. Consider I have 500 fields & only 15 of them start with 'ip_'
Can we do something like this to get the output -
db.collection.find({id:"123"}, {'ip*':1})
Output -
{
"ip_1":"10.0.0.1",
"ip_2":"10.0.0.2",
"ip_3":"10.0.1",
"ip_4":"10.0.1"
}
The following aggregate query, using PyMongo, returns documents with the field names starting with "ip_".
Note the various aggregation operators used: $filter, $regexMatch, $objectToArray, $arrayToObject. The aggregation pipeline the two stages $project and $replaceWith.
pipeline = [
{
"$project": {
"ipFields": {
"$filter" : {
"input": { "$objectToArray": "$$ROOT" },
"cond": { "$regexMatch": { "input": "$$this.k" , "regex": "^ip" } }
}
}
}
},
{
"$replaceWith": { "$arrayToObject": "$ipFields" }
}
]
pprint.pprint(list(collection.aggregate(pipeline)))
I am unaware of a way to specify an expression that would decide which hash keys would be projected. MongoDB has projection operators but they deal with arrays and text search.
If you have a fixed possible set of ip fields, you can simply request all of them regardless of which fields are present in a particular document, e.g. project with
{ip_1: true, ip_2: true, ...}

elasticsearch query - two criteria

I have a list of JSON files in elasticsearch.
I have a list of strings, matching which I want to use as the criteria for a search.
Where, matching = ["223232_ds","dnjsnsd_22","2ee2i33","mkddsj2220","23e3efdjn"
I now need to find those records in elasticsearch where two keys contain values in this list, matching.
Without elasticsearch and simply loading the JSON as a python object I can do this like:
results= []
for record in JSON_list:
if record['key_1'] in matching and record['key_2'] in matching:
results.append(record)
Where the JSON_list looks like this:
[{'key_1' : "blahaksds",
'key_2' : "njasdnjkns"},
{'key_1' : "bladfgfdf",
'key_2' : "njasdsfsdrr"}]
How do I search for multiple criteria in es? Previously, I've used this setup to search for a record_id directly.
es = elasticsearch.Elasticsearch()
name = "so_sample"
# Formulate query
query = str("_id:"+'"'+ record_id +'"')
# Query
result = es.search(name,q=query)
You can use a bool query with two terms queries in the must clause, like this:
{
"query": {
"bool": {
"must": [
{
"terms": {
"key_1": ["223232_ds","dnjsnsd_22","2ee2i33","mkddsj2220","23e3efdjn"]
}
},
{
"terms": {
"key_2": ["223232_ds","dnjsnsd_22","2ee2i33","mkddsj2220","23e3efdjn"]
}
}
]
}
}
}

elasticsearch dsl python unpack q queries

How can I dynamically create a combined Q query in elasticsearch dsl python?
I have gone through docs and this SO post. I constructed a q_query dictionary with all required info.
ipdb> q_queries
{'queries': [{'status': 'PASS'}, {'status': 'FAIL'}, {'status': 'ABSE'}, {'status': 'WITH'}], 'operation': 'or'}
I want to perform the following q_query
qq=Q("match", status='PASS') | Q("match", status="FAIL") | Q("match", status="ABSE") | Q("match", status="WITH")
for a list of dict following works out
ipdb> [Q('match', **z) for z in q_queries['queries']]
[Match(status='PASS'), Match(status='FAIL'), Match(status='ABSE'), Match(status='WITH')]
But How to combine multiple Qs with an or operator or an and operator? Also what is the corresponding elasticsearch raw query for the above? I tried following since I have to filter based on test_id.
{
"query": {
"bool": {
"must": [
{ "match": { "test_id": "7" }},
{
"range": {
"created": {
"gte": "2016-01-01",
"lte": "2016-01-31"
}
}
}
],
"should": [
{ "match": { "status": "PASS"}},
{ "match": { "status": "FAIL"}}
]
}
}
}
But results are not as expected I same query without should filter and the results obtained were the same. So should filters were not executed by elasticsearch in my case.
Any help is much appreciated.
TIA
After exploring elasticsearch dsl python for some more time, this piece of documentation helped me to solve above issue. Below posted is the function that I wrote to resolve this issue.
def create_q_queries(self, q_queries, search_query):
"""
create q queries and chain if multiple q queries.
:param q_queries: Q queries with operation and query params as a dict.
:param search_query: Search() object.
:return: search_query updated with q queries.
"""
if q_queries:
logical_operator_mappings = {'or': 'should', 'and': 'must'}
for query in q_queries:
queries = [Q('match', **query) for query in query['queries']]
search_query = search_query.query(Q('bool', **{
logical_operator_mappings.get(query.get('operation')): queries
}))
return search_query
I changed format of q_queries to perform chaining based on multiple operators like and, or etc.
q_queries = [
{
"operation": "or",
"queries":
[
{"status": "PASS"}, {"status": "FAIL"}, {"status": "ABSE"}, {"status": "WITH"}
]
}
]

Categories

Resources