I am trying to write a django app and use elasticsearch in it with elasticsearch-dsl library of python. I don't want to create all switch-case statements and then pass search queries and filters accordingly.
I want a function that does the parsing stuff by itself.
For e.g. If i pass "some text url:github.com tags:es,es-dsl,django",
the function should output corresponding query.
I searched for it in elasticsearch-dsl documentation and found a function that does the parsing.
https://github.com/elastic/elasticsearch-dsl-py/search?utf8=%E2%9C%93&q=simplequerystring&type=
However, I dont know how to use it.
I tried s = Search(using=client).query.SimpleQueryString("1st|ldnkjsdb"), but it is showing me parsing error.
Can anyone help me out?
You can just plug the SimpleQueryString in the Search object, instead of a dictionary send the elements as parameters of the object.
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
from elasticsearch_dsl.query import SimpleQueryString
client = Elasticsearch()
_search = Search(using=client, index='INDEX_NAME')
_search = _search.filter( SimpleQueryString(
query = "this + (that | thus) -those",
fields= ["field_to_search"],
default_operator= "and"
))
A lot of elasticsearch_dsl simply change the dictionary representation to classes of functions that makes the code look pythonic, and avoid the use of hard-to-read elasticsearch JSONs.
Im guessing you are asking about the usage of elasticsearch-dsl with query string like you are making a request with json data to the elasticsearch api. If that's the case, this is how you are going to use elasticsearch-dsl:
assume you have the query in query variable like this:
{
"query": {
"query_string" : {
"default_field" : "content",
"query" : "this AND that OR thus"
}
}
}
and now do this:
es = Elasticsearch(
host=settings.ELASTICSEARCH_HOST_IP, # Put your ES host IP
port=settings.ELASTICSEARCH_HOST_PORT, # Put yor ES host port
)
index = settings.MY_INDEX # Put your index name here
result = es.search(index=index, body=query)
Related
I've been scouring the web for some good python documentation for Elasticsearch. I've got a query term that I know returns the information I need, but I'm struggling to convert the raw string into something Python can interpret.
This will return a list of all unique 'VALUE's in the dataset.
{"find": "terms", "field": "hierarchy1.hierarchy2.VALUE"}
Which I have taken from a dashboarding tool which accesses this data.
But I don't seem to be able to convert this into correct python.
I've tried this:
body_test = {"find": "terms", "field": "hierarchy1.hierarchy2.VALUE"}
es = Elasticsearch(SETUP CONNECTION)
es.search(
index="INDEX_NAME",
body = body_test
)
but it doesn't like the find value. I can't find anything in the documentation about find.
RequestError: RequestError(400, 'parsing_exception', 'Unknown key for
a VALUE_STRING in [find].')
The only way I've got it to slightly work is with
es_search = (
Search(
using=es,
index=db_index
).source(['hierarchy1.hierarchy2.VALUE'])
)
But I think this is pulling the entire dataset and then filtering (which I obviously don't want to be doing each time I run this code). This needs to be done through python and so I cannot simply POST the query I know works.
I am completely new to ES and so this is all a little confusing. Thanks in advance!
So it turns out that the find in this case was specific to Grafana (the dashboarding tool I took the query from.
In the end I used this site and used the code from there. It's a LOT more complicated than I thought it was going to be. But it works very quickly and doesn't put a strain on the database (which my alternative method was doing).
In case the link dies in future years, here's the code I used:
from elasticsearch import Elasticsearch
es = Elasticsearch()
def iterate_distinct_field(es, fieldname, pagesize=250, **kwargs):
"""
Helper to get all distinct values from ElasticSearch
(ordered by number of occurrences)
"""
compositeQuery = {
"size": pagesize,
"sources": [{
fieldname: {
"terms": {
"field": fieldname
}
}
}
]
}
# Iterate over pages
while True:
result = es.search(**kwargs, body={
"aggs": {
"values": {
"composite": compositeQuery
}
}
})
# Yield each bucket
for aggregation in result["aggregations"]["values"]["buckets"]:
yield aggregation
# Set "after" field
if "after_key" in result["aggregations"]["values"]:
compositeQuery["after"] = \
result["aggregations"]["values"]["after_key"]
else: # Finished!
break
# Usage example
for result in iterate_distinct_field(es, fieldname="pattern.keyword", index="strings"):
print(result) # e.g. {'key': {'pattern': 'mypattern'}, 'doc_count': 315}
I am trying to retrieve all stories where the images.path has the text "images123456.jpg" for example.
In my mongocompass, I am able to retrieve it using this
{$and: [{"images.path": {$exists: true}}, {"images.path": /.*images[1-9].*/}] }
In my python script, I tried to paste the query in the following.
client = MongoClient(HOST, PORT)
dbStuff = client['myDatabase']
myCollection = dbStuff.story.with_options(codec_options=CodecOptions(tz_aware=True, tzinfo=pytz.timezone('Asia/Singapore')))
retrieved = myCollection .find({"$and": [{"images.path": {"$exists": True}}, {"images.path": '/.*images[1-9].*/'}] })
print retrieved.count() # Prints out 0
There is something wrong in the python script for
{"images.path": '/.*images[1-9].*/'}] }
part. How can i make the necessary changes?
/ is used as regular expression delimiter in some languages like JS. In Python you just write the contents of a regular expression as a string.
JS: /foo.*bar/
Python: r'foo.*bar'
I am trying to add a document of 43Mb into an index in Elasticsearch. I use the bulk API in python. Here is a snippet of my code:
from elasticsearch import helpers
from elasticsearch import Elasticsearch
document = <read a 43Mb json file, with two fields>
action = [
{
"_index":"test_index",
"_type":"test_type",
"_id": 1
}
]
action[0]["_source"]=document
es = Elasticsearch(hosts=<HOST>:9200, timeout = 30)
helpers.bulk(es, action)
This code always times out. I have also tried with different timeout values. Am I missing something here?
I am using pyarango driver (https://github.com/tariqdaouda/pyArango) for arangoDB, but I cannot understand how the field validation works. I have set the fields of a collection as in the github example:
import pyArango.Collection as COL
import pyArango.Validator as VAL
from pyArango.theExceptions import ValidationError
import types
class String_val(VAL.Validator) :
def validate(self, value) :
if type(value) is not types.StringType :
raise ValidationError("Field value must be a string")
return True
class Humans(COL.Collection) :
_validation = {
'on_save' : True,
'on_set' : True,
'allow_foreign_fields' : True # allow fields that are not part of the schema
}
_fields = {
'name' : Field(validators = [VAL.NotNull(), String_val()]),
'anything' : Field(),
'species' : Field(validators = [VAL.NotNull(), VAL.Length(5, 15), String_val()])
}
So I was expecting that when I try to add a document into "Humans" collection, if 'name' field is not a string, an error would rise. But it didn't seem to work that easy.
This is how I add documents to the collection:
myjson = json.loads(open('file.json').read())
collection_name = "Humans"
bindVars = {"doc": myjson, '#collection': collection_name}
aql = "For d in #doc INSERT d INTO ##collection LET newDoc = NEW RETURN newDoc"
queryResult = db.AQLQuery(aql, bindVars = bindVars, batchSize = 100)
So if 'name' is not a string I actually don't get any error and is uploaded into the collection.
Does someone knows how can check if a document contains proper fields for that collection using the built-in validation of pyarango?
I don't see anything wrong with your validator, its just that if you're using AQL queries to insert your documents, pyArango has no way of knowing the contents prior to insertion.
Validators only work on pyArango documents if you do:
humans = db["Humans"]
doc = humans.createDocument()
doc["name"] = 101
That should trigger the exception because you've defined:
'on_set': True
ArangoDB as document store itself doesn't enforce schemas, neither do the drivers.
If you need schema validation, this can be done on top of the driver or inside of ArangoDB using a Foxx service (via the joi validation library).
One possible solution for doing this is using JSON Schema with its python implementation on top of the driver in your application:
from jsonschema import validate
schema = {
"type" : "object",
"properties" : {
"name" : {"type" : "string"},
"species" : {"type" : "string"},
},
}
Another real life example using JSON Schema is swagger.io, which is also used to document the ArangoDB REST API and ArangoDB Foxx services.
I don't know yet what was wrong with the code I posted but now seems to work. However I had to convert unicode to utf-8 when reading the json file otherwise it was not able to identify strings. I know ArangoDB as itself does not enforce schemes but I am using that has a built-in validation.
For those interested in a built-in validation of arangoDB using python visit pyarango github.
I have the following Python code. It basically returns some elements of RDF from an online resource using SPARQL.
I want to query and return something from one of my local files. I tried to edit it but couldn't return anything.
What should I change in order to query within my local instead of http://dbpedia.org/resource?
from SPARQLWrapper import SPARQLWrapper, JSON
# wrap the dbpedia SPARQL end-point
endpoint = SPARQLWrapper("http://dbpedia.org/sparql")
# set the query string
endpoint.setQuery("""
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbpr: <http://dbpedia.org/resource/>
SELECT ?label
WHERE { dbpr:Asturias rdfs:label ?label }
""")
# select the retur format (e.g. XML, JSON etc...)
endpoint.setReturnFormat(JSON)
# execute the query and convert into Python objects
# Note: The JSON returned by the SPARQL endpoint is converted to nested Python dictionaries, so additional parsing is not required.
results = endpoint.query().convert()
# interpret the results:
for res in results["results"]["bindings"] :
print res['label']['value']
Thanks!
SPARQLWrapper is meant to be used only with remote or local SPARQL endpoints. You have two options:
(a) Put your local RDF file in a local triple store and point your code to localhost. (b) Or use rdflib and use the InMemory storage:
import rdflib.graph as g
graph = g.Graph()
graph.parse('filename.rdf', format='rdf')
print graph.serialize(format='pretty-xml')
You can query the rdflib.graph.Graph() with:
filename = "path/to/fileneme" #replace with something interesting
uri = "uri_of_interest" #replace with something interesting
import rdflib
import rdfextras
rdfextras.registerplugins() # so we can Graph.query()
g=rdflib.Graph()
g.parse(filename)
results = g.query("""
SELECT ?p ?o
WHERE {
<%s> ?p ?o.
}
ORDER BY (?p)
""" % uri) #get every predicate and object about the uri