need help in formulating a basic elasticutils search query
I am trying to test elasticutils mainly becz I was not able to get optimum performance for #searches results / second. ( more details : here )
So far here is what I have done.
es=get_es(hosts=['localhost:9200'],timeout=30,default_indexes=['ncbi_taxa_names'],dump_curl=CurlDumper())
es.get_indices()
# [2012-08-22T15:36:10.639102]
curl -XGET
http://localhost:9200/ncbi_taxa_names/_status
Out[26]: {u'ncbi_taxa_names': {'num_docs': 1316005}}
S().indexes('ncbi_taxa_names').values_dict()
Out[27]: [{u'tax_name': u'Conyza sp.', u'tax_id': u'41553'}, ...
so what I want to do is formulate a query where I can search for { "taxa_name":"cellvibrio"} and then do some comparison to how many search results I can retrieve with elasticutils compared to pyes.
May be it has something to do with the way the ES is running locally and not with the API's.
Update1
I tried the following and the searh results are still very similar to what I am getting from pyes. Now I am beginning to wonder whether it has something to do with how the local ES is running . Still need help figuring that out.
es=get_es(hosts=['localhost:9200'],timeout=30,default_indexes=['ncbi_taxa_names'],dump_curl=CurlDumper())
es.get_indices()
# [2012-08-22T15:36:10.639102]
curl -XGET
http://localhost:9200/ncbi_taxa_names/_status
Out[26]: {u'ncbi_taxa_names': {'num_docs': 1316005}}
s=S().indexes('ncbi_taxa_names').values_dict()
Out[27]: [{u'tax_name': u'Conyza sp.', u'tax_id': u'41553'}, ...
results = s.query(tax_name='aurantiacus') # using elasticutils
Appreciate your help.
Thanks!
Related
I am trying to connect knack online database with my python data handling scripts in order to renew objects/tables directly into my knack app builder. I discovered pyknackhq Python API for KnackHQ can fetch objects and return json objects for the object's records. So far so good.
However, following the documentation (http://www.wbh-doc.com.s3.amazonaws.com/pyknackhq/quick%20start.html) I have tried to fetch all rows (records in knack) for my object-table (having in total 344 records).
My code was:
i =0
for rec in undec_obj.find():
print(rec)
i=i+1
print(i)
>> 25
All first 25 records were returned indeed, however the rest until the 344-th were never returned. The documentation of pyknackhq library is relatively small so I couldn't find a way around my problem there. Is there a solution to get all my records/rows? (I have also changed the specification in knack to have all my records appear in the same page - page 1).
The ultimate goal is to take all records and make them a pandas dataframe.
thank you!
I haven't worked with that library, but I've written another python Knack API wrapper that should help:
https://github.com/cityofaustin/knackpy
The docs should get you where you want to go. Here's an example:
>>> from knackpy import Knack
# download data from knack object
# will fetch records in chunks of 1000 until all records have been downloaded
# optionally pass a rows_per_page and/or page_limit parameter to limit record count
>>> kn = Knack(
obj='object_3',
app_id='someappid',
api_key='topsecretapikey',
page_limit=10, # not needed; this is the default
rows_per_page=1000 # not needed; this is the default
)
>>> for row in kn.data:
print(row)
{'store_id': 30424, 'inspection_date': 1479448800000, 'id': '58598262bcb3437b51194040'},...
Hope that helps. Open a GitHub issue if you have any questions using the package.
I have been using the ElasticSearch DSL python package to query my elastic search database. The querying method is very intuitive but I'm having issues retrieving the documents. This is what I have tried:
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
es = Elasticsearch(hosts=[{"host":'xyz', "port":9200}],timeout=400)
s = Search(using=es,index ="xyz-*").query("match_all")
response = s.execute()
for hit in response:
print hit.title
The error I get :
AttributeError: 'Hit' object has no attribute 'title'
I googled the error and found another SO : How to access the response object using elasticsearch DSL for python
The solution mentions:
for hit in response:
print hit.doc.firstColumnName
Unfortunately, I had the same issue again with 'doc'. I was wondering what the correct way to access my document was?
Any help would really be appreciated!
I'm running into the same issues as I've found different versions of this, but it seems to depend on the version of the elasticsearch-dsl library you're using. You might explore the response object, and it's sub-objects. For instance, using version 5.3.0, I see the expected data using the below loop.
for hit in RESPONSE.hits._l_:
print(hit)
or
for hit in RESPONSE.hits.hits:
print(hit)
NOTE these are limited to 10 data elements for some strange reason.
print(len(RESPONSE.hits.hits))
10
print(len(RESPONSE.hits._l_))
10
This doesn't match the amount of overall hits if I print the number of hits using print('Total {} hits found.\n'.format(RESPONSE.hits.total))
Good luck!
From version 6 onwards the response does not return your populated Document class anymore, meaning that your fields are just an AttrDict which is basically a dictionary.
To solve this you need to have a Document class representing the document you want to parse. Then you need to parse the hit dictionary with your document class using the .from_es() method.
Like I answered here.
https://stackoverflow.com/a/64169419/5144029
Also have a look at the Document class here
https://elasticsearch-dsl.readthedocs.io/en/7.3.0/persistence.html
I am having weird issues with Neo4j's legacy indexing, and got stuck today. I need full text support, as I wish to run a performance comparison against Solr (which uses Lucene full text) to see how the different data model compares.
I have been following a few guides online, as well as various posts around here on stack.
I had success up until yesterday, where all of a sudden I had corrupted index files, as range queries were returning invalid and inconsistent results. So I am trying to set in stone exactly the steps I need to take.
I use the CSV bulk import tool to populate my database with about 4 million nodes with the label "record", and various nodes with labels like "data:SPD", "data:DIR", "data:TS", etc (using 2 labels, to represent that they are ocean data nodes, for different types of measurements).
The data model is simple. I have:
(r:record {meta:M, time:T, lat:L1, lon:L2})-[:measures]-(d:data {value:V})
M is a ID-like string which I use to keep track of my data internally for testing purposes. T is an epoch time integer. L1 / L2 are geo-spatial coordinate floats. My data nodes represent various kinds of collected data and not all records have the same data nodes. (Some have temperatures, wind speeds, wind directions, sea temperatures, etc). These values are all represented as floats. Each data node has a second label that says what kind of data it contains.
After I complete the import, I open up the shell and execute the following sequence:
index --create node_auto_index -t Node
index --set-config node_auto_index fulltext
I have the following configuration added to the default neo4j.conf file (this is there even before the CSV bulk import happens):
dbms.auto_index.nodes.enabled=true
dbms.auto_index.nodes.keys=meta,lat,lon,time
Before today, I would see that the fulltext command indeed worked by querying the shell:
index --get-config node_auto_index
returned something like:
{
"provider": "lucene",
"type": "fulltext"
}
I ran a series of tests on my data using the MATCH clause recently. I understand that this uses the more modern, schema indexing. My results were fine, and returned the expected data.
I read somewhere that since my data was imported prior to legacy index creation, I needed to manually index the relevant properties by doing something like this:
START n=node(*)
WITH n SKIP {curr_skip} LIMIT {fixed_lim}
WHERE EXISTS(n.meta)
SET n.time=n.time, n.lat=n.lat, n.lon=n.lon, n.meta=n.meta
RETURN n
Since I have 4 million records, my python handler does this as a series of batch operations by upping the {curr_skip} each time by {fixed_lim} and executing the query until I get 0 results.
Upon transitioning to my tests which involve the START clause yesterday, I found that using a lucene query like:
START r=node:node_auto_index(lon:[{} TO {}]) RETURN count(r)
(with a filled in range) was giving me bad results. Data that I expected to be returned was not. Furthermore, different ranges were yielding strange results such. Range (a,b) might yield 1000 results, but (a-e,b+e), a superset of the previous range would yield 0 results!! However, the exact same style of queries on time and lat seemed to be working perfectly! Even more so, I could do a multi facted query like:
START r=node:node_auto_index(time:[{} TO {}] AND lat:[{} TO {}]) RETURN count(r)
My best guess, was that somehow I corrupted the index files for lon.
The recommendations I have found online are to stop the database, go to /path/to/graph.db, and remove all of index*, and restart the database. Upon following these instructions today, I have discovered more weird behavior. I re-exceuted the same index creation / configuration statements from above, but after querying the configuration, find that the index type remains as an "type": "exact". Even stranger, is that the index files are not actually being created! There is no index directory being created under path/to/graph.db.
I am certain I have started the shell correctly by using:
neo4j-shell -path /path/to/graph.db/
If I try use index --create node_auto_index -t Node, I get an already exists notification, when it clearly does not.
For now, I think I am just going to start from scratch again and see if I can either reproduce these errors, or somehow bypass them.
Otherwise, if anyone with experience here has any idea of what might be going wrong, I would greatly appreciate some input!
UPDATE:
So I went ahead and started from scratch.
# ran my bulk import code
python3
>>> from mylib.module import load_data()
>>> load_data()
>>> # ... lots of printed stuff ...
IMPORT DONE in 3m 37s 950ms.
Imported:
15394183 nodes
15394171 relationships
27651625 properties
Peak memory usage: 361.94 MB
>>> exit()
# switched out my new database
cd /path/to/neo4j-community-3.1.0
mv data/databases/graph.db data/databases/oldgraph.db
mv data/databases/newgraph.db data/databases/graph.db
# check neo4j is off
ps aux | grep neo
# neo4j shell commands
bin/neo4j-shell -path data/databases/graph.db/
... some warning about GraphAware Runtime disbaled.
... the welcome message
neo4j-sh (?)$ index --create node_auto_index -t Node
neo4j-sh (?)$ index --set-config node_auto_index fulltext
INDEX CONFIGURATION CHANGED, INDEX DATA MAY BE INVALID
neo4j-sh (?)$ index --get-config node_auto_index -t Node
{
"provider": "lucene",
"type": "exact"
}
neo4j-sh (?)$ exit # thought maybe I just had to restart
# try again
bin/neo4j-shell -path data/databases/graph.db/
neo4j-sh (?)$ index --get-config node_auto_index -t Node
{
"provider": "lucene",
"type": "exact"
}
neo4j-sh (?)$ index --set-config node_auto_index fulltext
INDEX CONFIGURATION CHANGED, INDEX DATA MAY BE INVALID
neo4j-sh (?)$ index --get-config node_auto_index -t Node
{
"provider": "lucene",
"type": "exact"
}
# hmmmmm
neo4j-sh (?)$ index --create node_auto_index -t Node
Class index 'node_auto_index' alredy exists
# sanity check
neo4j-sh (?)$ MATCH (r:record) RETURN count(r);
+----------+
| count(r) |
+----------+
| 4085814 |
+----------+
1 row
470 ms
neo4j-sh (?)$ exit
As you can see, even after recreating a fresh database, I am not able to activate a fulltext index now. I have no idea why it worked a few days prior and not now, as I am the only one who is working on this server! Perhaps I will have to even reinstall neo4j as a whole.
UPDATE / IDEA:
Ok I have a potential idea as to my problem, and I think it may be permissions related. I have a dashboard.py module which I have been using to orchestrate turning on/off solr and neo4j. The other day, I had some weird issues with not being able to execute the start/stop sequences from within my shell, so I messed with a lot of permissions.
Lets call me userA. I belong to groups groupA, groupB.
I remember running the following yesterday:
sudo chown -R $USER:groupB neo4j-community-3.1.0
I have noticed that all of the new database files my python scripts are producing belong to group groupA. Could this be the culprit?
I am having the weird error again where I can't recreate the index because it thinks it still exists after i deleted it. I am going to rerun the bulk import once again, and fix these permissions prior trying to set the full text index. Will update tonight.
EDIT:
This did not seem to have an effect :(
I even tried chowning everything to root, both user and group to no avail. My lucene index will not change from exact to fulltext.
I am going to go ahead and do a full reinstall of everything now.
UPDATE:
Not even a full reinstall has worked.
I removed my entire neo4j-community-3.1.0 folder, and unpacked the tarball I had.
I set ownership of the entire folder to my own (because it was nfsnobody previously):
chown -R $USER:mygroup neo4j-community-3.1.0
I added the two lines to neo4j.conf:
dbms.auto_index.nodes.enabled=true
dbms.auto_index.nodes.keys=meta,lat,lon,time
I imported the data via bulk import tool, then did the same index creation / configuration commands as before. The index is still reporting that it is using an exact lucene index after it tells me the configuration changed.
I am at an utter loss here. Maybe I will just go ahead and try the START clauses tests I have anyways and see if they work.
UPDATE:
WOOOOW. I figured out my exact->fulltext issue!!!
The command:
index --set-config node_auto_index fulltext
needed to be:
index --set-config node_auto_index type fulltext
Incredible. What a doozy. The output message about the index being changed is really what threw me off, thinking that the command was being run correctly and that some other problem was at hand. Should I throw in a request on git for this? Is this command actually changing the index at all if I don't include type?
As for the invalid range queries, I am going to test this further soon. I believe that when I ran some code the first time around, I had a bug in my python handler that didn't loop over all the results, effectively missing out on some node's during manual indexing. Once I finish this process again, I will run my tests to check the results.
Sorry for cross posting.The following question is also posted on Elastic Search's google group.
In short I am trying to find out why I am not able to get optimal performance while doing searches on a ES index which contains about 1.5 millon records.
Currently I am able to get about 500-1000 searches in 2 seconds. I would think that this should be orders of magnitudes faster. Also currently I am not using thrift.
Here is how I am checking the performance.
Using 0.19.1 version of pyes (tried both stable and dev version from github)
Using 0.13.8 version of requests
conn = ES(['localhost:9201'],timeout=20,bulk_size=1000)
loop_start = time.clock()
q1 = TermQuery("tax_name","cellvibrio")
for x in xrange(1000000):
if x % 1000 == 0 and x > 0:
loop_check_point = time.clock()
print 'took %s secs to search %d records' % (loop_check_point-loop_start,x)
results = conn.search(query=q1)
if results:
for r in results:
pass
# print len(results)
else:
pass
Appreciate any help that you can give to help me scaleup the searches.
Thanks!
Isn't it just a matter of concurrency?
You're doing all your queries in sequence. So a query has to finish before the next one can come in to play. If you have a 1ms RTT to the server, this will limit you to 1000 requests per second.
Try to run a few instances of your script in parallel and see what kind of performance you got.
There are severeal ways to improve that with using pyes.
First of all try to get rid of the DottedDict class/object which is used to generat from every json/dict to an object for every result you get.
Second switch the json encoder to ujson.
These two things brought up a lot of performance.
This has the disadvantage that you
have to use the ways to access dicts instead of the dotted version ("result.facets.attribute.term" instead you have to use something like "result.facets['attribute']['term']" or "result.facets.get('attribute', {}).get('term', None)" )
I did this through extending the ES class and replace the "_send_request" function.
I need to create a tool that will check a domains live mx records against what should be expected (we have had issues with some of our staff fiddling with them and causing all incoming mail to redirected into the void)
Now I won't lie, I'm not a competent programmer in the slightest! I'm about 40 pages into "dive into python" and can read and understand the most basic code. But I'm willing to learn rather than just being given an answer.
So would anyone be able to suggest which language I should be using?
I was thinking of using python and starting with something along the lines of using 0s.system() to do a (dig +nocmd domain.com mx +noall +answer) to pull up the records, I then get a bit confused about how to compare this to a existing set of records.
Sorry if that all sounds like nonsense!
Thanks
Chris
With dnspython module (not built-in, you must pip install it):
>>> import dns.resolver
>>> domain = 'hotmail.com'
>>> for x in dns.resolver.resolve(domain, 'MX'):
... print(x.to_text())
...
5 mx3.hotmail.com.
5 mx4.hotmail.com.
5 mx1.hotmail.com.
5 mx2.hotmail.com.
Take a look at dnspython, a module that should do the lookups for you just fine without needing to resort to system calls.
the above solutions are correct. some things I would like to add and update.
the dnspython has been updated to be used with python3 and it has superseeded the dnspython3 library so use of dnspython is recommended
the domain will strictly take in the domain and nothing else.
for example: dnspython.org is valid domain, not www.dnspython.org
here's a function if you want to get the mail servers for a domain.
def get_mx_server(domain: str = "dnspython.org") -> str:
mail_servers = resolver.resolve(domain, 'MX')
mail_servers = list(set([data.exchange.to_text()
for data in mail_servers]))
return ",".join(mail_servers)