I am using python 3 with couchbase client. I have 461378 records in couchbase bucket and RAM used/quota is 3.77GB / 5.78GB. I am trying to retrieve the document by using the following code:
list_of_rows = []
for idx, product_details in enumerate(CouchRepo.get_product_details_iterator()):
list_of_rows.append(get_required_dict_for_df(product_details["data"]))
But i am getting the following error:
in __iter__
raw_rows = self.raw.fetch(self._mres)
couchbase.exceptions._TimeoutError_0x17 (generated, catch TimeoutError): <RC=0x17[Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout], HTTP Request failed. Examine 'objextra' for full result, Results=1, C Source=(src/http.c,144), OBJ=ViewResult<rc=0x17[Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout], value=None, http_status=0, tracing_context=0, tracing_output=None>, Tracing Output={":nokey:0": null}>
Basically inside code,
while self._do_iter:
raw_rows = self.raw.fetch(self._mres)
for row in self._process_payload(raw_rows):
yield row
I have tried setting a different operation_timeout but getting the same error. I also checked how can I allocate more RAM to bucket or node but not getting any solution. I have gone through the following links but didn't find any implementation details.
https://docs.couchbase.com/python-sdk/current/client-settings.html
https://docs.couchbase.com/server/current/install/sizing-general.html
How can I retrieve records info, also the number of records will increase in the future.
Related
I am using the elasticsearch-py Python package to interact with Elasticsearch through code. I have a script that is meant to take each document from one index, generate a field + value, then re-index it into a new index.
The issue is that there is 1216 documents in the first index, but only 1000 documents make it to the second one. Typically, it is exactly 1000 documents, occasionally making it higher around 1100, but never making it to the full 1216.
I usually keep the batch_size at 200, but changing it around seems to have some effect on the amount of documents that make it to the second index. Changing it to 400 will typically get a result of 800 documents being transferred. Using parallel_bulk seems to have the same results as using bulk.
I believe the issue is with the generating process I am performing. For each document I am generating its ancestry (they are organized in a tree structure) by recursively getting its parent from the first index. This involves rapid document GET requests interwoven with Bulk API calls to index the documents and Scroll API calls to get the documents from the index in the first place.
Would activity like this cause the documents to not go through? If I remove (comment out) the recursive GET requests, all documents seem to go through every time. I have tried creating multiple Elasticsearch clients, but that wouldn't even help if ES itself is the bottleneck.
Here is the code if you're curious:
def complete_resources():
for result in helpers.scan(client=es, query=query, index=TEMP_INDEX_NAME):
resource = result["_source"]
ancestors = []
parent = resource.get("parent")
while parent is not None:
ancestors.append(parent)
parent = es.get(
index=TEMP_INDEX_NAME,
doc_type=TEMPORARY_DOCUMENT_TYPE,
id=parent["uid"]
).get("_source").get("parent")
resource["ancestors"] = ancestors
resource["_id"] = resource["uid"]
yield resource
This generator is consumed by helpers.parallel_bulk()
for success, info in helpers.parallel_bulk(
client=es,
actions=complete_resources(),
thread_count=10,
queue_size=12,
raise_on_error=False,
chunk_size=INDEX_BATCH_SIZE,
index=new_primary_index_name,
doc_type=PRIMARY_DOCUMENT_TYPE,
):
if success:
successful += 1
else:
failed += 1
print('A document failed:', info)
This gives me the following result:
Time: 7 seconds
Successful: 1000
Failed: 0
The problem
I iterate over an entire vertex collection, e.g. journals, and use it to create edges, author, from a person to the given journal.
I use python-arango and the code is something like:
for journal in journals.all():
create_author_edge(journal)
I have a relatively small dataset, and the journals-collection has only ca. 1300 documents. However: this is more than 1000, which is the batch size in the Web Interface - but I don't know if this is of relevance.
The problem is that it raises a CursorNextError, and returns HTTP 404 and ERR 1600 from the database, which is the ERROR_CURSOR_NOT_FOUND error:
Will be raised when a cursor is requested via its id but a cursor with that id cannot be found.
Insights to the cause
From ArangoDB Cursor Timeout, and from this issue, I suspect that it's because the cursor's TTL has expired in the database, and in the python stacktrace something like this is seen:
# Part of the stacktrace in the error:
(...)
if not cursor.has_more():
raise StopIteration
cursor.fetch() <---- error raised here
(...)
If I iterate over the entire collection fast, i.e. if I do print(len(journals.all()) it outputs "1361" with no errors.
When I replace the journals.all() with AQL, and increase the TTL parameter, it works without errors:
for journal in db.aql.execute("FOR j IN journals RETURN j", ttl=3600):
create_author_edge(journal)
However, without the the ttl-parameter, the AQL approach gives the same error as using journals.all().
More information
A last piece of information is that I'm running this on my personal laptop when the error is raised. On my work computer, the same code was used to create the graph and populate it with the same data, but there no errors were raised. Because I'm on holiday I don't have access to my work computer to compare versions, but both systems were installed during the summer so there's a big chance the versions are the same.
The question
I don't know if this is an issue with python-arango, or with ArangoDB. I believe that because there is no problem when TTL is increased that it could indicate an issue with ArangodDB and not the Python driver, but I cannot know.
(I've added a feature request to add ttl-param to the .all()-method here.)
Any insights into why this is happening?
I don't have the rep to create the tag "python-arango", so it would be great if someone would create it and tag my question.
Inside of the server the simple queries will be translated to all().
As discussed on the referenced github issue, simple queries don't support the TTL parameter, and won't get them.
The prefered solution here is to use an AQL-Query on the client, so that you can specify the TTL parameter.
In general you should refrain from pulling all documents from the database at once, since this may introduce other scaling issues. You should use proper AQL with FILTER statements backed by indices (use explain() to revalidate) to fetch the documents you require.
If you need to iterate over all documents in the database, use paging. This is usually implemented the best way by combining a range FILTER with a LIMIT clause:
FOR x IN docs
FILTER x.offsetteableAttribute > #lastDocumentWithThisID
LIMIT 200
RETURN x
So here is how I did it. You can specify with the more args param makes it easy to do.
Looking at the source you can see the doc string says what to do
def AQLQuery(self, query, batchSize = 100, rawResults = False, bindVars = None, options = None, count = False, fullCount = False,
json_encoder = None, **moreArgs):
"""Set rawResults = True if you want the query to return dictionnaries instead of Document objects.
You can use **moreArgs to pass more arguments supported by the api, such as ttl=60 (time to live)"""
from pyArango.connection import *
conn = Connection(username=usr, password=pwd,arangoURL=url)# set this how ya need
db = conn['collectionName']#set this to the name of your collection
aql = """ for journal in journals.all():
create_author_edge(journal)"""
doc = db.AQLQuery(aql,ttl=300)
Thats all ya need to do!
I am trying to connect knack online database with my python data handling scripts in order to renew objects/tables directly into my knack app builder. I discovered pyknackhq Python API for KnackHQ can fetch objects and return json objects for the object's records. So far so good.
However, following the documentation (http://www.wbh-doc.com.s3.amazonaws.com/pyknackhq/quick%20start.html) I have tried to fetch all rows (records in knack) for my object-table (having in total 344 records).
My code was:
i =0
for rec in undec_obj.find():
print(rec)
i=i+1
print(i)
>> 25
All first 25 records were returned indeed, however the rest until the 344-th were never returned. The documentation of pyknackhq library is relatively small so I couldn't find a way around my problem there. Is there a solution to get all my records/rows? (I have also changed the specification in knack to have all my records appear in the same page - page 1).
The ultimate goal is to take all records and make them a pandas dataframe.
thank you!
I haven't worked with that library, but I've written another python Knack API wrapper that should help:
https://github.com/cityofaustin/knackpy
The docs should get you where you want to go. Here's an example:
>>> from knackpy import Knack
# download data from knack object
# will fetch records in chunks of 1000 until all records have been downloaded
# optionally pass a rows_per_page and/or page_limit parameter to limit record count
>>> kn = Knack(
obj='object_3',
app_id='someappid',
api_key='topsecretapikey',
page_limit=10, # not needed; this is the default
rows_per_page=1000 # not needed; this is the default
)
>>> for row in kn.data:
print(row)
{'store_id': 30424, 'inspection_date': 1479448800000, 'id': '58598262bcb3437b51194040'},...
Hope that helps. Open a GitHub issue if you have any questions using the package.
When I search a query that has more than 10,000 matches I get the following error:
{u'message': u'Request depth (10100) exceeded, limit=10000', u'__type': u'#SearchException', u'error': {u'rid': u'zpXDxukp4bEFCiGqeQ==', u'message': u'[*Deprecated*: Use the outer message field] Request depth (10100) exceeded, limit=10000'}}
When I search for more narrowed down keywords and queries with less results, everything works fine and no error is returned.
I guess I have to limit the search somehow, but I'm unable to figure out how. My search function looks like this:
def execute_query_string(self, query_string):
amazon_query = self.search_connection.build_query(q=query_string, start=0, size=100)
json_search_results = []
for json_blog in self.search_connection.get_all_hits(amazon_query):
json_search_results.append(json_blog)
results = []
for json_blog in json_search_results:
results.append(json_blog['fields'])
return results
And it's being called like this:
results = searcher.execute_query_string(request.GET.get('q', ''))[:100]
As you can see, I've tried to limit the results with the start and size attributes of build_query(). I still get the error though.
I must have missunderstood how to avoid getting more than 10,000 matches on a search result. Can someone tell me how to do it?
All I can find on this topic is Amazon's Limits where it says that you can only request 10,000 results. It does not say how to limit it.
You're calling get_all_hits, which gets ALL results for your query. That is why your size param is being ignored.
From the docs:
get_all_hits(query) Get a generator to iterate over all search results
Transparently handles the results paging from Cloudsearch search
results so even if you have many thousands of results you can iterate
over all results in a reasonably efficient manner.
http://boto.readthedocs.org/en/latest/ref/cloudsearch2.html#boto.cloudsearch2.search.SearchConnection.get_all_hits
You should be calling search instead -- http://boto.readthedocs.org/en/latest/ref/cloudsearch2.html#boto.cloudsearch2.search.SearchConnection.search
I am using elasticsearch-py to connect to my ES database which contains over 3 million documents. I want to return all the documents so I can abstract data and write it to a csv. I was able to accomplish this easily for 10 documents (the default return) using the following code.
es=Elasticsearch("glycerin")
query={"query" : {"match_all" : {}}}
response= es.search(index="_all", doc_type="patent", body=query)
for hit in response["hits"]["hits"]:
print hit
Unfortunately, when I attempted to implement the scan & scroll so I could get all the documents I ran into issues. I tried it two different ways with no success.
Method 1:
scanResp= es.search(index="_all", doc_type="patent", body=query, search_type="scan", scroll="10m")
scrollId= scanResp['_scroll_id']
response= es.scroll(scroll_id=scrollId, scroll= "10m")
print response
After scroll/ it gives the scroll id and then ends with ?scroll=10m (Caused by <class 'httplib.BadStatusLine'>: ''))
Method 2:
query={"query" : {"match_all" : {}}}
scanResp= helpers.scan(client= es, query=query, scroll= "10m", index="", doc_type="patent", timeout="10m")
for resp in scanResp:
print "Hiya"
If I print out scanResp before the for loop I get <generator object scan at 0x108723dc0>. Because of this I'm relatively certain that I'm messing up my scroll somehow, but I'm not sure where or how to fix it.
Results:
Again, after scroll/ it gives the scroll id and then ends with ?scroll=10m (Caused by <class 'httplib.BadStatusLine'>: ''))
I tried increasing the Max retries for the transport class, but that didn't make a difference.I would very much appreciate any insight into how to fix this.
Note: My ES is located on a remote desktop on the same network.
The python scan method is generating a GET call to the rest api. It is trying to send over your scroll_id over http. The most likely case here is that your scroll_id is too large to be sent over http and so you are seeing this error because it returns no response.
Because the scroll_id grows based on the number of shards you have it is better to use a POST and send the scroll_id in JSON as part of the request. This way you get around the limitation of it being too large for an http call.
Do you issue got resolved ?
I have got one simple solution, you must change the scroll_id every time after you call scroll method like below :
response_tmp = es.scroll(scroll_id=scrollId, scroll= "1m")
scrollId = response_tmp['_scroll_id']