How to upload a large dataset to AWS Elasticsearch cluster? - python

I have an AWS ElasticSearch cluster and I have created an index on it.
I want to upload 1 million documents in that index.
I am using Python package elasticsearch version 6.0.0 for doing so.
My payload structure is similar to this -
{
"a":1,
"b":2,
"a_info":{
"id":1,
"name":"Test_a"
},
"b_info":{
"id":1,
"name":"Test_b"
}
}
After discussion in the comment section, I realise that total number of fields in a document also includes its subfields. So in my case, total number of fields in each document goes to 60 in count.
I have tried the following methods -
Using Bulk() interface as described in the documentation(https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.Elasticsearch.bulk).
The error that I received using this method are -
Timeout response after waiting for ~10-20 min.
In this method, I have also tried uploading documents in batch of 100 but still getting timeout.
I have also tried adding documents one by one as per documentation(https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.Elasticsearch.create)
This method takes a lot of time to create upload even one document.
Also, I am getting this error for few of the documents -
TransportError(500, u'timeout_exception', u'Failed to acknowledge mapping update within [30s]')
My index settings are these -
{"Test":{"settings":{"index":{"mapping":{"total_fields":{"limit":"200000000"}},"number_of_shards":"5","provided_name":"Test","creation_date":"1557835068058","number_of_replicas":"1","uuid":"LiaKPAAoRFO6zWu5pc7WDQ","version":{"created":"6050499"}}}}}
I am new to ElasticSearch Domain. How can I upload my documents to AWS ES Cluster in a fast manner?

Related

Lambda.FunctionError with Kinesis Delivery Stream

I have a data processing pipeline that consist of API Gateway Endpoint>Lambda Handler>Kinesis Delivery Stream>Lambda Transform Function>Datadog.
A request to my endpoint triggers around 160k records to be generated for processing (these are spread across 11 different delivery streams with exp back off on the Direct Put into the Delivery Stream).
My pipeline is consistently loosing around ~20k records (140k of the 160k show up in Datadog). I have confirmed through the metric aws.firehose.incoming_records that all 160k records are being submitted to the delivery stream.
Looking at the transform function's metrics, it shows no errors. I have error logging in the function itself which is not revealing any obvious issues.
In the Destination error logs details for the firehose, I do see the following:
{
"deliveryStreamARN": "arn:aws:firehose:us-east-1:837515578404:deliverystream/PUT-DOG-6bnC7",
"destination": "aws-kinesis-http-intake.logs.datadoghq.com...",
"deliveryStreamVersionId": 9,
"message": "The Lambda function was successfully invoked but it returned an error result.",
"errorCode": "Lambda.FunctionError",
"processor": "arn:aws:lambda:us-east-1:837515578404:function:prob_weighted_calculations-dev:$LATEST"
}
In addition there are records in my s3 bucket for failed deliveries. I re ran the failed records in the transform function (created a custom test event set based off of data in s3 bucket for failed delivery. The lambda executed without an issue.
I read that having a mismatch in the number of records sent to the transform function and then outputed by the transform could cause the above error log. So I put in explicit error checkin the transform that would trigger an error within the function if the output record did not match the input. Even with this, no errors in the lambda.
I am at a loss here as to what could be causing this and do not feel confident in my pipeline given ~20k records are "leaking" without explanation.
Any suggestion on where to look to continue troubleshooting this issue would be greatly appreciated!

Python ElasticSearch updating field on multiple documents without script compilation rate limit

I'm using
from elasticsearch import Elasticsearch
from elasticsearch_dsl import UpdateByQuery
UpdateByQuery(index=index).using(es_client).query("match", id=<my_obj_id>)\
.script(source=f"ctx._source.view_count=12345")
to update the view_count field on one of my ElasticSearch documents.
The problem is on production there are a lot of documents that need updating and I get
TransportError(500, 'general_script_exception', '[script] Too many dynamic script compilations within, max: [75/5m]; please use indexed, or scripts with parameters instead; this limit can be changed by the [script.context.update.max_compilations_rate] setting')
I'm not sure if increasing the limit is a long-term solution. However I don't know how I'd do a bulk update over multiple documents at once to avoid so many calls?
You can use bulk API it allows to update multiple documents at once (see example there).
Generally, you should try to use "doc" updates instead of "script" whenever possible.

parsing exception: Expected [START_OBJECT] but foud [START_ARRAY]

I'm trying to convert my python ETLs to airflow.
I have an ETL written in python to copy data from Elastic to MSSQL.
I've build a DAG with 3 tasks.
task 1- get the latest date from the table in MSSQL
task 2- generate an elastic query based on that date retrieved from the previous task plus some filters (must not and sould) taken from a different table in MSSQL (less relevant).
eventually generating a body like so:
{ "query": {
"bool": {
"filter": {
"range": {
"#timestamp": { "gt": latest_timestamp }
}
},
"must_not": [],
"should": [],
"minimum_should_match": 1
}
}
}
task 3- scroll the elastic index using the body generated in the previous task and write the data to mssql.
My DAG fails on the 3rd task with the error:
parsing exception: Expected [START_OBJECT] but foud [START_ARRAY]
I've taken the generated body and ran it on elastic in dev tools and it is working fine.
So I have no idea what is the problem and how to debug it.
Any ideas?
I've found the problem.
I use XCOM to pass the body between Task2 and Task3.
Apparently something in the XCOM is messing with the body (I don't see it in the UI anyway).
When I put the logic of the body (task2) and call the search with the same body without passing it via XCOM everything is working as expected.
So beware of using XCOM cause it has side effects apparently.
I'm not a big ELK guy, but I would assume a different format is required there.
When you are doing "scroll the elastic index", you most probably use some API that expects one query format, while in dev console another query format is expected.
E.g. this thread:
https://discuss.elastic.co/t/unable-to-send-json-data-to-elastic-search/143506/3
Kibana Post Search - Expected [START_OBJECT] but found [VALUE_STRING]
So, check what format is expected by API handle you use to scroll through the data. Or, if still unclear, please share the function you use for scrolling in task 3.
Also,

Unable to fetch complete records from Salesforce using Python

I am trying to fetch the data from salesforce using the simple_salesforce library in python.
I am able to get the correct count of records while running the count query.
But while I am trying to put that results (in the form of list) into s3 as a JSON object, not as many reocrds are getting persisted as I captured from Salesforce.
Here is the piece of code:
result = sf.query("SELECT ID FROM Opportunity")['records']
object.put(Body=(bytes(json.dumps(result, indent=2).encode('UTF-8'))))
Is the problem on the Salesforce side or am I running into an issue using AWS's SDK to put the objects into S3?
Salesforce API returns stuff in chunks, default is 2000 records at a time. If it'd return to you 1M records it could kill your memory usage. Retrieve a chunk, process it (save to file?), request next chunk.
It's straight on the project's homepage:
If, due to an especially large result, Salesforce adds a
nextRecordsUrl to your query result, such as "nextRecordsUrl" :
"/services/data/v26.0/query/01gD0000002HU6KIAW-2000", you can pull the
additional results with either the ID or the full URL (if using the
full URL, you must pass ‘True’ as your second argument)
sf.query_more("01gD0000002HU6KIAW-2000")
sf.query_more("/services/data/v26.0/query/01gD0000002HU6KIAW-2000", True)
As a convenience, to retrieve all of the results in a single local
method call use
sf.query_all("SELECT Id, Email FROM Contact WHERE LastName = 'Jones'")

How to fix query problem on Azure CosmosDB that occurs only on collections with large data?

I'm trying to read from a CosmosDB collection (MachineCollection) with a large amount of data (58 GB data; index-size 9 GB). Throughput is set to 1000 RU/s. The collection is partitioned with a Serial number, Read Location (WestEurope, NorthEurope), Write Location (WestEurope). Simultaneously to my reading attempts, the MachineCollection is fed with data every 20 seconds.
The problem is that I can not query any data via Python. If I execute the query on CosmosDB Data Explorer I get results in no time. (e.g. querying for a certain serial number).
For troubleshooting purposes, I have created a new Database (TestDB) and a TestCollection. In this TestCollection, there are 10 datasets of MachineCollection. If I try to read from this MachineCollection via Python it succeeds and I am able to save the data to CSV.
This makes me wonder why I am not able to query data from MachineCollection when configuring TestDB and TestCollection with the exact same properties.
What I have already tried for the querying via Python:
options['enableCrossPartitionQuery'] = True
Querying using PartitionKey: options['partitionKey'] = 'certainSerialnumber'
Same as always. Works with TestCollection, but not with MachineCollection.
Any ideas on how to resolve this issue are highly appreciated!
Firstly, what you need to know is that Document DB imposes limits on Response page size. This link summarizes some of those limits: Azure DocumentDb Storage Limits - what exactly do they mean?
Secondly, if you want to query large data from Document DB, you have to consider the query performance issue, please refer to this article:Tuning query performance with Azure Cosmos DB.
By looking at the Document DB REST API, you can observe several important parameters which has a significant impact on query operations : x-ms-max-item-count, x-ms-continuation.
As I know,Azure portal doesn't automatically help you optimize your SQL so you need to handle this in the sdk or rest api.
You could set value of Max Item Count and paginate your data using continuation token. The Document Db sdk supports reading paginated data seamlessly. You could refer to the snippet of python code as below:
q = client.QueryDocuments(collection_link, query, {'maxItemCount':10})
results_1 = q._fetch_function({'maxItemCount':10})
#this is a string representing a JSON object
token = results_1[1]['x-ms-continuation']
results_2 = q._fetch_function({'maxItemCount':10,'continuation':token})
Another case you could refer to:How do I set continuation tokens for Cosmos DB queries sent by document_client objects in Python?

Categories

Resources