I have a data processing pipeline that consist of API Gateway Endpoint>Lambda Handler>Kinesis Delivery Stream>Lambda Transform Function>Datadog.
A request to my endpoint triggers around 160k records to be generated for processing (these are spread across 11 different delivery streams with exp back off on the Direct Put into the Delivery Stream).
My pipeline is consistently loosing around ~20k records (140k of the 160k show up in Datadog). I have confirmed through the metric aws.firehose.incoming_records that all 160k records are being submitted to the delivery stream.
Looking at the transform function's metrics, it shows no errors. I have error logging in the function itself which is not revealing any obvious issues.
In the Destination error logs details for the firehose, I do see the following:
{
"deliveryStreamARN": "arn:aws:firehose:us-east-1:837515578404:deliverystream/PUT-DOG-6bnC7",
"destination": "aws-kinesis-http-intake.logs.datadoghq.com...",
"deliveryStreamVersionId": 9,
"message": "The Lambda function was successfully invoked but it returned an error result.",
"errorCode": "Lambda.FunctionError",
"processor": "arn:aws:lambda:us-east-1:837515578404:function:prob_weighted_calculations-dev:$LATEST"
}
In addition there are records in my s3 bucket for failed deliveries. I re ran the failed records in the transform function (created a custom test event set based off of data in s3 bucket for failed delivery. The lambda executed without an issue.
I read that having a mismatch in the number of records sent to the transform function and then outputed by the transform could cause the above error log. So I put in explicit error checkin the transform that would trigger an error within the function if the output record did not match the input. Even with this, no errors in the lambda.
I am at a loss here as to what could be causing this and do not feel confident in my pipeline given ~20k records are "leaking" without explanation.
Any suggestion on where to look to continue troubleshooting this issue would be greatly appreciated!
Related
While batch loading data to BigQuery and specifying the max bad records to 5000. The BigQuery error stream provides 5 error records.
When I change the max bad records to 100 and load the same file. The load fails.
If my understanding is correct it means that there are more bad records than I got previously (5 records) but BigQuery is not logging it on the error stream.
Can anyone explain why this is so?
BigQuery stream error:
BigQuery's job error stream only provides the initial errors it encounters, it makes no guarantees that it will provide an exhaustive list of all errors.
See the REST reference documentation for more information. The error stream lives inside the JobStatus submessage:
https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#jobstatus
If you want to do more extensive validation of input files, I'd recommend some kind of preprocessing (perhaps something in dataflow/beam), or switch to a better format that's self describing like avro or parquet. CSV is somewhat of a notorious format due to the many idiosyncrasies and differences among various readers and writers.
I am trying to fetch the data from salesforce using the simple_salesforce library in python.
I am able to get the correct count of records while running the count query.
But while I am trying to put that results (in the form of list) into s3 as a JSON object, not as many reocrds are getting persisted as I captured from Salesforce.
Here is the piece of code:
result = sf.query("SELECT ID FROM Opportunity")['records']
object.put(Body=(bytes(json.dumps(result, indent=2).encode('UTF-8'))))
Is the problem on the Salesforce side or am I running into an issue using AWS's SDK to put the objects into S3?
Salesforce API returns stuff in chunks, default is 2000 records at a time. If it'd return to you 1M records it could kill your memory usage. Retrieve a chunk, process it (save to file?), request next chunk.
It's straight on the project's homepage:
If, due to an especially large result, Salesforce adds a
nextRecordsUrl to your query result, such as "nextRecordsUrl" :
"/services/data/v26.0/query/01gD0000002HU6KIAW-2000", you can pull the
additional results with either the ID or the full URL (if using the
full URL, you must pass ‘True’ as your second argument)
sf.query_more("01gD0000002HU6KIAW-2000")
sf.query_more("/services/data/v26.0/query/01gD0000002HU6KIAW-2000", True)
As a convenience, to retrieve all of the results in a single local
method call use
sf.query_all("SELECT Id, Email FROM Contact WHERE LastName = 'Jones'")
I have an AWS ElasticSearch cluster and I have created an index on it.
I want to upload 1 million documents in that index.
I am using Python package elasticsearch version 6.0.0 for doing so.
My payload structure is similar to this -
{
"a":1,
"b":2,
"a_info":{
"id":1,
"name":"Test_a"
},
"b_info":{
"id":1,
"name":"Test_b"
}
}
After discussion in the comment section, I realise that total number of fields in a document also includes its subfields. So in my case, total number of fields in each document goes to 60 in count.
I have tried the following methods -
Using Bulk() interface as described in the documentation(https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.Elasticsearch.bulk).
The error that I received using this method are -
Timeout response after waiting for ~10-20 min.
In this method, I have also tried uploading documents in batch of 100 but still getting timeout.
I have also tried adding documents one by one as per documentation(https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.Elasticsearch.create)
This method takes a lot of time to create upload even one document.
Also, I am getting this error for few of the documents -
TransportError(500, u'timeout_exception', u'Failed to acknowledge mapping update within [30s]')
My index settings are these -
{"Test":{"settings":{"index":{"mapping":{"total_fields":{"limit":"200000000"}},"number_of_shards":"5","provided_name":"Test","creation_date":"1557835068058","number_of_replicas":"1","uuid":"LiaKPAAoRFO6zWu5pc7WDQ","version":{"created":"6050499"}}}}}
I am new to ElasticSearch Domain. How can I upload my documents to AWS ES Cluster in a fast manner?
Is it possible to get lifetime data from using facebookads api on python? I tried to use date_preset:lifetime and time_increment:1, but got a server error instead. And, then I found this on their website:
"We use data-per-call limits to prevent a query from retrieving too much data beyond what the system can handle. There are 2 types of data limits:
By number of rows in response, and
By number of data points required to compute the total, such as summary row."
Any way I can do this? And, another question, is there like any way to pull raw data from facebook ad account, like a dump of all the data that resides on facebook for an ad account?
The first thing is to try is to add the limit parameter, which limits the number of results returned per page.
However, if the account has a large amount of history, the likelihood is that the total amount of data is too great, and in this case, you'll have to query ranges of data.
As you're looking for data by individual day, I'd start trying to query for month blocks, and if this is still too much data, query for each date individually.
I am trying to interact with a DynamoDB table from python using boto. I want all reads/writes to be quorum consistency to ensure that reads sent out immediately after writes always reflect the correct data.
NOTE: my table is set up with "phone_number" as the hash key and first_name+last_name as a secondary index. And for the purposes of this question one (and only one) item exists in the db (first_name="Paranoid", last_name="Android", phone_number="42")
The following code works as expected:
customer = customers.get_item(phone_number="42")
While this statement:
customer = customers.get_item(phone_number="42", consistent_read=True)
fails with the following error:
boto.dynamodb2.exceptions.ValidationException: ValidationException: 400 Bad Request
{u'message': u'The provided key element does not match the schema', u'__type': u'com.amazon.coral.validate#ValidationException'}
Could this be the result of some hidden data corruption due to failed requests in the past? (for example two concurrent and different writes executed at eventual consistency)
Thanks in advance.
It looks like you are calling the get_item method so the issue is with how you are passing parameters.
get_item(hash_key, range_key=None, attributes_to_get=None, consistent_read=False, item_class=<class 'boto.dynamodb.item.Item'>)
Which would mean you should be calling the API like:
customer = customers.get_item(hash_key="42", consistent_read=True)
I'm not sure why the original call you were making was working.
To address your concerns about data corruption and eventual consistency, it is highly unlike that any API call you could make to DynamoDB could result in it getting into a bad state outside of you sending it bad data for an item. DynamoDB is a highly tested solution that provides exceptional availability and goes to extraordinary lengths to take care of the data you send it.
Eventual consistency is something to be aware of with DynamoDB, but generally speaking it is not something that causes many issues depending on the specifics of the use case. While AWS does not provide specific metrics on what "eventually consistent" look like, in day-to-day use it is normal to be able to read out records that were just written/modified under a second even when eventually consistent reads.
As for performing multiple writes simultaneously on the same object, DynamoDB writes are always strongly consistent. You can utilize conditional writes with DynamoDB if you are worried about an individual item being modified at the same time resulting in unexpected behavior which will allow writes to fail and your application logic to deal with any issues that arise.