parsing exception: Expected [START_OBJECT] but foud [START_ARRAY] - python

I'm trying to convert my python ETLs to airflow.
I have an ETL written in python to copy data from Elastic to MSSQL.
I've build a DAG with 3 tasks.
task 1- get the latest date from the table in MSSQL
task 2- generate an elastic query based on that date retrieved from the previous task plus some filters (must not and sould) taken from a different table in MSSQL (less relevant).
eventually generating a body like so:
{ "query": {
"bool": {
"filter": {
"range": {
"#timestamp": { "gt": latest_timestamp }
}
},
"must_not": [],
"should": [],
"minimum_should_match": 1
}
}
}
task 3- scroll the elastic index using the body generated in the previous task and write the data to mssql.
My DAG fails on the 3rd task with the error:
parsing exception: Expected [START_OBJECT] but foud [START_ARRAY]
I've taken the generated body and ran it on elastic in dev tools and it is working fine.
So I have no idea what is the problem and how to debug it.
Any ideas?

I've found the problem.
I use XCOM to pass the body between Task2 and Task3.
Apparently something in the XCOM is messing with the body (I don't see it in the UI anyway).
When I put the logic of the body (task2) and call the search with the same body without passing it via XCOM everything is working as expected.
So beware of using XCOM cause it has side effects apparently.

I'm not a big ELK guy, but I would assume a different format is required there.
When you are doing "scroll the elastic index", you most probably use some API that expects one query format, while in dev console another query format is expected.
E.g. this thread:
https://discuss.elastic.co/t/unable-to-send-json-data-to-elastic-search/143506/3
Kibana Post Search - Expected [START_OBJECT] but found [VALUE_STRING]
So, check what format is expected by API handle you use to scroll through the data. Or, if still unclear, please share the function you use for scrolling in task 3.
Also,

Related

Call the Bigquery stored procedure in Dataflow pipeline

I have written a stored procedure in Bigquery and trying to call it within a dataflow pipeline. This works for the SELECT queries but not for the stored procedure:
pipeLine = beam.Pipeline(options=options)
rawdata = ( pipeLine
| beam.io.ReadFromBigQuery(
query="CALL my_dataset.create_customer()", use_standard_sql=True)
)
pipeLine.run().wait_until_finish()
Stored procedure:
CREATE OR REPLACE PROCEDURE my_dataset.create_customer()
BEGIN
SELECT *
FROM `project_name.my_dataset.my_table`
WHERE customer_name LIKE "%John%"
ORDER BY created_time
LIMIT 5;
END;
I am able to create the stored procedure and call it within the Bigquery console. But, in the dataflow pipeline, it throws an error while calling it:
"code": 400,
"message": "configuration.query.destinationEncryptionConfiguration cannot be set for scripts",
"message": "configuration.query.destinationEncryptionConfiguration cannot be set for scripts",
"domain": "global",
"reason": "invalid"
"status": "INVALID_ARGUMENT"
Edit:
Is there any other method in beam that I can use to call the stored procedure in bigquery ?
I see multiple threads raised on the same issue, but did not find answer for it, so thought to post this question. Thank you for any help.
The principle of a procedure is to perform a job and to return nothing. The principle of a function is to perform a job and to return something.
You can't use a stored procedure as a read from in Dataflow, your error is normal. The parametrized views are in the pipe to achieve what you want. The solution for now is to use a UDF or to directly write the query in your code.
EDIT 1
What do you want to do?
Do you want to get data? If so, it's not a procedure that you have to use.
Do you want to simply call a Stored Procedure? If so, simply perform an API call, with the BigQuery client library to run a call query, that's all. But you have to update your stored procedure because for now, it's only a projection and it's useless "for a procedure".

How to handle an update of many docs in a single query in Elasticsearch and Python script

I'm working in a Python script which in simple terms it will update the stock field of every document when his _id matches with the Id that I get from
a DB2 query, in this query I bring two columns: catentry_id and stock. So the idea is to find every single id from DB2 in every single document of ES and update the stock from DB2 to ES.
I new int the world of ES and I did many searches and read many sites and also the documentation looking a way to handle this
I try first to get all the docs of the index using this querys i have to put in a obj in the python for nex iteration with the resultset from db2.
GET /_search
{
"_source": {
"includes": ["_id","stock"],
"excludes": ["_index","_score","_type","boost","brand","cat_1","cat_1_id",\
"cat_1_url","cat_2","cat_2_id","cat_2_url","cat_3","cat_3_id",\
"cat_3_url","cat_4","cat_4_id","cat_4_url","cat_5","cat_5_id",\
"cat_5_url","category","category_breadcrumbs","children",\
"children_tmp","delivery","discount","fullImage","id","keyword",\
"longDescription","name","partNumber","pickupinstore","price",\
"price_internet","price_m2","price_tc","product_can","published",\
"ribbon_ads","shipping_normal","shortDescription","specs",\
"specs_open","thumb","ts","url"]
},
"query": {
"range": {
"stock": {
"gte": 0
}
}
}
}
But I don't know the way to create the proper query to update all the docs. I was thinking to try to do it in a script with painless or _bulk, but I didn't find any example or anyone who does a similar task.
Update:
I could solve the taske with the guidelines of the netx link, but for my case this take aprox 20 min to update all the doc in elastic.
First i try to solve the upde with bulk o parallel bulk but then i figure out the bulk's update all the source and i doesnt work with painless script, if im wrong about what i said may be i couldnt made it work.
Second i try to compare only the values of the stock that have difference between DB2 and ES and that reduce me a lot of time, but for some reason im not 100% there is updating the correct amount of docs.
And the last craizy thing i try to doit was to pass the last dictionary inside of a painless script as a param and iterate inside the script, but that didn't work, as I'm new to this and I read about painless syntax is similar to groovy I try to iterate the dictionary as a map again didn't work where the API throws a syntax error.
I would like to optimize this but my sprint finished last week and now I have another tasks.
Sorry this is really not my day, I had 9 tabs open, how much experience do you have with parsing and python in general? maybe I am just wasting your time, but your project sounds fun and if you want to share code live I got some hours free now.

How to upload a large dataset to AWS Elasticsearch cluster?

I have an AWS ElasticSearch cluster and I have created an index on it.
I want to upload 1 million documents in that index.
I am using Python package elasticsearch version 6.0.0 for doing so.
My payload structure is similar to this -
{
"a":1,
"b":2,
"a_info":{
"id":1,
"name":"Test_a"
},
"b_info":{
"id":1,
"name":"Test_b"
}
}
After discussion in the comment section, I realise that total number of fields in a document also includes its subfields. So in my case, total number of fields in each document goes to 60 in count.
I have tried the following methods -
Using Bulk() interface as described in the documentation(https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.Elasticsearch.bulk).
The error that I received using this method are -
Timeout response after waiting for ~10-20 min.
In this method, I have also tried uploading documents in batch of 100 but still getting timeout.
I have also tried adding documents one by one as per documentation(https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.Elasticsearch.create)
This method takes a lot of time to create upload even one document.
Also, I am getting this error for few of the documents -
TransportError(500, u'timeout_exception', u'Failed to acknowledge mapping update within [30s]')
My index settings are these -
{"Test":{"settings":{"index":{"mapping":{"total_fields":{"limit":"200000000"}},"number_of_shards":"5","provided_name":"Test","creation_date":"1557835068058","number_of_replicas":"1","uuid":"LiaKPAAoRFO6zWu5pc7WDQ","version":{"created":"6050499"}}}}}
I am new to ElasticSearch Domain. How can I upload my documents to AWS ES Cluster in a fast manner?

Elasticsearch: update multiple docs with python script?

I have a scenario where we have an array field on documents and we're trying to update a key/value in that array on each document, so for instance a doc would look like:
_source:{
"type": 1,
"items": [
{"item1": "value1"},
{"item2": "value2"}
]
}
We're trying to efficiently update "value1" for instance on every doc of "type": 1. We'd like to avoid conflicts and we're hoping we can do this all by using a scrip, preferably in python, but I can't find any examples of how to update fields in python, let alone across multiple documents.
So, is it possible to do this with a script and if so, does anyone have a good example?
Thanks
I know this is a little late, but I came across this in a search so I figured I would answer for anyone else who comes by. You can definitely do this utilizing the elasticsearch python library.
You can find all of the info and examples you need via the Elasticsearch RTD.
More specifically, I would look into the "ingest" operations as you can update specific pieces of documents within an index using elasticsearch.
So your script should do a few things:
Gather list of documents using the "search" operation
Depending on size, or if you want to thread, you can store in a queue
Loop through list of docs / pop doc off queue
Create updated field list
Call "update" operations EX: self.elasticsearch.update(index=MY_INDEX, doc_type=1, id=123456, body={"item1": "updatedvalue1"})

Is it possible to use web2py language in javascript file (.js)?

I'm trying to use AJAX with web2py language but I have a problem
My code is:
javascript
$(document).ready(function(){
$(".className").click(function(){
jQuery.ajax({
type:'POST',
url:'getName',
data:{
itemName:'a'
},
timeout: 20000,
success: function(msg) {
alert(msg);
},
error: function(objAJAXRequest, strError){
alert( "Error:" + strError );
}
});
});
default.py
def getName():
itemName=request.vars.itemName
return "Name: " + itemName
The thing is I want to use the data from the database, but is it possible to use
{{for item in tableName:}}
var name={{=item.name}}
like this?
I'm not sure how to extract data from DB in javascript.
Can you help me a bit?
Cheers
The short answer is that you can't directly extract data from the db in javascript using web2py. You have to query the db with web2py, and then use web2py to send your query data to the javascript (or more accurately since you're using ajax, use jquery/javascript to pull your query data down from web2py). Make sure that you actually need to perform logic on the client (javascript) side here, because the easiest thing to do would be to perform all your logic in python in the web2py controller.
However, if you do for some reason need to perform logic on your data on the client side, then you are on the right track. The easiest thing for you to do would be fetch the records you need from the db in the web2py controller, then package the records up as a json object in your web2py controller ("import simplejson" or you can do it with the standard library with the latest version of python), then return that json object to make it available for your js to fetch using the ajax request you've included above. Only at that point should you loop through the json object using the javascript to get the data you need.
That said, if you're just trying to get a field from a single record from the database, the easiest thing to do would be just to query the database correctly to get that record, and return it in a web2py controller for your ajax script to get.

Categories

Resources