Python JSON transformation from explicit to generic by configuration - python

I have an explicit JSON input that I want to transform into metadata driven generic objects within an array. I have successfully done this on an individual basis however, I want to be able to do it using a configuration file instead.
What I have below is an example of the input data, the configuration I want to apply and the output data.
Since it is outputing into a generic schema, no matter what the input value data type is, I want it to always output as a string.
In addition, the origin data may not always exist in the origin payload so, when I did an individual one of these, I used try which worked really well however, when doing it via a second file of configuration, I am not sure if it will still be the same method, I guess it would, I expect a loop through the configuration file and it create whatever it can else skips to the next one until completed.
INPUT ORIGIN DATA
{
"activities_acceptance" : {
"contractors_sub_contractors" : {
"contractors_subcontractors_engaged" : "yes"
},
"cooking_deep_frying" : {
"deep_frying_engaged" : "yes",
"deep_fryer_vat_limit" : 10
}
},
"situation_acceptance" : {
"building_construction" : {
"wall_materials" : "CONCRETE"
}
}
}
CONFIGURATION PARAMETERS
{
"processiong_configuration" : [
{
"origin_path" : "activities_acceptance.contractors_sub_contractors",
"set_category" : "business-activity",
"set_type" : "contractors-subcontractors",
"set_value" : [
{
"use_value" : "activities_acceptance.contractors_sub_contractors.contractors_subcontractors_engaged",
"set_value" : "value"
}
]
},
{
"origin_path" : "activities_acceptance.cooking_deep_frying",
"set_category" : "business-activity",
"set_type" : "cooking-deep-frying",
"set_value" : [
{
"use_value" : "activities_acceptance.cooking_deep_frying.deep_frying_engaged",
"set_value" : "value"
},
{
"use_value" : "activities_acceptance.cooking_deep_frying.deep_fryer_vat_limit",
"set_value" : "details"
}
]
},
{
"origin_path" : "situation_acceptance.building_construction",
"set_category" : "situation-materials",
"set_type" : "wall-materials",
"set_value" : [
{
"use_value" : "situation_acceptance.building_construction.wall_materials",
"set_value" : "CONCRETE"
}
]
}
]
}
EXPECTED OUTPUT
{
"characteristics" : [
{
"category" : "business-activity",
"type" : "contractors-subcontractors",
"value" : "yes"
},
{
"category" : "business-activity",
"type" : "deep-frying",
"value" : "yes",
"details" : "10"
},
{
"category" : "situation-materials",
"type" : "wall-materials",
"value" : "CONCRETE"
}
]
}
What I currently have for a single transform without configuration is the following:
# Create Business Characteristics
business_characteristics = {
"characteristics" : []
}
# Create Characteristics - Business - Liability
# if liability section exists logic to go in here
try:
acc_liability = {
"category" : "business-activities",
"type" : "contractors-sub-contractors-engaged",
"description" : "",
"value" : "",
"details" : ""
}
acc_liability['value'] = d['line_of_businesses'][0]['assets']['commercial_operations'][0]['liability_asset']['acceptance']['contractors_and_subcontractors']['contractors_and_subcontractors_engaged']
acc_liability['details'] = d['line_of_businesses'][0]['assets']['commercial_operations'][0]['liability_asset']['acceptance']['contractors_and_subcontractors']['types_of_work_contractors_performed']
business_characteristics['characteristics'].append(acc_liability)
except:
acc_liability = {}
CURRENT OUTPUT in Jupyter
{
"characteristics": [
{
"category": "business-activities",
"type": "contractors-sub-scontractors-engaged",
"description": "",
"value": "YES",
"details": ""
}
]
}

Related

Not getting output when using terms to filter on multiple values

I have a table in opensearch in which the format of every field is "text".
This is how my table looks like
Now the query(q1) which I am running in opensearch looks like this. i am not getting any output. But when I run query q2 then I get the output.
q1 = {"size":10,"query":{"bool":{"must":[{"multi_match":{"query":"cen","fields":["name","alias"],"fuzziness":"AUTO"}}],"filter":[{"match_phrase":{"category":"Specialty"}},{"match_phrase":{"prov_type":"A"}},{"match_phrase":{"prov_type":"C"}}]}}}
q2 = {"size":10,"query":{"bool":{"must":[{"multi_match":{"query":"cen","fields":["name","alias"],"fuzziness":"AUTO"}}],"filter":[{"match_phrase":{"category":"Specialty"}},{"match_phrase":{"prov_type":"A"}}]}}}
Now I want to apply multiple filtering on prov_type. I have tried using terms also with prov_type in list like ['A','B'].
Can anyone please answer this on how to apply multiple filters on value for single column in opensearch/elasticsearch. Datatype for every field is text.
Have already tried this - How to filter with multiple fields and values in elasticsearch?
Mapping for the index
GET index/_mapping
{
"spec_proc_comb_exp" : {
"mappings" : {
"properties" : {
"alias" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"category" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"prov_type" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"specialty_code" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
Please let me know in case you need anymore information
You can use the should query to filter your data with the OR condition.
Should: The clause (query) should appear in the matching document.
GET test_allergy/_search
{
"query": {
"bool": {
"should": [
{
"term": {
"prov_type": "A"
}
},
{
"term": {
"prov_type": "C"
}
}
],
"minimum_should_match": 1
}
}
}
Note: You can set minimum_should_match as a number or percentage.

Mongodb join two collections giving duplicate results

def get(self):
res = json.loads(dumps(
self.devices_col.aggregate([
{"$lookup": {
"from": "participants",
"localField": "_id.docgroupid",
"foreignField": "device_id",
"as": "participants"
}
},
{
"$unwind": "$participants"
}
])
))
return res
participants document
{
"_id" : ObjectId("5f7230502930714468ed892c"),
"hash" : "83a84e8bf170114cffcc3b1e178d6468",
"name" : "BOMW0000029529",
"persona_id" : "i123",
"command" : "start",
"va_info" : [
{
"device_id" : "5f722a742930714468ed8929",
"automation_config" : "",
"status" : "false",
"remote_path" : "/datadrive/gatewayfolder",
"version" : "1.3.0.9",
"latest_va_version" : "1.3.1.2",
"version_updated_on" : "",
"latest_va_build_number" : "20200525",
"last_connected_on" : "02/08/2020 11:25:55",
"last_seen_on" : "02/08/2020 11:25:55",
"last_activity_processed_on" : "02/07/2020 11:25:55"
}
],
"inclusions" : [
"myfinancewnscom",
"OUTLOOK",
"jp2launcher",
"EXCEL"
],
"created_by" : "",
"created_on" : "",
"modified_by" : "",
"modified_on" : ""
}
devices document
{
"_id" : ObjectId("5f722a742930714468ed8929"),
"name" : "",
"unique_id" : "u168381",
"os" : {
"version" : "6.2.9200.0",
"name" : "Microsoft Windows 10 Home",
"locale" : {
"geo_location" : null,
"time_zone" : "IST",
"day_light_saving_support" : false
},
"culture" : {
"name" : "en-US",
"LCID" : "1032",
"language" : "English (United States)"
},
"browser" : [
{
"name" : "IE",
"value" : "9.11.17763.0"
},
{
"name" : "Chrome",
"value" : "84.0.4147.105"
},
{
"name" : "Firefox",
"value" : "Not Found"
}
]
},
"created_by" : "",
"created_on" : "",
"modified_by" : "",
"modified_on" : ISODate("2020-07-21T06:08:50.876Z")
}
Here is my data.
Here is my piece of python code. i am using pymongo client to make query from mongodb
In above code i am trying to join two collection (devices and participants) with device_id (Which is inside participants)
collections.
I have only two records in each collections.
But, output giving me 4 result.
Two duplicate records it is giving.
Please have a look where i am making wrong.
It doesn't doubles, it multiplies: number of devices * number of participants.
In your pipeline you join the collections as:
{"$lookup": {
"from": "participants",
"localField": "_id.docgroupid",
"foreignField": "device_id",
"as": "participants"
}
}
There is no _id.docgroupid field in devices and there are no device_id field in participants so it makes a perfect match of each participant to each device.
After the lookup stage the participants field hold whole participants collection. When you unwind it you see the same parent document with each single participant. Even tho the _id values of the documents are the same they are not identical duplicates and differ by participants field.

Avro IDL for Rest Api input validation in python

We've been using Avro IDL to define message sets used on our Kafka back end and are quite happy with it. We've also been interested in tying to validate JSON to a REST api on a Python Flask app with the Avro Schema as well and have been running into some difficulty.
There are a variety of packages out there but I have yet to find something that clearly works the way I need it to. I'm hoping for some guidance.
I can take my avdl file and generate a set of avsc files with:
avro-tools idl2schemata message.avdl output_dir
or
avro-tools idl message.avdl > output_dir/schema.avsc
And I'm able to read these in python but I've found nothing "easy" that can just tell me if my JSON input matches the schema.
Has anybody done something similar? Am I going down the wrong path? Any advice would be appreciated thanks.
I know if I was playing in the SpringBoot land this would likely be VERY simple.
Thanks
IDL
#namespace("org.jeeftor.avro")
protocol TacoRequest {
enum MeatType{
CHICKEN,
BEEF,
TURKEY,
FISH
}
enum CheeseType {
GROSS_VEGAN,
ACTUAL_COW_CHEESE,
GOAT_CHEESE
}
enum Toppings {
LECHUGA,
TOMATO,
SAUCE
}
record Taco {
MeatType meat;
CheeseType cheese;
array<Toppings> toppings;
}
record Order {
union { string, int } order_id;
array<Taco> tacos;
}
}
Schema
I generate the schema with: avro-tools idl order.avdl protocol.avpr
{
"protocol" : "TacoRequest",
"namespace" : "org.jeeftor.avro",
"types" : [ {
"type" : "enum",
"name" : "MeatType",
"symbols" : [ "CHICKEN", "BEEF", "TURKEY", "FISH" ]
}, {
"type" : "enum",
"name" : "CheeseType",
"symbols" : [ "GROSS_VEGAN", "ACTUAL_COW_CHEESE", "GOAT_CHEESE" ]
}, {
"type" : "enum",
"name" : "Toppings",
"symbols" : [ "LECHUGA", "TOMATO", "SAUCE" ]
}, {
"type" : "record",
"name" : "Taco",
"fields" : [ {
"name" : "meat",
"type" : "MeatType"
}, {
"name" : "cheese",
"type" : "CheeseType"
}, {
"name" : "toppings",
"type" : {
"type" : "array",
"items" : "Toppings"
}
} ]
}, {
"type" : "record",
"name" : "Order",
"fields" : [ {
"name" : "order_id",
"type" : [ "string", "int" ]
}, {
"name" : "tacos",
"type" : {
"type" : "array",
"items" : "Taco"
}
} ]
} ],
"messages" : { }
}
My question is how to easily use this "schema" to validate input.

Query DSL not working in pyes search

I am trying to use a custom query DSL to get results using the pyes library. I have query DSL that works when I use the command line
curl -XGET localhost:9200/test_index/_search -d '{
"query": {
"function_score": {
"query": {
"match_all": {}
},
"field_value_factor": {
"field": "starred",
"modifier": "none",
"factor": 2
}
}
},
"aggs" : {
"types" : {
"filters" : {
"filters" : {
"category1" : { "type" : { "value" : "category1"}},
"category2" : { "type" : { "value" : "category2"}},
"category3" : { "type" : { "value" : "category3"}},
"category4": { "type" : { "value" : "category4"}},
"category5" : { "type" : { "value" : "category5"}}
}
},
"aggs": {
"topFoundHits": {
"top_hits": {
"size": 5
}
}
}
}
}
}'
The idea here is to search across many categorized documents for all documents matching a particular string query. Then using aggregations I want to find the top five resulting documents by category. Starred items are boosted so that they show up above other search results.
This works great when I enter the command as listed above directly in terminal but it doesn't work when I try to put it in pyes. I'm not sure what the best way is to do it. The documentation for the pyes library is really confusing for me to translate this totally into pyes objects.
I'm trying to do the following:
query_dsl = self.get_text_index_query_dsl()
resulting_docs = conn.search(query=query_dsl)
(where self.get_test_index_query_dsl returns the query dsl dict above)
Searching as is gives me a:
ElasticSearchException: QueryParsingException[[test_index] No query registered for [query]]; }]
If I remove the parent "query" mapping and try:
query_dsl = {
"function_score": {
"query": {
"match_all": {}
},
"field_value_factor": {
"field": "starred",
"modifier": "none",
"factor": 2
}
},
"aggs" : {
"types" : {
"filters" : {
"filters" : {
"category1" : { "type" : { "value" : "category1"}},
"category2" : { "type" : { "value" : "category2"}},
"category3" : { "type" : { "value" : "category3"}},
"category4": { "type" : { "value" : "category4"}},
"category5" : { "type" : { "value" : "category5"}}
}
},
"aggs": {
"topFoundHits": {
"top_hits": {
"size": 5
}
}
}
}
}
}
This also errors out with: ElasticSearchException: ElasticsearchParseException[Expected field name but got START_OBJECT "aggs"]; }]
These errors in addition to the fact that pyes doesn't seem to have a 'topFoundHits' functionality yet (I think) are holding me up.
Any ideas why this is happening and how to fix it?
Thank you so much!
I got this working using this library where you can just use your regular query dsl JSON syntax : http://elasticsearch-dsl.readthedocs.org/en/latest/.

Python and Elasticsearch API changes and Autcomplete

So to begin. I am trying to add around 7.2k documents. No problem there. The issue is after I am not able to get any suggestions returned to me. So this is how the information is added:
def addVariantToElasticSearch(self,docId, companyId, companyName, parent, companyIndustry, variants, count,conn):
body = { "company":{
"company_name": companyName,
"parent": parent,
"suggest": { "input": variants,
"output": companyName,
"weight": count,
"payload": {"industry_id": companyIndustry,
"no_of_jobseekers":count,
"company_id": companyId
}
}
}
}
res = conn.index(body=body, index="companies", doc_type="company", id=docId)
The mapping and settings is defined as:
def setting():
return { "settings" : {
"index": {
"number_of_replicas" : 0,
"number_of_shards": 1
},
"analysis" : {
"analyzer" : {
"my_edge_ngram_analyzer" : {
"tokenizer" : "my_edge_ngram_tokenizer",
"filter":["standard", "lowercase"]
}
},
"tokenizer" : {
"my_edge_ngram_tokenizer" : {
"type" : "edgeNGram",
"min_gram" : "1",
"max_gram" : "5",
"token_chars": [ "letter", "digit" ]
}
}
}
},
"mappings": {
"company" : {
"properties" : {
"name" : { "type" : "string" },
"industy": {"type": "integer"},
"count" : {"type": "long" },
"parent": {"type": "string"},
"suggest" : {
"type" : "completion",
"index_analyzer": "my_edge_ngram_analyzer",
"search_analyzer": "my_edge_ngram_analyzer",
"payloads": True
}
}
}
}
}
Index creation:
def createMapping(es):
settings = setting()
es.indices.create(index="companies", body=settings)
I call createMapping which uses setting(), then add each variant - surrounded by a try,except -> causes no issue. I can see all my documents added in the browser as well as looking at the status, settings and mappings.
But when I use a curl request as below, I get no results. (See curl and output beneath)
curl -X POST localhost:9200/companies/_suggest -d '
{
"company-suggest" : {
"text" : "1800",
"completion" : {
"field" : "suggest"
}
}
}'
{
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"suggest" : [ {
"text" : "ruby",
"offset" : 0,
"length" : 4,
"options" : [ ]
} ]
I am currently using ES 1.1.0. I have tried both Python API 0.4 and 1.1.0 with no luck (I tried 0.4 as a result of 1.1.0 not working although I know it isn't best to due to compatibility issues with version of ES). I have also been able to add the same settings with mappings via curl and added a company which I have been able to retrieve by this curl above.
I'm not sure exactly where the issue lies. I have looked at the Data folder in ES to ensure it has been created, as well as the browser. I have also ensured only a single ES instance is running.
Any help greatly appreciated,

Categories

Resources