I am trying to extract value of 'login' from a dump of JSON which is in the form of text (response.text)
Here's the string:
{
"name":"master",
"commit":{
"sha":"adc3208a9ac76262250a",
"commit":{
"author":{
"name":"root",
"email":"dan.ja#foo.ca",
"date":"2018-02-26T20:14:41Z"
},
"committer":{
"name":"GitHub Enterprise",
"date":"2018-02-26T20:14:41Z"
},
"message":"Update README.md",
"tree":{
"sha":"3e4710d0e021a0a7",
"comment_count":0,
"verification":{
"verified":false,
"reason":"unsigned",
"signature":null,
"payload":null
}
},
"author":{
"login":"kyle",
"id":5
}
I am just trying to pull the value 'kyle' from the login in the last line. The value of 'kyle' can change as it can be a different login each time. Thus I need string in "login":"string"
Here's what I have right now but that only gets me "login" :
/"login"[^\a]*"/g
Never parse JSON with regex, use a JSON parser.
With jq :
Input file :
{
"commit" : {
"commit" : {
"tree" : {
"verification" : {
"payload" : null,
"verified" : false,
"signature" : null,
"reason" : "unsigned"
},
"sha" : "3e4710d0e021a0a7",
"comment_count" : 0
},
"author" : {
"id" : 5,
"login" : "kyle"
},
"committer" : {
"name" : "GitHub Enterprise",
"date" : "2018-02-26T20:14:41Z"
},
"message" : "Update README.md"
},
"sha" : "adc3208a9ac76262250a"
},
"name" : "master"
}
Command :
$ jq '.commit.commit.author.login' file.json
Or via a python script :
#!/usr/bin/env python3
import json
string = """
{
"commit" : {
"commit" : {
"tree" : {
"verification" : {
"payload" : null,
"verified" : false,
"signature" : null,
"reason" : "unsigned"
},
"sha" : "3e4710d0e021a0a7",
"comment_count" : 0
},
"author" : {
"id" : 5,
"login" : "kyle"
},
"committer" : {
"name" : "GitHub Enterprise",
"date" : "2018-02-26T20:14:41Z"
},
"message" : "Update README.md"
},
"sha" : "adc3208a9ac76262250a"
},
"name" : "master"
}
"""
j = json.loads(string)
print(j['commit']['commit']['author']['login'])
Output :
"kyle"
Related
I have an explicit JSON input that I want to transform into metadata driven generic objects within an array. I have successfully done this on an individual basis however, I want to be able to do it using a configuration file instead.
What I have below is an example of the input data, the configuration I want to apply and the output data.
Since it is outputing into a generic schema, no matter what the input value data type is, I want it to always output as a string.
In addition, the origin data may not always exist in the origin payload so, when I did an individual one of these, I used try which worked really well however, when doing it via a second file of configuration, I am not sure if it will still be the same method, I guess it would, I expect a loop through the configuration file and it create whatever it can else skips to the next one until completed.
INPUT ORIGIN DATA
{
"activities_acceptance" : {
"contractors_sub_contractors" : {
"contractors_subcontractors_engaged" : "yes"
},
"cooking_deep_frying" : {
"deep_frying_engaged" : "yes",
"deep_fryer_vat_limit" : 10
}
},
"situation_acceptance" : {
"building_construction" : {
"wall_materials" : "CONCRETE"
}
}
}
CONFIGURATION PARAMETERS
{
"processiong_configuration" : [
{
"origin_path" : "activities_acceptance.contractors_sub_contractors",
"set_category" : "business-activity",
"set_type" : "contractors-subcontractors",
"set_value" : [
{
"use_value" : "activities_acceptance.contractors_sub_contractors.contractors_subcontractors_engaged",
"set_value" : "value"
}
]
},
{
"origin_path" : "activities_acceptance.cooking_deep_frying",
"set_category" : "business-activity",
"set_type" : "cooking-deep-frying",
"set_value" : [
{
"use_value" : "activities_acceptance.cooking_deep_frying.deep_frying_engaged",
"set_value" : "value"
},
{
"use_value" : "activities_acceptance.cooking_deep_frying.deep_fryer_vat_limit",
"set_value" : "details"
}
]
},
{
"origin_path" : "situation_acceptance.building_construction",
"set_category" : "situation-materials",
"set_type" : "wall-materials",
"set_value" : [
{
"use_value" : "situation_acceptance.building_construction.wall_materials",
"set_value" : "CONCRETE"
}
]
}
]
}
EXPECTED OUTPUT
{
"characteristics" : [
{
"category" : "business-activity",
"type" : "contractors-subcontractors",
"value" : "yes"
},
{
"category" : "business-activity",
"type" : "deep-frying",
"value" : "yes",
"details" : "10"
},
{
"category" : "situation-materials",
"type" : "wall-materials",
"value" : "CONCRETE"
}
]
}
What I currently have for a single transform without configuration is the following:
# Create Business Characteristics
business_characteristics = {
"characteristics" : []
}
# Create Characteristics - Business - Liability
# if liability section exists logic to go in here
try:
acc_liability = {
"category" : "business-activities",
"type" : "contractors-sub-contractors-engaged",
"description" : "",
"value" : "",
"details" : ""
}
acc_liability['value'] = d['line_of_businesses'][0]['assets']['commercial_operations'][0]['liability_asset']['acceptance']['contractors_and_subcontractors']['contractors_and_subcontractors_engaged']
acc_liability['details'] = d['line_of_businesses'][0]['assets']['commercial_operations'][0]['liability_asset']['acceptance']['contractors_and_subcontractors']['types_of_work_contractors_performed']
business_characteristics['characteristics'].append(acc_liability)
except:
acc_liability = {}
CURRENT OUTPUT in Jupyter
{
"characteristics": [
{
"category": "business-activities",
"type": "contractors-sub-scontractors-engaged",
"description": "",
"value": "YES",
"details": ""
}
]
}
My english is poor.
I have a piece of json data with such a structure, and I want to use python to aggregate this data, as shown in the figure.
If the service.name is the same, then it needs to be archived, and the duplicate "url.path" needs to be removed
I don’t know what method to use to store data, use list? dict?
Can anyone help please? thanks
```
{
[
{
"_source" : {
"error" : {
"exception" : [
{
"handled" : false,
"type" : "Abp.UI.UserFriendlyException",
"message" : "未发现该用户的WeiXinUserRelation,粉丝编号447519"
}
]
},
"trace" : {
"id" : "a3e3796ca145b448829d0d0f96661e67"
},
"#timestamp" : "2021-06-21T06:57:52.603Z",
"service" : {
"name" : "Lonsid_AAA_Web_Host"
}, "url" : {
"path" : "/product/getAAA" }
}
},
{
"_source" : {
"error" : {
"exception" : [
{
"handled" : false,
"type" : "Abp.UI.UserFriendlyException",
"message" : "未发现该用户的WeiXinUserRelation,粉丝编号447519"
}
]
},
"trace" : {
"id" : "a3e3796ca145b448829d0d0f96661e67"
},
"#timestamp" : "2021-06-21T06:57:52.603Z",
"service" : {
"name" : "Lonsid_BBB_Web_Host"
}, "url" : {
"path" : "/product/getBBB" }
}
},
{
"_source" : {
"error" : {
"exception" : [
{
"handled" : false,
"type" : "Abp.UI.UserFriendlyException",
"message" : "未发现该用户的WeiXinUserRelation,粉丝编号447519"
}
]
},
"trace" : {
"id" : "a3e3796ca145b448829d0d0f96661e67"
},
"#timestamp" : "2021-06-21T06:57:52.603Z",
"service" : {
"name" : "Lonsid_AAA_Web_Host"
}, "url" : {
"path" : "/product/getAAA" }
}
} ] }
```
This should get you started. It builds a cache of all the known past names, and drops anything that is previously seen.
import pprint
import json
json_data = """[
{
"_source" : {
"error" : {
"exception" : [
{
"handled" : false,
"type" : "Abp.UI.UserFriendlyException",
"message" : "未发现该用户的WeiXinUserRelation,粉丝编号447519"
}
]
},
"trace" : {
"id" : "a3e3796ca145b448829d0d0f96661e67"
},
"#timestamp" : "2021-06-21T06:57:52.603Z",
"service" : {
"name" : "Lonsid_AAA_Web_Host"
}, "url" : {
"path" : "/product/getAAA" }
}
},
{
"_source" : {
"error" : {
"exception" : [
{
"handled" : false,
"type" : "Abp.UI.UserFriendlyException",
"message" : "未发现该用户的WeiXinUserRelation,粉丝编号447519"
}
]
},
"trace" : {
"id" : "a3e3796ca145b448829d0d0f96661e67"
},
"#timestamp" : "2021-06-21T06:57:52.603Z",
"service" : {
"name" : "Lonsid_BBB_Web_Host"
}, "url" : {
"path" : "/product/getBBB" }
}
},
{
"_source" : {
"error" : {
"exception" : [
{
"handled" : false,
"type" : "Abp.UI.UserFriendlyException",
"message" : "未发现该用户的WeiXinUserRelation,粉丝编号447519"
}
]
},
"trace" : {
"id" : "a3e3796ca145b448829d0d0f96661e67"
},
"#timestamp" : "2021-06-21T06:57:52.603Z",
"service" : {
"name" : "Lonsid_AAA_Web_Host"
}, "url" : {
"path" : "/product/getAAA" }
}
}
]"""
data = json.loads(json_data)
cache = {}
for item in data:
if item['_source']['service']['name'] not in cache:
cache[item['_source']['service']['name']] = item
pprint.pprint(list(cache.values()))
def get(self):
res = json.loads(dumps(
self.devices_col.aggregate([
{"$lookup": {
"from": "participants",
"localField": "_id.docgroupid",
"foreignField": "device_id",
"as": "participants"
}
},
{
"$unwind": "$participants"
}
])
))
return res
participants document
{
"_id" : ObjectId("5f7230502930714468ed892c"),
"hash" : "83a84e8bf170114cffcc3b1e178d6468",
"name" : "BOMW0000029529",
"persona_id" : "i123",
"command" : "start",
"va_info" : [
{
"device_id" : "5f722a742930714468ed8929",
"automation_config" : "",
"status" : "false",
"remote_path" : "/datadrive/gatewayfolder",
"version" : "1.3.0.9",
"latest_va_version" : "1.3.1.2",
"version_updated_on" : "",
"latest_va_build_number" : "20200525",
"last_connected_on" : "02/08/2020 11:25:55",
"last_seen_on" : "02/08/2020 11:25:55",
"last_activity_processed_on" : "02/07/2020 11:25:55"
}
],
"inclusions" : [
"myfinancewnscom",
"OUTLOOK",
"jp2launcher",
"EXCEL"
],
"created_by" : "",
"created_on" : "",
"modified_by" : "",
"modified_on" : ""
}
devices document
{
"_id" : ObjectId("5f722a742930714468ed8929"),
"name" : "",
"unique_id" : "u168381",
"os" : {
"version" : "6.2.9200.0",
"name" : "Microsoft Windows 10 Home",
"locale" : {
"geo_location" : null,
"time_zone" : "IST",
"day_light_saving_support" : false
},
"culture" : {
"name" : "en-US",
"LCID" : "1032",
"language" : "English (United States)"
},
"browser" : [
{
"name" : "IE",
"value" : "9.11.17763.0"
},
{
"name" : "Chrome",
"value" : "84.0.4147.105"
},
{
"name" : "Firefox",
"value" : "Not Found"
}
]
},
"created_by" : "",
"created_on" : "",
"modified_by" : "",
"modified_on" : ISODate("2020-07-21T06:08:50.876Z")
}
Here is my data.
Here is my piece of python code. i am using pymongo client to make query from mongodb
In above code i am trying to join two collection (devices and participants) with device_id (Which is inside participants)
collections.
I have only two records in each collections.
But, output giving me 4 result.
Two duplicate records it is giving.
Please have a look where i am making wrong.
It doesn't doubles, it multiplies: number of devices * number of participants.
In your pipeline you join the collections as:
{"$lookup": {
"from": "participants",
"localField": "_id.docgroupid",
"foreignField": "device_id",
"as": "participants"
}
}
There is no _id.docgroupid field in devices and there are no device_id field in participants so it makes a perfect match of each participant to each device.
After the lookup stage the participants field hold whole participants collection. When you unwind it you see the same parent document with each single participant. Even tho the _id values of the documents are the same they are not identical duplicates and differ by participants field.
I have DB with my users:
{
"_id": {
"$oid": "5a0decadefcb09087c08a868"
},
"user_id": "5b232a5a-b333-4320-ba63-722b9e167ef3",
"email": "email#email.com",
"password": "***",
"registration_date": {
"$date": "2017-11-16T19:53:17.946Z"
},
"type": "user"
},
{
"_id": {
"$oid": "5a0ded3aefcb090887d7f4fb"
},
"user_id": "0054bbde-3ba0-490f-8d54-ffaf72958888",
"email": "second#gmail.com",
"password": "***",
"registration_date": {
"$date": "2017-11-16T19:55:38.194Z"
},
"type": "user"
}
I want to count users by each date (registration_date) and get some thing like that:
01.01.2017 – 10
01.02.2017 – 20
01.03.2017 – 15
...
I'm trying that code, but it doesn't work:
def registrations_by_date(self):
users = self.users_db.aggregate([
{'$group': {
'_id': {'registration_date':'$date'},
'count': {'$sum':1}
}},
])
return users
What i'm doing wrong? How to get this data?
If the date in your schema is of ISODate
then the below aggregate query will work, the date format is done before grouping so that the timestamp is not taken while grouping the data
{
"_id" : "5a0decadefcb09087c08a868",
"user_id" : "5b232a5a-b333-4320-ba63-722b9e167ef3",
"email" : "email#email.com",
"password" : "***",
"registration_date" : ISODate("2017-11-16T19:53:17.946Z"),
"type" : "user"
}
{
"_id" : "5a0ded3aefcb090887d7f4fb",
"user_id" : "0054bbde-3ba0-490f-8d54-ffaf72958888",
"email" : "second#gmail.com",
"password" : "***",
"registration_date" : ISODate("2017-11-16T19:55:38.194Z"),
"type" : "user"
}
The aggregation query to get the result is
db.userReg.aggregate([
{$project:
{ formattedRegDate:
{ "$dateToString": {format:"%Y-%m-%d", date:"$registration_date"}}
}
},
{$group:{_id:"$formattedRegDate", count:{$sum:1}}}]);
and the result is
{ "_id" : "2017-11-16", "count" : 2 }
If the date in your schema is of String
then the below approach to be used
Sample Data
{
"_id" : "5a0decadefcb09087c08a868",
"user_id" : "5b232a5a-b333-4320-ba63-722b9e167ef3",
"email" : "email#email.com",
"password" : "***",
"registration_date" : "2017-11-16T19:53:17.946Z",
"type" : "user"
}
{
"_id" : "5a0ded3aefcb090887d7f4fb",
"user_id" : "0054bbde-3ba0-490f-8d54-ffaf72958888",
"email" : "second#gmail.com",
"password" : "***",
"registration_date" : "2017-11-16T19:55:38.194Z",
"type" : "user"
}
Query
db.userReg.aggregate([{
$group:{ _id: { date: {"$substr":["$registration_date", 0, 10]}},
count:{$sum:1}
}
}]);
and the result is
{ "_id" : { "date" : "2017-11-16" }, "count" : 2 }
It seems you have an extra ,
db.userReg.aggregate([
{$group: {_id: "$registration_date", count: {$sum:1}}}
])
This gives the correct result(ON the basis of record on my mcahine) :
{
"_id" : ISODate("2017-11-15T19:55:38.194Z"),
"count" : 1.0 }
{
"_id" : ISODate("2017-11-16T19:55:38.194Z"),
"count" : 2.0 }
I'm using imply-2.2.3. Here is my tranquility server configuration:
{
"dataSources" : [
{
"spec" : {
"dataSchema" : {
"dataSource" : "tutorial-tranquility-server",
"parser" : {
"type" : "string",
"parseSpec" : {
"timestampSpec" : {
"column" : "timestamp",
"format" : "auto"
},
"dimensionsSpec" : {
"dimensions" : [],
"dimensionExclusions" : [
"timestamp",
"value"
]
},
"format" : "json"
}
},
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "hour",
"queryGranularity" : "none"
},
"metricsSpec" : [
{
"type" : "count",
"name" : "count"
},
{
"name" : "value_sum",
"type" : "doubleSum",
"fieldName" : "value"
},
{
"fieldName" : "value",
"name" : "value_min",
"type" : "doubleMin"
},
{
"type" : "doubleMax",
"name" : "value_max",
"fieldName" : "value"
}
]
},
"ioConfig" : {
"type" : "realtime"
},
"tuningConfig" : {
"type" : "realtime",
"maxRowsInMemory" : "50000",
"windowPeriod" : "PT10M"
}
},
"properties" : {
"task.partitions" : "1",
"task.replicants" : "1"
}
},
{
"spec": {
"dataSchema" : {
"dataSource" : "test",
"parser" : {
"type" : "string",
"parseSpec" : {
"timestampSpec" : {
"column" : "timestamp",
"format" : "auto"
},
"dimensionsSpec" : {
"dimensions" : [
"a"
],
},
"format" : "json"
}
},
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "hour",
"queryGranularity" : "none"
},
"metricsSpec" : [
{
"type" : "count",
"name" : "count"
},
{
"type": "doubleSum",
"name": "b",
"fieldName": "b"
}
]
},
"ioConfig" : {
"type" : "realtime"
},
"tuningConfig" : {
"type" : "realtime",
"maxRowsInMemory" : "50000",
"windowPeriod" : "P1Y"
}
},
"properties": {
"task.partitions" : "1",
"task.replicants" : "1"
}
}
],
"properties" : {
"zookeeper.connect" : "localhost",
"druid.discovery.curator.path" : "/druid/discovery",
"druid.selectors.indexing.serviceName" : "druid/overlord",
"http.port" : "8200",
"http.threads" : "40",
"serialization.format" : "smile",
"druidBeam.taskLocator": "overlord"
}
}
I have trouble sending data to the second datasoruce, test, specifically. I tried to send the below data to druid with python requests:
{'b': 7, 'timestamp': '2017-01-20T03:32:54.586415', 'a': 't'}
The response I receive:
b'{"result":{"received":1,"sent":0}}'
If you read my config file you will notice that I set window period to one year. I would like to send data in with a large time span to druid using tranquility server. Is there something wrong with my config or data?