InfluxDB JSON data ingestion into a measurement

InfluxDB JSON data ingestion into a measurement - python

I am trying to ingest the data from one measurement (vulnerability) to another measurement (test1) using influxDB python client. Since i want to ingest only server, ID, route from vulnerability measurement into test1 measurement, i choose three columns. Any help would be appreciated on how to ingest the data from one measurement to another.
code:
from influxdb import InfluxDBClient
from datetime import datetime
client = InfluxDBClient('hostname', 8089, 'user', 'pwd', 'database')
results = client.query("SELECT server, ID, route from vulnerability")
for row in results:
influxJson = [
{
"measurement":"test1",
"time" : datetime.utcnow().isoformat() + "Z",
"tags": {
'ResiliencyTier':'targetResiliencyTier',
'lob' : 'technologyDivision'
},
"fields": {
columns[0][0] : str(row[1][0]),
columns[1][0] : str(row[1][1]),
columns[2][0] : str(row[1][2])
}
}
]
client.write_points(influxJson)
Sample Data of vulnerability measurement:
{'time': '2022-02-10T17:51:52.638000Z', 'server': '123123123', 'id': '351335', 'route': '37875'}, {'time': '2022-02-10T17:51:52.638000Z', 'server': '234', 'qid': '351343', 'route': '0037875'}
ERROR:
File "Vul_SUmmary_UTEP_data_PROD.py", line 29, in startprocess
columns[0][0] : str(row[1][0]),
NameError: name 'columns' is not defined
Thanks

Regarding the error (I know it's a little late), it clearly states that 'columns' is not defined. That's because it never is. 'row' is defined because it's the variable of your for-loop. If you want to fill another table with the data from row, you need to define it beforehand. You could also just write:
'server' : str(row[1][0]),
'ID' : str(row[1][1]),
'route' : str(row[1][2])
After all, you're just trying to create another dictionary for fields.
I hope you figured that out yourself in the last 5 months. I just stumbled across your question regarding another problem.

Related

Python: JSON File Format not printing out correctly

I'm trying to develop a parser that extracts data from a json formatted file, but when I was testing out trying to read the file and output its contents it doesn't print the data properly. Disclaimer this is my first time working on json so please go easy. Here are the contents of the file (it's quite dense so I'm only putting in a part of it and some of the values are made up):
{
"jobs" : [
{
"jobname" : "workload",
"groupid" : 0,
"eta" : 0,
"elapsed" : 69,
"job options" : {
"bs" : "4k",
"rw" : "randread"
},
"read" : {
"io_bytes" : 2000,
"bw" : 560,
"slat_ns" : {
"min" : 0,
"max" : 0,
"mean" : 0
}
}
}
]
}
So now I have python code that opens the json file and returns it as a dictionary. Then it's supposed to iterate through the list:
import json
# Opening JSON file
f = open('workload.log')
# returns JSON object as a dictionary
data = json.load(f)
# Iterating through the json list
for i in data['jobs']:
print(i)
# Closing file
f.close()
Here's the link to the code I found online: https://www.geeksforgeeks.org/read-json-file-using-python/
Now from my understanding on how the json format works I assume when I print the file contents, the output should be:
{'jobname': 'workload', 'groupid': 0, 'eta': 0, 'elapsed': 69, 'job options', 'read'}
I think 'job options' would be their own category or at least be printed separately from 'jobname', 'groupid', and etc. However, this is what I get instead:
{'jobname': 'workload', 'groupid': 0, 'eta': 0, 'elapsed': 69, 'job options': {'bs': '4k', 'rw': 'randread'}, 'read': {'io_bytes': 2000, 'bw': 560, 'slat_ns': {'min': 0, 'max': 0, 'mean': 0}}
There's a lot more data than that but that's the gist of it. They are all printed on one line. Is the formatting wrong? I've used this code on other sample JSON formats and it works just fine. I feel like at least the "job options" and "read" sections in the file should be accessible through the data label like "data['jobs']['job options']" or something. I want to figure out how to print out these sections separately.

Your code iterates over each job, and then prints that job, one per line. There's only one job in your example file, so you get one line of output.
Why would you expect the print statement to drop some of the data?
Those subkeys are available, exactly as you expect:
for job in data['jobs']:
# job is now the entire job object you where printing
print(job['job options'])
# or make it pretty
print(json.dumps(job['job options'], sort_keys=True, indent=2))
{
"bs": "4k",
"rw": "randread"
}

How to get dictionary data through a python module?

def cd():
person = {
'name' : 'bhavya',
'age' : 32,
'birth_date' : '18/10/1996'
}
print(person)
cd()
This is a dictionary I've saved as ey.py.
Now I want to import this whole data into another .py file with help of a module. So I can fetch whole data saved in ey.py.
Can anyone guide me?

If your goal is separating data from main python code, you have to make a new python code(ex. data.py)
data.py
def get_data():
data = { 'name' : 'bhavya', 'age' : 32, 'birth_date' : '18/10/1996' }
return data
main.py
from data import get_data
person = get_data()
print(person)
FYI

How to use my original key for identifying a doc in Cloudant db from Python client?

I'm working on a python client program to Cloundant.
I'd like to retrieve a doc, not based on "_id",but on my own field.
Still, it does not work causing Key Error. Any help to solve this error is highly appreciated!
Here is my code:
from cloudant.client import Cloudant
from cloudant.error import CloudantException
from cloudant.result import Result,ResultByKey
...
client.connect()
databaseName = "mydata1"
myDatabase = client[databaseName]
# As direct access like 'doc = myDatabase[<_id>]' cannot work for my key,
# let's check on by one ...
for document in myDatabase:
# if document['_id']== "20170928chibikasmall": <= if I use _id it's ok
if document['gokigenField']== 111:
This cause
KeyError :'gokigenField'
In advance, I've created gokigenField index using dashboard, then confirm the result via my postman with REST API
GET https://....bluemix.cloudant.com/mydata1/_index
the result is as follows:
{"total_rows":2,"indexes":[{"ddoc":null,"name":"_all_docs","type":"special","def":{"fields":[{"_id":"asc"}]}},{"ddoc":"_design/f7fb53912eb005771b736422f41c24cd26c7f06a","name":"gokigen-index","type":"text","def":{"default_analyzer":"keyword","default_field":{},"selector":{},"fields":[{"gokigenField":"number"}],"index_array_lengths":true}}]}
Also, I've confirmed I can use this gokigenField as query index nicely on cloudant dashboard as well as POST query .
My newly created "gokigenField" is not included in all the document in DB, as there are automatically created doc ("_design/xxx) without that field.
I guess this might cause KeyError, when I call this from my Python client.
I cannot find Cloudant API for checking 'if a specific key exists or not in a document', in the reference.. So, cannot have any idea how to by-pass such docs...

This is how to index an query data from the Python client. Let's assume we already have the library imported and have a database client in myDatabase.
First of all I created some data:
#create some data
data = { 'name': 'Julia', 'age': 30, 'pets': ['cat', 'dog', 'frog'], 'gokigenField': 'a' }
myDatabase.create_document(data)
data = { 'name': 'Fred', 'age': 30, 'pets': ['dog'], 'gokigenField': 'b' }
myDatabase.create_document(data)
data = { 'name': 'Laura', 'age': 31, 'pets': ['cat'], 'gokigenField': 'c' }
myDatabase.create_document(data)
data = { 'name': 'Emma', 'age': 32, 'pets': ['cat', 'parrot', 'hamster'], 'gokigenField': 'c' }
myDatabase.create_document(data)
We can check the data is there in the Cloudant dashboard or by doing:
# check the data is there
for document in myDatabase:
print(document)
Next we can opt to index the field gokigenField like so:
# create an index on the field 'gokigenField'
mydb.create_query_index(fields=['gokigenField'])
Then we can query the database:
# do a query
selector = {'gokigenField': {'$eq': 'c'}}
docs = mydb.get_query_result(selector)
for doc in docs:
print (doc)
which outputs the two matching documents.
The python-cloudant documentation is here.

Dynamically create list/dict during for loop for conversion to JSON

I am trying to build a list/dict that will be converted to JSON later on. I am trying to write the code that builds and populates the multiple levels of the JSON format I ultimately need. I am having an issue wrapping my head around this. Thank you for the help.
What I ultimately need -> Populate this list/dict:
dataset_permission_json = []
with this format:
{
"projects":[
{
"project":"test-project-1",
"datasets":[
{
"dataset":"testing1",
"permissions":[
{
"role":"READER",
"google_group":"testing1#test.com"
}
]
},
{
"dataset":"testing2",
"permissions":[
{
"role":"OWNER",
"google_group":"testing2#test.com"
}
]
},
{
"dataset":"testing3",
"permissions":[
{
"role":"READER",
"google_group":"testing3#test.com"
}
]
},
{
"dataset":"testing4",
"permissions":[
{
"role":"WRITER",
"google_group":"testing4#test.com"
}
]
}
]
}
]
}
I have multiple for loops that successfully print out the information I am pulling from an external API but I to be able to enter that data into the list/dict. The dynamic values I am trying to input are:
'project' i.e. test-project-1
'dataset' i.e. testing1
'role' i.e. READER
'google_group' i.e. testing1#test.com
I have tried things like:
dataset_permission_json.update({'project': project})
but cannot figure out how not to overwrite the data during the multiple for loops.
for project in projects:
print(project) ## Need to add this variable to 'projects'
for bq_group in bq_groups:
delegated_credentials = credentials.create_delegated(bq_group)
http_auth = delegated_credentials.authorize(Http())
list_datasets_in_project = bigquery_service.datasets().list(projectId=project).execute()
datasets = list_datasets_in_project.get('datasets',[])
print(dataset['datasetReference']['datasetId']) ##Add the dataset to 'datasets' under the project
for dataset in datasets:
get_dataset_permissions_result = bigquery_service.datasets().get(projectId=project, datasetId=dataset['datasetReference']['datasetId']).execute()
dataset_permissions = get_dataset_permissions_result.get('access',[])
### ADD THE NEXT LEVEL 'permissions' level here?
for dataset_permission in dataset_permissions:
if 'groupByEmail' in dataset_permission:
if bq_group in dataset_permission['groupByEmail']:
print(dataset['datasetReference']['datasetId'] && dataset_permission['groupByEmail']) ##Add to each dataset
I appreciate the help.
EDIT: Updated Progress
Ok I have created the nested structure that I was looking for using StackOverflow
Things are great except for the last part. I am trying to append the role & group to each 'permission' nest, but after everything runs the data is only appended to the last 'permission' nest in the JSON structure. It seems like it is overwriting itself during the for loop. Thoughts?
Updated for loop:
for project in projects:
for bq_group in bq_groups:
delegated_credentials = credentials.create_delegated(bq_group)
http_auth = delegated_credentials.authorize(Http())
list_datasets_in_project = bigquery_service.datasets().list(projectId=project).execute()
datasets = list_datasets_in_project.get('datasets',[])
for dataset in datasets:
get_dataset_permissions_result = bigquery_service.datasets().get(projectId=project, datasetId=dataset['datasetReference']['datasetId']).execute()
dataset_permissions = get_dataset_permissions_result.get('access',[])
for dataset_permission in dataset_permissions:
if 'groupByEmail' in dataset_permission:
if bq_group in dataset_permission['groupByEmail']:
dataset_permission_json['projects'][project]['datasets'][dataset['datasetReference']['datasetId']]['permissions']
permission = {'group': dataset_permission['groupByEmail'],'role': dataset_permission['role']}
dataset_permission_json['permissions'] = permission
UPDATE: Solved.
dataset_permission_json['projects'][project]['datasets'][dataset['datasetReference']['datasetId']]['permissions']
permission = {'group': dataset_permission['groupByEmail'],'role': dataset_permission['role']}
dataset_permission_json['projects'][project]['datasets'][dataset['datasetReference']['datasetId']]['permissions'] = permission

Import list of dicts or JSON file to elastic search with python

I have a .json.gz file that I wish to load into elastic search.
My first attempt involved using the json module to convert the JSON to a list of dicts.
import gzip
import json
from pprint import pprint
from elasticsearch import Elasticsearch
nodes_f = gzip.open("nodes.json.gz")
nodes = json.load(nodes_f)
Dict example:
pprint(nodes[0])
{u'index': 1,
u'point': [508163.122, 195316.627],
u'tax': u'fehwj39099'}
Using Elasticsearch:
es = Elasticsearch()
data = es.bulk(index="index",body=nodes)
However, this returns:
elasticsearch.exceptions.RequestError: TransportError(400, u'illegal_argument_exception', u'Malformed action/metadata line [1], expected START_OBJECT or END_OBJECT but found [VALUE_STRING]')
Beyond this, I wish to be able to find the tax for given point query, in case this has an impact on how I should be indexing the data with elasticsearch.

Alfe pointed me in the right direction, but I couldn't get his code to work.
I found two solutions:
Line by line with a for loop:
es = elasticsearch.Elasticsearch()
for node in nodes:
_id = node['index']
es.index(index='nodes',doc_type='external',id=_id,body=node)
In bulk, using helper:
actions = [
{
"_index" : "nodes_bulk",
"_type" : "external",
"_id" : str(node['index']),
"_source" : node
}
for node in nodes
]
helpers.bulk(es,actions)
Bulk was around 22 times faster for a list of 343724 dicts.

Here is my working code using bulk api:
Define a list of dicts:
from elasticsearch import Elasticsearch, helpers
es = Elasticsearch([{'host':'localhost', 'port': 9200}])
doc = [{'_id': 1,'price': 10, 'productID' : 'XHDK-A-1293-#fJ3'},
{'_id':2, "price" : 20, "productID" : "KDKE-B-9947-#kL5"},
{'_id':3, "price" : 30, "productID" : "JODL-X-1937-#pV7"},
{'_id':4, "price" : 30, "productID" : "QQPX-R-3956-#aD8"}]
helpers.bulk(es, doc, index='products',doc_type='_doc', request_timeout=200)

The ES bulk library showed several problems, including performance trouble, not being able to set specific _ids etc. But since the bulk API of ES is not very complicated, we did it ourselves:
import requests
headers = { 'Content-type': 'application/json',
'Accept': 'text/plain'}
jsons = []
for d in docs:
_id = d.pop('_id') # take _id out of dict
jsons.append('{"index":{"_id":"%s"}}\n%s\n' % (_id, json.dumps(d)))
data = ''.join(jsons)
response = requests.post(url, data=data, headers=headers)
We needed to set a specific _id but I guess you can skip this part in case you want a random _id set by ES automatically.
Hope that helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

InfluxDB JSON data ingestion into a measurement - python

Related

Python: JSON File Format not printing out correctly

How to get dictionary data through a python module?

How to use my original key for identifying a doc in Cloudant db from Python client?

Dynamically create list/dict during for loop for conversion to JSON

Import list of dicts or JSON file to elastic search with python

Categories

Resources