MongoDB bulk insert using Python fails - python

I implemented a simple batch upload with the following code, assuming I could aggregate a pre-set number of Json docs (aka ecpDocSorted which is a dict) into a bulk list and flush it e.g. after collecting 5 docs. The ecpDocSorted contains a simple Json structure - all key/values with no ids.
The code snippet looks like this
#
# Sorting the ecpDoc by keys
#
for k in sorted(ecpDoc.keys()):
ecpDocSorted[k] = ecpDoc[k]
ecpDoc = dict(ecpDocSorted)
#
# Insert into MongoDB
#
bulk.append(ecpDocSorted)
if len(bulk) == 5:
# insert into Mongo
result = mycol.insert_many(bulk)
print(result)
bulk = []
Uploading an individual doc (using len(bulk)==1) works fine and the document ends up in Mongo.
For any other number (e.g. len(bulk)==5) it fails with the following error:
raise BulkWriteError(full_result)
pymongo.errors.BulkWriteError: batch op errors occurred
What am I missing?
Added based on comment:
ecpDocSorted example:
{'address1': 'SOME ADDRESS', 'city': 'Arecibo', 'country': 'US', 'languages': 'English', 'name': 'SOME NAME', 'phone': '123-123-1234', 'postalcode': '00612'}

The issue seems to be with ecpDocSorted.
When using
bulk.append(ecpDoc)
instead of
bulk.append(ecpDocSorted)
it just works fine.

Related

python websocket fetch returns empty object

I'm trying to use the pybit package (https://github.com/verata-veritatis/pybit) on a crypto exchange, but when i try to fetch the data from the websocket, all I get is an empty object as a response.
import pybit
endpoint_public = 'wss://stream.bybit.com/realtime_public'
subs = [
'orderBookL2_25.BTCUSD',
'instrument_info.100ms.BTCUSD',
'last_price.BTCUSD'
]
ws_unauth = WebSocket(endpoint_public, subscriptions=subs)
ws_unauth.fetch('last_price.BTCUSD')
the output is this
{}
EDIT: 2022.09.19
It seems they changed code in module and examples in documentation are different. They don't use fetch() but they assign subscription to functions - handlers - and websocket runs own (hidden) loop to fetch data and execute assigned function.
I found three problems:
First: code works for me if I use endpoint realtime instead of realtime_public - I found it in somewhere in ByBit API documentation (not in documentation for Python module)
Second: there is no 'last_price.BTCUSD' in documentation - and this generate errors when I try it with endpoint realtime - and other subscriptions don't work.
Third: first fetch may not give result and it may need to sleep() short time before first fetch. Normally code should run in some loop and get data every few (milli)seconds and then it makes no problem. You could also use if to run some code only if you get data.
import pybit
import time
endpoint_public = 'wss://stream.bybit.com/realtime'
subs = [
'orderBookL2_25.BTCUSD',
'instrument_info.100ms.BTCUSD',
# 'last_price.BTCUSD'
]
ws_unauth = pybit.WebSocket(endpoint_public, subscriptions=subs)
time.sleep(1)
#print(ws_unauth.fetch('last_price.BTCUSD')) # doesn't work with `realtime_public`; generate error with `realtime`
print(ws_unauth.fetch('orderBookL2_25.BTCUSD')) # doesn't work with `realtime_public`; works with `realtime`
Result:
[
{'price': '40702.50', 'symbol': 'BTCUSD', 'id': 407025000, 'side': 'Buy', 'size': 350009},
{'price': '40703.00', 'symbol': 'BTCUSD', 'id': 407030000, 'side': 'Buy', 'size': 10069},
{'price': '40705.00', 'symbol': 'BTCUSD', 'id': 407050000, 'side': 'Buy', 'size': 28},
# ...
]
BTW:
ByBit API Documentation shows also examples for Public Topic.
They use:
realtime instead of realtime_public,
loop to fetch data periodically,
if data to skip empty response.
from pybit import WebSocket
subs = [
"orderBookL2_25.BTCUSD"
]
ws = WebSocket(
"wss://stream-testnet.bybit.com/realtime",
subscriptions=subs
)
while True:
data = ws.fetch(subs[0])
if data:
print(data)
ByBit API Documentation shows also examples for Private Topic.
They also use:
realtime instead of realtime_public ( + api_key, api_secret),
loop to fetch data periodically,
if data to skip empty response.
For test they use stream-testnet but in real code it should use stream.

JSON Data Masking using PARANOID

I have just started with Python programming & working on https://pypi.org/project/PARANOID/ to mask the PII details such as first_name, last_name & email_address.
{
"id": 324324,
"first_name": "John",
"last_name": "Smith",
"email": "john.smith#abc.com"
}
When I am executing paranoid -i my.json -o output, all the fields of my json (id, first_name, last_name & email_address) are getting masked. But I don't want to mask id. For that -l with Xpath to the json has to be provided.
I have tried various combinations for Xpath to json, but still it masks all the fields in the file.
Please guide me.
This library doesn't seems very attractive for me as there's no documentation at all ,the code is very messy and they seem to be overcomplicating some stuff! the only reference to what it can do was here and they just refer to using xpath to process xml not to process json.
I then installed the library locally and verified that xpath doesn't apply to json. But it turns out the library is just a single module( a unique file) with a bunch of functions. So I investigated which functions are being used when you're processing a json file. Only two functions are being used jsonParse2 and maskGenerator. So it was doable to reuse it.
jsonParse2 Doesn't make sense at all for me as they are parsing a json file manually when there are so many easy tools to use such as the json library. I will discard the jsonParse2 function as it was the main problem to filter which keys should be processed.
I will simply reuse the maskGenerator function into my solution then we just pass the keys,values we're interested in to the maskGenerator.
CODE Solution
create an input.json file with your json in the same folder as the solution:
import paranoid
import json
list_not_to_mask = ["id"]
with open("input.json") as input_file:
input_dict =json.loads(input_file.read())
print(input_dict)
output_dict = input_dict
for key in input_dict.keys():
if key in list_not_to_mask:
pass
else:
output_dict[key] = paranoid.maskGenerator(str(input_dict[key]),is_json=True)
print(output_dict)
with open('output.json', 'w') as output_file:
json.dump(output_dict, output_file, ensure_ascii=False, indent=4)
OUTPUT
it will also create a output.json
the input we have is: {'id': 324324, 'first_name': 'John', 'last_name': 'Smith', 'email': 'john.smith#abc.com'}
the output we have is: {'id': 324324, 'first_name': 'Vxqt', 'last_name': 'Yiphq', 'email': 'vuxr.kcicc#muj.jbj'}

Extract value from uncorrectly parsed dict from json output

I am processing kafka streams in a python flask server. I read the responses with json and need to extract the udid values from the stream. I read each response with request.json and save it in a dictionary. When i try to parse it fails. the dict contains the following values
dict_items([('data', {'SDKVersion': '7.1.2', 'appVersion': '6.5.5', 'dateTime': '2019-08-05 15:01:28+0200', 'device': 'iPhone', 'id': '3971',....})])
parsing the dict the normal way doesnt work ie event_data['status'] gives error.Perhaps it is because its not a pure dict....?
#app.route('/data/idApp/5710/event/start', methods=['POST'])
def give_greeting():
print("Hola")
event_data = request.json
print(event_data.items())
print(event_data['status'])
#print(event_data['udid'])
#print(event_data['Additional'])
return 'Hello, {0}!'.format(event_data)
The values contained in event data are the following
dict_items([('data', {'SDKVersion': '7.1.2', 'appVersion': '6.5.5', 'dateTime': '2019-08-05 15:01:28+0200', 'device': 'iPhone', 'id': '3971',....})])
The expected result would be this result
print(event_data['status'])->start
print(event_data['udid'])->BAEB347B-9110-4CC8-BF99-FA4039C4599B
print(event_data['SDKVersion'])->7.1.2
etc
the output of
print(event_data.keys()) is dict_keys(['data'])
The data you are expecting is wrapped in an additional data property.
You only need to do one extra step to access this data.
data_dict = request.json
event_data = data_dict['data']
Now you should be able to access the information you want with
event_data['SDKVersion']
...
as you have described above.
As #jonrsharpe stated, this is not an issue with the parsing. The parsing either fails or succeeds, but you will never get a "broken" object (be it dict, list, ...) from parsing JSON.

Dataflow Streaming using Python SDK: Transform for PubSub Messages to BigQuery Output

I am attempting to use dataflow to read a pubsub message and write it to big query. I was given alpha access by the Google team and have gotten the provided examples working but now I need to apply it to my scenario.
Pubsub payload:
Message {
data: {'datetime': '2017-07-13T21:15:02Z', 'mac': 'FC:FC:48:AE:F6:94', 'status': 1}
attributes: {}
}
Big Query Schema:
schema='mac:STRING, status:INTEGER, datetime:TIMESTAMP',
My goal is to simply read the message payload and insert into bigquery. I am struggling with getting my head around the transformations and how should I map the key/values to the big query schema.
I am very new to this so any help is appreciated.
Current code:https://codeshare.io/ayqX8w
Thanks!
I was able to successfully parse the pubsub string by defining a function that loads it into a json object (see parse_pubsub()). One weird issue I encountered was that I was not able to import json at the global scope. I was receiving "NameError: global name 'json' is not defined" errors. I had to import json within the function.
See my working code below:
from __future__ import absolute_import
import logging
import argparse
import apache_beam as beam
import apache_beam.transforms.window as window
'''Normalize pubsub string to json object'''
# Lines look like this:
# {'datetime': '2017-07-13T21:15:02Z', 'mac': 'FC:FC:48:AE:F6:94', 'status': 1}
def parse_pubsub(line):
import json
record = json.loads(line)
return (record['mac']), (record['status']), (record['datetime'])
def run(argv=None):
"""Build and run the pipeline."""
parser = argparse.ArgumentParser()
parser.add_argument(
'--input_topic', required=True,
help='Input PubSub topic of the form "/topics/<PROJECT>/<TOPIC>".')
parser.add_argument(
'--output_table', required=True,
help=
('Output BigQuery table for results specified as: PROJECT:DATASET.TABLE '
'or DATASET.TABLE.'))
known_args, pipeline_args = parser.parse_known_args(argv)
with beam.Pipeline(argv=pipeline_args) as p:
# Read the pubsub topic into a PCollection.
lines = ( p | beam.io.ReadStringsFromPubSub(known_args.input_topic)
| beam.Map(parse_pubsub)
| beam.Map(lambda (mac_bq, status_bq, datetime_bq): {'mac': mac_bq, 'status': status_bq, 'datetime': datetime_bq})
| beam.io.WriteToBigQuery(
known_args.output_table,
schema=' mac:STRING, status:INTEGER, datetime:TIMESTAMP',
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
)
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()
Data written to Python SDK's BigQuery sink should be in the form of a dictionary where each key of the dictionary gives a field of the BigQuery table and corresponding value gives the value to be written to that field. For a BigQuery RECORD type, value itself should be a dictionary with corresponding key,value pairs.
I filed a JIRA to improve documentation of corresponding python module in Beam: https://issues.apache.org/jira/browse/BEAM-3090
I have a similar usecase (reading rows as strings from PubSub, converting them to dicts and then processing them).
I am using ast.literal_eval(), which seems to work for me. This command will evaluate the string, but in a safer way than eval() (see here). It should return a dict whose keys are strings, and values are evaluated to the most likely type (int, str, float...). You may want to make sure the values take the correct type though.
This would give you a pipeline like this
import ast
lines = ( p | beam.io.ReadStringsFromPubSub(known_args.input_topic)
| "JSON row to dict" >> beam.Map(
lambda s: ast.literal_eval(s))
| beam.io.WriteToBigQuery( ... )
)
I have not used BigQuery (yet), so I cannot help you on the last line, but what you wrote seems correct at first glance.

Accessing Elasticsearch using python

I'm currently trying to write a script to enrich some data. I've already coded some things that work fine with a demodata txt file, but now I'd like to try and directly requests the latest data from the server in the script.
The data I'm working with is stored on Elasticsearch. I've received a URL, including the port number. I also have a cluster ID, a username, and a password.
I can access the data directly using Kibana, where I enter the following into the console (under Dev Tools):
GET /*projectname*/appevents/_search?pretty=true&size=10000
I can copy the output into a TXT file (well, it's actually JSON data), which currently gets parsed by my script. I'd prefer to just collect the data directly without this intermediate step. Also, I'm currently limited to 10000 records/events, but I'd like to get all of them.
This works:
res = requests.get('*url*:*port*',
auth=HTTPBasicAuth('*username*','*password*'))
print(res.content)
I'm struggling with the elasticsearch package. How do I mimic the 'get' command listed above in my script, collecting everything in a JSON format?
Fixed, got some help from a programmer. Stored into a list, so I can work with it from there. Code below, identifying info is removed.
es = Elasticsearch(
hosts=[{'host': '***', 'port': ***}],
http_auth=('***', '***'),
use_ssl=True
)
count = es.count(index="***", doc_type="***")
print(count) # {u'count': 244532, u'_shards': {u'successful': 5, u'failed': 0, u'total': 5}}
# Use scroll to ease strain on cluster (don't pull in all results at once)
results = es.search(index="***", doc_type="***", size=1000,
scroll="30s")
scroll_id = results['_scroll_id']
total_size = results['hits']['total']
print(total_size)
# Save all results in list
dump = []
ct = 1
while total_size > 0:
results = es.scroll(scroll_id=scroll_id, scroll='30s')
dump += results['hits']['hits']
scroll_id = results['_scroll_id']
total_size = len(results['hits']['hits']) # As long as there are results, keep going ...
print("Chunk #", ct, ": ", total_size, "\tList size: ", len(dump))
ct += 1
es.clear_scroll(body={'scroll_id': [scroll_id]}) # Cleanup (otherwise Scroll id remains in ES memory)

Categories

Resources