bson.errors.InvalidBSON out of range - python

I have a query from python to mongodb through pymongo.
Some records work for me, but on one record it stops working, it looks like there is a different date format, but how then it got into the find() :
from bson import ObjectId
import config_auth
from operator import itemgetter
from datetime import datetime
import pyodbc
import pymongo
mydb = config_auth.mydb
def load():
temp_arr = []
for item in mydb.questionaries.find({ 'created_at' : {"$gt": datetime(2019,10,30), "$lt": datetime(2019,10,31)}}):
temp=[]
temp.append(str(item['_id']))
print(item['created_at'])
temp_arr.append(tuple(temp))
After this i have this error:
2019-10-30 15:36:09.920000
2019-10-30 15:36:02.344000
2019-10-30 15:36:02.344000
2019-10-30 15:33:47.360000
2019-10-30 15:33:47.360000
Traceback (most recent call last):
File "c:/Users/d.konoplya/Desktop/python/etl_finservice/questionaries.py", line 115, in <module>
print(load())
File "c:/Users/d.konoplya/Desktop/python/etl_finservice/questionaries.py", line 16, in load
for item in mydb.questionaries.find({ 'created_at' : {"$gt": datetime(2019,10,30), "$lt": datetime(2019,10,31)}}):
File "C:\Users\d.konoplya\AppData\Local\Programs\Python\Python37\lib\site-packages\pymongo\cursor.py", line 1156, in next
if len(self.__data) or self._refresh():
File "C:\Users\d.konoplya\AppData\Local\Programs\Python\Python37\lib\site-packages\pymongo\cursor.py", line 1093, in _refresh
self.__send_message(g)
File "C:\Users\d.konoplya\AppData\Local\Programs\Python\Python37\lib\site-packages\pymongo\cursor.py", line 955, in __send_message
address=self.__address)
File "C:\Users\d.konoplya\AppData\Local\Programs\Python\Python37\lib\site-packages\pymongo\mongo_client.py", line 1346, in _run_operation_with_response
exhaust=exhaust)
File "C:\Users\d.konoplya\AppData\Local\Programs\Python\Python37\lib\site-packages\pymongo\mongo_client.py", line 1464, in _retryable_read
return func(session, server, sock_info, slave_ok)
File "C:\Users\d.konoplya\AppData\Local\Programs\Python\Python37\lib\site-packages\pymongo\mongo_client.py", line 1340, in _cmd
unpack_res)
File "C:\Users\d.konoplya\AppData\Local\Programs\Python\Python37\lib\site-packages\pymongo\server.py", line 131, in run_operation_with_response
user_fields=user_fields)
File "C:\Users\d.konoplya\AppData\Local\Programs\Python\Python37\lib\site-packages\pymongo\cursor.py", line 1030, in _unpack_response
legacy_response)
File "C:\Users\d.konoplya\AppData\Local\Programs\Python\Python37\lib\site-packages\pymongo\message.py", line 1538, in unpack_response
self.documents, codec_options, user_fields)
File "C:\Users\d.konoplya\AppData\Local\Programs\Python\Python37\lib\site-packages\bson\__init__.py", line 1098, in _decode_all_selective
bson.errors.InvalidBSON: year 0 is out of range
{
"_id" : ObjectId("5db9849eb491a900016f913b"),
"updated_at" : ISODate("2019-10-30T13:10:29.320Z"),
"created_at" : ISODate("2019-10-30T12:39:58.277Z"),
"state" : "credit_issued",
"registred_in_1c_at" : ISODate("2019-10-30T13:33:19.504Z"),
"signer_id" : ObjectId("5d584ab05aeafd000191518a")
}
But each record has a date> 0. What could be the problem?
Thanks

If you did:
import datetime
This could cause problems; try
from datetime import datetime
It's a shame they couldn't have named them something different to prevent the bear traps that are so easy to fall in.

I actually found it, this happens because MongoDB / BSON can represent dates outside the datetime.datetime range of Python.
PyMongo decodes BSON datetime values to instances of Python’s
datetime.datetime. Instances of datetime.datetime are limited to years
between datetime.MINYEAR (usually 1) and datetime.MAXYEAR (usually
9999). Some MongoDB drivers (e.g. the PHP driver) can store BSON
datetimes with year values far outside those supported by
datetime.datetime.
Read more about this in the pymongo FAQ: https://pymongo.readthedocs.io/en/stable/faq.html#why-do-i-get-overflowerror-decoding-dates-stored-by-another-language-s-driver
They suggest the following workarounds.
One is, to query for valid date ranges:
>>> from datetime import datetime
>>> coll = client.test.dates
>>> cur = coll.find({'dt': {'$gte': datetime.min, '$lte': datetime.max}})
The other is, to exclude the offending field from the results:
>>> cur = coll.find({}, projection={'dt': False})

Related

BulkIndexError while using elasticsearch with Python

I am using the following code to load some data on elasticsearch in a Python module:
def loadNewsDataToElasticsearch():
if not es:
print('Elasticsearch server not accessible.')
#Delete all previous data
print('Clearing old data from Elasticsearch server.')
indices = ['articles']
if es.indices.exists(index='articles'):
es.delete_by_query(index=indices, body={"query": {"match_all": {}}})
print('Loading new data from csv file to dataframe.')
df = pd.read_csv(directory+"elasticsearch/dataset/newsDataSmall.csv",index_col=0)
dataset = {'title': df['title'], 'link': df['link'], 'date': df.index}
# Elasticsearch bulk buffer
buffer = []
rows = 0
#Testing
#print(dataset['title'][0])
print('Inserting new data to Elasticsearch.')
for title, link, date in zip(dataset['title'],dataset['link'],dataset['date']):
# Article record
article = {"_id": rows, "_index": "articles", "title": title, "link": link, "date": date}
# Buffer article
buffer.append(article)
# Increment number of articles processed
rows += 1
# Bulk load every 1000 records
if rows % 1000 == 0:
helpers.bulk(es, buffer)
buffer = []
print("Inserted {} articles".format(rows), end="\r")
if buffer:
helpers.bulk(es, buffer)
print("Total articles inserted: {}".format(rows))
#return(es)
The above code runs perfectly fine when I use it on Windows OS in Anaconda. However, when I am running the same code on Linux server, I get the following error:
loadNewsDataToElasticsearch()
Clearing old data from Elasticsearch server.
Loading new data from csv file to dataframe.
Inserting new data to Elasticsearch.
Traceback (most recent call last):
File "", line 1, in
File "/home/priyanka/final5-refined-pyc3.9.15/nlpfnews/semanticSearch.py", line 145, in loadNewsDataToElasticsearch
helpers.bulk(es, buffer)
File "/home/priyanka/miniconda3/lib/python3.9/site-packages/elasticsearch/helpers/actions.py", line 524, in bulk
for ok, item in streaming_bulk(
File "/home/priyanka/miniconda3/lib/python3.9/site-packages/elasticsearch/helpers/actions.py", line 438, in streaming_bulk
for data, (ok, info) in zip(
File "/home/priyanka/miniconda3/lib/python3.9/site-packages/elasticsearch/helpers/actions.py", line 355, in _process_bulk_chunk
yield from gen
File "/home/priyanka/miniconda3/lib/python3.9/site-packages/elasticsearch/helpers/actions.py", line 274, in _process_bulk_chunk_success
raise BulkIndexError(f"{len(errors)} document(s) failed to index.", errors)
elasticsearch.helpers.BulkIndexError: 500 document(s) failed to index.
I am stuck with this for the past few days and have not been able to resolve it. Any help or suggestions will be appreciated.

UnicodeDecodeError for md5 id bulk importing data into elasticsearch

I have written a simple python script to import data into elasticsearch using bulk API.
# -*- encoding: utf-8 -*-
import csv
import datetime
import hashlib
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
from dateutil.relativedelta import relativedelta
ORIGINAL_FORMAT = '%y-%m-%d %H:%M:%S'
INDEX_PREFIX = 'my-log'
INDEX_DATE_FORMAT = '%Y-%m-%d'
FILE_ADDR = '/media/zeinab/ZiZi/Elastic/python/elastic-test/elasticsearch-import-data/sample_data/sample.csv'
def set_data(input_file):
with open(input_file) as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
sendtime = datetime.datetime.strptime(row['sendTime'].split('.')[0], ORIGINAL_FORMAT)
yield {
"_index": '{0}-{1}_{2}'.format(
INDEX_PREFIX,
sendtime.replace(day=1).strftime(INDEX_DATE_FORMAT),
(sendtime.replace(day=1) + relativedelta(months=1)).strftime(INDEX_DATE_FORMAT)),
"_type": 'data',
'_id': hashlib.md5("{0}{1}{2}{3}{4}".format(sendtime, row['IMSI'], row['MSISDN'], int(row['ruleRef']), int(row['sponsorRef']))).digest(),
"_source": {
'body': {
'status': int(row['status']),
'sendTime': sendtime
}
}
}
if __name__ == "__main__":
es = Elasticsearch(['http://{0}:{1}'.format('my.host.ip.addr', 9200)])
es.indices.delete(index='*')
success, _ = bulk(es, set_data(FILE_ADDR))
This comment helped me on writing/using set_data method.
Unfortunately I get this exception:
/usr/bin/python2.7 /media/zeinab/ZiZi/Elastic/python/elastic-test/elasticsearch-import-data/import_bulk_data.py
Traceback (most recent call last):
File "/media/zeinab/ZiZi/Elastic/python/elastic-test/elasticsearch-import-data/import_bulk_data.py", line 59, in <module>
success, _ = bulk(es, set_data(source_file))
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 257, in bulk
for ok, item in streaming_bulk(client, actions, **kwargs):
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 180, in streaming_bulk
client.transport.serializer):
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 60, in _chunk_actions
action = serializer.dumps(action)
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/serializer.py", line 50, in dumps
raise SerializationError(data, e)
elasticsearch.exceptions.SerializationError: ({u'index': {u'_type': 'data', u'_id': '8\x1dI\xa2\xe9\xa2H-\xa6\x0f\xbd=\xa7CY\xa3', u'_index': 'my-log-2017-04-01_2017-05-01'}}, UnicodeDecodeError('utf8', '8\x1dI\xa2\xe9\xa2H-\xa6\x0f\xbd=\xa7CY\xa3', 3, 4, 'invalid start byte'))
Process finished with exit code 1
I can insert this data into elasticsearch successfully using index API:
es.index(index='{0}-{1}_{2}'.format(
INDEX_PREFIX,
sendtime.replace(day=1).strftime(INDEX_DATE_FORMAT),
(sendtime.replace(day=1) + relativedelta(months=1)).strftime(INDEX_DATE_FORMAT)
),
doc_type='data',
id=hashlib.md5("{0}{1}{2}{3}{4}".format(sendtime, row['IMSI'], row['MSISDN'], int(row['ruleRef']), int(row['sponsorRef']))).digest(),
body={
'status': int(row['status']),
'sendTime': sendtime
}
)
But the issue with index API is that it's very slow; it needs about 2 seconds to import just 50 records. I hoped bulk API would help me on the speed.
According to the hashlib documentation, the digest method will
Return the digest of the data passed to the update() method so far. This is a bytes object of size digest_size which may contain bytes in the whole range from 0 to 255.
So the resulting bytes may not decodeable to unicode.
>>> id_ = hashlib.md5('abc'.encode('utf-8')).digest()
>>> id_
b'\x90\x01P\x98<\xd2O\xb0\xd6\x96?}(\xe1\x7fr'
>>> id_.decode('utf-8')
Traceback (most recent call last):
File "<console>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x90 in position 0: invalid start byte
The hexdigest method will produce a string as output; from the docs:
Like digest() except the digest is returned as a string object of double length, containing only hexadecimal digits. This may be used to exchange the value safely in email or other non-binary environments.
>>> id_ = hashlib.md5('abc'.encode('utf-8')).hexdigest()
>>> id_
'900150983cd24fb0d6963f7d28e17f72'

Filter items newer than 1 hour with RethinkDB and Python

I have a Python-script gathering some metrics and saving them to RethinkDB. I have also written a small Flask-application to display the data on a dashboard.
Now I need to run a query to find all rows in a table newer than 1 hour. This is what I got so far:
tzinfo = pytz.timezone('Europe/Oslo')
start_time = tzinfo.localize(datetime.now() - timedelta(hours=1))
r.table('metrics').filter( lambda m:
m.during(start_time, r.now())
).run(connection)
When I try to visit the page I get this error message:
ReqlRuntimeError: Not a TIME pseudotype: `{
"listeners": "6469",
"time": {
"$reql_type$": "TIME",
"epoch_time": 1447581600,
"timezone": "+01:00"
}
}` in:
r.table('metrics').filter(lambda var_1:
var_1.during(r.iso8601('2015-11-18T12:06:20.252415+01:00'), r.now()))
I googled a bit and found this thread which seems to be a similar problem: https://github.com/rethinkdb/rethinkdb/issues/4827, so I revisited how I add new rows to the database as well to see if that was the issue:
def _fix_tz(timestamp):
tzinfo = pytz.timezone('Europe/Oslo')
dt = datetime.strptime(timestamp[:-10], '%Y-%m-%dT%H:%M:%S')
return tzinfo.localize(dt)
...
for row in res:
... remove some data, manipulate some other data ...
r.db('metrics',
{'time': _fix_tz(row['_time']),
...
).run(connection)
'_time' retrieved by my data collection-script contains some garbage I remove, and then create a datetime-object. As far as I can understand from the RethinkDB documentation I should be able to insert these directly, and if I use "data explorer" in RethinkDB's Admin Panel my rows look like this:
{
...
"time": Sun Oct 25 2015 00:00:00 GMT+02:00
}
Update:
I did another test and created a small script to insert data and then retrieve it
import rethinkdb as r
conn = r.connect(host='localhost', port=28015, db='test')
r.table('timetests').insert({
'time': r.now(),
'message': 'foo!'
}).run(conn)
r.table('timetests').insert({
'time': r.now(),
'message': 'bar!'
}).run(conn)
cursor = r.table('timetests').filter(
lambda t: t.during(r.now() - 3600, r.now())
).run(conn)
I still get the same error message:
$ python timestamps.py
Traceback (most recent call last):
File "timestamps.py", line 21, in <module>
).run(conn)
File "/Users/tsg/.virtualenv/p4-datacollector/lib/python2.7/site-packages/rethinkdb/ast.py", line 118, in run
return c._start(self, **global_optargs)
File "/Users/tsg/.virtualenv/p4-datacollector/lib/python2.7/site-packages/rethinkdb/net.py", line 595, in _start
return self._instance.run_query(q, global_optargs.get('noreply', False))
File "/Users/tsg/.virtualenv/p4-datacollector/lib/python2.7/site-packages/rethinkdb/net.py", line 457, in run_query
raise res.make_error(query)
rethinkdb.errors.ReqlQueryLogicError: Not a TIME pseudotype: `{
"id": "5440a912-c80a-42dd-9d27-7ecd6f7187ad",
"message": "bar!",
"time": {
"$reql_type$": "TIME",
"epoch_time": 1447929586.899,
"timezone": "+00:00"
}
}` in:
r.table('timetests').filter(lambda var_1: var_1.during((r.now() - r.expr(3600)), r.now()))
I finally figured it out. The error is in the lambda-expression. You need to use .during() on a specific field. If not the query will try to wrestle the whole row/document into a timestamp
This code works:
cursor = r.table('timetests').filter(
lambda t: t['time'].during(r.now() - 3600, r.now())
).run(conn)

How to push into array nested in dictionary?

I want to create a mongodb to store the homework results, I create a homework which is a dictionary storing the results' array of each subject.
import pymongo
DBCONN = pymongo.Connection("127.0.0.1", 27017)
TASKSINFO = DBCONN.tasksinfo
_name = "john"
taskid = TASKSINFO.tasksinfo.insert(
{"name": _name,
"homework": {"bio": [], "math": []}
})
TASKSINFO.tasksinfo.update({"_id": taskid},
{"$push": {"homework.bio", 92}})
When I tried to push some information to db, there's error:
Traceback (most recent call last):
File "mongo_push_demo.py", line 13, in <module>
{"$push": {"homework.bio", 92}})
File "/usr/local/lib/python2.7/dist-packages/pymongo-2.5-py2.7-linux-i686.egg/pymongo/collection.py", line 479, in update
check_keys, self.__uuid_subtype), safe)
File "/usr/local/lib/python2.7/dist-packages/pymongo-2.5-py2.7-linux-i686.egg/pymongo/message.py", line 110, in update
encoded = bson.BSON.encode(doc, check_keys, uuid_subtype)
File "/usr/local/lib/python2.7/dist-packages/pymongo-2.5-py2.7-linux-i686.egg/bson/__init__.py", line 567, in encode
return cls(_dict_to_bson(document, check_keys, uuid_subtype))
File "/usr/local/lib/python2.7/dist-packages/pymongo-2.5-py2.7-linux-i686.egg/bson/__init__.py", line 476, in _dict_to_bson
elements.append(_element_to_bson(key, value, check_keys, uuid_subtype))
File "/usr/local/lib/python2.7/dist-packages/pymongo-2.5-py2.7-linux-i686.egg/bson/__init__.py", line 466, in _element_to_bson
type(value))
bson.errors.InvalidDocument: cannot convert value of type <type 'set'> to bson
{"$push": {"homework.bio", 92}})
It should be :, not ,.
{'a', 1} is a set of two elements in Python, that's why you get the error.

python json serialize datetime

First, a simple question on terms,
Encoding(json.dumps) means, converting something to json string,
decoding(json.loads) means, converting json string to json type(?)
I have a list of objects which I got from
>>> album_image_list = AlbumImage.objects.all().values(*fields)[offset:count]
>>> json.dumps(album_image_list[0], cls=DjangoJSONEncoder)
'{"album": 4, "album__title": "g jfd", "created_at": "2012-08-18T02:23:49Z", "height": 1024.0, "width": 512.0, "url_image": "http://--:8000/media/101ac908-df50-42cc-af6f-b172c8829a31.jpg"}'
but when I try the same on whole list (album_image_list),it fails...
>>> json.dumps(album_image_list, cls=DjangoJSONEncoder)
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/usr/lib/python2.6/json/__init__.py", line 237, in dumps
**kw).encode(obj)
File "/usr/lib/python2.6/json/encoder.py", line 367, in encode
chunks = list(self.iterencode(o))
File "/usr/lib/python2.6/json/encoder.py", line 317, in _iterencode
for chunk in self._iterencode_default(o, markers):
File "/usr/lib/python2.6/json/encoder.py", line 323, in _iterencode_default
newobj = self.default(o)
File "/home/--/virtualenvs/aLittleArtist/lib/python2.6/site-packages/django/core/serializers/json.py", line 75, in default
return super(DjangoJSONEncoder, self).default(o)
File "/usr/lib/python2.6/json/encoder.py", line 344, in default
raise TypeError(repr(o) + " is not JSON serializable")
TypeError: [{'album': 4L, 'album__title': u'g jfd', 'created_at': datetime.datetime(2012, 8, 18, 2, 23, 49, tzinfo=<UTC>), 'height': 1024.0, 'width': 512.0, 'url_image': u'http://--:8000/media/101ac908-df50-42cc-af6f-b172c8829a31.jpg'}, {'album': 4L, 'album__title': u'g jfd', 'created_at': datetime.datetime(2012, 8, 18, 1, 54, 51, tzinfo=<UTC>), 'height': 512.0, 'width': 512.0, 'url_image': u'http://--:8000/media/e85d1cf7-bfd8-4e77-b90f-d1ee01c67392.jpg'}] is not JSON serializable
>>>
Why does it succeed on one element and fails on the list?
If you want to just dump a dictionary to JSON, just use json.dumps. It can easily be made to serialize objects by passing in a custom serialization class - there's one included with Django that deals with datetimes already:
from django.core.serializers.json import DjangoJSONEncoder
json.dumps(mydictionary, cls=DjangoJSONEncoder)
.values() doesn't actually return a list. It returns a ValuesQuerySet which is not serializable by the json module. Try converting album_image_list to a list:
json.dumps(list(album_image_list), cls=DjangoJSONEncoder)
Which DjangoJSONEncoder you are using?
it looks like DjangoJSONEncoder may not support encoding list of results.
Try this:
JSON Serializing Django Models with simplejson
class DateTimeJSONEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, datetime.datetime):
return obj.isoformat()
else:
return super(DateTimeJSONEncoder, self).default(obj)
updated_at=DateTimeJSONEncoder().encode(p.updated_at)
This will help you to serialize datetime object

Categories

Resources