Invalid Document: Cannot encode object - python

Sample documents in the collection are as follows:
[
{ "_id" : ObjectId("57690ce4a89aa8aa92ed1896"), "total_enters" : 308974, "Segment" : "7", "Chain" : "11625", "Geography" : "303"},
{ "_id" : ObjectId("57690ce4a89aa8aa92ed1897"), "total_enters" : 311076, "Segment" : "7", "Chain" : "4624", "Geography" : "303"}
]
I have a collection in the above format.
The following query is returning an error:
**InvalidDocument: Cannot encode object: set(['$total_enters'])
cursor_v2=db.Chain_share.aggregate([{
"$group":{
"_id":{"Geography":"$Geography","Segment":"$Segment"},
"enters":{"$sum":"$total_enters"},
"max_chain":{"$max":{"$total_enters"}}}
}
])
The field total_enters is an aggregate field which was obtained as a result of $sum operation. Not able to figure out why it is not able to sum it up again?
What is causing the encoding error?

Related

Exception when trying to parse large JSON file using ijson

I am trying to parse a large JSON file (16GB) using ijson but I always get the following error :
Exception has occurred: IncompleteJSONError
lexical error: invalid char in json text.
venue" : { "type" : NumberInt(0) }, "yea
(right here) ------^
File "C:\pyth\dblp_parser.py", line 14, in <module>
for record in ijson.items(f, 'item', use_float=True):
My code is as follows:
with open("dblpv13.json", "rb") as f:
for record in ijson.items(f, 'records.item', use_float=True):
paper_id = record["_id"] #_id is only for test
paper_id_tab.append(paper_id)
A part of my json file is as follows:
{
"_id" : "53e99784b7602d9701f3f636",
"title" : "Flatlined",
"authors" : [
{
"_id" : "53f58b15dabfaece00f8046d",
"name" : "Peter J. Denning",
"org" : "ACM Education Board",
"gid" : "5b86c72de1cd8e14a3c2b772",
"oid" : "544bd99545ce266baef0668a",
"orgid" : "5f71b2811c455f439fe3c58a"
}
],
"venue" : {
"_id" : "555036f57cea80f954169e28",
"raw" : "Commun. ACM",
"raw_zh" : null,
"publisher" : null,
"type" : NumberInt(0)
},
"year" : NumberInt(2002),
"keywords" : [
"linear scale",
"false dichotomy"
],
"n_citation" : NumberInt(7),
"page_start" : "15",
"page_end" : "19",
"lang" : "en",
"volume" : "45",
"issue" : "6",
"issn" : "",
"isbn" : "",
"doi" : "10.1145/508448.508463",
"pdf" : "",
"url" : [
"http://doi.acm.org/10.1145/508448.508463"
],
"abstract" : "Our propensity to create linear scales between opposing alternatives creates false dichotomies that hamper our thinking and limit our action."
},
I tried to fill in records item by item but always the same error. I'm completely blocked.
Please, can any body help me?
The same problem happened to me with the said dataset. ijson can't handle it. I overcame the problem by creating another dataset and then parsing the new dataset with ijson. The approach is quite simple: read the orignal dataset with simple read; remove "NumberInt(" and ")", write the result to a new json file. the code is given below.
f=open('dblpv13_clean.json')
with open('dblpv13.json','r',errors='ignore') as myFile:
for line in myFile:
line=line.replace("NumberInt(","").replace(")","")
f.write(line)
f.close()
Now you can parse the new dataset with ijson as follows.
with open('dblpv13_clean.json', "r",errors='ignore') as f:
for i, element in enumerate(ijson.items(f, "item")):
do something....

Extracting and updating a dictionary from array of dictinaries in MongoDB

I have a structure like this:
{
"id" : 1,
"user" : "somebody",
"players" : [
{
"name" : "lala",
"surname" : "baba",
"player_place" : "1",
"start_num" : "123",
"results" : {
"1" : { ... }
"2" : { ... },
...
}
},
...
]
}
I am pretty new to MongoDB and I just cannot figure out how to extract results for a specific user (in this case "somebody", but there are many other users and each has an array of players and each player has many results) for a specific player with start_num.
I am using pymongo and this is the code I came up with:
record = collection.find(
{'user' : name}, {'players' : {'$elemMatch' : {'start_num' : start_num}}, '_id' : False}
)
This extracts players with specific player for a given user. That is good, but now I need to get specific result from results, something like this:
{ 'results' : { '2' : { ... } } }.
I tried:
record = collection.find(
{'user' : name}, {'players' : {'$elemMatch' : {'start_num' : start_num}}, 'results' : result_num, '_id' : False}
)
but that, of course, doesn't work. I could just turn that to list in Python and extract what I need, but I would like to do that with query in Mongo.
Also, what would I need to do to replace specific result in results for specific player for specific user? Let's say I have a new result with key 2 and I want to replace existing result that has key 2. Can I do it with same query as for find() (just replacing method find with method replace or find_and_replace)?
You can replace a specific result and the syntax for that should be something like this,
assuming you want to replace the result with key 1,
collection.updateOne({
"user": name,
"players.start_num": start_num
},
{ $set: { "players.$.results.1" : new_result }})

Print only specific parts of json file

I am wondering what I am doing wrong when trying to print the data of name of the following code in python.
import urllib.request, json
with urllib.request.urlopen("<THIS IS A URL IN THE ORIGINAL SCRIPT>") as url:
data = json.loads(url.read().decode())
print (data['Departure']['Product']['name'])
print (data['Departure']['Stops']['Stop'][0]['depTime'])
And this is the api I am fetching the data from:
{
"Departure" : [ {
"Product" : {
"name" : "Länstrafik - Buss 201",
"num" : "201",
"catCode" : "7",
"catOutS" : "BLT",
"catOutL" : "Länstrafik - Buss",
"operatorCode" : "254",
"operator" : "JLT",
"operatorUrl" : "http://www.jlt.se"
},
"Stops" : {
"Stop" : [ {
"name" : "Gislaved Lundåkerskolan",
"id" : "740040260",
"extId" : "740040260",
"routeIdx" : 12,
"lon" : 13.530096,
"lat" : 57.298178,
"depTime" : "20:55:00",
"depDate" : "2019-03-05"
}
data["Departure"] is a list, and you are indexing into it like it's a dictionary.
You wrote the dictionary sample confusingly. Here's how I think it looks:
d = {
"Departure" : [ {
"Product" : {
"name" : "Länstrafik - Buss 201",
"num" : "201",
"catCode" : "7",
"catOutS" : "BLT",
"catOutL" : "Länstrafik - Buss",
"operatorCode" : "254",
"operator" : "JLT",
"operatorUrl" : "http://www.jlt.se"
},
"Stops" : {
"Stop" : [ {
"name" : "Gislaved Lundåkerskolan",
"id" : "740040260",
"extId" : "740040260",
"routeIdx" : 12,
"lon" : 13.530096,
"lat" : 57.298178,
"depTime" : "20:55:00",
"depDate" : "2019-03-05"
}]}}]}
And here's how you can print depTime
print(d["Departure"][0]["Stops"]["Stop"][0]["depTime"])
The important part you missed is d["Departure"][0] because d["Departure"] is a list.
As Kyle said in the previous answer, data["Departure"] is a list, but you're trying to use it as a dictionary. There are 2 possible solutions.
Change data["Departure"]["Stops"]["Stop"] etc. to data["Departure"][0]["Stops"]["Stop"] etc.
Change the JSON file to make departure into a dictionary, which would allow you to keep your original code. This would make the final JSON snippet look like this:
"Departure" : {
"Product" : {
"name" : "Länstrafik - Buss 201",
"num" : "201",
"catCode" : "7",
"catOutS" : "BLT",
"catOutL" : "Länstrafik - Buss",
"operatorCode" : "254",
"operator" : "JLT",
"operatorUrl" : "http://www.jlt.se"
},
"Stops" : {
"name" : "Gislaved Lundåkerskolan",
"id" : "740040260",
"extId" : "740040260",
"routeIdx" : 12,
"lon" : 13.530096,
"lat" : 57.298178,
"depTime" : "20:55:00",
"depDate" : "2019-03-05"
}
}

Python parsing json file to access values returning TypeError

I am using python to parse a json file full of url data to try and build a url reputation classifier. There are around 2,000 entries in the json file and not all of them have all of the fields present. A typical entry looks like this:
[
{
"host_len" : 12,
"fragment" : null,
"url_len" : 84,
"default_port" : 80,
"domain_age_days" : "5621",
"tld" : "com",
"num_domain_tokens" : 3,
"ips" : [
{
"geo" : "CN",
"ip" : "115.236.98.124",
"type" : "A"
}
],
"malicious_url" : 0,
"url" : "http://www.oppo.com/?utm_source=WeiBo&utm_medium=OPPO&utm_campaign=DailyFlow",
"alexa_rank" : "25523",
"query" : "utm_source=WeiBo&utm_medium=OPPO&utm_campaign=DailyFlow",
"file_extension" : null,
"registered_domain" : "oppo.com",
"scheme" : "http",
"path" : "/",
"path_len" : 1,
"port" : 80,
"host" : "www.oppo.com",
"domain_tokens" : [
"www",
"oppo",
"com"
],
"mxhosts" : [
{
"mxhost" : "mail1.oppo.com",
"ips" : [
{
"geo" : "CN",
"ip" : "121.12.164.123",
"type" : "A"
}
]
}
],
"path_tokens" : [
""
],
"num_path_tokens" : 1
}
]
I am trying to access the data stored in the fields "ips" and "mxhosts" to compare the "geo" location. To try and access the first "ips" field I'm using:
corpus = open(file)
urldata = json.load(corpus, encoding="latin1")
for record in urldata:
print record["ips"][0]["geo"]
But as I mentioned not all of the json entries have all of the fields. "ips" is always present but sometimes it's "null" and the same goes for "geo". I'm trying to check for the data before accessing it using:
if(record["ips"] is not None and record["ips"][0]["geo"] is not None):
But I this an error:
if(record["ips"] is not None and record["ips"][0]["geo"] is not None):
TypeError: string indices must be integers
When I try to check it using this:
if("ips" in record):
I get this error message:
print record["ips"][0]["geo"]
TypeError: 'NoneType' object has no attribute '__getitem__'
So I'm not sure how to check if the record I'm trying to access exists before I access it, or if I'm even accessing in the most correct way. Thanks.
You can simply check if record["ips"] is not None, or more simply if it's True, before proceeding to access it as a list; otherwise you would be calling a list method on a None object.
for record in urldata:
if record["ips"]:
print record["ips"][0]["geo"]
So it ended up being a little convoluted due to the inconsistent nature of the json file, but I had to end up first checking that "ips" was not null and then checking that "geo" was present in record["ips"][0]. This is what it looks like:
if(record["ips"] is not None and "geo" in record["ips"][0]):
print record["ips"][0]["geo"]
Thanks for the feedback everyone!

Jsonify data not returning to ajax call

I have an app where I am using flask, python, ajax, json, javascript, and leaflet. This app reads a csv file, puts it into json format, then returns it to an ajax call. My issue is that the geojson is not being returned. In the console, I am getting a 5000 NetworkError in the console log. The end result is to use the return geojson in a leaflet map layer. If I remove the jsonify, the return works fine, but it is a string of course, and this wont work for the layer.
As you can see, I have a simple alert("success") in the ajax success part. This is not being executed. Nor is the alert(data).
I do have jsonify in the from Flask import statement.
Thank you for the help
Ajax call
$.ajax({
type : "POST",
url : '/process',
data: {
chks: chks
}
})
.success(function(data){
alert("success"); // I am doing this just to get see if I get back here. I do not
alert(data);
python/flask
#app.route('/process', methods=['POST'])
def process():
data = request.form['chks']
rawData = csv.reader(open('static/csvfile.csv', 'r'), dialect='excel')
count = sum(1 for row in open('static/csvfile.csv))
template =\
''' \
{"type" : "Feature",
"geometry" : {
"type" : "Point",
"coordinates" : [%s, %s]},
"properties" : {"name" : "%s" }
}%s
'''
output = \
''' \
{"type" : "Feature Collection",
"features" : [
'''
iter = 0
separator = ","
lastrow = ""
for row in rawData:
iter += 1 // this is used to skip the first line of the csv file
if iter >=2:
id = row[0]
lat = row[1]
long = row[2]
if iter != count:
output += template % (row[2], row[1], row[0], separator)
else:
output += template % (row[2], row[1], row[0], lastrow)
output += \
''' \
]}
'''
return jsonify(output)
More Info - taking David Knipe's info into hand, If I remove the jsonify from my return statement, it returns what I expect, and I can output the return in an alert. It looks like this
{ "type" : "Feature Collection",
"features" : [
{"type" : "Feature",
"geometry" : {
"type" : "Point",
"coordinates" : [ -86.28, 32.36]},
"properties" : {"name" : "Montgomery"}
},
{ "type" : "Feature",
"geometry" : {
"type" : "Point",
"coordinates" : [ -105.42, 40.30]},
"properties" : {"name" : "Boulder"}
},
]}
If I take that data and hard code it into the ajax success, then pass it to the leaflet layer code like this - it will work, and my points will be displayed on my map
...
.success(function(data){
var pointsHC= { "type" : "Feature Collection",
"features" : [
{"type" : "Feature",
"geometry" : {
"type" : "Point",
"coordinates" : [ -86.28, 32.36]},
"properties" : {"name" : "Montgomery"}
},
{ "type" : "Feature",
"geometry" : {
"type" : "Point",
"coordinates" : [ -105.42, 40.30]},
"properties" : {"name" : "Boulder"}
},
]};
// leaflet part
var layer = L.geoJson(pointsHC, {
pointToLayer: function(feature, latlng){
return L.circleMarker( ...
If I do not hard code and pass the data via a variable, it does not work, and I get and invalid geoJson object. I have tried it with both the final semi-colon removed and not removed, and no love either way
...
.success(function(data){
// leaflet part
var layer = L.geoJson(data, {
pointToLayer: function(feature, latlng){
return L.circleMarker( ...
So it works if you don't try to parse the JSON, but if you do then it fails. Your JSON is invalid:
As loganbertram pointed out, you're missing a " on "Feature Collection".
You're missing a " on "properties".
output = template % ... should be output += template % ... - you're appending to output, not replacing it.
the features array will have a trailing comma (unless it is empty).
Although actually in your code features will always be empty anyway: you set iter = 0, never change its value, and then don't do the output = ... bit because iter < 2.
Are you sure you actually want to use jsonify? As I understand it, that turns any object into a JSON string. But output is already a JSON string - or should be, if you fix the various bugs loganbertram and I have spotted. In that case the client-side code will not fail trying to parse JSON. But if you jsonify something that's already JSON, you'll get something like this:
"{\"type\" : \"Feature\",
\"geometry\" : {
...
which the javascript will then convert back to the original JSON string, instead of a JSON object.
Actually, it would be better to rewrite the whole thing so it constructs an object instead of a string, and then calls jsonify on that object. But I don't know enough Python to give more details easily.

Categories

Resources