Python - Key not found in JSON data, how to parse other elements? - python

I have data from a Facebook group feed (24000 odd records in total). Eg.
{
"data": [
{
"message": "MoneyWise its time to vote for the 2017 winners https://www.moneywise.co.uk/home-finances-survey?",
"updated_time": "2017-07-27T21:15:52+0000",
"permalink_url": "https://www.facebook.com/groups/uwpartnersforum/permalink/1745120025791166/",
"from": {
"name": "John Oliver",
"id": "10152744793754666"
},
"id": "1452979881671850_1745120025791166"
},
{
"message": "We often think of communicating as figuring out a really good message and leaving it that. But the annoying fact is that unless we pay close attention to how that message is landing on the other person, not much communication will take place - Alan Alda",
"updated_time": "2017-07-27T21:15:26+0000",
"permalink_url": "https://www.facebook.com/groups/uwpartnersforum/permalink/1744867295816439/",
"from": {
"name": "Adrian Watts",
"id": "10152461880942242"
},
"id": "1452979881671850_1744867295816439"
}
]
}
and I am trying to extract, on comand prompt and in file, "message", "permalink_url", "updated_time", "name" and "id"(one inside from) post by a particular person, say "John Oliver". Following python script works.. mostly:
fhand = open('try1.json')
urlData = fhand.read()
jsonData = json.loads(urlData)
fout = open('output1.txt', 'w')
for i in jsonData["data"]:
if i["from"]["name"] == "John Oliver":
print (i["message"], end = "|")
print (i["permalink_url"], end = "|")
print (i["updated_time"], end = "|")
print (i["from"]["name"], end = "|")
print (i["from"]["id"], end = "\n")
print()
fout.write(str(i["message"]) + "|")
fout.write(str(i["permalink_url"]) + "|")
fout.write(str(i["updated_time"]) + "|")
fout.write(str(i["from"]["name"]) + "|")
fout.write(str(i["from"]["id"]) + "\n")
fout.close()
But I am facing two issues.
Issue 1. If there is no message in any records I am getting traceback:
Traceback (most recent call last):
File "facebook_feed.py", line 36, in <module>
main()
File "facebook_feed.py", line 25, in main
print (i["message"], end = "|")
KeyError: 'message'
So, I need some help in going through the complete file even if there is no message for an object extracting all other details from it.
Issue 2. and this is a strange one ... I have two files "try1.json" with 500 odd records and "trial1.json" with 24000 odd records, with completely same structure. When I open "try1.json" in "Atom" text editor it is colour highlighted smaller file in Atom but "trial1.json" is not colour highlighted bigger file in atom. On running the above script with try1.json, I am getting the KeyError for "message" (as shown above) but for "trial1.json" I get this:
Traceback (most recent call last):
File "facebook_feed.py", line 36, in <module>
main()
File "facebook_feed.py", line 20, in main
if i["from"]["name"] == "John Oliver":
KeyError: 'from'
trial1.json is 17 MB file.. is that an issue?

If you're not sure if i["message"] exists, don't just blindly access it. Either use dict.get, e.g. i.get('message', 'No message found'), or check if it's there first:
if "message" in i:
print (i["message"], end = "|")
You can do the same kind of thing with i["from"].
Atom isn't highlighting the big file because it's big. But if you can successfully json.loads something, it's valid JSON.

Related

Retrieve specific data from JSON file in Python

Hello everyone and happy new year! I'm posting for the first time here since I can't find an answer for my problem, I've been searching for 2 days now and can't figure out what to do and it's driving me crazy... Here's my problem:
I'm using the Steamspy api to download data of the 100 most played games in the last 2 weeks on Steam, I'm able to dump the downloaded data into a .json file (dumped_data.json in this case), and to import it back into my Python code (I can print it). And while I can print the whole thing, it's not really useable as it's 1900 lines of more or less useful data, so I want to be able to print only specific items of the dumped data. In this project, I would like to only print the name, developer, owners and price of the first 10 games on the list. It's my first Python project, and I can't figure out how to do this...
Here's the Python code:
import steamspypi
import json
#gets data using the steamspy api
data_request = dict()
data_request["request"] = "top100in2weeks"
#dumps the data into json file
steam_data = steamspypi.download(data_request)
with open("dumped_data.json", "w") as jsonfile:
json.dump(steam_data, jsonfile, indent=4)
#print values from dumped file
f = open("dumped_data.json")
data= json.load(f)
print(data)
And here's an exemple of the first item in the json file:
{
"570": {
"appid": 570,
"name": "Dota 2",
"developer": "Valve",
"publisher": "Valve",
"score_rank": "",
"positive": 1378093,
"negative": 267290,
"userscore": 0,
"owners": "100,000,000 .. 200,000,000",
"average_forever": 35763,
"average_2weeks": 1831,
"median_forever": 1079,
"median_2weeks": 1020,
"price": "0",
"initialprice": "0",
"discount": "0",
"ccu": 553462
},
Thanks in advance to everyone that's willing to help me, it would mean a lot.
The following prints the values you desire from the first 10 games:
for game in list(data.values())[:10]:
print(game["name"], game["developer"], game["owners"], game["price"])

Add JSON to a file

I have this python code:
import json
def write_json(new_data, filename='test.json'):
with open(filename,'r+') as file:
file_data = json.load(file)
file_data.append(new_data)
file.seek(0)
json.dump(file_data, file, indent = 4)
e = {
"1": [
{
"id": "df8ec0tdrhseragerse4-a3e0-8aa2da5119d3",
"name": "Deezomiro"
}
]
}
write_json(e)
Which should add this json to the end of the current test.json file. However, when I run it I get this error.
Traceback (most recent call last):
File "C:\Users\ArtyF\AppData\Roaming\Microsoft\Windows\Network Shortcuts\HWMonitor\copenheimer\test.py", line 16, in <module>
write_json(e)
File "C:\Users\ArtyF\AppData\Roaming\Microsoft\Windows\Network Shortcuts\HWMonitor\copenheimer\test.py", line 5, in write_json
file_data.append(new_data)
AttributeError: 'dict' object has no attribute 'append'
How can I make this code add this json to the end of the file?
As the error says, dictionaries don't have append. They do have update.
file_data.update(new_data)
# TODO: dump file_data back to file

How to write line by line request output

I am trying to write line by line the JSON output from my Python request. I already checked some similar issue on StackOverflow in the question: write to file line by line python, without success.
Here is the code:
myfile = open ("data.txt", "a")
for item in pretty_json["geonames"]:
print (item["geonameId"],item["name"])
myfile.write ("%s\n" % item["geonameId"] + "https://www.geonames.org/" + item["name"])
myfile.close()
Here the output from my pretty_json["geonames"]
{
"adminCode1": "FR",
"lng": "7.2612",
"geonameId": 2661847,
"toponymName": "Aeschlenberg",
"countryId": "2658434",
"fcl": "P",
"population": 0,
"countryCode": "CH",
"name": "Aeschlenberg",
"fclName": "city, village,...",
"adminCodes1": {
"ISO3166_2": "FR"
},
"countryName": "Switzerland",
"fcodeName": "populated place",
"adminName1": "Fribourg",
"lat": "46.78663",
"fcode": "PPL"
}
Then, as output saved on my data.txt, I'm having :
11048419
https://www.geonames.org/Aïre2661847
https://www.geonames.org/Aeschlenberg2661880
https://www.geonames.org/Aarberg6295535
The expected result should be something like:
Aïre , https://www.geonames.org/11048419
Aeschlenberg , https://www.geonames.org/2661847
Aarberg , https://www.geonames.org/2661880
Writing the output in CSV could be a solution?
Regards.
Using the csv module.
Ex:
import csv
with open("data.txt", "a") as myfile:
writer = csv.writer(myfile) #Create Writer Object
for item in pretty_json["geonames"]: #Iterate list
writer.writerow([item["name"], "https://www.geonames.org/{}".format(item["geonameId"])]) #Write row.
If I understand correctly, you want the same screen output to your file. That's easy. If you are on python 3 just add to your print function:
print (item["geonameId"],item["name"], file=myfile)
Just compose a proper printing format for the needed items:
...
for item in pretty_json["geonames"]:
print("{}, https://www.geonames.org/{}".format(item["name"], item["geonameId"]))
Sample output:
Aeschlenberg, https://www.geonames.org/2661847

Filter items newer than 1 hour with RethinkDB and Python

I have a Python-script gathering some metrics and saving them to RethinkDB. I have also written a small Flask-application to display the data on a dashboard.
Now I need to run a query to find all rows in a table newer than 1 hour. This is what I got so far:
tzinfo = pytz.timezone('Europe/Oslo')
start_time = tzinfo.localize(datetime.now() - timedelta(hours=1))
r.table('metrics').filter( lambda m:
m.during(start_time, r.now())
).run(connection)
When I try to visit the page I get this error message:
ReqlRuntimeError: Not a TIME pseudotype: `{
"listeners": "6469",
"time": {
"$reql_type$": "TIME",
"epoch_time": 1447581600,
"timezone": "+01:00"
}
}` in:
r.table('metrics').filter(lambda var_1:
var_1.during(r.iso8601('2015-11-18T12:06:20.252415+01:00'), r.now()))
I googled a bit and found this thread which seems to be a similar problem: https://github.com/rethinkdb/rethinkdb/issues/4827, so I revisited how I add new rows to the database as well to see if that was the issue:
def _fix_tz(timestamp):
tzinfo = pytz.timezone('Europe/Oslo')
dt = datetime.strptime(timestamp[:-10], '%Y-%m-%dT%H:%M:%S')
return tzinfo.localize(dt)
...
for row in res:
... remove some data, manipulate some other data ...
r.db('metrics',
{'time': _fix_tz(row['_time']),
...
).run(connection)
'_time' retrieved by my data collection-script contains some garbage I remove, and then create a datetime-object. As far as I can understand from the RethinkDB documentation I should be able to insert these directly, and if I use "data explorer" in RethinkDB's Admin Panel my rows look like this:
{
...
"time": Sun Oct 25 2015 00:00:00 GMT+02:00
}
Update:
I did another test and created a small script to insert data and then retrieve it
import rethinkdb as r
conn = r.connect(host='localhost', port=28015, db='test')
r.table('timetests').insert({
'time': r.now(),
'message': 'foo!'
}).run(conn)
r.table('timetests').insert({
'time': r.now(),
'message': 'bar!'
}).run(conn)
cursor = r.table('timetests').filter(
lambda t: t.during(r.now() - 3600, r.now())
).run(conn)
I still get the same error message:
$ python timestamps.py
Traceback (most recent call last):
File "timestamps.py", line 21, in <module>
).run(conn)
File "/Users/tsg/.virtualenv/p4-datacollector/lib/python2.7/site-packages/rethinkdb/ast.py", line 118, in run
return c._start(self, **global_optargs)
File "/Users/tsg/.virtualenv/p4-datacollector/lib/python2.7/site-packages/rethinkdb/net.py", line 595, in _start
return self._instance.run_query(q, global_optargs.get('noreply', False))
File "/Users/tsg/.virtualenv/p4-datacollector/lib/python2.7/site-packages/rethinkdb/net.py", line 457, in run_query
raise res.make_error(query)
rethinkdb.errors.ReqlQueryLogicError: Not a TIME pseudotype: `{
"id": "5440a912-c80a-42dd-9d27-7ecd6f7187ad",
"message": "bar!",
"time": {
"$reql_type$": "TIME",
"epoch_time": 1447929586.899,
"timezone": "+00:00"
}
}` in:
r.table('timetests').filter(lambda var_1: var_1.during((r.now() - r.expr(3600)), r.now()))
I finally figured it out. The error is in the lambda-expression. You need to use .during() on a specific field. If not the query will try to wrestle the whole row/document into a timestamp
This code works:
cursor = r.table('timetests').filter(
lambda t: t['time'].during(r.now() - 3600, r.now())
).run(conn)

py2neo.error. InvalidSemanticsException

I am stuck into Invalid Semantic exception error following is my code:
import json
from py2neo import neo4j, Node, Relationship, Graph
graph = Graph()
graph.schema.create_uniqueness_constraint("Authors", "auth_name")
graph.schema.create_uniqueness_constraint("Mainstream_News", "id")
with open("example.json") as f:
for line in f:
while True:
try:
file = json.loads(line)
break
except ValueError:
# Not yet a complete JSON value
line += next(f)
# Now creating the node and relationships
news = graph.merge_one("Mainstream_News", {"id": unicode(file["_id"]["$oid"]), "entry_url": unicode(file["entry_url"]),"title":unicode(file["title"])})
authors = graph.merge_one("Authors", {"auth_name": unicode(file["auth_name"]), "auth_url" : unicode(file["auth_url"]), "auth_eml" : unicode(file["auth_eml"])})
graph.create_unique(Relationship(news, "hasAuthor", authors))
I am trying to connect the news node with authors node.My Json file looks like this:
{
"_id": {
"$oid": "54933912bf4620870115a2e3"
},
"auth_eml": "",
"auth_url": "",
"cat": [],
"auth_name": "Max Bond",
"out_link": [],
"entry_url": [
"http://www.usda.gov/wps/portal/usda/!ut/p/c5/04_SB8K8xLLM9MSSzPy8xBz9CP0os_gAC9-wMJ8QY0MDpxBDA09nXw9DFxcXQ-cAA_1wkA5kFaGuQBXeASbmnu4uBgbe5hB5AxzA0UDfzyM_N1W_IDs7zdFRUREAZXAypA!!/dl3/d3/L2dJQSEvUUt3QS9ZQnZ3LzZfUDhNVlZMVDMxMEJUMTBJQ01IMURERDFDUDA!/?navtype=SU&navid=AGRICULTURE"
],
"out_link_norm": [],
"title": "United States Department of Agriculture - Agriculture",
"entry_url_norm": [
"usda.gov/wps/portal/usda/!ut/p/c5/04_SB8K8xLLM9MSSzPy8xBz9CP0os_gAC9-wMJ8QY0MDpxBDA09nXw9DFxcXQ-cAA_1wkA5kFaGuQBXeASbmnu4uBgbe5hB5AxzA0UDfzyM_N1W_IDs7zdFRUREAZXAypA!!/dl3/d3/L2dJQSEvUUt3QS9ZQnZ3LzZfUDhNVlZMVDMxMEJUMTBJQ01IMURERDFDUDA!/"
],
"ts": 1290945374000,
"source_url": "",
"content": "\n<a\nhref=\"/wps/portal/usda/!ut/p/c4/04_SB8K8xLLM9MSSzPy8xBz9CP0os_gAC9-wMJ8QY0MDpxBDA09nXw9DFxcXQ-cAA_2CbEdFAEUOjoE!/?navid=AVIAN_INFLUENZA\">\n<b>Avian Influenza, Bird Flu</b></a> <br />\nThe official U.S. government web site for information on pandemic flu and avian influenza\n\n<strong>Pest Management</strong> <br />\nPest management policy, pesticide screening tool, evaluate pesticide risk, conservation\nbuffers, training modules.\n\n<strong>Weather and Climate</strong> <br />\nU.S. agricultural weather highlights, weekly weather and crop bulletin, major world crop areas\nand climatic profiles.\n"
}
The full exception error is like this:
File "/home/mohan/workspace/test.py", line 20, in <module>
news = graph.merge_one("Mainstream_News", {"id": unicode(file["_id"]["$oid"]), "entry_url": unicode(file["entry_url"]),"title":unicode(file["title"])})
File "/usr/local/lib/python2.7/dist-packages/py2neo/core.py", line 958, in merge_one
for node in self.merge(label, property_key, property_value, limit=1):
File "/usr/local/lib/python2.7/dist-packages/py2neo/core.py", line 946, in merge
response = self.cypher.post(statement, parameters)
File "/usr/local/lib/python2.7/dist-packages/py2neo/cypher/core.py", line 86, in post
return self.resource.post(payload)
File "/usr/local/lib/python2.7/dist-packages/py2neo/core.py", line 331, in post
raise_from(self.error_class(message, **content), error)
File "/usr/local/lib/python2.7/dist-packages/py2neo/util.py", line 235, in raise_from
raise exception
py2neo.error.InvalidSemanticsException: Cannot merge node using null property value for {'title': u'United States Department of Agriculture - Agriculture', 'id': u'54933912bf4620870115a2e3', 'entry_url': u"[u'http://www.usda.gov/wps/portal/usda/!ut/p/c5/04_SB8K8xLLM9MSSzPy8xBz9CP0os_gAC9-wMJ8QY0MDpxBDA09nXw9DFxcXQ-cAA_1wkA5kFaGuQBXeASbmnu4uBgbe5hB5AxzA0UDfzyM_N1W_IDs7zdFRUREAZXAypA!!/dl3/d3/L2dJQSEvUUt3QS9ZQnZ3LzZfUDhNVlZMVDMxMEJUMTBJQ01IMURERDFDUDA!/?navtype=SU&navid=AGRICULTURE']"}
Any suggestions to fix this ?
Yeah, I see what's going on here. If you look at the py2neo API and look for the merge_one function, it's defined this way:
merge_one(label, property_key=None, property_value=None)
Match or create a node by label and optional property and
return a single matching node. This method is intended to be
used with a unique constraint and does not fail if more than
one matching node is found.
The way that you're calling it is with a string first (label) and then a dictionary:
news = graph.merge_one("Mainstream_News", {"id": unicode(file["_id"]["$oid"]), "entry_url": unicode(file["entry_url"]),"title":unicode(file["title"])})
Your error message says that py2neo is treating the entire dictionary like a property name, and you haven't provided a property value.
So you're calling this function incorrectly. What you should probably be doing is merge_one only on the basis of the id property, then later adding the extra properties you need to the node that comes back.
You need to convert those merge_one calls into something like this:
news = graph.merge_one("Mainstream News", "id", unicode(file["_id]["$oid]))
Note this doesn't give you the extra properties, those you'd add later.

Categories

Resources