parsing API with Python - how to handle JSON with BOM - python

I'm using Python 2.7.11 on windows to get JSON data from API (data on trees in Warsaw, Poland, but nevermind that). I want to generate output csv file with all the data provided by the api, for further analysis. I started with a script I used for another project (also discussed here on Stackoverflow and corrected for me by #Martin Taylor).That script didn't work so I tried to modify it using my very basic understanding, googling around and applying pdb debugger. At the moment, the result looks like this:
import pdb
import json
import urllib2
import csv
pdb.set_trace()
url = "https://api.um.warszawa.pl/api/action/datastore_search/?resource_id=ed6217dd-c8d0-4f7b-8bed-3b7eb81a95ba"
myfile = 'C:/dane/drzewa.csv'
csv_myfile = csv.writer(open(myfile, 'wb'))
cols = ['numer_adres', 'stan_zdrowia', 'y_wgs84', 'dzielnica', 'adres', 'lokalizacja', 'wiek_w_dni', 'srednica_k', 'pnie_obwod', 'miasto', 'jednostka', 'x_pl2000', 'wysokosc', 'y_pl2000', 'numer_inw', 'x_wgs84', '_id', 'gatunek_1', 'gatunek', 'data_wyk_pom']
csv_myfile.writerow(cols)
def api_iterate(myfile):
while True:
global url
print url
json_page = urllib2.urlopen(url)
data = json.load(json_page)
json_page.close()
for data_object in data ['result']['records']:
csv_myfile.writerow([data_object[col] for col in cols])
try:
url = data['_links']['next']
except KeyError as e:
break
with open(myfile, 'wb'):
api_iterate(myfile)
I'm a very fresh Python user so I get confused all the time. Now I got to the point when, while reading the objects in json dictionary, I get a Keyerror message associated with the 'x_wgs84' element. I suppose it has something to do with the fact that in the source url this element is preceded by a U+FEFF unicode character. I tried to get around this but I got stuck and would appreciate assistance.
I suspect the code may be corrupt in several other ways - as I mentioned, I'm a very unskilled programmer (yet).

You need to put the key with the unicode character:
To know how to do it, one easy way is to print the keys:
>>> import requests
>>> res = requests.get('https://api.um.warszawa.pl/api/action/datastore_search/?resource_id=ed6217dd-c8d0-4f7b-8bed-3b7eb81a95ba')
>>> data = res.json()
>>> records = data['result']['records']
>>> records[0]
{u'numer_adres': u'', u'stan_zdrowia': u'dobry', u'y_wgs84': u'52.21865', u'y_pl2000': u'5787241.04475524', u'adres': u'ul. ALPEJSKA', u'x_pl2000': u'7511793.96937063', u'lokalizacja': u'Ulica ALPEJSKA', u'wiek_w_dni': u'60', u'miasto': u'Warszawa', u'jednostka': u'Dzielnica Wawer', u'pnie_obwod': u'73', u'wysokosc': u'14', u'data_wyk_pom': u'20130709', u'dzielnica': u'Wawer', u'\ufeffx_wgs84': u'21.172584', u'numer_inw': u'D386200', u'_id': 125435, u'gatunek_1': u'Quercus robur', u'gatunek': u'd\u0105b szypu\u0142kowy', u'srednica_k': u'7'}
>>> records[0].keys()
[u'numer_adres', u'stan_zdrowia', u'y_wgs84', u'y_pl2000', u'adres', u'x_pl2000', u'lokalizacja', u'wiek_w_dni', u'miasto', u'jednostka', u'pnie_obwod', u'wysokosc', u'data_wyk_pom', u'dzielnica', u'\ufeffx_wgs84', u'numer_inw', u'_id', u'gatunek_1', u'gatunek', u'srednica_k']
>>> records[0][u'\ufeffx_wgs84']
u'21.172584'
As you can see, to get your key, you need to write it as '\ufeffx_wgs84' with the unicode character that is causing trouble.
Note: I don't know if you are using python2 or 3, but you might need to put a u before your string declaration in python2 to declare it as unicode string.

Related

Python writing to file and json returns None/null instead of value

I'm trying to write data to a file with the following code
#!/usr/bin/python37all
print('Content-type: text/html\n\n')
import cgi
from Alarm import *
import json
htmldata = cgi.FieldStorage()
alarm_time = htmldata.getvalue('alarm_time')
alarm_date = htmldata.getvalue('alarm_date')
print(alarm_time,alarm_date)
data = {'time':alarm_time,'date':alarm_date}
# print(data['time'],data['date'])
with open('alarm_data.txt','w') as f:
json.dump(data,f)
...
but when opening the the file, I get the following output:
{'time':null,'date':null}
The print statement returns what I except it to: 14:26 2020-12-12.
I've tried this same method with f.write() but it returns both values as None. This is being run on a raspberry pi. Why aren't the correct values being written?
--EDIT--
The json string I expect to see is the following:{'time':'14:26','date':'2020-12-12'}
Perhaps you meant:
data = {'time':str(alarm_time), 'date':str(alarm_date)}
I would expect to see your file contents like this:
{"time":"14:26","date":"2020-12-12"}
Note the double quotes: ". json is very strict about these things, so don't fool yourself into having single quotes ' in a file and expecting json to parse it.

Trying to implement jsonschema with just filepaths

I've uploaded a json from the user and now I'm trying to compare that json to a schema using the jsonschema validator. I'm getting an error, ValidationError: is not of type u'object'
Failed validating u'type' in schema
This is my code so far:
from __future__ import unicode_literals
from django.shortcuts import render, redirect
import jsonschema
import json
import os
from django.conf import settings
#File to store all the parsers
def jsonVsSchemaParser(project, file):
baseProjectURL = 'src\media\json\schema'
projectSchema = project.lower()+'.schema'
projectPath = os.path.join(baseProjectURL,projectSchema)
filePath = os.path.join(settings.BASE_DIR,'src\media\json', file)
actProjectPath = os.path.join(settings.BASE_DIR,projectPath)
print filePath, actProjectPath
schemaResponse = open(actProjectPath)
schema = json.load(schemaResponse)
response = open(filePath)
jsonFile = json.load(response)
jsonschema.validate(jsonFile, schema)
I'm trying to do something similar to this question except instead of using a url I'm using my filepath.
Also I'm using python 2.7 and Django 1.11 if that is helpful at all.
Also I'm pretty sure I don't have a problem with my filepaths because I printed them and it outputted what I was expecting. I also know that my schema and json can be read by jsonschema since I used it on the command line as well.
EDIT: that validation error seemed to be a fluke. the actual validation error I'm consistently getting is "-1 is not of type u'string'". The annoying thing is that's supposed to be like that. It is wrong that sessionid isn't a string but I want that to be handled by the jsonschema but I don't want my validation errors to be given in this format: . What I want to do is collect all validation errors in an array and then post it to the user in the next page.
I just ended up putting a try-catch around my validate method. Here's what it looks like:
validationErrors = []
try:
jsonschema.validate(jsonFile, schema)
except jsonschema.exceptions.ValidationError as error:
validationErrors.append(error)
EDIT: This solution only works if you have one error because after the validation error is called it breaks out of the validate method. In order to present every error you need to use lazy validation. This is how it looks in my code if you need another example:
v = jsonschema.Draft4Validator(schema)
for error in v.iter_errors(jsonFile):
validationErrors.append(error)
✓ try-except-else-finally statement is a great way to catch and handle exceptions(Run time errors) in Python.
✓So if you want to catch and store Exceptions in an array then the great solution for you is to use try-except statement. In this way you can catch and store in any data structure like lists etc. and your program with continue with its execution, it will not terminate.
✓ Below is a modified code where I have used a for loop which catches error 5 times and stores in list.
validationErrors = []
for i in range(5):
try:
jsonschema.validate(jsonFile, schema)
except jsonschema.exceptions.ValidationError as error:
validationErrors.append(error)
✓ Finally, you can have a look at the below code sample where I have stored ZeroDivisionError and it's related string message in 2 different lists by iterating over a for loop 5 times.
You can use the 2nd list ZeroDivisionErrorMessagesList to pass to template, if you want to print messages on web page (if you want). You can use 1st also.
ZeroDivisionErrorsList = [];
ZeroDivisionErrorMessagesList = list(); # list() is same as [];
for i in range(5):
try:
a = 10 / 0; # it will raise exception
print(a);. # it will not execute
except ZeroDivisionError as error:
ZeroDivisionErrorsList.append(error)
ZeroDivisionErrorMessagesList.append(str(error))
print(ZeroDivisionErrorsList);
print(); # new line
print(ZeroDivisionErrorMessagesList);
» Output:
[ZeroDivisionError('division by zero',),
ZeroDivisionError('division by zero',),
ZeroDivisionError('division by zero',),
ZeroDivisionError('division by zero',),
ZeroDivisionError('division by zero',)]
['division by zero', 'division by zero', 'division by zero', 'division by zero', 'division by zero']

How do I avoid getting a sporadic KeyError: 'data' when using the Reddit API in python?

I have the following python code that is working ok to use reddit's api and look up the front page of different subreddits and their rising submissions.
from pprint import pprint
import requests
import json
import datetime
import csv
import time
subredditsToScan = ["Arts", "AskReddit", "askscience", "aww", "books", "creepy", "dataisbeautiful", "DIY", "Documentaries", "EarthPorn", "explainlikeimfive", "food", "funny", "gaming", "gifs", "history", "jokes", "LifeProTips", "movies", "music", "pics", "science", "ShowerThoughts", "space", "sports", "tifu", "todayilearned", "videos", "worldnews"]
ofilePosts = open('posts.csv', 'wb')
writerPosts = csv.writer(ofilePosts, delimiter=',')
ofileUrls = open('urls.csv', 'wb')
writerUrls = csv.writer(ofileUrls, delimiter=',')
for subreddit in subredditsToScan:
front = requests.get(r'http://www.reddit.com/r/' + subreddit + '/.json')
rising = requests.get(r'http://www.reddit.com/r/' + subreddit + '/rising/.json')
front.text
rising.text
risingData = rising.json()
frontData = front.json()
print(len(risingData['data']['children']))
print(len(frontData['data']['children']))
for i in range(0, len(risingData['data']['children'])):
author = risingData['data']['children'][i]['data']['author']
score = risingData['data']['children'][i]['data']['score']
subreddit = risingData['data']['children'][i]['data']['subreddit']
gilded = risingData['data']['children'][i]['data']['gilded']
numOfComments = risingData['data']['children'][i]['data']['num_comments']
linkUrl = risingData['data']['children'][i]['data']['permalink']
timeCreated = risingData['data']['children'][i]['data']['created_utc']
writerPosts.writerow([author, score, subreddit, gilded, numOfComments, linkUrl, timeCreated])
writerUrls.writerow([linkUrl])
for j in range(0, len(frontData['data']['children'])):
author = frontData['data']['children'][j]['data']['author'].encode('utf-8').strip()
score = frontData['data']['children'][j]['data']['score']
subreddit = frontData['data']['children'][j]['data']['subreddit'].encode('utf-8').strip()
gilded = frontData['data']['children'][j]['data']['gilded']
numOfComments = frontData['data']['children'][j]['data']['num_comments']
linkUrl = frontData['data']['children'][j]['data']['permalink'].encode('utf-8').strip()
timeCreated = frontData['data']['children'][j]['data']['created_utc']
writerPosts.writerow([author, score, subreddit, gilded, numOfComments, linkUrl, timeCreated])
writerUrls.writerow([linkUrl])
It works well and scrapes the data accurately but it constantly gets interrupted, seemingly randomly, and has a run time crash, saying:
Traceback (most recent call last):
File "dataGather1.py", line 27, in <module>
for i in range(0, len(risingData['data']['children'])):
KeyError: 'data'
I have no idea why this error is occuring on and off and not consistently. I thought maybe I am calling the API too much so it stops me from accessing it so I threw a sleep in my code but that did not help. Any ideas?
When there are no data on the response from the API there are is no key data on the dictionary so you get a keyError on some subreddits. You need to use a try catch
The json you are parsing doesn't contain the 'data' element. Thus you get an error. I think your hunch is correct though. It is probably rate limiting, or that you're asking for hidden/deleted entries.
Reddit is very strict about accessing their API without playing nice. Meaning you should register your app and use a meaningful user-agent to your requets, and you should probably use the python library for this kind of thing: https://praw.readthedocs.io/en/latest/
Without registering it seems to my experience that the direct REST reddit API is even more strict than the 1 request per 2 seconds rule they have (had?).
Python raises a KeyError whenever a dict() object is requested (using the format a = adict[key]) and the key is not in the dictionary.
It seems like when you are getting this error, your data value is empty.
You might just try to get the length of the dictionary before you execute the for loop. If it’s empty, it will just not run. Some interesting error checking here might help.
size = len(risingData)
if size:
for i in range(0,size):
…

I keep getting [u'error']

Why am I getting this error when trying to print keys from dictionary?
import urllib
import urllib2
import json
ret = urllib2.urlopen(urllib2.Request('https://poloniex.com/public?command=returnTicker'))
a = json.loads(ret.read())
print a.keys()
ret = urllib2.urlopen(urllib2.Request('https://poloniex.com/public?command=return24Volume'))
b = json.loads(ret.read())
print b.keys()
The error is produced by the website - your code as such is ok, it produces a json-object which has apparently the structure '{ "error" : ""}'. Try printing it out & figure out what's wrong, you probably need some authentication tokens or similar stuff.
There seems to be a API wrapper available, you should consider using or at least understanding it: http://pastebin.com/8fBVpjaj
It's directly featured on polniex website, and it clearly shows the need of API secret and key.
The error is originating from the site. The dict is loaded, and expresses the error as it's only key.
Try this:
ret = urllib2.urlopen(urllib2.Request('https://poloniex.com/public?command=return24hVolume'))
b = json.loads(ret.read())
print b.keys()
Notice the 'h' in return24hVolume.
The second URL returns:
{"error":"Invalid command."}
So as Reut Sharabani points out, you have to use 'h' in the URL:
https://poloniex.com/public?command=return24hVolume

expected string or buffer when processing text file with python

I'm processing a large (120mb) text file from my thunderbird imap directory and attempting to extract to/from info from the headers using mbox and regex. the process runs for a while until I eventually get an exception: "TypeError: expected string or buffer".
The exception references the fifth line of this code:
PAT_EMAIL = re.compile(r"[0-9A-Za-z._-]+\#[0-9A-Za-z._-]+")
temp_list = []
mymbox = mbox("data.txt")
for email in mymbox.values():
from_address = PAT_EMAIL.findall(email["from"])
to_address = PAT_EMAIL.findall(email["to"])
for item in from_address:
temp_list.append(item) #items are added to a temporary list where they are sorted then written to file
I've run the code on other (smaller) files, so I'm guessing the issue is my file. The file appears to be just a bunch of text. Can someone point me in the write direction for debugging this?
There can only be one from address (I think!):
In the following:
from_address = PAT_EMAIL.findall(email["from"])
I have a feeling you're trying to duplicate the work of email.message_from_file and email.utils.parseaddr
from email.utils import parseaddr
>>> s = "Jon Clements <jon#example.com>"
>>> from email.utils import parseaddr
>>> parseaddr(s)
('Jon Clements', 'jon#example.com')
So you can use parseaddr(email['from'])[1] to get the email address and use that.
Similarly, you may wish to look at email.utils.getaddresses to handle to and cc addresses...
Well, I didn't solve the issue but have worked around it for my own purposes. I inserted a try statement so that the iteration just continues past any TypeError. For every thousand email addresses I'm getting about 8 failures, which will suffice. Thanks for your input!
PAT_EMAIL = re.compile(r"[0-9A-Za-z._-]+\#[0-9A-Za-z._-]+")
temp_list = []
mymbox = mbox("data.txt")
for email in mymbox.values():
try:
from_address = PAT_EMAIL.findall(email["from"])
except(TypeError):
print "TypeError!"
try:
to_address = PAT_EMAIL.findall(email["to"])
except(TypeError):
print "TypeError!"
for item in from_address:
temp_list.append(item) #items are added to a temporary list where they are sorted then written to file

Categories

Resources