Load JSON data from Google GitHub repo - python

I am trying to load the following JSON file (from the Google Github repo) in Python as follows:
import json
import requests
url = "https://raw.githubusercontent.com/google/vsaq/master/questionnaires/webapp.json"
r = requests.get(url)
data = r.text.splitlines(True)
#remove first n lines which is not JSON (commented license)
data = ''.join(data[14:])
When I use json.loads(data) I get the following error:
JSONDecodeError: Expecting ',' delimiter: line 725 column 543 (char 54975)
As this has been saved as a json file by the GitHub repo owner (Google) I'm wondering what Im doing wrong here.

I found the obtained text from API call is like a simple text, not a valid JSON (I checked at https://jsonformatter.curiousconcept.com/).
Here is my code that I used to filter the valid JSON part from the response.
I have used re module to extract the JSON part.
import json
import requests
import re
url = "https://raw.githubusercontent.com/google/vsaq/master/questionnaires/webapp.json"
r = requests.get(url)
text = r.text.strip()
m = re.search(r'\{(.|\s)*\}', text) # It is for finding a valid JSON part from obtained text
s = m.group(0).replace('false', 'False') # Python has 'False/True' not 'false/true' (Replacement)
d = eval(s)
print(d) # {...}
print(type(d)) # <class 'dict'>
References »
https://docs.python.org/3.6/library/re.html
https://jsonformatter.curiousconcept.com/

Related

TypeError: expected str, bytes or os.PathLike object, not dict

This is my code:
from os import rename, write
import requests
import json
url = "https://api.github.com/search/users?q=%7Bquery%7D%7B&page,per_page,sort,order%7D"
data = requests.get(url).json()
print(data)
outfile = open("C:/Users/vladi/Desktop/json files Vlad/file structure first attemp.json", "r")
json_object = json.load(outfile)
with open(data,'w') as endfile:
endfile.write(json_object)
print(endfile)
I want to call API request.
I want to take data from this URL: https://api.github.com/search/users?q=%7Bquery%7D%7B&page,per_page,sort,order%7D,
and rewrite it with my own data which is my file called file structure first attemp.json
and update this URL with my own data.
import requests
url = "https://api.github.com/search/usersq=%7Bquery%7D%7B&page,per_page,sort,order%7D"
data = requests.get(url)
with open(data,'w') as endfile:
endfile.write(data.text)
json.loads() returns a Python dictionary, which cannot be written to a file. Simply write the returned string from the URL.
response.json() is a built in feature that requests uses to load the JSON returned from the URL. So you are loading the JSON twice.

Loading multiple JSON files

So I am trying to load multiple JSON files with Python HTTP requests, but I cant figure out to do it corecctly.
Loading one JSON file with python is pretty simple:
response = requests.get(url)
te = response.content.decode()
da = json.loads(te[te.find("{"):te.rfind("}")+1]
But how can I load multiple JSON files?
I have a list of URLs and I tried to request every URL with a loop and then load every line of the result, but it seems this does not work.
This is the code I am using:
t = []
for url in urls:
re = requests.get(url)
te = req.content.decode()
daten = json.loads(te[te.find("{"):te.rfind("}")+1])
t.append(daten)
But I am getting this error:
JSONDecodeError: Expecting value: line 1 column 1 (char 0).
I am pretty new with JSOn but I do understand that I cant read it line for line with a loop, becuase it destructs the JSON struture(?).
So how can I read multiple JSON files?
EDIT: Found the error.
Some links are not in correct JSON.
With requests library, If the endpoint you are requesting returns a well formed json response, all you need to do is call the .json() method on the response object:
t = []
for url in urls:
re = requests.get(url)
t.append(re.json())
Then, if you want to handle bad responses, wrap the code above in a try:...except block
Assuming you receive correct json from any site, you didn't construct result json.
You might write something like
t = []
for url in urls:
t.append(requests.get(url).content.decode('utf-8'))
result = json.loads('{{"data": [{}]}}'.format(','.join(t)))

how to get an XML file from a website using python?

using 'bottle' library, I have to create my own API based on this website http://dblp.uni-trier.de so I have to get data for each author. For this reason I am using the following link format http://dblp.uni-trier.de/pers/xx/'first letter of the last name'/'lastnamefirstname'.xml
Could you help me get the XML format to be able to parse it and get the information I need.
thank you
import bottle
import requests
import re
r = requests.get("https://dblp.uni-trier.de/")
#the format of my request is
#http://localhost:8080/lastname firstname
#bottle.route('/info/<name>')
def info(name):
first_letter = name[:1]
#mettre au format Lastname:Firstname
...
data = requests.get("http://dblp.uni-trier.de/pers/xx/" + first_letter + "/" + family_name + ".xml")
return data
bottle.run(host='localhost', port=8080)
from xml.etree import ElementTree
import requests
url = 'some url'
response = requests.get(url)
xml_root = ElementTree.fromstring(response.content)
fromstring Parses an XML section from a string constant. This function can be used to embed “XML literals” in Python code. text is a
string containing XML data. parser is an optional parser instance. If
not given, the standard XMLParser parser is used. Returns an Element
instance.
HOW TO Load XML from a string into an ElementTree
from xml.etree import ElementTree
root = ElementTree.fromstring("<root><a>1</a></root>")
ElementTree.dump(root)
OUTPUT
<root><a>1</a></root>
The object returned from requests.get is not the raw data. You need to use text property to get the contents
Response Content Documentation
Note that:
response.text returns content as unicode
response.content returns content as bytes

Reading a github file using python returns HTML tags

I am trying to read a text file saved in github using requests package.
Here is the python code I am using:
import requests
url = 'https://github.com/...../filename'
page = requests.get(url)
print page.text
Instead of getting the text, I am reading HTML tags.
How can I read the text from the file instead of HTML tags?
There are some good solutions already, but if you use requests just follow Github's API.
The endpoint for all content is
GET /repos/:owner/:repo/contents/:path
But keep in mind that the default behavior of Github's API is to encode the content using base64.
In your case you would do the following:
#!/usr/bin/env python3
import base64
import requests
url = 'https://api.github.com/repos/{user}/{repo_name}/contents/{path_to_file}'
req = requests.get(url)
if req.status_code == requests.codes.ok:
req = req.json() # the response is a JSON
# req is now a dict with keys: name, encoding, url, size ...
# and content. But it is encoded with base64.
content = base64.decodestring(req['content'])
else:
print('Content was not found.')
You can access a text version by changing the beginning of your link to
https://raw.githubusercontent.com/
Thank you #dasdachs for your answer. However I was getting an error when executing the following line:
content = base64.decodestring(req['content'])
The error I got was:
/usr/lib/python3.6/base64.py in _input_type_check(s)
511 except TypeError as err:
512 msg = "expected bytes-like object, not %s" % s.__class__.__name__
--> 513 raise TypeError(msg) from err
514 if m.format not in ('c', 'b', 'B'):
515 msg = ("expected single byte elements, not %r from %s" %
TypeError: expected bytes-like object, not str
Hence I replaced it with the below snippet:
content = base64.b64decode(json['content'])
Sharing my working snippet below (executing in Python 3):
import requests
import base64
import json
def constructURL(user = "404",repo_name= "404",path_to_file= "404",url= "404"):
url = url.replace("{user}",user)
url = url.replace("{repo_name}",repo_name)
url = url.replace("{path_to_file}",path_to_file)
return url
user = '<provide value>'
repo_name = '<provide value>'
path_to_file = '<provide value>'
json_url ='https://api.github.com/repos/{user}/{repo_name}/contents/{path_to_file}'
json_url = constructURL(user,repo_name,path_to_file,json_url) #forms the correct URL
response = requests.get(json_url) #get data from json file located at specified URL
if response.status_code == requests.codes.ok:
jsonResponse = response.json() # the response is a JSON
#the JSON is encoded in base 64, hence decode it
content = base64.b64decode(jsonResponse['content'])
#convert the byte stream to string
jsonString = content.decode('utf-8')
finalJson = json.loads(jsonString)
else:
print('Content was not found.')
for key, value in finalJson.items():
print("The key and value are ({}) = ({})".format(key, value))
Expanding on #Patrick's answer, I'm going to show you my code for how to do that.
import requests
url = 'https://raw.githubusercontent.com/...'
page = requests.get(url)
print page.text
You could first clone the repo, either via bash, or using a python library like GitPython. Then just open and read the file locally.

Python Pandas Not Recognizing JSON File

I'm trying to load JSON data in Pandas in order to do some analysis.
Here is an example of the data I'm analyzing.
http://data.ncaa.com/jsonp/game/football/fbs/2013/08/31/wyoming-nebraska/field.json
I have tried the following:
import json
import pandas as pd
from pandas import DataFrame
json_data = pd.read_json('jsonv3.json')
and also
import json
import pandas
from pprint import pprint
json_data=open('jsonv3.json')
data = json.load(json_data)
pprint(data)
json_data.close()
The resulting errors are as follows:
1) ValueError: Expected object or value
2) ValueError: No JSON object could be decoded
I don't really know why the JSON file is not being recognized.
I've confirmed on http://jsonformatter.curiousconcept.com/ That it is valid JSON. I don't really know how to debug the issue. I haven't been able to find anything. Is the error potentially because of the JSON spacing format?
That's not JSON, it is JSONP. Note that the JSON "content" is wrapped in a "function call" callbackWrapper(...). From the wikipedia article: "The response to a JSONP request is not JSON and is not parsed as JSON".
If you've saved the JSONP response in the file jsonv3.json, you could strip off the function call wrapper and process the content with something like this:
import json
with open('jsonv3.json', 'r') as f:
response = f.read()
start = response.find('(')
end = response.rfind(')')
json_content = response[start+1:end]
data = json.loads(json_content)

Categories

Resources