Convert html source code to json object - python

I am fetching html source code of many pages from one website, I need to convert it into json object and combine with other elements in json doc. . I have seen many questions on same topic but non of them were helpful.
My code:
url = "https://totalhash.cymru.com/analysis/?1ce201cf28c6dd738fd4e65da55242822111bd9f"
htmlContent = requests.get(url, verify=False)
data = htmlContent.text
print("data",data)
jsonD = json.dumps(htmlContent.text)
jsonL = json.loads(jsonD)
ContentUrl='{ \"url\" : \"'+str(urls)+'\" ,'+"\n"+' \"uid\" : \"'+str(uniqueID)+'\" ,\n\"page_content\" : \"'+jsonL+'\" , \n\"date\" : \"'+finalDate+'\"}'
above code gives me unicode type, however, when I put that output in jsonLint it gives me invalid json error. Can somebody help me understand how can I convert the complete html into a json objet?

jsonD = json.dumps(htmlContent.text) converts the raw HTML content into a JSON string representation.
jsonL = json.loads(jsonD) parses the JSON string back into a regular string/unicode object. This results in a no-op, as any escaping done by dumps() is reverted by loads(). jsonL contains the same data as htmlContent.text.
Try to use json.dumps to generate your final JSON instead of building the JSON by hand:
ContentUrl = json.dumps({
'url': str(urls),
'uid': str(uniqueID),
'page_content': htmlContent.text,
'date': finalDate
})

The correct way to convert HTML source code to a JSON file on the local system is as follows:
import json
import codecs
# Load the JSON file by specifying the location and filename
with codecs.open(filename="json_file.json", mode="r", encoding="utf-8") as jsonf:
json_file = json.loads(jsonf.read())
# Load the HTML file by specifying the location and filename
with codecs.open(filename="html_file.html", mode='r', encoding="utf-8") as htmlf:
html_file = htmlf.read()
# Chose the key name where the HTML source code will live as a string
json_file['Key1']['Key2'] = html_file
# Dump the dictionary to JSON object and save it in a specific location
json_object = json.dumps(json_file, indent=4)
with codecs.open(filename="final_json_file.json", mode="w", encoding="utf-8") as ojsonf:
ojsonf.write(json_object)
Next, open the JSON file in your editor.
Press CTRL + H, and replace \n or \t characters by '' (nothing!).
Now you can parse your HTML file with codecs.open() function and do the operations.

Related

TypeError: expected str, bytes or os.PathLike object, not dict

This is my code:
from os import rename, write
import requests
import json
url = "https://api.github.com/search/users?q=%7Bquery%7D%7B&page,per_page,sort,order%7D"
data = requests.get(url).json()
print(data)
outfile = open("C:/Users/vladi/Desktop/json files Vlad/file structure first attemp.json", "r")
json_object = json.load(outfile)
with open(data,'w') as endfile:
endfile.write(json_object)
print(endfile)
I want to call API request.
I want to take data from this URL: https://api.github.com/search/users?q=%7Bquery%7D%7B&page,per_page,sort,order%7D,
and rewrite it with my own data which is my file called file structure first attemp.json
and update this URL with my own data.
import requests
url = "https://api.github.com/search/usersq=%7Bquery%7D%7B&page,per_page,sort,order%7D"
data = requests.get(url)
with open(data,'w') as endfile:
endfile.write(data.text)
json.loads() returns a Python dictionary, which cannot be written to a file. Simply write the returned string from the URL.
response.json() is a built in feature that requests uses to load the JSON returned from the URL. So you are loading the JSON twice.

Opening a json file from computer as a dictionary

I wrote the following function that I want to apply to a json file:
import json
def myfunction(dictionary):
#doing things
return new_dictionary
data = """{
#a json file as a dictionary
}"""
info = json.loads(data)
refined = key_replacer(info)
new_data = json.dumps(refined)
print(new_data)
It works fine, but how do I do it when I want to import a file from my computer? json.loads take a string as input and returns a dictionary as output and json.dumps take a dictionary as input and returns a string as output. I tried with:
with open('C:\\Users\\SONY\\Desktop\\test.json', 'r', encoding="utf8") as data:
info = json.loads(data)
But TypeError: the JSON object must be str, bytes or bytearray, not TextIOWrapper.
You are passing a file object instead of string. To fix that, you need to read the file first json.loads(data.read())
Howerver, you can directly load json from files using json.load(open('myFile','r')) or in your case, json.load(data)
loads and dumps work on strings. If you want to work on files you should use load and dump instead.
Here is an example:
from json import dump, load
with open('myfile.json','r') as my_file:
content = load(my_file)
#do stuff on content
with open('myooutput.json','w') as my_output:
dump(content, my_output)

How to return data pulled from python requests call into json

I am trying to make a GHE API call and convert the returned data into JSON. I am sure this is fairly simple (my current code writes the data into a .txt file) but I am incredibly new to python.
I am having a hard time understanding how to use json.dumps.
import requests
import json
GITHUB_ENTERPRISE_TOKEN = 'token xxx'
SEARCH_QUERY = "Evidence+locker+Seed+in:readme"
headers = {
'Authorization': GITHUB_ENTERPRISE_TOKEN,
}
url = "https://github.ibm.com/api/v3/search/repositories?q=" + SEARCH_QUERY
#Setup url to include GHE api endpoint and the search query
response = requests.get(url, headers=headers)
with open('./evidencelockerevidence.txt', 'w') as file:
file.write(response.text)
#writes to a .txt file the evidence fetched from GHE
Rather than the last two lines of functional code writing the data into a .txt file I would like to return it as JSON object in the same directory.
json.dumps simply stringify, thus, serialize your JSON object so you can store it as a plain text file. Its counterpart is json.loads.
f = open('a.jsonl', 'wt')
f.write(json.dumps(jobj))
People usually write one JSON object per line, a.k.a, jsonl format.
json.dump directly store your JSON object to a file. Its counterpart is json.load.
json.dump(jobj, open('a.json', 'wt'))
A json format file contains only one JSON object in a single line or multiple lines.

Valid JSON in text file but python json.loads gives "JSON object could be decoded"

I have a valid JSON (checked using Lint) in a text file.
I am loading the json as follows
test_data_file = open('jsonfile', "r")
json_str = test_data_file.read()
the_json = json.loads(json_str)
I have verified the json data in file on Lint and it shows it as valid. However the json.loads throws
ValueError: No JSON object could be decoded
I am a newbie to Python so not sure how to do it the right way. Please help
(I assume it has something to do it encoding the string to unicode format from utf-8 as the data in file is retrieved as a string)
I tried with open('jsonfile', 'r') and it works now.
Also I did the following on the file
json_newfile = open('json_newfile', 'w')
json_oldfile = open('json_oldfile', 'r')
old_data = json_oldfile.read()
json.dump(old_data, json_newfile)
and now I am reading the new file.

Json decoding error - strange characters apprearing on json.loads() instead of text

I am trying to make an API call to http://api.stackoverflow.com/1.1/badges/name
My code snippet -
url = 'http://api.stackoverflow.com/1.1/badges/name'
f = urllib2.urlopen(url)
content = f.read()
jsonobj = json.loads(content)
print jsonobj
This gives me the error -
ValueError: No JSON object could be decoded
When I tried http://jsonviewer.stack.hu to load the json object from the above URL, it showed me garbled characters. You can see the output here - http://jsonviewer.stack.hu/#http://api.stackoverflow.com/1.1/badges/name
The text is displayed normally in the browser window if you go to http://api.stackoverflow.com/1.1/badges/name
I tried adding UTF-8 encoding -
jsonobj = json.loads(content, encoding = 'UTF-8')
but it still gives the same error.
According to http://api.stackoverflow.com/1.0/usage the returned information is gzipped. You will have to unzip the binary data to get the actual JSON. You can do this with the gzip and StringIO modules:
url = urllib2.urlopen('http://api.stackoverflow.com/1.1/badges/name')
zippedContents = url.read()
sio = StringIO.StringIO(zippedContents)
gz = gzip.GzipFile(fileobj=sio)
print gz.read()

Categories

Resources