Parsing XML from websites and save the code?

Parsing XML from websites and save the code? - python

I would like to parse the xml code from a website like
http://ops.epo.org/3.1/rest-services/published-data/publication/docdb/EP1000000/biblio
and save it in another xml or csv file.
I tried it with this:
import urllib.request
web_data = urllib.request.urlopen("http://ops.epo.org/3.1/rest-services/published-data/publication/docdb/EP1000000/biblio")
str_data = web_data.read()
try:
f = open("file.xml", "w")
f.write(str(str_data))
print("SUCCESS")
except:
print("ERROR")
But in the saved XML data is between every element '\n' and at the beginning ' b' '
How can i save the XML data without all the 'n\' and ' b' '?

If you write the xml file in binary mode, you don't need to convert the data read into a string of characters first. Also, if you process the data a line at a time, that should get rid of '\n' problem. The logic of your code could also be structured a little better IMO, as shown below:
import urllib.request
web_data = urllib.request.urlopen("http://ops.epo.org/3.1/rest-services"
"/published-data/publication"
"/docdb/EP1000000/biblio")
data = web_data.read()
with open("file.xml", "wb") as f:
for line in data:
try:
f.write(data)
except Exception as exc:
print('ERROR')
print(str(exc))
break
else:
print('SUCCESS')

read() returns data as bytes but you can save data without converting to str(). You have to open file in byte mode - "wb" - and write data.
import urllib.request
web_data = urllib.request.urlopen("http://ops.epo.org/3.1/rest-services/published-data/publication/docdb/EP1000000/biblio")
data = web_data.read()
try:
f = open("file.xml", "wb")
f.write(data)
print("SUCCESS")
except:
print("ERROR")
BTW: To convert bytes to string/unicode you have to use ie. decode('utf-8') .
If you use str() then Python uses own method to create string and it adds b" to inform you that you have bytes in your data.

Related

HTML Diff File is getting malformed

With difflib library I am trying to generate the diff file which is in html format. It works for most of the time but for few times, the generate html is malformed. Sometimes it also observed that formed html doesn't have all the content and sometimes the formed content doesn't have the lines at proper place.
Below is the code I am using for it:
import difflib
try:
print("Reading file from first file")
firstfile = open(firstFilePath, "r")
contentsFirst = firstfile.readlines()
print("Reading file from second file")
secondfile = open(secondFilePath, "r")
contentsSecond = secondfile.readlines()
print("Creating diff file:")
config_diff = difflib.HtmlDiff(wrapcolumn=70).make_file(contentsSecond, contentsFirst)
if not os.path.exists(diff_file_path):
os.makedirs(diff_file_path)
final_path = diff_file_path + "/" + diff_file_name + '.html'
diff_file = open(final_path, 'w')
diff_file.write(config_diff)
print("Diff file is genrated :")
except Exception as error:
print("Exception occurred in create_diff_file " + str(error))
raise Exception(str(error))
This piece of code is called in a threaded program. Although with retry, I get the desired result but doesn't know the reason for getting malformed and inconsistent diff file. If someone can help me in finding the actual reason behind it and can propose the solution, will be helpful for me.

Try to read a json file with python

I have a json file that is a synonime dicitonnary in French (I say French because I had an error message with ascii encoding... due to the accents 'é',etc). I want to read this file with python to get a synonime when I input a word.
Well, I can't even read my file...
That's my code:
data=[]
with open('sortieDES.json', encoding='utf-8') as data_file:
data = json.loads(data_file.read())
print(data)
So I have a list quite ugly, but my question is: how can I use the file like a dictionary ? I want to input data['Académie']and have the list of the synonime... Here an example of the json file:
{"Académie française":{
"synonymes":["Institut","Quai Conti","les Quarante"]
}

You only need to call json.load on the File object (you gave it the name data_file):
data=[]
with open('sortieDES.json', encoding='utf-8') as data_file:
data = json.load(data_file)
print(data)

Instead of
json.load(line)
you have to use
json.loads(line)
Your s is missing in loads(...)

Unable to decode HTML page with urllib.request

I've wrote the following piece of code which searches URL and saves the HTML to a text file. However, I have two issues
Most importantly, it does not save € and £ in the HTML as this. This is likely a decoding issue which I've tried to fix, but so far without success
The following code does not replace the "\n" in the HTML with "". This isn't as important to me, but I am curious as to why it is not working
Any ideas?
import urllib.request
while True: # this is an infinite loop
with urllib.request.urlopen('WEBSITE_URL') as f:
fDecoded = f.read().decode('utf-8')
data = str(fDecoded .read()).replace('\n', '') # does not seem to work?
myfile = open("TestFile.txt", "r+")
myfile.write(data)
print ('----------------')

When you do this -
fDecoded = f.read().decode('utf-8')
fDecoded is already of type str , you are reading the byte string from the request and decoding it into str using utf-8 encoding.
Then after this you cannot call -
str(fDecoded .read()).replace('\n', '')
str has no method read() and you do not actually need to convert it to str again. Just do -
data = fDecoded.replace('\n', '')

Python JSON encoding Verification

I am trying to scan in a text document that I have and then find certain sections and output it to a file in json format.
Unfortunatly I am not to sure how to use json and would appricate it if someone could tell me how to encode it as json properly.
Thank you everyone!
#save word and type to database
word = [{'WORD':strWrd , 'TYPE':strWrdtyp}]
with open(input_lang+'.dic', 'a') as outfile:
try:
json.dump(word, outfile)
outfile.write('\n')
outfile.close
except (TypeError, ValueError) as err:
print 'Error:', err

Replacing commas with blank spaces from a read in text file

import os
os.chdir('my directory')
data = open('text.txt', 'r')
data = data.replace(",", " ")
print(data)
I get the error:
AttributeError: '_io.TextIOWrapper' object has no attribute 'replace'

You should open files in a with statement:
with open('text.txt', 'r') as data:
plaintext = data.read()
plaintext = plaintext.replace(',', '')
the with statement ensures that resources are released properly, so you don't have to worry about remembering to close them.
The more substantial thing you were missing is that data is a file object, and replace works on strings. data.read() returns the string of text in the file.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing XML from websites and save the code? - python

Related

HTML Diff File is getting malformed

Try to read a json file with python

Unable to decode HTML page with urllib.request

Python JSON encoding Verification

Replacing commas with blank spaces from a read in text file

Categories

Resources