How to open html file that contains Unicode characters? - python

I have html file called test.html it has one word בדיקה.
I open the test.html and print it's content using this block of code:
file = open("test.html", "r")
print file.read()
but it prints ??????, why this happened and how could I fix it?
BTW. when I open text file it works good.
Edit: I'd tried this:
>>> import codecs
>>> f = codecs.open("test.html",'r')
>>> print f.read()
?????

import codecs
f=codecs.open("test.html", 'r')
print f.read()
Try something like this.

I encountered this problem today as well. I am using Windows and the system language by default is Chinese. Hence, someone may encounter this Unicode error similarly. Simply add encoding = 'utf-8':
with open("test.html", "r", encoding='utf-8') as f:
text= f.read()

you can make use of the following code:
from __future__ import division, unicode_literals
import codecs
from bs4 import BeautifulSoup
f=codecs.open("test.html", 'r', 'utf-8')
document= BeautifulSoup(f.read()).get_text()
print(document)
If you want to delete all the blank lines in between and get all the words as a string (also avoid special characters, numbers) then also include:
import nltk
from nltk.tokenize import word_tokenize
docwords=word_tokenize(document)
for line in docwords:
line = (line.rstrip())
if line:
if re.match("^[A-Za-z]*$",line):
if (line not in stop and len(line)>1):
st=st+" "+line
print st
*define st as a string initially, like st=""

You can read HTML page using 'urllib'.
#python 2.x
import urllib
page = urllib.urlopen("your path ").read()
print page

Use codecs.open with the encoding parameter.
import codecs
f = codecs.open("test.html", 'r', 'utf-8')

CODE:
import codecs
path="D:\\Users\\html\\abc.html"
file=codecs.open(path,"rb")
file1=file.read()
file1=str(file1)

You can simply use this
import requests
requests.get(url)

you can use 'urllib' in python3 same as
https://stackoverflow.com/a/27243244/4815313 with few changes.
#python3
import urllib
page = urllib.request.urlopen("/path/").read()
print(page)

Related

How to open a LaTeX file in Python that starts with a comment

Code:
import os
import re
import time
import csv
from TexSoup import TexSoup
path = os.getcwd()
texFile = path + '\\Paper16.tex'
print(texFile)
soup = TexSoup(open(texFile, 'r'))
This returns no output when I try to print(soup) and I believe it is because the first line starts with %.
I think this is some sort of bug of TexSoup.
Namely, if you remove the first line or comment out second line instead, the TexSoup is able to parse the file and print(soup) will give some output.
In addition, if you terminate the first line by adding braces:
%{\documentstyle[aps,epsf,rotate,epsfig,preprint]{revtex}}
the TexSoup is also able to parse the file.

How to display non-English language gotten by Facebook API

I'm getting some facebook posts that have a mixture of English and and a non-English language (Khmer to be exact).
Here's how the non-English is displayed when I print the data to screen or save it to file: \u178a\u17c2\u179b\u1787\u17b6\u17a2\u17d2. I would rather have it display as ឈឹម បញ្ចពណ៌ (Note: this is not a translation of the previous unicode.)
Try this if you want to save the info in a file:
import codecs
string = 'ឈឹម បញ្ចពណ៌'
with codecs.open('yourfile', 'w', encoding='utf-8') as f:
f.write(string)
This should be it:
print(u'\u1787\u17b6\u17a2\u17d2') #python3
print u'\u1787\u17b6\u17a2\u17d2' #python2.7
Output: ជាអ្
In pycharm I added:
(at top) # -- coding: utf-8 --
import sys
reload(sys)
sys.setdefaultencoding('utf8')
s = json.dumps(posts['data'],ensure_ascii=False)
json_file.write(s.decode('utf-8'))

Decode HTML Entity on Python

I have a file that contain some lines like this:
StatsLearning_Lect1_2a_111213_v2_%5B2wLfFB_6SKI%5D_%5Btag22%5D.mp4
Respect to this lines, i have some files on disk, but saved on decoded form:
StatsLearning_Lect1_2a_111213_v2_[2wLfFB_6SKI]_[tag22].mp4
I need get file name from first file list and correct file name from second file and change file name to second name. For this goal, i need decode html entity from file name, so i do somthing like this:
import os
from html.parser import HTMLParser
fpListDwn = open('listDwn', 'r')
for lineNumberOnList, fileName in enumerate(fpListDwn):
print(HTMLParser().unescape(fileName))
but this action doesn't have any effect on run, some run's result is:
meysampg#freedom:~/Downloads/Practical Machine Learning$ python3 changeName.py
StatsLearning_Lect1_2a_111213_v2_%5B2wLfFB_6SKI%5D_%5Btag22%5D.mp4
StatsLearning_Lect1_2b_111213_v2_%5BLvaTokhYnDw%5D_%5Btag22%5D.mp4
StatsLearning_Lect3_4a_110613_%5BWjyuiK5taS8%5D_%5Btag22%5D.mp4
StatsLearning_Lect3_4b_110613_%5BUvxHOkYQl8g%5D_%5Btag22%5D.mp4
StatsLearning_Lect3_4c_110613_%5BVusKAosxxyk%5D_%5Btag22%5D.mp4
How i can fix this?
I guess you should use urllib.parse instead of html.parser
>>> f="StatsLearning_Lect1_2a_111213_v2_%5B2wLfFB_6SKI%5D_%5Btag22%5D.mp4"
>>> import urllib.parse as parse
>>> f
'StatsLearning_Lect1_2a_111213_v2_%5B2wLfFB_6SKI%5D_%5Btag22%5D.mp4'
>>> parse.unquote(f)
'StatsLearning_Lect1_2a_111213_v2_[2wLfFB_6SKI]_[tag22].mp4'
So your script should look like:
import os
import urllib.parse as parse
fpListDwn = open('listDwn', 'r')
for lineNumberOnList, fileName in enumerate(fpListDwn):
print(parse.unquote(fileName))
This is actually "percent encoding", not HTML encoding, see this question:
How to percent-encode URL parameters in Python?
Basically you want to use urllib.parse.unquote instead:
from urllib.parse import unquote
unquote('StatsLearning_Lect1_2a_111213_v2_%5B2wLfFB_6SKI%5D_%5Btag22%5D.mp4')
Out[192]: 'StatsLearning_Lect1_2a_111213_v2_[2wLfFB_6SKI]_[tag22].mp4'

how to convert UTF-8 code to symbol characters in python

I crawled some webpages using python's urllib.request API and saved the read lines into a new file.
f = open(docId + ".html", "w+")
with urllib.request.urlopen('http://stackoverflow.com') as u:
s = u.read()
f.write(str(s))
But when I open the saved files, I see many strings such as \xe2\x86\x90, which was originally an arrow symbol in the original page. It seems to be a UTF-8 code of the symbol, but how do I convert the code to the symbol back?
Your code is broken: u.read() returns bytes object. str(bytes_object) returns a string representation of the object (how the bytes literal would look like) -- you don't want it here:
>>> str(b'\xe2\x86\x90')
"b'\\xe2\\x86\\x90'"
Either save the bytes on disk as is:
import urllib.request
urllib.request.urlretrieve('http://stackoverflow.com', 'so.html')
or open the file in binary mode: 'wb' and save it manually:
import shutil
from urllib.request import urlopen
with urlopen('http://stackoverflow.com') as u, open('so.html', 'wb') as file:
shutil.copyfileobj(u, file)
or convert bytes to Unicode and save them to disk using any encoding you like.
import io
import shutil
from urllib.request import urlopen
with urlopen('http://stackoverflow.com') as u, \
open('so.html', 'w', encoding='utf-8', newline='') as file, \
io.TextIOWrapper(u, encoding=u.headers.get_content_charset('utf-8'), newline='') as t:
shutil.copyfileobj(t, file)
Try:
import urllib2, io
with io.open("test.html", "w", encoding='utf8') as fout:
s = urllib2.urlopen('http://stackoverflow.com').read()
s = s.decode('utf8', 'ignore') # or s.decode('utf8', 'replace')
fout.write(s)
See https://docs.python.org/2/howto/unicode.html

With regards to urllib AttributeError: 'module' object has no attribute 'urlopen'

import re
import string
import shutil
import os
import os.path
import time
import datetime
import math
import urllib
from array import array
import random
filehandle = urllib.urlopen('http://www.google.com/') #open webpage
s = filehandle.read() #read
print s #display
#what i plan to do with it once i get the first part working
#results = re.findall('[<td style="font-weight:bold;" nowrap>$][0-9][0-9][0-9][.][0-9][0-9][</td></tr></tfoot></table>]',s)
#earnings = '$ '
#for money in results:
#earnings = earnings + money[1]+money[2]+money[3]+'.'+money[5]+money[6]
#print earnings
#raw_input()
this is the code that i have so far. now i have looked at all the other forums that give solutions such as the name of the script, which is parse_Money.py, and i have tried doing it with urllib.request.urlopen AND i have tried running it on python 2.5, 2.6, and 2.7. If anybody has any suggestions it would be really welcome, thanks everyone!!
--Matt
---EDIT---
I also tried this code and it worked, so im thinking its some kind of syntax error, so if anybody with a sharp eye can point it out, i would be very appreciative.
import shutil
import os
import os.path
import time
import datetime
import math
import urllib
from array import array
import random
b = 3
#find URL
URL = raw_input('Type the URL you would like to read from[Example: http://www.google.com/] :')
while b == 3:
#get file name
file1 = raw_input('Enter a file name for the downloaded code:')
filepath = file1 + '.txt'
if os.path.isfile(filepath):
print 'File already exists'
b = 3
else:
print 'Filename accepted'
b = 4
file_path = filepath
#open file
FileWrite = open(file_path, 'a')
#acces URL
filehandle = urllib.urlopen(URL)
#display souce code
for lines in filehandle.readlines():
FileWrite.write(lines)
print lines
print 'The above has been saved in both a text and html file'
#close files
filehandle.close()
FileWrite.close()
it appears that the urlopen method is available in the urllib.request module and not in the urllib module as you're expecting.
rule of thumb - if you're getting an AttributeError, that field/operation is not present in the particular module.
EDIT - Thanks to AndiDog for pointing out - this is a solution valid for Py 3.x, and not applicable to Py2.x!
The urlopen function is actually in the urllib2 module. Try import urllib2 and use urllib2.urlopen
I see that you are using Python2 or at least intend to use Python2.
urlopen helper function is available in both urllib and urllib2 in Python2.
What you need to do this, execute this script against the correct version of your python
C:\Python26\python.exe yourscript.py

Categories

Resources