I use PyPDF2 to read a pdf file but get a unicode string.
I don't know what's the encoding, then try to dump first 8 chars to hex:
0000 005b 00d7 00c1 00e8 00d4 00c5 00d5 [......
What's these bytes means? is it utf-16be/le?
I try below code but output is wrong:
print outStr.encode('utf-16be').decode('utf-16')
嬀휀섀퐀씀픀
If print directly, python will report error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-7: ordinal not in range(128)
I am following the instruction from How To Extract Text From Pdf In Python
Code section as below:
import PyPDF2
import textract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
FILTER = ''.join([(len(repr(chr(x))) == 3) and chr(x) or '.' for x in range(256)])
def dumpUnicodeString(src, length=8):
result = []
for i in xrange(0, len(src), length):
unichars = src[i:i+length]
hex = ' '.join(["%04x" % ord(x) for x in unichars])
printable = ''.join(["%s" % ((ord(x) <= 127 and FILTER[ord(x)]) or '.') for x in unichars])
result.append("%04x %-*s %s\n" % (i*2, length*5, hex, printable))
return ''.join(result)
def extractPdfText(filePath=''):
fileObject = open(filePath, 'rb')
pdfFileReader = PyPDF2.PdfFileReader(fileObject)
totalPageNumber = pdfFileReader.numPages
currentPageNumber = 0
text = ''
while(currentPageNumber < totalPageNumber ):
pdfPage = pdfFileReader.getPage(currentPageNumber)
text = text + pdfPage.extractText()
currentPageNumber += 1
if(text == ''):
text = textract.process(filePath, method='tesseract', encoding='utf-8')
return text
if __name__ == '__main__':
pdfFilePath = 'a.pdf'
pdfText = extractPdfText(pdfFilePath)
#pdfText = pdfText[:7]
print dumpUnicodeString(pdfText)
print pdfText
Related
The following script outputs files unreadable in .txt format. Please advise.
I inspired myself with: https://area.autodesk.com/m/drew.avis/tutorials/writing-and-reading-3ds-max-scene-sidecar-data-in-python
This is to replicate a macho shark into a mechanical robot.
import olefile
# set this to your file
f = r'C:\MRP\Shortfin_Mako_Shark_Rigged_scanline.max'
def cleanString(data,isArray=False):
# remove first 6 bytes + last byte
data = data[6:]
if isArray:
data = data[:-1]
return data
with olefile.OleFileIO(f) as ole:
ole.listdir()
print(ole.listdir())
i = 0
for entry in ole.listdir():
i = i + 1
print(entry)
if i > 2:
fin = ole.openstream(entry)
# myString = fin.read().decode("utf-16")
# myString = cleanString(myString, isArray=True)
fout = open(entry[0], "wb")
print(fout)
while True:
s = fin.read(8192)
if not s:
break
fout.write(s)
Please advise.
https://www.turbosquid.com/fr/3d-models/max-shortfin-mako-shark-rigged/991102#
I also tried this:
with olefile.OleFileIO(f) as ole:
ole.listdir()
print(ole.listdir())
i = 0
for entry in ole.listdir():
i = i + 1
print(entry)
if i > 2:
fin = ole.openstream(entry)
#myString = fin.read().decode("utf-16")
#myString = cleanString(myString, isArray=True)
fout = open(entry[0], "w")
print(fout)
while True:
s = fin.read(8192)
if not s:
break
fout.write(cleanString(s, isArray = True).decode("utf-8"))
# stream = ole.openstream('CustomFileStreamDataStorage/MyString')
# myString = stream.read().decode('utf-16')
# myString = cleanString(myString)
# stream = ole.openstream('CustomFileStreamDataStorage/MyGeometry')
# myGeometry = stream.read().decode('utf-16')
# myGeometry = cleanString(myGeometry, isArray=True)
# myGeometry = myGeometry.split('\x00')
# stream = ole.openstream('CustomFileStreamDataStorage/MyLayers')
# myLayers = stream.read().decode('utf-16')
# myLayers = cleanString(myLayers, isArray=True)
# myLayers = myLayers.split('\x00')
# print ("My String: {}\nMy Geometry: {}\nMy Layers: {}".format (myString, myGeometry, myLayers))
What is the right encoding to decode from?
Exception has occurred: UnicodeDecodeError
'utf-8' codec can't decode bytes in position 4-5: invalid continuation byte
File "C:\MRP\ALG_LIN.py", line 59, in
fout.write(cleanString(s, isArray = True).decode("utf-8"))
Exception has occurred: UnicodeEncodeError
'charmap' codec can't encode characters in position 2-5: character maps to
File "C:\MRP\ALG_LIN.py", line 59, in
fout.write(cleanString(s, isArray = True).decode("utf-16"))
KR,
Ludo
When I use the CountVectorizer in sklearn, it needs the file encoding in unicode, but my data file is encoding in ansi.
I tried to change the encoding to unicode using notepad++, then I use readlines, it cannot read all the lines, instead it can only read the last line. After that, I tried to read the line into data file, and write them into the new file by using unicode, but I failed.
def merge_file():
root_dir="d:\\workspace\\minibatchk-means\\data\\20_newsgroups\\"
resname='resule_final.txt'
if os.path.exists(resname):
os.remove(resname)
result = codecs.open(resname,'w','utf-8')
num = 1
for back_name in os.listdir(r'd:\\workspace\\minibatchk-means\\data\\20_newsgroups'):
current_dir = root_dir + str(back_name)
for filename in os.listdir(current_dir):
print num ,":" ,str(filename)
num = num+1
path=current_dir + "\\" +str(filename)
source=open(path,'r')
line = source.readline()
line = line.strip('\n')
line = line.strip('\r')
while line !="":
line = unicode(line,"gbk")
line = line.replace('\n',' ')
line = line.replace('\r',' ')
result.write(line + ' ')
line = source.readline()
else:
print 'End file :'+ str(filename)
result.write('\n')
source.close()
print 'End All.'
result.close()
The error message is :UnicodeDecodeError: 'gbk' codec can't decode bytes in position 0-1: illegal multibyte sequence
Oh,I find the way.
First, use chardet to detect string encoding.
Second,use codecs to input or output to the file in the specific encoding.
Here is the code.
import chardet
import codecs
import os
root_dir="d:\\workspace\\minibatchk-means\\data\\20_newsgroups\\"
num = 1
failed = []
for back_name in os.listdir("d:\\workspace\\minibatchk-means\\data\\20_newsgroups"):
current_dir = root_dir + str(back_name)
for filename in os.listdir(current_dir):
print num,":",str(filename)
num=num+1
path=current_dir+"\\"+str(filename)
content = open(path,'r').read()
source_encoding=chardet.detect(content)['encoding']
if source_encoding == None:
print '??' , filename
failed.append(filename)
elif source_encoding != 'utf-8':
content=content.decode(source_encoding,'ignore')
codecs.open(path,'w',encoding='utf-8').write(content)
print failed
Thanks for all your help.
When I try to add the below code it gives me an error.
I have installed every python module including nltk. I have added lxml nampy, but it won't work. I am using python3 and in this case I have changed urllib2 to urllib.requests.
Please help me to find a solution.
I am running this as
python index.py
My index file is given below.
This is the code:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import ssl
import os
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import codecs
def checkChar(token):
for char in token:
if(0 <= ord(char) and ord(char) <= 64) or (91 <= ord(char) and ord(char) <= 96) or (123 <= ord(char)):
return False
else:
continue
return True
def cleanMe(html):
soup = BeautifulSoup(html, "html.parser")
for script in soup(["script, style"]):
script.extract()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
return text
path = 'crawled_html_pages/'
index = {}
docNum = 0
stop_words = set(stopwords.words('english'))
for filename in os.listdir(path):
collection = {}
docNum += 1
file = codecs.open('crawled_html_pages/' + filename, 'r', 'utf-8')
page_text = cleanMe(file)
tokens = nltk.word_tokenize(page_text)
filtered_sentence = [w for w in tokens if not w in stop_words]
filtered_sentence = []
breakWord = ''
for w in tokens:
if w not in stop_words:
filtered_sentence.append(w.lower())
for token in filtered_sentence:
if len(token) == 1 or token == 'and':
continue
if checkChar(token) == false:
continue
if token == 'giants':
breakWord = token
continue
if token == 'brady' and breakWord == 'giants':
break
if token not in collection:
collection[token] = 0
collection[token] += 1
for token in collection:
if tokennot in index:
index[token] = ''
index[token] = index[token] + '(' + str(docNum) + ', ' + str(collection[token]) + ")"
if docNum == 500:
print(index)
break
else:
continue
f = open('index.txt', 'w')
vocab = open('uniqueWords.txt', 'w')
for term in index:
f.write(term + ' =>' + index[term])
vocab.write(term + '\n')
f.write('\n')
f.close()
vocab.close()
print('Finished...')
These are the errors I get:
> C:\Users\myworld>python index.py
Traceback (most recent call last):
File "index.py][ 1]", line 49, in <module>
page_text = cleanMe(file)
File "index.py", line 22, in cleanMe
soup = BeautifulSoup(html, "html.parser")
File "C:\Users\furqa\AppData\Local\Programs\Python\Python36-32\lib\site-packages\beautifulsoup4-4.6.0-py3.6.egg\bs4\__init__.py", line 191, in __init__
File "C:\Users\furqa\AppData\Local\Programs\Python\Python36-32\lib\codecs.py", line 700, in read
return self.reader.read(size)
File "C:\Users\furqa\AppData\Local\Programs\Python\Python36-32\lib\codecs.py", line 503, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131:
invalid start byte
You can change the type of encoding BeautifulSoup used by changing the from_encoding parameter:
soup = BeautifulSoup(html, from_encoding="iso-8859-8”)
I am using the code below to parse the XML format wikipedia training data into a pure text file:
from __future__ import print_function
import logging
import os.path
import six
import sys
from gensim.corpora import WikiCorpus
if __name__ == '__main__':
program = os.path.basename(sys.argv[0])
logger = logging.getLogger(program)
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
logging.root.setLevel(level=logging.INFO)
logger.info("running %s" % ' '.join(sys.argv))
# check and process input arguments
if len(sys.argv) != 3:
print("Using: python process_wiki.py enwiki.xxx.xml.bz2 wiki.en.text")
sys.exit(1)
inp, outp = sys.argv[1:3]
space = " "
i = 0
output = open(outp, 'w')
wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
for text in wiki.get_texts():
if six.PY3:
output.write(bytes(' '.join(text), 'utf-8').decode('utf-8') + '\n')
# ###another method###
# output.write(
# space.join(map(lambda x:x.decode("utf-8"), text)) + '\n')
else:
output.write(space.join(text) + "\n")
i = i + 1
if (i % 10000 == 0):
logger.info("Saved " + str(i) + " articles")
output.close()
logger.info("Finished Saved " + str(i) + " articles")
when I run this code, it gives me a following error message:
File "wiki_parser.py", line 42, in <module>
output.write(bytes(' '.join(text), 'utf-8').decode('utf-8') + '\n')
UnicodeEncodeError: 'cp949' codec can't encode character '\u1f00' in position 1537: illegal multibyte sequence
When I searched this error online, most answers told me to add 'utf-8' as the encoding which is already there. What could be the possible issue with the code?
Minimal example
The problem is that your file is opened with an implicit encoding (inferred from your system). I can recreate your issue as follows:
a = '\u1f00'
with open('f.txt', 'w', encoding='cp949') as f:
f.write(a)
Error message: UnicodeEncodeError: 'cp949' codec can't encode character '\u1f00' in position 0: illegal multibyte sequence
You have two options. Either open the file using an encoding which can encode the character you are using:
with open('f.txt', 'w', encoding='utf-8') as f:
f.write(a)
Or open the file as binary and write encoded bytes:
with open('f.txt', 'wb') as f:
f.write(a.encode('utf-8'))
Applied to your code:
I would replace this part:
output = open(outp, 'w')
wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
for text in wiki.get_texts():
if six.PY3:
output.write(bytes(' '.join(text), 'utf-8').decode('utf-8') + '\n')
# ###another method###
# output.write(
# space.join(map(lambda x:x.decode("utf-8"), text)) + '\n')
else:
output.write(space.join(text) + "\n")
with this:
from io import open
wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
with open(outp, 'w', encoding='utf=8') as output:
for text in wiki.get_texts():
output.write(u' '.join(text) + u'\n')
which should work in both Python 2 and Python 3.
I am trying to use the Bing api in python with the following code:
#!/usr/bin/python
from bingapi import bingapi
import re
import json
import urllib
import cgi
import cgitb
from HTMLParser import HTMLParser
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
def strip_tags2(data):
p = re.compile(r'<[^<]*?>')
q = re.compile(r'[&;!##$%^*()]*')
data = p.sub('', data)
return q.sub('', data)
def getUrl(item):
return item['Url']
def getContent(item):
return item['Description']
def getTitle(item):
return item['Title']
def getInfo(qry, siteStr):
qryStr = qry + "+" + siteStr
#qryStr = u"%s" % qryStr.encode('UTF-8')
query = urllib.urlencode({'q' : qryStr})
url = 'http://api.bing.net/json.aspx?Appid=<myappid>&Version=2.2&Market=en-US&Query=%s&Sources=web&Web.Count=10&JsonType=raw' % (query)
search_results = urllib.urlopen(url)
j = json.loads(search_results.read())
results = j['SearchResponse']['Web']['Results']
return results
def updateRecent(qry):
f = open("recent.txt", "r")
lines = f.readlines()
f.close()
lines = lines[1:]
if len(qry) > 50: #truncate if string too long
qry = (qry[:50] + '...')
qry = strip_tags2(qry) #strip out the html if injection try
lines.append("\n%s" % qry)
f = open("recent.txt", "w")
f.writelines(lines)
f.close()
if __name__ == '__main__':
form = cgi.FieldStorage()
qry = form["qry"].value
qry = r'%s' % qry
updateRecent(qry)
siteStr = "(site:answers.yahoo.com OR site:chacha.com OR site:blurtit.com OR site:answers.com OR site:question.com OR site:answerbag.com OR site:stackexchange.com)"
print "Content-type: text/html"
print
header = open("header.html", "r")
contents = header.readlines()
header.close()
for item in contents:
print item
print """
<div id="results">
<center><h1>Results:</h1></center>
"""
for item in getInfo(siteStr, qry):
print "<h3>%s</h3>" % getTitle(item)
print "<br />"
print "%s" % getUrl(item)
print "<br />"
print "<p style=\"color:gray\">%s</p>" % getContent(item)
print "<br />"
print "</div>"
footer = open("footer.html", "r")
contents = footer.readlines()
footer.close()
for thing in contents:
print thing
I prints a few results, and then gives me the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\\u2026' in position 72: ordinal not in range(128)
Can someone explain why this is happening? It clearly has something to do with how the url is getting encoded, but what is exactly is wrong? Thanks in advance!
That particular Unicode character is "HORIZONTAL ELLIPSIS". One or more of your getXXXXX() functions are returning Unicode strings, one of which contains a non-ASCII character. I suggest declaring the encoding of your output, for example:
Content-Type: text/html; charset=utf-8
and explicitly encoding your output in that encoding.
We need to know the line number where the exception was thrown, it will be in the backtrace. Anyway, the problem is that you are reading unicode from the files/URLs and then implicitly converting them to US-ASCII, probably in one of the concatenation operations. You should prefix all constant strings with u to indicate that they are unicode strings, like in
u"\n%s" % qry