I built a Python app - it's very straightforward. A file selection box opens, a user chooses a PDF file, and then the text from the PDF is exported to a CSV.
I packaged this as a .exe from within a virtualenv, I only installed the libraries I'm importing (plus PyMuPDF), and the package is still 1.4GB.
The script:
import textract
import csv
import codecs
import fitz
import re
import easygui
from io import open
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
filename = easygui.fileopenbox()
pdfFileObj = fitz.open(filename)
text =""
for page in pdfFileObj:
text+= page.getText()
re.sub(r'\W+', '', text)
if text != "":
text = text
else:
text = textract.process(filename, method='tesseract', language='eng')
tokens = word_tokenize(text)
punctuations = ['(',')',';',':','[',']',',']
stop_words = stopwords.words('english')
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
with open('ar.csv', 'w', newline='', encoding='utf-8') as f:
write = csv.writer(f)
for i in keywords:
write.writerow([i])
Some context:
Within my venv, the entire lib folder is about 400MB. So how do I find out what is being added to the .exe that's making it 1.4GB?
Related
I open a file in python.
I read it in and process it (separate it into single words).
I write it to an output file.
The below picture shows my code, the shell (where I'm printing out each word before appending it to the file), and the output.
Why does it become Chinese characters? The encoding of the file is ANSI.
Edit: I should add that the output file seems to be encoded with UCS-2 LE BOM.
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
with open('ALLSentences.txt', 'r') as myfile:
text = myfile.read()
tokenized = word_tokenize(text)
file = open("output.txt", "a")
for word in tokenized:
print(word)
file.write(word + "\n")
file.close()
I'm trying very hard to find the way to convert a PDF file to a .docx file with Python.
I have seen other posts related with this, but none of them seem to work correctly in my case.
I'm using specifically
import os
import subprocess
for top, dirs, files in os.walk('/my/pdf/folder'):
for filename in files:
if filename.endswith('.pdf'):
abspath = os.path.join(top, filename)
subprocess.call('lowriter --invisible --convert-to doc "{}"'
.format(abspath), shell=True)
This gives me Output[1], but then, I can't find any .docx document in my folder.
I have LibreOffice 5.3 installed.
Any clues about it?
Thank you in advance!
I am not aware of a way to convert a pdf file into a Word file using libreoffice.
However, you can convert from a pdf to a html and then convert the html to a docx.
Firstly, get the commands running on the command line. (The following is on Linux. So you may have to fill in path names to the soffice binary and use a full path for the input file on your OS)
soffice --convert-to html ./my_pdf_file.pdf
then
soffice --convert-to docx:'MS Word 2007 XML' ./my_pdf_file.html
You should end up with:
my_pdf_file.pdf
my_pdf_file.html
my_pdf_file.docx
Now wrap the commands in your subprocess code
I use this for multiple files
####
from pdf2docx import Converter
import os
# # # dir_path for input reading and output files & a for loop # # #
path_input = '/pdftodocx/input/'
path_output = '/pdftodocx/output/'
for file in os.listdir(path_input):
cv = Converter(path_input+file)
cv.convert(path_output+file+'.docx', start=0, end=None)
cv.close()
print(file)
Below code worked for me.
import win32com.client
word = win32com.client.Dispatch("Word.Application")
word.visible = 1
pdfdoc = 'NewDoc.pdf'
todocx = 'NewDoc.docx'
wb1 = word.Documents.Open(pdfdoc)
wb1.SaveAs(todocx, FileFormat=16) # file format for docx
wb1.Close()
word.Quit()
My approach does not follow the same methodology of using subsystems. However this one does the job of reading through all the pages of a PDF document and moving them to a docx file. Note: It only works with text; images and other objects are usually ignored.
#Description: This python script will allow you to fetch text information from a pdf file
#import libraries
import PyPDF2
import os
import docx
mydoc = docx.Document() # document type
pdfFileObj = open('pdf/filename.pdf', 'rb') # pdffile loction
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) # define pdf reader object
# Loop through all the pages
for pageNum in range(1, pdfReader.numPages):
pageObj = pdfReader.getPage(pageNum)
pdfContent = pageObj.extractText() #extracts the content from the page.
print(pdfContent) # print statement to test output in the terminal. codeline optional.
mydoc.add_paragraph(pdfContent) # this adds the content to the word document
mydoc.save("pdf/filename.docx") # Give a name to your output file.
I have successfully done this with pdf2docx :
from pdf2docx import parse
pdf_file = "test.pdf"
word_file = "test.docx"
parse(pdf_file, word_file, start=0, end=None)
I'm having a hard time trying to figure this out. New to coding. I'm trying to read a .txt file, tokenize it, pos tag the words in it.
Here's what I've got so far:
import nltk
from nltk import word_tokenize
import re
file = open('1865-Lincoln.txt', 'r').readlines()
text = word_tokenize(file)
string = str(text)
nltk.pos_tag(string)
My problem is, it keeps giving me the TypeError: expected string or bytes-like object error.
word_tokenize is expecting a string but file.readlines() gives you a list.
Just convert the list to a string will solve the issue.
import nltk
from nltk import word_tokenize
import re
file = open('test.txt', 'r').readlines()
text =''
for line in file:
text+=line
text = word_tokenize(text)
string = str(text) # remove it if want to tag by words and pass text directly to post_tag:)
nltk.pos_tag(string)
I suggest you do the following:
import nltk
# nltk.download('all') # only for the first time when you use nltk
from nltk import word_tokenize
import re
with open('1865-Lincoln.txt') as f: # with - open is recommended for file reading
lines = f.readlines() # first get all the lines from file, store it
for i in range(0, len(lines)): # for each line, do the following
token_text = word_tokenize(lines[i]) # tokenize each line, store in token_text
print (token_text) # for debug purposes
pos_tagged_token = nltk.pos_tag(token_text) # pass the token_text to pos_tag()
print (pos_tagged_token)
For a text file containing:
user is here
pass is there
The output was:
['user', 'is', 'here']
[('user', 'NN'), ('is', 'VBZ'), ('here', 'RB')]
['pass', 'is', 'there']
[('pass', 'NN'), ('is', 'VBZ'), ('there', 'RB')]
It worked for me, I'm on Python 3.6, if that should matter. Hope this helps!
EDIT 1:
So your issue was you were passing a list of strings to pos_tag(), whereas doc says
A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a part of speech tag to each word
Hence you needed to pass it line by line i. e. string by string. That is why you were getting a TypeError: expected string or bytes-like object error.
Most probably the 1865-Lincoln.txt refers to the inaugural speech of president Lincoln. It's available in NLTK from https://github.com/nltk/nltk_data/blob/gh-pages/packages/corpora/inaugural.zip
The original source of the document comes from the Inaugural Address Corpus
If we check how NLTK is reading the file using LazyCorpusReader, we see that the files are Latin-1 encoded.
inaugural = LazyCorpusLoader(
'inaugural', PlaintextCorpusReader, r'(?!\.).*\.txt', encoding='latin1')
If you have the default encoding set to utf8, most probably that's where the TypeError: expected string or bytes-like object is occurring
You should open the file with an explicit encoding and decode the string properly, i.e.
import nltk
from nltk import word_tokenize, pos_tag
tagged_lines = []
with open('test.txt', encoding='latin1') as fin:
for line in fin:
tagged_lines.append(pos_tag(word_tokenize(line)))
But technically, you can access the inagural corpus directly as a corpus object in NLTK, i.e.
>>> from nltk.corpus import inaugural
>>> from nltk import pos_tag
>>> tagged_sents = [pos_tag(sent) for sent in inaugural.sents('1865-Lincoln.txt')]
I've been trying to make my python code to fill a form in word with data that i scraped off the Internet. I wrote the data in a txt file and are now trying to fill the word file with this code:
import zipfile
import os
import tempfile
import shutil
import codecs
def getXml(docxFilename,ReplaceText):
zip = zipfile.ZipFile(open(docxFilename,"rb"))
xmlString= zip.read("word/document.xml")
for key in ReplaceText.keys():
xmlString = xmlString.replace(str(key), str(ReplaceText.get(key)))
return xmlString
def createNewDocx(originalDocx,xmlString,newFilename):
tmpDir = tempfile.mkdtemp()
zip = zipfile.ZipFile(open(originalDocx,"rb"))
zip.extractall(tmpDir)
#3tmpDir=tmpDir.decode("utf-8")
with open(os.path.join(tmpDir,"word/document.xml"),"w") as f:
f.write(xmlString)
filenames = zip.namelist()
zipCopyFilename = newFilename
with zipfile.ZipFile(zipCopyFilename,"w") as docx:
for filename in filenames:
docx.write(os.path.join(tmpDir,filename),filename)
shutil.rmtree(tmpDir)
f=open('test.txt', 'r',)
text=f.read().split("\n")
print text[1]
Pavarde = text[1]
Replace = {"PAVARDE1":Pavarde}
createNewDocx("test.docx",getXml("test.docx",Replace),"test2.docx")
The file is created but I cant open it.
I get the following error:
Illegal xlm character
My guess would be that theres something with the encoding but I cant find a solution.
I want to tokenize input file in python please suggest me i am new user of python .
I read the some thng about the regular expression but still some confusion so please suggest any link or code overview for the same.
Try something like this:
import nltk
file_content = open("myfile.txt").read()
tokens = nltk.word_tokenize(file_content)
print tokens
The NLTK tutorial is also full of easy to follow examples: https://www.nltk.org/book/ch03.html
Using NLTK
If your file is small:
Open the file with the context manager with open(...) as x,
then do a .read() and tokenize it with word_tokenize()
[code]:
from nltk.tokenize import word_tokenize
with open ('myfile.txt') as fin:
tokens = word_tokenize(fin.read())
If your file is larger:
Open the file with the context manager with open(...) as x,
read the file line by line with a for-loop
tokenize the line with word_tokenize()
output to your desired format (with the write flag set)
[code]:
from __future__ import print_function
from nltk.tokenize import word_tokenize
with open ('myfile.txt') as fin, open('tokens.txt','w') as fout:
for line in fin:
tokens = word_tokenize(line)
print(' '.join(tokens), end='\n', file=fout)
Using SpaCy
from __future__ import print_function
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
nlp = English()
tokenizer = Tokenizer(nlp.vocab)
with open ('myfile.txt') as fin, open('tokens.txt') as fout:
for line in fin:
tokens = tokenizer.tokenize(line)
print(' '.join(tokens), end='\n', file=fout)