How to add more tolerance for whitespaces in PyPDF2? [duplicate]

How to add more tolerance for whitespaces in PyPDF2? [duplicate] - python

This question already has answers here:
Whitespace gone from PDF extraction, and strange word interpretation
(7 answers)
Closed 5 years ago.
I'm looking for the easiest way to convert PDF to plain text in Python.
PyPDF2 seemed to be very easy, here is what I have:
def test_pdf(filename):
import PyPDF2
pdf = PyPDF2.PdfFileReader(open(filename, "rb"))
for page in pdf.pages:
print page.extractText()
But it gives me:
InChapter5wepresentandevaluateourresults,togetherwiththetestenvironment.
How can I extract words from that PDF with PyPDF? Is there a different way (another library that works well for this)?

Well i used with success PDFMiner, with which you can parse and extract text from pdf documents.
More specifically there is this pdf2txt.py module where you can use to extract text. Installation is easy: pdfminer-xxx#python setup.py install and from bash or cmd a simple pdf2txt.py -o Application.txt Reference/Application.pdf command would do the trick.
In the above mentioned oneliner application.pdf is ur target pdf, the one you are going to process and application.txt is the file that will be generated.
Furthermore for more complex tasks you can take a look at the api and modify it up to your needs.
edit: i answered based on my personal experience and that's that. I have no reason to "promote" the proposed tool. I hope that helps
edit2: something like that worked for me.
# -*- coding: utf-8 -*-
import os
import re
dirpath = 'path\\to\\dir'
filenames = os.listdir(dirpath)
nb = 0
open('path\\to\\dir\\file.txt', 'w') as outfile:
for fname in filenames:
nb = nb+1
print fname
print nb
currentfile = os.path.join(dirpath, fname)
open(currentfile) as infile:
for line in infile:
outfile.write(line)

Related

Python directory crawler to scan all kinds of files and search for keyword

I am trying to create a directory crawler to search for specific keywords in all files inside a folder and all its subfolders. This is what I have so far (in this case I am looking for keyword 'olofx'):
import os
rootDir = os.getcwd()
def scan_file(filename, dirname):
print(os.path.join(dirname,filename))
contains = False
if("olofx" in filename):
contains = True
else:
with open(os.path.join(dirname,filename)) as f:
lines = f.readlines()
for l in lines:
#print(l)
if("olofx" in l):
contains = True
break
if contains:
print("yes")
for dirName, subdirList, fileList in os.walk(rootDir):
for fname in fileList:
scan_file(fname, dirName)
Problem is when I reach one of my sample excel files, the characters seem to be unreadable.
here is some of the output for the excel file:
;���+͋�۳�L���P!�/��KdocProps/core.xml �(���_K�0���C�{�v�9Cہʞ
n(���v
6H�ݾ�i���|Lι��sI���:��VJ' �#1ͅ�h�^�s9O��VP�8�(//r���6`��r���7c�v ���
I have worked with openpyxl and I know I can use that to read excel files, but I want one script that reads all kinds of files: word, excel, pdf etc. Anyway to represent files' contents regardless of file types?
Thank you

Your code assumes that the content of your files is available as plain text.
Unfortunately, for many file types this is not the case. Office documents (.docx, .xslx) are basically XML documents inside a ZIP archive. That means that the text content is saved in a compressed way, so when you parse the file bytes as plain text, the content is not recognisable.
You will need the necessary tools to interpret each of your file types correctly. There are libraries for this. One that I found is https://textract.readthedocs.io/en/stable/ but I have no experience with it.

It seems, that Your script is saved with different encoding, as Your files, which are likely UTF-8 encoded.
Try to add at the very beginning of Your file the following line:
#!/usr/bin/env python
#-*- coding: utf-8 -*-
You may also check the following answer: Character Encoding, XML, Excel, python

Can not save xml file using minidom [duplicate]

This question already has answers here:
Troubles while parsing with python very large xml file
(3 answers)
Closed 4 years ago.
I tried to modify and save a xml file using minidom in python.
Everything is quite working good except 1 specific file, that I only can read but can not write it back.
Code that I use to save xml file:
domXMLFile = minidom.parse(dom_document_filename)
#some modification
F= open(dom_document_filename,"w")
domXMLFile .writexml(F)
F.close()
My question is :
Is it true that minidom can not handle too large file ( 714KB )?
How do i solve my problem?

In my opinion, lxml is way better than minidom for handling XML. If you have it, here is how to use it:
from lxml import etree
root = etree.parse('path/file.xml')
# some changes to root
with open('path/file.xml', 'w') as f:
f.write(etree.tostring(root, pretty_print=True))
If not, you could use pdb to debug your code. Just write import pdb; pdb.set_trace() in your code where you want a break pont and when running your function in a shell, it should stop at this line. It may give you a better view for what is not working.

Python: How to fix an [errno 2] when trying to open a text file? [duplicate]

This question already has answers here:
open() gives FileNotFoundError / IOError: '[Errno 2] No such file or directory'
(8 answers)
Closed 6 months ago.
I'm trying to learn Python in my free time, and my textbook is not covering anything about my error, so I must've messed up badly somewhere. When I try to open and read a text file through notepad (on Windows) with my code, it produces the error. My code is:
def getText():
infile = open("C:/Users/****/AppData/Local/Programs/Python/Python35-32/lib/book.txt" , "r")
allText = infile.read()
return allText
If it's necessary, here is the rest of my code so far:
def inspectWord(theWord,wList,fList):
tempWord = theWord.rstrip("\"\'.,`;:-!")
tempWord = tempWord.lstrip("\"\'.,`;:-!")
tempWord = tempWord.lower()
if tempWord in wList:
tIndex = wList.index(tempWord)
fList[tIndex]+=1
else:
wList.append(tempWord)
fList.append(1)
def main():
myText = getText()
print(myText)
main()
I'd greatly appreciate any advice, etc.; I cannot find any help for this. Thanks to anyone who responds.

To open a unicode file, you can do the following
import codecs
def getText():
with codecs.open("C:/Users/****/AppData/Local/Programs/Python/Python35-32/lib/book.txt" , "r", encoding='utf8') as infile:
allText = infile.read()
return allText
See also: Character reading from file in Python

First of all, I recommend you to use relative path, not absolute path. It is simpler and will make your life easier especially now that you just started to learn Python. If you know how to deal with commandline, run a new commandline and move to the directory where your source code is located. Make a new text file there and now you can do something like
f = open("myfile.txt")
Your error indicates that there is something wrong with path you passed to builtin function open. Try this in an interactive mode,
>> import os
>> os.path.exists("C:/Users/****/AppData/Local/Programs/Python/Python35-32/lib/book.txt")
If this returns False, there is nothing wrong with your getText fuction. Just pass a correct path to open function.

File size increasing after extraction?

This is a pretty general question, and I don't even know whether this is the correct community for the question, if not just tell me.
I have recently had an html file from which I was extracting ~90 lines of HTML code (total lines were ~8000). I did this with a simple Python script. I stored my output (the shortened html code) into a text file. Now I am curious because the file size has increased? what could cause the file to get bigger after I extracted some part out of it?
File size before: 319.374 Bytes
File size after: 321.516 Bytes
Is this because of the different file formats html and txt?
Any help or suggestions appreciated!
Code:
import glob
import os
import re
def extractor():
os.chdir(r"F:\Test") # the directory containing my html
for file in glob.iglob("*.html"): # iterates over all files in the directory ending in .html
with open(file, encoding="utf8") as f, open((file.rsplit(".", 1)[0]) + ".txt", "w", encoding="utf8") as out:
contents = f.read()
extract = re.compile(r'StartTag.*?EndTag', re.S)
cut = extract.sub('', contents)
if re.search(extract, contents) is not None:
out.write(cut)
out.close()
extractor()
EDIT: I also tried using ".html" instead of ".txt" as filem format for my output file. However the difference still remains.

This code does not write to the original HTML file. Something else must be causing the increased file size .

Finding many strings in directory

I have a directory full of files and a set of strings that need identification(about 40). I want to go through all of the files in the directory and print out the names of the files that have any one of my strings. I found code that works perfectly (Search directory for specific string), but it only works for one string. Whenever I try to add more, it prints out the name of every single file in the directory. Can someone help me tweak the code as I just started programming a few days ago and don't know what to do.
import glob
for file in glob.glob('*.csv'):
with open(file) as f:
contents = f.read()
if 'string' in contents:
print file
That code was taken from the question I mentioned above. Any help would be appreciated and any tips on asking the question better would as well! Thank You!

You can try:
import glob
strings = ['string1', 'string2']
for file in glob.glob('*.csv'):
with open(file) as f:
contents = f.read()
for string in strings:
if string in contents:
print file
break
About asking better questions: link

I would just use grep:
$ grep -l -f strings.txt *.csv

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to add more tolerance for whitespaces in PyPDF2? [duplicate] - python

Related

Python directory crawler to scan all kinds of files and search for keyword

Can not save xml file using minidom [duplicate]

Python: How to fix an [errno 2] when trying to open a text file? [duplicate]

File size increasing after extraction?

Finding many strings in directory

Categories

Resources