reading in multiple text file extensions .pdf, .txt and .htm - python

I have a folder where I want to read all the text files from and put them into a corpus, however I am only able to do it with .txt files. How can I expand the code below to read in .pdf, .htm and .txt files?
corpus_raw = u""
for file_name in file_names:
with codecs.open(file_name, "r", "utf-8") as file_name:
corpus_raw += file_name.read()
print("Document is {0} characters long".format(len(corpus_raw)))
print()
For example:
with open ('/data/text_file.txt', "r", encoding = "utf-8") as f:
print(f.read())
Read in data where I can view it on a notebook.
with open ('/data/text_file.pdf', "r", encoding = "utf-8") as f:
print(f.read())
Read nothing.

There are two types of files, binary files and plain-text files. A file can have one or the other, or sometimes both.
Html files are plaintext, human readable files, which you can edit by hand, but PDF Files are binary + Text files where you'll need special programs to edit them.
If you want to read from pdf or html, it's possible. I wasn't sure if you meant to extract the text, or to extract the source code, so I'll provide explanations to both.
Extracting Text
Extracting text can be done easily for html files. Using webbrowser, you can open your file in the browser, and then use urllib for extracting text. For more info, refer to the answers here: Extracting text from HTML file using Python
For pdf files, you can use a python module called PyPDF2. Download it using pip:
$ pip install PyPDF2
and get started.
Here is an example of a simple program I found on the internet:
import PyPDF2
# creating a pdf file object
pdfFileObj = open('example.pdf', 'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# printing number of pages in pdf file
print(pdfReader.numPages)
# creating a page object
pageObj = pdfReader.getPage(0)
# extracting text from page
print(pageObj.extractText())
# closing the pdf file object
pdfFileObj.close()
Extracting Source Code
Extracting source code is best done using python's open function as you did above.
For html files, you can just do what you did with text files. Or maybe to be simpler,
file = open("c:\\path\\to\\file")
print(file.read())
you can just do the above.
For pdf files, you do pretty much the same, but specifying the mode for editing in a different parameter in the open function. For more info, visit the sites in the More Info section.
file = open("c:\\path\\to\\file.extension", "a") #specifies the mode of editing. Unfortunately, you'll only be able to store data, not display it. But you can edit it, then save it after wards
print(file.readable()) #Will return false, proving to be not readable.
file.save("c:\\path\\to\\save\\in.extension")
More Info
https://www.geeksforgeeks.org/working-with-pdf-files-in-python/
https://www.programiz.com/python-programming/methods/built-in/open

This should work for htm/html files with no problem - they are basically just text files. Above, I only see that reading in .pdf has failed - was there a problem with .htm?
Also, reading in a .pdf may be much more difficult/involved than you think. A pdf contains a lot more information than just plaintext, and cannot be meaningfully edited in, say, notepad. As an example of what I mean, here's a small sample of what I got when I opened a .pdf in notepad:
%PDF-1.7
%âãÏÓ
1758 0 obj
<</Filter/FlateDecode/First 401/Length 908/N 51/Type/ObjStm>>stream
hÞ”ØQk\7à¿2ÍK,i4
Cã(Á”¾•–öâ.Ýn‚w]òó3rm˜Ÿ =ÄÜÝèΑ®?ÉÍ…e¦ê?Å/2e¥ÂJÙˆ+SÉT«ù7$"T„ZËT”´ù2£®L~©¯fÊ©±É–iÌ(¦ÄF¹&OðÑ’Œ|hnžU}Žñ¾®ûDOÉæCÄç'¿IF¸/±Å¿”±/ÿ!¾›Ú˜Æ>¤ùeiêóuÚ3õ®äU̘Է’Ìhì´$!_Êœ3©o­úaÇÖÅÏç·rGòuê‡Gé¾é>Žà›ì¾õä›ò£Õì›ðѵx¨ùQXÇ3ð'åC=ªJÃ6óç:¯Öý—ZòóúI¹ù…Ÿ3—ñ$<Éw‘èÍ›«›/dz/¸z¿¿?Ço'ÑoW¿îÆõX矮¯}Ý»ítþ#?~ö¥ç_ü”×éÓÕÇíÛyü6Ç÷·»û͇åòøé÷ýù°ýôöá´?n§}8ž·Ãa·ÿÜ>ßÞo‡ý¿§Wat£õ…Ñ~ûÏ[ýQÌÍß»¯çížRŽI
$L’ù¤“úËI%Ã$OâTHb˜dóI5&$(éé´SI“€ˆE”-&Š("4&E”=$1ÁPDYa1 ˆ`(‚çEä“€†"x^DŽÁ#C</"ÇŽ` ¢B</"ÇŽ¨#D…"x^DŽQˆ
EÔ±#*Q¡ˆº "vD"*QDÄŽ¨#„#uADì"Š¨"bG!P„Ì‹(±#ˆ(BæE”ØD!ó"Jì"!ó"JìˆD4(BæE”Ø
ˆhPD[;¢
Šh"bG4 ¢AmADìˆD(ÑDÄŽP B¡ˆ¶ "v„
E輎¡#„B:/‚cG(¡P„΋àØ
Dt(BçEpìˆDt(BçEpìˆDt(¢/ˆˆÑˆEô±#:Ñ¡ˆ¾ "vD"Šè"bGaPD_;€ƒ"l^Da#„A6/¢ÆŽ0  ›QcG1Þ¡¨y5–DN eA6¢Ö‹¬‚² ‹ç#O…ÉEzQ•ð›ª´#£]„¡wU ¿¬J:ô"ñPüŸÑçSÿ(íÃñ¯íÛÿA?û°§7¿8ìBÀawü‡nww›ßû]€ %“xw
endstream
endobj
1759 0 obj
<</Filter/FlateDecode/First 1907/Length 3450/N 200/Type/ObjStm>>stream
There are, however, options. I would suggest reading the page at https://www.geeksforgeeks.org/working-with-pdf-files-in-python/ as a starting point.

Related

How to merge pdf files using python without storing them into the local directory

I have some pdf files which are uploaded on a remote server. I have URL for each file and we can download these PDF files by visiting those URLs.
My question is,
I want to merge all pdf files into a single file (but, without storing these files into local directory). How can I do that (in python module 'PyPDF2')?
Please move to pypdf. It's essentially the same as PyPDF2, but the development will continue there (I'm the maintainer of both projects).
Your question is answered in the docs:
https://pypdf.readthedocs.io/en/latest/user/streaming-data.html
Instead of writing to a file, you write to io.ByteIO stream:
from io import ByteIO
# e.g. writer = PdfWriter()
# ... do what you want to do with the PDFs
with BytesIO() as bytes_stream:
writer.write(bytes_stream)
bytes_stream.seek(0)
data = bytes_stream.read() # that is now the "bytes" represention

Extract First Page from Multiple PDF Documents in Python

I have a document library which consists of Several Thousand PDF Documents. I am trying to extract the first page from each document. The extracted page should then be stored individually into a folder called "First Page".
I have written the below script as a means of printing the first page from each document. I have been able to extract the PDF files from some of the documents in my library. However, the vast majority have not been exported. Examining terminal, i note that there are a lot of errors thrown with the comment "Superfluous whitespace found in object header b'21' b'0'". I have searched online but am unable to locate anything of relevance.
I have three questions:
Would anyone have any idea how I can address the Superfluous whitespace issue?
My documents seem to be exporting as unreadable or damaged files. Is there something missing from my code?
My documents are also not exporting to my required output directory. I am unsure how I point the extracts to this directory. Would anyone be able to help with this also?
import os
import PyPDF2
from PyPDF2 import PdfFileWriter, PdfFileReader
# get the file names in the directory
input_directory = 'Fund Docs'
entries = os.listdir(input_directory)
output_directory = 'First Pages'
outputs = os.listdir(output_directory)
for entry in entries:
print(entry)
# create a PDF reader object
pdfFileObj = open(input_directory + '/' + entry, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# print(pdfReader.numPages)
# creating a page object
pageObj = pdfReader.getPage(0)
# extracting text from page
print(pageObj.extractText())
# closing the pdf file object
pdfFileObj.close()
outputFileName = 'First_Page' + entry + '.pdf'
with open(outputFileName, 'wb') as out:
pdf_writer = PyPDF2.PdfFileWriter(out)
print('created ', outputFileName)
Several points:
Use pypdf (PyPDF2 is deprecated)
Use PdfReader (PdfFileReader is deprecated - it now has strict=False by default)
The Superfluous whitespace found message is only a warning with strict=False. You see that message because the PDF is not completely standard compliant. You can silence the warning: https://pypdf.readthedocs.io/en/latest/user/suppress-warnings.html
When you write a question, you should also mention which version of the critical libraries you're using.

Converting pdf files to txt files but only getting last page of pdf file

I'm trying to convert a list of PDF files in a directory to txt. At the moment, however, I'm only getting the last page of the pdf files in the newly created txt. files.
The code:
import os, PyPDF2
import re
for file in os.listdir("Documents/Python/"):
if file.endswith(".pdf"):
fpath=os.path.join("Documents/Python/", file)
pdffileobj=open(fpath,'rb')
pdfreader=PyPDF2.PdfFileReader(pdffileobj)
x=pdfreader.numPages
pageobj=pdfreader.getPage(x-1)
text=pageobj.extractText()
newfpath=re.sub(".pdf","txt",fpath)
file1=open(newfpath,"a")
file1.writelines(text)
Txt files with all the pages
You are only getting the text of the last page because you are only ever reading the text of the last page of each pdf pageobj=pdfreader.getPage(x-1)
Although it works, it looks like pdfreader.numPages is deprecated now. The way to do it is len(reader.pages) if you wanted the number of pages. You could also just loop through each page object without getting the number of pages for page in pdfreader.pages:
The main thing you are missing is a second loop to go through each page of the pdf and extract the text.
for file in os.listdir("/your/path"):
if file.endswith(".pdf"):
fpath=os.path.join("/your/path", file)
pdffileobj=open(fpath,'rb')
pdfreader=PyPDF2.PdfFileReader(pdffileobj)
newfpath=re.sub(".pdf",".txt",fpath)
with open(newfpath,"w") as file:
#loop through each page and write to file
for page in pdfreader.pages:
text=page.extractText()
file.write(text)

Python PdfReader: Getting error when sequentially reading PDFs in a folder: Errno 2 (No such file or directory): 'filename.pdf'

I'm trying to put together a code that will procedurally read through a file of PDFs to scrape relevant information such as part names, numbers, materials, and final treatments. The (presumably) problematic part of the code is written:
for fp in os.listdir(path):
pdfFileObj = open(fp, 'rb')
reader = PdfReader(pdfFileObj)
number_of_pages = len(reader.pages)
page = reader.pages[0]
text = page.extract_text()
Title, part_number, material, f_treatments = extractText(text)
printAll(Title, part_number, material, f_treatments)
pdfFileObj.close()
where path = r'C:\Users\myname\Documents\TargetFile'
It reads the first file (1.pdf) in TargetFile successfully, but will return this upon reading the second file, (2.pdf):
[Errno 2] No such file or directory: '2.pdf'
which is peculiar, given that it needs to know that 2.pdf is in the file in order to report this error message. I suspect that fp in os.listdir() is detecting this, but that the pdfFileObj = open(fp, 'rb') command isn't finding it, as the error is reported from that line.
Do you know what the issue might be based on the information I've provided?
I thought that closing the document at the end of the loop code would help but this doesn't seem to make a difference. I've never worked with 'rb' or read-binary code before, but if it seems to work for the first file I don't expect this would be an issue.

Converting Binary Data to PDF in Python/Django

I'm working on upgrading a legacy system and have come across a table full of .pdf files saved as binary data. I have dumped the table into a csv file and am trying to write a script which will take each row and recreate the files that were uploaded in the first place so that I can upload the files to S3.
I have tried this:
new_file = open(file_name, "wb")
doc = doc.encode('utf-8')
new_file.write(doc)
new_file.close()
where file_name = the saved file name, and doc = the binary data stored as a string in the database.
but all it gives me is a bunk pdf file with the binary data in it.
Here is what the data looks like stored, its just the first bit, way to big to copy and paste.
0x255044462D312E340A25E2E3CFD30D0A312030206F626A0A3C3C200A2F43726561746F72202843616E6F6E2069522D4144562043353034352020504446290A2F4372656174696F6E446174652028443A32303133303432393133303830342D303527303027290A2F50726F647563657220285C3337365C3337375C303030415C303030645C3030306F5C303030625C303030655C303030205C303030505C303030445C303030465C303030205C303030535C303030635C3030305C0A615C3030306E5C303030205C3030304C5C303030695C303030625C303030725C303030615C303030725C303030795C303030205C303030315C3030302E5C303030305C303030655C3030305C0A205C303030665C3030306F5C303030725C303030205C303030435C303030615C3030306E5C3030306F5C3030306E5C303030205C303030695C3030306D5C303030615C303030675C3030305C0A655C303030525C303030555C3030304E5C3030304E5C303030455C303030525C3030305C303030290A3E3E200A656E646F626A0A322030206F626A0A3C3C200A2F5061676573203320302052200A2F54797065202F436174616C6F67200A2F4F7574707574496E74656E747320313120302052200A2F4D6574616461746120313220302052200A3E3E200A656E646F626A0A342030206F626A0A3C3C202F54797065202F

Categories

Resources