I created a modified pdf using the python library PyPDF2 but when I try to print it using my printer all I get is a set of blank pages. The Adobe Reader is also not able to read the newly created PDF file which makes me think that most probably the pdf is being read by the printer but the format is inappropriate for printing. And the problem is I don't have much knowledge about PDFs and their formats.
For example a pdf that does print has a format:
And for the pdf that I created and which doesn't print, the format is:
Hence, I would like to know what is the problem with the PDF I created using PyPDF2 and how can I fix it for printing it.
Edit:
The code used by me can be seen here on my previous post.
Related
I wanted to create a pdf using Python 3x.
The pdf should have some text data which is stored in a .xlsx file i.e.., it should read data from .xlsx file and write into the .pdf file.
Along with that, the pdf should have a png image of passport size.
I have come up with two basic ideas which are:-
First one is by writing a program which create a text file in which all required data from the pdf will be written along with the png image. After that the program will convert it into a pdf file.
Second one is by writing a program which will create the pdf file and write the data from .xlsx file as well as insert the image too into the pdf file.
I don't know whether these ideas can be used or not and how it can be used but after going through some researches on GFG, Stack overflow..., I have got totally confused and ended up asking this problem on this platform.
I have tried some modules like PIL, FPDF, reportlab,.. and am successfully able to create a pdf file with either texts or images but unable to combine both in the same text file.
Also I am confused in deciding which idea I should implement.
What I need from you guys is the answer of few of my questions which are:-
Are the ideas I mentioned above(second one specially) practically possible?
Can I make a program which imports data from file as well as png image into the same pdf. What modules and functions will be used there and how.
Please provide the code with comments or defining/elaborating the work of function used.
I hope I will get the desired result soon. Meanwhile I will try to solve it out by myself.
Trying to extract text from pdf file/s using python(v 3.8.2) module pypdf2(v 1.26.0). All good except with particular pdf file/s(generated from chrome print option.)
I have these files over the period that I have generated/downloaded using chrome's print option, where there is an option to save page/document as pdf. I am not able to extract text from these pdf files as code only returns ' '(empty), no problem with other pdf files. If you would like to test yourself you can save any web page as pdf using chrome print option and use that pdf to test. Chrome(v 81.0.4044.138)
Found that chrome uses Skia to save pages as pdf but didn't help to solve the problem. (PDF Producer: Skia/PDF m80)
Found following similar question on Stack Overflow but no body has answered yet and as I am new user I can't comment or add anything hence this new question.
Extract text from pdf converted from webpage using Pypdf2
Following is the code
import PyPDF2
pdfFileObj = open('example.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
print(pageObj.extractText())
pdfFileObj.close()
I am a new user and this is my first time posting question please correct me if I have done anything incorrect(not sure if I have). I assure you I have done my search on google found no solution or lacking knowledge to understand problem/solution. Thank you
PyPDF2 is highly unreliable for extracting text from pdf . as pointed out here too.
which says:
While PyPDF2 has .extractText(), which can be used on its page objects
(not shown in this example), it does not work very well. Some PDFs
will return text and some will return an empty string. When you want
to extract text from a PDF, you should check out the PDFMiner project
instead. PDFMiner is much more robust and was specifically designed
for extracting text from PDFs.
Look at my answer for similar question here
I am trying to get the text from a PDF file I downloaded with PyPDF.
Here is my code:
if not PyPDF2.PdfFileReader('download.pdf').isEncrypted:
PyPDF2.PdfFileReader('download.pdf').getPage(0).extractText()
This is the output:
'\n\n˘ˇ˘ˆ˙\n˝˛˚˜!\n\n\n\n#\nˇ˘ˆ˙ˆ˝˛˝\n˙˙˘ ˘ˆ"˝\n$!%˙(˝)˙*˜+,˝-.#/.(#0)0)/.1.+02345.\n˛˛ˇ/#.$/0/70/#.+322.32˙˘˛˘˘\n˛˘ 8˙˘9:˘ˆ;\n˛˘\n\n˝=\n˙˘˛\n.ˇ<9:˘ˇˇ%˘˛ˇ ˘˘<˘\n˝>"?˝˘$#<˘*ˆˆ˘˙˘A˘B˘˙˘˛ˇ!˛˘˙˘˛ˇ˘\n1C˙ˆ˘06˛˘8+˛9:˘D10+E˝ˆ˘8\n$˘˘9:˘˘1C˙ˆ˘+˘F˛˘D$1+FE˝˘˛˘˘<˘?˝\n////)*˘1˘˛ ?GG˜*HI\nD˘˙A˘E\nJ$\n˛\nDLE///M˛˝˛˙˘˛˘˛\n˛˘˛>"?\n˙˘˛\n˛\n/)M6;˝˛˙˘˛˘\n˛\n///˛\n\n'
When I open the file its content is fine. Also when I use another program to transform pdf into txt it works fine. It is a javascript rendered pdf on a webpage, don't know if it makes any difference.
Under Win 7, Python 3.6, I had the problem that PyPDF2 did not properly encode some PDF files. My solution was to use pdfminer.six.
pip install pdfminer.six
To extract text from a PDF, you can use functions such as the one in this post: https://stackoverflow.com/a/42154976/9524424
Worked perfect for me...
The following is taken from the documentation (https://pythonhosted.org/PyPDF2/PageObject.html)
extractText() Locate all text drawing commands, in the order they are
provided in the content stream, and extract the text. This works well
for some PDF files, but poorly for others, depending on the generator
used. This will be refined in the future. Do not rely on the order of
text coming out of this function, as it will change if this function
is made more sophisticated. Returns: a unicode string object.
So, it seems that the performance of this function depends on the pdf itself.
I'm currently making a program in python that creates data and then gets stored into a text file. The data is in a column like formation and when i change the file format to csv, it opens LibreOffice Calc (raspberry pi's version of excel) which is exactly how i wanted the data to be formatted.
But i want to take it one step further and convert my CSV file data into a PDF. I've looked on the web and it says how to convert a pdf into a csv which isn't what i want. I also saw something called pyPDF but im not sure about if that would be of any use.
This is the string of data that is being looped 10 times,
resultStr = 'Test,{},InNum,{},stats,{},Duration(ms),{} \n'.format("OFF",inPin, result, round(duration*1000))
Once the loop finishes, a text file gets opened and the 'resultStr' is the string is getting stored.
Thanks everyone for your help,
~Neamus
Using ReportLab, you can programatically generate PDF documents with your data. There are plenty of examples available to demonstrate the framework and how to use it. In your case, you should simply append to your document story in a loop for each of your CSV result strings.
I am trying to read a pdf through url. I followed many stackoverflow suggestions and used PyPdf2 FileReader to extract text from the pdf.
My code looks like this :
url = "http://kat.kar.nic.in:8080/uploadedFiles/C_13052015_ch1_l1.pdf"
#url = "http://kat.kar.nic.in:8080/uploadedFiles/C_06052015_ch1_l1.pdf"
f = urlopen(Request(url)).read()
fileInput = StringIO(f)
pdf = PyPDF2.PdfFileReader(fileInput)
print pdf.getNumPages()
print pdf.getDocumentInfo()
print pdf.getPage(1).extractText()
I am able to successfully extract text for first link. But if I use the same program for the second pdf. I do not get any text. The page numbers and document info seem to show up.
I tried extracting text from Pdfminer through terminal and was able to extract text from the second pdf.
Any idea what is wrong with the pdf or is there a drawback with the libraries I am using ?
If you read the comments in the pyPDF documentation you'll see that it's written right there that this functionality will not work well for some PDF files; in other words, you're looking at a restriction of the library.
Looking at the two PDF files, I can't see anything wrong with the files themselves. But...
The first file contains fully embedded fonts
The second file contains subsetted fonts
This means that the second file is more difficult to extract text from and the library probably doesn't support that properly. Just for reference I did a text extraction with callas pdfToolbox (caution, I'm affiliated with this tool) which uses the Acrobat text extraction and the text is properly extracted for both files (confirming that it's not the PDF files that are the problem).