Problem loading text from searchable pdfs ("PSKeyword" error) - python

I have a problem with extracting text using pdfplumber. The pdf is of type searchable and other examples work fine. On the other hand, there is one invoice, it cannot be loaded correctly. I get this error:
cannot convert 'PSKeyword' object to bytearray
From what I've noticed, this can be fixed as follows. Open the file with any pdf program and save it again.
Maybe someone had a similar problem? I am trying to code it in python somehow? Anyone have any ideas?
Thanks,
Norbert.

Related

Unable to print pdf created by PyPDF2

I created a modified pdf using the python library PyPDF2 but when I try to print it using my printer all I get is a set of blank pages. The Adobe Reader is also not able to read the newly created PDF file which makes me think that most probably the pdf is being read by the printer but the format is inappropriate for printing. And the problem is I don't have much knowledge about PDFs and their formats.
For example a pdf that does print has a format:
And for the pdf that I created and which doesn't print, the format is:
Hence, I would like to know what is the problem with the PDF I created using PyPDF2 and how can I fix it for printing it.
Edit:
The code used by me can be seen here on my previous post.

Jupyter Notebook issue

I ran some commands on Jupyter Notebook and expected to get a printed output containing data in tabulated form in a .csv file, but then i get an uncompleted output
This is the result i get from the .csv file
I ran this command;
df1=pandas.read_csv("supermarkets.csv", on_bad_lines='skip')
df1
I expected to get a printed output in a tabulated like in the image attached......
The data get printed in well tabulated form here
Here is a link to the online version of the file
[pythonhow.com/supermarkets.csv]
Getting good, clean quality data where the file extension correctly matches the actual content is often a challenge. Assessing the state of the input data is generally always a very important first step.
It appears the data you are trying to get is also online here. Github will render that as a table in the browser because it has a viewer mode. To look at the 'raw' file content, click here. You'll see it is nice comma-delimited file with columns separated by commas and rows each on a different line. The header with the column names is on the first line.
Now open in a good text editor the file you have that you are working with and compare it to the content I pointed you at. That should guide you on what is the issue.
At this point you may just wish to switch to using the version of the file that I pointed you at.
Use the link below to obtain it as proper csv file:
https://raw.githubusercontent.com/kenvilar/data-analysis-using-python/master/supermarkets.csv
You should be able to paste that link in your browser and then right click on the page and choose 'Save as..' to download it to your locak machine. The obtained file should open just fine using the code you showed in the screenshot in your post here.
Please work on writing better questions with specific titles, see here for guidance. The title at present is overly broad and is actually not accurate. This code would not work with the data you apparently have even if you were running it inside a Python code-based script. And so it is not a Jupyter notebook issue. For how to think about making it specific, a good thing to keep in mind is to write for your future self. If you continue to use notebooks you'll have hundreds that would be considered a 'Jupyter Notebook issue', but what makes this issue different from those?
I believe there is an issue with your csv file, not the code.
To me it looks like the data in your csv file are written in json format.
Have you opened the supermarkets.csv file using excel? it should look like a table, not a json formatted file.
did you try df1.show() to see if the csv got read in the first place?

pdf in python which consist data from .xlsx file and png image

I wanted to create a pdf using Python 3x.
The pdf should have some text data which is stored in a .xlsx file i.e.., it should read data from .xlsx file and write into the .pdf file.
Along with that, the pdf should have a png image of passport size.
I have come up with two basic ideas which are:-
First one is by writing a program which create a text file in which all required data from the pdf will be written along with the png image. After that the program will convert it into a pdf file.
Second one is by writing a program which will create the pdf file and write the data from .xlsx file as well as insert the image too into the pdf file.
I don't know whether these ideas can be used or not and how it can be used but after going through some researches on GFG, Stack overflow..., I have got totally confused and ended up asking this problem on this platform.
I have tried some modules like PIL, FPDF, reportlab,.. and am successfully able to create a pdf file with either texts or images but unable to combine both in the same text file.
Also I am confused in deciding which idea I should implement.
What I need from you guys is the answer of few of my questions which are:-
Are the ideas I mentioned above(second one specially) practically possible?
Can I make a program which imports data from file as well as png image into the same pdf. What modules and functions will be used there and how.
Please provide the code with comments or defining/elaborating the work of function used.
I hope I will get the desired result soon. Meanwhile I will try to solve it out by myself.

Can't get text out of PDF file with PyPDF2

I am trying to get the text from a PDF file I downloaded with PyPDF.
Here is my code:
if not PyPDF2.PdfFileReader('download.pdf').isEncrypted:
PyPDF2.PdfFileReader('download.pdf').getPage(0).extractText()
This is the output:
'\n\n˘ˇ˘ˆ˙\n˝˛˚˜!\n\n\n\n#\nˇ˘ˆ˙ˆ˝˛˝\n˙˙˘ ˘ˆ"˝\n$!%˙(˝)˙*˜+,˝-.#/.(#0)0)/.1.+02345.\n˛˛ˇ/#.$/0/70/#.+322.32˙˘˛˘˘\n˛˘ 8˙˘9:˘ˆ;\n˛˘\n\n˝=\n˙˘˛\n.ˇ<9:˘ˇˇ%˘˛ˇ ˘˘<˘\n˝>"?˝˘$#<˘*ˆˆ˘˙˘A˘B˘˙˘˛ˇ!˛˘˙˘˛ˇ˘\n1C˙ˆ˘06˛˘8+˛9:˘D10+E˝ˆ˘8\n$˘˘9:˘˘1C˙ˆ˘+˘F˛˘D$1+FE˝˘˛˘˘<˘?˝\n////)*˘1˘˛ ?GG˜*HI\nD˘˙A˘E\nJ$\n˛\nDLE///M˛˝˛˙˘˛˘˛\n˛˘˛>"?\n˙˘˛\n˛\n/)M6;˝˛˙˘˛˘\n˛\n///˛\n\n'
When I open the file its content is fine. Also when I use another program to transform pdf into txt it works fine. It is a javascript rendered pdf on a webpage, don't know if it makes any difference.
Under Win 7, Python 3.6, I had the problem that PyPDF2 did not properly encode some PDF files. My solution was to use pdfminer.six.
pip install pdfminer.six
To extract text from a PDF, you can use functions such as the one in this post: https://stackoverflow.com/a/42154976/9524424
Worked perfect for me...
The following is taken from the documentation (https://pythonhosted.org/PyPDF2/PageObject.html)
extractText() Locate all text drawing commands, in the order they are
provided in the content stream, and extract the text. This works well
for some PDF files, but poorly for others, depending on the generator
used. This will be refined in the future. Do not rely on the order of
text coming out of this function, as it will change if this function
is made more sophisticated. Returns: a unicode string object.
So, it seems that the performance of this function depends on the pdf itself.

What does preview app of OS X do to help extracting from pdf?

When I extracted content from a pdf file with 12 pages using my program based on pdfminer, I got wrong result with only 11 pages. I tested it with other files and got right result in most cases.
By accident, I opened it with preview app in OS X Yosemite(v10.10.4), and save it without any other operations. Then the result I got from program was right. I found size of this file was changed from 2m to 300k by preview, but have no idea what it had done.
I tried searching an answer, but most topics are about using export function of preview app to compress pdf file, and seems no one come across the same problem with pdfminer neither.
1, What does preview app do with a pdf file when "save" ?
2, How can I deal with the problem ?
Thanks in advance!
PDF is a complex file format which supports many different features and ways of doing things. Your pdfminer app apparently has problems with some of those features, which causes it to misinterpret certain files. Preview on the other hand seems to correctly support everything and was able to correctly read the file into its internal presentation format. When you then re-saved the file, Preview wrote it in the way that it would write the same information. Again, lots of different ways to do the same thing means different programs will do things differently.
Preview apparently has a better, more compatible, more streamlined way to express the same content; and your pdfminer can handle it better.

Categories

Resources