I will explain my dilemma first: I have several thousand powerpoint files (.ppt) that I need to extract the text. The problem is the text is is disorganized in the file and when read as a complete page it makes no sense for what I need (it would read in the example: line 1, line 3, line 2, line 4, line 5).
I was using tika to read the files initially. I then thought if I converted to pdf using glob and win32com.client that I would have some better luck but it's basically the same result. The picture here is an example of what the text is like.
So now my idea now is if I can section the pdf or ppt by pixel location (and save to separate temp files if needed, opened, and read that way) I can keep things in order and get what I need. Although the text moves around within each box, the black outline boxes are always roughly in the same location.
I cannot find anything to split an individual pdf page though, only multiple pages into a single page. Does anyone have an idea how to go about doing this?
I need to read the text in box one together (line 1 and line 2) and load into a dictionary or some other container, and the same for the second box. For reference there is only one slide in the powerpoint.
Allow me to provide the answer as a general guideline:
Both .ppt and .pptx files are glorified .zip files.
Use 7-zip or WinZip to open the .pptx and understand the structure.
Convert them into a .pptx file.
Each slide should now have a .xml file full of tags you can parse.
For example you will find tags for each text box with tags for that box's text nested inside.
Also: python-pptx
Mass convert by tweaking this VBA code: Link for VBA
Or using PowerShell: Link for [PowerShell]
Related
I want to be able to change the text in a pdf file (For Example : abc#gmail.com to xxx#xxxxx.com), keeping the other contents and format of the pdf file unchanged. I tried using pdfrw (something based on https://github.com/pmaupin/pdfrw/blob/master/examples/fancy_watermark.py ) but am only able to add a watermark on the page. This watermark does not completely hide the objects under it.
So I'm looking for some other methods to do the same. Is there any way to do it?
I'm working on a project for a friend of mine.
I want to find one specific keyword that is on multiple pages, and it has duplicates on other places on a large PDF file (40-60 pages and above) then save in memory in what page the keyword was found, and then split those pages from the original PDF File and lastly, merge them together.
I'm thinking about using PDFMiner or PyPDF2 (i'm open to other suggestions as well)
I'm already writing the code for the most part of it, but i can't figure out a good and efficient way to search the file and find that keyword, because this keyword is located in other places in the same file, and make sure that the data i want to extract from the original file isn't duplicate and all the data was extracted.
Thanks in Advance.
Did you try to split pdf file on couple of blocks and search keyword on each block with multithreading? This should be faster.
I am trying to read a pdf through url. I followed many stackoverflow suggestions and used PyPdf2 FileReader to extract text from the pdf.
My code looks like this :
url = "http://kat.kar.nic.in:8080/uploadedFiles/C_13052015_ch1_l1.pdf"
#url = "http://kat.kar.nic.in:8080/uploadedFiles/C_06052015_ch1_l1.pdf"
f = urlopen(Request(url)).read()
fileInput = StringIO(f)
pdf = PyPDF2.PdfFileReader(fileInput)
print pdf.getNumPages()
print pdf.getDocumentInfo()
print pdf.getPage(1).extractText()
I am able to successfully extract text for first link. But if I use the same program for the second pdf. I do not get any text. The page numbers and document info seem to show up.
I tried extracting text from Pdfminer through terminal and was able to extract text from the second pdf.
Any idea what is wrong with the pdf or is there a drawback with the libraries I am using ?
If you read the comments in the pyPDF documentation you'll see that it's written right there that this functionality will not work well for some PDF files; in other words, you're looking at a restriction of the library.
Looking at the two PDF files, I can't see anything wrong with the files themselves. But...
The first file contains fully embedded fonts
The second file contains subsetted fonts
This means that the second file is more difficult to extract text from and the library probably doesn't support that properly. Just for reference I did a text extraction with callas pdfToolbox (caution, I'm affiliated with this tool) which uses the Acrobat text extraction and the text is properly extracted for both files (confirming that it's not the PDF files that are the problem).
I have two pdf files, which are almost the same, except that the first one has OCRed text and the other doesn't, and they have different compressions.
The reason I want to do that is because there is some error in the first file's OCRed text, and the file uses the OCRed text to cover the corresponding image, which makes me unable to know what the correct text is. This is how the second file can help me.
I would like to
make the first file show the image, with the OCRed text hidden and not covering the image.
Alternatively, move the OCRed text from the first file to the second.
Alternatively, remove the OCRed text from the first file, and then re-OCR it, since Adobe Acrobat can't re-OCR a pdf file with OCRed text already.
So I wonder if there is a Python module that can move the OCRed text layer from the first file to the second, while removing the OCRed text layer away from the first file?
If there is no, what languages may have such libraries?
Thanks!
Check out pdfminer; it's not exactly a user-friendly API, but you should be able to navigate the PDF structure and drop the obstructing text. You can come back with specific questions.
But if it's just a question of hiding the OCR, you may be able to hide it if you open the file in Acrobat; IIRC it has options for showing just the OCR, just the background, or both.
I am working on a piece of software that analyses PDF files and generates HTML based on them. There are a number of things out there that already do this so I know it is possible, I have to write my own for business reasons.
I have managed to get all the text information, positions, fonts out of the PDF but I am struggling to read out the colour of the text. I am currently using PDFMiner to analyse the PDF but am beginning to think I will need to write my own PDFReader, even so, I can't figure out where in the document the Colour information for text is even kept! I have even read the PDF spec but cannot find the information I need.
I have scoured google, with no joy.
Thanks in advance!
The colour for text and other filled graphics is set using one of the g, rg or k operators in the content stream object in the PDF file, as described in section 4.5.7 Color Operators in the PDF reference manual.
The example G.3 Simple Graphics Example in the reference manual shows these operators being used to stroke and fill some shapes (but not text).
http://www.adobe.com/devnet/pdf/pdf_reference.html
When parsing a PDF file yourself you start by reading the trailer
at the end of the file which contains the file offset of the
cross reference table. This table contains the file offset of
each object in the PDF file. The objects are in a tree structure with references
to other objects. One of the objects will be
the content stream. This is described in sections 3.4 File Structure
and 3.6 Document Structure in the PDF reference manual.
It is possible to parse the PDF file yourself but this is
quite a lot of work. The content
stream may be compressed, contain references to other objects,
contain comments, etc. and you must handle all of these cases.
The PDFMiner software is already reading the content stream. Perhaps it
would be easier to extend PDFMiner to report the colour
of the text too?