Extracting text from multiple powerpoint files using python - python

I am trying to find a way to look in a folder and search the contents of all of the powerpoint documents within that folder for specific strings, preferably using Python. When those strings are found, I want to report out the text after that string as well as what document it was found in. I would like to compile the information and report it in a CSV file.
So far I've only come across the olefil package, https://bitbucket.org/decalage/olefileio_pl/wiki/Home. This provides all of the text contained in a specific document, which is not what I am looking to do. Please help.

Actually working
If you want to extract text:
import Presentation from pptx (pip install python-pptx)
for each file in the directory (using glob module)
look in every slides and in every shape in each slide
if there is a shape with text attribute, print the shape.text
from pptx import Presentation
import glob
for eachfile in glob.glob("*.pptx"):
prs = Presentation(eachfile)
print(eachfile)
print("----------------------")
for slide in prs.slides:
for shape in slide.shapes:
if hasattr(shape, "text"):
print(shape.text)

tika-python
A Python port of the Apache Tika library, According to the documentation Apache tika supports text extraction from over 1500 file formats.
Note: It also works charmingly with pyinstaller
Install with pip :
pip install tika
Sample:
#!/usr/bin/env python
from tika import parser
parsed = parser.from_file('/path/to/file')
print(parsed["metadata"]) #To get the meta data of the file
print(parsed["content"]) # To get the content of the file
Link to official GitHub

python-pptx can be used to do what you propose. Just at a high level, you would do something like this (not working code, just and idea of overall approach):
from pptx import Presentation
for pptx_filename in directory:
prs = Presentation(pptx_filename)
for slide in prs.slides:
for shape in slide.shapes:
print shape.text
You'd need to add the bits about searching shape text for key strings and adding them to a CSV file or whatever, but this general approach should work just fine. I'll leave it to you to work out the finer points :)

Textract-Plus
Use textract-plus which can extract text from most of the document extensions including pptx and pptm.
refer docs
Install-
pip install textract-plus
Sample-
import textractplus as tp
text=tp.process('path/to/yourfile.pptx')
for your case-
import os
import pandas as pd
import textractplus as tp
files_csv=[]
your_dir='.'
for f in os.listdir(your_dir):
if f.endswith('pptx') or f.endswith('pptm'):
text=tp.process(os.join(your_dir,f))
files_csv.append([f,text])
pd.Dataframe(files_csv,columns=['filename','text']).to_csv('your_csv.csv')
this code will fetch all the pptx and pptm files from directory and create a csv with first column as filename and second as text extracted from that file

import os
import textract
files_csv = []
your_dir = '.'
for f in os.listdir(your_dir):
if f.endswith('pptx') or f.endswith('pptm'):
text = tp.process(os.path.join('sample.pptx'))
print(text)

Related

How to check the type of file to be downloaded from a link

I was trying to download a file from a link using the following line of code:
urllib.request.urlretrieve('http://ipfs.io/ipfs/QmcgBRy',f'{UNKNOWN_FACES_DIR}\\sample2.mp4')
But the thing is I don't know what type of file is stored in the link and hence cant give an appropriate file extension before downloading it.
Is there any way to get to know the type of file i.e. .jpg, .jpeg, .mp4 etc. before downloading it?
Using pure urllib, you can get the content type from the following:
import urllib
url = 'https://i.imgur.com/Woi6pwf.jpg'
urllib.request.urlopen(url).info()['content-type']
which returns:
'image/jpeg'
You can use the Python-Magic to find the MIME-type of the file. I guess this is the best library to be used for this purpose. You can do like this
import magic
magic.from_file("testdata/test.pdf")
# OUTPUT
# >>> 'PDF document, version 1.2'
Recommended Version
import magic
magic.from_buffer(open("testdata/test.pdf").read(2048))
# OUTPUT
# >>> 'PDF document, version 1.2'

Extracting text from MS Word Document uploaded through FileUpload from ipyWidgets in Jupyter Notebook

I am trying to allow user to upload MS Word file and then I run a certain function that takes a string as input argument. I am uploading Word file through FileUpload however I am getting a coded object. I am unable to decode using byte UTF-8 and using upload.value or upload.data just returns coded text
Any ideas how I can extract content from uploaded Word File?
> upload = widgets.FileUpload()
> upload
#I select the file I want to upload
> upload.value #Returns coded text
> upload.data #Returns coded text
> #Previously upload['content'] worked, but I read this no longer works in IPYWidgets 8.0
Modern ms-word files (.docx) are actually zip-files.
The text (but not the page headers) are actually inside an XML document called word/document.xml in the zip-file.
The python-docx module can be used to extract text from these documents. It is mainly used for creating documents, but it can read existing ones. Example from here.
>>> import docx
>>> gkzDoc = docx.Document('grokonez.docx')
>>> fullText = []
>>> for paragraph in doc.paragraphs:
... fullText.append(paragraph.text)
...
Note that this will only extract the text from paragraphs. Not e.g. the text from tables.
Edit:
I want to be able to upload the MS file through the FileUpload widget.
There are a couple of ways you can do that.
First, isolate the actual file data. upload.data is actually a dictionary, see here. So do something like:
rawdata = upload.data[0]
(Note that this format has changed over different version of ipywidgets. The above example is from the documentation of the latest version. Read the relevant version of the documentation, or investigate the data in IPython, and adjust accordingly.)
write rawdata to e.g. foo.docx and open that. That would certainly work, but it does seem somewhat un-elegant.
docx.Document can work with file-like objects. So you could create an io.BytesIO object, and use that.
Like this:
foo = io.BytesIO(rawdata)
doc = docx.Document(foo)
Tweaking with #Roland Smith great suggestions, following code finally worked:
import io
import docx
from docx import Document
upload = widgets.FileUpload()
upload
rawdata = upload.data[0]
test = io.BytesIO(rawdata)
doc = Document(test)
for p in doc.paragraphs:
print (p.text)

how to extract fields from pdf in python using pdfminer

I have a pdf form that I need to extract email id, name of the person and other information like skills, city, etc..how can I do that using pdfminer3.
please find attached sample of pdf
First, use tika to to convert PDF to text.
import re
import sys
!{sys.executable} -m pip install tika
from tika import parser
from io import StringIO
from itertools import islice
file = 'filename with directory'
parsedPDF = parser.from_file(file) # Parse data from file
text = parsedPDF['content'] # Get files text content
Now extract desired fields using regex.
You can find extensive regex tutorials online. If you have any problem implementing the same, please ask here.
Try to use tika package:
from tika import parser
raw = parser.from_file('sample.pdf')
print(raw['content'])

Python - PyPDF2 misses large chunk of text. Any alternative on Windows?

I tried to parse a pdf file with the PyPDF2 but I only retrieve about 10% of the text. For the remaining 90%, pyPDF2 brings back only newlines... a bit frustrating.
Would you know any alternatives on Python running on Windows? I've heard of pdftotext but it seems that I can't install it because my computer does not run on Linux.
Any idea?
import PyPDF2
filename = 'Doc.pdf'
pdf_file = PyPDF2.PdfFileReader(open(filename, 'rb'))
print(pdf_file.getPage(0).extractText())
Try PyMuPDF. The following example simply prints out the text it finds. The library also allows you to get the position of the text if that would help you.
#!python3.6
import json
import fitz # http://pymupdf.readthedocs.io/en/latest/
pdf = fitz.open('2018-04-17-CP-Chiffre-d-affaires-T1-2018.pdf')
for page_index in range(pdf.pageCount):
text = json.loads(pdf.getPageText(page_index, output='json'))
for block in text['blocks']:
if 'lines' not in block:
# Skip blocks without text
continue
for line in block['lines']:
for span in line['spans']:
print(span['text'].encode('utf-8'))
pdf.close()

How to read and print the contents of a ttf file?

Is there any way that I can open, read and write a ttf file?
Example:
with open('xyz.ttf') as f:
content = f.readline()
print(content)
A bit more:
If I open a .ttf (font) file with windows font viewer we see the following image
From this I like to extract following lines as text, with proper style.
What is exactly inside this file with *.ttf extension. I think you need to add more details of the input and output. If you reffering to a font type database you must first find a module/package to open and read it, since *.ttf isn't a normal text file.
Read the given links and install the required packages first:
https://pypi.python.org/pypi/FontTools
Then, as suggested:
from fontTools.ttLib import TTFont
font = TTFont('/path/to/font.ttf')
print(font)
<fontTools.ttLib.TTFont object at 0x10c34ed50>
If you need help with something else trying putting the input and expected output.
Other links:
http://www.starrhorne.com/2012/01/18/how-to-extract-font-names-from-ttf-files-using-python-and-our-old-friend-the-command-line.html
Here is a another useful python script:
https://gist.github.com/pklaus/dce37521579513c574d0

Categories

Resources