I have a pdf form that I need to extract email id, name of the person and other information like skills, city, etc..how can I do that using pdfminer3.
please find attached sample of pdf
First, use tika to to convert PDF to text.
import re
import sys
!{sys.executable} -m pip install tika
from tika import parser
from io import StringIO
from itertools import islice
file = 'filename with directory'
parsedPDF = parser.from_file(file) # Parse data from file
text = parsedPDF['content'] # Get files text content
Now extract desired fields using regex.
You can find extensive regex tutorials online. If you have any problem implementing the same, please ask here.
Try to use tika package:
from tika import parser
raw = parser.from_file('sample.pdf')
print(raw['content'])
Related
I'm trying to extract specific information from the PDF using Tika in Python. I tried to incorporate regex into the code, but it returns an error. Here is my code:
from tika import parser
import re
parsed = parser.from_file("PDF/File.pdf")
desc = re.findall(r'((?:[A-Z][a-z]+\s*)+)\b\s*:\s*(.*?)\s*(?=(?:[A-Z][a-z]+\s*)+:|$)', parsed)
print(desc["content"])
The error returned is as follows:
TypeError: expected string or bytes-like object, got 'dict'
Is there a solution to fix the error and a way so that the regex can be passed into the code?
As a maintainer of PyMuPDF I just have to demonstrate how this works with this library:
import fitz # import pymupdf
import re
doc = fitz.open("PDF/File.pdf")
text = " ".join([page.get_text() for page in doc])
desc = re.findall(r'...', text)
I am trying to read the data from a draw.io drawing using python.
Apparently the format is an xml with some portions in "mxfile" encoding.
(That is, a section of the xml is deflated, then base64 encoded.)
Here's the official TFM:
https://drawio-app.com/extracting-the-xml-from-mxfiles/
And their online decoder tool:
https://jgraph.github.io/drawio-tools/tools/convert.html
So i try to decode the mxfile portion using the standard python tools:
import base64
s="7VvbcuI4FPwaHpOybG55BHKZmc1kmSGb7KvAArTIFiuLEObr58jINxTATvA4IVSlKtaxLFvq1lGrbWpOz3u+EXg+/c5dwmq25T7XnMuabSOr3YR/KrJaR9pIByaCurpSEhjQXyS6UkcX1CVBpqLknEk6zwZH3PfJSGZiWAi+zFYbc5a96xxPiBEYjDAzo4/UlVPdC7uVxL8QOplGd0bNi/UZD0eVdU+CKXb5MhVyrmpOT3Au10fec48wNXjRuDx+XT2y21nz5tuP4H/8T/ev+7uHs3Vj10UuibsgiC9f3fSv2fj6y0P9v3/n/esfS+umM/x2pi+xnjBb6PHqExFwX/dYrqJhDJbUY9iHUnfMfTnQZ2AQupjRiQ/HI3g6IiDwRISkgEBHn5B8DtHRlDL3Fq/4QvUhkHg0i0rdKRf0FzSLGZxCEIDTQmoy2c1MjYG6EsIWRAUJoE4/GhgUh25xIHWdEWcMzwM6DB9YVfGwmFC/y6XkXtQQX/gucXUpRjosSMFnMXfU9Tnh0LCp0SDPKTJqeG4I94gUK6iiz8ZM01MNReVlQlzU1LFpmrROW08YPVkmcdvx7X7C5ML+BAYhuZ+zcb96zvvZzeztMAPgfSxJVw1jkKYhHKS6moRCchYgKjKIeoc9YtAURlqmKMnIWG4lZDDHI+pPbsM6l/Uk8lP3VIU4XDtmIRmm1HWJH5JFYonXfFIMmXPqy3AoGl34gwHrWeeNWgMeqAdllJThT1UXssd94BWmIYEIkHVJFGFfoNbOabufWqssYkWRTRMpA2lR/Gwz0Uy5r8h4t/CGkDaODckdGWUqPaYPy8K7YVeMt2PgfeVhqi7ruC7k6OAE+EEBb7UrBrxuAG4gzGioH/RooBfX1j3wewCkai7C+17R4fIMGZxwTE44L+DP8JCwPg+opFy1L9Z1N3hRVdZGVj0fqjuW/zeB2jCz9kKMpjhQiRtk1wyGNzw6wvlcGqio6tzcNFAdyIWruplT9Vsn1X841Y82VL/TLFf1ow3V77Tfr+pvbWfqserGnGmnmZtm72UH0Daw7MDTK/fGtr7DUnJ0SB5UEBbGu/IdwMVJEB4c1Lwqvyw9iEy/8CskfusK0AiXWtu656rsC65aO7IZndZA9bIwbledqJHptd0QteIOiEd9LBTg93hGTJP4o+NbFqTVS/7oAXZlY+K7HfXCBUpDxpXa7kJIy3FkrYvXlEUr1x69nF3+iDsh0dQhbMiXV0mgGwbgRMSUwmo74LAtJfshg/3FhOTYzamn3QnsS0AKwrCkT9n3Tju0eV8RN9HltpXV5bblZJtYd1JflX7RU7Sh9SgYDR3Mqje9v77gYxIE3JTrpx1m+TtMZ3PHl3eH2bL2kviFDaZTz7HBbL2PDSYybcsBZlhn3E+4tsWT9+NsLJHpUhroffadRnFY8+4fS9tqmC7lp1IsEWLvWrKgjUzfeqVkcTYaslsbz1K2ZDGNxm2vKU+CpXzB0rDaGTrk/hDGRjsWme2KpdH4QB/CmD7qQApCzJc3n0WxtHLT690oFtMb7VF5fJrzoA54cZwrt8Bt0y6FpC2P77O1ioGu/OMX27RMQdmrVdy2etw9AX5gwHN/GFMe4qah2oMxkUfoHFSNtfNKMXY4rE1D0wD50xsMxXFt5JRhZTkMtun9PQBE7jEu0OWh2Kw8E5v2398LOV8oe6Gj3lXeqnlwQjQ3oheV59ti1h+fh2NdzNyLfUFUvdWnx3av0xdhudfq0zgrKqVtjbp+oDe6fvH7nJgwdraJvK5fo76noS2un9HQ2eYbp412+HgckFKMQ9s0Dq3z8wj4hK6hGZdKBHvSzlBbcus1vItHs0nI3x5nXMB5nycGpHa77fw5IZpf+ieX+rFq8c/P8ht1Z29kVETMPwaXaZ7lxyrSTx8VrMPM/uib3D8OnemZMeiFWuDxVu8zJcc3UTVVcB4HP9bou7Eu5KK/kRgGAbZxJf86cXEYpjhZFz9K0m/hChSTH1yvqyc/W3eufgM="
result=zlib.decompress(base64.b64decode(s))
Throws the exception:
zlib.error: Error -3 while decompressing data: incorrect header check
Meanwhile their tool above returns xml just fine when given the exact same data.
What am I missing?
Try this:
import zlib
import base64
import xml.etree.ElementTree as ET
from urllib.parse import unquote
tree = ET.parse(filename)
data = base64.b64decode(tree.find('diagram').text)
xml = zlib.decompress(data, wbits=-15)
xml = unquote(xml)
If you read the source of their html tool, you will see this:
data = String.fromCharCode.apply(null, new Uint8Array(pako.deflateRaw(data)));
They are using a JS library called pako and 'raw' mode. From github source you can get the required setting.
I tried to parse a pdf file with the PyPDF2 but I only retrieve about 10% of the text. For the remaining 90%, pyPDF2 brings back only newlines... a bit frustrating.
Would you know any alternatives on Python running on Windows? I've heard of pdftotext but it seems that I can't install it because my computer does not run on Linux.
Any idea?
import PyPDF2
filename = 'Doc.pdf'
pdf_file = PyPDF2.PdfFileReader(open(filename, 'rb'))
print(pdf_file.getPage(0).extractText())
Try PyMuPDF. The following example simply prints out the text it finds. The library also allows you to get the position of the text if that would help you.
#!python3.6
import json
import fitz # http://pymupdf.readthedocs.io/en/latest/
pdf = fitz.open('2018-04-17-CP-Chiffre-d-affaires-T1-2018.pdf')
for page_index in range(pdf.pageCount):
text = json.loads(pdf.getPageText(page_index, output='json'))
for block in text['blocks']:
if 'lines' not in block:
# Skip blocks without text
continue
for line in block['lines']:
for span in line['spans']:
print(span['text'].encode('utf-8'))
pdf.close()
I am trying to find a way to look in a folder and search the contents of all of the powerpoint documents within that folder for specific strings, preferably using Python. When those strings are found, I want to report out the text after that string as well as what document it was found in. I would like to compile the information and report it in a CSV file.
So far I've only come across the olefil package, https://bitbucket.org/decalage/olefileio_pl/wiki/Home. This provides all of the text contained in a specific document, which is not what I am looking to do. Please help.
Actually working
If you want to extract text:
import Presentation from pptx (pip install python-pptx)
for each file in the directory (using glob module)
look in every slides and in every shape in each slide
if there is a shape with text attribute, print the shape.text
from pptx import Presentation
import glob
for eachfile in glob.glob("*.pptx"):
prs = Presentation(eachfile)
print(eachfile)
print("----------------------")
for slide in prs.slides:
for shape in slide.shapes:
if hasattr(shape, "text"):
print(shape.text)
tika-python
A Python port of the Apache Tika library, According to the documentation Apache tika supports text extraction from over 1500 file formats.
Note: It also works charmingly with pyinstaller
Install with pip :
pip install tika
Sample:
#!/usr/bin/env python
from tika import parser
parsed = parser.from_file('/path/to/file')
print(parsed["metadata"]) #To get the meta data of the file
print(parsed["content"]) # To get the content of the file
Link to official GitHub
python-pptx can be used to do what you propose. Just at a high level, you would do something like this (not working code, just and idea of overall approach):
from pptx import Presentation
for pptx_filename in directory:
prs = Presentation(pptx_filename)
for slide in prs.slides:
for shape in slide.shapes:
print shape.text
You'd need to add the bits about searching shape text for key strings and adding them to a CSV file or whatever, but this general approach should work just fine. I'll leave it to you to work out the finer points :)
Textract-Plus
Use textract-plus which can extract text from most of the document extensions including pptx and pptm.
refer docs
Install-
pip install textract-plus
Sample-
import textractplus as tp
text=tp.process('path/to/yourfile.pptx')
for your case-
import os
import pandas as pd
import textractplus as tp
files_csv=[]
your_dir='.'
for f in os.listdir(your_dir):
if f.endswith('pptx') or f.endswith('pptm'):
text=tp.process(os.join(your_dir,f))
files_csv.append([f,text])
pd.Dataframe(files_csv,columns=['filename','text']).to_csv('your_csv.csv')
this code will fetch all the pptx and pptm files from directory and create a csv with first column as filename and second as text extracted from that file
import os
import textract
files_csv = []
your_dir = '.'
for f in os.listdir(your_dir):
if f.endswith('pptx') or f.endswith('pptm'):
text = tp.process(os.path.join('sample.pptx'))
print(text)
I did python script:
from string import punctuation
from collections import Counter
import urllib
from stripogram import html2text
myurl = urllib.urlopen("https://www.google.co.in/?gfe_rd=cr&ei=v-PPV5aYHs6L8Qfwwrlg#q=samsung%20j7")
html_string = myurl.read()
text = html2text( html_string )
file = open("/home/nextremer/Final_CF/contentBased/contentCount/hi.txt", "w")
file.write(text)
file.close()
Using this script I didn't get perfect output only some HTML code.
I want save all webpage text content in a text file.
I used urllib2 or bs4 but I didn't get results.
I don't want output as a html structure.
I want all text data from webpage
What do you mean with "webpage text"?
It seems you don't want the full HTML-File. If you just want the text you see in your browser, that is not so easily solvable, as the parsing of a HTML-document can be very complex, especially with JavaScript-rich pages.
That starts with assessing if a String between "<" and ">" is a regular tag and includes analyzing the CSS-Properties changed by JavaScript-behavior.
That is why people write very big and complex rendering-Engines for Webpage-Browsers.
You dont need to write any hard algorithms to extract data from search result. Google has a API to do this.
Here is an example:https://github.com/google/google-api-python-client/blob/master/samples/customsearch/main.py
But to use it, first you must to register in google for API Key.
All information you can find here: https://developers.google.com/api-client-library/python/start/get_started
import urllib
urllib.urlretrieve("http://www.example.com/test.html", "test.txt")