Using Tika in Python and Regular Expression To Extract Text From PDF - python

I'm trying to extract specific information from the PDF using Tika in Python. I tried to incorporate regex into the code, but it returns an error. Here is my code:
from tika import parser
import re
parsed = parser.from_file("PDF/File.pdf")
desc = re.findall(r'((?:[A-Z][a-z]+\s*)+)\b\s*:\s*(.*?)\s*(?=(?:[A-Z][a-z]+\s*)+:|$)', parsed)
print(desc["content"])
The error returned is as follows:
TypeError: expected string or bytes-like object, got 'dict'
Is there a solution to fix the error and a way so that the regex can be passed into the code?

As a maintainer of PyMuPDF I just have to demonstrate how this works with this library:
import fitz # import pymupdf
import re
doc = fitz.open("PDF/File.pdf")
text = " ".join([page.get_text() for page in doc])
desc = re.findall(r'...', text)

Related

parsing xml with namespace from request with lxml in python

I am trying to get some text out of a table from an online xml file. I can find the tables:
from lxml import etree
import requests
main_file = requests.get('https://training.gov.au/TrainingComponentFiles/CUA/CUAWRT601_R1.xml')
main_file.encoding = 'utf-8-sig'
root = etree.fromstring(main_file.content)
tables = root.xpath('//foo:table', namespaces={"foo": "http://www.authorit.com/xml/authorit"})
print(tables)
But I can't get any further than that. The text that I am looking for is:
Prepare to write scripts
Write draft scripts
Produce final scripts
When I paste the xml in here: http://xpather.com/
I can get it using the following expression:
//table[1]/tr/td[#width="2700"]/p[#id="4"][not(*)]/text()
but that doesn't work here and I'm out of ideas. How can I get that text?
Use the namespace prefix you declared (with namespaces={"foo": "http://www.authorit.com/xml/authorit"}) e.g. instead of //table[1]/tr/td[#width="2700"]/p[#id="4"][not(*)]/text() use //foo:table[1]/foo:tr/foo:td[#width="2700"]/foo:p[#id="4"][not(*)]/text().

Opening a draw.io file using python

I am trying to read the data from a draw.io drawing using python.
Apparently the format is an xml with some portions in "mxfile" encoding.
(That is, a section of the xml is deflated, then base64 encoded.)
Here's the official TFM:
https://drawio-app.com/extracting-the-xml-from-mxfiles/
And their online decoder tool:
https://jgraph.github.io/drawio-tools/tools/convert.html
So i try to decode the mxfile portion using the standard python tools:
import base64
s="7VvbcuI4FPwaHpOybG55BHKZmc1kmSGb7KvAArTIFiuLEObr58jINxTATvA4IVSlKtaxLFvq1lGrbWpOz3u+EXg+/c5dwmq25T7XnMuabSOr3YR/KrJaR9pIByaCurpSEhjQXyS6UkcX1CVBpqLknEk6zwZH3PfJSGZiWAi+zFYbc5a96xxPiBEYjDAzo4/UlVPdC7uVxL8QOplGd0bNi/UZD0eVdU+CKXb5MhVyrmpOT3Au10fec48wNXjRuDx+XT2y21nz5tuP4H/8T/ev+7uHs3Vj10UuibsgiC9f3fSv2fj6y0P9v3/n/esfS+umM/x2pi+xnjBb6PHqExFwX/dYrqJhDJbUY9iHUnfMfTnQZ2AQupjRiQ/HI3g6IiDwRISkgEBHn5B8DtHRlDL3Fq/4QvUhkHg0i0rdKRf0FzSLGZxCEIDTQmoy2c1MjYG6EsIWRAUJoE4/GhgUh25xIHWdEWcMzwM6DB9YVfGwmFC/y6XkXtQQX/gucXUpRjosSMFnMXfU9Tnh0LCp0SDPKTJqeG4I94gUK6iiz8ZM01MNReVlQlzU1LFpmrROW08YPVkmcdvx7X7C5ML+BAYhuZ+zcb96zvvZzeztMAPgfSxJVw1jkKYhHKS6moRCchYgKjKIeoc9YtAURlqmKMnIWG4lZDDHI+pPbsM6l/Uk8lP3VIU4XDtmIRmm1HWJH5JFYonXfFIMmXPqy3AoGl34gwHrWeeNWgMeqAdllJThT1UXssd94BWmIYEIkHVJFGFfoNbOabufWqssYkWRTRMpA2lR/Gwz0Uy5r8h4t/CGkDaODckdGWUqPaYPy8K7YVeMt2PgfeVhqi7ruC7k6OAE+EEBb7UrBrxuAG4gzGioH/RooBfX1j3wewCkai7C+17R4fIMGZxwTE44L+DP8JCwPg+opFy1L9Z1N3hRVdZGVj0fqjuW/zeB2jCz9kKMpjhQiRtk1wyGNzw6wvlcGqio6tzcNFAdyIWruplT9Vsn1X841Y82VL/TLFf1ow3V77Tfr+pvbWfqserGnGmnmZtm72UH0Daw7MDTK/fGtr7DUnJ0SB5UEBbGu/IdwMVJEB4c1Lwqvyw9iEy/8CskfusK0AiXWtu656rsC65aO7IZndZA9bIwbledqJHptd0QteIOiEd9LBTg93hGTJP4o+NbFqTVS/7oAXZlY+K7HfXCBUpDxpXa7kJIy3FkrYvXlEUr1x69nF3+iDsh0dQhbMiXV0mgGwbgRMSUwmo74LAtJfshg/3FhOTYzamn3QnsS0AKwrCkT9n3Tju0eV8RN9HltpXV5bblZJtYd1JflX7RU7Sh9SgYDR3Mqje9v77gYxIE3JTrpx1m+TtMZ3PHl3eH2bL2kviFDaZTz7HBbL2PDSYybcsBZlhn3E+4tsWT9+NsLJHpUhroffadRnFY8+4fS9tqmC7lp1IsEWLvWrKgjUzfeqVkcTYaslsbz1K2ZDGNxm2vKU+CpXzB0rDaGTrk/hDGRjsWme2KpdH4QB/CmD7qQApCzJc3n0WxtHLT690oFtMb7VF5fJrzoA54cZwrt8Bt0y6FpC2P77O1ioGu/OMX27RMQdmrVdy2etw9AX5gwHN/GFMe4qah2oMxkUfoHFSNtfNKMXY4rE1D0wD50xsMxXFt5JRhZTkMtun9PQBE7jEu0OWh2Kw8E5v2398LOV8oe6Gj3lXeqnlwQjQ3oheV59ti1h+fh2NdzNyLfUFUvdWnx3av0xdhudfq0zgrKqVtjbp+oDe6fvH7nJgwdraJvK5fo76noS2un9HQ2eYbp412+HgckFKMQ9s0Dq3z8wj4hK6hGZdKBHvSzlBbcus1vItHs0nI3x5nXMB5nycGpHa77fw5IZpf+ieX+rFq8c/P8ht1Z29kVETMPwaXaZ7lxyrSTx8VrMPM/uib3D8OnemZMeiFWuDxVu8zJcc3UTVVcB4HP9bou7Eu5KK/kRgGAbZxJf86cXEYpjhZFz9K0m/hChSTH1yvqyc/W3eufgM="
result=zlib.decompress(base64.b64decode(s))
Throws the exception:
zlib.error: Error -3 while decompressing data: incorrect header check
Meanwhile their tool above returns xml just fine when given the exact same data.
What am I missing?
Try this:
import zlib
import base64
import xml.etree.ElementTree as ET
from urllib.parse import unquote
tree = ET.parse(filename)
data = base64.b64decode(tree.find('diagram').text)
xml = zlib.decompress(data, wbits=-15)
xml = unquote(xml)
If you read the source of their html tool, you will see this:
data = String.fromCharCode.apply(null, new Uint8Array(pako.deflateRaw(data)));
They are using a JS library called pako and 'raw' mode. From github source you can get the required setting.

how to extract fields from pdf in python using pdfminer

I have a pdf form that I need to extract email id, name of the person and other information like skills, city, etc..how can I do that using pdfminer3.
please find attached sample of pdf
First, use tika to to convert PDF to text.
import re
import sys
!{sys.executable} -m pip install tika
from tika import parser
from io import StringIO
from itertools import islice
file = 'filename with directory'
parsedPDF = parser.from_file(file) # Parse data from file
text = parsedPDF['content'] # Get files text content
Now extract desired fields using regex.
You can find extensive regex tutorials online. If you have any problem implementing the same, please ask here.
Try to use tika package:
from tika import parser
raw = parser.from_file('sample.pdf')
print(raw['content'])

Search and replace text odfpy

I'm trying to make reports for a program using odfpy. My idea is to search for each keywords like [[[e_mail_address]]] and replace it by a word from the database.
I found the function text in odfpy api, but converted into string looses the formating.
There is an document in the odfpy installation files: api-for-odfpy.odt. In point 6.2 Teletype module, there is written how to get all the texts from the document and put them into a list:
from odf import text, teletype
from odf.opendocument import load
textdoc = load("my document.odt")
allparas = textdoc.getElementsByType(text.P)
print teletype.extractText(allparas[0])
and now I'm looking for the method to replace the current text to another. Maybe:
text.Change()
but there is always an error while using it. If you have any experience in using odfpy, please help.
I already found an answer:
textdoc = load("myfile.odt")
texts = textdoc.getElementsByType(text.P)
s = len(texts)
for i in range(s):
old_text = teletype.extractText(texts[i])
new_text = old_text.replace('something','something else')
new_S = text.P()
new_S.setAttribute("stylename",texts[i].getAttribute("stylename"))
new_S.addText(new_text)
texts[i].parentNode.insertBefore(new_S,texts[i])
texts[i].parentNode.removeChild(texts[i])
textdoc.save('myfile.odt')

Python : How to convert markdown formatted text to text

I need to convert markdown text to plain text format to display summary in my website. I want the code in python.
Despite the fact that this is a very old question, I'd like to suggest a solution I came up with recently. This one neither uses BeautifulSoup nor has an overhead of converting to html and back.
The markdown module core class Markdown has a property output_formats which is not configurable but otherwise patchable like almost anything in python is. This property is a dict mapping output format name to a rendering function. By default it has two output formats, 'html' and 'xhtml' correspondingly. With a little help it may have a plaintext rendering function which is easy to write:
from markdown import Markdown
from io import StringIO
def unmark_element(element, stream=None):
if stream is None:
stream = StringIO()
if element.text:
stream.write(element.text)
for sub in element:
unmark_element(sub, stream)
if element.tail:
stream.write(element.tail)
return stream.getvalue()
# patching Markdown
Markdown.output_formats["plain"] = unmark_element
__md = Markdown(output_format="plain")
__md.stripTopLevelTags = False
def unmark(text):
return __md.convert(text)
unmark function takes markdown text as an input and returns all the markdown characters stripped out.
The Markdown and BeautifulSoup (now called beautifulsoup4) modules will help do what you describe.
Once you have converted the markdown to HTML, you can use a HTML parser to strip out the plain text.
Your code might look something like this:
from bs4 import BeautifulSoup
from markdown import markdown
html = markdown(some_html_string)
text = ''.join(BeautifulSoup(html).findAll(text=True))
This is similar to Jason's answer, but handles comments correctly.
import markdown # pip install markdown
from bs4 import BeautifulSoup # pip install beautifulsoup4
def md_to_text(md):
html = markdown.markdown(md)
soup = BeautifulSoup(html, features='html.parser')
return soup.get_text()
def example():
md = '**A** [B](http://example.com) <!-- C -->'
text = md_to_text(md)
print(text)
# Output: A B
Commented and removed it because I finally think I see the rub here: It may be easier to convert your markdown text to HTML and remove HTML from the text. I'm not aware of anything to remove markdown from text effectively but there are many HTML to plain text solutions.
I came here while searching for a way to perform s.c. GitLab Releases via API call. I hope this matches the use case of the original questioner.
I decoded markdown to plain text (including whitespaces in the form of \n etc.) in that way:
with open("release_note.md", 'r') as file:
release_note = file.read()
description = bytes(release_note, 'utf-8')
return description.decode("utf-8")

Categories

Resources