Python - Apache Tika Single Page parser - python

I was wondering if there is any way using Tika/Python to only parse the first page or extract the metadata from the first page only? Right now, when I pass the pdf, it is parsing every single page.
I looked that this link: Is it possible to extract text by page for word/pdf files using Apache Tika?
However, this link explains more in java, which I am not familiar with. I was hoping there could be a python solution for it? Thanks!
from tika import parser
# running: java -jar tika-server1.18.jar before executing code below.
parsedPDF = parser.from_file('C:\\path\\to\\dir\\sample.pdf')
fulltext = parsedPDF['content']
metadata_dict = parsedPDF['metadata']
title = metadata_dict['title']
author = metadata_dict['Author'] # capturing all the names from lets say 15 pages. Just want it to capture from first page
pages = metadata_dict['xmpTPg:NPages']

Thanks for this info, really helpful. Here is my code to retrieve the content page by page (a bit dirty, but it works) :
raw_xml = parser.from_file(file, xmlContent=True)
body = raw_xml['content'].split('<body>')[1].split('</body>')[0]
body_without_tag = body.replace("<p>", "").replace("</p>", "").replace("<div>", "").replace("</div>","").replace("<p />","")
text_pages = body_without_tag.split("""<div class="page">""")[1:]
num_pages = len(text_pages)
if num_pages==int(raw_xml['metadata']['xmpTPg:NPages']) : #check if it worked correctly
return text_pages

#Gagravarr comments regarding XHTML, I found that Tika had a xmlContent parsing when reading the file. I used it to capture xml format and used regex to capture it.
This worked out for me:
parsed_data_full = parser.from_file(file_name,xmlContent=True)
parsed_data_full = parsed_data_full['content']
There is a start and end for each page divider that starts with "<div" and ends with "</div>" first occurrence . Basically wrote a small code to capture sub-strings between 2 sub-strings and stored into a variable to my specific requirement.

Related

How to use Python Fitz detect Hyphen when using search_for?

I'm new to the Fitz library and am working on a project where I need to find a string in a PDF page. I'm running into a case where the text on the page that I'm searching on is hyphenated. I am aware of the TEXT_DEHYPHENATE flag that I can use in the search for function, but that doesn't work for me (as shown in the image here https://postimg.cc/zHZPdd6v ). I'm getting no cases when I search for the hyphenated string.
Python Script
LOC = "./test.pdf"
doc = fitz.open(LOC)
page = doc[1]
print(page.get_text())
found = page.search_for("lowcost", flags=TEXT_DEHYPHENATE)
print("DONE")
print(len(found))
found = page.search_for("low-cost", flags=TEXT_DEHYPHENATE)
print("DONE")
print(len(found))
found = page.search_for("low cost", flags=TEXT_DEHYPHENATE)
print("DONE")
print(len(found))
for rect in found:
print(rect)
Output
Abstract
The objective of “XXXXXXXXXXXXXXXXXX” was design and assemble a low-
cost and efficient tool.
DONE
0
DONE
0
DONE
0
Can someone please point me to how I might be able to detect the hyphen in my file? Thank you!
Your first approach should work, look here:
# insert some hyphenated text
page.insert_textbox((100,100,300,300),"The objective of 'xxx' was design and assemble a low-\ncost and efficient tool.")
157.94699853658676
# now search for it again
page.search_for("lowcost") # 2 rectangles!
[Rect(159.3009796142578, 116.24800109863281, 175.8009796142578, 131.36199951171875),
Rect(100.0, 132.49501037597656, 120.17399597167969, 147.6090087890625)]
# each containing a text portion with hyphen removed
for rect in page.search_for("lowcost"):
print(page.get_textbox(rect))
low
cost
Without the original file there is no way to tell the reason for your failure.
Are you sure there really is text - and not e.g. an image or other hickups?
Edited: As per the comment of user #KJ below: PyMuPDF's C base library MuPDF regards all of the unicodes '-', 0xAD, 0x2010, 0x2011 as hyphens in this context. They all should work the same. Just reconfirmed it in an example.

Python's ElementTree, how to create links in a paragraph

I have a website I'm building running off Python 2.7 and using ElementTree to build the HTML on the fly. I have no problem creating the elements and appending them to the tree. It's where I have to insert links in the middle of a large paragraph that I am stumped. This is easy when it's done in text, but this is doing it via XML. Here's what I mean:
Sample text:
lawLine = "..., a vessel as defined in Section 21 of the Harbors and Navigation Code which is inhabited and designed for habitation, an inhabited floating home as defined in subdivision (d) of Section 18075.55 of the Health and Safety Code, ..."
To add that text to the HTML as H4-style text, I typically use:
h4 = ET.Element('h4')
htmlTree.append(h4)
h4.text = lawLine
I need to add links at the word "Section" and the numbers associated with it, but I can't simply create a new element "a" in the middle of a paragraph and add it to the HTML tree, so I'm trying to build that piece as text, then do ET.fromstring and append it to the tree:
thisLawType = 'PC'
matches = re.findall(r'Section [0-9.]*', lawLine)
if matches:
lawLine = """<h4>{0}</h4>""".format(lawLine)
for thisMatch in matches:
thisMatchLinked = """{2}""".format(thisLawType, thisMatch.replace('Section ',''), thisMatch)
lawLine = lawLine.replace(thisMatch, thisMatchLinked)
htmlBody.append(ET.fromstring(lawLine))
I am getting "xml.etree.ElementTree.ParseError: not well-formed" errors when I do ET.fromstring. Is there a better way to do this in ElementTree? I'm sure there are better extensions out there, but my work environment is limited to Python 2.7 and the standard library. Any help would be appreciated. Thanks!
Evan
The xml you are generating is indeed not well formed, because of the presence of & in thisMatchLinked. It's one of the special charcters which need to be escaped (see an interesting explanation here).
So try replacing & with & and see if it works.

Cannot access first page header (python-docx)

I have recently wrote python code to replace a word in MS office via python-docx. The code is working good for a couple of week. I can access through all the header in all page with this code below;
sections = doc.sections
for z in range (0, len(sections)):
header_section = doc.sections[z]
header = header_section.header
header_text = header.paragraphs[0]
However a fews days ago I found some issue. The code seem not to work with first page of the document (but still work in the later page). I have try to figure out why the code just stop working and it seem that it is something about different first page setting of the document (I try create a new document without different first page and it work just fine). Can anyone please kindly suggest what cause my code to stop working, so I could try to find the way to rewrote the code (Any MS Office update that may effect the code?). And if you have any idea how to access the header in the document with different first page setting, please kindly share. Thank in advance.
A section can have up to three headers (and three footers). These are the first-page header, odd-page header, and default-header. Most sections have only the default header, but if the first-page header is defined, it is used for the first page only. This is because the first page of a chapter, say, would typically not have a header or may have a different header than the "running" header through the rest of the chapter or other section.
So look to see whether you have a first-page header or odd-page header defined; that could explain the behavior you're seeing. See these pages in the documentation for more:
https://python-docx.readthedocs.io/en/latest/api/section.html#id1
https://python-docx.readthedocs.io/en/latest/user/hdrftr.html
This mystified me for a while, thanks for the tip to read the docs more closely. For those looking for a copy/paste, here's what I did:
def get_headers_and_footers(doc: Document):
header_txt = set()
for section in doc.sections:
for paragraph in section.header.paragraphs:
header_txt.add(paragraph.text)
for paragraph in section.first_page_header.paragraphs:
header_txt.add(paragraph.text)
for paragraph in section.even_page_header.paragraphs:
header_txt.add(paragraph.text)
footer_txt = set()
for section in doc.sections:
for paragraph in section.header.paragraphs:
footer_txt.add(paragraph.text)
for paragraph in section.first_page_header.paragraphs:
footer_txt.add(paragraph.text)
for paragraph in section.even_page_header.paragraphs:
footer_txt.add(paragraph.text)
return list(header_txt), list(footer_txt)

Interpreting nested HTML <blockquote>s in Python?

I have a web app that reads from the Tumblr API and reformats the way that "reblog chains" are formatted.
With Tumblr, commentary for a post is stored as HTML blockquotes. As users respond to the commentary above, another level gets added to the blockquote chain, eventually resulting in many nested reblog chains.
Here is an example of how a "reblog chain" looks in plain HTML:
<p><a class="tumblr_blog" href="http://chainsaw-police.tumblr.com/post/96158438802/example-tumblr-post">chainsaw-police</a>:</p><blockquote>
<p><a class="tumblr_blog" href="http://example-blog-domain.tumblr.com/post/96158384215/example-tumblr-post">example-blog-domain</a>:</p><blockquote>
<p>Here is an example of a Tumblr post.</p> <p>It can have multiple <p> elements sometimes. It may only have one, though, at other times.</p>
</blockquote>
<p>This is an example of a user “reblogging” a post. As you can see, the previous comment is stored above as a <blockquote>.</p>
</blockquote>
<p>This is another reblog. As you can see, all of the previous comments are stored as blockquotes, with earlier ones being residing deeper in the nest of blockquotes.</p>
And this is what it looks like when rendered.
I want to be able to reformat the reblog chain so that it looks more like this:
example-blog-domain:
Here is an example of a Tumblr post.
It can have multiple <p> elements sometimes. It may only have one, though, at other times.
chainsaw-police:
This is an example of a user “reblogging” a post. As you can see, the previous comment is stored above as a <blockquote>.
example-blog-domain:
This is another reblog. As you can see, all of the previous comments are stored as blockquotes, with earlier ones being residing deeper in the nest of blockquotes.
I know, It's an incredibly confusing structure, hence why I'm trying to write something to make it more readable.
Is there any way to interpret the HTML and split the reblogs up into individual "comments"? For example, having an array or dict that has the username and the commentary would be more than enough. However, after messing with lxml and BeautifulSoup for months, I'm at my wits' end.
If there was even a way to do it in CSS, which I highly doubt, that would be fine.
Thanks in advance, everyone!
I guess CSS does not have a such functionality.
You need parse to a structure by lxml, ... and render it. It is easier way. You can also create a filter using regexp that does not pass wrong items of html code.
reddit user /u/joyeusenoelle has answered my question over at /r/LearnPython using a tonne of convoluted regexes that end up looking more like a voodoo magic spell than a text manipulation script.
Lots of regexes later, I think I've solved this for an
arbitrarily-deep comment chain.
import re
with open("tcomment.txt","r") as tf:
text = ""
for line in tf:
text += line
tf.close()
text = text.replace("\n","")
text = text.replace(">",">\n")
text = text.replace("<","\n<")
text = re.sub("</p>\s*<p>","<br><br>", text)
text = text.replace("<p>\n", "")
text = text.replace("</p>\n","\n")
text = re.sub("<[/]{0,1}blockquote>","<chunk>",text)
text = re.sub("<a class=\"tumblr_blog\"[^>]+?>","<chunk>",text)
text = text.replace("</a>","")
text = re.sub("\n+","", text)
text = re.sub("\s{2,}"," ", text)
text = re.sub("<chunk>\s*<chunk>","<chunk>",text)
bits = text.split("<chunk>")
bits[0] = "Latest:"
comments = []
for i in range(len(bits)):
temp = ""
j = 0 - (i+1)
if (len(bits)-i) > i:
temp = "<b>" + bits[i] + "</b> " + bits[j]
comments.append(temp)
comments.reverse()
for comment in comments:
print("<p>%s</p>" % (comment))
print()
The line bits[0] = "Latest:" can be changed to whatever you want the
most recent comment to display, and you'll probably want to change how
the text comes into the script.
For the text you provided, this gives me:
<p><b>example-blog-domain:</b> Here is an example of a Tumblr post.<br><br>It can have multiple <p> elements sometimes. It may
only have one, though, at other times.
<p><b>chainsaw-police:</b> This is an example of a user "reblogging" a post. As you can see, the previous comment is stored
above as a <blockquote>.
<p><b>Latest:</b> This is another reblog. As you can see, all of the previous comments are stored as blockquotes, with earlier ones
being residing deeper in the nest of blockquotes.
e: Some thoughts: this is in Python 3, but everything but the print
statements should work in Python 2, I think. I used text.split()
whenever possible because direct string manipulation is typically
faster than regular expressions are, but that may not be appropriate
here. And finally, it's possible that I'm making more work for myself
than I need to in the substitutions section, but at this point I've
looked at the code too long to figure out if it could be slimmed down.

exporting wikipedia with Python

I am trying to export a category from Turkish wikipedia page by following http://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export . Here is the code I am using;
# -*- coding: utf-8 -*-
import requests
from BeautifulSoup import BeautifulStoneSoup
from sys import version
link = "http://tr.wikipedia.org/w/index.php?title=%C3%96zel:D%C4%B1%C5%9FaAktar&action=submit"
def get(pages=[], category = False, curonly=True):
params = {}
if pages:
params["pages"] = "\n".join(pages)
if category:
params["addcat"] = 1
params["category"] = category
if curonly:
params["curonly"] = 1
headers = {"User-Agent":"Wiki Downloader -- Python %s, contact: Yaşar Arabacı: yasar11732#gmail.com" % version}
r = requests.post(link, headers=headers, data=params)
return r.text
print get(category="Matematik")
Since I am trying to get data from Turkish wikipedia, I have used its url. Other things should be self explanatory. I am getting the form page that you can use to export data instead of the actual xml. Can anyone see what am I doing wrong here? I have also tried making a get request.
There is no parameter named category, the category name should be in the catname parameter.
But Special:Export was not build for bots, it was build for humans. So, if you use catname correctly, it will return the form again, this time with pages from the category filled in. Then you are supposed to click "Submit" again, which will return the XML you want.
I think doing this in code would be too complicated. It would be easier if you used the API instead. There are some Python libraries that can help you with that: Pywikipediabot or wikitools.
Sorry my original answer was horribly flawed. I misunderstood the original intent.
I did some more experimenting because I was curious. It seems that the code you have above is not necessarily incorrect, it is, in fact, that the Special Export documentation is misleading. The documentation states that using catname and addcat will add the categories to the output, but instead it only lists the pages and categories within the specified catname inside an html form. It seems that wikipedia actually requires that the pages that you wish download be specified explicitly. Granted, there documentation doesn't necessarily appear to be very thorough on that matter. I would suggest that you parse the page for the pages within the category and then explicitly download those pages with your script. I do see an an issue with this approach in terms of efficiency. Due to the nature of Wikipedia's data, you'll get a lot of pages which are simply category pages of other pages.
As an aside, it could possibly be faster to use the actual corpus of data from Wikipedia which is available for download.
Good luck!

Categories

Resources