Finding word on page(s) in document - python

I am looking for an elegant solution to find on what page(s) in a document a certain word occurs that I have stored in a python dictionary/list.
I first considered .docx format as an input and had a look at PythonDocx which has a search function, but there's obviously not really a pages attribute in the docx/xml format.
If I parse the document I could look for <w:br w:type="page"/> occurrences in the xml tree but unfortunately these do not show non-forced page breaks.
I even considered converting files to PDF first and use something like PDFminer to parse the document page-wise.
Is there any straightforward solution to search a .docx document for a string and return the pages it occurs on like
[('foo' ,[1, 4, 7 ]), ('bar', [2]), ('baz', [2, 5, 8, 9 )]

Parse the xml files composing the docx
It seems that the biggest challenge in your question is how to be able to parse a document page by page. This answer of a word document is not always the same and it depends on the margins, the paper sheet settings, the application you use to open it etc. A good reasoning on the accuracy of any script for this purpose can be found at google group.
However, if you can be satisfied with a almost 100% accurate, you start to find a solution as suggested in this google group: 
I found that I can unzip the .docx file and extract docProps/app.xml, then parse the XML with ElementTree to get the <Pages></Pages> element. I found that most of the time that number is accurate, but I've seen a few instances where the number in that element is not correct.  
Use Win32com.Client
Another approach could be to use win32com.client to open the file, paginate it, make your search and then return the results in the format you want it.
You can find an example of the syntax in this answer:
from win32com.client import Dispatch
#open Word
word = Dispatch('Word.Application')
word.Visible = False
word = word.Documents.Open(doc_path)
#get number of sheets
word.Repaginate()
num_of_sheets = word.ComputeStatistics(2)
You can also have a look to this answer regarding find and replace in a word document using win32com.client.

Related

How do I find text after a given key word?

I am practicing my programming skills (in Python) and I realized that I don't know what to do when I need to find a value that is unknown but introduced by a key word. I am taking the information for this off a website where in the page source it says, '"size":"10","stockKeepingUnitId":"(random number)"'
How can I figure out what that number is.
This is what I have so far --
def stock():
global session
endpoint = '(website)'
reponse = session.get(endpoint)
soup = bs(response.text, "html.parser")
sizes = soup.find('"size":"10","stockKeepingUnitId":')
Off the top of my head there are two ways to do this. Say you have the string mystr = 'some text...content:"67588978"'. The first way is just to search for "content:" in the string and use string slicing to take everything after it:
num = mystr[mystr.index('content:"') + len('content:"'):-1]
Alternatively, as probably a better solution, you could use regular expressions
import re
nums = re.findall(r'.*?content:\"(\d+)\"')
As you haven't provided an example of the dataset you're trying to analyze, there could also be a number of other solutions. If you're trying to parse a JSON or YAML file, there are simple libraries to turn them into python dicts (json is part of the standard library, and PyYaml handles YAML files easily).

How to robustly extract author names from pdf papers?

I'd like to extract author names from pdf papers. Does anybody know a robust way to do so?
For example, I'd like to extract the name Archana Shukla from this pdf https://arxiv.org/pdf/1111.1648
PDF documents contain Metadata. It includes information about the document and its contents such as the author’s name, keywords, copyright information. See Adobe doc.
You can use PyPDF2 to extract PDF Metadata. See the documentation about the DocumentInformation class.
This information may not be filled and can appear blank. So, one possibility is to parse the beginning or the end of the text and extract what you think is the author name. Of course, it is not reliable. But, if you have a bibliographic database, to can try a match.
Nowadays, editors like Microsoft Word or Libre Office Writer always fill the author name in the Metadata. And it is copied in the PDF when you export your documents. So, this should work for you. Give it a try and tell us!
I am going to pre-suppose that you have a way to extract text from a PDF document, so the question is really "how can I figure out the author from this text". I think one straightforward solution is to use the correspondence email. Here is an example implementation:
import difflib
# Some sample text
pdf_text="""SENTIMENT ANALYSIS OF DOCUMENT BASED ON ANNOTATION\n
Archana Shukla\nDepartment of Computer Science and Engineering,
Motilal Nehru National Institute of Technology,
Allahabad\narchana#mnnit.ac.in\nABSTRACT\nI present a tool which
tells the quality of document or its usefulness based on annotations."""
def find_author(some_text):
words = some_text.split(" ")
emails = []
for word in words:
if "#" in word:
emails.append(word)
emails_clean = emails[0].split("\n")
actual_email = [a for a in emails_clean if "#" in a]
actual_email = actual_email[0]
maybe_name = actual_email.split("#")[0]
all_words_lists = [a.split("\n") for a in words]
words = [a for sublist in all_words_lists for a in sublist]
words.remove(actual_email)
return difflib.get_close_matches(maybe_name, words)
In this case, find_author(pdf_text) returns ['Archana']. It's not perfect, but it's not incorrect. I think you could likely extend this in some clever ways, perhaps by getting the next word after the result or by combining this guess with metadata, or even by finding the DOI in the document if/when it exists and looking it up through some API, but nonetheless I think this should be a good starting point.
First thing first, there are some pdfs out there which pages are image. I don't know if you can extract the text from image easily. But from the pdf link you mentioned, I think it can be done. There is exist a package called PyPDF2 which as I know, can extract the text from pdf. All that left is to scan the last few pages and parse the Author names.
An example on how to use the package described here. Some of the code listed there is as follows:
import PyPDF2
pdfFileObj = open('meetingminutes.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
disp(pdfReader.numPages)
pageObj = pdfReader.getPage(0)
pageObj.extractText()

Extracting Fasta Moonlight Protein Sequences with Python

I want to extract the FASTA files that have the aminoacid sequence from the Moonlighting Protein Database ( www.moonlightingproteins.org/results.php?search_text= ) via Python, since it's an iterative process, which I'd rather learn how to program than manually do it, b/c come on, we're in 2016. The problem is I don´t know how to write the code, because I'm a rookie programmer :( . The basic pseudocode would be:
for protein_name in site: www.moonlightingproteins.org/results.php?search_text=:
go to the uniprot option
download the fasta file
store it in a .txt file inside a given folder
Thanks in advance!
I would strongly suggest to ask the authors for the database. From the FAQ:
I would like to use the MoonProt database in a project to analyze the
amino acid sequences or structures using bioinformatics.
Please contact us at bioinformatics#moonlightingproteins.org if you are
interested in using MoonProt database for analysis of sequences and/or
structures of moonlighting proteins.
Assuming you find something interesting, how are you going to cite it in your paper or your thesis?
"The sequences were scraped from a public webpage without the consent of the authors". Much better to give credit to the original researchers.
That's a good introduction to scraping
But back to your your original question.
import requests
from lxml import html
#let's download one protein at a time, change 3 to any other number
page = requests.get('http://www.moonlightingproteins.org/detail.php?id=3')
#convert the html document to something we can parse in Python
tree = html.fromstring(page.content)
#get all table cells
cells = tree.xpath('//td')
for i, cell in enumerate(cells):
if cell.text:
#if we get something which looks like a FASTA sequence, print it
if cell.text.startswith('>'):
print(cell.text)
#if we find a table cell which has UniProt in it
#let's print the link from the next cell
if 'UniProt' in cell.text_content():
if cells[i + 1].find('a') is not None and 'href' in cells[i + 1].find('a').attrib:
print(cells[i + 1].find('a').attrib['href'])

Python: Copy content from one word document to another word document and keeping format?

As the title says I would like to know if there is any module that will allow me to parse content from one Microsoft word document to another via python and keeping the format.
I want to read table data and transfer it to another table in another document.
Both doc A and B exist. I just want to be able to walk through the cells in both docs (not necessarily at the same time) and copy content without having to worry about if the text is formatted (font, italic, bold) or contains bullets.
I'm asking for python since it's my favorite language...
Following Kasra advice to use python-docx :
Rough example code.
Query document for table:
from docx import *
document = opendocx('xxxzzz.docx')
table = document.xpath('/w:document/w:body/w:tbl', namespaces=nsprefixes)[0]
Writing to another document:
output = opendocx('yyywwww.docx')
body = output.xpath('/w:document/w:body', namespaces=nsprefixes)[0]
body.append(table)
output.save('new-file-name.docx')

Extracting data from MS Word

I am looking for a way to extract / scrape data from Word files into a database. Our corporate procedures have Minutes of Meetings with clients documented in MS Word files, mostly due to history and inertia.
I want to be able to pull the action items from these meeting minutes into a database so that we can access them from a web-interface, turn them into tasks and update them as they are completed.
Which is the best way to do this:
VBA macro from inside Word to create CSV and then upload to the DB?
VBA macro in Word with connection to DB (how does one connect to MySQL from VBA?)
Python script via win32com then upload to DB?
The last one is attractive to me as the web-interface is being built with Django, but I've never used win32com or tried scripting Word from python.
EDIT: I've started extracting the text with VBA because it makes it a little easier to deal with the Word Object Model. I am having a problem though - all the text is in Tables, and when I pull the strings out of the CELLS I want, I get a strange little box character at the end of each string. My code looks like:
sFile = "D:\temp\output.txt"
fnum = FreeFile
Open sFile For Output As #fnum
num_rows = Application.ActiveDocument.Tables(2).Rows.Count
For n = 1 To num_rows
Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text
Assign = Application.ActiveDocument.Tables(2).Cell(n, 3).Range.Text
Target = Application.ActiveDocument.Tables(2).Cell(n, 4).Range.Text
If Target = "" Then
ExportText = ""
Else
ExportText = Descr & Chr(44) & Assign & Chr(44) & _
Target & Chr(13) & Chr(10)
Print #fnum, ExportText
End If
Next n
Close #fnum
What's up with the little control character box? Is some kind of character code coming across from Word?
Word has a little marker thingy that it puts at the end of every cell of text in a table.
It is used just like an end-of-paragraph marker in paragraphs: to store the formatting for the entire paragraph.
Just use the Left() function to strip it out, i.e.
Left(Target, Len(Target)-1))
By the way, instead of
num_rows = Application.ActiveDocument.Tables(2).Rows.Count
For n = 1 To num_rows
Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text
Try this:
For Each row in Application.ActiveDocument.Tables(2).Rows
Descr = row.Cells(2).Range.Text
Well, I've never scripted Word, but it's pretty easy to do simple stuff with win32com. Something like:
from win32com.client import Dispatch
word = Dispatch('Word.Application')
doc = word.Open('d:\\stuff\\myfile.doc')
doc.SaveAs(FileName='d:\\stuff\\text\\myfile.txt', FileFormat=?) # not sure what to use for ?
This is untested, but I think something like that will just open the file and save it as plain text (provided you can find the right fileformat) – you could then read the text into python and manipulate it from there. There is probably a way to grab the contents of the file directly, too, but I don't know it off hand; documentation can be hard to find, but if you've got VBA docs or experience, you should be able to carry them across.
Have a look at this post from a while ago: http://mail.python.org/pipermail/python-list/2002-October/168785.html Scroll down to COMTools.py; there's some good examples there.
You can also run makepy.py (part of the pythonwin distribution) to generate python "signatures" for the COM functions available, and then look through it as a kind of documentation.
You could use OpenOffice. It can open word files, and also can run python macros.
I'd say look at the related questions on the right -->
The top one seems to have some good ideas for going the python route.
how about saving the file as xml. then using python or something else and pull the data out of word and into the database.
It is possible to programmatically save a Word document as HTML and to import the table(s) contained into Access. This requires very little effort.

Categories

Resources