How to find textual differences between revisions on Wikipedia pages with mwclient?

How to find textual differences between revisions on Wikipedia pages with mwclient? - python

I'm trying to find the textual differences between two revisions of a given Wikipedia page using mwclient. I have the following code:
import mwclient
import difflib
site = mwclient.Site('en.wikipedia.org')
page = site.pages['Bowdoin College']
texts = [rev for rev in page.revisions(prop='content')]
if not (texts[-1][u'*'] == texts[0][u'*']):
##show me the differences between the pages
Thank you!

It's not clear weather you want a difflib-generated diff or a mediawiki-generated diff using mwclient.
In the first case, you have two strings (the text of two revisions) and you want to get the diff using difflib:
...
t1 = texts[-1][u'*']
t2 = texts[0][u'*']
print('\n'.join(difflib.unified_diff(t1.splitlines(), t2.splitlines())))
(difflib can also generate an HTML diff, refer to the documentation for more info.)
But if you want the MediaWiki-generated HTML diff using mwclient you'll need revision ids:
# TODO: Loading all revisions is slow,
# try to load only as many as required.
revisions = list(page.revisions(prop='ids'))
last_revision_id = revisions[-1]['revid']
first_revision_id = revisions[0]['revid']
Then use the compare action to compare the revision ids:
compare_result = site.get('compare', fromrev=last_revision_id, torev=first_revision_id)
html_diff = compare_result['compare']['*']

Related

Python PDF Parsing with Camelot and Extract the Table Title

Camelot is a fantastic Python library to extract the tables from a pdf file as a data frame. However, I'm looking for a solution that also returns the table description text written right above the table.
The code I'm using for extracting tables from pdf is this:
import camelot
tables = camelot.read_pdf('test.pdf', pages='all',lattice=True, suppress_stdout = True)
I'd like to extract the text written above the table i.e THE PARTICULARS, as shown in the image below.
What should be a best approach for me to do it? appreciate any help. thank you

You can create the Lattice parser directly
parser = Lattice(**kwargs)
for p in pages:
t = parser.extract_tables(p, suppress_stdout=suppress_stdout,
layout_kwargs=layout_kwargs)
tables.extend(t)
Then you have access to parser.layout which contains all the components in the page. These components all have bbox (x0, y0, x1, y1) and the extracted tables also have a bbox object. You can find the closest component to the table on top of it and extract the text.

Here's my hilariously bad implementation just so that someone can laugh and get inspired to do a better one and contribute to the great camelot package :)
Caveats:
Will only work for non-rotated tables
It's a heuristic
The code is bad
# Helper methods for _bbox
def top_mid(bbox):
return ((bbox[0]+bbox[2])/2, bbox[3])
def bottom_mid(bbox):
return ((bbox[0]+bbox[2])/2, bbox[1])
def distance(p1, p2):
return math.sqrt((p1[0]-p2[0])**2 + (p1[1]-p2[1])**2)
def get_closest_text(table, htext_objs):
min_distance = 999 # Cause 9's are big :)
best_guess = None
table_mid = top_mid(table._bbox) # Middle of the TOP of the table
for obj in htext_objs:
text_mid = bottom_mid(obj.bbox) # Middle of the BOTTOM of the text
d = distance(text_mid, table_mid)
if d < min_distance:
best_guess = obj.get_text().strip()
min_distance = d
return best_guess
def get_tables_and_titles(pdf_filename):
"""Here's my hacky code for grabbing tables and guessing at their titles"""
my_handler = PDFHandler(pdf_filename) # from camelot.handlers import PDFHandler
tables = camelot.read_pdf(pdf_filename, pages='2,3,4')
print('Extracting {:d} tables...'.format(tables.n))
titles = []
with camelot.utils.TemporaryDirectory() as tempdir:
for table in tables:
my_handler._save_page(pdf_filename, table.page, tempdir)
tmp_file_path = os.path.join(tempdir, f'page-{table.page}.pdf')
layout, dim = camelot.utils.get_page_layout(tmp_file_path)
htext_objs = camelot.utils.get_text_objects(layout, ltype="horizontal_text")
titles.append(get_closest_text(table, htext_objs)) # Might be None
return titles, tables
See: https://github.com/atlanhq/camelot/issues/395

How do you keep table rows together in python-docx?

As an example, I have a generic script that outputs the default table styles using python-docx (this code runs fine):
import docx
d=docx.Document()
type_of_table=docx.enum.style.WD_STYLE_TYPE.TABLE
list_table=[['header1','header2'],['cell1','cell2'],['cell3','cell4']]
numcols=max(map(len,list_table))
numrows=len(list_table)
styles=(s for s in d.styles if s.type==type_of_table)
for stylenum,style in enumerate(styles,start=1):
label=d.add_paragraph('{}) {}'.format(stylenum,style.name))
label.paragraph_format.keep_with_next=True
label.paragraph_format.space_before=docx.shared.Pt(18)
label.paragraph_format.space_after=docx.shared.Pt(0)
table=d.add_table(numrows,numcols)
table.style=style
for r,row in enumerate(list_table):
for c,cell in enumerate(row):
table.row_cells(r)[c].text=cell
d.save('tablestyles.docx')
Next, I opened the document, highlighted a split table and under paragraph format, selected "Keep with next," which successfully prevented the table from being split across a page:
Here is the XML code of the non-broken table:
You can see the highlighted line shows the paragraph property that should be keeping the table together. So I wrote this function and stuck it in the code above the d.save('tablestyles.docx') line:
def no_table_break(document):
tags=document.element.xpath('//w:p')
for tag in tags:
ppr=tag.get_or_add_pPr()
ppr.keepNext_val=True
no_table_break(d)
When I inspect the XML code the paragraph property tag is set properly and when I open the Word document, the "Keep with next" box is checked for all tables, yet the table is still split across pages. Am I missing an XML tag or something that's preventing this from working properly?

Ok, I also needed this. I think we were all making the incorrect assumption that the setting in Word's table properties (or the equivalent ways to achieve this in python-docx) was about keeping the table from being split across pages. It's not -- instead, it's simply about whether or not a table's rows can be split across pages.
Given that we know how successfully do this in python-docx, we can prevent tables from being split across pages by putting each table within the row of a larger master table. The code below successfully does this. I'm using Python 3.6 and Python-Docx 0.8.6
import docx
from docx.oxml.shared import OxmlElement
import os
import sys
def prevent_document_break(document):
"""https://github.com/python-openxml/python-docx/issues/245#event-621236139
Globally prevent table cells from splitting across pages.
"""
tags = document.element.xpath('//w:tr')
rows = len(tags)
for row in range(0, rows):
tag = tags[row] # Specify which <w:r> tag you want
child = OxmlElement('w:cantSplit') # Create arbitrary tag
tag.append(child) # Append in the new tag
d = docx.Document()
type_of_table = docx.enum.style.WD_STYLE_TYPE.TABLE
list_table = [['header1', 'header2'], ['cell1', 'cell2'], ['cell3', 'cell4']]
numcols = max(map(len, list_table))
numrows = len(list_table)
styles = (s for s in d.styles if s.type == type_of_table)
big_table = d.add_table(1, 1)
big_table.autofit = True
for stylenum, style in enumerate(styles, start=1):
cells = big_table.add_row().cells
label = cells[0].add_paragraph('{}) {}'.format(stylenum, style.name))
label.paragraph_format.keep_with_next = True
label.paragraph_format.space_before = docx.shared.Pt(18)
label.paragraph_format.space_after = docx.shared.Pt(0)
table = cells[0].add_table(numrows, numcols)
table.style = style
for r, row in enumerate(list_table):
for c, cell in enumerate(row):
table.row_cells(r)[c].text = cell
prevent_document_break(d)
d.save('tablestyles.docx')
# because I'm lazy...
openers = {'linux': 'libreoffice tablestyles.docx',
'linux2': 'libreoffice tablestyles.docx',
'darwin': 'open tablestyles.docx',
'win32': 'start tablestyles.docx'}
os.system(openers[sys.platform])

Have been straggling with the problem for some hours and finally found the solution worked fine for me. I just changed the XPath in the topic starter's code so now it looks like this:
def keep_table_on_one_page(doc):
tags = self.doc.element.xpath('//w:tr[position() < last()]/w:tc/w:p')
for tag in tags:
ppr = tag.get_or_add_pPr()
ppr.keepNext_val = True
The key moment is this selector
[position() < last()]
We want all but the last row in each table to keep with the next one

Would have left this is a comment under #DeadAd 's answer, but had low rep.
In case anyone is looking to stop a specific table from breaking, rather than all tables in a doc, change the xpath to the following:
tags = table._element.xpath('./w:tr[position() < last()]/w:tc/w:p')
where table refers to the instance of <class 'docx.table.Table'> which you want to keep together.
"//" will select all nodes that match the xpath (regardless of relative location), "./" will start selection from current node

Add Header based on Condition

I'm using reportlab to generate a PDF document that has two types of reports.
Please assume reports are r1 and r2. There may be more than 2-3 pages in each report. So i want to add a header like text from second page of each report.
For example in r1 reports page add "r1 report continued..." and in the pages of
r2 report add "r2 report continued..." How can i do that.
Currently i'm creating a list of the elements and passing it to template build function. So i cannot identify which report is being processed.
For example...
elements = []
elements.append(r1)
...
.....
elements.append(r2)
doc.build(elements)

Finally i managed to resolve it. But i'm not sure if its a proper method.
A big thanks to grc who provided this answer from where i created my solution.
As in grc's answer i have created a afterFlowable callback function.
def afterFlowable(self,flowable):
if hasattr(flowable, 'cReport'):
cReport = getattr(flowable, 'cReport')
self.cReport = cReport
Then while adding data for the r1 report a custom attribute will be created
elements.append(PageBreak())
elements[-1].cReport = 'r1'
Same code while adding data for r2 report
elements.append(PageBreak())
elements[-1].cReport = 'r2'
Then in the onPage function of the template
template = PageTemplate(id='test', frames=frame, onPage=headerAndFooter)
def headerAndFooter(canvas, doc):
canvas.saveState()
if cReport == 'r1':
Ph = Paragraph("""<para>r1 Report (continued)</para>""",styleH5)
w, h = Ph.wrap(doc.width, doc.topMargin)
Ph.drawOn(canvas, doc.leftMargin, doc.height+doc.topMargin)
Note that i'm just copy and pasting parts of my code...

Return values from a Python Entrez dictionary of dictionaries

I want to scrape the Interactions table from the Entrez Gene page.
The Interactions table is populated from a web server and when I tried to use the XML package in R, I could get the Entrez gene page, but the Interactions table body was empty (it had not been populated by the web server).
Dealing with the web server issue in R may be solvable (and I'd love to see how), but it seemed Biopython was an easier path.
I put together the following, which gives me what I want for an example gene:
# Pull the Entrez gene page for MAP1B using Biopython
from Bio import Entrez
Entrez.email = "jamayfie#vasci.umass.edu"
handle = Entrez.efetch(db="gene", id="4131", retmode="xml")
record = Entrez.read(handle)
handle.close()
PPI_Entrez = []
PPI_Sym = []
# Find the Dictionary that contains the Interaction table
for x in range(1, len(record[0]["Entrezgene_comments"])):
if ('Gene-commentary_heading', 'Interactions') in record[0]["Entrezgene_comments"][x].items():
for y in range(0, len(record[0]["Entrezgene_comments"][x]['Gene-commentary_comment'])):
EntrezID = record[0]["Entrezgene_comments"][x]['Gene-commentary_comment'][y]['Gene-commentary_comment'][1]['Gene-commentary_source'][0]['Other-source_src']['Dbtag']['Dbtag_tag']['Object-id']['Object-id_id']
PPI_Entrez.append(EntrezID)
Sym = record[0]["Entrezgene_comments"][x]['Gene-commentary_comment'][y]['Gene-commentary_comment'][1]['Gene-commentary_source'][0]['Other-source_anchor']
PPI_Sym.append(Sym)
# Return the desired values: I want the Entrez ID and Gene symbol for each interacting protein
PPI_Entrez # Returns the EntrezID
PPI_Sym # Returns the gene symbol
This code works, giving me what I want. But I think its ugly, and am concerned that if the Entrez gene page changes slightly in format it will break the code. In particular, there must be a better way to extract the desired information than specifying the full path, as I do with:
record[0]["Entrezgene_comments"][x]['Gene-commentary_comment'][y]['Gene-commentary_comment'][1]['Gene-commentary_source'][0]['Other-source_anchor']
But I cannot figure out how to search through a dictionary of dictionaries without specifying each level I want to descend. When I try functions like find(), they operate on the next level down, but not all the way to the bottom.
Is there a wildcard symbol, a Python equivalent of "//", or a function I can use to get to ['Object-id_id'] without naming the full path? Other suggestions for cleaner code are also appreciated.

I'm not sure about xpath in Python, but if the code works, then I would not worry removing full paths or if Entrez Gene XML will change. Since you first tried R, you could get the XML using a system call to Entrez Direct below or a package like rentrez.
doc <- xmlParse( system("efetch -db=gene -id=4131 -format xml", intern=TRUE) )
Next, get the nodes corresponding to rows in the table at http://www.ncbi.nlm.nih.gov/gene/4131#interactions
x <- getNodeSet(doc, "//Gene-commentary_heading[.='Interactions']/../Gene-commentary_comment/Gene-commentary" )
length(x)
[1] 64
x[1]
x[50]
Try the easy stuff first
xmlToDataFrame(x[1:4])
Gene-commentary_type Gene-commentary_text Gene-commentary_refs Gene-commentary_source Gene-commentary_comment
1 18 Affinity Capture-MS 24457600 BioGRID110304BioGRID 255BioGRID110304255GeneID8726EEDBioGRID114265
2 18 Reconstituted Complex 20195357 BioGRID110304BioGRID 255BioGRID110304255GeneID2353FOSBioGRID108636
3 18 Reconstituted Complex 20195357 BioGRID110304BioGRID 255BioGRID110304255GeneID1936EEF1DBioGRID108256
4 18 Affinity Capture-MS 2345592220562859 BioGRID110304BioGRID 255BioGRID110304255GeneID6789STK4BioGRID112665
Gene-commentary_create-date Gene-commentary_update-date
1 2014461120 201410513330
2 201312810490 201410513330
3 201312810490 201410513330
4 20137710360 201410513330
Some tags like text, refs, source, and dates should be easy to parse
sapply(x, function(x) paste( xpathSApply(x, ".//PubMedId", xmlValue), collapse=", "))
I'm not sure about the comments or how Products, Interactants and Other Genes listed in the table are stored in the XML, but I get one or three symbols and three ids for each node here.
sapply(x, function(x) paste( xpathSApply(x, ".//Gene-commentary_comment//Other-source_anchor", xmlValue), collapse=" + "))
sapply(x, function(x) paste( xpathSApply(x, ".//Gene-commentary_comment//Object-id_id", xmlValue), collapse=" + "))
Finally, since I think Entrez Gene just copies IntAct and BioGrid, you could try those sites too. Biogrid has a really simple Rest service, but you have to register for a key.
url <- "http://webservice.thebiogrid.org/interactions?geneList=MAP1B&taxId=9606&includeHeader=TRUE&accesskey=[ your ACCESSKEY ]"
biogrid <- read.delim(url)
dim(biogrid)
[1] 58 24
head(biogrid[, c(8:9,12)])
Official.Symbol.Interactor.A Official.Symbol.Interactor.B Experimental.System
1 ANP32A MAP1B Two-hybrid
2 MAP1B ANP32A Two-hybrid
3 RASSF1 MAP1B Affinity Capture-Western
4 RASSF1 MAP1B Two-hybrid
5 ANP32A MAP1B Affinity Capture-Western
6 GAN MAP1B Affinity Capture-Western

How to count no of rows in table from web application using selenium python webdriver

How to count the rows in the table from web application by using selenium python web driver. Here we can retrieve all data in the table from web application but couldn't count the rows and columns, please give me idea of how to do this.

Try some thing like this
int rowCount=driver.findElements(By.xpath("//table[#id='DataTable']/tbody/tr")).size();
int columnCount=driver.findElements(By.xpath("//table[#id='DataTable']/tbody/tr/td")).size();
FYI : This is the implementation in java.

Santoshsarma's answer is close to correct for Java, but it will count the number of cells rather than the number of columns. Here's a Python version that will count the number of columns in the first row:
row_count = len(driver.find_elements_by_xpath("//table[#id='DataTable']/tbody/tr"))
column_count = len(driver.find_elements_by_xpath("//table[#id='DataTable']/tbody/tr/td"))
If your table has a header you could count the columns there instead.
I haven't tried using storeXpathCount.

You can do it by using find_elements_by or execute_script.
Retrieving count by applying len on result of find_elements_by is correct way, however in other turn getting count of rows and columns by using execute_script is quite faster then getting len from result of find_elements_by method call:
rows_count = driver.execute_script("return document.getElementsByTagName('tr').length")
columns_count = driver.execute_script("return document.getElementsByTagName('tr')[0].getElementsByTagName('th').length")
Below is performance metrics of using find_elements_by vs execute_script approach:
from timeit import timeit
# Time metrics of retrieving 111 table rows
timeit(lambda:driver.execute_script("return document.getElementsByTagName('tr').length"), number=1000)
# 5.525
timeit(lambda:len(driver.find_elements_by_tag_name("tr")), number=1000)
# 9.799
timeit(lambda:len(driver.find_elements_by_xpath("//table/tbody/tr")), number=1000)
# 10.656
# Time metrics of retrieving 5 table headers
timeit(lambda:driver.execute_script("return document.getElementsByTagName('tr')[0].getElementsByTagName('th').length"), number=1000)
# 5.441
timeit(lambda:len(driver.find_elements_by_tag_name("th")), number=1000)
8.604
timeit(lambda:len(driver.find_elements_by_xpath("//table/tbody/th")), number=1000)
# 9.336

Selenium has a function for counting the XPath result, see http://release.seleniumhq.org/selenium-core/1.0/reference.html. According to this you can use storeXpathCount.
I am using the DefaultSelenium class from the selenium-java-client-driver, where I can use (also in java) the following:
int rowCount = selenium.getXpathCount("//table[#id='Datatable']/tbody/tr").intValue()

For python users the simple way of counting a tag is as follows.
!#use find_elements_by_xpath
test_elem = self.driver.find_elements_by_xpath("//div[#id='host-history']/table[1]/tbody[1]/tr")
!#then check the len of the returnes list
print (len(test_elem))

As of 2022
The best way I found was as follow. Where my table has many tr elements
Get the full path and
path1="/html/body/div[5]/div[3]/div/div[1]/div/div/div[4]/form/div[3]/div[2]/div/table/tbody"
element1=driver.find_element(By.XPATH,path1).find_elements(By.CSS_SELECTOR, 'tr')
number_of_rows = len(element1)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to find textual differences between revisions on Wikipedia pages with mwclient? - python

Related

Python PDF Parsing with Camelot and Extract the Table Title

How do you keep table rows together in python-docx?

Add Header based on Condition

Return values from a Python Entrez dictionary of dictionaries

How to count no of rows in table from web application using selenium python webdriver

Categories

Resources