python lxml.html: pull preceding text in html docstring - python

I'm trying to identify a given <table> element based on the text that precedes it in the html document.
My current method is to stringify each html table element and search for its text index within the file text:
filing_text=request.urlopen(url).read()
#some text cleanup here to make lxml's output match the .read() content
ref_text = lxml.html.tostring(filing_text).upper().\
replace(b" ",b"&NBSP;")
tbl_count=0
for tbl in self.filing_tree.iterfind('.//table'):
text_ind=reftext.find(lxml.html.tostring(tbl).\
upper().replace(b" ",b"&NBSP;"))
start_text=lxml.html.tostring(tbl)[0:50]
tbl_count+=1
print ('tbl: %s; position: %s; %s'%(tbl_count,text_ind,start_text))
Given the starting index of the table element, I can then search x characters preceding for text that may identify help to identify the table's content.
Two concerns with this approach:
Since the tag density (i.e., how much of the filing text is markup versus content) varies from url to url, it's hard to standardize my search range in the preceding text. 2500 characters of html may encompass 300 characters of actual content or 2000
Serializing and searching once per table element seems rather inefficient. It adds more overhead to a webscraping workflow than I'd like
Question: Is there a better way to do this? Is there an lxml method that can extract text content prior to a given element? I'm imagining something like itertext() that goes backwards from the element, recursively through the html docstring.

Use beautiful soup. Just a snippit to get you started:
>>> from bs4 import BeautifulSoup
>>> stupid_html = "<html><p> Hello </p><table> </table></html>"
>>> soup = BeautifulSoup(stupid_html )
>>> list_of_tables = soup.find_all("table")
>>> print( list_of_tables[0].previous )
Hello

Related

How to find distance between 2 elements in html using beautifulsoup

The goal is to find the distance between 2 tags, e.g. the first external a href attribute and the title tag, using BeautifulSoup.
html = '<title>stackoverflow</title>test'
soup = BeautifulSoup(html)
ext_link = soup.find('a',href=re.compile("^https?:",re.IGNORECASE))
title = soup.title
dist = abs_distance_between_tags(ext_link,title)
print dist
30
How would I do this without using regex?
Note that the order of the tags maybe different, and there maybe more than one match (although we only are taking the first using find() ).
I could not find a method in BeautifulSoup that returns the locations/positions in the html of the matches.
As you noted, it does not seem like you can get the exact character position of an element in BeautifulSoup.
Maybe this answer can help you along:
AFAIK, lxml only offers sourceline, which is insufficient. Cf API: Original line number as found by the parser or None if unknown.
But expat provides the exact offset in the file : CurrentByteIndex.
Fetched from start_element handler, it returns tag's start (ie '<') offset.
Fetched from char_data handler, it returns data's start (ie 'B' in your example) offset.
Beautiful Soup 4 now supports Tag.sourceline and Tag.sourcepos.
Reference: https://beautiful-soup-4.readthedocs.io/en/latest/#line-numbers

Python BeautifulSoup - Add Tags around found keyword

I am currently working on a project in which I want to allow regex search in/on a huge set of HTML files.
After first pinpointing the files of my interest I now want to highlight the found keyword!
Using BeautifulSoup I can determine the Node in which my keyword is found. One thing I do is changing the color of the whole parent.
However, I would also like to add my own <span>-Tags around just they keyword(s) I found.
Determining the position and such is no big deal using the find()-functions provided by BFSoup. But adding my tags around regular text seems to be impossible?
# match = keyword found by another regex
# node = the node I found using the soup.find(text=myRE)
node.parent.setString(node.replace(match, "<myspan>"+match+"</myspan>"))
This way I only add mere text and not a proper Tag, as the document is not freshly parsed, which I hope to avoid!
I hope my problem became a little clear :)
Here's a simple example showing one way to do it:
import re
from bs4 import BeautifulSoup as Soup
html = '''
<html><body><p>This is a paragraph</p></body></html>
'''
(1) store the text and empty the tag
soup = Soup(html)
text = soup.p.string
soup.p.clear()
print soup
(2) get start and end positions of the words to be boldened (apologies for my English)
match = re.search(r'\ba\b', text)
start, end = match.start(), match.end()
(3) split the text and add the first part
soup.p.append(text[:start])
print soup
(4) create a tag, add the relevant text to it and append it to the parent
b = soup.new_tag('b')
b.append(text[start:end])
soup.p.append(b)
print soup
(5) append the rest of the text
soup.p.append(text[end:])
print soup
here is the output from above:
<html><body><p></p></body></html>
<html><body><p>This is </p></body></html>
<html><body><p>This is <b>a</b></p></body></html>
<html><body><p>This is <b>a</b> paragraph</p></body></html>
If you add the text...
my_tag = node.parent.setString(node.replace(match, "<myspan>"+match+"</myspan>"))
...and pass it through BeautifulSoup once more
new_soup = BeautifulSoup(my_tag)
it should be classified as a BS tag object and available for parsing.
You could apply these changes to the original mass of text and run it through as a whole, to avoid repetition.
EDIT:
From the docs:
# Here is a more complex example that replaces one tag with another:
from BeautifulSoup import BeautifulSoup, Tag
soup = BeautifulSoup("<b>Argh!<a>Foo</a></b><i>Blah!</i>")
tag = Tag(soup, "newTag", [("id", 1)])
tag.insert(0, "Hooray!")
soup.a.replaceWith(tag)
print soup
# <b>Argh!<newTag id="1">Hooray!</newTag></b><i>Blah!</i>

Unscriptable Int Error for String Slice

I'm writing a webscraper and I have a table full of links to .pdf files that I want to download, save, and later analyze. I was using beautiful soup and I had soup find all the links. They are normally beautiful soup tag objects, but I've turned them into strings. The string is actually a bunch of junk with the link text buried in the middle. I want to cut out that junk and just leave the link. Then I will turn these into a list and have python download them later. (My plan is for python to keep a list of the pdf link names to keep track of what it's downloaded and then it can name the files according to those link names or a portion thereof).
But the .pdfs come in variable name-lengths, e.g.:
I_am_the_first_file.pdf
And_I_am_the_seond_file.pdf
and as they exist in the table, they have a bunch of junk text:
a href = ://blah/blah/blah/I_am_the_first_file.pdf[plus other annotation stuff that gets into my string accidentally]
a href = ://blah/blah/blah/And_I_am_the_seond_file.pdf[plus other annotation stuff that gets into my string accidentally]
So I want to cut ("slice") the front part and the last part off of the string and just leave the string that points to my url (so what follows is the desired output for my program):
://blah/blah/blah/I_am_the_first_file.pdf
://blah/blah/blah/And_I_am_the_seond_file.pdf
As you can see, though, the second file has more characters in the string than the first. So I can't do:
string[9:40]
or whatever because that would work for the first file but not for the second.
So i'm trying to come up with a variable for the end of the string slice, like so:
string[9:x]
wherein x is the location in the string that ends in '.pdf' (and my thought was to use the string.index('.pdf') function to do this.
But is t3h fail because I get an error trying to use a variable to do this
("TypeError: 'int' object is unsubscriptable")
Probably there's an easy answer and a better way to do this other than messing with strings, but you guys are way smartert than me and I figured you'd know straight off.
Here's my full code so far:
import urllib, urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("mywebsite.com")
soup = BeautifulSoup(page)
table_with_my_pdf_links = soup.find('table', id = 'searchResults')
#"search results" is just what the table i was looking for happened to be called.
for pdf_link in table_with_my_pdf_links.findAll('a'):
#this says find all the links and looop over them
pdf_link_string = str(pdf_link)
#turn the links into strings (they are usually soup tag objects, which don't help me much that I know of)
if 'pdf' in pdf_link_string:
#some links in the table are .html and I don't want those, I just want the pdfs.
end_of_link = pdf_link_string.index('.pdf')
#I want to know where the .pdf file extension ends because that's the end of the link, so I'll slice backward from there
just_the_link = end_of_link[9:end_of_link]
#here, the first 9 characters are junk "a href = yadda yadda yadda". So I'm setting a variable that starts just after that junk and goes to the .pdf (I realize that I will actualy have to do .pdf + 3 or something to actually get to the end of string, but this makes it easier for now).
print just_the_link
#I debug by print statement because I'm an amatuer
the line (Second from the bottom) that reads:
just_the_link = end_of_link[9:end_of_link]
returns an error (TypeError: 'int' object is unsubscriptable)
also, the ":" should be hyper text transfer protocol colon, but it won't let me post that b/c newbs can't post more than 2 links so I took them out.
just_the_link = end_of_link[9:end_of_link]
This is your problem, just like the error message says. end_of_link is an integer -- the index of ".pdf" in pdf_link_string, which you calculated in the preceding line. So naturally you can't slice it. You want to slice pdf_link_string.
Sounds like a job for regular expressions:
import urllib, urllib2, re
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("mywebsite.com")
soup = BeautifulSoup(page)
table_with_my_pdf_links = soup.find('table', id = 'searchResults')
#"search results" is just what the table i was looking for happened to be called.
for pdf_link in table_with_my_pdf_links.findAll('a'):
#this says find all the links and looop over them
pdf_link_string = str(pdf_link)
#turn the links into strings (they are usually soup tag objects, which don't help me much that I know of)
if 'pdf' in pdf_link_string:
pdfURLPattern = re.compile("""://(\w+/)+\S+.pdf""")
pdfURLMatch = pdfURLPattern.search(line)
#If there is no match than search() returns None, otherwise the whole group (group(0)) returns the URL of interest.
if pdfURLMatch:
print pdfURLMatch.group(0)

Using Beautiful Soup Python module to replace tags with plain text

I am using Beautiful Soup to extract 'content' from web pages. I know some people have asked this question before and they were all pointed to Beautiful Soup and that's how I got started with it.
I was able to successfully get most of the content but I am running into some challenges with tags that are part of the content. (I am starting off with a basic strategy of: if there are more than x-chars in a node then it is content). Let's take the html code below as an example:
<div id="abc">
some long text goes here and hopefully it
will get picked up by the parser as content
</div>
results = soup.findAll(text=lambda(x): len(x) > 20)
When I use the above code to get at the long text, it breaks (the identified text will start from 'and hopefully..') at the tags. So I tried to replace the tag with plain text as follows:
anchors = soup.findAll('a')
for a in anchors:
a.replaceWith('plain text')
The above does not work because Beautiful Soup inserts the string as a NavigableString and that causes the same problem when I use findAll with the len(x) > 20. I can use regular expressions to parse the html as plain text first, clear out all the unwanted tags and then call Beautiful Soup. But I would like to avoid processing the same content twice -- I am trying to parse these pages so I can show a snippet of content for a given link (very much like Facebook Share) -- and if everything is done with Beautiful Soup, I presume it will be faster.
So my question: is there a way to 'clear tags' and replace them with 'plain text' using Beautiful Soup. If not, what will be best way to do so?
Thanks for your suggestions!
Update: Alex's code worked very well for the sample example. I also tried various edge cases and they all worked fine (with the modification below). So I gave it a shot on a real life website and I run into issues that puzzle me.
import urllib
from BeautifulSoup import BeautifulSoup
page = urllib.urlopen('http://www.engadget.com/2010/01/12/kingston-ssdnow-v-dips-to-30gb-size-lower-price/')
anchors = soup.findAll('a')
i = 0
for a in anchors:
print str(i) + ":" + str(a)
for a in anchors:
if (a.string is None): a.string = ''
if (a.previousSibling is None and a.nextSibling is None):
a.previousSibling = a.string
elif (a.previousSibling is None and a.nextSibling is not None):
a.nextSibling.replaceWith(a.string + a.nextSibling)
elif (a.previousSibling is not None and a.nextSibling is None):
a.previousSibling.replaceWith(a.previousSibling + a.string)
else:
a.previousSibling.replaceWith(a.previousSibling + a.string + a.nextSibling)
a.nextSibling.extract()
i = i+1
When I run the above code, I get the following error:
0:<a href="http://www.switched.com/category/ces-2010">Stay up to date with
Switched's CES 2010 coverage</a>
Traceback (most recent call last):
File "parselink.py", line 44, in <module>
a.previousSibling.replaceWith(a.previousSibling + a.string + a.nextSibling)
TypeError: unsupported operand type(s) for +: 'Tag' and 'NavigableString'
When I look at the HTML code, 'Stay up to date.." does not have any previous sibling (I did not how previous sibling worked until I saw Alex's code and based on my testing it looks like it is looking for 'text' before the tag). So, if there is no previous sibling, I am surprised that it is not going through the if logic of a.previousSibling is None and a;nextSibling is None.
Could you please let me know what I am doing wrong?
-ecognium
An approach that works for your specific example is:
from BeautifulSoup import BeautifulSoup
ht = '''
<div id="abc">
some long text goes here and hopefully it
will get picked up by the parser as content
</div>
'''
soup = BeautifulSoup(ht)
anchors = soup.findAll('a')
for a in anchors:
a.previousSibling.replaceWith(a.previousSibling + a.string)
results = soup.findAll(text=lambda(x): len(x) > 20)
print results
which emits
$ python bs.py
[u'\n some long text goes here ', u' and hopefully it \n will get picked up by the parser as content\n']
Of course, you'll probably need to take a bit more care, i.e., what if there's no a.string, or if a.previousSibling is None -- you'll need suitable if statements to take care of such corner cases. But I hope this general idea can help you. (In fact you may want to also merge the next sibling if it's a string -- not sure how that plays with your heuristics len(x) > 20, but say for example that you have two 9-character strings with an <a> containing a 5-character strings in the middle, perhaps you'd want to pick up the lot as a "23-characters string"? I can't tell because I don't understand the motivation for your heuristic).
I imagine that besides <a> tags you'll also want to remove others, such as <b> or <strong>, maybe <p> and/or <br>, etc...? I guess this, too, depends on what the actual idea behind your heuristics is!
When I tried to flatten tags in the document, that way, the tags' entire content would be pulled up to its parent node in place (I wanted to reduce the content of a p tag with all sub-paragraphs, lists, div and span, etc. inside but get rid of the style and font tags and some horrible word-to-html generator remnants), I found it rather complicated to do with BeautifulSoup itself since extract() also removes the content and replaceWith() unfortunatetly doesn't accept None as argument. After some wild recursion experiments, I finally decided to use regular expressions either before or after processing the document with BeautifulSoup with the following method:
import re
def flatten_tags(s, tags):
pattern = re.compile(r"<(( )*|/?)(%s)(([^<>]*=\\\".*\\\")*|[^<>]*)/?>"%(isinstance(tags, basestring) and tags or "|".join(tags)))
return pattern.sub("", s)
The tags argument is either a single tag or a list of tags to be flattened.

Extract data from a website's list, without superfluous tags

Working code: Google dictionary lookup via python and beautiful soup -> simply execute and enter a word.
I've quite simply extracted the first definition from a specific list item. However to get plain data, I've had to split my data at the line break, and then strip it to remove the superfluous list tag.
My question is, is there a method to extract the data contained within a specific list without doing my above string manipulation - perhaps a function in beautiful soup that I have yet to see?
This is the relevant section of code:
# Retrieve HTML and parse with BeautifulSoup.
doc = userAgentSwitcher().open(queryURL).read()
soup = BeautifulSoup(doc)
# Extract the first list item -> and encode it.
definition = soup('li', limit=2)[0].encode('utf-8')
# Format the return as word:definition removing superfluous data.
print word + " : " + definition.split("<br />")[0].strip("<li>")
I think you are looking for findAll(text=True) this will extract the text from the tags
definitions = soup('ul')[0].findAll(text=True)
Will return a ist of all the text contents broken at the tag boundaries

Categories

Resources