How to find distance between 2 elements in html using beautifulsoup - python

The goal is to find the distance between 2 tags, e.g. the first external a href attribute and the title tag, using BeautifulSoup.
html = '<title>stackoverflow</title>test'
soup = BeautifulSoup(html)
ext_link = soup.find('a',href=re.compile("^https?:",re.IGNORECASE))
title = soup.title
dist = abs_distance_between_tags(ext_link,title)
print dist
30
How would I do this without using regex?
Note that the order of the tags maybe different, and there maybe more than one match (although we only are taking the first using find() ).
I could not find a method in BeautifulSoup that returns the locations/positions in the html of the matches.

As you noted, it does not seem like you can get the exact character position of an element in BeautifulSoup.
Maybe this answer can help you along:
AFAIK, lxml only offers sourceline, which is insufficient. Cf API: Original line number as found by the parser or None if unknown.
But expat provides the exact offset in the file : CurrentByteIndex.
Fetched from start_element handler, it returns tag's start (ie '<') offset.
Fetched from char_data handler, it returns data's start (ie 'B' in your example) offset.

Beautiful Soup 4 now supports Tag.sourceline and Tag.sourcepos.
Reference: https://beautiful-soup-4.readthedocs.io/en/latest/#line-numbers

Related

BeautifulSoup won't parse Article element

I'm working on parsing this web page.
I've got table = soup.find("div",{"class","accordions"}) to get just the fixtures list (and nothing else) however now I'm trying to loop through each match one at a time. It looks like each match starts with an article element tag <article role="article" about="/fixture/arsenal/2018-apr-01/stoke-city">
However for some reason when I try to use matches = table.findAll("article",{"role","article"})
and then print the length of matches, I get 0.
I've also tried to say matches = table.findAll("article",{"about","/fixture/arsenal"}) but get the same issue.
Is BeautifulSoup unable to parse tags, or am I just using it wrong?
Try this:
matches = table.findAll('article', attrs={'role': 'article'})
the reason is that findAll is searching for tag name. refer to bs4 docs
You need to pass the attributes as a dictionary. There are three ways in which you can get the data you want.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.arsenal.com/fixtures')
soup = BeautifulSoup(r.text, 'lxml')
matches = soup.find_all('article', {'role': 'article'})
print(len(matches))
# 16
Or, this is also the same:
matches = soup.find_all('article', role='article')
But, both these methods give some extra article tags that don't have the Arsernal fixtures. So, if you want to find them using /fixture/arsenal you can use CSS selectors. (Using find_all won't work, as you need a partial match)
matches = soup.select('article[about^=/fixture/arsenal]')
print(len(matches))
# 13
Also, have a look at the keyword arguments. It'll help you get what you want.

What is the most efficient way to get a specific link using Beautiful Soup in Python 3.0?

I am currently learning Python specialization on coursera. I have come across the issue of extracting a specific link from a webpage using BeautifulSoup. From this webpage (http://py4e-data.dr-chuck.net/known_by_Fikret.html), I am supposed to extract a URL from user input and open that subsequent links, all identified through the anchor tab and run some number of iterations.
While I able to program them using Lists, I am wondering if there is any simpler way of doing it without using Lists or Dictionary?
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
nameList=list()
loc=''
count=0
for tag in tags:
loc=tag.get('href',None)
nameList.append(loc)
url=nameList[pos-1]
In the above code, you would notice that after locating the links using 'a' tag and 'href', I cant help but has to create a list called nameList to locate the position of link. As this is inefficient, I would like to know if I could directly locate the URL without using the lists. Thanks in advance!
The easiest way is to get an element out of tags list and then extract href value:
tags = soup('a')
a = tags[pos-1]
loc = a.get('href', None)
You can also use soup.select_one() method to query :nth-of-type element:
soup.select('a:nth-of-type({})'.format(pos))
As :nth-of-type uses 1-based indexing, you don't need to subtract 1 from pos value if your users are expected to use 1-based indexing too.
Note that soup's :nth-of-type is not equivalent to CSS :nth-of-type pseudo-class, as it always selects only one element, while CSS selector may select many elements at once.
And if you're looking for "the most efficient way", then you need to look at lxml:
from lxml.html import fromstring
tree = fromstring(r.content)
url = tree.xpath('(//a)[{}]/#href'.format(pos))[0]

xpath how to format path

I would like to get #src value '/pol_il_DECK-SANTA-CRUZ-STAR-WARS-EMPIRE-STRIKES-BACK-POSTER-8-25-20135.jpg' from webpage
from lxml import html
import requests
URL = 'http://systemsklep.pl/pol_m_Kategorie_Deskorolka_Deski-281.html'
session = requests.session()
page = session.get(URL)
HTMLn = html.fromstring(page.content)
print HTMLn.xpath('//html/body/div[1]/div/div/div[3]/div[19]/div/a[2]/div/div/img/#src')[0]
but I can't. No matter how I format xpath, i tdooesnt work.
In the spirit of #pmuntima's answer, if you already know it's the 14th sourced image, but want to stay with lxml, then you can:
print HTMLn.xpath('//img/#data-src')[14]
To get that particular image. It similarly reports:
/pol_il_DECK-SANTA-CRUZ-STAR-WARS-EMPIRE-STRIKES-BACK-POSTER-8-25-20135.jpg
If you want to do your indexing in XPath (possibly more efficient in very large result sets), then:
print HTMLn.xpath('(//img/#data-src)[14]')[0]
It's a little bit uglier, given the need to parenthesize in the XPath, and then to index out the first element of the list that .xpath always returns.
Still, as discussed in the comments above, strictly numerical indexing is generally a fragile scraping pattern.
Update: So why is the XPath given by browser inspect tools not leading to the right element? Because the content seen by a browser, after a dynamic JavaScript-based update process, is different from the content seen by your request. Your request is not running JS, and is doing no such updates. Different content, different address needed--if the address is static and fragile, at any rate.
Part of the updates here seem to be taking src URIs, which initially point to an "I'm loading!" gif, and replacing them with the "real" src values, which are found in the data-src attribute to begin.
So you need two changes:
a stronger way to address the content you want (a way that doesn't break when you move from browser inspect to program fetch) and
to fetch the URIs you want from data-src not src, because in your program fetch, the JS has not done its load-and-switch trick the way it did in the browser.
If you know text associated with the target image, that can be the trick. E.g.:
search_phrase = 'DECK SANTA CRUZ STAR WARS EMPIRE STRIKES BACK POSTER'
path = '//img[contains(#alt, "{}")]/#data-src'.format(search_phrase)
print HTMLn.xpath(path)[0]
This works because the alt attribute contains the target text. You look for images that have the search phrase contained in their alt attributes, then fetch the corresponding data-src values.
I used a combination of requests and beautiful soup libraries. They both are wonderful and I would recommend them for scraping and parsing/extracting HTML. If you have a complex scraping job, scrapy is really good.
So for your specific example, I can do
from bs4 import BeautifulSoup
import requests
URL = 'http://systemsklep.pl/pol_m_Kategorie_Deskorolka_Deski-281.html'
r = requests.get(URL)
soup = BeautifulSoup(r.text, "html.parser")
specific_element = soup.find_all('a', class_="product-icon")[14]
res = specific_element.find('img')["data-src"]
print(res)
It will print out
/pol_il_DECK-SANTA-CRUZ-STAR-WARS-EMPIRE-STRIKES-BACK-POSTER-8-25-20135.jpg

python lxml.html: pull preceding text in html docstring

I'm trying to identify a given <table> element based on the text that precedes it in the html document.
My current method is to stringify each html table element and search for its text index within the file text:
filing_text=request.urlopen(url).read()
#some text cleanup here to make lxml's output match the .read() content
ref_text = lxml.html.tostring(filing_text).upper().\
replace(b" ",b"&NBSP;")
tbl_count=0
for tbl in self.filing_tree.iterfind('.//table'):
text_ind=reftext.find(lxml.html.tostring(tbl).\
upper().replace(b" ",b"&NBSP;"))
start_text=lxml.html.tostring(tbl)[0:50]
tbl_count+=1
print ('tbl: %s; position: %s; %s'%(tbl_count,text_ind,start_text))
Given the starting index of the table element, I can then search x characters preceding for text that may identify help to identify the table's content.
Two concerns with this approach:
Since the tag density (i.e., how much of the filing text is markup versus content) varies from url to url, it's hard to standardize my search range in the preceding text. 2500 characters of html may encompass 300 characters of actual content or 2000
Serializing and searching once per table element seems rather inefficient. It adds more overhead to a webscraping workflow than I'd like
Question: Is there a better way to do this? Is there an lxml method that can extract text content prior to a given element? I'm imagining something like itertext() that goes backwards from the element, recursively through the html docstring.
Use beautiful soup. Just a snippit to get you started:
>>> from bs4 import BeautifulSoup
>>> stupid_html = "<html><p> Hello </p><table> </table></html>"
>>> soup = BeautifulSoup(stupid_html )
>>> list_of_tables = soup.find_all("table")
>>> print( list_of_tables[0].previous )
Hello

Unscriptable Int Error for String Slice

I'm writing a webscraper and I have a table full of links to .pdf files that I want to download, save, and later analyze. I was using beautiful soup and I had soup find all the links. They are normally beautiful soup tag objects, but I've turned them into strings. The string is actually a bunch of junk with the link text buried in the middle. I want to cut out that junk and just leave the link. Then I will turn these into a list and have python download them later. (My plan is for python to keep a list of the pdf link names to keep track of what it's downloaded and then it can name the files according to those link names or a portion thereof).
But the .pdfs come in variable name-lengths, e.g.:
I_am_the_first_file.pdf
And_I_am_the_seond_file.pdf
and as they exist in the table, they have a bunch of junk text:
a href = ://blah/blah/blah/I_am_the_first_file.pdf[plus other annotation stuff that gets into my string accidentally]
a href = ://blah/blah/blah/And_I_am_the_seond_file.pdf[plus other annotation stuff that gets into my string accidentally]
So I want to cut ("slice") the front part and the last part off of the string and just leave the string that points to my url (so what follows is the desired output for my program):
://blah/blah/blah/I_am_the_first_file.pdf
://blah/blah/blah/And_I_am_the_seond_file.pdf
As you can see, though, the second file has more characters in the string than the first. So I can't do:
string[9:40]
or whatever because that would work for the first file but not for the second.
So i'm trying to come up with a variable for the end of the string slice, like so:
string[9:x]
wherein x is the location in the string that ends in '.pdf' (and my thought was to use the string.index('.pdf') function to do this.
But is t3h fail because I get an error trying to use a variable to do this
("TypeError: 'int' object is unsubscriptable")
Probably there's an easy answer and a better way to do this other than messing with strings, but you guys are way smartert than me and I figured you'd know straight off.
Here's my full code so far:
import urllib, urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("mywebsite.com")
soup = BeautifulSoup(page)
table_with_my_pdf_links = soup.find('table', id = 'searchResults')
#"search results" is just what the table i was looking for happened to be called.
for pdf_link in table_with_my_pdf_links.findAll('a'):
#this says find all the links and looop over them
pdf_link_string = str(pdf_link)
#turn the links into strings (they are usually soup tag objects, which don't help me much that I know of)
if 'pdf' in pdf_link_string:
#some links in the table are .html and I don't want those, I just want the pdfs.
end_of_link = pdf_link_string.index('.pdf')
#I want to know where the .pdf file extension ends because that's the end of the link, so I'll slice backward from there
just_the_link = end_of_link[9:end_of_link]
#here, the first 9 characters are junk "a href = yadda yadda yadda". So I'm setting a variable that starts just after that junk and goes to the .pdf (I realize that I will actualy have to do .pdf + 3 or something to actually get to the end of string, but this makes it easier for now).
print just_the_link
#I debug by print statement because I'm an amatuer
the line (Second from the bottom) that reads:
just_the_link = end_of_link[9:end_of_link]
returns an error (TypeError: 'int' object is unsubscriptable)
also, the ":" should be hyper text transfer protocol colon, but it won't let me post that b/c newbs can't post more than 2 links so I took them out.
just_the_link = end_of_link[9:end_of_link]
This is your problem, just like the error message says. end_of_link is an integer -- the index of ".pdf" in pdf_link_string, which you calculated in the preceding line. So naturally you can't slice it. You want to slice pdf_link_string.
Sounds like a job for regular expressions:
import urllib, urllib2, re
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("mywebsite.com")
soup = BeautifulSoup(page)
table_with_my_pdf_links = soup.find('table', id = 'searchResults')
#"search results" is just what the table i was looking for happened to be called.
for pdf_link in table_with_my_pdf_links.findAll('a'):
#this says find all the links and looop over them
pdf_link_string = str(pdf_link)
#turn the links into strings (they are usually soup tag objects, which don't help me much that I know of)
if 'pdf' in pdf_link_string:
pdfURLPattern = re.compile("""://(\w+/)+\S+.pdf""")
pdfURLMatch = pdfURLPattern.search(line)
#If there is no match than search() returns None, otherwise the whole group (group(0)) returns the URL of interest.
if pdfURLMatch:
print pdfURLMatch.group(0)

Categories

Resources