BS4 How to get text without using .text? - python

This is the source code layout from the website:
<div class="address">
<a href="https://website.ca/classifieds/59-barclay-street/">
59 Some Street<br />City, Zone 1
</a>
</div>
I would like to get the street number, route, and city for Google Geocoding. If I do this
>>>article.find('div', {'class': 'address'}).text
'59 Some StreetCity, Zone 1'
It takes away the <br /> and I'm left with no way to split the route from the city. If I do str().replace('<br />',', ') then I have to somehow convert it back again to whatever type it was before so I can do .text to get the actual text between the <a href>, it's inefficient. I'd like to use the functionality that .text uses to get the actual text, without the functionality where it removes the <br> stuff. I couldn't find a file called BeautifulSoup.py in my env, so I'm looking at the BeautifulSoup source code on GitHub, and I can't find a def text in there, I don't know where else to look.
Update:
articles = page_soup.find('h2', text='Ads').find_next_siblings('article')
for article in articles:
link = article.find('a')
br = link.find('br')
ad_address = br.previous_sibling.strip() + ', ' + br.next_sibling.strip().partition(', Zone ')[0]
#ad_address = link.br.replace_with(', ').get_text().strip().partition(', Zone ')

You can locate the br delimiter tag and get the siblings around it:
In [4]: br = soup.select_one("div.address > a > br")
In [5]: br.previous_sibling.strip()
Out[5]: u'59 Some Street'
In [6]: br.next_sibling.strip()
Out[6]: u'City, Zone 1'
You may also locate the br element and replace it with a space using replace_with():
In [4]: a = soup.select_one("div.address > a")
In [5]: a.br.replace_with(" ")
In [6]: a.get_text().strip()
Out[6]: u'59 Some Street City, Zone 1'
Or, you can join all text nodes inside the a tag:
In [7]: a = soup.select_one("div.address > a")
In [8]: " ".join(a.find_all(text=True)).strip()
Out[8]: u'59 Some Street City, Zone 1'

Try:
soup.find('div', {'class':'address'}).get_text(separator=u"<br/>").split(u'<br/>')
The separator keyword defines inner HTML which concatenates text.
http://omz-software.com/pythonista/docs/ios/beautifulsoup_ref.html

Try:
for link_to_text in links:
Print link_to_text.get_text()

Related

Python BeautifulSoup and HTML with unusual spaces

I am trying to update product prices by scraping their prices from a website. However I have reached an unusual html formatting which is giving me some trouble. I am trying to return the price without the spaces. Currently my code brings in all the spaces.
<p class='product__price'> == $0
<span class='visuallyhidden'>Regular price</span>
"
£9.99
" == $0
</p>
I am trying the following:
soup = BeautifulSoup(web_page, "html.parser")
for product in soup.find_all('div', class_="product-wrapper"):
# Get product name
product_title = product.find('p', class_='h4 product__title').text
# Get product price
product_price = product.find('p', class_='product__price').text
product_price.strip()
But unfortunately using the .strip() method does not work and the script returns the prices with a bunch of space and "Regular price".
Any ideas on how I can get exactly "£9.99" ?
The reason this does not work is because the p element contains two children:
A span element
A text node
When you cann .text on the parent p element you will drop the "span" tag. In addition to this, the content contains quotes which will make strip() ignore the spaces inside those quotes.
To solve the problem you must first isolate the text content from the span node, which you can do by diving into the span node using .children.
Finally, you can tell .strip() which characters to remove.
So, assumning the structure inside the p element is always like this we can do the following:
from bs4 import BeautifulSoup
data = """
<div>
<p class='product__price'>
<span class='visuallyhidden'>Regular price</span>
"
£9.99
"
</p>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for product in soup.find_all('div'):
# Get product price
product_price = product.find('p', class_='product__price')
raw_data = list(product_price.children)[-1]
# Remove spaces, newlines and quotes
cleaned = raw_data.strip(' \n"')
print(repr(cleaned))
You can use contents and get the last element and then split string with "
from bs4 import BeautifulSoup
data='''<p class='product__price'> == $0
<span class='visuallyhidden'>Regular price</span>
"
£9.99
" == $0
</p>'''
soup=BeautifulSoup(data,'html.parser')
items=soup.select_one('.product__price').contents
print(items[-1].split('"')[1].strip())
you should try this
product_price = product_price.strip().replace(" ","")
An alternative approach: how about regex?
from bs4 import BeautifulSoup
import re
html = """<div><p class='product__price'> == $0
<span class='visuallyhidden'>Regular price</span>
"
£9.99
" == $0
</p></div>"""
soup = BeautifulSoup(html, "html.parser")
for product in soup.find_all('div'):
# Get product price
product_price = product.find('p', class_='product__price').text
# Regex
price = re.search("(£\d*\.?\d*)", product_price)
# Print only when there is a match
if price: print(price[0])

How can I parse in the onclick() text in Python3 BeautifulSoup?

I've got the following HTML:
<td id="uprnButton0">
<button type="button"
onclick="changeText('uprnButton0','Loading');populAddr('14 PLACE NAME TOWN POSTCODE');
getobject('divAddress').innerHTML = '';
GetInfoAndRoundsFor('123456789123','SWN');"
title="Get Calendar for this address"
>Show
</button>
</td>
I want to get the text in populAddr and in GetInfoAndRoundsFor i.e. the strings "14 PLACE NAME TOWN POSTCODE" and "123456789123" respectively.
So far I have tried:
button_click_text = address.find('button').get('onclick')
Which gets me the full onClick string which is great. Is the only way to get the specific sub strings doing a bit of slicing?
I've tried this:
string = """changeText('uprnButton1','Loading');populAddr('14 PLACE NAME TOWN POSTCODE');getobject('divAddress').innerHTML = '';GetInfoAndRoundsFor('123456789123','SWN');"""
string_before = "populAddr('"
string_after = "');getobject"
print(string[string.index(string_before)+len(string_before):string.index(string_after)])
Which does work but looks like an effing mess. Is there best practice here?
Actually just thought this might be better:
string_split = string.split("'")
print(string_split[5])
print(string_split[11])
You should be able to use the following two lazy regex patterns
import re
html ='''<td id="uprnButton0">
<button type="button"
onclick="changeText('uprnButton0','Loading');populAddr('14 PLACE NAME TOWN POSTCODE');
getobject('divAddress').innerHTML = '';
GetInfoAndRoundsFor('123456789123','SWN');"
title="Get Calendar for this address"
>Show
</button>
</td>'''
p1 =re.compile(r"populAddr\('(.*?)'")
p2 = re.compile(r"GetInfoAndRoundsFor\('(.*?)'")
print(p1.findall(html)[0])
print(p2.findall(html)[0])
Explanation for one (same principle for both)
you can replace html variable with response.text or button_click_textwhere response.text is the requests response .text
I found this to be the quickest way of doing it and because I guess the HTML could be switched I put a couple of checks in to make sure the house number was what I searched for and the uprn was actually a number. If either of these was false then I know the code on the site has probably been tweaked:
string_split = string.split("'")
address = string_split[5]
uprn = string_split[11]
validate address starts with correct house number
print(address.startswith('15 '))
validate uprn contains a number
print(uprn[0:12].isdigit())
That is my try:
In [1]: d = """
...: <td id="uprnButton0">
...: <button type="button"
...: onclick="changeText('uprnButton0','Loading');populAddr('14 PLACE NAME TOWN POSTCODE');
...: getobject('divAddress').innerHTML = '';
...: GetInfoAndRoundsFor('123456789123','SWN');"
...: title="Get Calendar for this address"
...: >Show
...: </button>
...: </td>
...: """
In [2]: from bs4 import BeautifulSoup as bs
In [3]: soup = bs(d,"lxml")
In [4]: button_click_text = soup.find('button').get('onclick')
In [5]: button_click_text
Out[5]: "changeText('uprnButton0','Loading');populAddr('14 PLACE NAME TOWN POSTCODE');\n getobject('divAddress').innerHTML = '';\n GetInfoAndRoundsFor('123456789123','SWN');"
In [6]: import re
...: regex = re.compile(r"'.*?'")
...: out = regex.findall(button_click_text)
...: s1 = out[2][1:-1]
...: s2 = out[-2][1:-1]
In [7]: s1
Out[7]: '14 PLACE NAME TOWN POSTCODE'
In [8]: s2
Out[8]: '123456789123'
soup.find(button) returns an object representing the first button element, and soup.find('button')['onclick'] returns the string value of the onclick attribute.
Because of this, there isn't a convenient way of fetching the value of populAddr, other than using split.
I would recommend splitting by the following:
address = address.find('button').get('onclick').split('populAddr(')[1].split(')')[0]
If you split by populAddr, you know exactly what index the address is located in (it will always be index 0).
If you split by ', you will have to manually review every page you scrape in order to verify that the address will end up in index 5.

Using Selenium to find indexed element within a div

I'm scraping the front-end of a webpage and having difficulty getting the HMTL text of a div within a div.
Basically, I'm simulating clicks - one for each event listed on the page. From there, I want to scrape the date and time of the event, as well as the location of the event.
Here's an example of one of the pages I'm trying to scrape:
https://www.bandsintown.com/e/1013664851-los-grandes-de-la-banda-at-aura-nightclub?came_from=257&utm_medium=web&utm_source=home&utm_campaign=event
<div class="eventInfoContainer-54d5deb3">
<div class="lineupContainer-570750d2">
<div class="eventInfoContainer-9e539994">
<img src="assets.bandsintown.com/images.clock.svg">
<div>Sunday, April 21st, 2019</div> <!––***––>
<div class="eventInfoContainer-50768f6d">5:00PM</div><!––***––>
</div>
<div class="eventInfoContainer-1a68a0e1">
<img src="assets.bandsintown.com/images.clock.svg">
<div class="eventInfoContainer-2d9f07df">
<div>Aura Nightclub</div> <!––***––>
<div>283 1st St., San Jose, CA 95113</div> <!––***––>
</div>
I've marked the elements I want to extract with asterisks - the date, time, venue, and address. Here's my code:
base_url = 'https://www.bandsintown.com/?came_from=257&page='
events = []
eventContainerBucket = []
for i in range(1, 2):
driver.get(base_url + str(i))
# get events links
event_list = driver.find_elements_by_css_selector('div[class^=eventList-] a[class^=event-]')
# collect href attribute of events in even_list
events.extend(list(event.get_attribute("href") for event in event_list))
# iterate through all events and open them.
for event in events:
driver.get(event)
uniqueEventContainer = driver.find_elements_by_css_selector('div[class^=eventInfoContainer-]')[0]
print "Event information: "+ uniqueEventContainer.text
This prints:
Event information: Sunday, April 21st, 2019
3:00 PM
San Francisco Brewing Co.
3150 Polk St, Sf, CA 94109
View All The Fourth Son Tour Dates
My issue is that I can't access the nested eventInfoContainer divs individually. For example, the 'date' div is position [1], as it is the second element (after img) in it's parent div "eventInfoContainer-9e539994". The parent div "eventInfoContainer-9e539994" is in position [1] is it is likewise the second element in it's parent div "eventInfoContainer-54d5deb3" (after "lineupContainer).
By this logic, shouldn't I be able to access the date text by this code: (accessing the 1st position element, with it's parent being the 1st position element, within the container (the 0th position element)?
for event in events:
driver.get(event)
uniqueEventContainer = driver.find_elements_by_css_selector('div[class^=eventInfoContainer-]')[0][1][1]
I get the following error:
TypeError: 'WebElement' object does not support indexing
When you index into webElements list (which is what find_elements_by_css_selector('div[class^=eventInfoContainer-]') returns) you get a webElement, you cannot further index into that. You can split the text of a webElement to generate a list for further indexing.
If there is regular structure across pages you could load html for div into BeautifulSoup. Example url:
from selenium import webdriver
from bs4 import BeautifulSoup as bs
d = webdriver.Chrome()
d.get('https://www.bandsintown.com/e/1013664851-los-grandes-de-la-banda-at-aura-nightclub?came_from=257&utm_medium=web&utm_source=home&utm_campaign=event')
soup = bs(d.find_element_by_css_selector('[class^=eventInfoContainer-]').get_attribute('outerHTML'), 'lxml')
date = soup.select_one('img + div').text
time = soup.select_one('img + div + div').text
venue = soup.select_one('[class^=eventInfoContainer-]:nth-of-type(3) div > div').text
address = soup.select_one('[class^=eventInfoContainer-]:nth-of-type(3) div + div').text
print(date, time, venue, address)
If line breaks were consistent:
containers = d.find_elements_by_css_selector('div[class^=eventInfoContainer-]')
array = containers[0].text.split('\n')
date = array[3]
time = array[4]
venue = array[5]
address = array[6]
print(date, time, venue, address)
With index and split:
from selenium import webdriver
from bs4 import BeautifulSoup as bs
d = webdriver.Chrome()
d.get('https://www.bandsintown.com/e/1013664851-los-grandes-de-la-banda-at-aura-nightclub?came_from=257&utm_medium=web&utm_source=home&utm_campaign=event')
containers = d.find_elements_by_css_selector('div[class^=eventInfoContainer-]')
date_time = containers[1].text.split('\n')
i_date = date_time[0]
i_time = date_time[1]
venue_address = containers[3].text.split('\n')
venue = venue_address[0]
address = venue_address[1]
print(i_date, i_time, venue, address)
As the error suggests, webelements doesn't have indexing. What you are confusing with is list.
Here
driver.find_elements_by_css_selector('div[class^=eventInfoContainer-]')
This code returns a list of webelements. That is why you can access a webelement using the index of the list. But that element doesn't have indexing to another webelement. You are not getting a list of lists.
That is why
driver.find_elements_by_css_selector('div[class^=eventInfoContainer-]')[0] works. But driver.find_elements_by_css_selector('div[class^=eventInfoContainer-][0][1]') doesn't.
Edit:(Answer for quesion in the comment)
It is not slenium code.
The code posted in the answer by QHarr uses BeautifulSoup. It is a python package for parsing HTML and XML documents.
BeautifulSoup has a .select() method which uses CSS selector against a parsed document and return all the matching elements.
There’s also a method called select_one(), which finds only the first tag that matches a selector.
In the code,
time = soup.select_one('img + div + div').text
venue = soup.select_one('[class^=eventInfoContainer-]:nth-of-type(3) div > div').tex
It gets the first element found by the given CSS selector and returns the text inside the tag. The first line finds a img tag then finds the immediate sibling div tag, then again finds the sibling dev tag of the previous div tag.
In the second line it finds the third sibling tag that has class starts with eventInfoContainer- and then it finds the child div and find the child of that div.
Check out CSS selectors
This could be done directly using selenium:
date = driver.find_element_by_css_selector("img[class^='eventInfoContainer-'][src$='clock.svg'] + div")
time = driver.find_element_by_css_selector("img[class^='eventInfoContainer-'] + div + div")
venue = driver.find_element_by_css_selector("img[class^='eventInfoContainer-'][src$='pin.svg'] + div > div")
address = driver.find_element_by_css_selector("img[class^='eventInfoContainer-'][src$='pin.svg'] + div > div:nth-of-type(2)")
I've used differnt CSS selectors but it still selects the same elements.
I'm not sure about BeautifulSoup but in the answer of QHarr, the date selector would return other value instead of intended value for selenium.

How to use beautifulsoup to get node text and children tag separately

My html is like:
<a class="title" href="">
<b>name
<span class="c-gray">position</span>
</b>
</a>
I want to get name and position string separately. So my script is like:
lia = soup.find('a',attrs={'class':'title'})
pos = lia.find('span').get_text()
lia.find('span').replace_with('')
name = lia.get_text()
print name.strip()+','+pos
Although it can do the job, I don't think is a beautiful way. Any brighter idea?
You can use .contents method this way:
person = lia.find('b').contents
name = person[0].strip()
position = person[1].text
The idea is to locate the a element, then, for the name - get the first text node from an inner b element and, for the position - get the span element's text:
>>> a = soup.find("a", class_="title")
>>> name, position = a.b.find(text=True).strip(), a.b.span.get_text(strip=True)
>>> name, position
(u'name', u'position')

Pulling specific (text) spaced between HTML tag during BeautifulSoup

I'm trying to pull something that is categorized as (text) when I look at it in "Inspect Element" mode:
<div class="sammy"
<div class = "sammyListing">
<a href="/Chicago_Magazine/blahblahblah">
<b>BLT</b>
<br>
"
Old Oak Tap" <---**THIS IS THE TEXT I WANT**
<br>
<em>Read more</em>
</a>
</div>
</div>
This is my code thus far, with the line in question being the bottom list comprehension at the end:
STEM_URL = 'http://www.chicagomag.com'
BASE_URL = 'http://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-Chicago/'
soup = BeautifulSoup(urlopen(BASE_URL).read())
sammies = soup.find_all("div", "sammy")
sammy_urls = []
for div in sammies:
if div.a["href"].startswith("http"):
sammy_urls.append(div.a["href"])
else:
sammy_urls.append(STEM_URL + div.a["href"])
restaurant_names = [x for x in div.a.content]
I've tried div.a.br.content, div.br, but can't seem to get it right.
If suggesting a RegEx way, I'd also really appreciate a nonRegEx way if possible.
Locate the b element for every listing using a CSS selector and find the next text sibling:
for b in soup.select("div.sammy > div.sammyListing > a > b"):
print b.find_next_sibling(text=True).strip()
Demo:
In [1]: from urllib2 import urlopen
In [2]: from bs4 import BeautifulSoup
In [3]: soup = BeautifulSoup(urlopen('http://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-Chicago/'))
In [4]: for b in soup.select("div.sammy > div.sammyListing > a > b"):
...: print b.find_next_sibling(text=True).strip()
...:
Old Oak Tap
Au Cheval
...
The Goddess and Grocer
Zenwich
Toni Patisserie
Phoebe’s Bakery

Categories

Resources