I've got the following HTML:
<td id="uprnButton0">
<button type="button"
onclick="changeText('uprnButton0','Loading');populAddr('14 PLACE NAME TOWN POSTCODE');
getobject('divAddress').innerHTML = '';
GetInfoAndRoundsFor('123456789123','SWN');"
title="Get Calendar for this address"
>Show
</button>
</td>
I want to get the text in populAddr and in GetInfoAndRoundsFor i.e. the strings "14 PLACE NAME TOWN POSTCODE" and "123456789123" respectively.
So far I have tried:
button_click_text = address.find('button').get('onclick')
Which gets me the full onClick string which is great. Is the only way to get the specific sub strings doing a bit of slicing?
I've tried this:
string = """changeText('uprnButton1','Loading');populAddr('14 PLACE NAME TOWN POSTCODE');getobject('divAddress').innerHTML = '';GetInfoAndRoundsFor('123456789123','SWN');"""
string_before = "populAddr('"
string_after = "');getobject"
print(string[string.index(string_before)+len(string_before):string.index(string_after)])
Which does work but looks like an effing mess. Is there best practice here?
Actually just thought this might be better:
string_split = string.split("'")
print(string_split[5])
print(string_split[11])
You should be able to use the following two lazy regex patterns
import re
html ='''<td id="uprnButton0">
<button type="button"
onclick="changeText('uprnButton0','Loading');populAddr('14 PLACE NAME TOWN POSTCODE');
getobject('divAddress').innerHTML = '';
GetInfoAndRoundsFor('123456789123','SWN');"
title="Get Calendar for this address"
>Show
</button>
</td>'''
p1 =re.compile(r"populAddr\('(.*?)'")
p2 = re.compile(r"GetInfoAndRoundsFor\('(.*?)'")
print(p1.findall(html)[0])
print(p2.findall(html)[0])
Explanation for one (same principle for both)
you can replace html variable with response.text or button_click_textwhere response.text is the requests response .text
I found this to be the quickest way of doing it and because I guess the HTML could be switched I put a couple of checks in to make sure the house number was what I searched for and the uprn was actually a number. If either of these was false then I know the code on the site has probably been tweaked:
string_split = string.split("'")
address = string_split[5]
uprn = string_split[11]
validate address starts with correct house number
print(address.startswith('15 '))
validate uprn contains a number
print(uprn[0:12].isdigit())
That is my try:
In [1]: d = """
...: <td id="uprnButton0">
...: <button type="button"
...: onclick="changeText('uprnButton0','Loading');populAddr('14 PLACE NAME TOWN POSTCODE');
...: getobject('divAddress').innerHTML = '';
...: GetInfoAndRoundsFor('123456789123','SWN');"
...: title="Get Calendar for this address"
...: >Show
...: </button>
...: </td>
...: """
In [2]: from bs4 import BeautifulSoup as bs
In [3]: soup = bs(d,"lxml")
In [4]: button_click_text = soup.find('button').get('onclick')
In [5]: button_click_text
Out[5]: "changeText('uprnButton0','Loading');populAddr('14 PLACE NAME TOWN POSTCODE');\n getobject('divAddress').innerHTML = '';\n GetInfoAndRoundsFor('123456789123','SWN');"
In [6]: import re
...: regex = re.compile(r"'.*?'")
...: out = regex.findall(button_click_text)
...: s1 = out[2][1:-1]
...: s2 = out[-2][1:-1]
In [7]: s1
Out[7]: '14 PLACE NAME TOWN POSTCODE'
In [8]: s2
Out[8]: '123456789123'
soup.find(button) returns an object representing the first button element, and soup.find('button')['onclick'] returns the string value of the onclick attribute.
Because of this, there isn't a convenient way of fetching the value of populAddr, other than using split.
I would recommend splitting by the following:
address = address.find('button').get('onclick').split('populAddr(')[1].split(')')[0]
If you split by populAddr, you know exactly what index the address is located in (it will always be index 0).
If you split by ', you will have to manually review every page you scrape in order to verify that the address will end up in index 5.
Related
My html is like:
<a class="title" href="">
<b>name
<span class="c-gray">position</span>
</b>
</a>
I want to get name and position string separately. So my script is like:
lia = soup.find('a',attrs={'class':'title'})
pos = lia.find('span').get_text()
lia.find('span').replace_with('')
name = lia.get_text()
print name.strip()+','+pos
Although it can do the job, I don't think is a beautiful way. Any brighter idea?
You can use .contents method this way:
person = lia.find('b').contents
name = person[0].strip()
position = person[1].text
The idea is to locate the a element, then, for the name - get the first text node from an inner b element and, for the position - get the span element's text:
>>> a = soup.find("a", class_="title")
>>> name, position = a.b.find(text=True).strip(), a.b.span.get_text(strip=True)
>>> name, position
(u'name', u'position')
This is the source code layout from the website:
<div class="address">
<a href="https://website.ca/classifieds/59-barclay-street/">
59 Some Street<br />City, Zone 1
</a>
</div>
I would like to get the street number, route, and city for Google Geocoding. If I do this
>>>article.find('div', {'class': 'address'}).text
'59 Some StreetCity, Zone 1'
It takes away the <br /> and I'm left with no way to split the route from the city. If I do str().replace('<br />',', ') then I have to somehow convert it back again to whatever type it was before so I can do .text to get the actual text between the <a href>, it's inefficient. I'd like to use the functionality that .text uses to get the actual text, without the functionality where it removes the <br> stuff. I couldn't find a file called BeautifulSoup.py in my env, so I'm looking at the BeautifulSoup source code on GitHub, and I can't find a def text in there, I don't know where else to look.
Update:
articles = page_soup.find('h2', text='Ads').find_next_siblings('article')
for article in articles:
link = article.find('a')
br = link.find('br')
ad_address = br.previous_sibling.strip() + ', ' + br.next_sibling.strip().partition(', Zone ')[0]
#ad_address = link.br.replace_with(', ').get_text().strip().partition(', Zone ')
You can locate the br delimiter tag and get the siblings around it:
In [4]: br = soup.select_one("div.address > a > br")
In [5]: br.previous_sibling.strip()
Out[5]: u'59 Some Street'
In [6]: br.next_sibling.strip()
Out[6]: u'City, Zone 1'
You may also locate the br element and replace it with a space using replace_with():
In [4]: a = soup.select_one("div.address > a")
In [5]: a.br.replace_with(" ")
In [6]: a.get_text().strip()
Out[6]: u'59 Some Street City, Zone 1'
Or, you can join all text nodes inside the a tag:
In [7]: a = soup.select_one("div.address > a")
In [8]: " ".join(a.find_all(text=True)).strip()
Out[8]: u'59 Some Street City, Zone 1'
Try:
soup.find('div', {'class':'address'}).get_text(separator=u"<br/>").split(u'<br/>')
The separator keyword defines inner HTML which concatenates text.
http://omz-software.com/pythonista/docs/ios/beautifulsoup_ref.html
Try:
for link_to_text in links:
Print link_to_text.get_text()
So Im practicing my scraping and I came across something like this:
<div class="profileDetail">
<div class="profileLabel">Mobile : </div>
021 427 399
</div>
and I need the number outside of the <div> tag:
My code is:
num = soup.find("div",{"class":"profileLabel"}).text
but the output of that is Mobile : only it's the text inside the <div> tag not the text outside of it.
so how do we extract the text outside of the <div> tag?
I would make a reusable function to get the value by label, finding the label by text and getting the next sibling:
import re
def find_by_label(soup, label):
return soup.find("div", text=re.compile(label)).next_sibling
Usage:
find_by_label(soup, "Mobile").strip() # prints "021 427 399"
try using soup.find("div",{"class":"profileLabel"}).next_sibling, this will grab the next element, which can be either a bs4.Tag or a bs4.NavigableString.
bs4.NavigableString is what your trying to get in this case.
elem = soup.find("div",{"class":"profileLabel"}).next_sibling
print type(elem)
# Should return
bs4.element.NavigableString
Example:
In [4]: s = bs4.BeautifulSoup('<div> Hello </div>HiThere<p>next_items</p>', 'html5lib')
In [5]: s
Out[5]: <html><head></head><body><div> Hello </div>HiThere<p>next_items</p></body></html>
In [6]: s.div
Out[6]: <div> Hello </div>
In [7]: s.div.next_sibling
Out[7]: u'HiThere'
In [8]: type(s.div.next_sibling)
Out[8]: bs4.element.NavigableString
For future readers that feel this wasn't what they wanted, this may be your answer:
for tags in soup.find_all('div'):
if "profileLabel" in tags['class']:
print(tags.contents[0])
I'm trying to pull something that is categorized as (text) when I look at it in "Inspect Element" mode:
<div class="sammy"
<div class = "sammyListing">
<a href="/Chicago_Magazine/blahblahblah">
<b>BLT</b>
<br>
"
Old Oak Tap" <---**THIS IS THE TEXT I WANT**
<br>
<em>Read more</em>
</a>
</div>
</div>
This is my code thus far, with the line in question being the bottom list comprehension at the end:
STEM_URL = 'http://www.chicagomag.com'
BASE_URL = 'http://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-Chicago/'
soup = BeautifulSoup(urlopen(BASE_URL).read())
sammies = soup.find_all("div", "sammy")
sammy_urls = []
for div in sammies:
if div.a["href"].startswith("http"):
sammy_urls.append(div.a["href"])
else:
sammy_urls.append(STEM_URL + div.a["href"])
restaurant_names = [x for x in div.a.content]
I've tried div.a.br.content, div.br, but can't seem to get it right.
If suggesting a RegEx way, I'd also really appreciate a nonRegEx way if possible.
Locate the b element for every listing using a CSS selector and find the next text sibling:
for b in soup.select("div.sammy > div.sammyListing > a > b"):
print b.find_next_sibling(text=True).strip()
Demo:
In [1]: from urllib2 import urlopen
In [2]: from bs4 import BeautifulSoup
In [3]: soup = BeautifulSoup(urlopen('http://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-Chicago/'))
In [4]: for b in soup.select("div.sammy > div.sammyListing > a > b"):
...: print b.find_next_sibling(text=True).strip()
...:
Old Oak Tap
Au Cheval
...
The Goddess and Grocer
Zenwich
Toni Patisserie
Phoebe’s Bakery
Hi I need to pass a variable to the soup.find() function, but it doesn't work :(
Does anyone know a solution for this?
from bs4 import BeautifulSoup
html = '''<div> blabla
<p class='findme'> p-tag content</p>
</div>'''
sources = {'source1': '\'p\', class_=\'findme\'',
'source2': '\'span\', class_=\'findme2\'',
'source1': '\'div\', class_=\'findme3\'',}
test = BeautifulSoup(html)
# this works
#print(test.find('p', class_='findme'))
# >>> <p class="findme"> p-tag content</p>
# this doesn't work
tag = '\'p\' class_=\'findme\''
# a source gets passed
print(test.find(sources[source]))
# >>> None
I am trying to split it up as suggested like this:
pattern = '"p", {"class": "findme"}'
tag = pattern.split(', ')
tag1 = tag[0]
filter = tag[1]
date = test.find(tag1, filter)
I don't get errors, just None for date. The problem is propably the content of tag1 and filter The debuger of pycharm gives me:
tag1 = '"p"'
filter = '{"class": "findme"}'
Printing them doesn't show these apostrophs. Is it possible to remove these apostrophs?
The first argument is a tag name, and your string doesn't contain that. BeautifulSoup (or Python, generally) won't parse out a string like that, it cannot guess that you put some arbitrary Python syntax in that value.
Separate out the components:
tag = 'p'
filter = {'class_': 'findme'}
test.find(tag, **filter)
Okay I got it, thanks again.
dic_date = {'source1': 'p, class:findme', other sources ...}
pattern = dic_date[source]
tag = pattern.split(', ')
if len(tag) is 2:
att = tag[1].split(':') # getting the attribute
att = {att[0]: att[1]} # building a dictionary for the attributes
date = soup.find(tag[0], att)
else:
date = soup.find(tag[0]) # if there is only a tag without an attribute
Well it doesn't look very nice but it's working :)