Edit text from html with BeautifulSoup - python

I'm currently trying to extract the html elements which have a text on their own and wrap them with a special tag.
For example, my HTML looks like this:
<ul class="myBodyText">
<li class="fields">
This text still has children
<b>
Simple Text
</b>
<div class="s">
<ul class="section">
<li style="padding-left: 10px;">
Hello <br/>
World
</li>
</ul>
</div>
</li>
</ul>
I'm trying to wrap tags only around the tags, so I can further parse them at a later time, so I tried to make it look like this:
<ul class="bodytextAttributes">
<li class="field">
[Editable]This text still has children[/Editable]
<b>
[Editable]Simple Text[/Editable]
</b>
<div class="sectionFields">
<ul class="section">
<li style="padding-left: 10px;">
[Editable]Hello [/Editable]<br/>
[Editable]World[/Editable]
</li>
</ul>
</div>
</li>
</ul>
My script so far, which iterates just fine, but the placement of the edit placeholders isn't working and I have currently no idea how I can check this:
def parseSection(node):
b = str(node)
changes = set()
tag_start, tag_end = extractTags(b)
# index 0 is the element itself
for cell in node.findChildren()[1:]:
if cell.findChildren():
cell = parseSection(cell)
else:
# safe to extract with regular expressions, only 1 standardized tag created by BeautifulSoup
subtag_start, subtag_end = extractTags(str(cell))
changes.add((str(cell), "[/EditableText]{0}[EditableText]{1}[/EditableText]{2}[EditableText]".format(subtag_start, str(cell.text), subtag_end)))
text = extractText(b)
for change in changes:
text = text.replace(change[0], change[1])
return bs("{0}[EditableText]{1}[/EditableText]{2}".format(tag_start, text, tag_end), "html.parser")
The script generates following:
<ul class="myBodyText">
[EditableText]
<li class="fields">
This text still has children
[/EditableText]
<b>
[EditableText]
Simple Text
[/EditableText]
</b>
[EditableText]
<div class="s">
<ul class="section">
<li style="padding-left: 10px;">
Hello [/EditableText]
<br/>
[EditableText][/EditableText]
<br/>
[EditableText]
World
</li>
</ul>
</div>
</li>
[/EditableText]
</ul>
How I can check this and fix it? I'm grateful for every possible answer.

There is a built-in replace_with() method that fits the use case nicely:
soup = BeautifulSoup(data)
for node in soup.find_all(text=lambda x: x.strip()):
node.replace_with("[Editable]{}[/Editable]".format(node))
print soup.prettify()

Related

Python BeautifulSoup Loop through divs and multiple elements

I have a website containing film listings, I've put together a simplified HTML of the website. Please note that for the real world example the <ul> tags are not direct children of the class film_listing or showtime. They are found under several <div> or <ul> elements.
<li class="film_listing">
<h3 class="film_title">James Bond</h3>
<ul class="showtimes">
<li class="showtime">
<p class="start_time">15:00</p>
</li>
<li class="showtime">
<p class="start_time">19:00</p>
<ul class="attributes">
<li class="audio_desc">
</li>
<li class="open_cap">
</li>
</ul>
</li>
</ul>
</li>
I have created a Python script to scrape the website which currently lists all film titles with the first showtime and first attribute of each. However, I am trying to list all showtimes. The final aim is to only list film titles with open captions and the showtime of those open captions performances.
Here is the python script with a nested for loop that doesn't work and prints all showtimes for all films, rather than showtimes for a specific film. It is also not set up to only list captioned films yet. I suspect the logic may be wrong and would appreciate any advice. Thanks!
for i in soup.findAll('li', {'class':'film_listing'}):
film_title=i.find('h3', {'class':'film_title'}).text
print(film_title)
for j in soup.findAll('li', {'class':'showtime'}):
print(j['showtime.text'])
#For the time listings, find ones with Open Captioned
i=filmlisting.find('li', {'class':'open_cap'})
print(film_access)
edit: small correction to html script
There are many ways how you could extract the information. One way is to "search backwards". Search for <li> with class="open_cap" and the find previous start time and film title:
from bs4 import BeautifulSoup
txt = '''
<li class="film_listing">
<h3 class="film_title">James Bond</h3>
<ul class="showtimes">
<li class="showtime">
<p class="start_time">15:00</p>
</li>
<li class="showtime">
<p class="start_time">19:00</p>
<ul class="attributes">
<li class="audio_desc">
</li>
<li class="open_cap">
</li>
</ul>
</li>
</ul>
</li>'''
soup = BeautifulSoup(txt, 'html.parser')
for open_cap in soup.select('.open_cap'):
print('Name :', open_cap.find_previous(class_='film_title').text)
print('Start time :', open_cap.find_previous(class_='start_time').text)
print('-' * 80)
Prints:
Name : James Bond
Start time : 19:00
--------------------------------------------------------------------------------
Content of read.html
<li class="film_listing">
<h3 class="film_title">James Bond</h3>
<ul class="showtimes">
<li class="showtime">
<p class="start_time">15: 00</p>
</li>
<li class="showtime">
<p class="start_time">19:00</p>
<ul class="attributes">
<li class="audio_desc"></li>
<li class="open_cap"></li>
</ul>
</li>
</ul>
</li>
As you said <ul> tags are not direct children of the class film_listing or showtime then you can try find() to get first element with specified tag name or you can use find_all() to get list of elements with specified tag name.
You can try this
from bs4 import BeautifulSoup as bs
text = open("read.html", "r")
soup = bs(text.read(), 'html.parser')
for listing in soup.find_all("li", class_="film_listing"):
print("Film name: ", listing.find("h3", class_="film_title").text)
print("Start time: ", listing.find("p", class_="start_time").text)
Output:
Film name: James Bond
Start time: 15: 00
instead of find() you can use find_all() method which will return all the tags which that name <p> and class start_time

Extracting full URL from href tag in scrapy

I'm trying to use scrapy to scrape URLs from offers from this site
This is the code I tried:
url = response.css('a[data-tracking="click_body"]::attr(href)').extract()
But my code returns something very different from a URL.
Here is the HTML code of the div I'm interested in.
<div class="offer-item-details">
<header class="offer-item-header">
<h3>
<a href="https://www.otodom.pl/oferta/gdansk-pod-inwestycje-cicha-lokalizacja-ID46DXu.html#ab04badaa0" data-tracking="click_body" data-tracking-data="{"touch_point_button":"title"}" data-featured-name="promo_top_ads">
<strong class="visible-xs-block">42 m²</strong>
<span class="text-nowrap">
<span class="offer-item-title">Gdańsk/ Pod Inwestycje/ Cicha Lokalizacja</span>
</span>
</a>
</h3>
<p class="text-nowrap"><span class="hidden-xs">Mieszkanie na sprzedaż: </span>Gdańsk, Ujeścisko-Łostowice, Łostowice</p>
<div class="vas-list-no-offer">
<a class="button-observed observe-link favourites-button observed-text svg-heart add-to-favourites" data-statkey="ad.observed.list" rel="nofollow" data-id="60688916" href="#" title="Obserwuj">
<div class="observed-text-container" style="display: flex;">
<span class="icon observed-60688916"></span>
<i class="icon-heart-filled"></i>
<div class="observed-label">Dodaj do ulubionych</div>
</div>
</a>
</div>
</header>
<ul class="params
" data-tracking="click_body" data-tracking-data="{"touch_point_button":"body"}">
<li class="offer-item-rooms hidden-xs">2 pokoje</li>
<li class="offer-item-price">
346 000 zł </li>
<li class="hidden-xs offer-item-area">42 m²</li>
<li class="hidden-xs offer-item-price-per-m">8 238 zł/m²</li>
</ul>
</div>
Copied selector of that tag:
#offer-item-ad_id45Wog > div.offer-item-details > header > h3 > a
Copied xPath
//*[#id="offer-item-ad_id45Wog"]/div[1]/header/h3/a
Copied full xPath
/html/body/div[3]/main/section[2]/div/div/div[1]/div/article[1]/div[1]/header/h3/a
Your code gives you a list of the URLs. The extract() method in this case gets a list. To allow scrapy to extract the data you will have to do a for loop and yield statement.
url = response.css('a[data-tracking="click_body"]::attr(href)').extract()
for a in url:
yield{'url', a}

Selenium : How to wait then click?

I'm using selenium for automation and i want to click in each one of the <ul>elements then wait before clicking again in the element. This is my code but it doesn't seem to be the solution :
def navBar():
driver=setup()
navBar_List = driver.find_element_by_class_name("nav")
listItem = navBar_List.find_elements_by_tag_name("li")
for item in listItem :
WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.TAG_NAME,"li")))
item.click()
Here the HTLM code :
<ul class="nav navbar-nav">
<li tabindex="0">
<a class="h">
<div class="icon-left-navbar">
...
</div>
</a>
</li>
<li tabindex="0">
<a class="h">
<div class="icon-left-navbar">
...
</div>
</a>
</li>
<li tabindex="0">
<a class="h">
<div class="icon-left-navbar">
...
</div>
</a>
</li>
</ul>
Is Thread.sleep(100) an option?
Define your li with .find_elements.
Use xpath for recognize them : //ul[#class='nav navbar-nav']//li.
With loop you can utilize increment to wait each li. I'm imagine it will produce like below:
(xpath)[1]
(xpath)[2]
etc...
And try the following code:
listItem = WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH,"//ul[#class='nav navbar-nav']//li")))
for x in range(1, len(listItem)+1):
WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"(//ul[#class='nav navbar-nav']//li)[" +str(x) +"]"))).click()

Can I access the subchild of a parent in XPath?

So as the title states I have some HTML code from http://chem.sis.nlm.nih.gov/chemidplus/name/acetone that I am parsing and want to extract some data like the Acetone under MeSH Heading from my similar post How to set up XPath query for HTML parsing?
<div id="names">
<h2>Names and Synonyms</h2>
<div class="ds">
<button class="toggle1Col" title="Toggle display between 1 column of wider results and multiple columns.">↔</button>
<h3>Name of Substance</h3>
<div class="yui3-g-r">
<div class="yui3-u-1-4">
<ul>
<li id="ds2">
<div>2-Propanone</div>
</li>
</ul>
</div>
<div class="yui3-u-1-4">
<ul>
<li id="ds3">
<div>Acetone</div>
</li>
</ul>
</div>
<div class="yui3-u-1-4">
<ul>
<li id="ds4">
<div>Acetone [NF]</div>
</li>
</ul>
</div>
<div class="yui3-u-1-4">
<ul>
<li id="ds5">
<div>Dimethyl ketone</div>
</li>
</ul>
</div>
</div>
<h3>MeSH Heading</h3>
<ul>
<li id="ds6">
<div>Acetone</div>
</li>
</ul>
</div>
</div>
Previously in other pages I would do mesh_name = tree.xpath('//*[text()="MeSH Heading"]/..//div')[1].text_content() to extract the data because other pages had similar structures, but now I see that is not the case as I didn't account for inconsistency. So, is there a way of after going to the node that I want and then obtaining it's subchild, allowing for consistency across different pages?
Would doing tree.xpath('//*[text()="MeSH Heading"]//preceding-sibling::text()[1]') work?
From what I understand, you need to get the list of items by a heading title.
How about making a reusable function that would work for every heading in the "Names and Synonyms" container:
from lxml.html import parse
tree = parse("http://chem.sis.nlm.nih.gov/chemidplus/name/acetone")
def get_contents_by_title(tree, title):
return tree.xpath("//h3[. = '%s']/following-sibling::*[1]//div/text()" % title)
print get_contents_by_title(tree, "Name of Substance")
print get_contents_by_title(tree, "MeSH Heading")
Prints:
['2-Propanone', 'Acetone', 'Acetone [NF]', 'Dimethyl ketone']
['Acetone']

How to extract href, alt and imgsrc using beautiful soup python

Can someone help me extract some data from the below sample html using beautiful soup python?
These are what i'm trying to extract:
The href html link : example
/movies/watch-malayalam-movies-online/6106-watch-buddy.html
The alt text which has the movie name :
Buddy 2013 Malayalam Movie
The thumbnail : example http://i44.tinypic.com/2lo14b8.jpg
(There are multiple occurrences of these..)
Full source available at : http:\\olangal.com
Sample html :
<div class="item column-1">
<h2>
<a href="/movies/watch-malayalam-movies-online/6106-watch-buddy.html">
Buddy
</a>
</h2>
<ul class="actions">
<li class="email-icon">
<a href="/component/mailto/?tmpl=component&template=beez_20&link=36bbe22fb7c54b5465609b8a2c60d8c8a1841581" title="Email" onclick="window.open(this.href,'win2','width=400,height=350,menubar=yes,resizable=yes'); return false;">
<img src="/media/system/images/emailButton.png" alt="Email" />
</a>
</li>
</ul>
<img width="110" height="105" alt=" Buddy 2013 Malayalam Movie" src="http://i44.tinypic.com/2lo14b8.jpg" border="0" />
<p class="readmore">
<a href="/movies/watch-malayalam-movies-online/6106-watch-buddy.html">
Read more...
</a>
</p>
<div class="item-separator">
</div>
</div>
<div class="item column-2">
<h2>
<a href="/movies/watch-malayalam-movies-online/6105-watch-pigman.html">
Pigman
</a>
</h2>
<ul class="actions">
<li class="email-icon">
<a href="/component/mailto/?tmpl=component&template=beez_20&link=2b0dfb09b41b8e6fabfd7ed2a035f4d728bedb1a" title="Email" onclick="window.open(this.href,'win2','width=400,height=350,menubar=yes,resizable=yes'); return false;">
<img src="/media/system/images/emailButton.png" alt="Email" />
</a>
</li>
</ul>
<img width="110" height="105" alt="Pigman 2013 Malayalam Movie" src="http://i41.tinypic.com/jpa3ko.jpg" border="0" />
<p class="readmore">
<a href="/movies/watch-malayalam-movies-online/6105-watch-pigman.html">
Read more...
</a>
</p>
<div class="item-separator">
</div>
</div>
Update : Finally cracked it with help from #kroolik. Thanks to you.
Here's what worked for me:
for eachItem in soup.findAll("div", { "class":"item" }):
eachItem.ul.decompose()
imglinks = eachItem.find_all('img')
for imglink in imglinks:
imgfullLink = imglink.get('src').strip()
links = eachItem.find_all('a')
for link in links:
names = link.contents[0].strip()
fullLink = "http://olangal.com"+link.get('href').strip()
print "Extracted : " + names + " , " + imgfullLink+" , "+fullLink
You can get both <img width="110"> and <p class="read more"> using the following:
for div in soup.find_all(class_='item'):
# Will match `<p class="readmore">...</p>` that is direct
# child of the div.
p = div.find(class_='readmore', recursive=False)
# Will print `href` attribute of the first `<a>` element
# inside `p`.
print p.a['href']
# Will match `<img width="110">` that is direct child
# of the div.
img = div.find('img', width=110, recursive=False)
print img['src'], img['alt']
Note that this is for the most recent Beautiful Soup version.
I usually use PyQuery for such scrapping, it's clean and easy. You can use jQuery selectors directly with it. e.g to see your Name and reputation, I will just have to write something like
from pyquery import PyQuery as pq
d = pq(url = 'http://stackoverflow.com/users/1234402/gbzygil')
p=d('#user-displayname')
t=d('#user-panel-reputation div h1 a span')
print p.html()
So unless you can't switch from bsoup, I will strongly recommend switching to PyQuery or some library that supports XPath well.

Categories

Resources