is it possible to change parent of html element with python beautifulsoup - python

Let's assume I have a html like following:
<div class="question-div"></div>
<div class="answer-div"></div>
<div class="question-div"></div>
<div class="answer-div"></div>
<div class="question-div"></div>
<div class="answer-div"></div>
I want to move all divs with the class answer-div into the previous question-div. Can I handle it with beautifulsoup?

You can also use insert
from bs4 import BeautifulSoup
html="""
<div class="question-div"></div>
<div class="answer-div"></div>
<div class="question-div"></div>
<div class="answer-div"></div>
<div class="question-div"></div>
<div class="answer-div"></div>
"""
soup=BeautifulSoup(html,'html.parser')
for div in soup.findAll('div',{"class":"answer-div"}):
div.find_previous_sibling('div').insert(0,div)
print(soup)
Output
<div class="question-div"><div class="answer-div"></div></div>
<div class="question-div"><div class="answer-div"></div></div>
<div class="question-div"><div class="answer-div"></div></div>

No hands-on experience with beautifulsoup but I will give this one a shot!
The way I look at it is, you find all the div's with question and answer separately.
div_ques_Blocks = soup.find_all('div', class_="question-div")
div_ans_Blocks = soup.find_all('div', class_="answer-div")
and then loop through the question-div to insert/append the answer-div
for divtag in div_ans_Blocks :
print divtag.find_previous_sibling('div')
If the above print statement gives you all the answer-div, you can then try appending them instead of priting, maybe like this?

Related

How to get the text from certain class name if other sibling class exists?

I've tried to get the text from class="eventAwayMinute">57 in every matchEvent class (Parent tag)
If a matchEvent class contains class="eventIcon eventIcon_1":
<div class="matchEvent">
<div class="eventHomePlayer">
</div>
<div class="eventHomeMinute"></div>
<div class="eventIcon eventIcon_1"></div>
<div class="eventAwayMinute">57'</div>
<div class="eventAwayPlayer">
George
<span>(Irakli)</span> </div>
</div>
I tried
Minutes = [(gm.get_text()).strip() for gm in soup.select('matchEvent , div[class$="eventIcon_1"]')]
and it dose not work.
I tried also
Minutes = [(gm.get_text()).strip() for gm in soup.select('matchEvent')]
But it returns all minutes that exist in every matchEvent (There is several matchEvent classes in html code).
You can use the :has() CSS Selector to check if matchEvent has an eventIcon eventIcon_1 class, and than print the eventAwayMinute class:
from bs4 import BeautifulSoup
html = """<div class="matchEvent">
<div class="eventHomePlayer">
</div>
<div class="eventHomeMinute"></div>
<div class="eventIcon eventIcon_1"></div>
<div class="eventAwayMinute">57'</div>
<div class="eventAwayPlayer">
George
<span>(Irakli)</span> </div>
</div>
"""
soup = BeautifulSoup(html, "html.parser")
for tag in soup.select(".matchEvent:has(.eventIcon.eventIcon_1)"):
print(tag.select_one(".eventAwayMinute").text.strip("'"))
Output:
57

Any way to only extract specific div from beautiful soup

I have run into an issue while working on a web scraping project in python. I am new to python and am not sure how to extract a specific line, or a value from part of a line, from the beautiful soup output. I would like to get only the data-rarity part from this site but i haven't found how to do that without removing the entire line from the list.
Any help is much appreciated!
I have this:
rarity = soup.find_all('div', {'class': 'profileCards__card'})
print(rarity[0])
This outputs:
<div class="profileCards__card upgrade " data-level="902" data-elixir="2" data-rarity="102" data-arena="802">
<img src="//cdn.statsroyale.com/images/cards/full/snowball.png"><span class="profileCards__level">lvl.9</span>
<div class="profileCards__meter">
<span style="width: 100%"></span>
<div class="profileCards__meter__numbers">
8049/800
</div>
</div>
<div class="ui__tooltip ui__tooltipTop ui__tooltipMiddle cards__tooltip">
Giant Snowball
</div>
</div>
I would ideally want to get only the value after the data-rarity so just the 102 part from this in the inspect element of the site.
<div class="profileCards__cards">
<div class="profileCards__card upgrade " data-level="902" data-elixir="2" data-rarity="102" data-arena="802">
<img src="//cdn.statsroyale.com/images/cards/full/snowball.png"><span class="profileCards__level">lvl.9</span>
<div class="profileCards__meter">
<span style="width: 100%"></span>
<div class="profileCards__meter__numbers">
8049/800
</div>
</div>
<div class="ui__tooltip ui__tooltipTop ui__tooltipMiddle cards__tooltip">
Giant Snowball
</div>
</div>
Use:
rarity = soup.find_all('div', {'class': 'profileCards__card'})
for r in rarity:
print(r.find("div", {'class': 'profileCards__card'})["data-rarity"])

Scraping from Same-named Tags with Python / BeautifulSoup

I'm getting my toes wet with BeautifulSoup and am hung up on scraping some particular info. The HTML looks like the following, for example:
<div class="row">
::before
<div class="course-short clearfix">
::before
<div class="course-meta col-sm-12">
<dl>
<dt>Language:</dt>
<dd>English</dd>
<dt>Author:</dt>
<dd>John Doe</dd>
<dt>Institution:</dt>
<dd>American University</dd>
</dl>
</div>
...
<div class="row">
::before
<div class="course-short clearfix">
::before
<div class="course-meta col-sm-12">
<dl>
<dt>Language:</dt>
<dd>English</dd>
<dt>Author:</dt>
<dd>John Doe, Jr.</dd>
<dt>Institution:</dt>
<dd>Another University</dd>
</dl>
</div>
...
Each page has about 10 <div class="row"> tags, each with the same <dt> and <dd> pattern (e.g., Language, Author, Institution).
I am trying to scrape the <dd>American University</dd> info, ultimately to create a loop so that I can get that info specific to each <div class="row"> tag.
I've managed the following so far:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.oeconsortium.org/courses/search/?search=statistics")
bsObj = BeautifulSoup(html.read(), "html.parser")
institutions = [x.text.strip() for x in bsObj.find_all('div', 'course-meta col-sm-12', 'dd')]
But that only gives me the following mess for each respective <div class="row"> :
Language:\n\n\t\t\t\t\t\t\t\t\t\t\tEnglish\t\t\t\t\t\t\t\t\t\nAuthor:\nJohn Doe\nInstitution:\nAmerican University\n
Language:\n\n\t\t\t\t\t\t\t\t\t\t\tEnglish\t\t\t\t\t\t\t\t\t\nAuthor:\nJohn Doe, Jr.\nInstitution:\nAnother University\n
...
(I know how to .strip(); that's not the issue.)
I can't figure out how to target that third <dd></dd> for each respective <div class="row">. I feel like it may be possible by also targeting the <dt>Institution:</dt> tag (which is "Institution" in every respective case), but I can't figure it out.
Any help is appreciated. I am trying to make a LOOP so that I can loop over, say, ten <div class="row"> instances and just pull out the info specific to the "Institution" <dd> tag.
Thank you!
I can't figure out how to target that third <dd></dd> for each respective <div class="row">
find_all will return a list of all occurrences, so you could just take the third element of the result. Although you may want to wrap the whole thing with try catch to prevent IndexError
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all
I'd make a function out of this code to reuse it for different pages:
soup = BeautifulSoup(html.read(), "html.parser")
result = []
for meta in soup.find_all('div', {'class': 'course-meta'}):
try:
institution = meta.find_all('dd')[2].text.strip()
result.append(institution) # or whatever you like to store it
except IndexError:
pass

Python Beautifulsoup Find_all except

I'm struggling to find a simple to solve this problem and hope you might be able to help.
I've been using Beautifulsoup's find all and trying some regex to find all the items except the 'emptyLine' line in the html below:
<div class="product_item0 ">...</div>
<div class="product_item1 ">...</div>
<div class="product_item2 ">...</div>
<div class="product_item0 ">...</div>
<div class="product_item1 ">...</div>
<div class="product_item2 ">...</div>
<div class="product_item0 ">...</div>
<div class="product_item1 last">...</div>
<div class="product_item2 emptyItem">...</div>
Is there a simple way to find all the items except one including the 'emptyItem'?
Just skip elements containing the emptyItem class. Working sample:
from bs4 import BeautifulSoup
data = """
<div>
<div class="product_item0">test0</div>
<div class="product_item1">test1</div>
<div class="product_item2">test2</div>
<div class="product_item2 emptyItem">empty</div>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for elm in soup.select("div[class^=product_item]"):
if "emptyItem" in elm["class"]: # skip elements having emptyItem class
continue
print(elm.get_text())
Prints:
test0
test1
test2
Note that the div[class^=product_item] is a CSS selector that would match all div elements with a class starting with product_item.

Extract outer div using BeautifulSoup

If the HTML code looks like this:
<div class="div1">
<p>hello</p>
<p>hi</p>
<div class="nesteddiv">
<p>one</p>
<p>two</p>
<p>three</p>
</div>
</div>
How do I extract just
<div class="div1">
<p>hello</p>
<p>hi</p>
</div>
I already tried parser.find('div', 'div1') but I'm getting the whole div including the nested one.
You actually want to extract() the nested div from the document and then get the first div. Here is an example (where html is the HTML you provided in the question):
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> soup.div.div.extract()
<div class="nesteddiv">
<p>one</p>
<p>two</p>
<p>three</p>
</div>
>>> soup.div
<div class="div1">
<p>hello</p>
<p>hi</p>
</div>
Why not just find() the nested div and then remove it from the tree using extract()?

Categories

Resources