How can I 'cycle' though similar blocks of HTML using BeautifulSoup?

How can I 'cycle' though similar blocks of HTML using BeautifulSoup? - python

I want to scrape info from each of the "item watching" in the "items" class. I'm stuck because when I try find it only finds the HTML for the first "item watching" but I don't want to use find_all because it gives a massive blob that I can't prettify and it would make it more difficult to cycle through the information.
soup = BeautifulSoup(res.text, "html.parser") # SOUP
class_items = soup.find("div", attrs={"data-name":"watching"}).find("div", class_="items") # Narrowed Down
actual_items = class_items.find("div", class_="item watching") # Was thinking [x] so I can cycle?
The whole shabang:
import requests
from bs4 import BeautifulSoup
payload = {"username":"?????", "password":"?????"}
url = "https://9anime.to/user/watchlist"
loginurl = "https://9anime.to/user/ajax/login"
with requests.Session() as s:
res = s.post(loginurl, data=payload)
res = s.get(url)
soup = BeautifulSoup(res.text, "html.parser")
class_items = soup.find("div", attrs={"data-name":"watching"}).find("div", class_="items")
actual_items = class_items.find_next("div", class_="item watching")
print(actual_items.prettify())
site url: https://9anime.to/
login url: https://9anime.to/user/ajax/login
Expected output for each "item watching" (Similar format for each):
<div class="item watching">
<a class="thumb" href="/watch/kaguya-sama-love-is-war-season-2.omkj?ep=7">
<img alt="Kaguya-sama: Love is War Season 2" src="https://static.akacdn.ru/files/images/2019/10/f53e6536aa7b3b95e6fe4c6d7b8e1a9b.jpg"/>
</a>
<a class="link" data-jtitle="Kaguya-sama wa Kokurasetai?: Tensai-tachi no Renai Zunousen" data-tip="/ajax/film/tooltip/omkj?v=5dab1c5b" href="/watch/kaguya-sama-love-is-war-season-2.omkj?ep=7">
Kaguya-sama: Love is War Season 2
</a>
<span class="current">
7
</span>
<div class="info">
<span class="state old tip" data-id="omkj" data-unwatched="Unwatched" data-value="0" data-watched="Watched" title="Click to change">
Watched
</span>
<span class="status">
7/12
</span>
<span class="dropdown userbookmark" data-id="omkj">
<i class="icon icon-pencil-square" data-toggle="dropdown">
</i>
<ul class="dropdown-menu bookmark choices pull-right" data-id="omkj">
<li data-value="watching">
<a>
<i class="fa fa-eye">
</i>
Watching
</a>
</li>
<li data-value="watched">
<a>
<i class="fa fa-check">
</i>
Completed
</a>
</li>
<li data-value="onhold">
<a>
<i class="fa fa-hand-grab-o">
</i>
On-Hold
</a>
</li>
<li data-value="dropped">
<a>
<i class="fa fa-eye-slash">
</i>
Drop
</a>
</li>
<li data-value="planned">
<a>
<i class="fa fa-bookmark">
</i>
Plan to watch
</a>
</li>
<li class="divider" role="separator">
</li>
<li data-value="remove">
<a>
<i class="fa fa-remove">
</i>
Remove entry
</a>
</li>
</ul>
</span>
</div>
<div class="clearfix">
</div>
</div>

One way would be to use CSS selectors and the select function:
actual_items = soup.select('div.content > div.items > div.item.watching')
for item in actual_items:
print(item.prettify())

I am not a beautiful soup expert, but I had a similar problem where using find_all and then creating a smaller variable did help me to visualize the information.
df=pd.DataFrame()
for i in soup:
class_items = i.find_all("div", class_="item_watching")
for x in class_items:
df = df.append({'Actual Items': x.text.strip()}, ignore_index=True)

Related

How to get the href link from nested HTML tags using python and beautifulsoup?

I have used to below code to extract the href content
soup = bs(html, 'html.parser')
pagesize_content = soup.find_all('div',attrs={'class':"pull-right"})
print(pagesize_content)
When I the class "Pull-right"
output as below
[<div class="col-md-4 pull-right">
<button class="btn btn-default closebutton" data-dismiss="modal" type="button">Close</button>
<button class="btn btn-success" id="Name Properties" type="submit">
Submit
<em class="fa fa-check"></em>
</button>
</div>, <div class="pull-right">
<ul class="pagination">
<li class="disabled">
<a class="disabled" href="javascript:void(0)">
First
<em class="fa fa-angle-double-left"></em>
</a>
</li>
<li class="active">
1
</li>
<li class="">
2
</li>
<li class="">
3
</li>
<li class="">
4
</li>
<li class="">
<a href="/Test/country?page=4">
Last
<em class="fa fa-angle-double-right"></em>
</a>
</li>
</ul>
</div>]
From that how to extract the href link.
When I try with
print(pagesize_content.ul.li.a['href'])
error displaying
pagesize_content = soup.find_all('a',attrs={'class':"fa fa-angle-double-right"},href=True)
print(pagesize_content)
empty output

find a specific child element in html with beautifulsoup python

The example im stuck with is like this
<div class="nav-links">
<div class="nav-previous">
<a href="prevlink" rel="prev">
<span class="meta-nav" aria-hidden="true">Previous </span>
<span class="screen-reader-text">Previous post:</span> <br>
<span class="post-title">
Title
</span>
</a>
</div>
<div class="nav-next">
<a href="nextlink" rel="next">
<span class="meta-nav" aria-hidden="true">Next </span> <span class="screen-reader-text">Next post:</span>
<br>
<span class="post-title">
Title
</span>
</a>
</div>
my ultimate goal is to get the value of href but all i could is bet the whole <div class ... element. im using beautifulsoup python

You can print all the values for href by finding all the links in the page
links = soup.find_all("a")
for link in links:
print(link.attrs['href'])

need to find a value with beautiful soup

This is a part of HTML code from following page following page:
<div>
<div class="sidebar-labeled-information">
<span>
Economic skill:
</span>
<span>
10.646
</span>
</div>
<div class="sidebar-labeled-information">
<span>
Strength:
</span>
<span>
2336
</span>
</div>
<div class="sidebar-labeled-information">
<span>
Location:
</span>
<div>
<a href="region.html?id=454">
Little Karoo
<div class="xflagsSmall xflagsSmall-Argentina">
</div>
</a>
</div>
</div>
<div class="sidebar-labeled-information">
<span>
Citizenship:
</span>
<div>
<div class="xflagsSmall xflagsSmall-Poland">
</div>
<small>
<a href="pendingCitizenshipApplications.html">
change
</a>
</small>
</div>
</div>
</div>
I want to extract region.html?id=454 from it. I don't know how to narrow the search down to <a href="region.html?id=454">, since there are a lot of <a href=> tags.
Here is the python code:
session=session()
r = session.get('https://orange.e-sim.org/battle.html?id=5377',headers=headers,verify=False)
soup = BeautifulSoup(r.text, 'html.parser')
div = soup.find_all('div',attrs={'class':'sidebar-labeled-information'})
And the output of this code is:
[<div class="sidebar-labeled-information" id="levelMission">
<span>Level:</span> <span>15</span>
</div>, <div class="sidebar-labeled-information" id="currRankText">
<span>Rank:</span>
<span>Colonel</span>
</div>, <div class="sidebar-labeled-information">
<span>Economic skill:</span>
<span>10.646</span>
</div>, <div class="sidebar-labeled-information">
<span>Strength:</span>
<span>2336</span>
</div>, <div class="sidebar-labeled-information">
<span>Location:</span>
<div>
<a href="region.html?id=454">Little Karoo<div class="xflagsSmall xflagsSmall-Argentina"></div>
</a>
</div>
</div>, <div class="sidebar-labeled-information">
<span>Citizenship:</span>
<div>
<div class="xflagsSmall xflagsSmall-Poland"></div>
<small>change
</small>
</div>
</div>]
But my desired output is region.html?id=454.
The page which I'm trying to search in is located here, but you need to have an account to view the page.

soup = BeautifulSoup(html)
links = soup.findAll('a', href=True)
for link in links:
href = link['href']
url = urlparse(href)
if url.path == "region.html":
print (url.path + "?" + url.query)
This prints region.html?id=454

you can try using this class:
xflagsSmall
and find the parrent of that element
element=soup.find("div",{"class": "xflagsSmall"})
parent_element=element.find_parent()
link=parent_element.attrs["href"]```

You can query on base of href value:
element=soup.find("a",{"href": "region.html?id=454"})
element.attrs["href"]

How can I extract only the link out of an <a href, which includes li elements, using beautifulsoup? [duplicate]

This question already has answers here:
BeautifulSoup - extracting attribute values
(2 answers)
Closed 4 years ago.
I am all new to python and beautifulsoup. I want to get the link form the href. Unfortunately, the anchor also includes other and irrelevant data.
Help is much apreciated
<a href="/link-i-want/to-get.html">
<li class="cat-list-row1 clearfix">
<img align="left" alt="Do not need!" src="https://do.not/need/.jpg" style="margin-right: 20px;" width="40%"/>
<h3>
<p class="subline">Do not need</p> Do not need! </h3>
<span class="tag-body">
<p>Do not need</p>... </span>
<div style="clear:both;"></div>
</li>
</a>

Attribute value can be extracted using [] brackets.
For instance, if to extract alt value an img tag use:
image_example = soup.find('img') and then print(image_example['alt'])
Updated code:
from bs4 import BeautifulSoup
data = '''
<a href="/link-i-want/to-get.html">
<li class="cat-list-row1 clearfix">
<img align="left" alt="Do not need!" src="https://do.not/need/.jpg" style="margin-right: 20px;" width="40%"/>
<h3>
<p class="subline">Do not need</p> Do not need! </h3>
<span class="tag-body">
<p>Do not need</p>... </span>
<div style="clear:both;"></div>
</li>
</a> <a href="/link-i-want/to-get.html">
<li class="cat-list-row1 clearfix">
<img align="left" alt="Do not need!" src="https://do.not/need/.jpg" style="margin-right: 20px;" width="40%"/>
<h3>
<p class="subline">Do not need</p> Do not need! </h3>
<span class="tag-body">
<p>Do not need</p>... </span>
<div style="clear:both;"></div>
</li>
</a>
'''
soup = BeautifulSoup(data, 'html.parser')
url_address = soup.find('a')['href']
print (url_address) # Output: /link-i-want/to-get.html
The format is as follows.
soup.find('<tag>')['<attribute-name>'] .
We can use .get(attr) like mentioned. soup.find('<tag>').get('<attr>')
Reference: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start

How can I find sibling in Beautifulsoup?

Following code is simplified html code.
<html>
...
<div class="info">
<span class="time">2017.01.16</span>
</div>
<div class="related_group">
<ul class="related_list">
<li>
<p class="info">
<span class="time">2016.12.28</span>
</p>
</li>
</ul>
</div>
<div class="info">
<span class="time">2017.01.26</span>
</div>
<div class="related_group">
<ul class="related_list">
<li>
<p class="info">
<span class="time">2017.01.30</span>
</p>
</li>
</ul>
</div>
...
</html>
This pattern repeated many time and I want to get data like
2017.01.16 and 2017.01.26
So I was use Beautiful Soup in python.
for item in soup.find_all("span", {"class" : "time"}):
source=source+str(item.find_all(text=True))
This code find date data but it find also useless data
2016.12.28 and 2017.01.30
For more precised result, I tried with find_next_siblings
for item in soup.find_next_siblings("span", {"class" : "time"}):
source=source+str(item.find_next_siblings())
You may know, it doesn't work.
Of course I searched reference and read it.
I can't understand enough because lack of English..
If you don't mind, could you help me with code??

Try this:
from bs4 import BeautifulSoup
html=""" <html>
<div class="info">
<span class="time">2017.01.16</span>
</div>
<div class="related_group">
<ul class="related_list">
<li>
<p class="info">
<span class="time">2016.12.28</span>
</p>
</li>
</ul>
</div>
<div class="info">
<span class="time">2017.01.26</span>
</div>
<div class="related_group">
<ul class="related_list">
<li>
<p class="info>
<span class="time">2017.01.30</span>
</p>
</li>
</ul>
</div>
</html>"""
soup = BeautifulSoup(html)
s = soup.find_all('div', class_=['info', 'related_group'])
s = iter(s)
for a in s:
print a.text.strip(), '---', next(s).text.strip()
Output:
2017.01.16 --- 2016.12.28
2017.01.26 --- 2017.01.30

What about this:
times = []
items = soup.find_all('div', {"class" : "info"})
for item in items:
tmp = item.select(".time")
t = tmp[0].text
times.append(t)

soup.find_all('div', class_='info')
out:
[<div class="info">
<span class="time">2017.01.16</span>
</div>, <div class="info">
<span class="time">2017.01.26</span>
</div>]
The tag you want is under div tag.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I 'cycle' though similar blocks of HTML using BeautifulSoup? - python

One way would be to use CSS selectors and the select function: actual_items = soup.select('div.content > div.items > div.item.watching') for item in actual_items: print(item.prettify())

Related

How to get the href link from nested HTML tags using python and beautifulsoup?

find a specific child element in html with beautifulsoup python

need to find a value with beautiful soup

How can I extract only the link out of an <a href, which includes li elements, using beautifulsoup? [duplicate]

How can I find sibling in Beautifulsoup?

Categories

Resources