Following code is simplified html code.
<html>
...
<div class="info">
<span class="time">2017.01.16</span>
</div>
<div class="related_group">
<ul class="related_list">
<li>
<p class="info">
<span class="time">2016.12.28</span>
</p>
</li>
</ul>
</div>
<div class="info">
<span class="time">2017.01.26</span>
</div>
<div class="related_group">
<ul class="related_list">
<li>
<p class="info">
<span class="time">2017.01.30</span>
</p>
</li>
</ul>
</div>
...
</html>
This pattern repeated many time and I want to get data like
2017.01.16 and 2017.01.26
So I was use Beautiful Soup in python.
for item in soup.find_all("span", {"class" : "time"}):
source=source+str(item.find_all(text=True))
This code find date data but it find also useless data
2016.12.28 and 2017.01.30
For more precised result, I tried with find_next_siblings
for item in soup.find_next_siblings("span", {"class" : "time"}):
source=source+str(item.find_next_siblings())
You may know, it doesn't work.
Of course I searched reference and read it.
I can't understand enough because lack of English..
If you don't mind, could you help me with code??
Try this:
from bs4 import BeautifulSoup
html=""" <html>
<div class="info">
<span class="time">2017.01.16</span>
</div>
<div class="related_group">
<ul class="related_list">
<li>
<p class="info">
<span class="time">2016.12.28</span>
</p>
</li>
</ul>
</div>
<div class="info">
<span class="time">2017.01.26</span>
</div>
<div class="related_group">
<ul class="related_list">
<li>
<p class="info>
<span class="time">2017.01.30</span>
</p>
</li>
</ul>
</div>
</html>"""
soup = BeautifulSoup(html)
s = soup.find_all('div', class_=['info', 'related_group'])
s = iter(s)
for a in s:
print a.text.strip(), '---', next(s).text.strip()
Output:
2017.01.16 --- 2016.12.28
2017.01.26 --- 2017.01.30
What about this:
times = []
items = soup.find_all('div', {"class" : "info"})
for item in items:
tmp = item.select(".time")
t = tmp[0].text
times.append(t)
soup.find_all('div', class_='info')
out:
[<div class="info">
<span class="time">2017.01.16</span>
</div>, <div class="info">
<span class="time">2017.01.26</span>
</div>]
The tag you want is under div tag.
Related
The example im stuck with is like this
<div class="nav-links">
<div class="nav-previous">
<a href="prevlink" rel="prev">
<span class="meta-nav" aria-hidden="true">Previous </span>
<span class="screen-reader-text">Previous post:</span> <br>
<span class="post-title">
Title
</span>
</a>
</div>
<div class="nav-next">
<a href="nextlink" rel="next">
<span class="meta-nav" aria-hidden="true">Next </span> <span class="screen-reader-text">Next post:</span>
<br>
<span class="post-title">
Title
</span>
</a>
</div>
my ultimate goal is to get the value of href but all i could is bet the whole <div class ... element. im using beautifulsoup python
You can print all the values for href by finding all the links in the page
links = soup.find_all("a")
for link in links:
print(link.attrs['href'])
I have a webpage follow the pattern:
<a class="card cardlisting0" href="abc/def/gh.com">
<div class="contentWrapper">
<div class="card-content">
<time datetime="2020-05-31">3 hours ago</time>
</div>
</div>
</a>
<a class="card cardlisting1" href="ijk/lmn/op.com">
<div class="contentWrapper">
<div class="card-content">
<time datetime="2020-04-30">20200430</time>
</div>
</div>
</a>
...
And I want to scrape the href and date time attribute in pairs: [abc/def/gh.com,2020-05-31], [ijk/lmn/op.com, 2020-04-30]
How can I realize this?
Thank you.
You can try the following:
from bs4 import BeautifulSoup
t='''<a class="card cardlisting0" href="abc/def/gh.com">
<div class="contentWrapper">
<div class="card-content">
<time datetime="2020-05-31">3 hours ago</time>
</div>
</div>
</a>
<a class="card cardlisting1" href="ijk/lmn/op.com">
<div class="contentWrapper">
<div class="card-content">
<time datetime="2020-04-30">20200430</time>
</div>
</div>
</a>'''
soup=BeautifulSoup(t,"lxml")
aTags=soup.select('a')
data=[]
for aTag in aTags:
timeTag=aTag.select_one('time')
data.append([aTag.get('href'),timeTag['datetime']])
print(data)
Instead of t you can use the response from selenium.
Output:
[['abc/def/gh.com', '2020-05-31'], ['ijk/lmn/op.com', '2020-04-30']]
You can use the find_element_by_xpath() and get_attribute() functions using Python, as follows:
# for the hrefs
urls = [a.get_attribute('href') for a in driver.find_elements_by_xpath('//a[contains(#class, "card cardlisting0")]')]
# for the datetimes
dates = [time_element.get_attribute('datetime') for time_element in driver.find_elements_by_xpath('//a//time')]
I want to scrape info from each of the "item watching" in the "items" class. I'm stuck because when I try find it only finds the HTML for the first "item watching" but I don't want to use find_all because it gives a massive blob that I can't prettify and it would make it more difficult to cycle through the information.
soup = BeautifulSoup(res.text, "html.parser") # SOUP
class_items = soup.find("div", attrs={"data-name":"watching"}).find("div", class_="items") # Narrowed Down
actual_items = class_items.find("div", class_="item watching") # Was thinking [x] so I can cycle?
The whole shabang:
import requests
from bs4 import BeautifulSoup
payload = {"username":"?????", "password":"?????"}
url = "https://9anime.to/user/watchlist"
loginurl = "https://9anime.to/user/ajax/login"
with requests.Session() as s:
res = s.post(loginurl, data=payload)
res = s.get(url)
soup = BeautifulSoup(res.text, "html.parser")
class_items = soup.find("div", attrs={"data-name":"watching"}).find("div", class_="items")
actual_items = class_items.find_next("div", class_="item watching")
print(actual_items.prettify())
site url: https://9anime.to/
login url: https://9anime.to/user/ajax/login
Expected output for each "item watching" (Similar format for each):
<div class="item watching">
<a class="thumb" href="/watch/kaguya-sama-love-is-war-season-2.omkj?ep=7">
<img alt="Kaguya-sama: Love is War Season 2" src="https://static.akacdn.ru/files/images/2019/10/f53e6536aa7b3b95e6fe4c6d7b8e1a9b.jpg"/>
</a>
<a class="link" data-jtitle="Kaguya-sama wa Kokurasetai?: Tensai-tachi no Renai Zunousen" data-tip="/ajax/film/tooltip/omkj?v=5dab1c5b" href="/watch/kaguya-sama-love-is-war-season-2.omkj?ep=7">
Kaguya-sama: Love is War Season 2
</a>
<span class="current">
7
</span>
<div class="info">
<span class="state old tip" data-id="omkj" data-unwatched="Unwatched" data-value="0" data-watched="Watched" title="Click to change">
Watched
</span>
<span class="status">
7/12
</span>
<span class="dropdown userbookmark" data-id="omkj">
<i class="icon icon-pencil-square" data-toggle="dropdown">
</i>
<ul class="dropdown-menu bookmark choices pull-right" data-id="omkj">
<li data-value="watching">
<a>
<i class="fa fa-eye">
</i>
Watching
</a>
</li>
<li data-value="watched">
<a>
<i class="fa fa-check">
</i>
Completed
</a>
</li>
<li data-value="onhold">
<a>
<i class="fa fa-hand-grab-o">
</i>
On-Hold
</a>
</li>
<li data-value="dropped">
<a>
<i class="fa fa-eye-slash">
</i>
Drop
</a>
</li>
<li data-value="planned">
<a>
<i class="fa fa-bookmark">
</i>
Plan to watch
</a>
</li>
<li class="divider" role="separator">
</li>
<li data-value="remove">
<a>
<i class="fa fa-remove">
</i>
Remove entry
</a>
</li>
</ul>
</span>
</div>
<div class="clearfix">
</div>
</div>
One way would be to use CSS selectors and the select function:
actual_items = soup.select('div.content > div.items > div.item.watching')
for item in actual_items:
print(item.prettify())
I am not a beautiful soup expert, but I had a similar problem where using find_all and then creating a smaller variable did help me to visualize the information.
df=pd.DataFrame()
for i in soup:
class_items = i.find_all("div", class_="item_watching")
for x in class_items:
df = df.append({'Actual Items': x.text.strip()}, ignore_index=True)
This is a part of HTML code from following page following page:
<div>
<div class="sidebar-labeled-information">
<span>
Economic skill:
</span>
<span>
10.646
</span>
</div>
<div class="sidebar-labeled-information">
<span>
Strength:
</span>
<span>
2336
</span>
</div>
<div class="sidebar-labeled-information">
<span>
Location:
</span>
<div>
<a href="region.html?id=454">
Little Karoo
<div class="xflagsSmall xflagsSmall-Argentina">
</div>
</a>
</div>
</div>
<div class="sidebar-labeled-information">
<span>
Citizenship:
</span>
<div>
<div class="xflagsSmall xflagsSmall-Poland">
</div>
<small>
<a href="pendingCitizenshipApplications.html">
change
</a>
</small>
</div>
</div>
</div>
I want to extract region.html?id=454 from it. I don't know how to narrow the search down to <a href="region.html?id=454">, since there are a lot of <a href=> tags.
Here is the python code:
session=session()
r = session.get('https://orange.e-sim.org/battle.html?id=5377',headers=headers,verify=False)
soup = BeautifulSoup(r.text, 'html.parser')
div = soup.find_all('div',attrs={'class':'sidebar-labeled-information'})
And the output of this code is:
[<div class="sidebar-labeled-information" id="levelMission">
<span>Level:</span> <span>15</span>
</div>, <div class="sidebar-labeled-information" id="currRankText">
<span>Rank:</span>
<span>Colonel</span>
</div>, <div class="sidebar-labeled-information">
<span>Economic skill:</span>
<span>10.646</span>
</div>, <div class="sidebar-labeled-information">
<span>Strength:</span>
<span>2336</span>
</div>, <div class="sidebar-labeled-information">
<span>Location:</span>
<div>
<a href="region.html?id=454">Little Karoo<div class="xflagsSmall xflagsSmall-Argentina"></div>
</a>
</div>
</div>, <div class="sidebar-labeled-information">
<span>Citizenship:</span>
<div>
<div class="xflagsSmall xflagsSmall-Poland"></div>
<small>change
</small>
</div>
</div>]
But my desired output is region.html?id=454.
The page which I'm trying to search in is located here, but you need to have an account to view the page.
soup = BeautifulSoup(html)
links = soup.findAll('a', href=True)
for link in links:
href = link['href']
url = urlparse(href)
if url.path == "region.html":
print (url.path + "?" + url.query)
This prints region.html?id=454
you can try using this class:
xflagsSmall
and find the parrent of that element
element=soup.find("div",{"class": "xflagsSmall"})
parent_element=element.find_parent()
link=parent_element.attrs["href"]```
You can query on base of href value:
element=soup.find("a",{"href": "region.html?id=454"})
element.attrs["href"]
I have the following "web-site" (here is the piece of the html):
<div class="moduleBody">
<div class="feature">
<div class="feature">
<h2>
sometext
</h2>
<div class="relatedInfo">
<span class="relatedTopics">
<span class="timestamp">22 Mar 2014</span>
</div>
</div>
</div>
I would like to extract the sometext and somelink. For this purpose, I have written the python code, here it is:
for links in soup.find_all('div','moduleBody'):
for link in links.find_all('div','feature'):
if not("video" in (link['href'])):
print "Name: "+link.text
#sibling_page=urllib2.urlopen("major_link"+link['href'])
print " Link extracted: "+link['href']
However, this code prints nothing. Could you suggest where is my mistake?
Your div does not have href attribute. You have to look one level down at the <a> element.
from bs4 import BeautifulSoup
html = """
<div class="moduleBody">
<div class="feature">
<div class="feature">
<h2>
sometext
</h2>
<div class="relatedInfo">
<span class="relatedTopics">
<span class="timestamp">22 Mar 2014</span>
</div>
</div>
</div>
"""
soup = BeautifulSoup(html)
for links in soup.find_all("div", "moduleBody"):
for link in links.find_all("div", "feature"):
for a in links.find_all("a"):
if not "video" in a['href']:
print("Name: " + a.text)
print("Link extracted: " + a['href'])
Prints:
Name: sometext
Link extracted: somelink
Name: sometext
Link extracted: somelink
It finds it twice, as your html is broken. BeautifulSoup fixes it as follows:
<div class="moduleBody">
<div class="feature">
<div class="feature">
<h2>
<a href="somelink">
sometext
</a>
</h2>
<div class="relatedInfo">
<span class="relatedTopics">
<span class="timestamp">
22 Mar 2014
</span>
</span>
</div>
</div>
</div>
</div>
Inside your second for loop, your link variable holds reference to <div class="feature">...</div>, which do not have the attribute href.
It highly depends on your structure, but if the <div class="feature"> tag always starts with <h2> tag which contains only <a> tag, then what you can do is to get the anchor tag <a> first:
for links in soup.find_all('div','moduleBody'):
for link in links.find_all('div','feature'):
anchor_tag = link.h2.a
if not 'video' in anchor_tag['href']:
print 'Name: %s' % anchor_tag.text
print 'Link extracted: %s' % anchor_tag['href']
By the way, your HTML is not well-formed, the first <div class="feature"> tag should be closed.
<div class="moduleBody">
<div class="feature"></div>
<div class="feature">
<h2>
sometext
</h2>
<div class="relatedInfo">
<span class="relatedTopics">
<span class="timestamp">22 Mar 2014</span>
</div>
</div>
</div>