This is a part of HTML code from following page following page:
<div>
<div class="sidebar-labeled-information">
<span>
Economic skill:
</span>
<span>
10.646
</span>
</div>
<div class="sidebar-labeled-information">
<span>
Strength:
</span>
<span>
2336
</span>
</div>
<div class="sidebar-labeled-information">
<span>
Location:
</span>
<div>
<a href="region.html?id=454">
Little Karoo
<div class="xflagsSmall xflagsSmall-Argentina">
</div>
</a>
</div>
</div>
<div class="sidebar-labeled-information">
<span>
Citizenship:
</span>
<div>
<div class="xflagsSmall xflagsSmall-Poland">
</div>
<small>
<a href="pendingCitizenshipApplications.html">
change
</a>
</small>
</div>
</div>
</div>
I want to extract region.html?id=454 from it. I don't know how to narrow the search down to <a href="region.html?id=454">, since there are a lot of <a href=> tags.
Here is the python code:
session=session()
r = session.get('https://orange.e-sim.org/battle.html?id=5377',headers=headers,verify=False)
soup = BeautifulSoup(r.text, 'html.parser')
div = soup.find_all('div',attrs={'class':'sidebar-labeled-information'})
And the output of this code is:
[<div class="sidebar-labeled-information" id="levelMission">
<span>Level:</span> <span>15</span>
</div>, <div class="sidebar-labeled-information" id="currRankText">
<span>Rank:</span>
<span>Colonel</span>
</div>, <div class="sidebar-labeled-information">
<span>Economic skill:</span>
<span>10.646</span>
</div>, <div class="sidebar-labeled-information">
<span>Strength:</span>
<span>2336</span>
</div>, <div class="sidebar-labeled-information">
<span>Location:</span>
<div>
<a href="region.html?id=454">Little Karoo<div class="xflagsSmall xflagsSmall-Argentina"></div>
</a>
</div>
</div>, <div class="sidebar-labeled-information">
<span>Citizenship:</span>
<div>
<div class="xflagsSmall xflagsSmall-Poland"></div>
<small>change
</small>
</div>
</div>]
But my desired output is region.html?id=454.
The page which I'm trying to search in is located here, but you need to have an account to view the page.
soup = BeautifulSoup(html)
links = soup.findAll('a', href=True)
for link in links:
href = link['href']
url = urlparse(href)
if url.path == "region.html":
print (url.path + "?" + url.query)
This prints region.html?id=454
you can try using this class:
xflagsSmall
and find the parrent of that element
element=soup.find("div",{"class": "xflagsSmall"})
parent_element=element.find_parent()
link=parent_element.attrs["href"]```
You can query on base of href value:
element=soup.find("a",{"href": "region.html?id=454"})
element.attrs["href"]
Related
The example im stuck with is like this
<div class="nav-links">
<div class="nav-previous">
<a href="prevlink" rel="prev">
<span class="meta-nav" aria-hidden="true">Previous </span>
<span class="screen-reader-text">Previous post:</span> <br>
<span class="post-title">
Title
</span>
</a>
</div>
<div class="nav-next">
<a href="nextlink" rel="next">
<span class="meta-nav" aria-hidden="true">Next </span> <span class="screen-reader-text">Next post:</span>
<br>
<span class="post-title">
Title
</span>
</a>
</div>
my ultimate goal is to get the value of href but all i could is bet the whole <div class ... element. im using beautifulsoup python
You can print all the values for href by finding all the links in the page
links = soup.find_all("a")
for link in links:
print(link.attrs['href'])
I have a webpage follow the pattern:
<a class="card cardlisting0" href="abc/def/gh.com">
<div class="contentWrapper">
<div class="card-content">
<time datetime="2020-05-31">3 hours ago</time>
</div>
</div>
</a>
<a class="card cardlisting1" href="ijk/lmn/op.com">
<div class="contentWrapper">
<div class="card-content">
<time datetime="2020-04-30">20200430</time>
</div>
</div>
</a>
...
And I want to scrape the href and date time attribute in pairs: [abc/def/gh.com,2020-05-31], [ijk/lmn/op.com, 2020-04-30]
How can I realize this?
Thank you.
You can try the following:
from bs4 import BeautifulSoup
t='''<a class="card cardlisting0" href="abc/def/gh.com">
<div class="contentWrapper">
<div class="card-content">
<time datetime="2020-05-31">3 hours ago</time>
</div>
</div>
</a>
<a class="card cardlisting1" href="ijk/lmn/op.com">
<div class="contentWrapper">
<div class="card-content">
<time datetime="2020-04-30">20200430</time>
</div>
</div>
</a>'''
soup=BeautifulSoup(t,"lxml")
aTags=soup.select('a')
data=[]
for aTag in aTags:
timeTag=aTag.select_one('time')
data.append([aTag.get('href'),timeTag['datetime']])
print(data)
Instead of t you can use the response from selenium.
Output:
[['abc/def/gh.com', '2020-05-31'], ['ijk/lmn/op.com', '2020-04-30']]
You can use the find_element_by_xpath() and get_attribute() functions using Python, as follows:
# for the hrefs
urls = [a.get_attribute('href') for a in driver.find_elements_by_xpath('//a[contains(#class, "card cardlisting0")]')]
# for the datetimes
dates = [time_element.get_attribute('datetime') for time_element in driver.find_elements_by_xpath('//a//time')]
I have an html as shown below
<div class="xtree">
<img class="dojoimg">
<span class="presentation">+</span>
<span class ="treenode">
<div class="ctreefolder">.... </div>
<div class="presentationfolder">.... </div>
<span >Setting</span>
</span>
</div>
<div class="xtree">
<img class="dojoimg">
<span class="presentation">+</span>
<span class ="treenode">
<div class="ctreefolder">.... </div>
<div class="presentationfolder">.... </div>
<span >Home</span>
</span>
</div>
<div class="xtree">
<img class="dojoimg">
<span class="presentation">+</span>
<span class ="treenode">
<div class="ctreefolder">.... </div>
<div class="presentationfolder">.... </div>
<span >products</span>
</span>
</div>
I want to click the img icon based on the text in the last span tag.
for example , I want to select the first img tag , if the last span contains "Setting" . Can you please help me in writing xpath for this UI element to use in selenium webdriver python
I think this XPath will help you.Here i find the img class then match the text contains
//*[#class="dojoimg"]//span[contains(text(), "Setting")]
Hope this concept will help you.
Here is my solution :
Using find_element_by_link_text
driver.find_element_by_link_text("Reveal").click()
Following code is simplified html code.
<html>
...
<div class="info">
<span class="time">2017.01.16</span>
</div>
<div class="related_group">
<ul class="related_list">
<li>
<p class="info">
<span class="time">2016.12.28</span>
</p>
</li>
</ul>
</div>
<div class="info">
<span class="time">2017.01.26</span>
</div>
<div class="related_group">
<ul class="related_list">
<li>
<p class="info">
<span class="time">2017.01.30</span>
</p>
</li>
</ul>
</div>
...
</html>
This pattern repeated many time and I want to get data like
2017.01.16 and 2017.01.26
So I was use Beautiful Soup in python.
for item in soup.find_all("span", {"class" : "time"}):
source=source+str(item.find_all(text=True))
This code find date data but it find also useless data
2016.12.28 and 2017.01.30
For more precised result, I tried with find_next_siblings
for item in soup.find_next_siblings("span", {"class" : "time"}):
source=source+str(item.find_next_siblings())
You may know, it doesn't work.
Of course I searched reference and read it.
I can't understand enough because lack of English..
If you don't mind, could you help me with code??
Try this:
from bs4 import BeautifulSoup
html=""" <html>
<div class="info">
<span class="time">2017.01.16</span>
</div>
<div class="related_group">
<ul class="related_list">
<li>
<p class="info">
<span class="time">2016.12.28</span>
</p>
</li>
</ul>
</div>
<div class="info">
<span class="time">2017.01.26</span>
</div>
<div class="related_group">
<ul class="related_list">
<li>
<p class="info>
<span class="time">2017.01.30</span>
</p>
</li>
</ul>
</div>
</html>"""
soup = BeautifulSoup(html)
s = soup.find_all('div', class_=['info', 'related_group'])
s = iter(s)
for a in s:
print a.text.strip(), '---', next(s).text.strip()
Output:
2017.01.16 --- 2016.12.28
2017.01.26 --- 2017.01.30
What about this:
times = []
items = soup.find_all('div', {"class" : "info"})
for item in items:
tmp = item.select(".time")
t = tmp[0].text
times.append(t)
soup.find_all('div', class_='info')
out:
[<div class="info">
<span class="time">2017.01.16</span>
</div>, <div class="info">
<span class="time">2017.01.26</span>
</div>]
The tag you want is under div tag.
I have the following "web-site" (here is the piece of the html):
<div class="moduleBody">
<div class="feature">
<div class="feature">
<h2>
sometext
</h2>
<div class="relatedInfo">
<span class="relatedTopics">
<span class="timestamp">22 Mar 2014</span>
</div>
</div>
</div>
I would like to extract the sometext and somelink. For this purpose, I have written the python code, here it is:
for links in soup.find_all('div','moduleBody'):
for link in links.find_all('div','feature'):
if not("video" in (link['href'])):
print "Name: "+link.text
#sibling_page=urllib2.urlopen("major_link"+link['href'])
print " Link extracted: "+link['href']
However, this code prints nothing. Could you suggest where is my mistake?
Your div does not have href attribute. You have to look one level down at the <a> element.
from bs4 import BeautifulSoup
html = """
<div class="moduleBody">
<div class="feature">
<div class="feature">
<h2>
sometext
</h2>
<div class="relatedInfo">
<span class="relatedTopics">
<span class="timestamp">22 Mar 2014</span>
</div>
</div>
</div>
"""
soup = BeautifulSoup(html)
for links in soup.find_all("div", "moduleBody"):
for link in links.find_all("div", "feature"):
for a in links.find_all("a"):
if not "video" in a['href']:
print("Name: " + a.text)
print("Link extracted: " + a['href'])
Prints:
Name: sometext
Link extracted: somelink
Name: sometext
Link extracted: somelink
It finds it twice, as your html is broken. BeautifulSoup fixes it as follows:
<div class="moduleBody">
<div class="feature">
<div class="feature">
<h2>
<a href="somelink">
sometext
</a>
</h2>
<div class="relatedInfo">
<span class="relatedTopics">
<span class="timestamp">
22 Mar 2014
</span>
</span>
</div>
</div>
</div>
</div>
Inside your second for loop, your link variable holds reference to <div class="feature">...</div>, which do not have the attribute href.
It highly depends on your structure, but if the <div class="feature"> tag always starts with <h2> tag which contains only <a> tag, then what you can do is to get the anchor tag <a> first:
for links in soup.find_all('div','moduleBody'):
for link in links.find_all('div','feature'):
anchor_tag = link.h2.a
if not 'video' in anchor_tag['href']:
print 'Name: %s' % anchor_tag.text
print 'Link extracted: %s' % anchor_tag['href']
By the way, your HTML is not well-formed, the first <div class="feature"> tag should be closed.
<div class="moduleBody">
<div class="feature"></div>
<div class="feature">
<h2>
sometext
</h2>
<div class="relatedInfo">
<span class="relatedTopics">
<span class="timestamp">22 Mar 2014</span>
</div>
</div>
</div>