selenium scrape multiple attributes within a block at the same time

selenium scrape multiple attributes within a block at the same time - python

I have a webpage follow the pattern:
<a class="card cardlisting0" href="abc/def/gh.com">
<div class="contentWrapper">
<div class="card-content">
<time datetime="2020-05-31">3 hours ago</time>
</div>
</div>
</a>
<a class="card cardlisting1" href="ijk/lmn/op.com">
<div class="contentWrapper">
<div class="card-content">
<time datetime="2020-04-30">20200430</time>
</div>
</div>
</a>
...
And I want to scrape the href and date time attribute in pairs: [abc/def/gh.com,2020-05-31], [ijk/lmn/op.com, 2020-04-30]
How can I realize this?
Thank you.

You can try the following:
from bs4 import BeautifulSoup
t='''<a class="card cardlisting0" href="abc/def/gh.com">
<div class="contentWrapper">
<div class="card-content">
<time datetime="2020-05-31">3 hours ago</time>
</div>
</div>
</a>
<a class="card cardlisting1" href="ijk/lmn/op.com">
<div class="contentWrapper">
<div class="card-content">
<time datetime="2020-04-30">20200430</time>
</div>
</div>
</a>'''
soup=BeautifulSoup(t,"lxml")
aTags=soup.select('a')
data=[]
for aTag in aTags:
timeTag=aTag.select_one('time')
data.append([aTag.get('href'),timeTag['datetime']])
print(data)
Instead of t you can use the response from selenium.
Output:
[['abc/def/gh.com', '2020-05-31'], ['ijk/lmn/op.com', '2020-04-30']]

You can use the find_element_by_xpath() and get_attribute() functions using Python, as follows:
# for the hrefs
urls = [a.get_attribute('href') for a in driver.find_elements_by_xpath('//a[contains(#class, "card cardlisting0")]')]
# for the datetimes
dates = [time_element.get_attribute('datetime') for time_element in driver.find_elements_by_xpath('//a//time')]

Related

How to find inside many <div> tags? (Beautiful Soup)

I want to find the content inside the following tag:
<h4 id="rfq-info-header-id" class="pr-3 mb-3">
RFQ1526090
</h4>
Full code:
<rfq-display-header-seller>
<div class="card-body pb-0">
<div class="row">
<div id="rfq-info-header-col-1" class="col-xs-12 col-sm-12 col-md-12 col-lg-6">
<div class="small text-muted">RFQ ID</div>
<h4 id="rfq-info-header-id" class="pr-3 mb-3">
RFQ1526090
</h4>
I tried:
rfq_id = [tag.text.strip() for tag in soup.find_all(name='h4', attrs={'id': 'rfq-info-header-id','class': 'pr-3 mb-3'})]
print(rfq_id)
But this resulted in empty list [].
Is this because the h4 tag is inside many tags? How to simplify the code to extract the data inside tag in the above code

I'm getting output as follows:
from bs4 import BeautifulSoup
html_doc="""
<rfq-display-header-seller>
<div class="card-body pb-0">
<div class="row">
<div id="rfq-info-header-col-1" class="col-xs-12 col-sm-12 col-md-12 col-lg-6">
<div class="small text-muted">RFQ ID</div>
<h4 id="rfq-info-header-id" class="pr-3 mb-3">
RFQ1526090
</h4>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
# rfq_id = soup.find('h4').text
# print(rfq_id)
rfq_id = [t.get_text(strip=True) for t in soup.find_all('h4')]
print(rfq_id)
Output:
['RFQ1526090']
Output using only find method:
RFQ1526090

need to find a value with beautiful soup

This is a part of HTML code from following page following page:
<div>
<div class="sidebar-labeled-information">
<span>
Economic skill:
</span>
<span>
10.646
</span>
</div>
<div class="sidebar-labeled-information">
<span>
Strength:
</span>
<span>
2336
</span>
</div>
<div class="sidebar-labeled-information">
<span>
Location:
</span>
<div>
<a href="region.html?id=454">
Little Karoo
<div class="xflagsSmall xflagsSmall-Argentina">
</div>
</a>
</div>
</div>
<div class="sidebar-labeled-information">
<span>
Citizenship:
</span>
<div>
<div class="xflagsSmall xflagsSmall-Poland">
</div>
<small>
<a href="pendingCitizenshipApplications.html">
change
</a>
</small>
</div>
</div>
</div>
I want to extract region.html?id=454 from it. I don't know how to narrow the search down to <a href="region.html?id=454">, since there are a lot of <a href=> tags.
Here is the python code:
session=session()
r = session.get('https://orange.e-sim.org/battle.html?id=5377',headers=headers,verify=False)
soup = BeautifulSoup(r.text, 'html.parser')
div = soup.find_all('div',attrs={'class':'sidebar-labeled-information'})
And the output of this code is:
[<div class="sidebar-labeled-information" id="levelMission">
<span>Level:</span> <span>15</span>
</div>, <div class="sidebar-labeled-information" id="currRankText">
<span>Rank:</span>
<span>Colonel</span>
</div>, <div class="sidebar-labeled-information">
<span>Economic skill:</span>
<span>10.646</span>
</div>, <div class="sidebar-labeled-information">
<span>Strength:</span>
<span>2336</span>
</div>, <div class="sidebar-labeled-information">
<span>Location:</span>
<div>
<a href="region.html?id=454">Little Karoo<div class="xflagsSmall xflagsSmall-Argentina"></div>
</a>
</div>
</div>, <div class="sidebar-labeled-information">
<span>Citizenship:</span>
<div>
<div class="xflagsSmall xflagsSmall-Poland"></div>
<small>change
</small>
</div>
</div>]
But my desired output is region.html?id=454.
The page which I'm trying to search in is located here, but you need to have an account to view the page.

soup = BeautifulSoup(html)
links = soup.findAll('a', href=True)
for link in links:
href = link['href']
url = urlparse(href)
if url.path == "region.html":
print (url.path + "?" + url.query)
This prints region.html?id=454

you can try using this class:
xflagsSmall
and find the parrent of that element
element=soup.find("div",{"class": "xflagsSmall"})
parent_element=element.find_parent()
link=parent_element.attrs["href"]```

You can query on base of href value:
element=soup.find("a",{"href": "region.html?id=454"})
element.attrs["href"]

Python Find text at end of html element

I need to pull the movie title and year out of the HTML text below using the BeautifulSoup find() method.
the below returns the name of the movie, but I'm unable to return only the year
find('p').find('a').text
<div class="col-sm-6 col-lg-3">
<div class="poster-container">
<a class="poster-link" href="/title/80244680/">
<img alt="A Tale of Two Kitchens (2019)" class="poster" src="https://occ-0-37-33.1.nflxso.net/dnm/api/v6/0DW6CdE4gYtYx8iy3aj8gs9WtXE/AAAABfTGUtIG2HYlEhUbvzPHmiAyPSkDcBIhQx_Ey06KfkgaUEwELBtJsJYP71-Vsx06NTKFKWZQupZGNVE8DCo8dC0j-zpcaNCPGFiyNJKN7tonZ3gMSAM.jpg?r=397"/>
<div class="overlay d-none d-lg-block text-center">
<span class="d-block font-weight-bold small mt-3">Documentaries</span>
<span class="d-block font-weight-bold small">International Movies</span>
</div>
</a>
</div>
<p><strong>A Tale of Two Kitchens</strong><br/>2019</p>
</div>
A Tale of Two Kitchens
<br/>

my_element.contents[-1]
This will give you the last element contained inside my_element: in this case, if my_element is the <p>, this will give the text "2019" as a NavigableString. (The first child is the <strong> tag, which contains <a> and all the rest.)

Use the following code.find the <a> tag and then use next_element
from bs4 import BeautifulSoup
html='''<div class="col-sm-6 col-lg-3">
<div class="poster-container">
<a class="poster-link" href="/title/80244680/">
<img alt="A Tale of Two Kitchens (2019)" class="poster" src="https://occ-0-37-33.1.nflxso.net/dnm/api/v6/0DW6CdE4gYtYx8iy3aj8gs9WtXE/AAAABfTGUtIG2HYlEhUbvzPHmiAyPSkDcBIhQx_Ey06KfkgaUEwELBtJsJYP71-Vsx06NTKFKWZQupZGNVE8DCo8dC0j-zpcaNCPGFiyNJKN7tonZ3gMSAM.jpg?r=397"/>
<div class="overlay d-none d-lg-block text-center">
<span class="d-block font-weight-bold small mt-3">Documentaries</span>
<span class="d-block font-weight-bold small">International Movies</span>
</div>
</a>
</div>
<p><strong>A Tale of Two Kitchens</strong><br/>2019</p>
</div>
A Tale of Two Kitchens
<br/>'''
soup=BeautifulSoup(html,'html.parser')
item=soup.select_one('.col-sm-6.col-lg-3').find_next('p')
print(item.text)
Output:
A Tale of Two Kitchens2019
item=soup.select_one('.col-sm-6.col-lg-3').find_next('p').find('a').text
print(item)
output:
A Tale of Two Kitchens
item=soup.select_one('.col-sm-6.col-lg-3').find_next('p').find('a').next_element.next_element.next_element
print(item)
output:
2019

beautifulsoup: finding specific class name in nested div

I try to get reviews in agoda site for analysis by using beautifulsoup
i have inspected and see that the reviews is in :
<div class="container-agoda">
<div class="a">
<div class="b">
<div class="c">
<div class="d">
<div class="col-xs-9 review-comment" data-selenium="comments-detail">>
<div name="review-title" class="title" data-selenium="comments-title">
HAD 1 HOUR SLEEP
</div>
<div class="review-comment-section">
<div class="comment-detail" data-selenium="reviews-comments">
<span>Great location</span>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
but this class in nested in 10+ classes
I have tried by
for div in soup.findAll('div', attrs={"class":"comment-detail"}):
print(div)
but it get nothing.
Is it have a method for get an exactly as find ''' class="comment-detail" data-selenium="reviews-comments" ''' or any suggestion.
Thank you.

BeautifulSoup and find

I have a html code:
<div id='div1'>
<div id='d'> </div>
<p></p>
</div>
How do I get all that in a div with an id div1?
soup.find('div',{'id':"div1"}) returns:
<div id='div1'>
<div id='d'> </div>
<p></p>
</div>
I need to get only:
<div id='d'> </div>
<p></p>

See the documentation, specifically .find() and .contents.

You want the content between the start and end of the tag including all child tags.
soup.find('div', id="div1").contents

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

selenium scrape multiple attributes within a block at the same time - python

Related

How to find inside many <div> tags? (Beautiful Soup)

need to find a value with beautiful soup

Python Find text at end of html element

beautifulsoup: finding specific class name in nested div

BeautifulSoup and find

Categories

Resources