How do I extract text from a button using Beautiful Soup?

How do I extract text from a button using Beautiful Soup? - python

I am trying to scrape GoFundMe information but can't seem to extract the number of donors.
This is the html I am trying to navigate. I am attempting to retrieve 11.1K,
<ul class="list-unstyled m-meta-list m-meta-list--default">
<li class="m-meta-list-item">
<button class="text-stat disp-inline text-left a-button a-button--inline" data-element-
id="btn_donors" type="button" data-analytic-event-listener="true">
<span class="text-stat-value text-underline">11.1K</span>
<span class="m-social-stat-item-title text-stat-title">donors</span>
I've tried using
donors = soup.find_all('li', class_ = 'm-meta-list-item')
for donor in donors:
print(donor.text)
The class/button seems to be hidden inside another class? How can I extract it?
I'm new to beautifulsoup but have used selenium quite a bit.
Thanks in advance.

These fundraiser pages all have similar html and that value is dynamically retrieved. I would suggest using selenium and a css class selector
from selenium import webdriver
d = webdriver.Chrome()
d.get('https://www.gofundme.com/f/treatmentforsiyona?qid=7375740208a5ee878a70349c8b74c5a6')
num = d.find_element_by_css_selector('.text-stat-value').text
print(num)
d.quit()
Learn more about selenium:
https://sqa.stackexchange.com/a/27856

get the id gofundme.com/f/{THEID} and call the API
/web-gateway/v1/feed/THEID/donations?sort=recent&limit=20&offset=20
process the Data
for people in apiResponse['references']['donations']
print(people['name'])
use browser console to find host API.

Related

How do I scrape this text from an HTML <span id> using Python, Selenium, and BeautifulSoup?

I'm working on creating a web scraping tool that generates a .csv report by using Python, Selenium, beautifulSoup, and pandas.
Unfortunately, I'm running into an issue with grabbing the "data-date" text from the HTML below. I am looking to pull the "2/4/2020" into the .csv my code is generating.
<span class="import-popover"><span id="LargeHeader_glyphStatus" class="glyphicon glyphicon-ok-sign white"></span><b><span id="LargeHeader_statusText">Processing Complete</span></b><span id="LargeHeader_dateText" data-date="2/4/2020" data-delay="1" data-step="3" data-error="False">, Last Processed 2/5/2020</span></span>
My python script starts off with the following:
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import pandas as pd
driver = webdriver.Chrome('C:\chromedriver.exe')
lastdatadate=[]
lastprocesseddate=[]
Then I have it log in to a website, enter my un/pw credentials, and click the continue/login button.
From there, I am using the following to parse the html, scrape the website, and pull the relevant data/text into a .csv:
content = driver.page_source
soup = bs(content, 'html.parser')
for a in soup.findAll('div', attrs={'class':'large-header-welcome'}):
datadate=a.find(?????)
processeddate=a.find('span', attrs={'id':'LargeHeader_dateText'})
lastdatadate.append(datadate.text)
lastprocesseddate.append(processeddate.text)
df = pd.DataFrame({'Last Data Date':lastdatadate,'Last Processed Date':lastprocesseddate})
df.to_csv('hqm.csv', index=False, encoding='utf-8')
So far, I've got it working for the "last processed date" component of the HTML, but I am having trouble getting it to pull the "last data date" from the HTML. It's there, I just don't know how to have python find it. I've tried using the find method but I have not been successful.
I've tried googling around and checking here for what I should try, but I've come up empty-handed so far. I think I'm having trouble what to search for.
Any insight would be much appreciated as I am trying to learn and get better. Thanks!
edit: here is a closer look of the HTML:
<div class="large-header-welcome">
<div class="row">
<div class="col-sm-6">
<h3 class="welcome-header">Welcome, <span id="LargeHeader_fullname">Rhett</span></h3>
<p class="">
<b>Site:</b> <span id="LargeHeader_Name">redacted</span>
<br />
<span class="import-popover"><span id="LargeHeader_glyphStatus" class="glyphicon glyphicon-ok-sign white"></span><b><span id="LargeHeader_statusText">Processing Complete</span></b><span id="LargeHeader_dateText" data-date="2/4/2020" data-delay="1" data-step="3" data-error="False">, Last Processed 2/5/2020</span></span>
</p>
</div>

To find one element use find()
processeddate=soup.find('span', attrs={'id':'LargeHeader_dateText'}).text
to find multple elements use
for item in soup.find_all('span', attrs={'id':'LargeHeader_dateText'}):
processeddate=item.text
Or you can use css selector select()
for item in soup.select('#LargeHeader_dateText'):
processeddate=item.text
EDIT
To get the attribute value data-date use following code
lastdatadate=[]
for item in soup.find_all('span',attrs={"id": "LargeHeader_dateText","data-date": True}):
processeddate=item['data-date']
lastdatadate(processeddate)
lastdatadate.append(processeddate)
Or css selector.
lastdatadate=[]
for item in soup.select('#LargeHeader_dateText[data-date]'):
processeddate=item['data-date']
print(processeddate)
lastdatadate.append(processeddate)
Both will give same output.however later one faster execution.

Having trouble savings links to a list variable with selenium

Practicing web scraping through selenium by opening user's dating profiles through a dating site. I need selenium to save a href link for every profile on the page but unfortunately it only saves the first profile on the list, rather than creating a list variable with all the links saved. All of the profiles start with the same two div class/style which is "member-thumbnail" and "position: absolute". Thank you for any help that you can offer.
Here is the website code:
<div class="member-thumbnail">
<div style="position: absolute;">
<a href="/Member/Details/LvL-Up">
<img src="//storage.com/imgcdn/m/t/502b24cb-3f75-49a1-a61a-ae80e18d86a0" class="presenceLine online">
</a>
</div>
</div>
Here is my code:
link_list = []
link_list = browser.find_element_by_css_selector('.member-thumbnail a').get_attribute('href')
length_link_list = len(link_list)
for i in range (0, length_link_list):
browser.get(link_list[i])

use find_elements_by_css_selector instead of find_element_by_css_selector
link
if you're going to loop through the whole list returned from find_elements_by_css_selector, consider using this instead, a bit more pythonic way.
link_list = browser.find_elements_by_css_selector('.member-thumbnail a')
for element in linklist:
browser.get(element.get_attribute('href'))

Download html of a webpage thats already loaded

I am writing a program using Python and selenium to automate logging into a website. The website asks a security question for additional verification. Clearly the answer I would send using "send_keys" would depend on the question asked so I need to figure out what is being asked based on the text. BeautifulSoup can be used to parse through the HTML but in all the examples I have seen you have to give a URL to then read the page content. How do I read the content of a page that's already open? The code I am using is:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
chromedriver = 'C:\\Program Files\\Google\\chromedriver.exe'
browser = webdriver.Chrome(chromedriver)
browser.get('http://www.aaaa.com')
loginElem = browser.find_element_by_id('bbbb')
loginElem.send_keys('cccc')
passwordElem = browser.find_element_by_id('dddd')
passwordElem.send_keys('eeee')
passwordElem.send_keys(Keys.RETURN)
The page with the security questions loads after this and that's the page I want the URL of.
I also tried finding by element but for some reason it wasnt working which is why I am trying a workaround. Below is the HTML for the entire div class where the question is. Alternatively maybe you can help me search for the right one.
<div class="answer-section">
<p> Please answer your challenge question so we can help
verify your identity.
</p> <label for="tlpvt-challenge-answer"> What is the name of your dog?
</label>
<input type="text" id="tlpvt-challenge-answer" class="tl-private gis- mask"
name="challengeQuestionAnswer" value=""/>
</div>

well if you want to use BeautifulSoup you can retrieve the source code from the webdriver and then parse it:
chromedriver = 'C:\\Program Files\\Google\\chromedriver.exe'
browser = webdriver.Chrome(chromedriver)
browser.get('http://www.aaaa.com')
# call page_source attr from a webdriver instance to
# retrieve HTML source code
html = browser.page_source
# parse it with BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
label = soup.find('label', {'for': 'tlpvt-challenge-answer'})
print label.get_text()
output:
$ What is the name of your dog?

Fetching name and email from a web page [duplicate]

This question already has an answer here:
How to get data off from a web page in selenium webdriver [closed]
(1 answer)
Closed 7 years ago.
I'm trying to fetch data off from a Link. I want to fetch name/email/location/etc content from the web page and paste it into the webpage. I have written the code for it always when i run this code it just stores a blank list.
Please help me to copy these data from the web page.
I want to fetch company name, email, phone number from this Link and put these contents in an excel file. I want to do the same for the all pages of the website. I have got the logic to fetch the the links in the browser and switch in between them. I'm unable to fetch the data from the website. Can anybody provide me an enhancement to the code i have written.
Below is the code i have written:
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
import time
from lxml import html
import requests
import xlwt
browser = webdriver.Firefox() # Get local session of firefox
# 0 wait until the pages are loaded
browser.implicitly_wait(3) # 3 secs should be enough. if not, increase it
browser.get("http://ae.bizdirlib.com/taxonomy/term/1493") # Load page
links = browser.find_elements_by_css_selector("h2 > a")
#print link
for link in links:
link.send_keys(Keys.CONTROL + Keys.RETURN)
link.send_keys(Keys.CONTROL + Keys.PAGE_UP)
#tree = html.fromstring(link.text)
time.sleep(5)
companyNameElement = browser.find_elements_by_css_selector(".content.clearfix>div>fieldset>div>ul>li").text
companyName = companyNameElement
print companyNameElement
The Html code is given below
<div class="content">
<div id="node-946273" class="node node-country node-promoted node-full clearfix">
<div class="content clearfix">
<div itemtype="http://schema.org/Corporation" itemscope="">
<fieldset>
<legend>Company Information</legend>
<div style="width:100%;">
<div style="float:right; width:340px; vertical-align:top;">
<br/>
<ul>
<li>
<strong>Company Name</strong>
:
<span itemprop="name">Sabbro - F.Z.C</span>
</li>
</ul>
when i use it it gives me a error that list' object has no attribute 'text'. Can somebody help me to enhance the code and make it work. I'm kind of like stuck forever on this issue.

companyNameElement = browser.find_elements_by_css_selector(".content.clearfix>div>fieldset>div>ul>li").text
companyName = companyNameElement
print companyNameElement
find_elements_by... return a list, you can either access first element of that list or use equivalent find_element_by... method that would get just the first element.

Select hyperlink in html document using Python and Selenium

I am trying to select a hyperlink in a document from a website, but not sure how to select it using Selenium.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
names = 'Catostomus discobolus yarrowi'
driver = webdriver.Firefox()
driver.get("http://ecos.fws.gov/ecos/home.action")
SciName = driver.find_element_by_id('searchbox')
SciName.send_keys(names)
SciName.send_keys(Keys.RETURN)
The above code gets to the page that I am interested in working on, but not sure how to select the hyperlink. I am interested in selecting the first hyperlink. The html of interest is
Zuni Bluehead Sucker (<strong>Catostomus discobolus</strong> yarrowi)
</h4>
<div class='url'>ecos.fws.gov/speciesProfile/profile/speciesProfile.action?spcode=E063</div>
<span class='description'>
States/US Territories in which the Zuni Bluehead Sucker is known to or is believed to occur: Arizona, New Mexico; US Counties in which the Zuni ...
</span>
<ul class='sitelinks'></ul>
</div>
I am guessing I could use find_element_by_xpath, but have been unable to do so successfully. I will want to always select the first hyperlink. Also, the hyperlink name will change based on the species name entered.

I added the following code:
SciName = driver.find_element_by_css_selector("a[href*='http://ecos.fws.gov/speciesProfile/profile/']")
SciName.click()
I should have read the selenium documentation more thoroughly.

try this:
SciName = driver.find_element_by_link_text("Zuni Bluehead Sucker")
SciName.click()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I extract text from a button using Beautiful Soup? - python

get the id gofundme.com/f/{THEID} and call the API /web-gateway/v1/feed/THEID/donations?sort=recent&limit=20&offset=20 process the Data for people in apiResponse['references']['donations'] print(people['name']) use browser console to find host API.

Related

How do I scrape this text from an HTML <span id> using Python, Selenium, and BeautifulSoup?

Having trouble savings links to a list variable with selenium

Download html of a webpage thats already loaded

Fetching name and email from a web page [duplicate]

Select hyperlink in html document using Python and Selenium

Categories

Resources