Split text by "whitespace" (as seen in Firefox inspect) in Python Selenium - python

There are several questions out there about how to split text by whitespace, but I couldn't find one that answers my question. I'm using Python & Selenium to collect some text from a website. The text I want to collect looks like this when I view it using Firefox's "Inspect Element"
I don't see the same thing when I look the HTML in Google Chrome (the image below is for a different item/car than the original Firefox image):
I want to capture each of the lines separately (e.g. ['2012', 'HONDA', 'ACCORD 4C', 'LX']). If I use something like elem.text.split(' ') then I'll end up with ['2012', 'HONDA', 'ACCORD', '4C', 'LX'] which is NOT what I want/need.
When I print(elem.text) I get this regardless of browser:
2012 HONDA ACCORD 4C LX
elem.get_attribute('innerHTML') gives the following regardless of the browser:
2012 HONDA ACCORD 4C LX
elem.get_attribute('outerHTML') gives the following regardless of the browser:
<div class="class_name">2012 HONDA ACCORD 4C LX</div>
Edit/Update
I went to the website in Firefox then performed a "Save Page As..." with the Format equal to "Web Page, complete". The HTML in that region of the page looks like this:
<div class="class1" id="id1">
<div class="class2">
<div class="class3">
<div class="class4">2020 CHEVROLET SUBURBAN 4X2 V8 PREMIER</div>
</div>
</div>
</div>
Is there some way for Selenium to recognize what Firefox is seeing here and split the text based on the "whitespace" indicator?

Try Below Code. Change the URL as per your need. I have assumed the class as 'happy' in my code.
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get("xyz")
time.sleep(5)
mytxt = driver.find_element_by_class_name('happy')
Split_text = mytxt.text.split()
print("Year :-", Split_text[0])
print("Make :-", Split_text[1])
print("Model :-", Split_text[2])
print("Type :-", Split_text[3])
Output -
Note - If it solves your query then please mark it as an answer.

Looks like the whitespace is a result of the <pre> tag and new lines outside of tags inside it (probably more than that too - i'm not a web developer so just had a poke around). If you can share a link to your page we can look or if you update your question with source for your page you'll hopefully see the inner workings yourself.
If you render this HTML:
<pre>
<div>
hello again
</div>
<div>
world
</div>
</pre>
You'll get this in devtools:
With that in mind, you have a couple of options.
You can try Chrome. This doesn't seem to render that awkward whitespace and might be more useful for scripting against your site:
If you must use FF or chrome doesn't cut it, try running this code snippet - obvioulsy modify the bits you need to get your page and element:
from selenium import webdriver
#create this or set your URL
url = "C:\Git\PythonSelenium\StackWhitepsace.html"
browser = webdriver.Chrome()
browser.get(url)
#set this to how you identify your element
elem = browser.find_element_by_tag_name("pre")
print("text::")
print(elem.text)
print("") #line break
print("inner::")
print(elem.get_attribute('innerHTML'))
print("") #line break
print("outer::")
print(elem.get_attribute('outerHTML'))
It's a bit verbose, but this is how this the output for my simple page:
text::
hello again
world
inner::
<div>
hello again
</div>
<div>
world
</div>
outer::
<pre> <div>
hello again
</div>
<div>
world
</div>
</pre>
When you see the html options that selenium sees you'll be able to use #Pythonologist 's split in the other answer to split the outcome into the parts you need.

Related

How do I scrape this text from an HTML <span id> using Python, Selenium, and BeautifulSoup?

I'm working on creating a web scraping tool that generates a .csv report by using Python, Selenium, beautifulSoup, and pandas.
Unfortunately, I'm running into an issue with grabbing the "data-date" text from the HTML below. I am looking to pull the "2/4/2020" into the .csv my code is generating.
<span class="import-popover"><span id="LargeHeader_glyphStatus" class="glyphicon glyphicon-ok-sign white"></span><b><span id="LargeHeader_statusText">Processing Complete</span></b><span id="LargeHeader_dateText" data-date="2/4/2020" data-delay="1" data-step="3" data-error="False">, Last Processed 2/5/2020</span></span>
My python script starts off with the following:
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import pandas as pd
driver = webdriver.Chrome('C:\chromedriver.exe')
lastdatadate=[]
lastprocesseddate=[]
Then I have it log in to a website, enter my un/pw credentials, and click the continue/login button.
From there, I am using the following to parse the html, scrape the website, and pull the relevant data/text into a .csv:
content = driver.page_source
soup = bs(content, 'html.parser')
for a in soup.findAll('div', attrs={'class':'large-header-welcome'}):
datadate=a.find(?????)
processeddate=a.find('span', attrs={'id':'LargeHeader_dateText'})
lastdatadate.append(datadate.text)
lastprocesseddate.append(processeddate.text)
df = pd.DataFrame({'Last Data Date':lastdatadate,'Last Processed Date':lastprocesseddate})
df.to_csv('hqm.csv', index=False, encoding='utf-8')
So far, I've got it working for the "last processed date" component of the HTML, but I am having trouble getting it to pull the "last data date" from the HTML. It's there, I just don't know how to have python find it. I've tried using the find method but I have not been successful.
I've tried googling around and checking here for what I should try, but I've come up empty-handed so far. I think I'm having trouble what to search for.
Any insight would be much appreciated as I am trying to learn and get better. Thanks!
edit: here is a closer look of the HTML:
<div class="large-header-welcome">
<div class="row">
<div class="col-sm-6">
<h3 class="welcome-header">Welcome, <span id="LargeHeader_fullname">Rhett</span></h3>
<p class="">
<b>Site:</b> <span id="LargeHeader_Name">redacted</span>
<br />
<span class="import-popover"><span id="LargeHeader_glyphStatus" class="glyphicon glyphicon-ok-sign white"></span><b><span id="LargeHeader_statusText">Processing Complete</span></b><span id="LargeHeader_dateText" data-date="2/4/2020" data-delay="1" data-step="3" data-error="False">, Last Processed 2/5/2020</span></span>
</p>
</div>
To find one element use find()
processeddate=soup.find('span', attrs={'id':'LargeHeader_dateText'}).text
to find multple elements use
for item in soup.find_all('span', attrs={'id':'LargeHeader_dateText'}):
processeddate=item.text
Or you can use css selector select()
for item in soup.select('#LargeHeader_dateText'):
processeddate=item.text
EDIT
To get the attribute value data-date use following code
lastdatadate=[]
for item in soup.find_all('span',attrs={"id": "LargeHeader_dateText","data-date": True}):
processeddate=item['data-date']
lastdatadate(processeddate)
lastdatadate.append(processeddate)
Or css selector.
lastdatadate=[]
for item in soup.select('#LargeHeader_dateText[data-date]'):
processeddate=item['data-date']
print(processeddate)
lastdatadate.append(processeddate)
Both will give same output.however later one faster execution.

How to check checkbox using Selenium in Python

I have a problem with this checkbox. I tried to click searching element with id, name, XPath, CSS Selector and contains text and still I could not click on this checkbox. Additionally, I've tried with another site with similar HTML code and on this site, it was enough to look for id and click. Any ideas?
<div class="agree-box-term">
<input tabindex="75" id="agree" name="agree" type="checkbox" value="1">
<label for="agree" class="checkbox-special">* Zapoznałam/em się z Regulaminem sklepu internetowego i akceptuję jego postanowienia.<br></label>
</div>
Here is my Python code https://codeshare.io/5zo0Jj
I have used javaScript Executor and it clicks on the element.However I have also checked webdriver click is not working.
driver.execute_script("arguments[0].click();", driver.find_element_by_id("agree"))
I don't know why this is, but in my experience some boxes don't accept click but do accept a 'mousedown' trigger.
try:
driver.execute_script('$("div.agree-box-term input#agree").trigger("mousedown")')
This solution does rely on jquery being on the page, if it's not we can write it in javascript
r = driver.find_element_by_xpath("//*[#id="form-order"]/div[2]/div[4]/label")
r.click()
Does this work for you? Sometimes it's just a question of selecting the right xpath, or adding the brackets after click.
Does your code contain nested html tags? For example:
<html>
<div>
<p> Some text </p>
<html>
That block can't be traversed!
</html>
</div>
</html>
Anything inside the second HTML tags can't be traversed/accessed. Try to see if that's the case.
In any other case the following code ran perfectly fine for your snippet:
driver.find_element_by_css_selector('#agree').click()

Fetching name and email from a web page [duplicate]

This question already has an answer here:
How to get data off from a web page in selenium webdriver [closed]
(1 answer)
Closed 7 years ago.
I'm trying to fetch data off from a Link. I want to fetch name/email/location/etc content from the web page and paste it into the webpage. I have written the code for it always when i run this code it just stores a blank list.
Please help me to copy these data from the web page.
I want to fetch company name, email, phone number from this Link and put these contents in an excel file. I want to do the same for the all pages of the website. I have got the logic to fetch the the links in the browser and switch in between them. I'm unable to fetch the data from the website. Can anybody provide me an enhancement to the code i have written.
Below is the code i have written:
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
import time
from lxml import html
import requests
import xlwt
browser = webdriver.Firefox() # Get local session of firefox
# 0 wait until the pages are loaded
browser.implicitly_wait(3) # 3 secs should be enough. if not, increase it
browser.get("http://ae.bizdirlib.com/taxonomy/term/1493") # Load page
links = browser.find_elements_by_css_selector("h2 > a")
#print link
for link in links:
link.send_keys(Keys.CONTROL + Keys.RETURN)
link.send_keys(Keys.CONTROL + Keys.PAGE_UP)
#tree = html.fromstring(link.text)
time.sleep(5)
companyNameElement = browser.find_elements_by_css_selector(".content.clearfix>div>fieldset>div>ul>li").text
companyName = companyNameElement
print companyNameElement
The Html code is given below
<div class="content">
<div id="node-946273" class="node node-country node-promoted node-full clearfix">
<div class="content clearfix">
<div itemtype="http://schema.org/Corporation" itemscope="">
<fieldset>
<legend>Company Information</legend>
<div style="width:100%;">
<div style="float:right; width:340px; vertical-align:top;">
<br/>
<ul>
<li>
<strong>Company Name</strong>
:
<span itemprop="name">Sabbro - F.Z.C</span>
</li>
</ul>
when i use it it gives me a error that list' object has no attribute 'text'. Can somebody help me to enhance the code and make it work. I'm kind of like stuck forever on this issue.
companyNameElement = browser.find_elements_by_css_selector(".content.clearfix>div>fieldset>div>ul>li").text
companyName = companyNameElement
print companyNameElement
find_elements_by... return a list, you can either access first element of that list or use equivalent find_element_by... method that would get just the first element.

How to locate text input by name using Selenium WebDriver?

I'm a selenium newbie and just trying to learn the basics. I have a simple CherryPy webapp that takes a first name and last name as input:
My Webapp:
<p>
<label></label>
<input name="first_name"></input>
<br></br>
</p>
<p>
<label></label>
<input name="last_name"></input>
<br></br>
</p>
In my python console I have:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://localhost:8080')
The page loads fine in FF but I'm a little lost on how to get text into the 'first_name' and 'last_name' text boxes. I see examples where you do something like inputElement = driver.find_element_by_id("n") and then inputElement.send_keys('my_first_name') but I don't have an id...just a name. Do I need to add stuff to my web page? Thanks!
You can use find_element_by_name:
driver.find_element_by_name('first_name').send_keys("my_first_name")
driver.find_element_by_name('last_name').send_keys("my_last_name")

Select hyperlink in html document using Python and Selenium

I am trying to select a hyperlink in a document from a website, but not sure how to select it using Selenium.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
names = 'Catostomus discobolus yarrowi'
driver = webdriver.Firefox()
driver.get("http://ecos.fws.gov/ecos/home.action")
SciName = driver.find_element_by_id('searchbox')
SciName.send_keys(names)
SciName.send_keys(Keys.RETURN)
The above code gets to the page that I am interested in working on, but not sure how to select the hyperlink. I am interested in selecting the first hyperlink. The html of interest is
Zuni Bluehead Sucker (<strong>Catostomus discobolus</strong> yarrowi)
</h4>
<div class='url'>ecos.fws.gov/speciesProfile/profile/speciesProfile.action?spcode=E063</div>
<span class='description'>
States/US Territories in which the Zuni Bluehead Sucker is known to or is believed to occur: Arizona, New Mexico; US Counties in which the Zuni ...
</span>
<ul class='sitelinks'></ul>
</div>
I am guessing I could use find_element_by_xpath, but have been unable to do so successfully. I will want to always select the first hyperlink. Also, the hyperlink name will change based on the species name entered.
I added the following code:
SciName = driver.find_element_by_css_selector("a[href*='http://ecos.fws.gov/speciesProfile/profile/']")
SciName.click()
I should have read the selenium documentation more thoroughly.
try this:
SciName = driver.find_element_by_link_text("Zuni Bluehead Sucker")
SciName.click()

Categories

Resources