Get all <thspan> contents in Python Selenium

Get all <thspan> contents in Python Selenium - python

Say that I have a piece of HTML code that looks like this:
<html>
<body>
<thspan class="sentence">He</thspan>
<thspan class="sentence">llo</thspan>
</body>
</html>
And I wanted to get the content of both and connect them into a string in Python Selenium.
My current code looks like this:
from selenium import webdriver
from selenium.webdriver.common.by import By
browser = webdriver.Chrome()
thspans = browser.find_elements(By.CLASS_NAME, "sentence")
context = ""
for thspan in thspans:
context.join(thspan.text)
The code can run without any problem, but the context variable doesn't contain anything. How can I get the content of both and connect them into a string in Python Selenium?

context += thspan.text instead of using context.join(thspan.text) just like #Rajagopalan said

Get all <thspan> contents in Python Selenium
Hi! You were not redirecting the browser to the page you actually want to scrap the data from. And you were misusing the function join. Here is a code that will work for you:
from selenium import webdriver
from selenium.webdriver.common.by import By
browser = webdriver.Chrome()
# Put the absolute path to your html file if you are working locally, or
# the URL of the domain you want to scrap
browser.get('file:///your/absolute/path/to/the/html/code/index.html')
thspans = browser.find_elements(By.CLASS_NAME, "sentence")
context = ''
print('thspans', thspans, end='\n\n')
for thspan in thspans:
context += thspan.text
print(context)
Good luck!

Use this line without the loop:
context = "".join([thspan.text for thspan in thspans])

Related

how to load multiple urls in driver.get()?

How to load multiple urls in driver.get() ?
I am trying to load 3 urls in below code, but how to load the other 2 urls?
And afterwards the next challenge is to pass authentication for all the urls as well which is same.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome(executable_path=r"C:/Users/RYadav/AppData/Roaming/Microsoft/Windows/Start Menu/Programs/Python 3.8/chromedriver.exe")
driver.get("https://fleet.my.salesforce.com/reportbuilder/reportType.apexp")#put here the adress of your page
elem = driver.find_elements_by_xpath('//*[#id="ext-gen63"]')#put here the content you have put in Notepad, ie the XPath
button = driver.find_element_by_id('id="ext-gen63"')
print(elem.get_attribute("class"))
driver.close
submit_button.click()

Try below code :
def getUrls(targeturl):
driver = webdriver.Chrome(executable_path=r" path for chromedriver.exe")
driver.get("http://www."+targeturl+".com")
# perform your taks here
driver.quit()
for i in range(3):
webPage = ['google','facebook','gmail']
for i in webPage:
print i;
getUrls(i)

You can't load more than 1 url at a time for each Webdriver. If you want to do so, you maybe need some multiprocessing module. If you want to do an iterative solution, just create a list with every url you need and loop through it. With that you won't have the credential problem neither.

Extract a link from a webpage using Python

I have this problem: I want to extract the URL of each single project from this page, but I don't know how to do that. I tried to extract it through
projects = main_page.find_all_next('div', attrs={'class':'relative self-start'})
but I don't get the link. How can I go through it? Thank you in advance for helping me.

This website dynamically loaded content. So you need something that can run javascript. There is an easy example to access site with selenium.
from selenium import webdriver
from bs4 import BeautifulSoup
url = "https://www.kickstarter.com/discover/categories/music"
dr = webdriver.Chrome() # or PhantomJS,Firefox
try:
dr.get(url)
main_page = BeautifulSoup(dr.page_source,"lxml")
projects = main_page.find_all('div', {'class':'relative self-start'})
project_showed = main_page.find_all("div",class_="bg-white black relative border-grey-500 border")
print(len(projects))
except Exception as e:
raise e
finally:
dr.close()
But if you can not load data in time, you should use WebDriverWait or Implicit to wait for it loading finished. WebDriverWait and Implicit

the link generated by javascript, you can't get it with BeutifulSoup, use Regex to capture url in javascript variable
import requests
import re
html = requests.get('https://www.kickstarter.com/discover/categories/music').text
listURL = re.findall(r'"project":"([^"]+)', html)
for url in listURL:
print url

Import defined selenium function browser issue

Set-up
I use selenium for a variety of things and found myself defining the same functions over and over again.
I decided to define the functions in a separate file, and import these to my work files.
Simple Example
If I define functions and execute all in one file, things work fine. See the simple full_script.py below,
# import webdriver
from selenium import webdriver
# create browser
browser = webdriver.Firefox(
executable_path='/mypath/geckodriver')
# define short xpath function
def el_xp(x):
return browser.find_element_by_xpath(x)
# navigate to url
browser.get('https://nos.nl')
# obtain title first article
el_xp('/html/body/main/section[1]/div/ul/li[1]/a/div[2]/h3').text
This successfully returns the title of the first article on this news website.
Problem
Now, when I split the script in a xpath_function.py and a run_text.py, and save them in a test folder on my desktop, things don't work fine.
xpath_function.py
# import webdriver
from selenium import webdriver
# create browser
browser = webdriver.Firefox(
executable_path='/mypath/geckodriver')
# define short xpath function
def el_xp(x):
return browser.find_element_by_xpath(x)
run_test.py
import os
os.chdir('/my/Desktop/test')
import xpath_function as xf
# import webdriver
from selenium import webdriver
# create browser
browser = webdriver.Firefox(
executable_path='/Users/lucaspanjaard/Documents/RentIndicator/geckodriver')
browser.get('https://nos.nl')
xf.el_xp('/html/body/main/section[1]/div/ul/li[1]/a/div[2]/h3').text
Executing run_test.py results in 2 browsers opened, of which one navigates to the news website and the following error,
NoSuchElementException: Unable to locate element:
/html/body/main/section[1]/div/ul/li[1]/a/div[2]/h3
I suppose the issue is that in both xpath_function.py and run_test.py I'm defining a browser.
However, if I don't define a browser in xpath_function.py, I get an error in that file that no browser is defined.
How do I solve this?

You can easily fix it by changing the definition of el_exp to include the browser as an extra parameter:
def el_xp(browser, x):
return browser.find_element_by_xpath(x)
now in run_test.py you call it like this:
xf.el_xp(browser, '/html/body/main/section[1]/div/ul/li[1]/a/div[2]/h3').text

PhantomJS (Selenium) Cannot Load PDFs from direct urls

I was trying to save some PDF from a link via PhantomJS (selenium). So, I refered to this code that turns webpages to pdfs. and it worked just fine when I ran the exactly same code.
So, I have this pdf I wanted to save from a direct url and I tried that script... it didn't work. It just saves a PDF with 1 white page. That's all...
My Code :
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
def execute(script, args):
driver.execute('executePhantomScript', {'script': script, 'args' : args })
driver = webdriver.PhantomJS('phantomjs')
# hack while the python interface lags
driver.command_executor._commands['executePhantomScript'] = ('POST', '/session/$sessionId/phantom/execute')
driver.get('http://www.planetpublish.com/wp-content/uploads/2011/11/The_Scarlet_Letter_T.pdf')
try:
WebDriverWait(driver, 40).until(EC.presence_of_element_located((By.ID, 'plugin')))
except Exception as TimeoutException:
print("I waited for far too long and still couldn't fine the view.")
pass
# set page format
# inside the execution script, webpage is "this"
pageFormat = '''this.paperSize = {format: "A4", orientation: "portrait" };'''
execute(pageFormat, [])
# render current page
render = '''this.render("test2.pdf")'''
execute(render, [])
I'm not sure what's happening and why is it happening. Need some assistance.
EDIT: This is just the test PDF that I was trying to get via Selenium. There are some other PDFs which I need to get and that website is checking god-knows-what to decide whether it's a human or a bot. So, Selenium is the only way.
EDIT 2 : So, here's the website I was practicing on : http://services.ecourts.gov.in/ecourtindia/cases/case_no.php?state_cd=26&dist_cd=8&appFlag=web
Select "Cr Rev - Criminal Revision" from "Case Type" drop down and input any number in case number and year. Click on "Go".
This will show a little table, click on "view" and it should show a table on full page.
Scroll down to the "orders" table and you should see "Copy of order". That's the pdf I'm trying to get.I have tried requests as well and it did not work.

Currently, PhantomJS and Chrome headless doesn't support download a file. If you are OK with Chrome browser, please see my example below. It finds a elements, and then add an attribute download. Finally, it clicks on the link to download the file to default Downloads folder.
import time
driver = webdriver.Chrome()
driver.get('http://www.planetpublish.com/free-ebooks/93/heart-of-darkness/')
pdfLinks = driver.find_elements_by_css_selector(".entry-content ul > li > a")
for pdfLink in pdfLinks:
script = "arguments[0].setAttribute('download',arguments[1]);"
driver.execute_script(script, pdfLink, pdfLink.text)
time.sleep(1)
pdfLink.click()
time.sleep(3)
driver.quit()

If you're just looking at downloading PDFs which aren't protected behind some javascript or stuff (essentially straightforward stuff), I suggest using the requests library instead.
import requests
url ='http://www.planetpublish.com/wp-content/uploads/2011/11/The_Scarlet_Letter_T.pdf'
r = requests.get(url)
with open('The_Scarlet_Letter_T.pdf', 'wb') as f:
f.write(r.content)
# If large file
with requests.get(url, stream=True) as r:
with open('The_Scarlet_Letter_T.pdf', 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk:
f.write(chunk)

I recommend you look at the pdfkit library.
import pdfkit
pdfkit.from_url('http://www.planetpublish.com/wp-content/uploads/2011/11/The_Scarlet_Letter_T.pdf', 'out.pdf')
It makes downloading pdfs very simple with python. You will also need to download this for the library to work.
You could also try the code from this link shown below
#!/usr/bin/env python
from contextlib import closing
from selenium.webdriver import Firefox # pip install selenium
from selenium.webdriver.support.ui import WebDriverWait
# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
browser.get('http://www.planetpublish.com/wp-content/uploads/2011/11/The_Scarlet_Letter_T.pdf')
button = browser.find_element_by_name('button')
button.click()
# wait for the page to load
WebDriverWait(browser, timeout=10).until(
lambda x: x.find_element_by_id('someId_that_must_be_on_new_page'))
# store it to string variable
page_source = browser.page_source
print(page_source)
which you will need to edit to make work for your pdf.

Python Selenium On Local HTML String

I am trying to run Selenium on a local HTML string but can't seem to find any documentation on how to do so. I retrieve HTML source from an e-mail API, so Selenium won't be able to parse it directly. Is there anyway to alter the following so that it would read the HTML string below:
Python Code for remote access:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_class_name("q")
Local HTML Code:
s = "<body>
<p>This is a test</p>
<p class="q">This is a second test</p>
</body>"

If you don't want to create a file or load a URL before being able to replace the content of the page, you can always leverage the Data URLs feature, which supports HTML, CSS and JavaScript:
from selenium import webdriver
driver = webdriver.Chrome()
html_content = """
<html>
<head></head>
<body>
<div>
Hello World =)
</div>
</body>
</html>
"""
driver.get("data:text/html;charset=utf-8,{html_content}".format(html_content=html_content))

If I understand the question correctly, I can imagine 2 ways to do this:
Save HTML code as file, and load it as url file:///file/location. The problem with that is that location of file and how file is loaded by a browser may differ for various OSs / browsers. But implementation is very simple on the other hand.
Another option is to inject your code onto some page, and then work with it as a regular dynamic HTML. I think this is more reliable, but also more work. This question has a good example.

Here was my solution for doing basic generated tests without having to make lots of temporary local files.
import json
from selenium import webdriver
driver = webdriver.PhantomJS() # or your browser of choice
html = '''<div>Some HTML</div>'''
driver.execute_script("document.write('{}')".format(json.dumps(html)))
# your tests

If I am reading correctly you are simply trying to get text from an element. If that is the case then the following bit should fit your needs:
elem = driver.find_element_by_class_name("q").text
print elem
Assuming "q" is the element you need.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get all <thspan> contents in Python Selenium - python

context += thspan.text instead of using context.join(thspan.text) just like #Rajagopalan said

Use this line without the loop: context = "".join([thspan.text for thspan in thspans])

Related

how to load multiple urls in driver.get()?

Extract a link from a webpage using Python

Import defined selenium function browser issue

PhantomJS (Selenium) Cannot Load PDFs from direct urls

Python Selenium On Local HTML String

Categories

Resources