Scraping image, url, description from page

Scraping image, url, description from page - python

I am trying to get image and video url from https://www.google.com/trends/home/all/IN
Here is the code:
driver = webdriver.PhantomJS('/usr/local/bin/phantomjs')
driver.set_window_size(1124, 850)
driver.get("https://www.google.com/trends/home/all/IN")
trend = {}
def getGooglerends():
try:
#Does this line makes any sense
#element = WebDriverWait(driver, 20).until(lambda driver: driver.find_elements_by_class_name('md-list-block ng-scope'))
for s in driver.find_elements_by_class_name('md-list-block ng-scope'):
print s.find_element_by_tag_name('img').get_attribute('src')
print s.find_element_by_tag_name('img').get_attribute('alt')
print s.find_elements_by_class_name('image-wrapper ng-scope').get_attribute('href')
except:
getNDTVTrends()
getGooglerends()
which gives
WebDriverException: Message: {"errorMessage":"Compound class names not permitted","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Length":"111","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:57213","User-Agent":"Python-urllib/2.7"},"httpVersion":"1.1","method":"POST","post":"{\"using\": \"class name\", \"sessionId\": \"648251c0-1cc7-11e5-bf1c-4ff79ddbdce4\", \"value\": \"md-list-block ng-scope\"}","url":"/elements","urlParsed":{"anchor":"","query":"","file":"elements","directory":"/","path":"/elements","relative":"/elements","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/elements","queryKey":{},"chunks":["elements"]},"urlOriginal":"/session/648251c0-1cc7-11e5-bf1c-4ff79ddbdce4/elements"}}
Screenshot: available via screen
Any suggestion for this error?

Compound class names not permitted
It basically means, that you can not have spaces in your class name. You need to switch to another selector, be that css, xpath or something like that.
Not really sure what you are trying to select, but for example following xpath selects a list of items containing that class:
//div[#class="homepage-trending-stories generic-container ng-scope"]/md-list[#class="md-list-block ng-scope"]

Related

Selenium cant locate element inside ::before ::after

Element to be located
I am trying to locate a span element inside a webpage, I have tried by XPath but its raise timeout error, I want to locate title span element inside Facebook marketplace product. url
here is my code :
def title_detector():
title = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, 'path'))).text
list_data = title.split("ISBN", 1)

Try this xpath //span[contains(text(),'isbn')]

You can't locate pseudo elements with XPath, only with CSS selector.
I see it's FaceBook with it's ugly class names...
I'm not sure this will work for you, maybe these class names are dynamic, but it worked for me this time.
Anyway, the css_locator for that span element is .dati1w0a.qt6c0cv9.hv4rvrfc.discj3wi .d2edcug0.hpfvmrgz.qv66sw1b.c1et5uql.lr9zc1uh.a8c37x1j.keod5gw0.nxhoafnm.aigsh9s9.qg6bub1s.fe6kdd0r.mau55g9w.c8b282yb.iv3no6db.o0t2es00.f530mmz5.hnhda86s.oo9gr5id
So, since we are trying to get it's before we can do it with the following JavaScript script:
span_locator = `.dati1w0a.qt6c0cv9.hv4rvrfc.discj3wi .d2edcug0.hpfvmrgz.qv66sw1b.c1et5uql.lr9zc1uh.a8c37x1j.keod5gw0.nxhoafnm.aigsh9s9.qg6bub1s.fe6kdd0r.mau55g9w.c8b282yb.iv3no6db.o0t2es00.f530mmz5.hnhda86s.oo9gr5id`
script = "return window.getComputedStyle(document.querySelector('{}'),':before').getPropertyValue('content')".format(span_locator)
print(driver.execute_script(script).strip())
In case the css selector above not working since the class names are dynamic there - try to locate that span with some stable css_locator, it is possible. Just have to try it several times until you see which class names are stable and which are not.
UPD:
You don't need to locate the pseudo elements there, will be enough to catch that span itself. So, it will be enough something like this:
span_locator = `.dati1w0a.qt6c0cv9.hv4rvrfc.discj3wi .d2edcug0.hpfvmrgz.qv66sw1b.c1et5uql.lr9zc1uh.a8c37x1j.keod5gw0.nxhoafnm.aigsh9s9.qg6bub1s.fe6kdd0r.mau55g9w.c8b282yb.iv3no6db.o0t2es00.f530mmz5.hnhda86s.oo9gr5id`
def title_detector():
title = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, 'span_locator'))).text
title = title.strip()
list_data = title.split("ISBN", 1)

Unable to find by class name

I want to extract data in <div class="user-profile_list __relatives"> ... (see image)
Source code of the page https://gist.github.com/mascai/59e3bf779c2ba7cecb973ab9653ed419
My code
def get_relatives(driver):
relatives = []
relatives_container = driver.find_element_by_class_name("user-profile_list __relatives")
return relatives
driver = webdriver.Chrome(executable_path='chromedriver')
get_relatives(driver)
Error text
Message: no such element: Unable to locate element: {"method":"css selector","selector":".user-profile_list __relatives"}

This thing happens many time, its better to use xpath and search for class in it,
relatives_container = driver.find_element_by_xpath('//*[class="classuser-profile_list __relatives"]')
You can also try contains in xpath, it also work if there are multiple classes in that element and write only one of them
relatives_container = driver.find_element_by_xpath('//*[contains(#class, 'user-profile_list __relatives')]')

Python Selenium send_keys to email input field

I am trying to make an auto-checkout script on https://www.footish.se/sneakers/fila-wmns-disruptor-run-1010866-60m
I have made it to the checkout page, but unable to enter my email in the "email" input-field.
The code looks like this
email = driver.find_element_by_xpath("/html/body/div/span/div/div/div/div[1]/div/div/div[1]/div/form/div[2]/div[1]/div/label/div/div/input")
email.send_keys("test#email.com")
Have implemented some sort of function to wait for the desired elements untill they are loaded in. One example->
while not find:
try:
find = driver.find_element_by_xpath("/html/body/form/div[5]/div/div[4]/div[1]/div[1]/div[2]/div[1]/div[10]/div[1]/h2")
print("Loaded info")
except:
continue
The error i am getting is this.
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"/html/body/div/span/div/div/div/div[1]/div/div/div[1]/div/form/div[2]/div[1]/div/label/div/div/input"}
How would i resolve this? Thanks in advance....

It seems like the email field is a dynamicallly generated element and so when a check is initially made it is not yet present. You may try and use the until method to wait for a specific time and see if it indeed does get applied to the DOM.
Sample code
driver = webdriver.Chrome()
driver.get("url")
try:
element = WebDriverWait(driver, 10).until(
EC.visibility_of_element_located((By.ID, "myDynamicElement")) # set your id here
)
finally:
driver.quit()
Do update if it doesn't work. It's then likely to do with frame or wrong path issues.

How to get iframe source from page_source when the id isn't on the iframe

Hello today i wanna ask how to get the link inside the page source but without id, i asked before how to get the link with id ok now i understand, but i've tried the same method with another link and i was not successful about that so here is my code:
from selenium import webdriver
# Create a new instance of the Firefox driver
driver_path = r"C:\Users\666\Desktop\New folder (8)\chromedriver.exe"
driver = webdriver.Chrome(driver_path)
# go to the google home page
driver.get("https://www.gledalica.com/sa-prevodom/iron-fist-s02e01-video_02cb355f8.html")
# find the element that's name attribute is q (the google search box)
element = driver.find_element_by_id("Playerholder")
frame = driver.find_element_by_tag_name("iframe")
driver.switch_to.frame("iframe")
link = frame.get_attribute("src")
driver.quit()
Like this here: enter image description here

There are multiple way to get it. In this case one of easiest is by using a CSS selector:
frame = find_element_by_css_selector('#Playerholder iframe')
This looks for the element with id = "Playerholder" in the html and then look for a child of it that is an iframe.

Can't access some content on page using selenium and python

I am using selenium in a python script to login into a website where I can get an authorization key to access their API. I am able to login and navigate to the page where the authorization key is provided, I am using chrome driver for testing so I can see what's going on. When I get to the final page where the key is displayed, I can't find a way to access it. I can't see it in the page source, and when I try to access via the page element outer html, it doesn't print the value shown on the page. Here is a screenshot of what I see in the browser (I'm interested in accessing the content shown in response body):
this is the code snippet I am using to try to access the content:
auth_key = WebDriverWait(sel_browser, 10).until(EC.presence_of_element_located((By.XPATH, '//*[#id="responseBodyContent"]')))
print auth_key.get_attribute("outerHTML")
and this is what the print statement returns:
<pre id="responseBodyContent"></pre>
I've also tried:
print auth_key.text
which returns nothing. Is there way I can extract this key from the page?

It looks like you need a custom wait to wait for the element and then wait for text.
First, add a class, find element and then get innerHTML of the element. Finally, measure length of the string.
See my example below.
class element_text_not_empty(object):
def __init__(self, locator):
self.locator = locator
def __call__(self, driver):
try:
element = driver.find_element(*self.locator)
if(len(element.get_attribute('innerHTML').strip())>0):
return element.get_attribute('innerHTML')
else:
return False
except Exception as ex:
print("Error while waiting: " + str(ex))
return False
driver = webdriver.Chrome(chrome_path)
...
...
try:
print("Start wait")
result = WebDriverWait(driver, 20).until(element_text_not_empty((By.XPATH, '//*[#id="responseBodyContent"]')))
print(result)
except Exception as ex:
print("Error: " + str(ex))

Since attribute value is in json format for responseBodyContent try this
authkey_text = json.loads(auth_key.get_attribute)
print str(authkey_text)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping image, url, description from page - python

Related

Selenium cant locate element inside ::before ::after

Unable to find by class name

Python Selenium send_keys to email input field

How to get iframe source from page_source when the id isn't on the iframe

Can't access some content on page using selenium and python

Categories

Resources