I noticed that facebook has some weird class names that look computer generated. What I don't know is if these classes are at least constant over time or they change in some time interval? Maybe someone who has experience with that can answer. Only thing I can see is that when I exit Chrome and open it again it is still the same, so at least they don't change every browser session.
So I'd guess the best way to go about scraping facebook would be to use some elements in user interface and assume structure is always the same, like for example to get address from About section something like this:
from selenium import webdriver
driver = webdriver.Chrome("C:/chromedriver.exe")
driver.get("https://www.facebook.com/pg/Burma-Superstar-620442791345784/about/?ref=page_internal")
# wait some time
address_elements = driver.find_elements_by_xpath("//span[text()='FIND US']/../following-sibling::div//button[text()='Get Directions']/../../preceding-sibling::div[1]/div/span")
for item in address_elements:
print item.text
You were pretty correct. Facebook is built through ReactJS which is pretty much evident from the presence of the following keywords and tags within the HTML DOM:
{"react_render":true,"reflow":true}
<!-- react-mount-point-unstable -->
["React-prod"]
["ReactDOM-prod"]
ReactComposerTaggerType:{r:["t5r69"],be:1}
So, the dynamically generated class names are bound to change after certain timegaps.
Solution
The solution would be to use the static attributes to construct a dynamic Locator Strategy.
To retrieve the first line of the address just below the text FIND US you need to induce WebDriverWait in conjunction with expected_conditions as visibility_of_element_located() and you can use the following optimized solution:
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//span[normalize-space()='FIND US']//following::span[2]"))))
References
You can find some relevant discussions in:
Logging Facebook using selenium
Why Selenium driver fail to recognize ID element of Facebook login page?
Outro
Note: Scraping Facebook violates their Terms of Service of section 3.2.3 and you are liable to be questioned and may even land up in Facebook Jail. Use Facebook Graph API instead.
Related
I have a list of domains that I would like to loop over and screenshot using selenium. However, the cookie consent column means the full page is not viewable. Most of them have different consent buttons - what is the best way of accepting these? Or is there another method that could achieve the same results?
urls for reference: docjournals.com, elcomercio.com, maxim.com, wattpad.com, history10.com
You'll need to click accept individually for every website.
You can do that, using
from selenium.webdriver.common.by import By
driver.find_element(By.XPATH, "your_XPATH_locator").click()
To get around the XPATH selectors varying from page to page you can use
driver.current_url and use the url to figure out which selector you need to use.
Or alternatively if you iterate over them anyways you can do it like this:
page_1 = {
'url' : 'docjournals.com'
'selector' : 'example_selector_1'
}
page_2 = {
'url' = 'elcomercio.com'
'selector' : 'example_selector_2'
}
pages = [page_1, page_2]
for page in pages:
driver.get(page.url)
driver.find_element(By.XPATH, page.selector).click()
From the snapshot
as you can observe diffeent urls have different consent buttons, they may vary with respect to:
innerText
tag
attributes
implementation (iframe / shadowRoot)
Conclusion
There can't be a generic solution to accept/deny the cookie concent as at times:
You may need to induce WebDriverWait for the element_to_be_clickable() and click on the concent.
You may need to switch to an iframe. See: Unable to locate cookie acceptance window within iframe using Python Selenium
You may need to traverse within a shadowRoot. See: How to get past a cookie agreement page using Python and Selenium?
I am trying to get the link descriptions of duckduck of search results using the following:
results=browser.find_elements_by_xpath("//div[#id='links']/div/div/div[2]")
description=[]
for result in results:
description.append(result.text)
I am getting the error 'list' object has no attribute 'text'. I was able to use a similar method to get the search result titles, but for some reason I am unable to extract the text from this particular xpath.
To extract the link descriptions of the search results from DuckDuckGo you have to induce WebDriverWait for the visibility of all elements located and you can use the following solution:
Code Block:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get('https://duckduckgo.com/')
search_box = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.NAME, "q")))
search_box.send_keys("Selenium")
search_box.submit()
elements = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[#id='links']/div/div/div[2]")))
for ele in elements:
print(ele.text)
driver.quit()
Console Output:
What is Selenium? Selenium automates browsers.That's it! What you do with that power is entirely up to you. Primarily, it is for automating web applications for testing purposes, but is certainly not limited to just that.
Selenium is a mineral found in the soil. Selenium naturally appears in water and some foods. While people only need a very small amount, selenium plays a key role in the metabolism.. Why do people ...
Selenium is a chemical element with symbol Se and atomic number 34. It is a nonmetal (more rarely considered a metalloid) with properties that are intermediate between the elements above and below in the periodic table, sulfur and tellurium, and also has similarities to arsenic.
Selenium is a trace mineral found naturally in the soil that also appears in certain high-selenium foods, and there are even small amounts in water.. Selenium is an extremely vital mineral for the human body as it increases immunity, takes part in antioxidant activity that defends against free radical damage and inflammation, and plays a key role in maintaining a healthy metabolism.
Introduction. Selenium is a trace element that is naturally present in many foods, added to others, and available as a dietary supplement. Selenium, which is nutritionally essential for humans, is a constituent of more than two dozen selenoproteins that play critical roles in reproduction, thyroid hormone metabolism, DNA synthesis, and protection from oxidative damage and infection [].
Selenium is an essential trace mineral that is important for many bodily processes, including cognitive function, a healthy immune system, and fertility in both men and women.
Your body relies on selenium, an important mineral, for many of its basic functions, from reproduction to fighting infection. The amount of selenium in different foods depends on the amount of ...
Overview Information Selenium is a mineral. It is taken into the body in water and foods. People use it for medicine. Most of the selenium in the body comes from the diet. The amount of selenium ...
Selenium WebDriver. The biggest change in Selenium recently has been the inclusion of the WebDriver API. Driving a browser natively as a user would either locally or on a remote machine using the Selenium Server it marks a leap forward in terms of browser automation.
Downloads. Below is where you can find the latest releases of all the Selenium components. You can also find a list of previous releases, source code, and additional information for Maven users (Maven is a popular Java build tool).
You don't have to create a for loop for the empty list... try using this code:
results=driver.find_elements_by_xpath("//div[#id='links']/div/div/div[2]")
description=[]
for result in results:
description.append(result.text)
Example:
To test this I simply typed 'hmm' in DuckDuckGo so the URL is https://duckduckgo.com/?q=hmm&t=h_&ia=web
from selenium import webdriver
driver=webdriver.Chrome()
driver.get('https://duckduckgo.com/?q=hmm&t=h_&ia=web')
results=driver.find_elements_by_xpath("//div[#id='links']/div/div/div[2]")
description=[]
for result in results:
description.append(result.text)
print(description[0])
print(' ')
print(description[1])
print(' ')
print(description[2])
Output:
HMM to Develop "New-GAUS 2020"... HMM Holds 'PSA-Hyundai Pusan N... HMM Names New VLCC, 'Universal... 2019 New Year's Message; The HMM's Future Plan; HMM Blueprint for the Year 202... HMM signed the formal contract...
Hmm definition, (used typically to express thoughtful absorption, hesitation, doubt, or perplexity.) See more.
2 � used to emphasize that one has asked a question and is awaiting an answer But tell Santa the truth now, what's the most important part to a little boy or girl? The box
The search results:
I am trying to understand Python in general as I just switched over from using VBA. I interested in the possible ways you could approach this single issue. I already went around it by just going to the link directly, but I need to understand and apply here.
from selenium import webdriver
chromedriver = r'C:\Users\dd\Desktop\chromedriver.exe'
browser = webdriver.Chrome(chromedriver)
url = 'https://www.fake.com/'
browser.get(url)
browser.find_element_by_id('txtLoginUserName').send_keys("Hello")
browser.find_element_by_id('txtLoginPassword').send_keys("There")
browser.find_element_by_id('btnLogin').click()
At this point, I am trying to navigate to a particular button/link.
Here is the info from the page/element
T-Mobile
Here are some of the things I tried:
for elem in browser.find_elements_by_xpath("//*[contains(text(), 'T-Mobile')]"):
elem.click
browser.execute_script("InitiateCallBack(187, True, T-Mobile, https://www.fake.com/, TMobile)")
I also attempted to look for tags and use css selector all of which I deleted out of frustration!
Specific questions
How do I utilize the innertext,"T-Mobile", to click the button?
How would I execute the onclick event?
I've tried to read the following links, but still have not succeeded incoming up with a different way. Part of it is probably because I don't understand the specific syntax yet. This is just some of the things I looked at. I spent about 3 hours trying various things before I came here!
selenium python onclick() gives StaleElementReferenceException
http://selenium-python.readthedocs.io/locating-elements.html
Python: Selenium to simulate onclick
https://stackoverflow.com/questions/43531654/simulate-a-onclick-with-selenium-https://stackoverflow.com/questions/45360707/python-selenium-using-onclick
Running javascript in Selenium using Python
How do I utilize the innertext,"T-Mobile", to click the button?
find_elements_by_link_text would be appropriate for this case.
elements = driver.find_elements_by_link_text('T-Mobile')
for elem in elements:
elem.click()
There's also a by_partial_link_text locator as well if you don't have the full exact text.
How would I execute the onclick event?
The simplest way would be to simply call .click() on the element as shown above and the event should, naturally, execute at that time.
Alternatively, you can retrieve the onclick attribute and use driver.execute_script to run the js.
for elem in elements:
script = elem.get_attribute('onlcick')
driver.execute_script(script)
Edit:
note that in your code you did element.click -- this does nothing. element.click() (note the parens) calls the click method.
is there a way to utilize browser.execute_script() for the onclick event
execute_script can fire the equivalent event, but there may be more listeners that you miss by doing this. Using the element click method is the most sound. There may very well be many implementation details of the site that may hinder your automation efforts, but those possibilities are endless. Without seeing the actual context, it's hard to say.
You can use JS methods to click an element or otherwise interact with the page, but you may miss certain event listeners that occur when using the site 'normally'; you want to emulate, more or less, the normal use as closely as possible.
As per the HTML you have shared it's pretty clear the website uses JavaScript. So to click() on the link with text as T-Mobile you have to induce WebDriverWait with expected_conditions clause as element_to_be_clickable and your can use the following code block :
WebDriverWait(driver, 20).until(expected_conditions.element_to_be_clickable((By.XPATH, "//a[contains(.,'T-Mobile')]"))).click()
you can use it
<div class="button c_button s_button" onclick="submitForm('rMTF')" style="margin-bottom: 30px;">
<input class="v_small" type="button"></input>
<span>
Reset
</span>
I am trying to scrape information from a website. So far, I've been able to access the webpage, log in with a username and password, and then print that landing page's page source into a separate .html/.txt file as needed.
Here's where the problems arise: on that "landing page," there's a table that I want to scrape the data from. If I were to manually right-click on any integer on that table, and select "inspect," I'd find the integer with no problem. However, when looking at the page source as a whole, I don't see the integers- just variable/parameter names. This leads me to believe it is a dynamic website.
How can I scrape the data?
I've been to hell and back trying to scrape this website, and so far, here's how the available technology has worked for me:
Firefox, IE, and Opera do not render the table. My guess is that this is a problem on the website's end. Only Chrome seems to work if I log in manually.
Selenium's Chromium package has been failing on me repeatedly (on my Windows 7 laptop) and I have even posted a question about the matter here. For now I'll assume it's just a lost cause, but I'm willing to graciously accept anyone's benevolent help.
Spynner's description looked promising, but that setup has frustrated me for quite some time- and the lack of a clear introduction only compounds its cumbersome nature to a novice like myself.
I prefer to code in Python, as it is the language I am most comfortable with. I have a pending company request to have the company install Visual Studio on my computer (to try doing this in C#), but I'm not holding my breath...
If my code can be of any use, so far, here's how I'm using mechanize:
# Headless Browsing Using PhantomJS and Selenium
#
# PhantomJS is installed in current directory
#
from selenium import webdriver
import time
browser = webdriver.PhantomJS()
browser.set_window_size(1120, 550) # need a fake browser size to fetch elements
def login_entry(username, password):
login_email = browser.find_element_by_id('UserName')
login_email.send_keys(username)
login_password = browser.find_element_by_id('Password')
login_password.send_keys(password)
submit_elem = browser.find_element_by_xpath("//button[contains(text(), 'Log in')]")
submit_elem.click()
browser.get("https://www.example.com")
login_entry('usr_name', 'pwd')
time.sleep(10)
test_output = open('phantomjs_test_source_output.html', 'w')
test_output.write(repr(browser.page_source))
test_output.close()
browser.quit()
p.s.- if anyone thinks I should be tagging javascript to this question, let me know. I personally don't know javascript but I'm sensing that it might be part of the problem/solution.
Try something like this. Sometimes with dynamic pages you need to wait for the data to load.
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
WebDriverWait(my_driver, my_time).until(EC.presence_of_all_elements_located(my_expected_element))
http://selenium-python.readthedocs.io/waits.html
https://seleniumhq.github.io/selenium/docs/api/py/webdriver_support/selenium.webdriver.support.expected_conditions.html
I'm trying to select a textarea wrapped in angular 1 using selenium, but it can't be seen in DOM. There's a module called Pytractor. I've been trying to solve this but I'm unable to use it correctly.
Can anyone help me with this?
You can also use regular selenium bindings to test AngularJS applications. You would need to use Explicit Waits to wait for elements to appear, disappear, title/url to change etc - for any actions that would let you continue with testing the page.
Example (waiting for textarea element to appear):
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.TAG_NAME, "myaccount")))
There is one important thing that pytractor (as protractor itself) provides - it knows when AngularJS is settled and ready - models are updated, there are no outstanding async requests etc. It doesn't mean you have to use it to test AngularJS applications, but it gives you an advantage.
Additionally, pytractor provides you with new locators, e.g. you can find an element by model or binding. It also doesn't mean you cannot find the same element using other location techniques which regular selenium python provides out-of-the-box.
Note that pytractor is not actively developed and maintained at the moment.
You may just mine protractor for useful code snippets. I personally use this function that blocks until Angular is done rendering the page.
def wait_for_angular(self, selenium_driver):
selenium_driver.set_script_timeout(10)
selenium_driver.execute_async_script("""
callback = arguments[arguments.length - 1];
angular.element('html').injector().get('$browser').notifyWhenNoOutstandingRequests(callback);""")
Replace 'html' for whatever element is your 'ng-app'.
Solution for Angular 1 comes from protractor/lib/clientsidescripts.js#L51. It could also be possible to adapt the solution to Angular 2 using this updated code