I have a list of domains that I would like to loop over and screenshot using selenium. However, the cookie consent column means the full page is not viewable. Most of them have different consent buttons - what is the best way of accepting these? Or is there another method that could achieve the same results?
urls for reference: docjournals.com, elcomercio.com, maxim.com, wattpad.com, history10.com
You'll need to click accept individually for every website.
You can do that, using
from selenium.webdriver.common.by import By
driver.find_element(By.XPATH, "your_XPATH_locator").click()
To get around the XPATH selectors varying from page to page you can use
driver.current_url and use the url to figure out which selector you need to use.
Or alternatively if you iterate over them anyways you can do it like this:
page_1 = {
'url' : 'docjournals.com'
'selector' : 'example_selector_1'
}
page_2 = {
'url' = 'elcomercio.com'
'selector' : 'example_selector_2'
}
pages = [page_1, page_2]
for page in pages:
driver.get(page.url)
driver.find_element(By.XPATH, page.selector).click()
From the snapshot
as you can observe diffeent urls have different consent buttons, they may vary with respect to:
innerText
tag
attributes
implementation (iframe / shadowRoot)
Conclusion
There can't be a generic solution to accept/deny the cookie concent as at times:
You may need to induce WebDriverWait for the element_to_be_clickable() and click on the concent.
You may need to switch to an iframe. See: Unable to locate cookie acceptance window within iframe using Python Selenium
You may need to traverse within a shadowRoot. See: How to get past a cookie agreement page using Python and Selenium?
Related
I am trying to scrape a website with product listings that if clicked on redirect the user to a new tab with further information/contact the seller details. I am trying to retrieve said URL without actually having to click on each listing in the catalog and wait for the page to load as this would take a lot of time.
I have searched in web inspector for the "href" but the only link available is to the image source of each listing. However, I noticed that after clicking each element, a GET request method gets sent and this is the URL (https://api.wallapop.com/api/v3/items/v6g2v4y045ze?language=es) it contains pretty much all the information I need, I'm not sure if it's of any use, but its the furthest I've gotten.
UPDATE: I tried the code I was suggested (with modifications to specifically find the 'href' attributes in the clickable elements), but I get None returning. I have been looking into finding an 'onclick' element or something similar that might have what I'm looking for but so far it looks like the solution will end up being clicking each element and extracting all the information from there.
elements123 = driver.find_elements(By.XPATH, '//a[contains(#class,"ItemCardList__item")]')
for e in elements123:
print(e.get_attribute('href'))
I appreciate any insights, thank you in advance.
You need something like this:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://google.com")
# Get all the elements available with tag name 'a'
elements = driver.find_elements(By.TAG_NAME, 'a')
for e in elements:
print(e.get_attribute('href'))
I am trying to scrape a website that requires me to first fill out certain dropdowns.
However, most of the dropdown selections are hidden and only appear in the DOM tree when I scroll down WITHIN the dropdown. Is there a solution I can use to somehow mimic a scroll wheel, or are there other libraries that could complement Selenium?
There are several ways to scroll an element into view but the most reliable one in Selenium is evaluating javascripts scrollIntoview() function.
For example I use this example for scraping twitch.tv in my blog:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.twitch.tv/directory/game/Art")
# find last item and scroll to it
driver.execute_script("""
let items=document.querySelectorAll('.tw-tower>div');
items[items.length-1].scrollIntoView();
""")
The javascript finds all "items" in the pagination page and scrolls the last one into view. In your case you should use:
driver.execute_script("""
let item = document.querySelector('DROPDOWN CSS SELECTOR');
item.scrollIntoView();
""")
You can read more about it here: https://scrapfly.io/blog/web-scraping-with-selenium-and-python/#advanced-selenium-functions
requests and BeautifulSoup are two libraries in python that can assist with scraping data. They allow you to get the url and make instances within the html language.
In order to inspect a specific part of a website you just need to right click & inspect on the item you want to scrape. This will open all the hidden paths you speak of to that specific tag.
I noticed that facebook has some weird class names that look computer generated. What I don't know is if these classes are at least constant over time or they change in some time interval? Maybe someone who has experience with that can answer. Only thing I can see is that when I exit Chrome and open it again it is still the same, so at least they don't change every browser session.
So I'd guess the best way to go about scraping facebook would be to use some elements in user interface and assume structure is always the same, like for example to get address from About section something like this:
from selenium import webdriver
driver = webdriver.Chrome("C:/chromedriver.exe")
driver.get("https://www.facebook.com/pg/Burma-Superstar-620442791345784/about/?ref=page_internal")
# wait some time
address_elements = driver.find_elements_by_xpath("//span[text()='FIND US']/../following-sibling::div//button[text()='Get Directions']/../../preceding-sibling::div[1]/div/span")
for item in address_elements:
print item.text
You were pretty correct. Facebook is built through ReactJS which is pretty much evident from the presence of the following keywords and tags within the HTML DOM:
{"react_render":true,"reflow":true}
<!-- react-mount-point-unstable -->
["React-prod"]
["ReactDOM-prod"]
ReactComposerTaggerType:{r:["t5r69"],be:1}
So, the dynamically generated class names are bound to change after certain timegaps.
Solution
The solution would be to use the static attributes to construct a dynamic Locator Strategy.
To retrieve the first line of the address just below the text FIND US you need to induce WebDriverWait in conjunction with expected_conditions as visibility_of_element_located() and you can use the following optimized solution:
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//span[normalize-space()='FIND US']//following::span[2]"))))
References
You can find some relevant discussions in:
Logging Facebook using selenium
Why Selenium driver fail to recognize ID element of Facebook login page?
Outro
Note: Scraping Facebook violates their Terms of Service of section 3.2.3 and you are liable to be questioned and may even land up in Facebook Jail. Use Facebook Graph API instead.
I try to script some test on an internal website in selenium in python.
And this site contains a lot of iframes.
I have no issue to switch iframe, but I try to set the url of the current iframe, and I do not find any method to attain this.
iframe1 = WebDriverWait (driver, 10).until (EC.presence_of_element_located((By.ID, 'iframe1')))
driver.switch_to_frame (iframe1)
iframe2 = WebDriverWait (driver, 10).until (EC.presence_of_element_located((By.ID, 'iframe2')))
driver.switch_to_frame (iframe2)
# I was expecting something like this
driver.get("/new_url_inside_my_frame.html")
But it does not work, at first because get does not works with relative url, but even if I use a complete url, it sets the all the page and not just the iframe.
I am pretty sure, this is possible, I just not find how anywhere.
I am using Selenium WebDriver to do automated testing of a website. I have been successful in clicking through numerous menus and links to a point.
At one point the website I am working with generates links that look like this:
<U onclick="HourglassSubmitItem(document.all('PageName').value, '00000001Responsibility Code')">Responsibility Code</U>
I am trying to use the .click functionality of the webdriver to click this link with no success.
Using this:
page.find_element_by_xpath("//u[contains(text(),'Responsibility Code')]")
successfully finds the U tag above. but when I add .click() to the end of this xpath, the click is not performed. But it also does not generate an error. So, my question is can Selenium be used to simulate clicks on an HTML tag that is NOT an anchor () tag? If so, how?
I will also say that I do not have control over the page I am working with, so changing the to is not possible.
I would appreciate any guidance the Community could provide.
Thank You for you help,
Chris
Sometimes using JavaScript could solve the "clicking" issue:
element = page.find_element_by_xpath("//u[contains(text(),'Responsibility Code')]")
page.execute_script('arguments[0].click();', element)
You can prefer JavaScript in this case.
WebElment element = page.find_element_by_xpath("//u[contains(text(),'Responsibility Code')]")
JavaScriptExecutor executor = (JavaScriptExecutor)driver;
executor.ExecuteScript("arguments[0].click();", element);