I am trying to scrape a website that requires me to first fill out certain dropdowns.
However, most of the dropdown selections are hidden and only appear in the DOM tree when I scroll down WITHIN the dropdown. Is there a solution I can use to somehow mimic a scroll wheel, or are there other libraries that could complement Selenium?
There are several ways to scroll an element into view but the most reliable one in Selenium is evaluating javascripts scrollIntoview() function.
For example I use this example for scraping twitch.tv in my blog:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.twitch.tv/directory/game/Art")
# find last item and scroll to it
driver.execute_script("""
let items=document.querySelectorAll('.tw-tower>div');
items[items.length-1].scrollIntoView();
""")
The javascript finds all "items" in the pagination page and scrolls the last one into view. In your case you should use:
driver.execute_script("""
let item = document.querySelector('DROPDOWN CSS SELECTOR');
item.scrollIntoView();
""")
You can read more about it here: https://scrapfly.io/blog/web-scraping-with-selenium-and-python/#advanced-selenium-functions
requests and BeautifulSoup are two libraries in python that can assist with scraping data. They allow you to get the url and make instances within the html language.
In order to inspect a specific part of a website you just need to right click & inspect on the item you want to scrape. This will open all the hidden paths you speak of to that specific tag.
Related
I know there are plenty ways to get a HTML source passing the page url.
But is there a way to get the current html of a page if it displays data after some action ?
For example: A simple html page with a button (thats the source html) that displays random data when you click it.
Thanks
I believe you're looking for a tool collectively known as a "headless browser". The only one I've used that is available in Python (and can vouch for) is Selenium WebDriver, but there are plenty to choose from if you're searching up headless browsers for Python.
https://pypi.org/project/selenium
With this you should be able to programmatically load a web page, look up and click the button in the virtually rendered DOM, then lookup the innerHTML property of the targeted element.
I am using Selenium along with python to scrape some pages. I have many web pages that represent the same type of objects(football player information) but each of them has a slightly different HTML layout. In particular my main issue here is that the div class identifiers change when refreshing or changing web page, in a way which is unpredictable.
In the specific case I would like to get the data in the div which class identifier "jss176", but when I get to another player this will change to "jss450" for example, with no meaningful pattern to be found.
Is there a way I can go around this? I was thinking of navigating through the Childs starting from div with id = "root", but I don't seem to find a good piece of code to achieve this.
Thank you very much!
If only the id's change, but not the web structure, you could scrape the info by XPATH.
https://www.tutorialspoint.com/what-is-xpath-in-selenium-with-python
You can directly access the div you want and select in chrome "copy XPATH" option in the browser.
So the title says it all. I am trying to scrape the connections based upon a search term I supply. Once the page renders, all of the connections aren't in the html as if they are hidden until I scroll down to see them. Is there a way to use Selenium to show all of the connections at once? I have no code to post since this is only a question.
You can use selenium to scroll down the page, loading the data you intend to grab.
The code bellow will scroll to the bottom of the page:
...
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
...
I've had an use case where I had to hit the bottom of the page consecutive times to load the content and get all the data I needed, in which I used the mentioned approach.
Hope this helps...
I am using Selenium WebDriver to do automated testing of a website. I have been successful in clicking through numerous menus and links to a point.
At one point the website I am working with generates links that look like this:
<U onclick="HourglassSubmitItem(document.all('PageName').value, '00000001Responsibility Code')">Responsibility Code</U>
I am trying to use the .click functionality of the webdriver to click this link with no success.
Using this:
page.find_element_by_xpath("//u[contains(text(),'Responsibility Code')]")
successfully finds the U tag above. but when I add .click() to the end of this xpath, the click is not performed. But it also does not generate an error. So, my question is can Selenium be used to simulate clicks on an HTML tag that is NOT an anchor () tag? If so, how?
I will also say that I do not have control over the page I am working with, so changing the to is not possible.
I would appreciate any guidance the Community could provide.
Thank You for you help,
Chris
Sometimes using JavaScript could solve the "clicking" issue:
element = page.find_element_by_xpath("//u[contains(text(),'Responsibility Code')]")
page.execute_script('arguments[0].click();', element)
You can prefer JavaScript in this case.
WebElment element = page.find_element_by_xpath("//u[contains(text(),'Responsibility Code')]")
JavaScriptExecutor executor = (JavaScriptExecutor)driver;
executor.ExecuteScript("arguments[0].click();", element);
[ Ed: Maybe I'm just asking this? Not sure -- Capture JSON response through Selenium ]
I'm trying to use Selenium (Python) to navigate via hyperlinks to pages in a web database. One page returns a table with hyperlinks that I want Selenium to follow. But the links do not appear in the page's source. The only html that corresponds to the table of interest is a tag indicating that the site is pulling results from a facet search. Within the div is a <script type="application/json"> tag and a handful of search options. Nothing else.
Again, I can view the hyperlinks in Firefox, but not using "View Page Source" or Selenium's selenium.webdriver.Firefox().page_source call. Instead, that call outputs not the <script> tag but a series of <div> tags that appear to define the results' format.
Is Selenium unable to navigate output from JSON applications? Or is there another way to capture the output of such applications? Thanks, and apologies for the lack of code/reproducibility.
Try using execute_script() and get the links by running JavaScript, something like:
driver.execute_script("document.querySelector('div#your-link-to-follow').click();")
Note: if the div are generated by scripts dynamically, you may want to implicitly wait a few seconds before executing the script.
I've confronted a similar situation on a website with JavaScript (http://ledextract.ces.census.gov to be specific). I had pretty good luck just using Selenium's get_element() methods. The key is that even if not everything about the hyperlinks appears in the page's source, Selenium will usually be able to find it by navigating to the website since doing that will engage the JavaScript that produces the additional links.
Thus, for example, you could try mousing over the links, finding their titles, and then using:
driver.find_element_by_xpath("//*[#title='Link Title']").click()
Based on whatever title appears by the link when you mouse over it.
Or, you may be able to find the links based on the text that appears on them:
driver.find_element_by_partial_link_text('Link Text').click()
Or, if you have a sense of the id for the links, you could use:
driver.find_element_by_id('Link_ID').click()
If you are at a loss for what the text, title, ID, etc. would be for the links you want, a somewhat blunt response is to try to pull the id, text, and title for every element off the website and then save that to a file that you can look for to identify likely candidates for the links you're wanting. That should show you a lot more (in some respects) than just the source code for the site would:
AllElements = driver.find_elements_by_xpath('//*')
for Element in AllElements:
print 'ID = %s TEXT = %s Title =%s' %(Element.get_attribute("id"), Element.get_attribute("text"), Element.get_attribute("title"))
Note: if you have or suspect you have a situation where you'll have multiple links with the same title/text, etc. then you may want to use the find_elements (plural) methods to get lists of all those satisfying your criteria, specify the xpath more explicitly, etc.