I am using Selenium along with python to scrape some pages. I have many web pages that represent the same type of objects(football player information) but each of them has a slightly different HTML layout. In particular my main issue here is that the div class identifiers change when refreshing or changing web page, in a way which is unpredictable.
In the specific case I would like to get the data in the div which class identifier "jss176", but when I get to another player this will change to "jss450" for example, with no meaningful pattern to be found.
Is there a way I can go around this? I was thinking of navigating through the Childs starting from div with id = "root", but I don't seem to find a good piece of code to achieve this.
Thank you very much!
If only the id's change, but not the web structure, you could scrape the info by XPATH.
https://www.tutorialspoint.com/what-is-xpath-in-selenium-with-python
You can directly access the div you want and select in chrome "copy XPATH" option in the browser.
Related
I am trying to scrape a website that requires me to first fill out certain dropdowns.
However, most of the dropdown selections are hidden and only appear in the DOM tree when I scroll down WITHIN the dropdown. Is there a solution I can use to somehow mimic a scroll wheel, or are there other libraries that could complement Selenium?
There are several ways to scroll an element into view but the most reliable one in Selenium is evaluating javascripts scrollIntoview() function.
For example I use this example for scraping twitch.tv in my blog:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.twitch.tv/directory/game/Art")
# find last item and scroll to it
driver.execute_script("""
let items=document.querySelectorAll('.tw-tower>div');
items[items.length-1].scrollIntoView();
""")
The javascript finds all "items" in the pagination page and scrolls the last one into view. In your case you should use:
driver.execute_script("""
let item = document.querySelector('DROPDOWN CSS SELECTOR');
item.scrollIntoView();
""")
You can read more about it here: https://scrapfly.io/blog/web-scraping-with-selenium-and-python/#advanced-selenium-functions
requests and BeautifulSoup are two libraries in python that can assist with scraping data. They allow you to get the url and make instances within the html language.
In order to inspect a specific part of a website you just need to right click & inspect on the item you want to scrape. This will open all the hidden paths you speak of to that specific tag.
I am trying to find the names of all iframes in a web page. When I run driver.find_element_by_xpath("//iframe") I get session="f139d552bcf5b17598ba7b5af3987c8", element="04036644-d6cf-40a1-9434-5ce5d951e9a" how do I correlate this back to a name that is useful in html so that I can switch by different locators like tag, css, id, etc.? Preferably not in Java but if that is the only solution available that's fine. What kind of Id /attribute is element="04036644-d6cf-40a1-9434-5ce5d951e9a". Can someone provide an example of what the code would look like?
This particular website has a 'show more' button. To load more data from a table. But this data seems to be loaded at the start, because I can click on it and expand the table even in offline mode.
Is there a way to scrape the whole source code in one go without clicking this button many times over in Selenium? Since it seems the entire table is loaded initially when the page is first loaded.
driver.get_source does not show the whole thing in this case, only what is visibly showing when opening the browser.
Using Python, Selenium with Google Chrome.
If indeed all the data is loaded at start, then it can surely be found by looking at the DOM (at the tag or possibly any other tag containing the data). Easy way to do that is open up the console (F12) and use the inspect element tool provided by your browser
Now to answer your question, I'm going to scrape the data using BeautifulSoup, at the found location (tag).I've seen that scraping with Selenium is pretty similar with BeautifulSoup, so you might just get the concept
For example, your table resides in a div (having random attributes, let's say a class called 'randomclass'). The table tag is 'ul', and each entry is stored in a 'li', or specifically in a 'li'.text()
To select the div:
selected_div = soup.find('div', attrs={'class': 'randomclass'})
To select the table inside the div:
table = selected_div.find('ul')
To iterate through the table rows and manage data:
for li in table.find_all('li'):
mylist.append(li.text())
I am trying to scrape information off websites like this:
https://www.glassdoor.com/Overview/Working-at-7-Eleven-EI_IE3581.11,19.htm
using python + beautifulsoup + mechanize.
Accessing anything on the main-site is no problem. However, I also need the information that appears in a overlay-window that appears when one clicks on the "Rating Trends" button next to the bar with stars.
This overlay-window can also be accessed directly by using the url:
https://www.glassdoor.com/Reviews/7-Eleven-Reviews-E3581.htm#trends-overallRating
The html associated with this page is a modification of the original site's html.
However, regardless of what element I try to find (via findAll ) on that overlay-window website, beautifulsoup returns zero hits.
How can I fix this? I tried adding a sleep time between accessing the website and reading anything in, to no avail.
Thanks!
If you're using the Chrome browser select the background of that page (without the additional information displayed) and select 'Inspect' from the context menu (for Windows anyway), then the 'Network' tab, so that you can see network traffic. Now click on 'Rating trends'. The entry marked 'xhr' will be https://www.glassdoor.ca/api/employer/3581-rating.htm?locationStr=&jobTitleStr=&filterCurrentEmployee=false&filterEmploymentStatus=REGULAR&filterEmploymentStatus=PART_TIME (I much hope!) and its contents will be the following.
{"employerId":3581,"ratings":[{"hasRating":true,"type":"overallRating","value":2.9},{"hasRating":true,"type":"ceoRating","value":0.54},{"hasRating":true,"type":"bizOutlook","value":0.35},{"hasRating":true,"type":"recommend","value":0.4},{"hasRating":true,"type":"compAndBenefits","value":2.4},{"hasRating":true,"type":"cultureAndValues","value":2.5},{"hasRating":true,"type":"careerOpportunities","value":2.5},{"hasRating":true,"type":"workLife","value":2.4},{"hasRating":true,"type":"seniorManagement","value":2.3}],"week":0,"year":0}
Whether this URL can be altered for use in obtaining information for other employers, I regret, I cannot tell you.
[ Ed: Maybe I'm just asking this? Not sure -- Capture JSON response through Selenium ]
I'm trying to use Selenium (Python) to navigate via hyperlinks to pages in a web database. One page returns a table with hyperlinks that I want Selenium to follow. But the links do not appear in the page's source. The only html that corresponds to the table of interest is a tag indicating that the site is pulling results from a facet search. Within the div is a <script type="application/json"> tag and a handful of search options. Nothing else.
Again, I can view the hyperlinks in Firefox, but not using "View Page Source" or Selenium's selenium.webdriver.Firefox().page_source call. Instead, that call outputs not the <script> tag but a series of <div> tags that appear to define the results' format.
Is Selenium unable to navigate output from JSON applications? Or is there another way to capture the output of such applications? Thanks, and apologies for the lack of code/reproducibility.
Try using execute_script() and get the links by running JavaScript, something like:
driver.execute_script("document.querySelector('div#your-link-to-follow').click();")
Note: if the div are generated by scripts dynamically, you may want to implicitly wait a few seconds before executing the script.
I've confronted a similar situation on a website with JavaScript (http://ledextract.ces.census.gov to be specific). I had pretty good luck just using Selenium's get_element() methods. The key is that even if not everything about the hyperlinks appears in the page's source, Selenium will usually be able to find it by navigating to the website since doing that will engage the JavaScript that produces the additional links.
Thus, for example, you could try mousing over the links, finding their titles, and then using:
driver.find_element_by_xpath("//*[#title='Link Title']").click()
Based on whatever title appears by the link when you mouse over it.
Or, you may be able to find the links based on the text that appears on them:
driver.find_element_by_partial_link_text('Link Text').click()
Or, if you have a sense of the id for the links, you could use:
driver.find_element_by_id('Link_ID').click()
If you are at a loss for what the text, title, ID, etc. would be for the links you want, a somewhat blunt response is to try to pull the id, text, and title for every element off the website and then save that to a file that you can look for to identify likely candidates for the links you're wanting. That should show you a lot more (in some respects) than just the source code for the site would:
AllElements = driver.find_elements_by_xpath('//*')
for Element in AllElements:
print 'ID = %s TEXT = %s Title =%s' %(Element.get_attribute("id"), Element.get_attribute("text"), Element.get_attribute("title"))
Note: if you have or suspect you have a situation where you'll have multiple links with the same title/text, etc. then you may want to use the find_elements (plural) methods to get lists of all those satisfying your criteria, specify the xpath more explicitly, etc.