I am trying to scrap a website using Selenium in Python in order to extract few links.
But for some of the tags, I am not able to find the links. When I inspect element for these links, it points me to ::before and ::after. One way to do this is to click on it which opens a new window and get the link from the new window. But this solutions is quite slow. Can someone help me know how can I fetch these links directly from this page?
Looks like the links you are trying to extract are not statically stored inside the i elements you see there. These links are dynamically generated by some JavaScripts running on that page.
So, the answer is "No", you can not extract there links from that page without human-like iterating elements of that page.
Related
Using Selenium in Python.
I have a scraping tool that opens a site, grabs text elements using XPath, cleans those elements, and closes the page. The tool is getting too bulky to clean the elements while the driver is still connected. So what I want to do is instead open the page, grab the entire HTML source, close the page, and then grab what I want out of the page using xpaths. But since the page is now just text, I'm unable to use the XPath methods in selenium. Any recommendations?
There is a real state website with an infinite scroll down and I have tried to extract the companies' names and other details but I have a problem with writing the selectors need some insights for a new learner in scrapy.
HTML Snippet:
After handling if "more" button is available in website.
So, the selector appears in most browsers you can copy selectors like this
based on the function you are using you copy "xpath" or something else for scrapping process,
If that's does not help please give the link to webpage and select what values you want to scrap.
As I understand, you want to get the href from the tag and you don't know how to do it in scrapy.
you just need to add ::attr(ng-href) this to the last of your CSS selectors.
link = response.css('your_selector::attr(ng-href)').get()
to make it easier for you your CSS selector should be
link = response.css('.companyNameSpecs a::attr(ng-href)').get()
but it looks like the href and ng-href is the same you can also do the same with it
link = response.css('your_selector::attr(href)').get()
This particular website has a 'show more' button. To load more data from a table. But this data seems to be loaded at the start, because I can click on it and expand the table even in offline mode.
Is there a way to scrape the whole source code in one go without clicking this button many times over in Selenium? Since it seems the entire table is loaded initially when the page is first loaded.
driver.get_source does not show the whole thing in this case, only what is visibly showing when opening the browser.
Using Python, Selenium with Google Chrome.
If indeed all the data is loaded at start, then it can surely be found by looking at the DOM (at the tag or possibly any other tag containing the data). Easy way to do that is open up the console (F12) and use the inspect element tool provided by your browser
Now to answer your question, I'm going to scrape the data using BeautifulSoup, at the found location (tag).I've seen that scraping with Selenium is pretty similar with BeautifulSoup, so you might just get the concept
For example, your table resides in a div (having random attributes, let's say a class called 'randomclass'). The table tag is 'ul', and each entry is stored in a 'li', or specifically in a 'li'.text()
To select the div:
selected_div = soup.find('div', attrs={'class': 'randomclass'})
To select the table inside the div:
table = selected_div.find('ul')
To iterate through the table rows and manage data:
for li in table.find_all('li'):
mylist.append(li.text())
[ Ed: Maybe I'm just asking this? Not sure -- Capture JSON response through Selenium ]
I'm trying to use Selenium (Python) to navigate via hyperlinks to pages in a web database. One page returns a table with hyperlinks that I want Selenium to follow. But the links do not appear in the page's source. The only html that corresponds to the table of interest is a tag indicating that the site is pulling results from a facet search. Within the div is a <script type="application/json"> tag and a handful of search options. Nothing else.
Again, I can view the hyperlinks in Firefox, but not using "View Page Source" or Selenium's selenium.webdriver.Firefox().page_source call. Instead, that call outputs not the <script> tag but a series of <div> tags that appear to define the results' format.
Is Selenium unable to navigate output from JSON applications? Or is there another way to capture the output of such applications? Thanks, and apologies for the lack of code/reproducibility.
Try using execute_script() and get the links by running JavaScript, something like:
driver.execute_script("document.querySelector('div#your-link-to-follow').click();")
Note: if the div are generated by scripts dynamically, you may want to implicitly wait a few seconds before executing the script.
I've confronted a similar situation on a website with JavaScript (http://ledextract.ces.census.gov to be specific). I had pretty good luck just using Selenium's get_element() methods. The key is that even if not everything about the hyperlinks appears in the page's source, Selenium will usually be able to find it by navigating to the website since doing that will engage the JavaScript that produces the additional links.
Thus, for example, you could try mousing over the links, finding their titles, and then using:
driver.find_element_by_xpath("//*[#title='Link Title']").click()
Based on whatever title appears by the link when you mouse over it.
Or, you may be able to find the links based on the text that appears on them:
driver.find_element_by_partial_link_text('Link Text').click()
Or, if you have a sense of the id for the links, you could use:
driver.find_element_by_id('Link_ID').click()
If you are at a loss for what the text, title, ID, etc. would be for the links you want, a somewhat blunt response is to try to pull the id, text, and title for every element off the website and then save that to a file that you can look for to identify likely candidates for the links you're wanting. That should show you a lot more (in some respects) than just the source code for the site would:
AllElements = driver.find_elements_by_xpath('//*')
for Element in AllElements:
print 'ID = %s TEXT = %s Title =%s' %(Element.get_attribute("id"), Element.get_attribute("text"), Element.get_attribute("title"))
Note: if you have or suspect you have a situation where you'll have multiple links with the same title/text, etc. then you may want to use the find_elements (plural) methods to get lists of all those satisfying your criteria, specify the xpath more explicitly, etc.
I'm using Python 2.7 with beautifulsoup and urllib2, I'm trying to scrap this page: angel.co/companies
As you see it shows a list with companies and it ends with a button "More" to show the others. As you click the button, more companies appear to watch and it creates a new tag with the new list of resutls. The button is in this div: <div class="more" data-page="2">More</div> and each time you click it the data-page increases.
I'd like to know if it's possible to scrap this page completely (so it clicks the "More" button each time it arrives to the end). I suppose it is scrapping the css and changing it but I never did so and I haven't found information about this anywhere.
Depending on what you want to do you could use their API for this. If you are not sure what it is and how to use it, try googling around for an answer. Here's one for starters.