Finding downloadable content in webpage without knowing what the page looks like - python

I want to find downloadable content in a webpage but I don't know what the webpage looks like. Right now, I am looking at all links
links = driver.find_elements(By.XPATH, "//a[#href]")
and buttons
buttons = driver.find_elements(By.TAG_NAME, "button")
For each link (which has an href attribute as seen in the XPATH specification), I check whether or not it points to a page with some form of machine readable file I am looking for (either .csv or .json); if the link does not have one of these extensions as a suffix, I assume it does not reference a machine readable file.
As for the buttons, I know of no way to check what they may contain other than naively clicking on them (button.click()). While this is clearly dangerous, especially because this function will be applied on thousands of websites, I don't know how else to do it.
Is there any other way I could check for downloadable content? Additionally, are there any other page elements I should be looking for, besides links and buttons, and are there any more efficient methods of doing what I want?
Any help is greatly appreciated. Thanks!

Related

Python/Selenium: Any way to wildcard the end of an xpath? Or search for a specifically formatted piece of an xpath?

I am using python / selenium to archive some posts. They are simple text + images. As the site requires a login, I'm using selenium to access it.
The problem is, the page shows all the posts, and they are only fully readable on clicking a text labeled "read more", which brings up a popup with the full text / images.
So I'm writing a script to scroll the page, click read more, scrape the post, close it, and move on to the next one.
The problem I'm running into, is that each read more button is an identical element:
read more
If I try to loop through them using XPaths, I run into the problem of them being formatted differently as well, for example:
//*[#id="page"]/div[2]/article[10]/div[2]/ul/li/a
//*[#id="page"]/div[2]/article[14]/div[2]/p[3]/a
I tried formatting my loop to just loop through the article numbers, but of course the xpath's terminate differently. Is there a way I can add a wildcard to the back half of my xpaths? Or search just by the article numbers?
/ is used to go for direct child, use // instead to go from <article> to the <a>
//*[#id="page"]/div[2]/article//a[.="read more"]
This will give you a list of elements you can iterate. You might be able to remove the [.="read more"], but it might catch unrelated <a> tags, depends on the rest of the html structure.
You can also try looking for the read more elements directly by text
//a[.="read more"]
I recommend using CSS Selectors over XPaths. CSS Selector provide faster, cleaner and simpler way to deal with these queries.
('a[href^="javascript"]')
This will selects every element whose href attribute value begins with "javascript" which is what you are looking for...
You can learn more about Locating Elements by CSS Selectors in selenium here.
readMore = driver.find_element(By.CSS_SELECTOR, 'a[href^="javascript"]')
And about Locating Hyperlinks by Link Text
readMore_link = driver.find_elements(By.LINK_TEXT, 'javascript')

Scraping a website that has certain problems

I want to scrape this website and scrape all articles by this author, with Python(response or Selenium libraries) and put them in PDF file.
However, when I click on the button "Show More" that is in the bottom, after 8 times, it doesn't anymore display more articles, hence I can't access them all(idea was to automate selenium, to click on it until all articles are showed, and then scrape them all). Is there a workaround? Alternative ways I can access all articles chronologically and scrape them?
My idea was to somehow analyze if the links come from alternative source, but I'm clueless. However, I scraped successfully those articles that are displayed.
Thanks in advance!
Use findElements and search for <h2 class="css-1j9dxys e1xfvim30">...</h2> which will give you a list of all titles. Each time when you click the Show more the size of the list will get extended by 10 or so. So the idea is to simply click the button untill the size of the list does not change. Use a while loop. Something like:
List<WebElements> oldList = Driver.findElements(by.cssSelector("h2.css-
1j9dxys.e1xfvim30"));
List<WebElements> newList = new ArrayList<>();
WebElement button = Driver.findElement(by.xpath("//button[text()='Show More']"));
while(newList.size!=oldList.size){
button.click();
newList = List<WebElements> newList = Driver.findElements(by.cssSelector("h2.css-
1j9dxys.e1xfvim30));
}
I might have some mistakes in the code but the idea is there. Good luck!

How to find an element's position using XPath?

I've been trying to write a simple script in order to upload 200+ links to a website I'm working in (I have poor knowledge on python and even poorer in HTML, of course I wasn't working as a web developer, I just need to upload these links).
Well, the situation I'm in is the following: I am using Splinter(therefore, Python) in order to navigate in the website. Certain section titles of this website will be compared with values I have in a .csv table.
For instance, in this screenshot, I am looking for this link /admin/pages/5, and I would like to compare the link's title (Explorar subpáginas de 'MA111 - Cálculo I') with my .CSV table. The problem is the link's title doesn't appear in the website.
To find the link I would guess that I should use find_by_xpath(), but I don't know how to do it. I would guess it's something like this link.
I would appreciate any help! I hope I have made myself clear.
You first need to define how are you detecting that url, so for example, "it is always to the right of certain button", or "it is the second row in a table", that way you can build the respective xpath (which is a path to follow inside the DOM.
I am not entirely sure, but this could give you the solution
url = browser.find_by_xpath('//td[#class="children"]/a')[0]['href']
if you are finding a tag by the link name for example, try this:
url = browser.find_by_xpath('//a[contains(#title, "MA111 - Cálculo I")]')[0]['href']
If you check there, the xpath says "find in the entire DOM // a tag named a which contains "MA111 - Cálculo I" in the title attribute.

Selenium Python: clicking links produced by JSON application

[ Ed: Maybe I'm just asking this? Not sure -- Capture JSON response through Selenium ]
I'm trying to use Selenium (Python) to navigate via hyperlinks to pages in a web database. One page returns a table with hyperlinks that I want Selenium to follow. But the links do not appear in the page's source. The only html that corresponds to the table of interest is a tag indicating that the site is pulling results from a facet search. Within the div is a <script type="application/json"> tag and a handful of search options. Nothing else.
Again, I can view the hyperlinks in Firefox, but not using "View Page Source" or Selenium's selenium.webdriver.Firefox().page_source call. Instead, that call outputs not the <script> tag but a series of <div> tags that appear to define the results' format.
Is Selenium unable to navigate output from JSON applications? Or is there another way to capture the output of such applications? Thanks, and apologies for the lack of code/reproducibility.
Try using execute_script() and get the links by running JavaScript, something like:
driver.execute_script("document.querySelector('div#your-link-to-follow').click();")
Note: if the div are generated by scripts dynamically, you may want to implicitly wait a few seconds before executing the script.
I've confronted a similar situation on a website with JavaScript (http://ledextract.ces.census.gov to be specific). I had pretty good luck just using Selenium's get_element() methods. The key is that even if not everything about the hyperlinks appears in the page's source, Selenium will usually be able to find it by navigating to the website since doing that will engage the JavaScript that produces the additional links.
Thus, for example, you could try mousing over the links, finding their titles, and then using:
driver.find_element_by_xpath("//*[#title='Link Title']").click()
Based on whatever title appears by the link when you mouse over it.
Or, you may be able to find the links based on the text that appears on them:
driver.find_element_by_partial_link_text('Link Text').click()
Or, if you have a sense of the id for the links, you could use:
driver.find_element_by_id('Link_ID').click()
If you are at a loss for what the text, title, ID, etc. would be for the links you want, a somewhat blunt response is to try to pull the id, text, and title for every element off the website and then save that to a file that you can look for to identify likely candidates for the links you're wanting. That should show you a lot more (in some respects) than just the source code for the site would:
AllElements = driver.find_elements_by_xpath('//*')
for Element in AllElements:
print 'ID = %s TEXT = %s Title =%s' %(Element.get_attribute("id"), Element.get_attribute("text"), Element.get_attribute("title"))
Note: if you have or suspect you have a situation where you'll have multiple links with the same title/text, etc. then you may want to use the find_elements (plural) methods to get lists of all those satisfying your criteria, specify the xpath more explicitly, etc.

Is a Web Crawler more suitable?

TL;DR Version :
I have only heard about web crawlers in intelluctual conversations Im not part of. All I want to know that can they follow a specific path like:
first page (has lot of links) -->go to links specified-->go to
links(specified, yes again)-->go to certain link-->reach final page
and download source.
I have googled a bit and came across Scrappy. But I am not sure if I fully understand web crawlers to begin with and if scrappy can help me follow the specific path I want.
Long Version
I wanted to extract some text of a group of static web pages. These web pages are very simple with just basic HTML. I used python and the urllib to access the URL,extract the text and work with it. Pretty soon I realized that I will have to basically visit all these pages and copy paste the URL into my program, which is tiresome. I wanted to know if this is more suitable for a web crawler. I want to access this
page. Then select only a few organisms (I have a list of those). On Clicking on of them you can see this page. If you look under the table - MTases active in the genome there are Enzymes which are hyperlinks. Clinking on those lead to this page. On the right hand side there is link named Sequence Data. Once clicked it leads to the page which has a small table on the lower right with yellow headers. under it it has an entry DNA (FASTA STYLE. Clicking on view will lead to the page im interested in and want to download the page source from.
I think you are definitely on the right track for looking at a web crawler to help you do this. You can also look at Norconex HTTP Collector which I know can let you follow links on a page without storing that page if is is just a listing page to you. That crawler lets you filter out pages after their links have been extracted to be followed. Ultimately, you can configure the right filters so that only the pages matching the pattern you want get downloaded for you to process (whether it is based on crawl depth, URL pattern, content pattern, etc).

Categories

Resources