I've been trying to write a simple script in order to upload 200+ links to a website I'm working in (I have poor knowledge on python and even poorer in HTML, of course I wasn't working as a web developer, I just need to upload these links).
Well, the situation I'm in is the following: I am using Splinter(therefore, Python) in order to navigate in the website. Certain section titles of this website will be compared with values I have in a .csv table.
For instance, in this screenshot, I am looking for this link /admin/pages/5, and I would like to compare the link's title (Explorar subpáginas de 'MA111 - Cálculo I') with my .CSV table. The problem is the link's title doesn't appear in the website.
To find the link I would guess that I should use find_by_xpath(), but I don't know how to do it. I would guess it's something like this link.
I would appreciate any help! I hope I have made myself clear.
You first need to define how are you detecting that url, so for example, "it is always to the right of certain button", or "it is the second row in a table", that way you can build the respective xpath (which is a path to follow inside the DOM.
I am not entirely sure, but this could give you the solution
url = browser.find_by_xpath('//td[#class="children"]/a')[0]['href']
if you are finding a tag by the link name for example, try this:
url = browser.find_by_xpath('//a[contains(#title, "MA111 - Cálculo I")]')[0]['href']
If you check there, the xpath says "find in the entire DOM // a tag named a which contains "MA111 - Cálculo I" in the title attribute.
Related
I want to find downloadable content in a webpage but I don't know what the webpage looks like. Right now, I am looking at all links
links = driver.find_elements(By.XPATH, "//a[#href]")
and buttons
buttons = driver.find_elements(By.TAG_NAME, "button")
For each link (which has an href attribute as seen in the XPATH specification), I check whether or not it points to a page with some form of machine readable file I am looking for (either .csv or .json); if the link does not have one of these extensions as a suffix, I assume it does not reference a machine readable file.
As for the buttons, I know of no way to check what they may contain other than naively clicking on them (button.click()). While this is clearly dangerous, especially because this function will be applied on thousands of websites, I don't know how else to do it.
Is there any other way I could check for downloadable content? Additionally, are there any other page elements I should be looking for, besides links and buttons, and are there any more efficient methods of doing what I want?
Any help is greatly appreciated. Thanks!
So I've been trying to search for specific keywords on a webpage using selenium in python, but can't seem to figure out how to search for specific text in a specific area. As shown in the picture, when I search the word "Sim" in chrome, several spots are highlighted. The red region is the only place I am looking for code in. I've using the xpaths to identify the text, as nothing else is available for them.
This is the code so far:
else:
print("Nothing here yet 1")
if driver.find_elements_by_xpath("//*[contains(text(), 'Sim')]"):
login_to_reply = driver.find_element_by_xpath("//body/div[#id='app']/main/div[#id='content']/div/div/div/div/div/div/article/header[1]")
login_to_reply.click()
time.sleep(5)
if anyone could help and let me know what I'm not understanding, I would really appreciate that, thank you.
Based on the additional information in the comments, to search for keywords in the FIRST POST on the page you can use the following xpath:
(//article)[1]//div[#class='Post-body'][contains(normalize-space(),'point')]
The key bit is: (//article)[1] - it's locking the further identifiers to within the first [1] located article tag. The rest just reduces repetition within the dom and finds your text whoever the nested tags are strucutred.
Based on the link provided, that matches the top post only. You can see searching for keyword "point" only gets 1 hit even though it's in multiple posts...
The text 'point' can be swapped out for 'sim' or whatever you want to filter.
Looking at you code, i have some more suggestions:
1/
Your second xpath for "login to reply" is not great. Long copied xpaths like that are typically flakey and troublesome. You can use the same technique as above to click the reply within that article box:
(//article)[1]//button[span[text()='Reply']]
2/
You also need to be aware that this line won't work as you expect.
if driver.find_elements_by_xpath("//*[contains(text(), 'Sim')]"):
If the element is not found, it does not return false - it returns a NoSuchElement exception and fails (and stops) the script.
You need this to be a try/except block
It would need to look like this:
try:
driver.find_elements_by_xpath("(//article)[1]//div[#class='Post-body'][contains(normalize-space(),'point')]"):
login_to_reply = driver.find_element_by_xpath("(//article)[1]//button[span[text()='Reply']]")
login_to_reply.click()
time.sleep(5)
except:
print("Text was not found")
I've not run this but if it doesn't work let me know and i'll look again.
I am trying to write python code to extract links from a web page. As per logic, I am looking
for the sequence <a href="">. The code extracts the link address from a normal anchor tag like -
<a href="https://www.google.com", but I see that there are other ways of specifying hyperlinks
as under -
News
Documentation
Downloads
Support
On clicking '/news/' the address that it resolves to is "https://www.reviewboard.org/news/".
How does this happen, and where is this information stored ?
Because '/news/' is useless by itself unless converted to complete string
https://www.reviewboard.org/news/.
Thanks
These are relative links. It's the link relative to the page where the link is found.
So if I am on www.somewebsite.com/somepage, and I encounter this link:
Some other page
It will take me to www.somewebsite.com/somepage/someotherpage
These work the same way a relative path works, including ../ syntax to point back up through the file structure.
[ Ed: Maybe I'm just asking this? Not sure -- Capture JSON response through Selenium ]
I'm trying to use Selenium (Python) to navigate via hyperlinks to pages in a web database. One page returns a table with hyperlinks that I want Selenium to follow. But the links do not appear in the page's source. The only html that corresponds to the table of interest is a tag indicating that the site is pulling results from a facet search. Within the div is a <script type="application/json"> tag and a handful of search options. Nothing else.
Again, I can view the hyperlinks in Firefox, but not using "View Page Source" or Selenium's selenium.webdriver.Firefox().page_source call. Instead, that call outputs not the <script> tag but a series of <div> tags that appear to define the results' format.
Is Selenium unable to navigate output from JSON applications? Or is there another way to capture the output of such applications? Thanks, and apologies for the lack of code/reproducibility.
Try using execute_script() and get the links by running JavaScript, something like:
driver.execute_script("document.querySelector('div#your-link-to-follow').click();")
Note: if the div are generated by scripts dynamically, you may want to implicitly wait a few seconds before executing the script.
I've confronted a similar situation on a website with JavaScript (http://ledextract.ces.census.gov to be specific). I had pretty good luck just using Selenium's get_element() methods. The key is that even if not everything about the hyperlinks appears in the page's source, Selenium will usually be able to find it by navigating to the website since doing that will engage the JavaScript that produces the additional links.
Thus, for example, you could try mousing over the links, finding their titles, and then using:
driver.find_element_by_xpath("//*[#title='Link Title']").click()
Based on whatever title appears by the link when you mouse over it.
Or, you may be able to find the links based on the text that appears on them:
driver.find_element_by_partial_link_text('Link Text').click()
Or, if you have a sense of the id for the links, you could use:
driver.find_element_by_id('Link_ID').click()
If you are at a loss for what the text, title, ID, etc. would be for the links you want, a somewhat blunt response is to try to pull the id, text, and title for every element off the website and then save that to a file that you can look for to identify likely candidates for the links you're wanting. That should show you a lot more (in some respects) than just the source code for the site would:
AllElements = driver.find_elements_by_xpath('//*')
for Element in AllElements:
print 'ID = %s TEXT = %s Title =%s' %(Element.get_attribute("id"), Element.get_attribute("text"), Element.get_attribute("title"))
Note: if you have or suspect you have a situation where you'll have multiple links with the same title/text, etc. then you may want to use the find_elements (plural) methods to get lists of all those satisfying your criteria, specify the xpath more explicitly, etc.
TL;DR Version :
I have only heard about web crawlers in intelluctual conversations Im not part of. All I want to know that can they follow a specific path like:
first page (has lot of links) -->go to links specified-->go to
links(specified, yes again)-->go to certain link-->reach final page
and download source.
I have googled a bit and came across Scrappy. But I am not sure if I fully understand web crawlers to begin with and if scrappy can help me follow the specific path I want.
Long Version
I wanted to extract some text of a group of static web pages. These web pages are very simple with just basic HTML. I used python and the urllib to access the URL,extract the text and work with it. Pretty soon I realized that I will have to basically visit all these pages and copy paste the URL into my program, which is tiresome. I wanted to know if this is more suitable for a web crawler. I want to access this
page. Then select only a few organisms (I have a list of those). On Clicking on of them you can see this page. If you look under the table - MTases active in the genome there are Enzymes which are hyperlinks. Clinking on those lead to this page. On the right hand side there is link named Sequence Data. Once clicked it leads to the page which has a small table on the lower right with yellow headers. under it it has an entry DNA (FASTA STYLE. Clicking on view will lead to the page im interested in and want to download the page source from.
I think you are definitely on the right track for looking at a web crawler to help you do this. You can also look at Norconex HTTP Collector which I know can let you follow links on a page without storing that page if is is just a listing page to you. That crawler lets you filter out pages after their links have been extracted to be followed. Ultimately, you can configure the right filters so that only the pages matching the pattern you want get downloaded for you to process (whether it is based on crawl depth, URL pattern, content pattern, etc).