I am trying to write python code to extract links from a web page. As per logic, I am looking
for the sequence <a href="">. The code extracts the link address from a normal anchor tag like -
<a href="https://www.google.com", but I see that there are other ways of specifying hyperlinks
as under -
News
Documentation
Downloads
Support
On clicking '/news/' the address that it resolves to is "https://www.reviewboard.org/news/".
How does this happen, and where is this information stored ?
Because '/news/' is useless by itself unless converted to complete string
https://www.reviewboard.org/news/.
Thanks
These are relative links. It's the link relative to the page where the link is found.
So if I am on www.somewebsite.com/somepage, and I encounter this link:
Some other page
It will take me to www.somewebsite.com/somepage/someotherpage
These work the same way a relative path works, including ../ syntax to point back up through the file structure.
Related
I've been trying to write a simple script in order to upload 200+ links to a website I'm working in (I have poor knowledge on python and even poorer in HTML, of course I wasn't working as a web developer, I just need to upload these links).
Well, the situation I'm in is the following: I am using Splinter(therefore, Python) in order to navigate in the website. Certain section titles of this website will be compared with values I have in a .csv table.
For instance, in this screenshot, I am looking for this link /admin/pages/5, and I would like to compare the link's title (Explorar subpáginas de 'MA111 - Cálculo I') with my .CSV table. The problem is the link's title doesn't appear in the website.
To find the link I would guess that I should use find_by_xpath(), but I don't know how to do it. I would guess it's something like this link.
I would appreciate any help! I hope I have made myself clear.
You first need to define how are you detecting that url, so for example, "it is always to the right of certain button", or "it is the second row in a table", that way you can build the respective xpath (which is a path to follow inside the DOM.
I am not entirely sure, but this could give you the solution
url = browser.find_by_xpath('//td[#class="children"]/a')[0]['href']
if you are finding a tag by the link name for example, try this:
url = browser.find_by_xpath('//a[contains(#title, "MA111 - Cálculo I")]')[0]['href']
If you check there, the xpath says "find in the entire DOM // a tag named a which contains "MA111 - Cálculo I" in the title attribute.
[ Ed: Maybe I'm just asking this? Not sure -- Capture JSON response through Selenium ]
I'm trying to use Selenium (Python) to navigate via hyperlinks to pages in a web database. One page returns a table with hyperlinks that I want Selenium to follow. But the links do not appear in the page's source. The only html that corresponds to the table of interest is a tag indicating that the site is pulling results from a facet search. Within the div is a <script type="application/json"> tag and a handful of search options. Nothing else.
Again, I can view the hyperlinks in Firefox, but not using "View Page Source" or Selenium's selenium.webdriver.Firefox().page_source call. Instead, that call outputs not the <script> tag but a series of <div> tags that appear to define the results' format.
Is Selenium unable to navigate output from JSON applications? Or is there another way to capture the output of such applications? Thanks, and apologies for the lack of code/reproducibility.
Try using execute_script() and get the links by running JavaScript, something like:
driver.execute_script("document.querySelector('div#your-link-to-follow').click();")
Note: if the div are generated by scripts dynamically, you may want to implicitly wait a few seconds before executing the script.
I've confronted a similar situation on a website with JavaScript (http://ledextract.ces.census.gov to be specific). I had pretty good luck just using Selenium's get_element() methods. The key is that even if not everything about the hyperlinks appears in the page's source, Selenium will usually be able to find it by navigating to the website since doing that will engage the JavaScript that produces the additional links.
Thus, for example, you could try mousing over the links, finding their titles, and then using:
driver.find_element_by_xpath("//*[#title='Link Title']").click()
Based on whatever title appears by the link when you mouse over it.
Or, you may be able to find the links based on the text that appears on them:
driver.find_element_by_partial_link_text('Link Text').click()
Or, if you have a sense of the id for the links, you could use:
driver.find_element_by_id('Link_ID').click()
If you are at a loss for what the text, title, ID, etc. would be for the links you want, a somewhat blunt response is to try to pull the id, text, and title for every element off the website and then save that to a file that you can look for to identify likely candidates for the links you're wanting. That should show you a lot more (in some respects) than just the source code for the site would:
AllElements = driver.find_elements_by_xpath('//*')
for Element in AllElements:
print 'ID = %s TEXT = %s Title =%s' %(Element.get_attribute("id"), Element.get_attribute("text"), Element.get_attribute("title"))
Note: if you have or suspect you have a situation where you'll have multiple links with the same title/text, etc. then you may want to use the find_elements (plural) methods to get lists of all those satisfying your criteria, specify the xpath more explicitly, etc.
I am having some unknown trouble when using xpath to retrieve text from an HTML page from lxml library.
The page url is www.mangapanda.com/one-piece/1/1
I want to extract the selected chapter name text from the drop down select tag. Now I just want the first option so the XPath to find that is pretty easy. That is :-
.//*[#id='chapterMenu']/option[1]/text()
I verified the above using Firepath and it gives correct data. but when I am trying to use lxml for the purpose I get not data at all.
from lxml import html
import requests
r = requests.get("http://www.mangapanda.com/one-piece/1/1")
page = html.fromstring(r.text)
name = page.xpath(".//*[#id='chapterMenu']/option[1]/text()")
But in name nothing is stored. I even tried other XPath's like :-
//div/select[#id='chapterMenu']/option[1]/text()
//select[#id='chapterMenu']/option[1]/text()
The above were also verified using FirePath. I am unable to figure out what could be the problem. I would request some assistance regarding this problem.
But it is not that all aren't working. An xpath that working with lxml xpath here is :-
.//img[#id='img']/#src
Thank you.
I've had a look at the html source of that page and the content of the element with the id chapterMenu is empty.
I think your problem is that it is filled using javascript and javascript will not be automatically evaluated just by reading the html with lxml.html
You might want to have a look at this:
Evaluate javascript on a local html file (without browser)
Maybe you're able to trick it though... In the end, also javascript needs to fetch the information using a get request. In this case it requests: http://www.mangapanda.com/actions/selector/?id=103&which=191919
Which is json and can be easily turned into a python dict/array using the json library.
But you have to find out how to get the id and the which parameter if you want to automate this.
The id is part of the html, look for document['mangaid'] within one of the script tags and which can maybe stay 191919 has to be 0... although I couldn't find it in any source I found it, when it is 0 you will be redirected to the proper url.
So there you go ;)
The source document of the page you are requesting is in a default namespace:
<html xmlns="http://www.w3.org/1999/xhtml">
even if Firepath does not tell you about this. The proper way to deal with namespaces is to redeclare them in your code, which means associating them with a prefix and then prefixing element names in XPath expressions.
name = page.xpath('//*[#id='chapterMenu']/xhtml:option[1]/text()',
namespaces={'xhtml': 'http://www.w3.org/1999/xhtml'})
Then, the piece of the document the path expression above is concerned with is:
<select id="chapterMenu" name="chapterMenu"></select>
As you can see, there is no option element inside it. Please tell us what exactly you'd like to find.
I want to get torrents links from page. With chrome source browser I see the link is:
href="browse.php?search=Brooklyn+Nine-Nine&page=1"
But then i scrap this link with Scrapy i only get:
href="browse.php?page=1"
this "search=Brooklyn+Nine-Nine&" part is not in the link.
Into page's torrents search form I enter "Brooklyn Nine-Nine", and it will show all search results.
So my question will be is it chromes automatic links formatting feature? and how I could get link with Scrapy as Chromes shows.
I think i could enter missing part by my self. Such like replacing spaces with plus sign in text that is used for search.
Or maybe were there some more elegant solution...
It's all okey... I did a mistake in my script. My search text was empty so the links also was without any additional text.
I'm new to python and trying to figure this out, so sorry if this has been asked. I couldn't find it and don't know what this may be called.
So the short of it. I want to take a link like:
http://www.somedomainhere.com/embed-somekeyhere-650x370.html
and turn it into this:
http://www.somedomainhere.com/somekeyhere
The long of it, I have been working on an addon for xbmc that goes to a website, grabs a url, goes to that url to find another url. Basically a url resolver.
So the program searches the site and comes up with somekeyhere-650x370.html. But that page is in java and is unusable to me. but when I go to com/somekeyhere that code is usable. So I need to grab the first url, change the url to the usable page and then scrape that page.
So far the code I have is
if 'somename' in name:
try:
n=re.compile('<iframe title="somename" type="text/html" frameborder="0" scrolling="no" width=".+?" height=".+?" src="(.+?)">" frameborder="0"',re.DOTALL).findall(net().http_GET(url).content)[0]
CONVERT URL to .com/somekeyhere SO BELOW NA CAN READ IT.
na = re.compile("'file=(.+?)&.+?'",re.DOTALL).findall(net().http_GET(na).content)[0]
Any suggestions on how I can accomplish converting the url?
I really didn't get the long of your question.
However, answering the short
Assumptions:
somekey is a alphanumeric
a='http://www.domain.com/embed-somekey-650x370.html'
p=re.match(r'^http://www.domain.com/embed-(?P<key>[0-9A-Za-z]+)-650x370.html$',a)
somekey=p.group('key')
requiredString="http://www.domain.com/"+somekey #comment1
I have really provided a very specific answer here for just the domain name.
You should modify the regex as required. I see your code in question uses regex and hence i assume
you can frame a regex to match your requirement better.
EDIT 1 : also see urlparse from here
https://docs.python.org/2/library/urlparse.html?highlight=urlparse#module-urlparse
It provides an easy way to get to parse your url
Also, in line with "#comment1" you can actually save the domain name to a variable and reuse it here