Get full link from page with Scrapy - python

I want to get torrents links from page. With chrome source browser I see the link is:
href="browse.php?search=Brooklyn+Nine-Nine&page=1"
But then i scrap this link with Scrapy i only get:
href="browse.php?page=1"
this "search=Brooklyn+Nine-Nine&" part is not in the link.
Into page's torrents search form I enter "Brooklyn Nine-Nine", and it will show all search results.
So my question will be is it chromes automatic links formatting feature? and how I could get link with Scrapy as Chromes shows.
I think i could enter missing part by my self. Such like replacing spaces with plus sign in text that is used for search.
Or maybe were there some more elegant solution...

It's all okey... I did a mistake in my script. My search text was empty so the links also was without any additional text.

Related

Script cannot fetch data from a web page

I am trying to write a program in Python that can take the name of a stock and its price and print it. However, when I run it, nothing is printed. it seems like the data is having a problem being fetched from the website. I double checked that the path from the web page is correct, but for some reason the text does not want to show up.
from lxml import html
import requests
page = requests.get('https://www.bloomberg.com/quote/UKX:IND?in_source=topQuotes')
tree = html.fromstring(page.content)
Prices = tree.xpath('//span[#class="priceText__1853e8a5"]/text()')
print ('Prices:' , Prices)
here is the website I am trying to get the data from
I have tried BeautifulSoup, but it has the same problem.
If you print the string page.content, you'll see that the website code it captures is actually for a captcha test, not the "real" destination page itself you see when you manually visit the website. It seems that the website was smart enough to see that your request to this URL was from a script and not manually from a human, and it effectively prevented your script from scraping any real content. So Prices is empty because there simply isn't a span tag of class "priceText__1853e8a5" on this special Captcha page. I get the same when I try scraping with urllib2.
As others have suggested, Selenium (actual web automation) might be able to launch the page and get you what you need. The ID looks dynamically generated, though I do get the same one when I manually look at the page. Another alternative is to simply find a different site that can give you the quote you need without blocking your script. I tried it with https://tradingeconomics.com/ukx:ind and that works. Though of course you'll need a different xpath to find the cell you need.

Getting error when i try to click on the link text using Selenium Python

It is a href I am trying to click on this using the below code however, it is not able to find link text. It has no frames and it is on the same window. Not sure what is going on
self.driver.find_element_by_link_text("UNITED WAY OF EASTERN UTAH").click()
This is the screenshot of the element code:
Wild guess here, but has the page fully loaded (including any content created by dynamic code e.g. javascript) before you try to click on the link. If the link is created after you try to find it then obviously it will be missing. Try putting a time.sleep before you try to find it.
Took a closer look to the screenshot and this should help. You are using find_element_by_link_text which if I am not mistaken looks for a complete match between the provided text and the text in the link. However the text in your link is not an exact match. You should use find_element_by_partial_link_text instead

Selenium Python: clicking links produced by JSON application

[ Ed: Maybe I'm just asking this? Not sure -- Capture JSON response through Selenium ]
I'm trying to use Selenium (Python) to navigate via hyperlinks to pages in a web database. One page returns a table with hyperlinks that I want Selenium to follow. But the links do not appear in the page's source. The only html that corresponds to the table of interest is a tag indicating that the site is pulling results from a facet search. Within the div is a <script type="application/json"> tag and a handful of search options. Nothing else.
Again, I can view the hyperlinks in Firefox, but not using "View Page Source" or Selenium's selenium.webdriver.Firefox().page_source call. Instead, that call outputs not the <script> tag but a series of <div> tags that appear to define the results' format.
Is Selenium unable to navigate output from JSON applications? Or is there another way to capture the output of such applications? Thanks, and apologies for the lack of code/reproducibility.
Try using execute_script() and get the links by running JavaScript, something like:
driver.execute_script("document.querySelector('div#your-link-to-follow').click();")
Note: if the div are generated by scripts dynamically, you may want to implicitly wait a few seconds before executing the script.
I've confronted a similar situation on a website with JavaScript (http://ledextract.ces.census.gov to be specific). I had pretty good luck just using Selenium's get_element() methods. The key is that even if not everything about the hyperlinks appears in the page's source, Selenium will usually be able to find it by navigating to the website since doing that will engage the JavaScript that produces the additional links.
Thus, for example, you could try mousing over the links, finding their titles, and then using:
driver.find_element_by_xpath("//*[#title='Link Title']").click()
Based on whatever title appears by the link when you mouse over it.
Or, you may be able to find the links based on the text that appears on them:
driver.find_element_by_partial_link_text('Link Text').click()
Or, if you have a sense of the id for the links, you could use:
driver.find_element_by_id('Link_ID').click()
If you are at a loss for what the text, title, ID, etc. would be for the links you want, a somewhat blunt response is to try to pull the id, text, and title for every element off the website and then save that to a file that you can look for to identify likely candidates for the links you're wanting. That should show you a lot more (in some respects) than just the source code for the site would:
AllElements = driver.find_elements_by_xpath('//*')
for Element in AllElements:
print 'ID = %s TEXT = %s Title =%s' %(Element.get_attribute("id"), Element.get_attribute("text"), Element.get_attribute("title"))
Note: if you have or suspect you have a situation where you'll have multiple links with the same title/text, etc. then you may want to use the find_elements (plural) methods to get lists of all those satisfying your criteria, specify the xpath more explicitly, etc.

Parsing HTML code to find a link

I need to get a link that is buried in an html code (does not show up on the website). I've tried parsing the page with BeautifulSoup, but it only gets the links on the webpage. Is there a way to parse html code to find the link?
You can find everything that is in the sourcecode of a html page, in your case it probably looks like http://someurl>. Php can do the job for you, if you give me more details about the link and in what website you are trying to find the link and what you want to do with it, I might look into some of my own code made for extracting url's (links) from any given website.
I found a nice example code for you (does more than extracting url's). This will hopefully help you on your way: http://www.web-max.ca/PHP/misc_23.php

Ensure a page has downloaded correctly in Python

I am writing a basic screen scraping script using Mechanize and BeautifulSoup (BS) in Python. However, the problem I am running into is that for some reason the requested page does not download correctly every time. I am concluding this because when searching the downloaded pages using BS for present tags, I get an error. If I download the page again, it works.
Hence, I would like to write a small function that checks to see if the page has correctly downloaded and re-download if necessary (I could also solve it by figuring out what goes wrong, but that is probably too advanced for me). My question is how would I go about checking to see if the page has been downloaded correctly?
You can just check for a tag you expect to be there, and if it fails, repeat the download.
page = BeautifulSoup(page)
while page.body = None:
#redownload the page
page = BeautifulSoup(page)
#now you can use the data
I think you may simple search for html ending tag if this tag is in - this is a valid page.
The most generic solution is to check that the </html> closing tag exists. That will allow you to detect truncation of the page.
Anything else, and you will have to describe your failure mode more clearly.

Categories

Resources