How can I find a dynamic url on a webpage? - python

Guys I want to be able to know how you can find a dynamic url on a website. Primarily, I am looking for the search term at the end of a website. For example how would I find the link https://www.abbeywhisky.com/pages/search-results-page?q= from entering the front page of https://www.abbeywhisky.com
I am unsure if there is a way to do this by just using the landing page of a site. I would have tried to scrape the first page of the site but using the filter of "?=" does not shown any results.

Related

Get method from requests library seems to return homepage rather than specific URL

I'm new to Python & object-oriented programming in general. I'm trying to build a simple web scraper to create data frames from NBA contract data on basketball-reference.com. I had planned to use the requests library together with BeautifulSoup. However, the get method seems to be returning the site's homepage rather than the page affiliated with the URL I give.
I give a URL to a team's contracts page (https://www.basketball-reference.com/contracts/IND.html), but when I print the html it looks like it belongs to the homepage.
I haven't been able to find any documentation on the web about anyone else having this problem...
I'm using the Spyder IDE.
# Import library
import requests
# Assign the URL for contract scraping
url = 'https://www.basketball-reference.com/contracts/IND.html'
# Pull contracts page
page = requests.get(url)
# Check that correct page is being pulled
print(page.text)
This seems like it should be very straightforward, so I'm not understanding why the console is displaying html that clearly doesn't pertain to the page I'm trying to point to. I'm not getting any errors, just html from the homepage.
After checking the code on repl.it and visiting the webpage myself, I can confirm you are pulling in the correct page's HTML. The page variable contains the tables of data, as well as their info... and also the page's advertisements, the contact info, the social media buttons and links, the adblock detection scripts, and everything else on the webpage. Your issue isn't that you're getting the wrong page, it's that you're getting the entire page, not just the data.
You'll want to pick out the exact bits you're interested in - maybe by selecting the table and its child elements? The table's HTML id is contracts - that should be a good place to start.
(Try visiting the page in your browser, right-clicking anywhere on the page, and clicking "view page source" - that's what your program is pulling in. There's a LOT more to a webpage than most people realize!)
As a word of warning, though, Sports Reference has a data use policy that precludes web crawlers / spiders on their site. I would recommend checking (and using) one of the free sites they link instead; you risk being IP banned otherwise.
Simply printing the result of the get request on the terminal won't be very helpful, as the HTML page content returned is long - your terminal will truncate the printed response. I'm assuming in your case maybe the website has parts of the homepage reused in other pages as well, so it might get confusing.
I recommend writing the response into a file and then opening the file in the browser. You will see that your code is pulling the right page.

i am trying to scrape this website with scrapy python. I scraped most of the information but for some reason xpath doesnot scrape a division

Page i am trying to scrape
this is my code
Download_links = response.xpath('//div[#class = "download-block"]').extract()
this returns a empy list. Why cannot i scrape this div only?
This is the part of page i am trying to scrape
photo for the part i am trying to scrape
Please provide some help
You are getting an empty list because the division is not in the page source. Always check whether the data exists in the page source before writing xpaths.
The data may be in some other part of the page, please search the page source (ctrl+u) and get the correct xpath for the same.
Here in this page the download links are there in the pagesource.
see the image of the page source

getting links from table in web page

I am trying to go to a website, use their search tool to query a database, and grab all of the links from the table of search results displayed below the search tool. The problem is, the source for the website only shows the html for the search tool. Can anyone help me figure out how to get the links from the table? The address of the search tool is:
https://wagyu.digitalbeef.com/
I was hoping to use BeautifulSoup and python 3.6 on a windows 10 machine to read the pages associated with those links and grab the name of the cows and it's parents to create a more advanced pedigree chart than what is available on the site. Thanks for the help.
Just to clarify, I can manually grab a single link, use bs to grab the html for that page, and pull out the pedigree info. I just don't know how to grab the links from the search results page.

How do I use Scrapy with no specific start URL?

I'm attempting to use Scrapy to collect data from http://www.guidestar.org search results, but the way the website is set up, when I make a specific search for an organization on the website, the URL for the results is just
http://www.guidestar.org/SearchResults.aspx
which I can't plug in as the start URL since it doesn't link to any actual search results. Any ideas on how to get around this?

Is a Web Crawler more suitable?

TL;DR Version :
I have only heard about web crawlers in intelluctual conversations Im not part of. All I want to know that can they follow a specific path like:
first page (has lot of links) -->go to links specified-->go to
links(specified, yes again)-->go to certain link-->reach final page
and download source.
I have googled a bit and came across Scrappy. But I am not sure if I fully understand web crawlers to begin with and if scrappy can help me follow the specific path I want.
Long Version
I wanted to extract some text of a group of static web pages. These web pages are very simple with just basic HTML. I used python and the urllib to access the URL,extract the text and work with it. Pretty soon I realized that I will have to basically visit all these pages and copy paste the URL into my program, which is tiresome. I wanted to know if this is more suitable for a web crawler. I want to access this
page. Then select only a few organisms (I have a list of those). On Clicking on of them you can see this page. If you look under the table - MTases active in the genome there are Enzymes which are hyperlinks. Clinking on those lead to this page. On the right hand side there is link named Sequence Data. Once clicked it leads to the page which has a small table on the lower right with yellow headers. under it it has an entry DNA (FASTA STYLE. Clicking on view will lead to the page im interested in and want to download the page source from.
I think you are definitely on the right track for looking at a web crawler to help you do this. You can also look at Norconex HTTP Collector which I know can let you follow links on a page without storing that page if is is just a listing page to you. That crawler lets you filter out pages after their links have been extracted to be followed. Ultimately, you can configure the right filters so that only the pages matching the pattern you want get downloaded for you to process (whether it is based on crawl depth, URL pattern, content pattern, etc).

Categories

Resources