Scraping a paginated website with fixed url for every page (Python) - python

I am not familiar with HTML but I will do my best to explain what I need. I am scraping a website that is paginated. If one goes to the bottom one finds a 'Siguiente' (next in spanish) button that will lead to the next page. When doing so, the URL remains unchanged. Is there any way to tell Python to open the next page?
I want to do this:
(already done)
open the website,
do sth with the info there,
(having trouble because there is no URL of the next page)
go to the next page,
repeat...
Thanks for your help.

Related

Why did I get an empty list using Scrapy shell?

The website is "https://www.jbhifi.com.au/collections/laptops". I'm trying to crawl the href for the "next page".
But why scrapy shell returns an empty list? I'm using the statement:
response.css("li.ais-pagination--item ais-pagination--item__next a").xpath("#href")
Please show me how to scrape this using Scrapy. I suspect this is because the class starts with "ais" (but don't know why it causes the problem). This happened to me in the past. Any solutions? Cheers!
There is a need to understand that if you are extracting selectors merely on basis of inspect element it does not work that way. You need to check page source that what actually comes at time when page was loaded. While inspecting we are able to see all that content which all request against a page update. In your case there is no such class in page source ais-pagination--item__next. You have to track network, check which call is being hit on click of next page button and crack the logic which is being implemented.

Get method from requests library seems to return homepage rather than specific URL

I'm new to Python & object-oriented programming in general. I'm trying to build a simple web scraper to create data frames from NBA contract data on basketball-reference.com. I had planned to use the requests library together with BeautifulSoup. However, the get method seems to be returning the site's homepage rather than the page affiliated with the URL I give.
I give a URL to a team's contracts page (https://www.basketball-reference.com/contracts/IND.html), but when I print the html it looks like it belongs to the homepage.
I haven't been able to find any documentation on the web about anyone else having this problem...
I'm using the Spyder IDE.
# Import library
import requests
# Assign the URL for contract scraping
url = 'https://www.basketball-reference.com/contracts/IND.html'
# Pull contracts page
page = requests.get(url)
# Check that correct page is being pulled
print(page.text)
This seems like it should be very straightforward, so I'm not understanding why the console is displaying html that clearly doesn't pertain to the page I'm trying to point to. I'm not getting any errors, just html from the homepage.
After checking the code on repl.it and visiting the webpage myself, I can confirm you are pulling in the correct page's HTML. The page variable contains the tables of data, as well as their info... and also the page's advertisements, the contact info, the social media buttons and links, the adblock detection scripts, and everything else on the webpage. Your issue isn't that you're getting the wrong page, it's that you're getting the entire page, not just the data.
You'll want to pick out the exact bits you're interested in - maybe by selecting the table and its child elements? The table's HTML id is contracts - that should be a good place to start.
(Try visiting the page in your browser, right-clicking anywhere on the page, and clicking "view page source" - that's what your program is pulling in. There's a LOT more to a webpage than most people realize!)
As a word of warning, though, Sports Reference has a data use policy that precludes web crawlers / spiders on their site. I would recommend checking (and using) one of the free sites they link instead; you risk being IP banned otherwise.
Simply printing the result of the get request on the terminal won't be very helpful, as the HTML page content returned is long - your terminal will truncate the printed response. I'm assuming in your case maybe the website has parts of the homepage reused in other pages as well, so it might get confusing.
I recommend writing the response into a file and then opening the file in the browser. You will see that your code is pulling the right page.

How to do a loop on a dynamic href link with selenium in python?

I would like to make a loop on a dynamic href. Indeed, I download a set of files per page. On each page, I download 100 text files but I have to download 200 000 files. So, I have to click the next button in 2000. To do this, I got the href address of the next button but unfortunately, two objects change in this link, the page number 1,2,3, etc. and a string of characters. Please see attached sample of the next button that changes.
https://search.proquest.com/something/E6981FD6D11F45E8PQ/2?accountid=12543#scrollTo
https://search.proquest.com/something/E6981FD6D11F45E8PQ/3?accountid=12543#scrollTo
https://search.proquest.com/something/61C27022597C4092PQ/4?accountid=12543#scrollTo
https://search.proquest.com/something/E431552DC6554BF7PQ/5?accountid=12543#scrollTo
I'm novel user of Python. My level is bad.
#Before I add selenium setup for scraping.
n=2000
for i in range(1,n):
href="https://search.proquest.com/something/715376F5A5AF44BBPQ/" + str(i) + "?accountid=12543#scrollTo"
driver.get(href)
#Here, I add the code which allows downloading for each page.
Sample link is unavailable for me (i cannot signing up)
First..
what is "string of chacracters"?
book number? or category number?
if it is just random string, i think you should find another way.
How about using ActionChain? or driver.execute_script()?
First of all, In my opinion, Finding a meaning of string (from .js or .html) is more important.
#나민오 I need help in identifying xpath for my next page button. My goal consists to loop through pages in Python Selenium. Please find below the code of the next page button after inspecting on URL page on this picture.
next page button picture after inspect
I try to write the following code in python with selenium to download the file by page.
while True:
scraping() # here I call my function that allows to download the files per page
try:
#Checks if there are more pages with links
next_link = driver.find_element_by_xpath("//*[#title='Page suivante']")
drive.execute_script("arguments[0].scrollIntoView();", next_link)
next_link.click()
#Time sleep
time.sleep(20)
except NoSuchElementException:
pages_rows= False

How to use Python to scrape all the table contents on this website which is written by AJAX?

https://www.fedsdatacenter.com/federal-pay-rates/index.php?y=2017&n=&l=&a=&o=
This website seems to be written by jquery(AJAX). I would like to scrape all pages' tables. When I inspect the 1,2,3,4 page tags, they do not have a specific href link. Besides, clicking on them does not create a clear pattern of get requests, therefore, I find it hard to use Python urllib to send a get request for each page.
You can use Selenium with Python http://selenium-python.readthedocs.io/ to navigate through the pages. I would find the Next button and .click() it then time.sleep(seconds) and scrape the page. I can't navigate to the last page on this site, unfortunately (it seems broken - which you should also be aware of), but I'm assuming the Next button disappears or something when you get to the last page. If not, you might want to save the what you've scraped everytime you go to a new page, this way you don't lose your data in the event of an error.

Is a Web Crawler more suitable?

TL;DR Version :
I have only heard about web crawlers in intelluctual conversations Im not part of. All I want to know that can they follow a specific path like:
first page (has lot of links) -->go to links specified-->go to
links(specified, yes again)-->go to certain link-->reach final page
and download source.
I have googled a bit and came across Scrappy. But I am not sure if I fully understand web crawlers to begin with and if scrappy can help me follow the specific path I want.
Long Version
I wanted to extract some text of a group of static web pages. These web pages are very simple with just basic HTML. I used python and the urllib to access the URL,extract the text and work with it. Pretty soon I realized that I will have to basically visit all these pages and copy paste the URL into my program, which is tiresome. I wanted to know if this is more suitable for a web crawler. I want to access this
page. Then select only a few organisms (I have a list of those). On Clicking on of them you can see this page. If you look under the table - MTases active in the genome there are Enzymes which are hyperlinks. Clinking on those lead to this page. On the right hand side there is link named Sequence Data. Once clicked it leads to the page which has a small table on the lower right with yellow headers. under it it has an entry DNA (FASTA STYLE. Clicking on view will lead to the page im interested in and want to download the page source from.
I think you are definitely on the right track for looking at a web crawler to help you do this. You can also look at Norconex HTTP Collector which I know can let you follow links on a page without storing that page if is is just a listing page to you. That crawler lets you filter out pages after their links have been extracted to be followed. Ultimately, you can configure the right filters so that only the pages matching the pattern you want get downloaded for you to process (whether it is based on crawl depth, URL pattern, content pattern, etc).

Categories

Resources