Python Web Scraper - Problem With Accessing Web Data [PHP Error] - python

I am trying to scrape all sites from THIS website.
I will use www.site.com instead of real domain just to simpify my problem.
Basically, there is a list of around 300 000 sites, each page has 30 results, so there should be around 10000 pages.
This is an example:
www.site.com/1 -> sites from 1-30
www.site.com/2 -> sites from 30-60
www.site.com/3 -> sites from 60-90
www.site.com/4 -> sites from 90-120
The problem is, when I reach page 167, there are no more results after that shown. That way, I can see only list of the first 5000 sites.
When I write this:
www.site.com/168
I get this error: PHP Warning – yii\base\ErrorException
Click HERE to see full error.
I was able to create a script in python that will scrape first 5000 sites, but I don't have any idea on how to access full list.
For example, there is a possibility to search for certain keywords on that page, but again, if there are more than 5000 results, only first 5000 sites will be shown.
Any ideas on how to solve this problem?

Related

In Python Splinter/Selenium, how to load all contents in a lazy-load web page

What I want to do - Now I want to crawl the contents (similar to stock prices of companies) in a website. The value of each element (i.e. stock price) is updated every 1s. However, this web is a lazy-loaded page, so only 5 elements are visible at a time, meanwhile, I need to collect all data from ~200 elements.
What I tried - I use Python Splinter to get the data in the div.class of the elements, however, only 5-10 elements surrounding the current view appear in the HTML code. I tried scrolling down the browser, then I can get the next elements (stock prices of next companies), but the information of the prior elements is no longer available. This process (scrolling down and get new data) is too slow and when I can finish getting all 200 elements, the first element's value was changed several times.
So, can you suggest some approaches to handle this issue? Is there any way to force the browser to load all contents instead of lazy-loading?
there is not the one right way. It depends on how is the website working in background. Normaly there are two options if its a lazy loaded page.
Selenium. It executes all js scripts and "merges" all requests from the background to a complete page like a normal webbrowser.
Access the API. In this case you dont have to care for the ui and dynamicly hidden elements. The API gives you access to all data on the webpage, often more than displayed.
In your case, if there is an update every second it sounds like a
stream connection (maybe webstream). So try to figure out how the
website gets its data and then try to scrape the api endpoint directly.
What page is it?

Trying to parse all results from z-lib to build a database of book titles

I am trying to scrape a list of all available books in z-library, but results are only provided though a search term and I want the titles for all books.
Also, queries only feature 10 pages of 50 results per page, 500 in total. Doing an empty search using only a spacebar renders top 500 most popular books.
I intend to use Selenium and Python but I can't get around to accessing the entire list of books.
https://book4you.org/
Any ideas?
Thanks
You can not get all the books data by Selenium or web scraper on such site since these tools will only present you a results displayed by GUI for a specific key search query.
This maybe can be performed via some API GET protocol request to get ALL the data from the DB, however we can not know if that possible or not until we know all the API requests available for that specific site.

scrape data from URL into pandas

I am trying to scrape date from a URL. The data is not in HTML tables, so pandas.read_html() is not picking it up.
The URL is:
https://www.athlinks.com/event/1015/results/Event/638761/Course/988506/Results
The data I'd like to get is a table gender, age, time for the past 5k races (name is not really important). The data is presented in the web page 50 at a time for around 25 pages.
It uses various javascript frameworks for the UI (node.js, react). Found this out using the "What Runs" ad-on in chrome browser.
Here's the real reason I'd like to get this data. I'm a new runner and will be participating in this 5k next weeked and would like to explore some of the distribution statistics for past faces (its an annual race, and data goes back to 1980's).
Thanks in advance!
The data comes from socket.io, and there are python packages for it. How did I find it?
If you open Network panel in your browser and choose XHR filter, you'll find something like
https://results-hub.athlinks.com/socket.io/?EIO=3&transport=polling&t=MYOPtCN&sid=5C1HrIXd0GRFLf0KAZZi
Look into content it is what we need.
Luckily this site has a source maps.
Now you can go to More tools -> Search and find this domain.
And then find resultsHubUrl in settings.
This property used inside setUpSocket.
And setUpSocket used inside IndividualResultsStream.js and RaseStreams.js.
Now you can press CMD + P and go deep down to this files.
So... I've spent around five minutes to find it. You can go ahead! Now you have all the necessary tools. Feel free to use breakpoints and read more about chrome developer tools.
You actually need to render the JS in a browser engine before crawling the generated HTML. Have you tried https://github.com/scrapinghub/splash, https://github.com/miyakogi/pyppeteer, or https://www.npmjs.com/package/spa-crawler ? You can also try to inspect the page (F12 -> Networking) while is loading the data relevant to you (from a restful api, I suppose), and then make the same calls from command line using curl or the requests python library.

Scrapy - Build URLs Dynamically Based on HTTP Status Code?

I'm just getting started with Scrapy and I went through the tutorial, but I'm running into an issue that either I can't find the answer to in the tutorial and/or docs, or I've read the answer multiple times now, but I'm just not understanding properly...
Scenario:
Let's say I have exactly 1 website that I would like to crawl. Content is rendered dynamically based on query params passed in url. I will need to scrape for 3 "sets" of data based on URL pram of "category".
All the information I need can be grabbed from common base URLs like this:
"http://shop.somesite.com/browse/?product_type=instruments"
And the URls for each category like so:
"http://shop.somesite.com/browse/?q=&product_type=instruments&category=drums"
"http://shop.somesite.com/browse/?q=&product_type=instruments&category=keyboards"
"http://shop.somesite.com/browse/?q=&product_type=instruments&category=guitars"
The one caveat here, is that the site is only loading 30 results per initial request. If the user wants to view more, they have to click on button "Load More Results..." at the bottom. After investigating this a bit, during initial load of page, only the request for top 30 is made (which makes sense), and after clicking the "Load More.." button, the URL is updated with a "pagex=2" appended and the container refreshes with 30 more results. After this, the button goes away and as user continues to scroll down the page, subsequent requests are made to the server to get the next 30 results, "pagex" value is incremented by one, container refreshed with results appended, rinse and repeat.
I'm not exactly sure how to handle pagination on sites, but the simplest solution I came up with is simply finding out what the max number "pagex" for each category, and just set the URLs to that number for starters.
For example, if you pass URL in browser:
"http://shop.somesite.com/browse/?q=&product_type=instruments&category=drums&pagex=22"
HTTP Response Code 200 is received and all results are rendered to page. Great! That gives me what I need!
But, say next week or so, 50 more items added, so now the max is "...pagex=24" I wouldn't get all the latest.
Or is 50 items removed and new max is "...pagex=20", I will get 404 response when requesting "22".
I would like to send a test response with the last known "good" max page number and based on HTTP Response provided, use that to decide what URL will be.
So, before I start any crawling, I would like to add 1 to "pagex" and check for 404. If 404 I know I'm still good, if I get 200, I need to keep adding 1 until I get 404, so I know where max is (or decrease if needed).
I can't seem to figure out if I can do this using Scrapy, of I have to use a different module to run this check first. I tried adding simple checks for testing purposes in the "parse" and "start_requests" methods, and no luck. start_requests doesn't seem to be able to handle responses and parse can check the response code, but will not update the URL as instructed.
I'm sure it's my poor coding skills (still new to this all), but I can't seem to find a viable solution....
Any thoughts or ideas are very much appreciated!
you can configure in scrapy which statuses to configure, that way you can make decisions for example in the parse method according to the response.status. Check how to handle statuses in the documentation. Example:
class MySpider(CrawlSpider):
handle_httpstatus_list = [404]

Is there anyway to scrape facebook account links from a list of several hunder URLs

I have a list of over 1500 URLs relating to news media sites in India. I was interested in conducting some stats as part of my college project.
Long story short, I was interested in knowing which of these websites have links to their Facebook accounts on their main web page? Doing this would be a tedious task (I have done 25% of them so far), therefore I have been researching via the web on any possibilities of scraping these websites with a program. I have seen scrapers on scraperwiki as well as the importxml function primarily in Google Docs, however, thus far I have not been able to achieve much success with either.
I have tried the following function in Google Docs for a given site:
=ImportXML(A1, "//a[contains(#href, 'www.facebook.com')]")
Overall, I would like to ask whether its even possible (and how) to scan a given website (or list) just for a specific href link if the structure for each website differs significantly?
Thanks in advance for any help regarding this matter.
Mark

Categories

Resources