Scrape many pages with different "?id=" parameters without excessive requests? - python

I'm trying to scrape all pages with different ids from a site that is formatted url.com/page?id=1, but there are millions of ids so even at 1 request per second it will take weeks to get them all.
I am a total noob at this so I was wondering if there was a better way than going one by one such as a bulk request or something or should I just increase the requests per second to whatever I can get away with.
I am using requests and beautifulsoup in python to scrape the pages currently.

The grequests library is one possible approach you could take. The results are returned in the order they are obtained (which is not the same order as event_list).
import grequests
event_list = [grequests.get(f"http://url.com/page?id={req_id}") for req_id in range(1, 100)]
for r in grequests.imap(event_list, size=5):
print(r.request.url)
print(r.text[:100])
print()
Note: You are likely to be blocked if you attempt this on most sites. A better approach would be to see if the website has a suitable API you could use to obtain the same information which could be found by using your browser's network tools whilst using the site.

Related

Scrape website with Python with javascript format

I don't have much experience scraping data from websites. I normally use Python "requests" and "BeautifulSoup".
I need to download the table from here https://publons.com/awards/highly-cited/2019/
I do the usual with right click and Inspect, but the format is not the one I'm used to working with. I did a bit of reading and seems to be Javascript, where I could potentially extract the data from https://publons.com/static/cache/js/app-59ff4a.js. I read other posts that recommend Selenium and PhantomJS. However, I can't modify the paths as I'm not admin in this computer (I'm using Windows). Any idea on how to tackle this? Happy to go with R if Python isn't an option.
Thanks!
If you monitor the web traffic via dev tools you will see the API calls the page makes to update content. The info returned is in json format.
For example: page 1
import requests
r = requests.get('https://publons.com/awards/api/2019/hcr/?page=1&per_page=10').json()
You can alter the page param in a loop to get all results.
The total number of results is already indicated in the first response via r['count'] so easy enough to calculate the # pages to loop for at 10 results per page. Just be sure to be polite in how you make your requests.
Outline:
import math, requests
with requests.Session() as s:
r = s.get('https://publons.com/awards/api/2019/hcr/?page=1&per_page=10').json()
#do something with json. Parse items of interest into list and add to a final list? Convert to dataframe at end?
number_pages = math.ceil(r['count']/10)
for page in range(2, number_pages + 1):
#perhaps have a delay after X requests
r = s.get(f'https://publons.com/awards/api/2019/hcr/?page={page}&per_page=10').json()
#do something with json. Parse items of interest into list and add to a final list? Convert to dataframe at end?

Python tool to check broken links on a big urls list

I have a search engine in production serving around 700 000 url. The crawling is done using Scrapy, and all spiders are scheduled using DeltaFetch in order to get daily new links.
The difficulty I'm facing is handling broken links.
I have a hard time finding a good way to periodically scan, and remove broken links. I was thinking about a few solutions :
Developing a python script using requests.get, to check on every single url, and delete anything that returns a 404 status.
Using a third party tool like https://github.com/linkchecker/linkchecker, but not sure if it's the best solution since I only need to check up a list of url, not a website.
Using a scrapy spider to scrap this url list, and return any urls that are erroring out. I'm not really confident on that one since I know scrapy tends to timeout when scaning a lot of urls on different domains, this is why I rely so much on deltafetch
Do you have any recommendations / best practice to solve this problem?
Thanks a lot.
Edit : I forgot to give one precision : I'm looking to "validate" those 700k urls, not to crawl them. actually those 700k urls are the crawling result of around 2500k domains.
You could write a small script that just check the return http status like so:
for url in urls:
try:
urllib2.urlopen(url)
except urllib2.HTTPError, e:
# Do something when request fails
print e.code
This would be the same as your first point. You could also run this async in order to optimize the time it takes to run through your 700k links.
I would suggest using scrapy, since you're already looking up each URL with this tool and thus knows which URLs errors out. This means you don't have to check the URLs a second time.
I'd go about it like this:
Save every URL erroring out in a separate list/map with a counter (which is stored between runs).
Every time an URL errors out, increment the counter. If it doesn't, decrement the counter.
After running the Scrapy script, check this list/map for URLs with a high enough counter - let's say more than 10 faults, and remove them - or store them in a seperate list of links to check up on a later time (As a check if you accidentally removed a working URL because a server was down too long).
Since your third bullet is concerned about Scrapy being shaky with URL results, the same could be said for websites in general. If a site errors out on 1 try, it might not mean a broken link.
If you go for creating a script of our own check this solution
In addition an optimization that I suggest is to make heirarchy in your URL repository. If you get 404 from one of a parent URL you can avoid checking all it children URLs
First thought came into my mind is to request URLs with HEAD instead of any other method
Spawn multiple spiders at once assigning them batches like LIMIT 0,10000 and LIMIT 10000,10000
In your data pipeline, instead of running a MySQL DELETE query each time scraper finds 404 status, run DELETE FROM table WHERE link IN(link1,link2) query in bulk
I am sure you have INDEX on link column, if not add it

email scraper using python beautiful soup or html module

Currently, I am trying to gather data from my realtor from the listings she sends me. It always comes through a link from the main site "http://v3.torontomls.net" I think only realtors can go into this site and filter on houses, but when she sends it to me I can see a list of houses.
I am wondering if it is possible to create a python script that:)
1) opens Gmail
2) filter's on her emails
3) opens one of her emails
4) clicks on the link
5) Scrapes the house data into a CSV format
I am not sure about the feasibility of this, I have never used python to scrape web pages. I can see step 5 is doable, but how do I go about step 1 to 4?
Yes, this is possible, but you need to do some requirements gathering beforehand to determine which parts of the process can be eliminated. For instance, if your realtor is sending you the same link each time, you can just target that web address directly. If the link changes but is parameterized by month, for instance, you can just adjust the web address each month when you want to process the results.
To make the requests, I would suggest using the requests package along with bs4 (BeautifulSoup 4) to target elements. For creating CSV files, you may choose to use csv, but there are many alternatives if you require something that's more specific to your use case.

Speed up the number of page I can scrape via threading

I'm currently using beautifulsoup to scrape sourceforge.net for various project information. I'm using the solution in this thread. It works well, but I wish to do it yet faster. Right now I'm creating a list of 15 URLs, and feed them into the
run_parallel_in_threads. All the URLs are sourceforge.net links. I'm currently getting about 2.5 pages per second. And it seems that increasing or decreasing the number of URLs in my list doesn't have much effect on the speed. Are there any strategy to increase the number of page I can scrape? Any other solutions that are more suitable for this kind of project?
You could have your threads which run in parallel simply retrieve the web content. Once the html page is retrieved, pass the page into a queue which have multiple workers each parsing a single html page. Now you've essentially pipelined your workflow. Instead of having each thread do multiple steps (retrieve page, scrape, store). Each of your threads in parallel simple retrieve the page and then have it pass the task into a queue which processes these tasks in a round robin approach.
Please let me know if you have any questions!

Read all pages within a domain

I am using the urllib library to fetch pages. Typically I have the top-level domain name & I wish to extract some information from EVERY page within that domain. Thus, if I have xyz.com, I'd like my code to fetch the data from xyz.com/about etc. Here's what I am using:
import urllib,re
htmlFile = urllib.urlopen("http://www.xyz.com/"+r"(.*)")
html = htmlFile.read()
...............
This doe not do the trick for me though. Any ideas are appreciated.
Thanks.
-T
I don't know why you would expect domain.com/(.*) to work. You need to have a list of all the pages (dynamic or static) within that domain. Your python program cannot automatically know that. This knowledge you must obtain from elsewhere, either by following links or looking at the sitemap of the website.
As a footnote, scraping is a slightly shady business. Always make sure, no matter what method you employ, that you are not violating any terms and conditions.
You are trying to use a regular expression on the web server. Turns out, web servers don't actually support this kind of format, so it's failing.
To do what you're trying to, you need to implement a spider. A program that will download a page, find all the links within it, and decide which of them to follow. Then, downloads each of those pages, and repeats.
Some things to watch out for - looping, multiple links that end up pointing at the same page, links going outside of the domain, and getting banned from the webserver for spamming it with 1000s of requests.
In addition to #zigdon answer I recommend you to take a look at scrapy framework.
CrawlSpider will help you to implement crawling quite easily.
Scrapy has this functionality built in. No recursively getting links. It asynchronously automatically handles all the heavy lifting for you. Just specify your domain and search terms and how deep you want it to search in the page .ie the whole site.
http://doc.scrapy.org/en/latest/index.html

Categories

Resources