I'm trying to improve the speed of my web scraper, and I have thousands of sites I need to get info from. I'm trying to get the ratings and number of ratings for sites in Google search webpages from Facebook and Yelp. I would just use an API normally, but because I have a huge list of sites to search for and time is of the essence, Facebook's small request limits per hour make this not feasible to use their Graph API (I've tried...). My sites are all in Google search pages. What I have so far (I have provided 8 sample sites for reproducibility):
from multiprocessing.dummy import Pool
import requests
from bs4 import BeautifulSoup
pools = Pool(8) #My computer has 8 cores
proxies = MY_PROXIES
#How I set up my urls for requests on Google searches.
#Since each item has a "+" in between in a Google search, I have to format
#my urls to copy it.
site_list = ['Golden Gate Bridge', 'Statue of Liberty', 'Empire State Building', 'Millennium Park', 'Gum Wall', 'The Alamo', 'National Art Gallery', 'The Bellagio Hotel']
urls = list(map(lambda x: "+".join(x.split(" ")), site_list)
def scrape_google(url_list):
info = []
for i in url_list:
reviews = {'FB Rating': None,
'FB Reviews': None,
'Yelp Rating': None,
'Yelp Reviews': None}
request = requests.get(i, proxies=proxies, verify=False).text
search = BeautifulSoup(search, 'lxml')
results = search.find_all('div', {'class': 's'}) #Where the ratings roughly are
for j in results:
if 'Rating' in str(j.findChildren()) and 'yelp' in str(j.findChildren()[1]):
reviews['Yelp Rating'] = str(j.findChildren()).partition('Rating')[2].split()[1] #Had to brute-force get the ratings this way.
reviews['Yelp Reviews'] = str(j.findChildren()).partition('Rating')[2].split()[3]
elif 'Rating' in str(j.findChildren()) and 'facebook' in str(j.findChildren()[1]):
reviews['FB Rating'] = str(j.findChildren()).partition('Rating')[2].split()[1]
reviews['FB Reviews'] = str(j.findChildren()).partition('Rating')[2].split()[3]
info.append(reviews)
return info
results = pools.map(scrape_google, urls)
I tried something similar to this, but I think I'm getting way too many duplicated results. Will multithreading make this run more quickly? I did diagnostics on my code to see which parts took up the most time, and by far getting the requests was the rate-limiting factor.
EDIT: I just tried this out, and I get the following error:
Invalid URL 'h': No schema supplied. Perhaps you meant http://h?
I don't understand what the problem is, because if I try my scrape_google function without multithreading, it works just fine (albeit very very slowly), so url validity should not be an issue.
Yes, multithreading will probably make it run more quickly.
As a very rough rule of thumb, you can usually profitably make about 8-64 requests in parallel, as long as no more than 2-12 of them are to the same host. So, one dead-simple way to apply that is to just toss all of your requests into a concurrent.futures.ThreadPoolExecutor with, say, 8 workers.
In fact, that's the main example for ThreadPoolExecutor in the docs.
(By the way, the fact that your computer has 8 cores is irrelevant here. Your code isn't CPU-bound, it's I/O bound. If you do 12 requests in parallel, or even 500 of them, at any given moment, almost all of your threads are waiting on a socket.recv or similar call somewhere, blocking until the server responds, so they aren't using your CPU.)
However:
I think I'm getting way too many duplicated result
Fixing this may help far more than threading. Although, of course, you can do both.
I have no idea what your issue is here from the limited information you provided, but there's a pretty obvious workaround: Keep a set of everything you've seen so far. Whenever you get a new URL, if it's already in the set, throw it away instead of queuing up a new request.
Finally:
I would just use an API normally, but because I have a huge list of sites to search for and time is of the essence, Facebook's small request limits per hour make this not feasible
If you're trying to get around the rate limits for a major site, (a) you're probably violating their T&C, and (b) you're almost surely going to trigger some kind of detection and get yourself blocked.1
In your edited question, you attempted to do this with multiprocessing.dummy.Pool.map, which is fine—but you're getting the arguments wrong.
Your function takes a list of urls and loops over them:
def scrape_google(url_list):
# ...
for i in url_list:
But then you call it with a single URL at a time:
results = pools.map(scrape_google, urls)
This is similar to using the builtin map, or a list comprehension:
results = map(scrape_google, urls)
results = [scrape_google(url) for url in urls]
What happens if you get a single URL instead of a list of them, but try to use it as a list? A string is a sequence of its characters, so you loop over the characters of the URL one by one, trying to download each character as if it were a URL.
So, you want to change your function, like this:
def scrape_google(url):
reviews = # …
request = requests.get(url, proxies=proxies, verify=False).text
# …
return reviews
Now it takes a single URL, and returns a set of reviews for that URL. The pools.map will call it with each URL, and give you back an iterable of reviews, one per URL.
1. Or maybe something more creative. Someone posted a question on SO a few years ago about a site that apparently sent corrupted responses that seem to have been specifically crafted to waste exponential CPU for a typical scraper regex…
Related
I'm writing a Python script that keeps track of my playtime. Every 30 seconds I 'refresh' the data I get from the Steam Web API. But after a certain amount of calls (around 60), the response is totally empty.
I am aware of the maximum of 100.000 API calls per day, but it doesn't seem like I'm getting rate-limited, because I also tested refreshing every 60 seconds, even every 5 minutes, but it always is empty after around 60 calls.
Here's some of the code:
from steam.webapi import WebAPI
from time import sleep
api = WebAPI(API_KEY, raw=False, format='json', https=True, http_timeout=10)
def game_data(steamids):
data = api.call('IPlayerService.GetOwnedGames', appids_filter=None, include_appinfo=True, include_free_sub=True, include_played_free_games=True, steamid=steamids)
return data['response']['games']
while True:
g_data = game_data(steamids)
playtime = []
for i in g_data:
playtime.append(i['playtime_forever'])
print(playtime)
sleep(30)
Output
{"response":{}}
I'm using the steam library, which works basically the same as requesting data with the request library. It seems like the problem is only with the IPlayerService interface.
In the image above I counted the amount of requests, and as you can see, the 60th request failed. It raises a keyError exception, because the response is empty.
Please let me know if you need any other information, and hopefully someone knows how to fix this.
Thanks in advance!
So I just found a fix, instead of GetOwnedGames use GetRecentlyPlayedGames.
This one doesn't seem to have a limit of 60, which the other one has. It's basically the exact same response, except it only returns games that have been played in the past 2 weeks, which is totally fine for my use.
I would like to get the information about all reviews from my server. That's my code that I used to achieve my goal.
from rbtools.api.client import RBClient
client = RBClient('http://my-server.net/')
root = client.get_root()
reviews = root.get_review_requests()
The variable reviews contains just 25 review requests (I expected much, much more). What's even stranger I tried something a bit different
count = root.get_review_requests(counts_only=True)
Now count.count is equal to 17164. How can I extract the rest of my reviews? I tried to check the official documentation but I haven't found anything connected to my problem.
According to the documentation (https://www.reviewboard.org/docs/manual/dev/webapi/2.0/resources/review-request-list/#webapi2.0-review-request-list-resource), counts_only is only a Boolean flag that indicates following:
If specified, a single count field is returned with the number of results, instead of the results themselves.
But, what you could do, is to provide it with status, so:
count = root.get_review_requests(counts_only=True, status='all')
should return you all the requests.
Keep in mind that I didn't test this part of the code locally. I referred to their repo test example -> https://github.com/reviewboard/rbtools/blob/master/rbtools/utils/tests/test_review_request.py#L643 and the documentation link posted above.
You have to use pagination (unfortunately I can't provide exact code without ability to reproduce your question):
The maximum number of results to return in this list. By default, this is 25. There is a hard limit of 200; if you need more than 200 results, you will need to make more than one request, using the “next” pagination link.
Looks like pagination helper class also available.
If you want to get 200 results you may set max_results:
requests = root.get_review_requests(max_results=200)
Anyway HERE is a good example how to iterate over results.
Also I don't recommend to get all 17164 results by one request even if it possible. Because total data of response will be huge (let's say if size one a result is 10KB total size will be more than 171MB)
This is my first time using a CKAN Data API. I am trying to download public road accident data from a government website. It is only showing the first 100 rows. On the CKAN documentation it says that the default limit of rows it requests is "100".I am pretty sure you can write an ckan expression to the end of the url to give you the max rows but I am now sure how to write it. Please see python code below of what I have so of far. Is it possible? Thanks
is there any way I can write code similar to the psuedo ckan code request below?
url='https://data.gov.au/data/api/3/action/datastore_search?resource_id=d54f7465-74b8-4fff-8653-37e724d0ebbb&limit=MAX_ROWS'
CKAN Documentation reference: http://docs.ckan.org/en/latest/maintaining/datastore.html
There are several interesting fields in the documentation for ckanext.datastore.logic.action.datastore_search(), but the ones that pop out are limit and offset.
limit seems to have an absolute maximum of 32000 so depending on the amount of data you might still hit this limit.
offset seems to be the way to go. You keep calling the API with the offset increasing by a set amount until you have all the data. See the code below.
But, actually calling the API revealed something interesting. It generates a next URL which you can call, it automagically updates the offset based on the limit used (and maintaining the limit set on the initial call).
You can call this URL to get the next batch of results.
Some testing showed that it will go past the maximum though, so you need to check if the returned records are lower than the limit you use.
import requests
BASE_URL = "https://data.gov.au/data"
INITIAL_URL = "/api/3/action/datastore_search?resource_id=d54f7465-74b8-4fff-8653-37e724d0ebbb"
LIMIT = 10000
def get_all() -> list:
result = []
resp = requests.get(f"{BASE_URL}{INITIAL_URL}&limit={LIMIT}")
js = resp.json()["result"]
result.extend(js["records"])
while "_links" in js and "next" in js["_links"]:
resp = requests.get(BASE_URL + js["_links"]["next"])
js = resp.json()["result"]
result.extend(js["records"])
print(js["_links"]["next"]) # just so you know it's actually doing stuff
if len(js["records"]) < LIMIT:
# if it returned less records than the limit, the end has been reached
break
return result
print(len(get_all()))
Note, when exploring an API, it helps to check what exactly is returned. I used the simple code below to check what was returned, which made exploring the API a lot easier. Also, reading the docs helps, like the one I linked above.
from pprint import pprint
pprint(requests.get(BASE_URL+INITIAL_URL+"&limit=1").json()["result"])
Hello everyone and thanks a lot for your time.
I'm facing a really weird problem. There is this organization that has this API service up for us to retrieve data fronm them. It is a simple URL that returns a JSON with exactly 100 records.
So I created a Python code to retrieve this data and store it into our local database. So everytime we run the code, we get 100 records until the organization's API gets empty and there's nothing else to respond until the next day. So just to be clear. If the organization want's us to import 360 records, we have to run the API GET call 4 times, 3 times to get batches of 100 records and the fourth time to retrieve the last 60 records. After that if I run it for a 5th time, the response tells me there are no more records for the day.
So, my problem starts here, I wanted run the API GET call inside a while loop to retrive all the JSONs and store them on a list. But inside the while loop whenever we run again the API GET call, its response is the exact same one as the previous response. The data doesn't change at all and the API on the organization's side, doesn't send any more batches of records, because there are no requests from us. Let me show you how it looks.
import requests
listOfResponses = []
tempResponseList = []
while True:
tempResponseList = requests.get(url = apiURL, headers = headers, params = params).json()
if tempResponseList:
listOfResponses.append(tempResponseList)
tempResponseList = []
else:
print('There are no more records')
break
I have read many more articles regarding that te problem may be on the keep-alive property of the requests library in Python, but no matter what I try, it won't reset either the connection or the refresh the API GET call to retreive new data. I'm tied to having to run the program the times needed to retrieve all the data from the API.
I tried adding the { 'Connection':'close' } parameter on the headers of the request and it closed the connection but still no new data.
I tried ussing the requests.Session() method and closing the session but still no solution:
s = requests.Session()
#all the code above executed but instead of requests.get, I used s.get
#and then it was followed by this
s.close()
I also even tried a solution posted here at the forum that suggested adding this code after the s.close():
s.mount('http://', requests.adapters.HTTPAdapter())
s.mount('https://', requests.adapters.HTTPAdapter())
I'm a little bit confused with this so any help, observations or suggestions are greatly appreciated.
I am trying to write a code that gets 1000 first URL's of http pages in google search of some word. I used this code in Python to get the 1000 first URL's
import GoogleScraper
import urllib
urls = GoogleScraper.scrape('english teachers', number_pages=2)
for url in urls:
print(urllib.parse.unquote(url.geturl()))
print('[!] Received %d results by asking %d pages with %d results per page' %
(len(urls), 2, 100))`
but this code returns 0 received results.
is there another way to get a lot of URL's from google search in a convenient way?
I also tried xgoogle and pygoogle modules but they can handle just with small amount of requests for pages.
Google has a Custom Search API which allows you to make 100 queries a day for free. Given that each page has 10 results per page, you can barely fit in 1000 results in a day. xgoogle and pygoogle are just wrappers around this API, so I don't think you'll be able to get more results by using them.
If you do need more, consider creating another Google account with another API key which will effectively double your limit. If you're okay with slightly inferior results, you can try out Bing's Search API (they offer 5000 requests a month).