I'm writing a Python script that keeps track of my playtime. Every 30 seconds I 'refresh' the data I get from the Steam Web API. But after a certain amount of calls (around 60), the response is totally empty.
I am aware of the maximum of 100.000 API calls per day, but it doesn't seem like I'm getting rate-limited, because I also tested refreshing every 60 seconds, even every 5 minutes, but it always is empty after around 60 calls.
Here's some of the code:
from steam.webapi import WebAPI
from time import sleep
api = WebAPI(API_KEY, raw=False, format='json', https=True, http_timeout=10)
def game_data(steamids):
data = api.call('IPlayerService.GetOwnedGames', appids_filter=None, include_appinfo=True, include_free_sub=True, include_played_free_games=True, steamid=steamids)
return data['response']['games']
while True:
g_data = game_data(steamids)
playtime = []
for i in g_data:
playtime.append(i['playtime_forever'])
print(playtime)
sleep(30)
Output
{"response":{}}
I'm using the steam library, which works basically the same as requesting data with the request library. It seems like the problem is only with the IPlayerService interface.
In the image above I counted the amount of requests, and as you can see, the 60th request failed. It raises a keyError exception, because the response is empty.
Please let me know if you need any other information, and hopefully someone knows how to fix this.
Thanks in advance!
So I just found a fix, instead of GetOwnedGames use GetRecentlyPlayedGames.
This one doesn't seem to have a limit of 60, which the other one has. It's basically the exact same response, except it only returns games that have been played in the past 2 weeks, which is totally fine for my use.
Related
This is my first time using a CKAN Data API. I am trying to download public road accident data from a government website. It is only showing the first 100 rows. On the CKAN documentation it says that the default limit of rows it requests is "100".I am pretty sure you can write an ckan expression to the end of the url to give you the max rows but I am now sure how to write it. Please see python code below of what I have so of far. Is it possible? Thanks
is there any way I can write code similar to the psuedo ckan code request below?
url='https://data.gov.au/data/api/3/action/datastore_search?resource_id=d54f7465-74b8-4fff-8653-37e724d0ebbb&limit=MAX_ROWS'
CKAN Documentation reference: http://docs.ckan.org/en/latest/maintaining/datastore.html
There are several interesting fields in the documentation for ckanext.datastore.logic.action.datastore_search(), but the ones that pop out are limit and offset.
limit seems to have an absolute maximum of 32000 so depending on the amount of data you might still hit this limit.
offset seems to be the way to go. You keep calling the API with the offset increasing by a set amount until you have all the data. See the code below.
But, actually calling the API revealed something interesting. It generates a next URL which you can call, it automagically updates the offset based on the limit used (and maintaining the limit set on the initial call).
You can call this URL to get the next batch of results.
Some testing showed that it will go past the maximum though, so you need to check if the returned records are lower than the limit you use.
import requests
BASE_URL = "https://data.gov.au/data"
INITIAL_URL = "/api/3/action/datastore_search?resource_id=d54f7465-74b8-4fff-8653-37e724d0ebbb"
LIMIT = 10000
def get_all() -> list:
result = []
resp = requests.get(f"{BASE_URL}{INITIAL_URL}&limit={LIMIT}")
js = resp.json()["result"]
result.extend(js["records"])
while "_links" in js and "next" in js["_links"]:
resp = requests.get(BASE_URL + js["_links"]["next"])
js = resp.json()["result"]
result.extend(js["records"])
print(js["_links"]["next"]) # just so you know it's actually doing stuff
if len(js["records"]) < LIMIT:
# if it returned less records than the limit, the end has been reached
break
return result
print(len(get_all()))
Note, when exploring an API, it helps to check what exactly is returned. I used the simple code below to check what was returned, which made exploring the API a lot easier. Also, reading the docs helps, like the one I linked above.
from pprint import pprint
pprint(requests.get(BASE_URL+INITIAL_URL+"&limit=1").json()["result"])
Hello everyone and thanks a lot for your time.
I'm facing a really weird problem. There is this organization that has this API service up for us to retrieve data fronm them. It is a simple URL that returns a JSON with exactly 100 records.
So I created a Python code to retrieve this data and store it into our local database. So everytime we run the code, we get 100 records until the organization's API gets empty and there's nothing else to respond until the next day. So just to be clear. If the organization want's us to import 360 records, we have to run the API GET call 4 times, 3 times to get batches of 100 records and the fourth time to retrieve the last 60 records. After that if I run it for a 5th time, the response tells me there are no more records for the day.
So, my problem starts here, I wanted run the API GET call inside a while loop to retrive all the JSONs and store them on a list. But inside the while loop whenever we run again the API GET call, its response is the exact same one as the previous response. The data doesn't change at all and the API on the organization's side, doesn't send any more batches of records, because there are no requests from us. Let me show you how it looks.
import requests
listOfResponses = []
tempResponseList = []
while True:
tempResponseList = requests.get(url = apiURL, headers = headers, params = params).json()
if tempResponseList:
listOfResponses.append(tempResponseList)
tempResponseList = []
else:
print('There are no more records')
break
I have read many more articles regarding that te problem may be on the keep-alive property of the requests library in Python, but no matter what I try, it won't reset either the connection or the refresh the API GET call to retreive new data. I'm tied to having to run the program the times needed to retrieve all the data from the API.
I tried adding the { 'Connection':'close' } parameter on the headers of the request and it closed the connection but still no new data.
I tried ussing the requests.Session() method and closing the session but still no solution:
s = requests.Session()
#all the code above executed but instead of requests.get, I used s.get
#and then it was followed by this
s.close()
I also even tried a solution posted here at the forum that suggested adding this code after the s.close():
s.mount('http://', requests.adapters.HTTPAdapter())
s.mount('https://', requests.adapters.HTTPAdapter())
I'm a little bit confused with this so any help, observations or suggestions are greatly appreciated.
I have to scrape tweets from Twitter for a specific user (#salvinimi), from January 2018. The issue is that there are a lot of tweets in this range of time, and so I am not able to scrape all the ones I need!
I tried multiple solutions:
1)
pip install twitterscraper
from twitterscraper import query_tweets_from_user as qtfu
tweets = qtfu(user='matteosalvinimi')
With this method, I get only a few teets (500~600 more or less), instead of all the tweets... Do you know why?
2)
!pip install twitter_scraper
from twitter_scraper import get_tweets
tweets = []
for i in get_tweets('matteosalvinimi', pages=100):
tweets.append(i)
With this method I get an error -> "ParserError: Document is empty"...
If I set "pages=40", I get the tweets without errors, but not all the ones. Do you know why?
Three things for the first issue you encounter:
first of all, every API has its limits and one like Twitter would be expected to monitor its use and eventually stop a user from retrieving data if the user is asking for more than the limits. Trying to overcome the limitations of the API might not be the best idea and might result in being banned from accessing the site or other things (I'm taking guesses here as I don't know what's the policy of Twitter on the matter). That said, the documentation on the library you're using states :
With Twitter's Search API you can only sent 180 Requests every 15 minutes. With a maximum number of 100 tweets per Request this means you can mine for 4 x 180 x 100 = 72.000 tweets per hour.
By using TwitterScraper you are not limited by this number but by your internet speed/bandwith and the number of instances of TwitterScraper you are willing to start.
then, the function you're using, query_tweets_from_user() has a limit argument which you can set to an integer. One thing you can try is changing that argument and seeing whether you get what you want or not.
finally, if the above does not work, you could be subsetting your time range in two, three ore more subsets if needed, collect the data separately and merge them together afterwards.
The second issue you mention might be due to many different things so I'll just take a broad guess here. For me, either setting pages=100 is too high and by one way or another the program or the API is unable to retrieve the data, or you're trying to look at a hundred pages when there is less than a hundred in pages to look for reality, which results in the program trying to parse an empty document.
I'm trying to improve the speed of my web scraper, and I have thousands of sites I need to get info from. I'm trying to get the ratings and number of ratings for sites in Google search webpages from Facebook and Yelp. I would just use an API normally, but because I have a huge list of sites to search for and time is of the essence, Facebook's small request limits per hour make this not feasible to use their Graph API (I've tried...). My sites are all in Google search pages. What I have so far (I have provided 8 sample sites for reproducibility):
from multiprocessing.dummy import Pool
import requests
from bs4 import BeautifulSoup
pools = Pool(8) #My computer has 8 cores
proxies = MY_PROXIES
#How I set up my urls for requests on Google searches.
#Since each item has a "+" in between in a Google search, I have to format
#my urls to copy it.
site_list = ['Golden Gate Bridge', 'Statue of Liberty', 'Empire State Building', 'Millennium Park', 'Gum Wall', 'The Alamo', 'National Art Gallery', 'The Bellagio Hotel']
urls = list(map(lambda x: "+".join(x.split(" ")), site_list)
def scrape_google(url_list):
info = []
for i in url_list:
reviews = {'FB Rating': None,
'FB Reviews': None,
'Yelp Rating': None,
'Yelp Reviews': None}
request = requests.get(i, proxies=proxies, verify=False).text
search = BeautifulSoup(search, 'lxml')
results = search.find_all('div', {'class': 's'}) #Where the ratings roughly are
for j in results:
if 'Rating' in str(j.findChildren()) and 'yelp' in str(j.findChildren()[1]):
reviews['Yelp Rating'] = str(j.findChildren()).partition('Rating')[2].split()[1] #Had to brute-force get the ratings this way.
reviews['Yelp Reviews'] = str(j.findChildren()).partition('Rating')[2].split()[3]
elif 'Rating' in str(j.findChildren()) and 'facebook' in str(j.findChildren()[1]):
reviews['FB Rating'] = str(j.findChildren()).partition('Rating')[2].split()[1]
reviews['FB Reviews'] = str(j.findChildren()).partition('Rating')[2].split()[3]
info.append(reviews)
return info
results = pools.map(scrape_google, urls)
I tried something similar to this, but I think I'm getting way too many duplicated results. Will multithreading make this run more quickly? I did diagnostics on my code to see which parts took up the most time, and by far getting the requests was the rate-limiting factor.
EDIT: I just tried this out, and I get the following error:
Invalid URL 'h': No schema supplied. Perhaps you meant http://h?
I don't understand what the problem is, because if I try my scrape_google function without multithreading, it works just fine (albeit very very slowly), so url validity should not be an issue.
Yes, multithreading will probably make it run more quickly.
As a very rough rule of thumb, you can usually profitably make about 8-64 requests in parallel, as long as no more than 2-12 of them are to the same host. So, one dead-simple way to apply that is to just toss all of your requests into a concurrent.futures.ThreadPoolExecutor with, say, 8 workers.
In fact, that's the main example for ThreadPoolExecutor in the docs.
(By the way, the fact that your computer has 8 cores is irrelevant here. Your code isn't CPU-bound, it's I/O bound. If you do 12 requests in parallel, or even 500 of them, at any given moment, almost all of your threads are waiting on a socket.recv or similar call somewhere, blocking until the server responds, so they aren't using your CPU.)
However:
I think I'm getting way too many duplicated result
Fixing this may help far more than threading. Although, of course, you can do both.
I have no idea what your issue is here from the limited information you provided, but there's a pretty obvious workaround: Keep a set of everything you've seen so far. Whenever you get a new URL, if it's already in the set, throw it away instead of queuing up a new request.
Finally:
I would just use an API normally, but because I have a huge list of sites to search for and time is of the essence, Facebook's small request limits per hour make this not feasible
If you're trying to get around the rate limits for a major site, (a) you're probably violating their T&C, and (b) you're almost surely going to trigger some kind of detection and get yourself blocked.1
In your edited question, you attempted to do this with multiprocessing.dummy.Pool.map, which is fine—but you're getting the arguments wrong.
Your function takes a list of urls and loops over them:
def scrape_google(url_list):
# ...
for i in url_list:
But then you call it with a single URL at a time:
results = pools.map(scrape_google, urls)
This is similar to using the builtin map, or a list comprehension:
results = map(scrape_google, urls)
results = [scrape_google(url) for url in urls]
What happens if you get a single URL instead of a list of them, but try to use it as a list? A string is a sequence of its characters, so you loop over the characters of the URL one by one, trying to download each character as if it were a URL.
So, you want to change your function, like this:
def scrape_google(url):
reviews = # …
request = requests.get(url, proxies=proxies, verify=False).text
# …
return reviews
Now it takes a single URL, and returns a set of reviews for that URL. The pools.map will call it with each URL, and give you back an iterable of reviews, one per URL.
1. Or maybe something more creative. Someone posted a question on SO a few years ago about a site that apparently sent corrupted responses that seem to have been specifically crafted to waste exponential CPU for a typical scraper regex…
I am trying to retrieve user Friend network using python-twitter API. I am using the getFriendIDs() method which retrieves the ids of all the accounts a particular twitter user is following. The following is a small snipped of my test code:
for item in IdList:
aDict[item] = api.GetFriendIDs(user_id=item,count=4999)
print "sleeping 60"
time.sleep(66)
print str(api.MaximumHitFrequency())+" The maximum hit frequency"
print api.GetRateLimitStatus()['resources']['friends']['/friends/ids']['remaining']
There are 35 ids (of twitter user accounts) in IdList and for each item I am retrieving upto 4999 Ids that the current user with id 'item' is following. I am aware of the new rate-limiting by twitter wherein the rate-limit window has been changed from 60 minutes to 15 minutes and the fact that they advice you not to make more than one request to the server per minute (api.MaximumHitFrequency()). So basically 15 requests in 15 minutes. That is exactly what I'm doing in fact I'm making a request to the server every 66 seconds and not 60 seconds but I get a rate-limit error after 6 requests. I am unable to figure out why this is happening. Please do let me know if anyone else has had this problem.
Have a look at https://github.com/bear/python-twitter/wiki/Rate-Limited-API---How-to-deal-with.
Also, it might help to use a newer version of the python-twitter code. The MaximumHitFrequency and GetRateLimitStatus methods have been modified with https://github.com/bear/python-twitter/commit/25cccb81fbeb4c630a0024981bc98f7fb41f3933.