Let python sleep 60 secs after it has crawled every 20 pages

Let python sleep 60 secs after it has crawled every 20 pages - python

I am trying to collect the retweets data from a Chinese microblog Sina Weibo, you can see the following code. However, I am suffering from the problem of IP request out of limit.
To solve this problem, I have to set time.sleep() for the code. You can see I attempted to add a line of ' time.sleep(10) # to opress the ip request limit' in the code. Thus python will sleep 10 secs after crawling a page of retweets (one page contains 200 retweets).
However, it still not sufficient to deal with the IP problem.
Thus, I am planning to more systematically make python sleep 60 secs after it has crawled every 20 pages. Your ideas will be appreciated.
ids=[3388154704688495, 3388154704688494, 3388154704688492]
addressForSavingData= "C:/Python27/weibo/Weibo_repost/repostOwsSave1.csv"
file = open(addressForSavingData,'wb') # save to csv file
for id in ids:
if api.rate_limit_status().remaining_hits >= 205:
for object in api.counts(ids=id):
repost_count=object.__getattribute__('rt')
print id, repost_count
pages= repost_count/200 +2 # why should it be 2? cuz python starts from 0
for page in range(1, pages):
time.sleep(10) # to opress the ip request limit
for object in api.repost_timeline(id=id, count=200, page=page): # get the repost_timeline of a weibo
"""1.1 reposts"""
mid = object.__getattribute__("id")
text = object.__getattribute__("text").encode('gb18030') # add encode here
"""1.2 reposts.user"""
user = object.__getattribute__("user") # for object in user
user_id = user.id
"""2.1 retweeted_status"""
rts = object.__getattribute__("retweeted_status")
rts_mid = rts.id # the id of weibo
"""2.2 retweeted_status.user"""
rtsuser_id = rts.user[u'id']
try:
w = csv.writer(file,delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL)
w.writerow(( mid,
user_id, rts_mid,
rtsuser_id, text)) # write it out
except: # Exception of UnicodeEncodeError
pass
elif api.rate_limit_status().remaining_hits < 205:
sleep_time=api.rate_limit_status().reset_time_in_seconds # time.time()
print sleep_time, api.rate_limit_status().reset_time
time.sleep(sleep_time+2)
file.close()
pass

Can you not just pace the script instead?
I suggest to make your script sleep in between each request instead of making a requests all at the same time. And say span over a minute.. This way you will also avoid any flooding bans and this is considered good behaviour.
Pacing your requests may also allow you to do things more quickly if the server does not time you out for sending too many requests.
If there is a limit to the IP sometimes their are no great and easy solutions. For example if you run apache http://opensource.adnovum.ch/mod_qos/ limits bandwidth and connections and specifically it limits;
The maximum number of concurrent requests
Limitation of the bandwidth such as the maximum allowed number of requests per second to an URL or the maximum/minimum of downloaded kbytes per second.
Limits the number of request events per second
Generic request line and header filter to deny unauthorized operations.
Request body data limitation and filtering
the maximum number of allowed connections from a single IP source address or dynamic keep-alive control.
You may want to start with these. You could send referrer URL's with your requests and make only single connections, not multiple connections.
You could also refer to this question

I figure out the solution:
first, give an integer, e.g 0
i = 0
second, in the for page loop, add the following code
for page in range(1, 300):
i += 1
if (i % 25 ==0):
print i, "find i which could be exactly divided by 25"

Related

Simple function to respect Twitter's V2 API rate limits?

Problem:
Often we'd like to pull much more data than Twitter would like us to at one time. In between each query it would be wonderful if there was a simple function to call that checks if you need to wait.
Question:
What is a simple function for respecting Twitter's API limits and ensuring that any long-running-query will complete successfully without harassing Twitter and ensure the querying user does not get banned?
Ideal Answer:
The most ideal answer would be a portable function that should work in all situations. That is, finish (properly) no matter what, and respect Twitter's API rate limit rules.
Caveat
I have submitted a working answer of my own but I am unsure if there is a way to improve it.

I am developing a Python package to utilize Twitter's new V2 API. I want to make sure that I am respecting Twitter's rate limits as best as I possibly can.
Below are the two functions used to wait when needed. They check the API call response headers for remaining queries and then also rely on Twitter's HTTP codes provided here as an ultimate backup. As far as I can tell, these three HTTP codes are the only time-related errors, and the others should raise issues for an API user to inform them of whatever they are doing incorrectly.
from datetime import datetime
from osometweet.utils import pause_until
def manage_rate_limits(response):
"""Manage Twitter V2 Rate Limits
This method takes in a `requests` response object after querying
Twitter and uses the headers["x-rate-limit-remaining"] and
headers["x-rate-limit-reset"] headers objects to manage Twitter's
most common, time-dependent HTTP errors.
"""
while True:
# Get number of requests left with our tokens
remaining_requests = int(response.headers["x-rate-limit-remaining"])
# If that number is one, we get the reset-time
# and wait until then, plus 15 seconds.
# The regular 429 exception is caught below as well,
# however, we want to program defensively, where possible.
if remaining_requests == 1:
buffer_wait_time = 15
resume_time = datetime.fromtimestamp( int(response.headers["x-rate-limit-reset"]) + buffer_wait_time )
print(f"One request from being rate limited. Waiting on Twitter.\n\tResume Time: {resume_time}")
pause_until(resume_time)
# Explicitly checking for time dependent errors.
# Most of these errors can be solved simply by waiting
# a little while and pinging Twitter again - so that's what we do.
if response.status_code != 200:
# Too many requests error
if response.status_code == 429:
buffer_wait_time = 15
resume_time = datetime.fromtimestamp( int(response.headers["x-rate-limit-reset"]) + buffer_wait_time )
print(f"Too many requests. Waiting on Twitter.\n\tResume Time: {resume_time}")
pause_until(resume_time)
# Twitter internal server error
elif response.status_code == 500:
# Twitter needs a break, so we wait 30 seconds
resume_time = datetime.now().timestamp() + 30
print(f"Internal server error # Twitter. Giving Twitter a break...\n\tResume Time: {resume_time}")
pause_until(resume_time)
# Twitter service unavailable error
elif response.status_code == 503:
# Twitter needs a break, so we wait 30 seconds
resume_time = datetime.now().timestamp() + 30
print(f"Twitter service unavailable. Giving Twitter a break...\n\tResume Time: {resume_time}")
pause_until(resume_time)
# If we get this far, we've done something wrong and should exit
raise Exception(
"Request returned an error: {} {}".format(
response.status_code, response.text
)
)
# Each time we get a 200 response, exit the function and return the response object
if response.ok:
return response
Here is the pause_until function.
def pause_until(time):
""" Pause your program until a specific end time. 'time' is either
a valid datetime object or unix timestamp in seconds (i.e. seconds
since Unix epoch) """
end = time
# Convert datetime to unix timestamp and adjust for locality
if isinstance(time, datetime):
# If we're on Python 3 and the user specified a timezone,
# convert to UTC and get tje timestamp.
if sys.version_info[0] >= 3 and time.tzinfo:
end = time.astimezone(timezone.utc).timestamp()
else:
zoneDiff = pytime.time() - (datetime.now() - datetime(1970, 1, 1)).total_seconds()
end = (time - datetime(1970, 1, 1)).total_seconds() + zoneDiff
# Type check
if not isinstance(end, (int, float)):
raise Exception('The time parameter is not a number or datetime object')
# Now we wait
while True:
now = pytime.time()
diff = end - now
#
# Time is up!
#
if diff <= 0:
break
else:
# 'logarithmic' sleeping to minimize loop iterations
sleep(diff / 2)
This seems to work quite nicely but I'm not sure if there are edge-cases that will break this or if there is simply a more elegant/simple way to do this.

I get "IndexError: list index out of range" error on server, but not on local machine

I made a python script that crawls the web page at 'http://spys.one/en/socks-proxy-list/' and fetches all the IP addresses there, then checks if they're up and finally returns a list of all live Ip addresses. then there's a second script which connects to telegrams bot API and uses the first script to show the user a list of recent socks5 working servers.
I'm an amateur programmer and new to Python programming language. I made these scripts for exercise. feel free to point out my mistakes and show the ways I can improve my code. thanks in advance!
import requests as req
import re
import socket
def is_open(ip, port):
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
try:
s.connect((ip, int(port)))
s.shutdown(2)
return True
except:
return False
# Initial settings:
url = 'http://spys.one/en/socks-proxy-list/'
regex = '\d{1,4}\.\d{1,4}\.\d{1,4}\.\d{1,4}'
# Request URL
response = req.get(url).text
# Extract IP and port from source
p = re.compile(regex)
results = p.findall(response)
# Fetch and check the first 20 IPs
alive = []
for i in range(0, 20):
if is_open(results[i], '1080'):
alive.append(results[i])
def gimmeprox():
links = []
for x in range(0,len(alive)):
links.append('https://t.me/proxy?server=' + alive[int(x)] + '&port=1080')
payload = '\n\n'.join(links)
return payload
When I run this code and the other (bot) script, everything works fine, but as soon as I put it on the web (heroku, etc.) it crashes on line 30:
line 30, in <module>
if is_open(results[i], '1080'):
with the error "".

Short answer: "results" does not always have 20 items. So, you're basically asking for something that doesn't exist.
You should always check the length before iterating over; or in these scenarios when you don't need the index, simply iterate over the actual items rather than the index.

When you run
for i in range(20):
if is_open(results[i], '1080'):
alive.append(results[i])
and len(results) is <20, you will eventually try to access results[len(results)], resulting in an IndexError. To prevent this, choose the lower value of len(results) and 20 as your argument for range, like so: min(len(results), 20).
An alternative is to loop through all values of results and break when you have 20.
for r in result:
if is_open(r, '1080'):
alive.append(r)
if len(alive) >= 20: # shouldn't actually get over 20, just a precaution
break

how to send muliptle requests and make sure the response comes back within a second in python

I am trying to validate what the throttle limit for an endpoint using python code.
Basically I have set Throttlelimit on the endpoint I am testing is 3calls/sec. The test does 4 calls and checks the status codes to have atleast 1 429 response.
The validation I have fails sometimes because it looks like the responses take more than a second to respond. The code I tried are:
Method1:
request = requests.Request(method='GET', url=GLOBALS["url"], params=context.payload, headers=context.headers)
context.upperlimit = int(GLOBALS["ThrottleLimit"]) + 1
reqs = [request for i in range(0, context.upperlimit)]
with BaseThrottler(name='base-throttler', reqs_over_time=(context.upperlimit, 1)) as bt:
throttled_requests = bt.multi_submit(reqs)
context.responses = [tr.response for tr in throttled_requests]
assert(429 in [ i.status_code for i in context.responses])
Method2:
request = requests.get(url=GLOBALS["url"], params=context.payload, headers=context.headers)
url = request.url
urls = set([])
for i in range(0, context.upperlimit):
urls.add(grequests.get(url))
context.responses = grequests.map(urls)
assert(429 in [ i.status_code for i in context.responses])
Is there a way that I can make sure all the responses came back in the same second and if not it should try again before failing the test.

I suppose you are using requests and grequests library. You can set a timeout as explained in the docs and also for grequests.
Plain requests
requests.get(url, timeout=1)
Using grequests
grequests.get(url, timeout=1)
Timeout value is "number of seconds"

Using timeout won't necessarily ensure the condition that you are looking for, which is that all 4 requests were received by the endpoint within one second (not that each individual response was received within one second of sending the request).
One quick and dirty way to solve this is to simply time the execution of the code, and ensure that all responses were received in less than a second (using the timeit module)
start_time = timeit.default_timer()
context.responses = grequests.map(urls)
elapsed = timeit.default_timer() - start_time
if elapsed < 1:
assert(429 in [ i.status_code for i in context.responses])
This is crude because it is checking round trip time, but will ensure that all requests were received within a second. If you need more specificity, or find that the condition is not met often enough, you could add a header to the response with the exact time the request was received by the endpoint, and then verify that all requests hit the endpoint within one second of each other.

Correct greenlet termination

I am using gevent to download some html pages.
Some websites are way too slow, some stop serving requests after period of time. That is why I had to limit total time for a group of requests I make. For that I use gevent "Timeout".
timeout = Timeout(10)
timeout.start()
def downloadSite():
# code to download site's url one by one
url1 = downloadUrl()
url2 = downloadUrl()
url3 = downloadUrl()
try:
gevent.spawn(downloadSite).join()
except Timeout:
print 'Lost state here'
But the problem with it is that i loose all the state when exception fires up.
Imagine I crawl site 'www.test.com'. I have managed to download 10 urls right before site admins decided to switch webserver for maintenance. In such case i will lose information about crawled pages when exception fires up.
The question is - how do I save state and process the data even if Timeout happens ?

Why not try something like:
timeout = Timeout(10)
def downloadSite(url):
with Timeout(10):
downloadUrl(url)
urls = ["url1", "url2", "url3"]
workers = []
limit = 5
counter = 0
for i in urls:
# limit to 5 URL requests at a time
if counter < limit:
workers.append(gevent.spawn(downloadSite, i))
counter += 1
else:
gevent.joinall(workers)
workers = [i,]
counter = 0
gevent.joinall(workers)
You could also save a status in a dict or something for every URL, or append the ones that fail in a different array, to retry later.

A self-contained example:
import gevent
from gevent import monkey
from gevent import Timeout
gevent.monkey.patch_all()
import urllib2
def get_source(url):
req = urllib2.Request(url)
data = None
with Timeout(2):
response = urllib2.urlopen(req)
data = response.read()
return data
N = 10
urls = ['http://google.com' for _ in xrange(N)]
getlets = [gevent.spawn(get_source, url) for url in urls]
gevent.joinall(getlets)
contents = [g.get() for g in getlets]
print contents[5]
It implements one timeout for each request. In this example, contents contains 10 times the HTML source of google.com, each retrieved in an independent request. If one of the requests had timed out, the corresponding element in contents would be None. If you have questions about this code, don't hesitate to ask in the comments.
I saw your last comment. Defining one timeout per request definitely is not wrong from the programming point of view. If you need to throttle traffic to the website, then just don't spawn 100 greenlets simultaneously. Spawn 5, wait until they returned. Then, you can possibly wait for a given amount of time, and spawn the next 5 (already shown in the other answer by Gabriel Samfira, as I see now). For my code above, this would mean, that you would have to repeatedly call
N = 10
urls = ['http://google.com' for _ in xrange(N)]
getlets = [gevent.spawn(get_source, url) for url in urls]
gevent.joinall(getlets)
contents = [g.get() for g in getlets]
whereas N should not be too high.

Python-ldap search: Size Limit Exceeded

I'm using the python-ldap library to connect to our LDAP server and run queries. The issue I'm running into is that despite setting a size limit on the search, I keep getting SIZELIMIT_EXCEEDED errors on any query that would return too many results. I know that the query itself is working because I will get a result if the query returns a small subset of users. Even if I set the size limit to something absurd, like 1, I'll still get a SIZELIMIT_EXCEEDED on those bigger queries. I've pasted a generic version of my query below. Any ideas as to what I'm doing wrong here?
result = self.ldap.search_ext_s(self.base, self.scope, '(personFirstMiddle=<value>*)', sizelimit=5)

When the LDAP client requests a size-limit, that is called a 'client-requested' size limit. A client-requested size limit cannot override the size-limit set by the server. The server may set a size-limit for the server as a whole, for a particular authorization identity, or for other reasons - whichever the case, the client may not override the server size limit. The search request may have to be issued in multiple parts using the simple paged results control or the virtual list view control.

Here's a Python3 implementation that I came up with after heavily editing what I found here and in the official documentation. At the time of writing this it works with the pip3 package python-ldap version 3.2.0.
def get_list_of_ldap_users():
hostname = "google.com"
username = "username_here"
password = "password_here"
base = "dc=google,dc=com"
print(f"Connecting to the LDAP server at '{hostname}'...")
connect = ldap.initialize(f"ldap://{hostname}")
connect.set_option(ldap.OPT_REFERRALS, 0)
connect.simple_bind_s(username, password)
connect=ldap_server
search_flt = "(personFirstMiddle=<value>*)" # get all users with a specific middle name
page_size = 500 # how many users to search for in each page, this depends on the server maximum setting (default is 1000)
searchreq_attrlist=["cn", "sn", "name", "userPrincipalName"] # change these to the attributes you care about
req_ctrl = SimplePagedResultsControl(criticality=True, size=page_size, cookie='')
msgid = connect.search_ext(base=base, scope=ldap.SCOPE_SUBTREE, filterstr=search_flt, attrlist=searchreq_attrlist, serverctrls=[req_ctrl])
total_results = []
pages = 0
while True: # loop over all of the pages using the same cookie, otherwise the search will fail
pages += 1
rtype, rdata, rmsgid, serverctrls = connect.result3(msgid)
for user in rdata:
total_results.append(user)
pctrls = [c for c in serverctrls if c.controlType == SimplePagedResultsControl.controlType]
if pctrls:
if pctrls[0].cookie: # Copy cookie from response control to request control
req_ctrl.cookie = pctrls[0].cookie
msgid = connect.search_ext(base=base, scope=ldap.SCOPE_SUBTREE, filterstr=search_flt, attrlist=searchreq_attrlist, serverctrls=[req_ctrl])
else:
break
else:
break
return total_results
This will return a list of all users but you can edit it as required to return what you want without hitting the SIZELIMIT_EXCEEDED issue :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Let python sleep 60 secs after it has crawled every 20 pages - python

I figure out the solution: first, give an integer, e.g 0 i = 0 second, in the for page loop, add the following code for page in range(1, 300): i += 1 if (i % 25 ==0): print i, "find i which could be exactly divided by 25"

Related

Simple function to respect Twitter's V2 API rate limits?

I get "IndexError: list index out of range" error on server, but not on local machine

how to send muliptle requests and make sure the response comes back within a second in python

Correct greenlet termination

Python-ldap search: Size Limit Exceeded

Categories

Resources