I was trying to parse one site, the idea is simple:
I am making get request to the user page www.link/username. Depends if response(html text) contains element or not I do an action.
But I do need to check large amount of usernames(~3000) in parallel and as often as possible.
I have list of proxies(good proxies, not public ones). Set the HEADER with user-agent, refers.
What I do:
Create thread for each username. 50 usernames are using one proxy.
Each thread is checking its username once at random choosing sleep time.
At the begging everything is okay and I get right responses. But after few iterations responses are getting wrong and my program is not doing what should.
Can u please help me to figure out how to make that amount of requests at the same time using python requests.
Some code:
def check_username(username, proxy=''):
responses[username] = 'tgme_page_extra'
try:
responses[username] = requests.get(URL + username, headers=HEADERS, proxies=proxy) # getting response from link
except Exception as e:
print(e)
time.sleep(7)
if "tgme_page_extra" not in responses[username].text: # If username seems unclaimable
# action
else:
#another action
def username_monit(username, proxy):
while True: # Check username and sleep for interval
check_username(username, proxy)
time.sleep(random.choice(config.CHECK_INTERVAL))
Related
I’m trying to get the response of an api request made with ajax.ajax() and the response is stored into ['apiResponse'] in the HTML5 Local Storage (but the rest of the python function processes without waiting for it be put into localStorage).
Because of this I need to wait for it before getting the response and I thought I could do what I did below for the program to wait before it proceed.
Unfortunately the browser seems to freeze every time I put a while loop...
If someone know how to make Brython and the browser to stop freezing or another method to do what I wanna do...
(It would really help me as it’s the only step before succeeding getting Spotify api requests response)
from browser import ajax #to make requests
from browser.local_storage import storage as localStorage #to access HTML5 Local Storage
import json #to convert a json-like string into a Python Dict
#Request to the API
def on_complete(req):
if req.status==200 or req.status==0:
localStorage['apiResponse'] = req.text
else:
print("An error occured while asking Spotify for data")
def apiRequest(requestUrl, requestMethod):
req = ajax.ajax()
req.bind('complete', on_complete)
req.open(requestMethod, requestUrl, True)
req.set_header('Authorization', localStorage['header'])
req.send()
def response():
while localStorage['apiResponse'] == '':
continue
print('done')
return json.loads(localStorage['apiResponse'])
Thanks in advance!
I am trying to read Twitter feeds using the URL. Yesterday I was able to pull some 80K tweets using the code and due to some updates on my machine, my Mac terminal stopped responding before the python code completed.
Today the same code is not returning any json data. It is throwing me empty results. While if I type the same URL in browser I am able to get a json file with full of data.
Here is my code:
Method 1:
try:
urllib.request.urlcleanup()
response = urllib.request.urlopen(url)
print('URL to used: ', url)
testURL = response.geturl()
print('URL you used: ', testURL)
jsonResponse = response.read()
jsonResponse = urllib.request.urlopen(url).read()
This printed:
URL to used: https://twitter.com/i/search/timeline?f=tweets&q=%20since%3A2017-08-14%20until%3A2017-08-15%20USA&src=typd&max_position=
URL you used: https://twitter.com/i/search/timeline?f=tweets&q=%20since%3A2017-08-14%20until%3A2017-08-15%20USA&src=typd&max_position=
json: {'items_html': '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n', 'focused_refresh_interval': 30000, 'has_more_items': False, 'min_position': 'TWEET--', 'new_latent_count': 0}
****Method 2:****
try:
request = urllib.request.Request(url, headers=headers)
except:
print("Thats the problem here:")
try:
response = urllib.request.urlopen(request)
except:
print("Exception while fetching response")
testURL = response.geturl()
print('URL you used: ', testURL)
try:
jsonResponse = response.read()
except:
print("Exception while reading response")
Same results in both the cases.
Kindly help.
Based on my testing this behavior has nothing to do with urllib. The same thing happens with the requests library for example.
It appears Twitter detects automated scraping via repeated hits against the search URL, based on your IP address and user agent (UA) string. At some point, subsequent hits return empty results. This seems to happen after a day or so, probably as a result of delayed analysis on Twitter's part.
If you change the UA string in the search URL request header, you should once again receive valid results in the response. Twitter will probably block you again after a while, so you'll need to change your UA string frequently.
I assume Twitter expires these blocks after some timeout, but I don't know how long that will take.
By way of reference, the twitter-past-crawler project demonstrates using a semi-random UA string taken from a file containing multiple UA strings.
Also, the Twitter-Search-API-Python project uses a hardcoded UA string, which stopped working a day or so after my first test. Changing the string in the code (adding random characters) resulted in a resumption of prior functionality.
Currently taking a web scraping class with other students, and we are supposed to make ‘get’ requests to a dummy site, parse it, and visit another site.
The problem is, the content of the dummy site is only up for several minutes and disappears, and the content comes back up at a certain interval. During the time the content is available, everyone tries to make the ‘get’ requests, so mine just hangs until everyone clears up, and the content eventually disappears. So I end up not being able to successfully make the ‘get’ request:
import requests
from splinter import Browser
browser = Browser('chrome')
# Hangs here
requests.get('http://dummysite.ca').text
# Even if get is successful hangs here as well
browser.visit(parsed_url)
So my question is, what's the fastest/best way to make endless concurrent 'get' requests until I get a response?
Decide to use either requests or splinter
Read about Requests: HTTP for Humans
Read about Splinter
Related
Read about keep-alive
Read about blocking-or-non-blocking
Read about timeouts
Read about errors-and-exceptions
If you are able to get not hanging requests, you can think of repeated requests, for instance:
while True:
requests.get(...
if request is succesfull:
break
time.sleep(1)
Gevent provides a framework for running asynchronous network requests.
It can patch Python's standard library so that existing libraries like requests and splinter work out of the box.
Here is a short example of how to make 10 concurrent requests, based on the above code, and get their response.
from gevent import monkey
monkey.patch_all()
import gevent.pool
import requests
pool = gevent.pool.Pool(size=10)
greenlets = [pool.spawn(requests.get, 'http://dummysite.ca')
for _ in range(10)]
# Wait for all requests to complete
pool.join()
for greenlet in greenlets:
# This will raise any exceptions raised by the request
# Need to catch errors, or check if an exception was
# thrown by checking `greenlet.exception`
response = greenlet.get()
text_response = response.text
Could also use map and a response handling function instead of get.
See gevent documentation for more information.
In this situation, concurrency will not help much since the server seems to be the limiting factor. One solution is to send a request with a timeout interval, if the interval has exceeded, then try the request again after a few seconds. Then gradually increase the time between retries until you get the data that you want. For instance, your code might look like this:
import time
import requests
def get_content(url, timeout):
# raise Timeout exception if more than x sends have passed
resp = requests.get(url, timeout=timeout)
# raise generic exception if request is unsuccessful
if resp.status_code != 200:
raise LookupError('status is not 200')
return resp.content
timeout = 5 # seconds
retry_interval = 0
max_retry_interval = 120
while True:
try:
response = get_content('https://example.com', timeout=timeout)
retry_interval = 0 # reset retry interval after success
break
except (LookupError, requests.exceptions.Timeout):
retry_interval += 10
if retry_interval > max_retry_interval:
retry_interval = max_retry_interval
time.sleep(retry_interval)
# process response
If concurrency is required, consider the Scrapy project. It uses the Twisted framework. In Scrapy you can replace time.sleep with reactor.callLater(fn, *args, **kw) or use one of hundreds of middleware plugins.
From the documentation for requests:
If the remote server is very slow, you can tell Requests to wait
forever for a response, by passing None as a timeout value and then
retrieving a cup of coffee.
import requests
#Wait potentially forever
r = requests.get('http://dummysite.ca', timeout=None)
#Check the status code to see how the server is handling the request
print r.status_code
Status codes beginning with 2 mean the request was received, understood, and accepted. 200 means the request was a success and the information returned. But 503 means the server is overloaded or undergoing maintenance.
Requests used to include a module called async which could send concurrent requests. It is now an independent module named grequests
which you can use to make concurrent requests endlessly until a 200 response:
import grequests
urls = [
'http://python-requests.org', #Just include one url if you want
'http://httpbin.org',
'http://python-guide.org',
'http://kennethreitz.com'
]
def keep_going():
rs = (grequests.get(u) for u in urls) #Make a set of unsent Requests
out = grequests.map(rs) #Send them all at the same time
for i in out:
if i.status_code == 200:
print i.text
del urls[out.index(i)] #If we have the content, delete the URL
return
while urls:
keep_going()
I'm using an API for doing HTTP requests that return JSON. The calling of the api, however, depends on a start and an end page to be indicated, such as this:
def API_request(URL):
while(True):
try:
Response = requests.get(URL)
Data = Response.json()
return(Data['data'])
except Exception as APIError:
print(APIError)
continue
break
def build_orglist(start_page, end_page):
APILink = ("http://sc-api.com/?api_source=live&system=organizations&action="
"all_organizations&source=rsi&start_page={0}&end_page={1}&items_"
"per_page=500&sort_method=&sort_direction=ascending&expedite=1&f"
"ormat=json".format(start_page, end_page))
return(API_request(APILink))
The only way to know if you're not longer at an existing page is when the JSON will be null, like this.
If I wanted to do multiple build_orglist going over every single page asynchronously until I reach the end (Null JSON) how could I do so?
I went with a mix of #LukasGraf's answer of using sessions to unify all of my HTTP connections into a single session as well as made use of grequests for making the group of HTTP requests in parallel.
In my python application I have to read many web pages to collect data. To decrease the http calls I would like to fetch only changed pages. My problem is that my code always tells me that the pages have been changed (code 200) but in reality it is not.
This is my code:
from models import mytab
import re
import urllib2
from wsgiref.handlers import format_date_time
from datetime import datetime
from time import mktime
def url_change():
urls = mytab.objects.all()
# this is some urls:
# http://www.venere.com/it/pensioni/venezia/pensione-palazzo-guardi/#reviews
# http://www.zoover.it/italia/sardegna/cala-gonone/san-francisco/hotel
# http://www.orbitz.com/hotel/Italy/Venice/Palazzo_Guardi.h161844/#reviews
# http://it.hotels.com/ho292636/casa-del-miele-susegana-italia/
# http://www.expedia.it/Venezia-Hotel-Palazzo-Guardi.h1040663.Hotel-Information#reviews
# ...
for url in urls:
request = urllib2.Request(url.url)
if url.last_date == None:
now = datetime.now()
stamp = mktime(now.timetuple())
url.last_date = format_date_time(stamp)
url.save()
request.add_header("If-Modified-Since", url.last_date)
try:
response = urllib2.urlopen(request) # Make the request
# some actions
now = datetime.now()
stamp = mktime(now.timetuple())
url.last_date = format_date_time(stamp)
url.save()
except urllib2.HTTPError, err:
if err.code == 304:
print "nothing...."
else:
print "Error code:", err.code
pass
I do not understand what has gone wrong. Can anyone help me?
Web servers aren't required to send a 304 header as the response when you send an 'If-Modified-Since' header. They're free to send a HTTP 200 and send the entire page again.
Sending a 'If-Modified-Since' or 'If-None-Since' alerts the server that you'd like a cached response if available. It's like sending an 'Accept-Encoding: gzip, deflate' header -- you're just telling the server you'll accept something, not requiring it.
A good way to check if a site returns 304 is to use google chromes dev tools. E.g. below is an annotated example of using chrome on the bls website. Keep refreshing and you will see that the server keeps returning 304. If you force refresh with Ctrl+F5 (windows), you will see that instead it returns status code 200.
You can use this technique on your example to find out if the server does not return 304, or if you have incorrectly formatted your request headers somehow. Sometimes a webpage has a resource imported on to it which does not respect the If- headers and so it returns 200 whatever you do (If any resource on the page does not return 304, the whole page will return 200), but sometimes you are only looking at a specific part of a website and you can cheat by loading the resource directly and bypassing the whole document.