pool.map list index out of range python - python

there is about 70% chance shows error:
res=pool.map(feng,urls)
File "c:\Python27\lib\multiprocessing\pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "c:\Python27\lib\multiprocessing\pool.py", line 567, in get
raise self._value
IndexError: list index out of range
don't know why,if data less then 100,only 5%chance show that message.any one have idea how to improve?
#coding:utf-8
import multiprocessing
import requests
import bs4
import re
import string
root_url = 'http://www.haoshiwen.org'
#index_url = root_url+'/type.php?c=1'
def xianqin_url():
f = 0
h = 0
x = 0
y = 0
b = []
l=[]
for i in range(1,64):#页数
index_url=root_url+'/type.php?c=1'+'&page='+"%s" % i
response = requests.get(index_url)
soup = bs4.BeautifulSoup(response.text,"html.parser")
x = [a.attrs.get('href') for a in soup.select('div.sons a[href^=/]')]#取出每一页的div是sons的链接
c=len(x)#一共c个链接
j=0
for j in range(c):
url = root_url+x[j]
us = str(url)
print "收集到%s" % us
l.append(url) #pool = multiprocessing.Pool(8)
return l
def feng (url) :
response = requests.get(url)
response.encoding='utf-8'
#print response.text
soup = bs4.BeautifulSoup(response.text, "html.parser")
#content = soup.select('div.shileft')
qq=str(soup)
soupout = re.findall(r"原文(.+?)</div>",qq,re.S)#以“原文”开头<div>结尾的字段
#print soupout[1]
content=str(soupout[1])
b="风"
cc=content.count(b,0,len(content))
return cc
def start_process():
print 'Starting',multiprocessing.current_process().name
def feng (url) :
response = requests.get(url)
response.encoding='utf-8'
#print response.text
soup = bs4.BeautifulSoup(response.text, "html.parser")
#content = soup.select('div.shileft')
qq=str(soup)
soupout = re.findall(r"原文(.+?)</div>",qq,re.S)#以“原文”开头<div>结尾的字段
#print soupout[1]
content=str(soupout[1])
b="风"
c="花"
d="雪"
e="月"
f=content.count(b,0,len(content))
h=content.count(c,0,len(content))
x=content.count(d,0,len(content))
y=content.count(e,0,len(content))
return f,h,x,y
def find(urls):
r= [0,0,0,0]
pool=multiprocessing.Pool()
res=pool.map4(feng,urls)
for i in range(len(res)):
r=map(lambda (a,b):a+b, zip(r,res[i]))
return r
if __name__=="__main__":
print "开始收集网址"
qurls=xianqin_url()
print "收集到%s个链接" % len(qurls)
print "开始匹配先秦诗文"
find(qurls)
print '''
%s篇先秦文章中:
---------------------------
风有:%s
花有:%s
雪有:%s
月有:%s
数据来源:%s
''' % (len(qurls),find(qurls)[0],find(qurls)[1],find(qurls)[2],find(qurls)[3],root_url)
stackoverflow :Body cannot contain "`pool ma p".
changed it as res=pool.map4(feng,urls)
i'm trying to get some sub string from this website,with multiprocessing.

Indeed, multiprocessing makes it a bit hard to debug as you don't see where the index out of bound error occurred (the error message makes it appear as if it happened internally in the multiprocessing module).
In some cases this line:
content=str(soupout[1])
raises an index out of bound, because soupout is an empty list. If you change it to
if len(soupout) == 0:
return None
and then remove the None that were returned by changing
res=pool.map(feng,urls)
into
res = pool.map(feng,urls)
res = [r for r in res if r is not None]
then you can avoid the error. That said. You probably want to find out the root cause why re.findall returned an empty list. It is certainly a better idea to select the node with beatifulsoup than with regex, as generally matching with bs4 is more stable, especially if the website slightly changes their markup (e.g. whitespaces, etc.)
Update:
why is soupout is an empty list? When I didn't use pool.map never I have this error message shown
This is probably because you hammer the web server too fast. In a comment you mention that you sometimes get 504 in response.status_code. 504 means Gateway Time-out: The server was acting as a gateway or proxy and did not receive a timely response from the upstream server
This is because haoshiwen.org seems to be powered by kangle which is a reverse proxy. Now the reverse proxy handles back all the requests you send him to the web server behind, and if you now start too many processes at once the poor web server cannot handle the flood. Kangle has a default timeout of 60s so as soon as he doesn't get an answer back from the web server within 60s he shows the error you posted.
How do you fix that?
you could limit the number of processes: pool=multiprocessing.Pool(2), you'd need to play around with a good number of processes
at the top of feng(url) you could add a time.sleep(5) so each process waits 5 seconds between each request. Also here you'd need to play around with the sleep time.

Related

Changing variable for parameters for http json get requests

Pretty new to python so go easy on me :). This code works below but I was wondering if there is a way to change the indcode parameter by doing a loop so I do not have to repeat the requests.get.
paraD = dict()
paraD["area"] = "123"
paraD["periodtype"] = "2"
paraD["indcode"] = "722"
paraD["$limit"]=1000
#Open URL and get data for business indcode 722
document_1 = requests.get(dataURL, params=paraD)
bizdata_1 = document_1.json()
#Open URL and get data for business indcode 445
paraD["indcode"] = "445"
document_2 = requests.get(dataURL, params=paraD)
bizdata_2 = document_2.json()
#Open URL and get data for business indcode 311
paraD["indcode"] = "311"
document_3 = requests.get(dataURL, params=paraD)
bizdata_3 = document_3.json()
#Combine the three lists
output = bizdata_1 + bizdata_2 + bizdata_3
Since indcode is the only parameter that changes for each request, we will put that in a list and make the web requests inside a loop.
data_url = ""
post_params = dict()
post_params["area"] = "123"
post_params["periodtype"] = "2"
post_params["$limit"]=1000
# The list of indcode values
ind_codes = ["722", "445", "311"]
output = []
# Loop on indcode values
for code in ind_codes:
# Change indcode parameter value in the loop
post_params["indcode"] = code
try:
response = requests.get(data_url, params=post_params)
data1 = response.json()
output.append(data1)
except:
print("web request failed")
# More error handling / retry if required
print(output)
Assuming you're using Python 3.9+ you can combine dictionaries using the | operator. However, you need to be sure that you understand exactly what this will do. It's more likely that the code to combine dictionaries will be more complex.
When using the requests module it is very important to check the HTTP status code returned from the function (HTTP verb) you're calling.
Here's an approach to the stated problem that may work (depending on how the dictionary merge is effected).
from urllib.error import HTTPError
from requests import get as GET
from requests.exceptions import Timeout, TooManyRedirects, RequestException
from sys import stderr
# base parameters
params = {'area': '123', 'periodtype': '2', '$limit': 1000}
# the indcodes
indcodes = ('722', '445', '311')
# gets the JSON response (as a Python dictionary)
def getjson(url, params):
try:
(r := GET(url, params, timeout=1.0)).raise_for_status()
return r.json() # all good
# if we get any of these exceptions, report to stderr and return an empty dictionary
except (HTTPError, ConnectionError, Timeout, TooManyRedirects, RequestException) as e:
print(e, file=stderr)
return {}
# any exception here is not associated with requests/urllib. Report and raise
except Exception as f:
print(f, file=stderr)
raise
# an empty dictionary
target = {}
# build the target dictionary
# May not produce desired results depending on how the dictionary merge should be carried out
for indcode in indcodes:
target |= getjson('https://httpbin.org/json', params | {'indcode' : indcode})

Threading using Python limiting the number of threads and passing list of different values as arguments

I am here basically accessing the api call with various values coming from the list list_of_string_ids
I am expecting to create 20 threads, tell them to do something, write the values to DB and then have them all returning zero and going again to take the next data etc.
I have problem getting this to work using threading. Below is a code which is working correctly as expected, however it is taking very long to finish execration (around 45 minutes or more). The website I am getting the data from allows Async I/O using rate of 20 requests.
I assume this can make my code 20x faster but not really sure how to implement it.
import requests
import json
import time
import threading
import queue
headers = {'Content-Type': 'application/json',
'Authorization': 'Bearer TOKEN'}
start = time.perf_counter()
project_id_number = 123
project_id_string = 'pjiji4533'
name = "Assignment"
list_of_string_ids = [132,123,5345,123,213,213,...,n] # Len of list is 20000
def construct_url_threaded(project_id_number, id_string):
url = f"https://api.test.com/{}/{}".format(project_id_number,id_string)
r = requests.get(url , headers=headers) # Max rate allowed is 20 requests at once.
json_text = r.json()
comments = json.dumps(json_text, indent=2)
for item in json_text['data']:
# DO STUFF
for string_id in all_string_ids_list:
construct_url_threaded(project_id_number=project_id_number, id_string=string_id)
My trial is below
def main():
q = queue.Queue()
threads = [threading.Thread(target=create_url_threaded, args=(project_id_number,string_id, q)) for i in range(5) ] #5 is for testing
for th in threads:
th.daemon = True
th.start()
result1 = q.get()
result2 = q.get()

pycurl.multicurl - Cant get my the returned results, only the object memory address

I have been trying to get this example code working https://fragmentsofcode.wordpress.com/2011/01/22/pycurl-curlmulti-example/ for a bit tonight and it has me stumped (not difficult).
m = pycurl.CurlMulti()
for url in urllist:
response = io.StringIO()
handle = pycurl.Curl()
handle.setopt(pycurl.URL, url)
handle.setopt(pycurl.WRITEDATA, response)
req = (url, response)
m.add_handle(handle)
reqs.append(req)
# Perform multi-request.
# This code copied from pycurl docs, modified to explicitly
# set num_handles before the outer while loop.
SELECT_TIMEOUT = 1.0
num_handles = len(reqs)
while num_handles:
ret = m.select(SELECT_TIMEOUT)
if ret == -1:
continue
while 1:
ret, num_handles = m.perform()
if ret != pycurl.E_CALL_MULTI_PERFORM:
break
for req in reqs:
# req[1].getvalue() contains response content
print(req[1].getvalue())
I have tried the following prints
req[1]
req[1].getvalue()
req[1].getvalue().decode('ISO-8859-1')
The closest I have come is printing something like the following for all fetched URL's
_io.StringIO object at 0x7f19f931d9d8
I have also tried using io.BytesIO with the same results.
Comments in the code are from the original author, and I have changed some minor things that I think are version dependent, and removed the comments related.
How can I print the contents of the object at that memory address rather than just the address?
EDIT:
print(req[1].getvalue()) raises the following error
print(req[1].getalue())
AttributeError: '_io.BytesIO' object has no attribute 'getalue
so I squished it into variable like so
temp = req[1].getvalue()
print(temp)
and it just returned a b with single quotes like so:
b''

How to rotate proxies on a Python requests

I'm trying to do some scraping, but I get blocked every 4 requests. I have tried to change proxies but the error is the same. What should I do to change it properly?
Here is some code where I try it. First I get proxies from a free web. Then I go do the request with the new proxy but it doesn't work because I get blocked.
from fake_useragent import UserAgent
import requests
def get_player(id,proxy):
ua=UserAgent()
headers = {'User-Agent':ua.random}
url='https://www.transfermarkt.es/jadon-sancho/profil/spieler/'+str(id)
try:
print(proxy)
r=requests.get(u,headers=headers,proxies=proxy)
execpt:
....
code to manage the data
....
Getting proxies
def get_proxies():
ua=UserAgent()
headers = {'User-Agent':ua.random}
url='https://free-proxy-list.net/'
r=requests.get(url,headers=headers)
page = BeautifulSoup(r.text, 'html.parser')
proxies=[]
for proxy in page.find_all('tr'):
i=ip=port=0
for data in proxy.find_all('td'):
if i==0:
ip=data.get_text()
if i==1:
port=data.get_text()
i+=1
if ip!=0 and port!=0:
proxies+=[{'http':'http://'+ip+':'+port}]
return proxies
Calling functions
proxies=get_proxies()
for i in range(1,100):
player=get_player(i,proxies[i//4])
....
code to manage the data
....
I know that proxies scrape is well because when i print then I see something like:
{'http': 'http://88.12.48.61:42365'}
I would like to don't get blocked.
I recently had this same issue, but using proxy servers online as recommended in other answers is always risky (from privacy standpoint), slow, or unreliable.
Instead, you can use the requests-ip-rotator python library to proxy traffic through AWS API Gateway, which gives you a new IP each time:
pip install requests-ip-rotator
This can be used as follows (for your site specifically):
import requests
from requests_ip_rotator import ApiGateway, EXTRA_REGIONS
gateway = ApiGateway("https://www.transfermarkt.es")
gateway.start()
session = requests.Session()
session.mount("https://www.transfermarkt.es", gateway)
response = session.get("https://www.transfermarkt.es/jadon-sancho/profil/spieler/your_id")
print(response.status_code)
# Only run this line if you are no longer going to run the script, as it takes longer to boot up again next time.
gateway.shutdown()
Combined with multithreading/multiprocessing, you'll be able to scrape the site in no time.
The AWS free tier provides you with 1 million requests per region, so this option will be free for all reasonable scraping.
import requests
from itertools import cycle
list_proxy = ['socks5://Username:Password#IP1:20000',
'socks5://Username:Password#IP2:20000',
'socks5://Username:Password#IP3:20000',
'socks5://Username:Password#IP4:20000',
]
proxy_cycle = cycle(list_proxy)
# Prime the pump
proxy = next(proxy_cycle)
for i in range(1, 10):
proxy = next(proxy_cycle)
print(proxy)
proxies = {
"http": proxy,
"https":proxy
}
r = requests.get(url='https://ident.me/', proxies=proxies)
print(r.text)
The problem with using free proxies from sites like this is
websites know about these and may block just because you're using one of them
you don't know that other people haven't gotten them blacklisted by doing bad things with them
the site is likely using some form of other identifier to track you across proxies based on other characteristics (device fingerprinting, proxy-piercing, etc)
Unfortunately, there's not a lot you can do other than be more sophisticated (distribute across multiple devices, use VPN/TOR, etc) and risk your IP being blocked for attempting DDOS-like traffic or, preferably, see if the site has an API for access
Presumably you have your own pool of proxies - what is the best way to rotate them?
First, blindly picking random proxy we risk of repeating connection from the same proxy multiple times in a row. To add, most connection pattern based blocking is using proxy subnet (3rd number) rather than host - it's best to prevent repeats at subnet level.
It's also a good idea to track proxy performance as not all proxies are equal - we want to use our better performing proxies more often and let dead proxies cooldown.
All of this can be done with weighted randomization which is implemented by Python's random.choices() function:
import random
from time import time
from typing import List, Literal
class Proxy:
"""container for a proxy"""
def __init__(self, ip, type_="datacenter") -> None:
self.ip: str = ip
self.type: Literal["datacenter", "residential"] = type_
_, _, self.subnet, self.host = ip.split(":")[0].split('.')
self.status: Literal["alive", "unchecked", "dead"] = "unchecked"
self.last_used: int = None
def __repr__(self) -> str:
return self.ip
def __str__(self) -> str:
return self.ip
class Rotator:
"""weighted random proxy rotator"""
def __init__(self, proxies: List[Proxy]):
self.proxies = proxies
self._last_subnet = None
def weigh_proxy(self, proxy: Proxy):
weight = 1_000
if proxy.subnet == self._last_subnet:
weight -= 500
if proxy.status == "dead":
weight -= 500
if proxy.status == "unchecked":
weight += 250
if proxy.type == "residential":
weight += 250
if proxy.last_used:
_seconds_since_last_use = time() - proxy.last_used
weight += _seconds_since_last_use
return weight
def get(self):
proxy_weights = [self.weigh_proxy(p) for p in self.proxies]
proxy = random.choices(
self.proxies,
weights=proxy_weights,
k=1,
)[0]
proxy.last_used = time()
self.last_subnet = proxy.subnet
return proxy
If we mock run this Rotator we can see how weighted randoms distribute our connections:
from collections import Counter
if __name__ == "__main__":
proxies = [
# these will be used more often
Proxy("xx.xx.121.1", "residential"),
Proxy("xx.xx.121.2", "residential"),
Proxy("xx.xx.121.3", "residential"),
# these will be used less often
Proxy("xx.xx.122.1"),
Proxy("xx.xx.122.2"),
Proxy("xx.xx.123.1"),
Proxy("xx.xx.123.2"),
]
rotator = Rotator(proxies)
# let's mock some runs:
_used = Counter()
_failed = Counter()
def mock_scrape():
proxy = rotator.get()
_used[proxy.ip] += 1
if proxy.host == "1": # simulate proxies with .1 being significantly worse
_fail_rate = 60
else:
_fail_rate = 20
if random.randint(0, 100) < _fail_rate: # simulate some failure
_failed[proxy.ip] += 1
proxy.status = "dead"
mock_scrape()
else:
proxy.status = "alive"
return
for i in range(10_000):
mock_scrape()
for proxy, count in _used.most_common():
print(f"{proxy} was used {count:>5} times")
print(f" failed {_failed[proxy]:>5} times")
# will print:
# xx.xx.121.2 was used 2629 times
# failed 522 times
# xx.xx.121.3 was used 2603 times
# failed 508 times
# xx.xx.123.2 was used 2321 times
# failed 471 times
# xx.xx.122.2 was used 2302 times
# failed 433 times
# xx.xx.121.1 was used 1941 times
# failed 1187 times
# xx.xx.122.1 was used 1629 times
# failed 937 times
# xx.xx.123.1 was used 1572 times
# failed 939 times
By using weighted randoms we can create a connection pattern that appears random but smart. We can apply generic patterns like not proxies from the same IP family in a row as well as custom per-target logic like priotizing North American IPs for NA targets etc.
For more on this see my blog How to Rotate Proxies in Web Scraping

Python: wait for requests_futures.sessions to finish before continuing with the code flow

My current code as it stands prints an empty list, how do I wait for all requests and callbacks to finish before continuing with the code flow?
from requests_futures.sessions import FuturesSession
from time import sleep
session = FuturesSession(max_workers=100)
i = 1884001540 - 100
list = []
def testas(session, resp):
print(resp)
resp = resp.json()
print(resp['participants'][0]['stats']['kills'])
list.append(resp['participants'][0]['stats']['kills'])
while i < 1884001540:
url = "https://acs.leagueoflegends.com/v1/stats/game/NA1/" + str(i)
temp = session.get(url, background_callback=testas)
i += 1
print(list)
From looking at session.py in requests-futures-0.9.5.tar.gz its necesssary to create a future in order to wait for its result as shown in this code:
from requests_futures import FuturesSession
session = FuturesSession()
# request is run in the background
future = session.get('http://httpbin.org/get')
# ... do other stuff ...
# wait for the request to complete, if it hasn't already
response = future.result()
print('response status: {0}'.format(response.status_code))
print(response.content)
As shown in the README.rst a future can and should be created for every session.get() and waited on to complete.
This might be applied in your code as follows starting just before the while loop:
future = []
while i < 1884001540:
url = "https://acs.leagueoflegends.com/v1/stats/game/NA1/" + str(i)
future.append(session.get(url, background_callback=testas)
i += 1
for f in future:
response = f.result()
# the following print statements may be useful for debugging
# print('response status: {0}'.format(response.status_code))
# print(response.content, "\n")
print(list)
I'm not sure how your system will respond to a large number (1884001440) of futures and another way to do it is by processing them in smaller groups say 100 or 1000 at a time. It might be wise to test the script with a relatively small number of them at the beginning to find out how fast they return results.
from here https://pypi.python.org/pypi/requests-futures it says
from requests_futures.sessions import FuturesSession
session = FuturesSession()
# first request is started in background
future_one = session.get('http://httpbin.org/get')
# second requests is started immediately
future_two = session.get('http://httpbin.org/get?foo=bar')
# wait for the first request to complete, if it hasn't already
response_one = future_one.result()
so it seems that .result() is what you are looking for

Categories

Resources