(Sorry for my English)
I want to make parser for some website with include a lot of JS scripts,
for this I use selenium+phantomjs+lxml. Needs that this parser works fast,
at least 1000 links per 1 hour. For this purpose I use multiprocessing
(not Threading!! because GIL) and module Future, and ProcessExecutorPool.
The problem in next, when I give to input list from 10 links and 5 workers, after executions
lose some links. It can be 1 links or greater(till 6! - the max value but rare). This of course bad result.
There is some dependence, for increasing amount of process, increase amount of lost links.
First of all I trace where program breaking. (asserts doesn't works correctly because multiprocessing)
I find that, the program breaking after string "browser.get(l)". Then I put time.sleep(x) - give some time
for downloading page. Give no result. Then I try research .get() from selenium.webdriver....remote.webdriver.py
but it's reload .execute() - and this function takes so many parameters - and discovers it - too long and
difficult for me... and the same time I try to run program for 1 process - and I lost 1 links. I thought, may
be problem not in selenium and PhantomJS, than I replace concurrent.futures.Future.ProcessExecutorPool
on multiprocessing.Pool - problem solves, links don't lose, but if amount of process - <= 4, works
almost good, but some time appeares new mistakes(this mistakes appears when set 4 <= amount of process):
"""
multiprocessing.pool.RemoteTraceback:
Traceback (most recent call last):
File "/usr/lib/python3.4/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.4/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "interface.py", line 34, in hotline_to_mysql
w = Parse_hotline().browser_manipulation(link)
File "/home/water/work/parsing/class_parser/parsing_classes.py", line 352, in browser_manipulation
browser.get(l)
File "/usr/local/lib/python3.4/dist-packages/selenium/webdriver/remote/webdriver.py", line 247, in get
self.execute(Command.GET, {'url': url})
File "/usr/local/lib/python3.4/dist-packages/selenium/webdriver/remote/webdriver.py", line 233, in execute
response = self.command_executor.execute(driver_command, params)
File "/usr/local/lib/python3.4/dist-packages/selenium/webdriver/remote/remote_connection.py", line 401, in execute
return self._request(command_info[0], url, body=data)
File "/usr/local/lib/python3.4/dist-packages/selenium/webdriver/remote/remote_connection.py", line 471, in _request
resp = opener.open(request, timeout=self._timeout)
File "/usr/lib/python3.4/urllib/request.py", line 463, in open
response = self._open(req, data)
File "/usr/lib/python3.4/urllib/request.py", line 481, in _open
'_open', req)
File "/usr/lib/python3.4/urllib/request.py", line 441, in _call_chain
result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 1210, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/lib/python3.4/urllib/request.py", line 1185, in do_open
r = h.getresponse()
File "/usr/lib/python3.4/http/client.py", line 1171, in getresponse
response.begin()
File "/usr/lib/python3.4/http/client.py", line 351, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.4/http/client.py", line 321, in _read_status
raise BadStatusLine(line)
http.client.BadStatusLine: ''
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "interface.py", line 69, in <module>
main()
File "interface.py", line 63, in main
executor.map(hotline_to_mysql, link_list)
File "/usr/lib/python3.4/multiprocessing/pool.py", line 260, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib/python3.4/multiprocessing/pool.py", line 599, in get
raise self._value
http.client.BadStatusLine: ''
"""
import random
import time
import lxml.html as lh
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from multiprocessing import Pool
from selenium.webdriver.common.keys import Keys
from concurrent.futures import Future, ProcessPoolExecutor, ThreadPoolExecutor
AMOUNT_PROCESS = 5
def parse(h)->list:
# h - str, html of page
lxml_ = lh.document_fromstring(h)
name = lxml_.xpath('/html/body/div[2]/div[7]/div[6]/ul/li[1]/a/#title')
prices_ = (price.text_content().strip().replace('\xa0', ' ')
for price in lxml_.xpath('//*[#id="gotoshop-price"]'))
markets_ =(market.text_content().strip() for market in
lxml_.find_class('cell shop-title'))
wares = [[name[0], market, price] for (market, price)
in zip(markets_, prices_)]
return wares
def browser_manipulation(l):
#options = []
#options.append('--load-images=false')
#options.append('--proxy={}:{}'.format(host, port))
#options.append('--proxy-type=http')
#options.append('--user-agent={}'.format(user_agent)) #тут хедеры рандомно
dcap = dict(DesiredCapabilities.PHANTOMJS)
#user agent takes from my config.py
dcap["phantomjs.page.settings.userAgent"] = (random.choice(USER_AGENT))
browser = webdriver.PhantomJS(desired_capabilities=dcap)
#print(browser)
#print('~~~~~~', l)
#browser.implicitly_wait(20)
#browser.set_page_load_timeout(80)
#time.sleep(2)
browser.get(l)
time.sleep(20)
result = parse(browser.page_source)
#print('++++++', result[0][0])
browser.quit()
return result
def main():
#open some file with links
with open(sys.argv[1], 'r') as f:
link_list = [i.replace('\n', '') for i in f]
with Pool(AMOUNT_PROCESS) as executor:
executor.map(browser_manipulation, link_list)
if __name__ == '__main__':
main()
Where is the problem(selenium+phantomJS, ThreadPoolExecutor, my code)? Why links lost?
How to increase speed of parsing?
Finally, may be there are alternative way for parse dynamic website without selenium+phantomjs, on python?
Of course, important is speed of parsing.
Thank's for answers.
I try, instead ProcessPoolExecutor - ThreadPoolExecutor, losing links stop. In Thread... case, speed approximately equal to Process.
The question is actual, if you have some information about this, please write. Thanks.
Related
This is my code.
import requests
from bs4 import BeautifulSoup
import time
import datetime
from selenium import webdriver
import io
from pyvirtualdisplay import Display
neighbours = []
with io.open('cntr_london.txt', "r", encoding="utf-8") as f:
for q in f:
neighbours.append(q.replace('neighborhoods%5B%5D=', '').replace('\n',''))
#url = 'https://www.airbnb.com/s/paris/homes?room_types%5B%5D=Entire%20home%2Fapt&room_types%5B%5D=Private%20room&price_max=' +str(price_max)+ '&price_min=' + str(price_min)
def scroll_through_bottom():
s = 0
while s <= 4000:
s = s+200
browser.execute_script('window.scrollTo(0, '+ str(s) +');')
def get_links():
link_data = browser.find_elements_by_class_name('_1szwzht')
for link in link_data:
link_tag = link.find_elements_by_tag_name('a')
for l in link_tag:
link_list.append(l.get_attribute("href"))
length = len(link_list)
print length
with Display():
browser = webdriver.Firefox()
try:
browser.get('http://airbnb.com')
finally:
browser.quit()
Every url is working. But when I am trying to get Airbnb, it is giving me this error:
Traceback (most recent call last):
File "airbnb_new.py", line 43, in <module>
browser.get('http://airbnb.com')
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 248, in get
self.execute(Command.GET, {'url': url})
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 234, in execute
response = self.command_executor.execute(driver_command, params)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/remote_connection.py", line 401, in execute
return self._request(command_info[0], url, body=data)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/remote_connection.py", line 433, in _request
resp = self._conn.getresponse()
File "/usr/lib/python2.7/httplib.py", line 1089, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 444, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 408, in _read_status
raise BadStatusLine(line)
httplib.BadStatusLine: ''
On the other hand, when I am trying to run my code in Python3, it is giving me no module named pyvirtualdisplay even though I installed it with pip.
Can someone please help me with this problem? I would highly appreciate it.
Airbnb has identified your scraper and as a preventive measure they are rejecting requests from your spider. So you can not do anything may be you can change IP and system info and check if it works or you can wait for couple of hour and check if airbnb system has released lock and accepting requests from your system too.
I am trying to use urllib.request.urlretrieve along with the multiprocessing module to download some files and do some processing on them. However, each time I try to run my program, it gives me the error:
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.4/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.4/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "./thumb.py", line 13, in download_and_convert
filename, headers = urlretrieve(url)
File "/usr/lib/python3.4/urllib/request.py", line 186, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "/usr/lib/python3.4/urllib/request.py", line 161, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.4/urllib/request.py", line 463, in open
response = self._open(req, data)
File "/usr/lib/python3.4/urllib/request.py", line 486, in _open
'unknown_open', req)
File "/usr/lib/python3.4/urllib/request.py", line 441, in _call_chain
result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 1252, in unknown_open
raise URLError('unknown url type: %s' % type)
urllib.error.URLError: <urlopen error unknown url type: http>
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "./thumb.py", line 27, in <module>
pool.map(download_and_convert, enumerate(csvr))
File "/usr/lib/python3.4/multiprocessing/pool.py", line 260, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib/python3.4/multiprocessing/pool.py", line 599, in get
raise self._value
urllib.error.URLError: <urlopen error unknown url type: http>
The url that it seems to choke on is http://phytoimages.siu.edu/users/vitt/10_27_06_2/Equisetumarvense.JPG. Here is my code:
#!/usr/bin/env python3
from subprocess import Popen
from sys import argv, stdin
import csv
from multiprocessing import Pool
from urllib.request import urlretrieve
def download_and_convert(args):
num, url_list = args
url = url_list[0]
try:
filename, headers = urlretrieve(url)
except:
print(url)
raise
Popen(["convert", filename, "-resize", "250x250",\
str(num)+'.'+url.split('.')[-1]])
if __name__ == "__main__":
csvr = csv.reader(open(argv[1]))
if(len(argv) > 2): nprocs = argv[2]
else: nprocs = None
pool = Pool(processes=nprocs)
pool.map(download_and_convert, enumerate(csvr))
I have no idea why this error is ocurring. Could it be because I am using multiprocessing? If anyone could help me, it would be much appreciated.
Edit: This is the first url it tries to process, and it doesn't change the error if I change it.
Check this code snippet
>>> import urllib.parse
>>> urllib.parse.quote(':')
'%3A'
As you can see urllib interprets the ':' character strangely. Which coincidentally is where your program hangs up on you.
Try urllib.parse.urlencode() that should put you on the right track.
After some help from the comments I found the solution. It appears that the problem was the csv module being tripped up on the byte order mark (BOM). I was able to fix it by opening the file with encoding='utf-8-sig' as suggest here.
I want to click on load-more until it disappear on that page.
I have tried but it sometimes work or giving error. It is not perfect solution which i did.
I can have multiple url in a list and hit one by one and load-more until it disappear from that page.
Thanks in advance for helping.
Code
driver = webdriver.Firefox()
url = ["https://www.zomato.com/HauzKhasSocial","https://www.zomato.com/ncr/wendys-sector-29-gurgaon","https://www.zomato.com/vaultcafecp"]
for load in url:
driver.get(load)
xpath_content='//div[#class = "load-more"]'
temp_xpath="true"
while temp_xpath:
try:
#driver.implicitly.wait(15)
#WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.XPATH,xpath_content)))
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH,xpath_content)))
#urls=driver.find_element_by_xpath(xpath_content)
urls=driver.find_element_by_xpath(xpath_content)
text=urls.text
if text:
temp_xpath=text
print "XPATH=",temp_xpath
driver.find_element_by_xpath(xpath_content).click()
#driver.execute_script('$("div.load-more").click();')
except TimeoutException:
print driver.title, "no xpath of pagination"
temp_xpath=""
continue
Most of time I get following error while running my program.
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 173, in execute
response = self.command_executor.execute(driver_command, params)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/remote_connection.py", line 349, in execute
return self._request(command_info[0], url, body=data)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/remote_connection.py", line 380, in _request
resp = self._conn.getresponse()
File "/usr/lib/python2.7/httplib.py", line 1045, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 409, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 373, in _read_status
raise BadStatusLine(line)
httplib.BadStatusLine: ''
You probably get the BadStatus error because of a bug that has been fixed in the latest versions of Selenium webdrivers. I got into a similar situation recently and here's the thread of discussion with the developers that helped me out.
https://bugs.chromium.org/p/chromedriver/issues/detail?id=1548
I have a Python script that downloads product feeds from multiple affiliates in different ways. This didn't give me any problems until last Wednesday, when it started throwing all kinds of timeout exceptions from different locations.
Examples: Here I connect with a FTP service:
ftp = FTP(host=self.host)
threw:
Exception in thread Thread-7:
Traceback (most recent call last):
File "C:\Python27\Lib\threading.py", line 810, in __bootstrap_inner
self.run()
File "C:\Python27\Lib\threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "C:\Users\Administrator\Documents\Crawler\src\Crawlers\LDLC.py", line 23, in main
ftp = FTP(host=self.host)
File "C:\Python27\Lib\ftplib.py", line 120, in __init__
self.connect(host)
File "C:\Python27\Lib\ftplib.py", line 138, in connect
self.welcome = self.getresp()
File "C:\Python27\Lib\ftplib.py", line 215, in getresp
resp = self.getmultiline()
File "C:\Python27\Lib\ftplib.py", line 201, in getmultiline
line = self.getline()
File "C:\Python27\Lib\ftplib.py", line 186, in getline
line = self.file.readline(self.maxline + 1)
File "C:\Python27\Lib\socket.py", line 476, in readline
data = self._sock.recv(self._rbufsize)
timeout: timed out
Or downloading an XML File :
xmlFile = urllib.URLopener()
xmlFile.retrieve(url, self.feedPath + affiliate + "/" + website + '.' + fileType)
xmlFile.close()
throws:
File "C:\Users\Administrator\Documents\Crawler\src\Crawlers\FeedCrawler.py", line 106, in save
xmlFile.retrieve(url, self.feedPath + affiliate + "/" + website + '.' + fileType)
File "C:\Python27\Lib\urllib.py", line 240, in retrieve
fp = self.open(url, data)
File "C:\Python27\Lib\urllib.py", line 208, in open
return getattr(self, name)(url)
File "C:\Python27\Lib\urllib.py", line 346, in open_http
errcode, errmsg, headers = h.getreply()
File "C:\Python27\Lib\httplib.py", line 1139, in getreply
response = self._conn.getresponse()
File "C:\Python27\Lib\httplib.py", line 1067, in getresponse
response.begin()
File "C:\Python27\Lib\httplib.py", line 409, in begin
version, status, reason = self._read_status()
File "C:\Python27\Lib\httplib.py", line 365, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File "C:\Python27\Lib\socket.py", line 476, in readline
data = self._sock.recv(self._rbufsize)
IOError: [Errno socket error] timed out
These are just two examples but there are other methods, like authenticate or other API specific methods where my script throws these timeout errors. It never showed this behavior until Wednesday. Also, it starts throwing them at random times. Sometimes at the beginning of the crawl, sometimes later on. My script has this behavior on both my server and my local machine. I've been struggling with it for two days now but can't seem to figure it out.
This is what I know might have caused this:
On Wednesday one affiliate script broke down with the following error:
URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:581)>
I didn't change anything to my script but suddenly it stopped crawling that affiliate and threw that error all the time where I tried to authenticate. I looked it up and found that is was due to an OpenSSL error (where did that come from). I fixed it by adding the following before the authenticate method:
if hasattr(ssl, '_create_unverified_context'):
ssl._create_default_https_context = ssl._create_unverified_context
Little did I know, this was just the start of my problems... At that same time, I changed from Python 2.7.8 to Python 2.7.9. It seems that this is the moment that everything broke down and started throwing timeouts.
I tried changing my script in endless ways but nothing worked and like I said, it's not just one method that throws it. Also I switched back to Python 2.7.8, but this didn't do the trick either. Basically everything that makes a request to an external source can throw an error.
Final note: My script is multi threaded. It downloads product feeds from different affiliates at the same time. It used to run 10 threads per affiliate without a problem. Now I tried lowering it to 3 per affiliate, but it still throws these errors. Setting it to 1 is no option because that will take ages. I don't think that's the problem anyway because it used to work fine.
What could be wrong?
Base script:
from sanction import Client
# client_id & client_secret are omitted but are valid
client_pin = input('Enter PIN:')
access_token_url = 'https://api.home.nest.com/oauth2/access_token'
c = Client(
token_endpoint=access_token_url,
client_id=client_id,
client_secret=client_secret)
c.request_token(code = client_pin)
[See edits for history]
Running c.request('/devices') returned:
Traceback (most recent call last):
File "C:\py\nest_testing_sanction.py", line 36, in <module>
c.request("/devices")
File "C:\Python34\lib\site-packages\sanction-0.4.1-py3.4.egg\sanction\__init__.py", line 169, in request
File "C:\Python34\lib\site-packages\sanction-0.4.1-py3.4.egg\sanction\__init__.py", line 211, in transport_query
File "C:\Python34\lib\urllib\request.py", line 258, in __init__
self.full_url = url
File "C:\Python34\lib\urllib\request.py", line 284, in full_url
self._parse()
File "C:\Python34\lib\urllib\request.py", line 313, in _parse
raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: 'None/devices?access_token=c.[some long session token]'
Given the output it seems like I need to be putting in a generic URL so I tried c.request('wss://developer-api.nest.com'):
Traceback (most recent call last):
File "C:\py\nest_testing_sanction.py", line 36, in <module>
data = c.request(query_url)
File "C:\Python34\lib\site-packages\sanction-0.4.1-py3.4.egg\sanction\__init__.py", line 171, in request
File "C:\Python34\lib\urllib\request.py", line 153, in urlopen
return opener.open(url, data, timeout)
File "C:\Python34\lib\urllib\request.py", line 455, in open
response = self._open(req, data)
File "C:\Python34\lib\urllib\request.py", line 478, in _open
'unknown_open', req)
File "C:\Python34\lib\urllib\request.py", line 433, in _call_chain
result = func(*args)
File "C:\Python34\lib\urllib\request.py", line 1257, in unknown_open
raise URLError('unknown url type: %s' % type)
urllib.error.URLError: <urlopen error unknown url type: nonewss>
I also tried https as per:
- same result
By contrast, this works (for a firebase.io virtual device):
firebase = firebase.FirebaseApplication('https://nesttest.firebaseio.com', None)
thermostat_result = firebase.get('/devices', 'thermostats')
In Python I would use something like sanction to keep things simple. You should be able to get it to work with the Nest API using code like: (untested, using token flow rather than pin flow)
from sanction.client import Client
# instantiating a client to get the auth URI
c = Client(auth_endpoint="https://home.nest.com/login/oauth2",
client_id=config["nest.client_id"])
# instantiating a client to process OAuth2 response
c = Client(token_endpoint="https://api.home.nest.com/oauth2/access_token",
client_id=config["nest.client_id"],
client_secret=config["nest.client_secret"])
The library is well documented, so you should be able to figure it out from here if something is missing.
This is more of a comment, but the system does not let me comment just yet.
To your question about where to put the web pin simply add code = pin to the request_token call.
c.request_token(code = nest_client_pin)
This still does not fully solve the issue as I can only use a PIN once. After I have used it once, every subsequent call will fail again as you describe. Still researching that.