I want to click on load-more until it disappear on that page.
I have tried but it sometimes work or giving error. It is not perfect solution which i did.
I can have multiple url in a list and hit one by one and load-more until it disappear from that page.
Thanks in advance for helping.
Code
driver = webdriver.Firefox()
url = ["https://www.zomato.com/HauzKhasSocial","https://www.zomato.com/ncr/wendys-sector-29-gurgaon","https://www.zomato.com/vaultcafecp"]
for load in url:
driver.get(load)
xpath_content='//div[#class = "load-more"]'
temp_xpath="true"
while temp_xpath:
try:
#driver.implicitly.wait(15)
#WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.XPATH,xpath_content)))
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH,xpath_content)))
#urls=driver.find_element_by_xpath(xpath_content)
urls=driver.find_element_by_xpath(xpath_content)
text=urls.text
if text:
temp_xpath=text
print "XPATH=",temp_xpath
driver.find_element_by_xpath(xpath_content).click()
#driver.execute_script('$("div.load-more").click();')
except TimeoutException:
print driver.title, "no xpath of pagination"
temp_xpath=""
continue
Most of time I get following error while running my program.
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 173, in execute
response = self.command_executor.execute(driver_command, params)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/remote_connection.py", line 349, in execute
return self._request(command_info[0], url, body=data)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/remote_connection.py", line 380, in _request
resp = self._conn.getresponse()
File "/usr/lib/python2.7/httplib.py", line 1045, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 409, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 373, in _read_status
raise BadStatusLine(line)
httplib.BadStatusLine: ''
You probably get the BadStatus error because of a bug that has been fixed in the latest versions of Selenium webdrivers. I got into a similar situation recently and here's the thread of discussion with the developers that helped me out.
https://bugs.chromium.org/p/chromedriver/issues/detail?id=1548
Related
I am making a webscraping program that goes through each URL in a list of URLs, opens the page with that URL, and extracts some information from the soup. Most of the time it works fine, but occasionally the program will stop advancing through the list but not terminate the program, show warnings/exceptions, or otherwise show signs of error. My code, stripped down to the relevant parts, looks like this:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as bs
# some code...
for url in url_list:
req = Request(url, headers={"User-Agent": "Mozilla/5.0"})
page = urlopen(req)
soup = bs(page, features="html.parser")
# do some stuff with the soup...
When the program stalls, if I terminate it manually (using PyCharm), I get this traceback:
File "/Path/to/my/file.py", line 48, in <module>
soup = bs(page, features="html.parser")
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/__init__.py", line 266, in __init__
markup = markup.read()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 454, in read
return self._readall_chunked()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 564, in _readall_chunked
value.append(self._safe_read(chunk_left))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 610, in _safe_read
chunk = self.fp.read(min(amt, MAXAMOUNT))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 1052, in recv_into
return self.read(nbytes, buffer)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 911, in read
return self._sslobj.read(len, buffer)
KeyboardInterrupt
Here's what I have tried and learned:
Added a check to make sure that the page status is always 200 when making the soup. The fail condition never occurs.
Added a print statement after the soup is created. This print statement does not trigger after stalling.
The URLs are always valid. This is confirmed by the fact that the program does not stall on the same URL every time, and double confirmed by a similar program I have with nearly identical code that shows the same behavior on a different set of URLs.
I have tried running through this step-by-step with a debugger. The problem has not occurred in the 30 or so iterations I've checked manually, which may just be coincidence.
The page returns the correct headers when bs4 stalls. The problem seems to be isolated to the creation of the soup.
What could cause this behavior?
This is my code.
import requests
from bs4 import BeautifulSoup
import time
import datetime
from selenium import webdriver
import io
from pyvirtualdisplay import Display
neighbours = []
with io.open('cntr_london.txt', "r", encoding="utf-8") as f:
for q in f:
neighbours.append(q.replace('neighborhoods%5B%5D=', '').replace('\n',''))
#url = 'https://www.airbnb.com/s/paris/homes?room_types%5B%5D=Entire%20home%2Fapt&room_types%5B%5D=Private%20room&price_max=' +str(price_max)+ '&price_min=' + str(price_min)
def scroll_through_bottom():
s = 0
while s <= 4000:
s = s+200
browser.execute_script('window.scrollTo(0, '+ str(s) +');')
def get_links():
link_data = browser.find_elements_by_class_name('_1szwzht')
for link in link_data:
link_tag = link.find_elements_by_tag_name('a')
for l in link_tag:
link_list.append(l.get_attribute("href"))
length = len(link_list)
print length
with Display():
browser = webdriver.Firefox()
try:
browser.get('http://airbnb.com')
finally:
browser.quit()
Every url is working. But when I am trying to get Airbnb, it is giving me this error:
Traceback (most recent call last):
File "airbnb_new.py", line 43, in <module>
browser.get('http://airbnb.com')
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 248, in get
self.execute(Command.GET, {'url': url})
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 234, in execute
response = self.command_executor.execute(driver_command, params)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/remote_connection.py", line 401, in execute
return self._request(command_info[0], url, body=data)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/remote_connection.py", line 433, in _request
resp = self._conn.getresponse()
File "/usr/lib/python2.7/httplib.py", line 1089, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 444, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 408, in _read_status
raise BadStatusLine(line)
httplib.BadStatusLine: ''
On the other hand, when I am trying to run my code in Python3, it is giving me no module named pyvirtualdisplay even though I installed it with pip.
Can someone please help me with this problem? I would highly appreciate it.
Airbnb has identified your scraper and as a preventive measure they are rejecting requests from your spider. So you can not do anything may be you can change IP and system info and check if it works or you can wait for couple of hour and check if airbnb system has released lock and accepting requests from your system too.
When creating a new chromedriver instance (in python): webdriver.Chrome("./venv/selenium/webdriver/chromedriver"), I get an error http.client.BadStatusLine: ''. I am not navigating to a site, or using a server, just creating a new chromedriver. I am in a VirtualEnv that has the most recent version of Selenium (3.0.1) and chromedriver (2.24.1). This was working fine a few days ago, and I didn't change any code. I'm not really sure where to begin solving the code. My first step was to run pip install --upgrade -r requirements.txt to make sure all packages were up to date. My only idea now is that selenium is no handling the default start page, with url as data;,, because there is no response. However, as that is the default behavior, I would be surprised if selenium could not handle its' own default behavior. Any help would be much appreciated!
When the code is run (via python from the bash terminal), a new chromedriver instance is successfully created, but the error http.client.BadStatusLine: '' gets thrown, and the python terminal loses the connection to the chromedriver.
Full code:
import pythonscripts
# Creates a new webdriver
driver = pythonscripts.md()
# Never gets here, attempts to use driver get NameError: name 'driver' is not defined
Pythonscripts md method:
def md():
return webdriver.Chrome("./venv/selenium/webdriver/chromedriver")
Full error output:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/brydenr/server_scripts/cad_tests/pythonscripts.py", line 65, in md
return webdriver.Chrome("./venv/selenium/webdriver/chromedriver")
File "/Users/brydenr/server_scripts/venv/lib/python3.4/site-packages/selenium/webdriver/chrome/webdriver.py", line 69, in __init__
desired_capabilities=desired_capabilities)
File "/Users/brydenr/server_scripts/venv/lib/python3.4/site-packages/selenium/webdriver/remote/webdriver.py", line 92, in __init__
self.start_session(desired_capabilities, browser_profile)
File "/Users/brydenr/server_scripts/venv/lib/python3.4/site-packages/selenium/webdriver/remote/webdriver.py", line 179, in start_session
response = self.execute(Command.NEW_SESSION, capabilities)
File "/Users/brydenr/server_scripts/venv/lib/python3.4/site-packages/selenium/webdriver/remote/webdriver.py", line 234, in execute
response = self.command_executor.execute(driver_command, params)
File "/Users/brydenr/server_scripts/venv/lib/python3.4/site-packages/selenium/webdriver/remote/remote_connection.py", line 407, in execute
return self._request(command_info[0], url, body=data)
File "/Users/brydenr/server_scripts/venv/lib/python3.4/site-packages/selenium/webdriver/remote/remote_connection.py", line 439, in _request
resp = self._conn.getresponse()
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/http/client.py", line 1171, in getresponse
response.begin()
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/http/client.py", line 351, in begin
version, status, reason = self._read_status()
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/http/client.py", line 321, in _read_status
raise BadStatusLine(line)
http.client.BadStatusLine: ''
Tried doing
try:
webdriver.Chrome("./venv/selenium/webdriver/chromedriver")
except Exception:
webdriver.Chrome("./venv/selenium/webdriver/chromedriver")
The result is two of the same traceback as before, and two chromedriver instances. It seems like this question points to an error in urllib, but it is for a slightly different situation.
This happened to me after I updated chrome to the latest version.
I just updated chromedriver to 2.25 and it works again.
(Sorry for my English)
I want to make parser for some website with include a lot of JS scripts,
for this I use selenium+phantomjs+lxml. Needs that this parser works fast,
at least 1000 links per 1 hour. For this purpose I use multiprocessing
(not Threading!! because GIL) and module Future, and ProcessExecutorPool.
The problem in next, when I give to input list from 10 links and 5 workers, after executions
lose some links. It can be 1 links or greater(till 6! - the max value but rare). This of course bad result.
There is some dependence, for increasing amount of process, increase amount of lost links.
First of all I trace where program breaking. (asserts doesn't works correctly because multiprocessing)
I find that, the program breaking after string "browser.get(l)". Then I put time.sleep(x) - give some time
for downloading page. Give no result. Then I try research .get() from selenium.webdriver....remote.webdriver.py
but it's reload .execute() - and this function takes so many parameters - and discovers it - too long and
difficult for me... and the same time I try to run program for 1 process - and I lost 1 links. I thought, may
be problem not in selenium and PhantomJS, than I replace concurrent.futures.Future.ProcessExecutorPool
on multiprocessing.Pool - problem solves, links don't lose, but if amount of process - <= 4, works
almost good, but some time appeares new mistakes(this mistakes appears when set 4 <= amount of process):
"""
multiprocessing.pool.RemoteTraceback:
Traceback (most recent call last):
File "/usr/lib/python3.4/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.4/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "interface.py", line 34, in hotline_to_mysql
w = Parse_hotline().browser_manipulation(link)
File "/home/water/work/parsing/class_parser/parsing_classes.py", line 352, in browser_manipulation
browser.get(l)
File "/usr/local/lib/python3.4/dist-packages/selenium/webdriver/remote/webdriver.py", line 247, in get
self.execute(Command.GET, {'url': url})
File "/usr/local/lib/python3.4/dist-packages/selenium/webdriver/remote/webdriver.py", line 233, in execute
response = self.command_executor.execute(driver_command, params)
File "/usr/local/lib/python3.4/dist-packages/selenium/webdriver/remote/remote_connection.py", line 401, in execute
return self._request(command_info[0], url, body=data)
File "/usr/local/lib/python3.4/dist-packages/selenium/webdriver/remote/remote_connection.py", line 471, in _request
resp = opener.open(request, timeout=self._timeout)
File "/usr/lib/python3.4/urllib/request.py", line 463, in open
response = self._open(req, data)
File "/usr/lib/python3.4/urllib/request.py", line 481, in _open
'_open', req)
File "/usr/lib/python3.4/urllib/request.py", line 441, in _call_chain
result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 1210, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/lib/python3.4/urllib/request.py", line 1185, in do_open
r = h.getresponse()
File "/usr/lib/python3.4/http/client.py", line 1171, in getresponse
response.begin()
File "/usr/lib/python3.4/http/client.py", line 351, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.4/http/client.py", line 321, in _read_status
raise BadStatusLine(line)
http.client.BadStatusLine: ''
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "interface.py", line 69, in <module>
main()
File "interface.py", line 63, in main
executor.map(hotline_to_mysql, link_list)
File "/usr/lib/python3.4/multiprocessing/pool.py", line 260, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib/python3.4/multiprocessing/pool.py", line 599, in get
raise self._value
http.client.BadStatusLine: ''
"""
import random
import time
import lxml.html as lh
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from multiprocessing import Pool
from selenium.webdriver.common.keys import Keys
from concurrent.futures import Future, ProcessPoolExecutor, ThreadPoolExecutor
AMOUNT_PROCESS = 5
def parse(h)->list:
# h - str, html of page
lxml_ = lh.document_fromstring(h)
name = lxml_.xpath('/html/body/div[2]/div[7]/div[6]/ul/li[1]/a/#title')
prices_ = (price.text_content().strip().replace('\xa0', ' ')
for price in lxml_.xpath('//*[#id="gotoshop-price"]'))
markets_ =(market.text_content().strip() for market in
lxml_.find_class('cell shop-title'))
wares = [[name[0], market, price] for (market, price)
in zip(markets_, prices_)]
return wares
def browser_manipulation(l):
#options = []
#options.append('--load-images=false')
#options.append('--proxy={}:{}'.format(host, port))
#options.append('--proxy-type=http')
#options.append('--user-agent={}'.format(user_agent)) #тут хедеры рандомно
dcap = dict(DesiredCapabilities.PHANTOMJS)
#user agent takes from my config.py
dcap["phantomjs.page.settings.userAgent"] = (random.choice(USER_AGENT))
browser = webdriver.PhantomJS(desired_capabilities=dcap)
#print(browser)
#print('~~~~~~', l)
#browser.implicitly_wait(20)
#browser.set_page_load_timeout(80)
#time.sleep(2)
browser.get(l)
time.sleep(20)
result = parse(browser.page_source)
#print('++++++', result[0][0])
browser.quit()
return result
def main():
#open some file with links
with open(sys.argv[1], 'r') as f:
link_list = [i.replace('\n', '') for i in f]
with Pool(AMOUNT_PROCESS) as executor:
executor.map(browser_manipulation, link_list)
if __name__ == '__main__':
main()
Where is the problem(selenium+phantomJS, ThreadPoolExecutor, my code)? Why links lost?
How to increase speed of parsing?
Finally, may be there are alternative way for parse dynamic website without selenium+phantomjs, on python?
Of course, important is speed of parsing.
Thank's for answers.
I try, instead ProcessPoolExecutor - ThreadPoolExecutor, losing links stop. In Thread... case, speed approximately equal to Process.
The question is actual, if you have some information about this, please write. Thanks.
I have been trying to run PhantomJS via selenium for past 3 days and have had no success.
So far i have tried installing PhantomJS via npm, building it from source, installing via apt-get and downloading prebuilt executable and placing it in /usr/bin/phantomjs.
Every time I was able to run this example script loadspeed.js :
var page = require('webpage').create(),
system = require('system'),
t, address;
if (system.args.length === 1) {
console.log('Usage: loadspeed.js <some URL>');
phantom.exit();
}
t = Date.now();
address = system.args[1];
page.open(address, function (status) {
if (status !== 'success') {
console.log('FAIL to load the address');
} else {
t = Date.now() - t;
console.log('Loading time ' + t + ' msec');
}
phantom.exit();
});
and run it with 'phantomjs test.js http://google.com' and it worked just as it should.
but running PhantomJS via selenium in this small python script produces errors:
from selenium import webdriver
browser = webdriver.PhantomJS()
browser.get('http://seleniumhq.org')
python test.py
Traceback (most recent call last):
File "test.py", line 4, in <module>
browser.get('http://seleniumhq.org/')
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 176, in get
self.execute(Command.GET, {'url': url})
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 162, in execute
response = self.command_executor.execute(driver_command, params)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/remote_connection.py", line 350, in execute
return self._request(url, method=command_info[0], data=data)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/remote_connection.py", line 382, in _request
resp = self._conn.getresponse()
File "/usr/lib/python2.7/httplib.py", line 1045, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 409, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 373, in _read_status
raise BadStatusLine(line)
httplib.BadStatusLine: ''
Replacing second LOC with browser = webdriver.Firefox() works fine.
I am on Ubuntu 13.10 desktop and same error occurs on Ubuntu 13.04 aswell.
Python: 2.7
PhantomJS: 1.9.2
What am I doing wrong here?
There seems to be some issue introduced in newer Selenium, see
http://code.google.com/p/selenium/issues/detail?id=6690
I got a bit further using by using
pip install selenium==2.37
Avoids the stack trace above. Still having problems with driver.save_screenshot('foo.png') resulting in an empty file though.