I've written a script in python in combination with selenium using proxies to get the text of differnt links populated upon navigating to a url, as in this one. What I want to parse from there is the visible text connected to each link.
The script I've tried so far with is capable of producing new proxies when this function start_script() is called within it. The problem is that the very url lead me to this redirected link. I can get rid off this redirection only when I keep trying on until the url is satisfied with a proxy. My current script can try twice only with two new proxies.
How can I use any loop within get_texts() function in such a way so that it will keep trying using new proxies until it parses the required content?
My attempt so far:
import requests
import random
from itertools import cycle
from bs4 import BeautifulSoup
from selenium import webdriver
link = 'http://www.google.com/search?q=python'
def get_proxies():
response = requests.get('https://www.us-proxy.org/')
soup = BeautifulSoup(response.text,"lxml")
proxies = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tbody tr") if "yes" in item.text]
return proxies
def start_script():
proxies = get_proxies()
random.shuffle(proxies)
proxy = next(cycle(proxies))
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(f'--proxy-server={proxy}')
driver = webdriver.Chrome(chrome_options=chrome_options)
return driver
def get_texts(url):
driver = start_script()
driver.get(url)
if "index?continue" not in driver.current_url:
for item in [items.text for items in driver.find_elements_by_tag_name("h3")]:
print(item)
else:
get_texts(url)
if __name__ == '__main__':
get_texts(link)
The code below works well for me, however it can't help you with bad proxies. It also loops through the list of proxies and tries one until it succeeds or the list runs out.
It prints which proxy it uses so that you can see that it tries more than one time.
However as https://www.us-proxy.org/ points out:
What is Google proxy? Proxies that support searching on Google are
called Google proxy. Some programs need them to make large number of
queries on Google. Since year 2016, all the Google proxies are dead.
Read that article for more information.
Article:
Google Blocks Proxy in 2016 Google shows a page to verify that you are
a human instead of the robot if a proxy is detected. Before the year
2016, Google allows using that proxy for some time if you can pass
this human verification.
from contextlib import contextmanager
import random
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
def get_proxies():
response = requests.get('https://www.us-proxy.org/')
soup = BeautifulSoup(response.text,"lxml")
proxies = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tbody tr") if "yes" in item.text]
random.shuffle(proxies)
return proxies
# Only need to fetch the proxies once
PROXIES = get_proxies()
#contextmanager
def proxy_driver():
try:
proxy = PROXIES.pop()
print(f'Running with proxy {proxy}')
chrome_options = webdriver.ChromeOptions()
# chrome_options.add_argument("--headless")
chrome_options.add_argument(f'--proxy-server={proxy}')
driver = webdriver.Chrome(options=chrome_options)
yield driver
finally:
driver.close()
def get_texts(url):
with proxy_driver() as driver:
driver.get(url)
if "index?continue" not in driver.current_url:
return [items.text for items in driver.find_elements_by_tag_name("h3")]
print('recaptcha')
if __name__ == '__main__':
link = 'http://www.google.com/search?q=python'
while True:
links = get_texts(link)
if links:
break
print(links)
while True:
driver = start_script()
driver.get(url)
if "index?continue" in driver.current_url:
continue
else:
break
This will loop until index?continue is not in the url, and then break out of the loop.
This answer only addresses your specific question - it doesn't address the problem that you might be creating a large number of web drivers, but you never destroy the unsused / failed ones. Hint: you should.
Related
I am having trouble getting a page source HTML out of a site with selenium through a proxy. Here is my code
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
import codecs
import time
import shutil
proxy_username = 'myProxyUser'
proxy_password = 'myProxyPW'
port = '1080'
hostname = 'myProxyIP'
PROXY = proxy_username+":"+proxy_password+"#"+hostname+":"+port
options = Options()
options.add_argument("--headless")
options.add_argument("--kiosk")
options.add_argument('--proxy-server=%s' %PROXY)
driver = webdriver.Chrome(r'C:\Users\kingOtto\Downloads\chromedriver\chromedriver.exe', options=options)
driver.get("https://www.whatismyip.com")
time.sleep(10)
html = driver.page_source
f = codecs.open('dummy.html', "w", "utf-8")
f.write(html)
driver.close()
This results in a very incomplete HTML, showing only outer brackets of head and body:
html
Out[3]: '<html><head></head><body></body></html>'
Also the dummy.html file written to disk does not show any other content that what is displayed in the line above.
I am lost, here is what I tried
It does work when I run it without options.add_argument('--proxy-server=%s' %PROXY) line. So I am sure it is the proxy. But the proxy connection itself seems to be ok (I do not get any proxy connection erros - plus I do get the outer frame from the website, right? So the driver request gets through & back to me)
Different URLs: Not only whatismyip.com fails, also any other pages - tried different news outlets such as CNN or even google - virtually nothing comes back from any website, except for head and body brackets. It cannot be any javascript/iframe issue, right?
Different wait times (this article does not help: Make Selenium wait 10 seconds), up to 60 seconds -- plus my connection is super fast, <1 second should be enough (in browser)
What am I getting wrong about the connection?
driver.page_source does not always return what you expect via selenium. It's likely NOT the full dom. This is documented in the selenium doc and in various SO answers, e.g.:
https://stackoverflow.com/a/45247539/1387701
Selenium gives a best effort to provide the page source as it is fetched. Only highly dynamic pages this can often be limited in it's return.
I'm new in python and i try to crawl a whole website recursive with selenium.
I would like to do this with selenium because i want get all cookies which the website is used. I know that other tools can crawl a website easier and faster but other tools can't give me all cookies (first and third party).
Here my code:
from selenium import webdriver
import os, shutil
url = "http://example.com/"
links = set()
def crawl(start_link):
driver.get(start_link)
elements = driver.find_elements_by_tag_name("a")
urls_to_visit = set()
for el in elements:
urls_to_visit.add(el.get_attribute('href'))
for el in urls_to_visit:
if url in el:
if el not in links:
links.add(el)
crawl(el)
else:
return
dir_name = "userdir"
if os.path.isdir(dir_name):
shutil.rmtree(dir_name)
co = webdriver.ChromeOptions()
co.add_argument("--user-data-dir=userdir")
driver = webdriver.Chrome(options = co)
crawl(url)
print(links)
driver.close();
My problem is that the crawl function not open all pages from the website apparently. On some websites i can navigate to pages by hand that the function not reached. Why?
One thing I have noticed while using webdriver is that it needs time to load the page, the elements are not instantly available just like in a regular browser.
You may want to add some delays, or a loop to check for some type of footer to indicate that the page is loaded and you can start crawling.
I'm using python requests and beautifulsoup to verify a html document. However, the server for the landing page has some backend code that delays several seconds before presenting the final html document. I've tried the redirect=true approach but I end up with the original document. When loading the url on a browser, there is a 2-3 second delay while the page is created by the server. I've tried various samples like url2.geturl() after page load but all of these return the original url (and do so well before the 2-3 seconds elapses). I need something that emulates a browser and grabs the final document.
Btw, I am able to view the correct DOM elements in Chrome, just not problematically in python.
Figured this out after a few cycles. This requires a combination of 2 solutions (use python selenium package and time.sleep). Sets the background chrome process to run headless, get the url, wait for server side code to complete, then load the document. Here, I'm using BeautifulSoup to parse the DOM.
from selenium import webdriver
from bs4 import BeautifulSoup
import time
def run():
url = "http://192.168.1.55"
options = webdriver.ChromeOptions()
options.add_argument('headless')
browser = webdriver.Chrome(chrome_options=options)
browser.get(url)
time.sleep(5)
bs = BeautifulSoup(browser.page_source, 'html.parser')
data = bs.find_all('h3')
if __name__ == "__main__":
run()
from selenium import webdriver
driver = webdriver.Chrome()
driver.set_page_load_timeout(7)
def urlOpen(url):
try:
driver.get(url)
print driver.current_url
except:
return
Then I have URL lists and call above methods.
if __name__ == "__main__":
urls = ['http://motahari.ir/', 'http://facebook.com', 'http://google.com']
# It doesn't print anything
# urls = ['http://facebook.com', 'http://google.com', 'http://motahari.ir/']
# This prints https://www.facebook.com/ https://www.google.co.kr/?gfe_rd=cr&dcr=0&ei=3bfdWdzWAYvR8geelrqQAw&gws_rd=ssl
for url in urls:
urlOpen(url)
The problem is when website 'http://motahari.ir/' throws Timeout Exception, websites 'http://facebook.com' and 'http://google.com' always throw Timeout Exception.
Browser keeps waiting for 'motahari.ir/' to load. But the loop just goes on (It doesn't open 'facebook.com' but wait for 'motahari.ir/') and keep throwing timeout exception
Initializing a webdriver instance takes long, so I pulled that out of the method and I think that caused the problem. Then, should I always reinitialize webdriver instance whenever there's a timeout exception? And How? (Since I initialized driver outside of the function, I can't reinitialize it in except)
You will just need to clear the browser's cookies before continuing. (Sorry, I missed seeing this in your previous code)
from selenium import webdriver
driver = webdriver.Chrome()
driver.set_page_load_timeout(7)
def urlOpen(url):
try:
driver.get(url)
print(driver.current_url)
except:
driver.delete_all_cookies()
print("Failed")
return
urls = ['http://motahari.ir/', 'https://facebook.com', 'https://google.com']
for url in urls:
urlOpen(url)
Output:
Failed
https://www.facebook.com/
https://www.google.com/?gfe_rd=cr&dcr=0&ei=o73dWfnsO-vs8wfc5pZI
P.S. It is not very wise to do try...except... without a clear Exception type, is this might mask different unexpected errors.
The funny thing is I'm not getting any errors running this code but I do believe that the script isnt using a proxy when it reloads the page. here's the script
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chop = webdriver.ChromeOptions()
proxy_list = input("Name of proxy list file?: ")
proxy_file = open(proxy_list, 'r')
print ('Enter url')
url = input()
driver = webdriver.Chrome(chrome_options = chop)
driver.get(url)
import time
for x in range(0,10):
import urllib.request
import time
proxies = []
for line in proxy_file:
proxies.append( line )
proxies = [w.replace('\n', '') for w in proxies]
while True:
for i in range(len(proxies)):
proxy = proxies[i]
proxy2 = {"http":"http://%s" % proxy}
proxy_support = urllib.request.ProxyHandler(proxy2)
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)
urllib.request.urlopen(url).read()
time.sleep(5)
driver.get(url)
time.sleep (5)
Just wondering how I can use a proxy list with this script and have it work properly
As I know, its almost impossible to make selenium work with proxy on Firefox & Chrome. I tried 1 day and nothing. Opera I didnt try, but possible will be the same.
Also I saw freelancer request like "I have a problem with proxy on Selenium + Chrome. I'm the developer and lost 2 days to make it work,but nothing as result. If you don't know sure how to make it working good, please dont't disturb me -- its harder than just copy-paste from internet."
But its easy to do with Phantom JS -- this feature is working good there.
So try to do needed actions via PJS.