I am trying to build a program that scraps information from a website and stores it into a csv file. I am facing an issue with my "try" and "except" statement that is supposed to change headers and proxies when I get a 403 response.
To bypass securities I am adding to the url:
headers from my headers.yml script (Chrome, Firefox and others made up headers)
proxies that I get from https://free-proxy-list.net/, test them and store the good ones in a dictionary (5 of them in my code)
I then loop through these headers and proxies to build the URL and get the information (request).
If response 200 : the proxies and headers are supposed to change every 2 pages with the "for" loop and I'm able to retrieve all data.
If response 403 : I wanted the program to directly change headers and proxies each time it faced an error (403 code) thanks to a "try" and "except" statement.
However in this case, the program still runs normally when it faces 403 and no change is made to the URL.
Here's a sample of my code, could you please help with this issue?
for browser, headers in browser_headers.items():
print(f"\n\nUsing {browser} headers\n")
for proxy_url in good_proxies:
proxies = proxies = {
"http": proxy_url,
"https": proxy_url,
}
try:
for i in range(1,3):
response = requests.get(url + str(i+y), headers=headers, proxies=proxies, timeout=7)
print(response)
soup = BeautifulSoup(response.content, 'html.parser')
Tab = soup.find_all('div', class_= "announceDtl")
for Zone in Tab:
P = Zone.find('span', class_= "announceDtlPrice").text
S = Zone.find('span', class_= "announceDtlInfosArea").text
Pi = Zone.find('span', class_="announceDtlInfosNbRooms").text
A = Zone.find('div', class_= "announcePropertyLocation").text
info = [P, S, Pi, A]
thewriter.writerow(info)
Time.sleep(3)
print('Page Done', y+i)
y = y + 2
if y >= int(limit):
print("== ALL DATA RETRIEVED ===")
exit()
except Exception:
print(f"Proxy {proxy_url} failed, trying another one")
I have a .txt file that contains a list of URLs. The structure of the URLs varies - some may begin with https, some with http, others with just www and others with just the domain name (stackoverflow.com). So an example of the .txt file content is:-
www.google.com
microsoft.com
https://www.yahoo.com
http://www.bing.com
What I want to do is parse through the list and check if the URLs are live. In order to do that, the stucture of the URL must be correct otherwise the request will fail. Here's my code so far:-
import requests
with open('urls.txt', 'r') as f:
urls = f.readlines()
for url in urls:
url = url.replace('\n', '')
if not url.startswith('http'): #This is to handle just domain names and those that begin with 'www'
url = 'http://' + url
if url.startswith('http:'):
print("trying url {}".format(url))
response = requests.get(url, timeout=10)
status_code = response.status_code
if status_code == 200:
continue
else:
print("URL {} has a response code of {}".format(url, status_code))
print("encountered error. Now trying with https")
url = url.replace('http://', 'https://')
print("Now replacing http with https and trying again")
response = requests.get(url, timeout=10)
status_code = response.status_code
print("URL {} has a response code of {}".format(url, status_code))
else:
response = requests.get(url, timeout=10)
status_code = response.status_code
print("URL {} has a response code of {}".format(url, status_code))
I feel like I've overcomplicated this somewhat and there must be an easier way of trying variants (ie. domain name, domain with 'www' at the beginning, with 'http' at the beginning and with 'https://' at the beginning, until a site is identified as being live or not (ie. all variables have been exhausted).
Any suggestions on my code or a better way to approach this? In essence, I want to handle the formatting of the URL to ensure that I then attempt to check the status of the URL.
Thanks in advance
This is a little too long for a comment, but, yes, it can be simplified, starting from, and replacing, the startswith part:
if not '//' in url:
url = 'http://' + url
response = requests.get(url, timeout=10)
etc.
I am trying to build a simple webbot in Python, on Windows, using MechanicalSoup. Unfortunately, I am sitting behind a (company-enforced) proxy. I could not find a way to provide a proxy to MechanicalSoup. Is there such an option at all? If not, what are my alternatives?
EDIT: Following Eytan's hint, I added proxies and verify to my code, which got me a step further, but I still cannot submit a form:
import mechanicalsoup
proxies = {
'https': 'my.https.proxy:8080',
'http': 'my.http.proxy:8080'
}
url = 'https://stackoverflow.com/'
browser = mechanicalsoup.StatefulBrowser()
front_page = browser.open(url, proxies=proxies, verify=False)
form = browser.select_form('form[action="/search"]')
form.print_summary()
form["q"] = "MechanicalSoup"
form.print_summary()
browser.submit(form, url=url)
The code hangs in the last line, and submitdoesn't accept proxies as an argument.
It seems that proxies have to be specified on the session level. Then they are not required in browser.open and submitting the form also works:
import mechanicalsoup
proxies = {
'https': 'my.https.proxy:8080',
'http': 'my.http.proxy:8080'
}
url = 'https://stackoverflow.com/'
browser = mechanicalsoup.StatefulBrowser()
browser.session.proxies = proxies # THIS IS THE SOLUTION!
front_page = browser.open(url, verify=False)
form = browser.select_form('form[action="/search"]')
form["q"] = "MechanicalSoup"
result = browser.submit(form, url=url)
result.status_code
returns 200 (i.e. "OK").
According to their doc, this should work:
browser.get(url, proxies=proxy)
Try passing the 'proxies' argument to your requests.
I am trying to send GET request through a proxy with authentification.
I have the following existing code:
import httplib
username = 'myname'
password = '1234'
proxyserver = "136.137.138.139"
url = "http://google.com"
c = httplib.HTTPConnection(proxyserver, 83, timeout = 30)
c.connect()
c.request("GET", url)
resp = c.getresponse()
data = resp.read()
print data
when running this code, I get an answer from the proxy saying that I must provide authentification, which is correct.
In my code, I don't use login and password. My problem is that i don't know how to use them !
Any idea ?
You can refer this code if you specifically want to use httplib.
https://gist.github.com/beugley/13dd4cba88a19169bcb0
But you could also use the easier requests module.
import requests
proxies = {
"http": "http://username:password#proxyserver:port/",
# "https": "https://username:password#proxyserver:port/",
}
url = 'http://google.com'
data = requests.get(url, proxies=proxies)
Just a short, simple one about the excellent Requests module for Python.
I can't seem to find in the documentation what the variable 'proxies' should contain. When I send it a dict with a standard "IP:PORT" value it rejected it asking for 2 values.
So, I guess (because this doesn't seem to be covered in the docs) that the first value is the ip and the second the port?
The docs mention this only:
proxies – (optional) Dictionary mapping protocol to the URL of the proxy.
So I tried this... what should I be doing?
proxy = { ip: port}
and should I convert these to some type before putting them in the dict?
r = requests.get(url,headers=headers,proxies=proxy)
The proxies' dict syntax is {"protocol": "scheme://ip:port", ...}. With it you can specify different (or the same) proxie(s) for requests using http, https, and ftp protocols:
http_proxy = "http://10.10.1.10:3128"
https_proxy = "https://10.10.1.11:1080"
ftp_proxy = "ftp://10.10.1.10:3128"
proxies = {
"http" : http_proxy,
"https" : https_proxy,
"ftp" : ftp_proxy
}
r = requests.get(url, headers=headers, proxies=proxies)
Deduced from the requests documentation:
Parameters:
method – method for the new Request object.
url – URL for the new Request object.
...
proxies – (optional) Dictionary mapping protocol to the URL of the proxy.
...
On linux you can also do this via the HTTP_PROXY, HTTPS_PROXY, and FTP_PROXY environment variables:
export HTTP_PROXY=10.10.1.10:3128
export HTTPS_PROXY=10.10.1.11:1080
export FTP_PROXY=10.10.1.10:3128
On Windows:
set http_proxy=10.10.1.10:3128
set https_proxy=10.10.1.11:1080
set ftp_proxy=10.10.1.10:3128
You can refer to the proxy documentation here.
If you need to use a proxy, you can configure individual requests with the proxies argument to any request method:
import requests
proxies = {
"http": "http://10.10.1.10:3128",
"https": "https://10.10.1.10:1080",
}
requests.get("http://example.org", proxies=proxies)
To use HTTP Basic Auth with your proxy, use the http://user:password#host.com/ syntax:
proxies = {
"http": "http://user:pass#10.10.1.10:3128/"
}
I have found that urllib has some really good code to pick up the system's proxy settings and they happen to be in the correct form to use directly. You can use this like:
import urllib
...
r = requests.get('http://example.org', proxies=urllib.request.getproxies())
It works really well and urllib knows about getting Mac OS X and Windows settings as well.
The accepted answer was a good start for me, but I kept getting the following error:
AssertionError: Not supported proxy scheme None
Fix to this was to specify the http:// in the proxy url thus:
http_proxy = "http://194.62.145.248:8080"
https_proxy = "https://194.62.145.248:8080"
ftp_proxy = "10.10.1.10:3128"
proxyDict = {
"http" : http_proxy,
"https" : https_proxy,
"ftp" : ftp_proxy
}
I'd be interested as to why the original works for some people but not me.
Edit: I see the main answer is now updated to reflect this :)
If you'd like to persisist cookies and session data, you'd best do it like this:
import requests
proxies = {
'http': 'http://user:pass#10.10.1.0:3128',
'https': 'https://user:pass#10.10.1.0:3128',
}
# Create the session and set the proxies.
s = requests.Session()
s.proxies = proxies
# Make the HTTP request through the session.
r = s.get('http://www.showmemyip.com/')
8 years late. But I like:
import os
import requests
os.environ['HTTP_PROXY'] = os.environ['http_proxy'] = 'http://http-connect-proxy:3128/'
os.environ['HTTPS_PROXY'] = os.environ['https_proxy'] = 'http://http-connect-proxy:3128/'
os.environ['NO_PROXY'] = os.environ['no_proxy'] = '127.0.0.1,localhost,.local'
r = requests.get('https://example.com') # , verify=False
The documentation
gives a very clear example of the proxies usage
import requests
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
requests.get('http://example.org', proxies=proxies)
What isn't documented, however, is the fact that you can even configure proxies for individual urls even if the schema is the same!
This comes in handy when you want to use different proxies for different websites you wish to scrape.
proxies = {
'http://example.org': 'http://10.10.1.10:3128',
'http://something.test': 'http://10.10.1.10:1080',
}
requests.get('http://something.test/some/url', proxies=proxies)
Additionally, requests.get essentially uses the requests.Session under the hood, so if you need more control, use it directly
import requests
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
session = requests.Session()
session.proxies.update(proxies)
session.get('http://example.org')
I use it to set a fallback (a default proxy) that handles all traffic that doesn't match the schemas/urls specified in the dictionary
import requests
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
session = requests.Session()
session.proxies.setdefault('http', 'http://127.0.0.1:9009')
session.proxies.update(proxies)
session.get('http://example.org')
i just made a proxy graber and also can connect with same grabed proxy without any input
here is :
#Import Modules
from termcolor import colored
from selenium import webdriver
import requests
import os
import sys
import time
#Proxy Grab
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome(chrome_options=options)
driver.get("https://www.sslproxies.org/")
tbody = driver.find_element_by_tag_name("tbody")
cell = tbody.find_elements_by_tag_name("tr")
for column in cell:
column = column.text.split(" ")
print(colored(column[0]+":"+column[1],'yellow'))
driver.quit()
print("")
os.system('clear')
os.system('cls')
#Proxy Connection
print(colored('Getting Proxies from graber...','green'))
time.sleep(2)
os.system('clear')
os.system('cls')
proxy = {"http": "http://"+ column[0]+":"+column[1]}
url = 'https://mobile.facebook.com/login'
r = requests.get(url, proxies=proxy)
print("")
print(colored('Connecting using proxy' ,'green'))
print("")
sts = r.status_code
here is my basic class in python for the requests module with some proxy configs and stopwatch !
import requests
import time
class BaseCheck():
def __init__(self, url):
self.http_proxy = "http://user:pw#proxy:8080"
self.https_proxy = "http://user:pw#proxy:8080"
self.ftp_proxy = "http://user:pw#proxy:8080"
self.proxyDict = {
"http" : self.http_proxy,
"https" : self.https_proxy,
"ftp" : self.ftp_proxy
}
self.url = url
def makearr(tsteps):
global stemps
global steps
stemps = {}
for step in tsteps:
stemps[step] = { 'start': 0, 'end': 0 }
steps = tsteps
makearr(['init','check'])
def starttime(typ = ""):
for stemp in stemps:
if typ == "":
stemps[stemp]['start'] = time.time()
else:
stemps[stemp][typ] = time.time()
starttime()
def __str__(self):
return str(self.url)
def getrequests(self):
g=requests.get(self.url,proxies=self.proxyDict)
print g.status_code
print g.content
print self.url
stemps['init']['end'] = time.time()
#print stemps['init']['end'] - stemps['init']['start']
x= stemps['init']['end'] - stemps['init']['start']
print x
test=BaseCheck(url='http://google.com')
test.getrequests()
It’s a bit late but here is a wrapper class that simplifies scraping proxies and then making an http POST or GET:
ProxyRequests
https://github.com/rootVIII/proxy_requests
Already tested, the following code works. Need to use HTTPProxyAuth.
import requests
from requests.auth import HTTPProxyAuth
USE_PROXY = True
proxy_user = "aaa"
proxy_password = "bbb"
http_proxy = "http://your_proxy_server:8080"
https_proxy = "http://your_proxy_server:8080"
proxies = {
"http": http_proxy,
"https": https_proxy
}
def test(name):
print(f'Hi, {name}') # Press Ctrl+F8 to toggle the breakpoint.
# Create the session and set the proxies.
session = requests.Session()
if USE_PROXY:
session.trust_env = False
session.proxies = proxies
session.auth = HTTPProxyAuth(proxy_user, proxy_password)
r = session.get('https://www.stackoverflow.com')
print(r.status_code)
if __name__ == '__main__':
test('aaa')
I share some code how to fetch proxies from the site "https://free-proxy-list.net" and store data to a file compatible with tools like "Elite Proxy Switcher"(format IP:PORT):
##PROXY_UPDATER - get free proxies from https://free-proxy-list.net/
from lxml.html import fromstring
import requests
from itertools import cycle
import traceback
import re
######################FIND PROXIES#########################################
def get_proxies():
url = 'https://free-proxy-list.net/'
response = requests.get(url)
parser = fromstring(response.text)
proxies = set()
for i in parser.xpath('//tbody/tr')[:299]: #299 proxies max
proxy = ":".join([i.xpath('.//td[1]/text()')
[0],i.xpath('.//td[2]/text()')[0]])
proxies.add(proxy)
return proxies
######################write to file in format IP:PORT######################
try:
proxies = get_proxies()
f=open('proxy_list.txt','w')
for proxy in proxies:
f.write(proxy+'\n')
f.close()
print ("DONE")
except:
print ("MAJOR ERROR")