Trying to load a page and cycle through proxies each time

Trying to load a page and cycle through proxies each time - python

I'm currently trying to learn Python by doing small little silly projects to try and get my head around certain bits but I have hit a bit of a brick wall. I'm wanting to make something that will visit a page using a proxy list I have in a .txt file. I want it to load up the web page with the first proxy in the file, then load up the page with the second proxy and so on. However, I keep on getting this error:
Traceback (most recent call last):
File "c:\Users\Admin.vscode\extensions\ms-python.python-2019.6.24221\pythonFiles\ptvsd_launcher.py", line 43, in
main(ptvsdArgs)
File "c:\Users\Admin.vscode\extensions\ms-python.python-2019.6.24221\pythonFiles\lib\python\ptvsd__main__.py", line 434, in main
run()
File "c:\Users\Admin.vscode\extensions\ms-python.python-2019.6.24221\pythonFiles\lib\python\ptvsd__main__.py", line 312, in run_file
runpy.run_path(target, run_name='main')
File "C:\Users\Admin\AppData\Local\Programs\Python\Python37-32\lib\runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python37-32\lib\runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python37-32\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "c:\Users\Admin\Documents\PythonScripts\ebay-traffic.py", line 10, in
r = requests.get(url, proxies = line)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\sessions.py", line 524, in request
prep.url, proxies, stream, verify, cert
File "C:\Users\Admin\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\sessions.py", line 699, in merge_environment_settings
no_proxy = proxies.get('no_proxy') if proxies is not None else None
AttributeError: 'str' object has no attribute 'get'
The proxy file looks like this:
I've tried various stupid things like putting the proxy file in the int(), but that obviously doesn't work (but I was trying a lot of silly things).
import requests
proxyList = 'proxies.txt'
file = open(proxyList, "r")
url = input('Website: ')
for line in file:
print(line, end="")
r = requests.get(url, proxies = line)
print('Finished.')
input()
I expect it to print each line of the proxy file when it loads up the page when connected to the proxy.

You need to pass proxies as a dict
import requests
proxyList = 'proxies.txt'
file = open(proxyList, "r")
url = input('Website: ')
for line in file:
print(line, end="")
proxies = {'http': line.strip(), 'https': line.strip()}
r = requests.get(url, proxies=proxies)
print('Finished.')
input()

You need to provide the proxies as a dict to python requests, i.e.:
import requests
url = input('Website:\n')
with open('proxies.txt') as f:
proxies = [x.strip() for x in list(f)]
for p in proxies:
r = requests.get(url, proxies={'http': p, 'https': p})
print(r.text)
Demo

Related

Trouble using configparser to pass API credentials

I have a config.ini file that looks like this
[REDDIT]
client_id = 'myclientid23jd934g'
client_secret = 'myclientsecretjf30gj5g'
password = 'mypassword'
user_agent = 'myuseragent'
username = 'myusername'
When I try to use reddit's API praw like this:
import configparser
import praw
class redditImageScraper:
def __init__(self, sub, limit):
config = configparser.ConfigParser()
config.read('config.ini')
self.sub = sub
self.limit = limit
self.reddit = praw.Reddit(client_id=config.get('REDDIT','client_id'),
client_secret=config.get('REDDIT','client_secret'),
password=config.get('REDDIT','password'),
user_agent=config.get('REDDIT','user_agent'),
username=config.get('REDDIT','username'))
def get_content(self):
submissions = self.reddit.subreddit(self.sub).hot(limit=self.limit)
for submission in submissions:
print(submission.id)
def main():
scraper = redditImageScraper('aww', 25)
scraper.get_content()
if __name__ == '__main__':
main()
I get this traceback
Traceback (most recent call last):
File "config.py", line 30, in <module>
main()
File "config.py", line 27, in main
scraper.get_content()
File "config.py", line 22, in get_content
for submission in submissions:
File "C:\Users\Evan\Anaconda3\lib\site-packages\praw\models\listing\generator.py", line 61, in __next__
self._next_batch()
File "C:\Users\Evan\Anaconda3\lib\site-packages\praw\models\listing\generator.py", line 71, in _next_batch
self._listing = self._reddit.get(self.url, params=self.params)
File "C:\Users\Evan\Anaconda3\lib\site-packages\praw\reddit.py", line 454, in get
data = self.request("GET", path, params=params)
File "C:\Users\Evan\Anaconda3\lib\site-packages\praw\reddit.py", line 627, in request
method, path, data=data, files=files, params=params
File "C:\Users\Evan\Anaconda3\lib\site-packages\prawcore\sessions.py", line 185, in request
params=params, url=url)
File "C:\Users\Evan\Anaconda3\lib\site-packages\prawcore\sessions.py", line 116, in _request_with_retries
data, files, json, method, params, retries, url)
File "C:\Users\Evan\Anaconda3\lib\site-packages\prawcore\sessions.py", line 101, in _make_request
params=params)
File "C:\Users\Evan\Anaconda3\lib\site-packages\prawcore\rate_limit.py", line 35, in call
kwargs['headers'] = set_header_callback()
File "C:\Users\Evan\Anaconda3\lib\site-packages\prawcore\sessions.py", line 145, in _set_header_callback
self._authorizer.refresh()
File "C:\Users\Evan\Anaconda3\lib\site-packages\prawcore\auth.py", line 328, in refresh
password=self._password)
File "C:\Users\Evan\Anaconda3\lib\site-packages\prawcore\auth.py", line 138, in _request_token
response = self._authenticator._post(url, **data)
File "C:\Users\Evan\Anaconda3\lib\site-packages\prawcore\auth.py", line 31, in _post
raise ResponseException(response)
prawcore.exceptions.ResponseException: received 401 HTTP response
However when I manually insert the credentials, my code runs exactly as expected. Also, if I run the line
print(config.get('REDDIT', 'client_id'))
I get the output 'myclientid23jd934g' as expected.
Is there some reason that praw won't allow me to pass my credentials using configparser?

Double check what your inputs to praw.Reddit are:
kwargs = dict(client_id=config.get('REDDIT','client_id'),
client_secret=config.get('REDDIT','client_secret'),
password=config.get('REDDIT','password'),
user_agent=config.get('REDDIT','user_agent'),
username=config.get('REDDIT','username')))
print(kwargs)
praw.Reddit(**kwargs)

You're overcomplicating configuration here — PRAW will take care of this for you.
If you rename config.ini to praw.ini, you can replace your whole initialization with just
self.reddit = praw.Reddit('REDDIT')
This is because PRAW will look for a praw.ini file and parse it for you. If you want to give the section a more descriptive name, make sure to update it in the praw.ini as well as in the single parameter passed to Reddit (which specifies the section of the file to use).
See https://praw.readthedocs.io/en/latest/getting_started/configuration/prawini.html.
As this page notes, values like username and password should not have quotation marks around them. For example,
password=mypassword
is correct, but
password="mypassword"
is incorrect.

Python Requests OS Error 104 Connection Broken Error

Hi I am trying to a hit an API using requests module of python. The Api has to be hit 20000 times as the number of pages are around 20000. In every hit the data comes around 10 mb. By the end of the process it creates a json file of around 100gb. Here is the code I have written
with open('file.json','wb',buffering=100*1048567) as f:
while(next_page_cursor != ""):
with request.get(url,headers=headers) as response:
json_response = json.loads(response.content.decode('utf-8'))
"""
json response looks something like this
{
content:[{},{},{}........50 dictionaries]
next_page_cursor : "abcd"
}
"""
next_page_cursor = json_response['next_page_cursor']
for data in json_response['content']:
f.write((json.dumps(data) + "\n").encode())
But after running successfully for few pages the code fails giving the below error:
Traceback (most recent call last):
File "<command-1206920060120926>", line 65, in <module>
with requests.get(data_url, headers = headers) as response:
File "/databricks/python/lib/python3.7/site-packages/requests/api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "/databricks/python/lib/python3.7/site-packages/requests/api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "/databricks/python/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/databricks/python/lib/python3.7/site-packages/requests/sessions.py", line 686, in send
r.content
File "/databricks/python/lib/python3.7/site-packages/requests/models.py", line 828, in content
self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
File "/databricks/python/lib/python3.7/site-packages/requests/models.py", line 753, in generate
raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ('Connection broken: OSError("(104, \'ECONNRESET\')")', OSError("(104, 'ECONNRESET')"))

you need to use response.iter_content
https://2.python-requests.org/en/master/api/#requests.Response.iter_content

Pythonanywhere returning BadStatusLine when I am trying to get request of Airbnb through Selenium

This is my code.
import requests
from bs4 import BeautifulSoup
import time
import datetime
from selenium import webdriver
import io
from pyvirtualdisplay import Display
neighbours = []
with io.open('cntr_london.txt', "r", encoding="utf-8") as f:
for q in f:
neighbours.append(q.replace('neighborhoods%5B%5D=', '').replace('\n',''))
#url = 'https://www.airbnb.com/s/paris/homes?room_types%5B%5D=Entire%20home%2Fapt&room_types%5B%5D=Private%20room&price_max=' +str(price_max)+ '&price_min=' + str(price_min)
def scroll_through_bottom():
s = 0
while s <= 4000:
s = s+200
browser.execute_script('window.scrollTo(0, '+ str(s) +');')
def get_links():
link_data = browser.find_elements_by_class_name('_1szwzht')
for link in link_data:
link_tag = link.find_elements_by_tag_name('a')
for l in link_tag:
link_list.append(l.get_attribute("href"))
length = len(link_list)
print length
with Display():
browser = webdriver.Firefox()
try:
browser.get('http://airbnb.com')
finally:
browser.quit()
Every url is working. But when I am trying to get Airbnb, it is giving me this error:
Traceback (most recent call last):
File "airbnb_new.py", line 43, in <module>
browser.get('http://airbnb.com')
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 248, in get
self.execute(Command.GET, {'url': url})
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 234, in execute
response = self.command_executor.execute(driver_command, params)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/remote_connection.py", line 401, in execute
return self._request(command_info[0], url, body=data)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/remote_connection.py", line 433, in _request
resp = self._conn.getresponse()
File "/usr/lib/python2.7/httplib.py", line 1089, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 444, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 408, in _read_status
raise BadStatusLine(line)
httplib.BadStatusLine: ''
On the other hand, when I am trying to run my code in Python3, it is giving me no module named pyvirtualdisplay even though I installed it with pip.
Can someone please help me with this problem? I would highly appreciate it.

Airbnb has identified your scraper and as a preventive measure they are rejecting requests from your spider. So you can not do anything may be you can change IP and system info and check if it works or you can wait for couple of hour and check if airbnb system has released lock and accepting requests from your system too.

for range to send post request by using requests.Session(), it alert 'module' object has no attribute 'kqueue'

macOS 10.12.3 python 2.7.13 requests 2.13.0
I use requests package to send post request.This request need to login before post data.So I use request.Session() and load a logined cookie.
Then I use this session to send post data in cycle mode.
It is no error that I used to run this code in Windows and Linux.
Simple Code:
s = request.Session()
s.cookies = cookieslib.LWPCookieJar('cookise')
s.cookies.load(ignore_discard=True)
for user_id in range(100,200):
url = 'http://xxxx'
data = { 'user': user_id, 'content': '123'}
r = s.post(url, data)
...
But the program frequently (about every interval) crash, the error isAttributeError: 'module' object has no attribute 'kqueue'
Traceback (most recent call last):
File "/Users/gasxia/Dev/Projects/TgbookSpider/kfz_send_msg.py", line 90, in send_msg
r = requests.post(url, data) # catch error if user isn't exist
File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 535, in post
return self.request('POST', url, data=data, json=json, **kwargs)
File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 488, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 609, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python2.7/site-packages/requests/adapters.py", line 423, in send
timeout=timeout
File "/usr/local/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 588, in urlopen
conn = self._get_conn(timeout=pool_timeout)
File "/usr/local/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 241, in _get_conn
if conn and is_connection_dropped(conn):
File "/usr/local/lib/python2.7/site-packages/requests/packages/urllib3/util/connection.py", line 27, in is_connection_dropped
return bool(wait_for_read(sock, timeout=0.0))
File "/usr/local/lib/python2.7/site-packages/requests/packages/urllib3/util/wait.py", line 33, in wait_for_read
return _wait_for_io_events(socks, EVENT_READ, timeout)
File "/usr/local/lib/python2.7/site-packages/requests/packages/urllib3/util/wait.py", line 22, in _wait_for_io_events
with DefaultSelector() as selector:
File "/usr/local/lib/python2.7/site-packages/requests/packages/urllib3/util/selectors.py", line 431, in __init__
self._kqueue = select.kqueue()
AttributeError: 'module' object has no attribute 'kqueue'

This looks like a problem that commonly arises if you're using something like eventlet or gevent, both of which monkeypatch the select module. If you're using those to achieve asynchrony, you will need to ensure that those monkeypatches are applied before importing requests. This is a known bug, being tracked in this issue.

using mwclient under proxy server

I use net under proxy server
self.wp = site if site else mwclient.Site(self.url)
when above line is encountered following errors are show
File "C:\Python27\lib\site-packages\mwclient-0.6.5-py2.7.egg\mwclient\client.py", line 92, in __init__
self.site_init()
File "C:\Python27\lib\site-packages\mwclient-0.6.5-py2.7.egg\mwclient\client.py", line 100, in site_init
siprop = 'general|namespaces', uiprop = 'groups|rights')
File "C:\Python27\lib\site-packages\mwclient-0.6.5-py2.7.egg\mwclient\client.py", line 165, in api
info = self.raw_api(action, **kwargs)
File "C:\Python27\lib\site-packages\mwclient-0.6.5-py2.7.egg\mwclient\client.py", line 248, in raw_api
json_data = self.raw_call('api', data).read()
File "C:\Python27\lib\site-packages\mwclient-0.6.5-py2.7.egg\mwclient\client.py", line 223, in raw_call
url, data = data, headers = headers)
File "C:\Python27\lib\site-packages\mwclient-0.6.5-py2.7.egg\mwclient\http.py", line 225, in post
return self.find_connection(host).post(host,
File "C:\Python27\lib\site-packages\mwclient-0.6.5-py2.7.egg\mwclient\http.py", line 218, in find_connection
conn = cls(host, self)
File "C:\Python27\lib\site-packages\mwclient-0.6.5-py2.7.egg\mwclient\http.py", line 62, in __init__
self._conn.connect()
File "C:\Python27\lib\httplib.py", line 757, in connect
self.timeout, self.source_address)
File "C:\Python27\lib\socket.py", line 571, in create_connection
raise err
error: [Errno 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
I tried setting proxy using urllib2 by following steps, but it didnt help
>>> import urllib2
>>> auth = 'http://xxxxx:xxxx#10.1.9.30:8080'
>>> handler = urllib2.ProxyHandler({'http':auth})
>>> opener = urllib2.build_opener(handler)
>>> urllib2.install_opener(opener)

This is a bit old, but I faced the same problem yesterday and I'm posting the solution here, as it may help other people.
I managed to sort it out by changing the file mwclinet/http.py. Basically I check if the environment variable http_proxy exists and connect through a proxy rather than directly.
In class class HTTPPersistentConnection(object): I added a variable usesProxy = False. Around line 61 I replaced self._conn = self.http_class(host) by:
http_proxy_env = os.environ.get('http_proxy')
if http_proxy_env is not None:
try:
# proxy
http_proxy_url = urlparse.urlparse(http_proxy_env)
http_proxy_host,http_proxy_port = http_proxy_url.netloc.split(':')
self._conn = self.http_class(http_proxy_host,int(http_proxy_port))
self.usesProxy=True;
except:
self._conn = self.http_class(host)
else:
self._conn = self.http_class(host)
Then I substituted the next 2 occurrences of: self._conn.request(method, path, ....).
At line 94 by :
if self.usesProxy is False:
self._conn.request(method, path, headers = headers)
else:
self._conn.request(method, "http://"+host+path, headers = headers)
and at line 107 by:
if self.usesProxy is False:
self._conn.request(method, path, data, headers)
else:
self._conn.request(method, "http://"+host+path, data, headers)
It should do the job!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trying to load a page and cycle through proxies each time - python

You need to pass proxies as a dict import requests proxyList = 'proxies.txt' file = open(proxyList, "r") url = input('Website: ') for line in file: print(line, end="") proxies = {'http': line.strip(), 'https': line.strip()} r = requests.get(url, proxies=proxies) print('Finished.') input()

You need to provide the proxies as a dict to python requests, i.e.: import requests url = input('Website:\n') with open('proxies.txt') as f: proxies = [x.strip() for x in list(f)] for p in proxies: r = requests.get(url, proxies={'http': p, 'https': p}) print(r.text) Demo

Related

Trouble using configparser to pass API credentials

Python Requests OS Error 104 Connection Broken Error

Pythonanywhere returning BadStatusLine when I am trying to get request of Airbnb through Selenium

for range to send post request by using requests.Session(), it alert 'module' object has no attribute 'kqueue'

using mwclient under proxy server

Categories

Resources