Scraping - Change Proxy and Headers for each 403 response - python

I am trying to build a program that scraps information from a website and stores it into a csv file. I am facing an issue with my "try" and "except" statement that is supposed to change headers and proxies when I get a 403 response.
To bypass securities I am adding to the url:
headers from my headers.yml script (Chrome, Firefox and others made up headers)
proxies that I get from https://free-proxy-list.net/, test them and store the good ones in a dictionary (5 of them in my code)
I then loop through these headers and proxies to build the URL and get the information (request).
If response 200 : the proxies and headers are supposed to change every 2 pages with the "for" loop and I'm able to retrieve all data.
If response 403 : I wanted the program to directly change headers and proxies each time it faced an error (403 code) thanks to a "try" and "except" statement.
However in this case, the program still runs normally when it faces 403 and no change is made to the URL.
Here's a sample of my code, could you please help with this issue?
for browser, headers in browser_headers.items():
print(f"\n\nUsing {browser} headers\n")
for proxy_url in good_proxies:
proxies = proxies = {
"http": proxy_url,
"https": proxy_url,
}
try:
for i in range(1,3):
response = requests.get(url + str(i+y), headers=headers, proxies=proxies, timeout=7)
print(response)
soup = BeautifulSoup(response.content, 'html.parser')
Tab = soup.find_all('div', class_= "announceDtl")
for Zone in Tab:
P = Zone.find('span', class_= "announceDtlPrice").text
S = Zone.find('span', class_= "announceDtlInfosArea").text
Pi = Zone.find('span', class_="announceDtlInfosNbRooms").text
A = Zone.find('div', class_= "announcePropertyLocation").text
info = [P, S, Pi, A]
thewriter.writerow(info)
Time.sleep(3)
print('Page Done', y+i)
y = y + 2
if y >= int(limit):
print("== ALL DATA RETRIEVED ===")
exit()
except Exception:
print(f"Proxy {proxy_url} failed, trying another one")

Related

Python: request to website doesn't gives the html I need in any cases

Based on my question here I have some further question with requests on website finance.yahoo.com.
My request without User-Agent request gives me the html code I want to collect some data from the website.
The call with 'ALB' as parameter works fine, I get the requested data:
import bs4 as bs
import requests
def yahoo_summary_stats(stock):
response = requests.get(f"https://finance.yahoo.com/quote/{stock}")
#response = requests.get(f"https://finance.yahoo.com/quote/{stock}", headers={'User-Agent': 'Custom user agent'})
soup = bs.BeautifulSoup(response.text, 'lxml')
table = soup.find('p', {'class': 'D(ib) Va(t)'})
sector = table.findAll('span')[1].text
industry = table.findAll('span')[3].text
print(f"{stock}: {sector}, {industry}")
return sector, industry
web.yahoo_summary_stats('ALB')
Output:
ALB: Basic Materials, Specialty Chemicals
The call yahoo_summary_stats('AEE') doesnt work this way, so I need to acitivate headers to request the site with success.
But now with parameterheaders={'User-Agent': 'Custom user agent'} the code doesn't work and he cannot find the paragraph p with class 'D(ib) Va(t)'.
How can I solve this problem?
I think you are fetching the wrong URL
response = requests.get(f"https://finance.yahoo.com/quote/{stock}/profile?p={stock}", headers={'User-Agent': 'Custom user agent'})
Changing to above URL along with user-agent would solve it.
This page uses JavaScript to display information but requests,BeautifulSoup can't run JavaScript.
But checking page in web browser without JavaScript I see this information on subpage Profile.
"https://finance.yahoo.com/quote/{stock}/profile?p={stock}"
Code can get it for both stock from this page. But it needs User-Agent from real browser (or at least short version 'Mozilla/5.0'
import bs4 as bs
import requests
def yahoo_summary_stats(stock):
url = f"https://finance.yahoo.com/quote/{stock}/profile?p={stock}"
headers = {'User-Agent': 'Mozilla/5.0'}
print('url:', url)
response = requests.get(url, headers=headers)
soup = bs.BeautifulSoup(response.text, 'lxml')
table = soup.find('p', {'class': 'D(ib) Va(t)'})
sector = table.findAll('span')[1].text
industry = table.findAll('span')[3].text
print(f"{stock}: {sector}, {industry}")
return sector, industry
# --- main ---
result = yahoo_summary_stats('ALB')
print('result:', result)
result = yahoo_summary_stats('AEE')
print('result:', result)
Result:
url: https://finance.yahoo.com/quote/ALB/profile?p=ALB
ALB: Basic Materials, Specialty Chemicals
result: ('Basic Materials', 'Specialty Chemicals')
url: https://finance.yahoo.com/quote/AEE/profile?p=AEE
AEE: Utilities, Utilities—Regulated Electric
result: ('Utilities', 'Utilities—Regulated Electric')

Python requests, just confirm status code and not download body

I am looking at doing "proof of life" test for some sites my team is developing by just doing status code confirmation and do not actually need the document body. From what the Python Requests documentation says stream is False by default and the headers AND body are pulled down. However by setting stream to True only the headers are grabbed. My concern is the possibility of false positives.
I am trying something like the following:
url = random.choice(app.conf['TEST_SITES'])
ua = random.choice(app.conf['USER_AGENTS'])
proxies = {
'https':'{0}:{1}#{2}:{3}'.format(proxy_user, proxy_pass, proxy_ip, proxy_port),
'https':'{0}:{1}#{2}:{3}'.format(proxy_user, proxy_pass, proxy_ip, proxy_port)
}
headers = {'user-agent': ua}
proxy_session = requests.Session()
proxy_session.max_redirects = app.conf['MAX_REDIRECTS']
response = requests.get(headers=headers, proxies=proxies, stream=True, timeout=5)
ret_code = response.status_code
response.close
# Do stuff based on status code #

Using Requests for mixture of URLs in Python 3.x

I have a .txt file that contains a list of URLs. The structure of the URLs varies - some may begin with https, some with http, others with just www and others with just the domain name (stackoverflow.com). So an example of the .txt file content is:-
www.google.com
microsoft.com
https://www.yahoo.com
http://www.bing.com
What I want to do is parse through the list and check if the URLs are live. In order to do that, the stucture of the URL must be correct otherwise the request will fail. Here's my code so far:-
import requests
with open('urls.txt', 'r') as f:
urls = f.readlines()
for url in urls:
url = url.replace('\n', '')
if not url.startswith('http'): #This is to handle just domain names and those that begin with 'www'
url = 'http://' + url
if url.startswith('http:'):
print("trying url {}".format(url))
response = requests.get(url, timeout=10)
status_code = response.status_code
if status_code == 200:
continue
else:
print("URL {} has a response code of {}".format(url, status_code))
print("encountered error. Now trying with https")
url = url.replace('http://', 'https://')
print("Now replacing http with https and trying again")
response = requests.get(url, timeout=10)
status_code = response.status_code
print("URL {} has a response code of {}".format(url, status_code))
else:
response = requests.get(url, timeout=10)
status_code = response.status_code
print("URL {} has a response code of {}".format(url, status_code))
I feel like I've overcomplicated this somewhat and there must be an easier way of trying variants (ie. domain name, domain with 'www' at the beginning, with 'http' at the beginning and with 'https://' at the beginning, until a site is identified as being live or not (ie. all variables have been exhausted).
Any suggestions on my code or a better way to approach this? In essence, I want to handle the formatting of the URL to ensure that I then attempt to check the status of the URL.
Thanks in advance
This is a little too long for a comment, but, yes, it can be simplified, starting from, and replacing, the startswith part:
if not '//' in url:
url = 'http://' + url
response = requests.get(url, timeout=10)
etc.

Python request not using proxy

I wrote a simple python script to make a request to this website http://www.lagado.com/proxy-test using the requests module.
This website essentially tells you whether the request is using a proxy or not. According to the website, the request is not going through a proxy and is in fact going through my IP address.
Here is the code:
proxiesLocal = {
'https': proxy
}
headers = RandomHeaders.LoadHeader()
url = "http://www.lagado.com/proxy-test"
res = ''
while (res == ''):
try:
res = requests.get(url, headers=headers, proxies=proxiesLocal)
proxyTest = bs4.BeautifulSoup(res.text, "lxml")
items = proxyTest.find_all("p")
print(len(items))
for item in items:
print(item.text)
quit()
except:
print('sleeping')
time.sleep(5)
continue
Assuming that proxy is a variable of type string that stores the address of the proxy, what am I doing wrong?

How to loging to a website, that has password + captcha

i am writing a script that should look for new video releases from a private torrent tracker.
So the app is done, but i now need a way to get past the login screen, that has captcha, i have no idea how to do that.
Is there a way to use cookies from my own browser to get past login on the site, when i have credentials saved on my browser(firefox)?
Edit:
I am now trying to bypass the Captcha completely, by using cookies, i have an account to the site im trying to get in, and i read that it should be possible to bypass login and access a site by using cookies.
I found an example but i cannot get it to work. Here is the bit im trying to get to work:
cookies = {'uid': 'uid_here', 'pass': 'passkey', '__cfduid': 'cfduid'}
try:
page = requests.get(url, params=params, cookies=cookies).content
The cookies info i have copied form my own browser, but i cannot get this to work by myself
full bit of code im using as a refrence is here https://github.com/Flexget/Flexget/blob/97bcb6e10f654fbc5a3efa0bc00af6769d73ff69/flexget/plugins/sites/torrentday.py
edit2: heres what i have so far, but its not working:
def get_torrent(show_list):
print('Starting torrent search...')
new_eps = show_list
file_name = "C:/Users/secret/Desktop/tv_torrents/ "
start_url = "https://www.secretsite.com/browse.php?search="
end_url = "&cata=yes"
for line in new_eps:
# search for *** releases for all series
line += ' XAD'
s_string = start_url + line + end_url
cookies = {'site_cookie': 'ASDDA124fc96fb6776364asdA69c2f5ADAD921514234104'}
try:
read = requests.get(s_string, cookies=cookies).content
soup = BeautifulSoup(read, 'lxml')
links = soup.findAll('a')
print(soup)
torrent_links = ['https://www.secretsite.com/browse.php?search='
+ link['href'] for link in links if link['href'].endswith('torrent')]
except RequestException as e:
raise print('Could not connect to secretsite: %s' % e)
else:
try:
for links in torrent_links:
r = request.urlretrieve(links, file_name)
print('Success!' + line + ' downloaded')
except:
print('failed to dl torrent for ' + line)
pass
The documentation is not clear on how to "use" cookies, or i dont understand it:
cookies = dict(cookies_are='working')
r = requests.get(url, cookies=cookies)
I figured it out, was just an formatting error.
Correct format for sending cookie data below:
cookies = {'uid': '232323', 'pass': '32323232323232323232323',
'__cfduid': '2323232323adasdasdasdas78d6asdasjdgawi8d67as'}
try:
page = requests.get(url, cookies=cookies).content
soup = BeautifulSoup(page, 'lxml')
This lets me get past the captcha and "login" to the site using cookies from my own browser where i have logged into the site already.
for the captcha there is a python package that deals with it called captcha2upload , you will need a captcha solver account to get it (usually it's very cheap) just search in google for captcha solver.
I am not sure how you could use cookies for that matter...

Categories

Resources