Cache Access Denied. Authentication Required in requests module - python

I am trying to make a basic web crawler. My internet is through proxy connection. So I used the solution given here. But still while running the code I am getting the error.
My code is:
#!/usr/bin/python3.4
import requests
from bs4 import BeautifulSoup
import urllib.request as req
proxies = {
"http": r"http://usr:pass#202.141.80.22:3128",
"https": r"http://usr:pass#202.141.80.22:3128",
}
url = input("Ask user for something")
def santabanta(max_pages,url):
page = 1
while (page <= max_pages):
source_code = requests.get(url,proxies=proxies)
plain_text = source_code.text
print (plain_text)
soup = BeautifulSoup(plain_text,"lxml")
for link in soup.findAll('a'):
href = link.get('href')
print(href)
page = page + 1
santabanta(1,url)
But while running on terminal in ubuntu 14.04 I am getting the following error:
The following error was encountered while trying to retrieve the URL: http://www.santabanta.com/wallpapers/gauhar-khan/? Cache Access Denied. Sorry, you are not currently allowed to request http://www.santabanta.com/wallpapers/gauhar-khan/? from this cache until you have authenticated yourself.
The url posted by me is:http://www.santabanta.com/wallpapers/gauhar-khan/
Please help me

open the url.
hit F12(chrome user)
now go to "network" in the menu below.
hit f5 to reload the page so that chrome records all the data received from server.
open any of the "received file" and go down to "request header"
pass all the header to request.get()
.[Here is an image to help you][1]
[1]: http://i.stack.imgur.com/zUEBE.png
Make the header as follows:
headers = { 'Accept':' */ * ',
'Accept-Encoding':'gzip, deflate, sdch',
'Accept-Language':'en-US,en;q=0.8',
'Cache-Control':'max-age=0',
'Connection':'keep-alive',
'Proxy-Authorization':'Basic ZWRjZ3Vlc3Q6ZWRjZ3Vlc3Q=',
'If-Modified-Since':'Fri, 13 Nov 2015 17:47:23 GMT',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'
}

There is another way to solve this problem.
What you can do is let your python script to use the proxy defined in your environment variable
Open terminal (CTRL + ALT + T)
export http_proxy="http://usr:pass#proxy:port"
export https_proxy="https://usr:pass#proxy:port"
and remove the proxy lines from your code
Here is the changed code:
#!/usr/bin/python3.4
import requests
from bs4 import BeautifulSoup
import urllib.request as req
url = input("Ask user for something")
def santabanta(max_pages,url):
page = 1
while (page <= max_pages):
source_code = requests.get(url)
plain_text = source_code.text
print (plain_text)
soup = BeautifulSoup(plain_text,"lxml")
for link in soup.findAll('a'):
href = link.get('href')
print(href)
page = page + 1
santabanta(1,url)

Related

How to Read JS generated Page in Python

Please Note: This problem can be solved easily by using selenium library but I don't want to use selenium since the Host doesn't have a browser installed and not willing to.
Important: I know that render() will download chromium at first time and I'm ok with that.
Q: How can I get the page source when it's generated by JS code? For example this HP printer:
220.116.57.59
Someone posted online and suggested using:
from requests_html import HTMLSession
r = session.get('https://220.116.57.59', timeout=3, verify=False)
session = HTMLSession()
base_url = r.url
r.html.render()
But printing r.text doesn't print full page source and indicates that JS is disabled:
<div id="pgm-no-js-text">
<p>JavaScript is required to access this website.</p>
<p>Please enable JavaScript or use a browser that supports JavaScript.</p>
</div>
Original Answer: https://stackoverflow.com/a/50612469/19278887 (last part)
Grab the config endpoints and then parse the XML for the data you want.
For example:
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0"
}
with requests.Session() as s:
soup = (
BeautifulSoup(
s.get(
"http://220.116.57.59/IoMgmt/Adapters",
headers=headers,
).text,
features="xml",
).find_all("io:HardwareConfig")
)
print("\n".join(c.find("MacAddress").getText() for c in soup if c.find("MacAddress") is not None))
Output:
E4E749735068
E4E74973506B
E6E74973D06B

get a lot of HTML pages with python url requests

I am trying to dowload a big amount of HTML pages from a certain website, with the following python code using "requests" package:
FROM = 547495
TO = 570000
for page_number in range(FROM, TO):
url = DEFAULT_URL + str(page_number)
response = requests.get(url)
if response.status_code == 200:
with open(str(page_number) + ".html", "wb") as file:
file.write(response.content)
time.sleep(0.5)
I put a sleep(0.5) command in order that the web server will not think it is a DDOS attack.
after about 20,000 pages, I started getting only 403 forbiden http status code, and I can't anymore download pages.
But, if I try to open the same pages in my browser It opens well, so I guess the web server did not block me.
does someone has an Idea what caused it? and how can I handle it?
thank you
Make it look like it's your browser using headers, and set a cookie ID if it requires a session, here is an example. You can retrieve values of headers from inspecting "Network" tab in your browser when visiting the pages.
with requests.session() as sess:
sess.headers["User-Agent"]= "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:81.0) Gecko/20100101 Firefox/81.0"
sess.get(url)
sess.headers["Cookie"] = "eZSESSID={}".format(sess.cookies.get("eZSESSID"))
for page_number in range(FROM, TO):
if response.status_code == 200:
with open(str(page_number) + ".html", "wb") as file:
file.write(response.content)
time.sleep(0.5)

Scraping JSON from AJAX calls

Background
Considering this url:
base_url = "https://www.olx.bg/ad/sobstvenik-tristaen-kamenitsa-1-CID368-ID81i3H.html"
I want to make the ajax call for the telephone number:
ajax_url = "https://www.olx.bg/ajax/misc/contact/phone/7XarI/?pt=e3375d9a134f05bbef9e4ad4f2f6d2f3ad704a55f7955c8e3193a1acde6ca02197caf76ffb56977ce61976790a940332147d11808f5f8d9271015c318a9ae729"
Wanted results
If I press the button through the site in my chrome browser in the console I would get the wanted result:
{"value":"088 *****"}
debugging
If I open a new tab and paste the ajax_url I would always get empty values:
{"value":"000 000 000"}
If I try something like:
Bash:
wget $ajax_url
Python:
import requests
json_response= requests.get(ajax_url)
I would just receive the html of the the site's handling page that there is an error.
Ideas
I have something more when I am opening the request with the browser. What more do I have? maybe a cookie?
How do I get the wanted result with Bash/Python ?
Edit
the code of the response html is 200
I have tried with curl I get the same html problem.
Kind of a fix.
I have noticed that if I copy the cookie of the browser, and make a request with all the headers INCLUDING the cookie from the browser, I get the correct result
# I think the most important header is the cookie
headers = DICT_WITH_HEADERS_FROM_BROWSER
json_response= requests.get(next_url,
headers=headers,
)
Final question
The only question left is how can I generate a cookie through a Python script?
First you should create a requests Session to store cookies.
Then send a http GET request to the page that is actually calling the ajax request. If any cookie is created by the website, it is sent in GET response and your sessions stores the cookie.
Then you can easily use the session to call ajax api.
Important Note 1:
The ajax url you are calling in the original website is a http POST request! you should not send a get request to that url.
Important Note 2:
You also must extract phoneToken from the website js code which is stored in a variable like var phoneToken = 'here is the pt';
Sample code:
import re
import requests
my_session = requests.Session()
# call html website
base_url = "https://www.olx.bg/ad/sobstvenik-tristaen-kamenitsa-1-CID368-ID81i3H.html"
base_response = my_session.get(url=base_url)
assert base_response.status_code == 200
# extract phone token from base url response
phone_token = re.findall(r'phoneToken\s=\s\'(.+)\';', base_response.text)[0]
# call ajax api
ajax_path = "/ajax/misc/contact/phone/81i3H/?pt=" + phone_token
ajax_url = "https://www.olx.bg" + ajax_path
ajax_headers = {
'accept': '*/*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9,fa;q=0.8',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'Referer': 'https://www.olx.bg/ad/sobstvenik-tristaen-kamenitsa-1-CID368-ID81i3H.html',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'
}
ajax_response = my_session.post(url=ajax_url, headers=ajax_headers)
print(ajax_response.text)
When you run the code above, the result below is displayed:
{"value":"088 558 9937"}
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.firefox.options import Options
from bs4 import BeautifulSoup
import time
options = Options()
options.add_argument('--headless')
driver = webdriver.Firefox(options=options)
driver.get(
'https://www.olx.bg/ad/sobstvenik-tristaen-kamenitsa-1-CID368-ID81i3H.html')
number = driver.find_element_by_xpath(
"/html/body/div[3]/section/div[3]/div/div[1]/div[2]/div/ul[1]/li[2]/div/strong").click()
time.sleep(2)
source = driver.page_source
soup = BeautifulSoup(source, 'html.parser')
phone = soup.find("strong", {'class': 'xx-large'}).text
print(phone)
Output:
088 558 9937

Web Scraping with urllib and fixing 403: Forbidden

I used this cod to get a web page and it worked well
but now isnt work
 I try so many headers but still geting 403 error
this cod work for most sites but i cant get for example
this page
def get_page(addr):
headers = {}
headers['User-Agent'] = "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:48.0) Gecko/20100101 Firefox/48.0"
req = urllib.request.Request(addr, headers=headers)
html = urllib.request.urlopen(req).read()
return str(html)
Try Selenium:
from selenium import webdriver
import os
# initialise browser
browser = webdriver.Chrome(os.getcwd() + '/chromedriver')
browser.get('https://www.fragrantica.com/perfume/Victorio-Lucchino/No-4- Evasion-Exotica-50418.html')
# get page html
html = browser.page_source

Scraping Aliexpress with Python - Login required

I am trying to ceate a web scraper in Python that goes through all products of Aliexpress supplier.
My problem is that when I am going it without logging it I am eventually redirected to login web page. I added login section to my code but it does not help. I will appreciate all suggestions.
My code:
import requests
from bs4 import BeautifulSoup
import re
import sys
from lxml import html
def go_through_paginator(link):
source_code = requests.get(link, data=payload, headers = dict(referer = link))
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
print(soup)
for page in soup.findAll ('div', {'class' : 'ui-pagination-navi util-left'}):
for next_page in page.findAll ('a', {'class' : 'ui-pagination-next'}):
next_page_link="https:" + next_page.get('href')
print (next_page_link)
gather_all_products (next_page_link)
def gather_all_products (url):
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for item in soup.findAll ('a', {'class' : 'pic-rind'}):
product_link=item.get('href')
go_through_paginator(url)
payload = {
"loginId": "EMAIL",
"password": "LOGIN",
}
LOGIN_URL='https://login.aliexpress.com/buyer.htm?spm=2114.12010608.1000002.4.EihgQ5&return=https%3A%2F%2Fwww.aliexpress.com%2Fstore%2F1816376%3Fspm%3D2114.10010108.0.0.fs2frD&random=CAB39130D12E432D4F5D75ED04DC0A84'
session_requests = requests.session()
source_code = session_requests.get(LOGIN_URL)
source_code = session_requests.post(LOGIN_URL, data = payload)
URL='https://www.aliexpress.com/store/1816376?spm=2114.10010108.0.0.fs2frD'
source_code = requests.get(URL, data=payload, headers = dict(referer = URL))
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for L1 in soup.findAll ('li', {'id' : 'product-nav'}):
for L1_link in L1.findAll('a', {'class' : 'nav-link'}):
link = "https:" + L1_link.get('href')
gather_all_products(link)
And this the aliexpress login URL:
https://login.aliexpress.com/buyer.htm?spm=2114.12010608.1000002.4.EihgQ5&return=https%3A%2F%2Fwww.aliexpress.com%2Fstore%2F1816376%3Fspm%3D2114.10010108.0.0.fs2frD&random=CAB39130D12E432D4F5D75ED04DC0A84
Try to set the cookies value from xman_t and intl_common_forever from response cookies.
I was try it directly to grab all products information. Before I set xman_t and intl_common_forever Aliexpress just allow me to grab 7 products. After I set xman_t and intl_common_forever I successfully grab 50 products.
Hopefully this help you to scrapes their product.
Your problem is a bit too complex for Stackoverflow but let's take a look at few tips that can make huge difference.
What is happening here is that Aliexpress is suspecting you of being a robot and requests you to login to proceed.
To avoid that you need to fortify your scraper a bit. Let's take a look at common tips for AliExpress.
First, you need to sanitize your urls and remove tracking parameters:
URL='https://www.aliexpress.com/store/1816376?spm=2114.10010108.0.0.fs2frD'
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# we don't need this parameter, it just helps AliExpress to identify us
Then, we should fortify our requests Session with browser-like headers:
BASE_HEADERS = {
# lets use headers of Chrome 96 on Windows operating system to blend in:
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-language": "en-US;en;q=0.9",
"accept-encoding": "gzip, deflate, br",
}
session = requests.session(headers=BASE_HEADERS)
These two tips will decrease login requests significantly!
Though that's just the tip of an iceberg - for more see blog tutorial I wrote How to Scrape AliExpress

Categories

Resources