Web Scraping with urllib and fixing 403: Forbidden - python

I used this cod to get a web page and it worked well
but now isnt work
 I try so many headers but still geting 403 error
this cod work for most sites but i cant get for example
this page
def get_page(addr):
headers = {}
headers['User-Agent'] = "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:48.0) Gecko/20100101 Firefox/48.0"
req = urllib.request.Request(addr, headers=headers)
html = urllib.request.urlopen(req).read()
return str(html)

Try Selenium:
from selenium import webdriver
import os
# initialise browser
browser = webdriver.Chrome(os.getcwd() + '/chromedriver')
browser.get('https://www.fragrantica.com/perfume/Victorio-Lucchino/No-4- Evasion-Exotica-50418.html')
# get page html
html = browser.page_source

Related

Requests returns a status code of 429 for URL https://www.instagram.com/google

I'm trying to code an Instagram-webscraper in Python to return values like a person's followers, the number of posts etc.
Let's just take Google's Instagram-account for this example.
Here is my code:
import requests
from bs4 import BeautifulSoup
link = requests.get("https://www.instagram.com/google")
soup = BeautifulSoup(link.text, "html.parser")
print(soup)
print(link.status_code)
Pretty straightforward.
However, if I run the code, it prints link.status_code = 429. It should be 200, for any other website it prints 200.
Also, when it prints soup, it doesnt show what I actually want. Not the HTML for the account is shown, but the HTML for the Instagram-Error-page.
Why does requests open the instagram error page, not the account from the link provided?
To get correct response from the server, set User-Agent HTTP header:
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0"
}
link = requests.get("https://www.instagram.com/google", headers=headers)
soup = BeautifulSoup(link.text, "lxml")
print(link.status_code)
print(soup.select_one('meta[name="description"]')["content"])
Prints:
200
12.5m Followers, 33 Following, 1,642 Posts - See Instagram photos and videos from Google (#google)

Python Request library not helping in Getting the correct webpage

I have this website:
https://xueqiu.com/hq#exchange=CN&firstName=1&secondName=1_0
I am trying to get this webpage through Python's get request. I have also tried changing "user-agent".
But I am not able to get the webpage, I am new to this parsing.
url = 'https://xueqiu.com/hq#exchange=CN&firstName=1&secondName=1_0'
with request.session() as session:
response = session.get(url)
Could someone please help me how to extract it?
Your data loaded via below url json format. So i use json module to extract the data.
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as bs
import requests
import json
headers={
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
}
def scrape(url):
with requests.Session() as req:
req.headers.update(headers)
r = req.get(url)
mydata =r.json()
for data in mydata['data']['list']:
print(data, sep='*')
url = 'https://xueqiu.com/service/v5/stock/screener/quote/list?page=1&size=30&order=desc&orderby=percent&order_by=percent&market=CN&type=sh_sz&_=1606221698728'
scrape(url)
hope its help you.

urllib.request.urlopen not working for a specific website

I used urllib.request.Request for the url of a memidex.com page, but the urllib.request.urlopen(url) line goes on to fail to open the url.
url = urllib.request.Request("http://www.memidex.com/" + term)
my_request = urllib.request.urlopen(url)
info = BeautifulSoup(my_request, "html.parser")
I've tried using the same code for a different website and it worked for that one so I have no idea why it's not working for memidex.com.
You need to add headers to your url request in order to overcome the error. BTW 'HTTP Error 403: Forbidden' was your error right?
Hope the below code helps you.
import urllib.request
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
url = "http://www.memidex.com/"
headers={'User-Agent':user_agent,}
request=urllib.request.Request(url,None,headers)
response = urllib.request.urlopen(request)
data = response.read()
print(data)

Problems with user agent on urllib

I wanted to get the HTML of web site but I cant get it due to the user agent I suppose. Because when I call uClient=ureq(my_url) I get an error like this: urllib.error.HTTPError: HTTP Error 403: Forbidden
This is the code:
from urllib.request import urlopen as ureq, Request
from bs4 import BeautifulSoup as soup
my_url= 'https://hsreplay.net/meta/#tab=matchups&sortBy=winrate'
ureq(Request(my_url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'}))
uClient=ureq(my_url)
page_html=uClient.read()
uClient.close()
html=soup(page_html,"html.parser")
I have tried other methods of changing th user agent and other user agents, but it isn't work.
I'm pretty sure you will help. Thanks!!
What you did above is clearly a mess. The code should not run at all. Try the below way instead.
from bs4 import BeautifulSoup
from urllib.request import Request,urlopen
URL = "https://hsreplay.net/meta/#tab=matchups&sortBy=winrate"
req = Request(URL,headers={"User-Agent":"Mozilla/5.0"})
res = urlopen(req).read()
soup = BeautifulSoup(res,"lxml")
name = soup.find("h1").text
print(name)
Output:
HSReplay.net
Btw, you can scrape few items that are not javascript encrypted from that page. However, the core content of that page are generated dynamically so you can't grab them using urllib and BeautifulSoup. To get them you need to choose any browser simulator like selenium etc.

Cache Access Denied. Authentication Required in requests module

I am trying to make a basic web crawler. My internet is through proxy connection. So I used the solution given here. But still while running the code I am getting the error.
My code is:
#!/usr/bin/python3.4
import requests
from bs4 import BeautifulSoup
import urllib.request as req
proxies = {
"http": r"http://usr:pass#202.141.80.22:3128",
"https": r"http://usr:pass#202.141.80.22:3128",
}
url = input("Ask user for something")
def santabanta(max_pages,url):
page = 1
while (page <= max_pages):
source_code = requests.get(url,proxies=proxies)
plain_text = source_code.text
print (plain_text)
soup = BeautifulSoup(plain_text,"lxml")
for link in soup.findAll('a'):
href = link.get('href')
print(href)
page = page + 1
santabanta(1,url)
But while running on terminal in ubuntu 14.04 I am getting the following error:
The following error was encountered while trying to retrieve the URL: http://www.santabanta.com/wallpapers/gauhar-khan/? Cache Access Denied. Sorry, you are not currently allowed to request http://www.santabanta.com/wallpapers/gauhar-khan/? from this cache until you have authenticated yourself.
The url posted by me is:http://www.santabanta.com/wallpapers/gauhar-khan/
Please help me
open the url.
hit F12(chrome user)
now go to "network" in the menu below.
hit f5 to reload the page so that chrome records all the data received from server.
open any of the "received file" and go down to "request header"
pass all the header to request.get()
.[Here is an image to help you][1]
[1]: http://i.stack.imgur.com/zUEBE.png
Make the header as follows:
headers = { 'Accept':' */ * ',
'Accept-Encoding':'gzip, deflate, sdch',
'Accept-Language':'en-US,en;q=0.8',
'Cache-Control':'max-age=0',
'Connection':'keep-alive',
'Proxy-Authorization':'Basic ZWRjZ3Vlc3Q6ZWRjZ3Vlc3Q=',
'If-Modified-Since':'Fri, 13 Nov 2015 17:47:23 GMT',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'
}
There is another way to solve this problem.
What you can do is let your python script to use the proxy defined in your environment variable
Open terminal (CTRL + ALT + T)
export http_proxy="http://usr:pass#proxy:port"
export https_proxy="https://usr:pass#proxy:port"
and remove the proxy lines from your code
Here is the changed code:
#!/usr/bin/python3.4
import requests
from bs4 import BeautifulSoup
import urllib.request as req
url = input("Ask user for something")
def santabanta(max_pages,url):
page = 1
while (page <= max_pages):
source_code = requests.get(url)
plain_text = source_code.text
print (plain_text)
soup = BeautifulSoup(plain_text,"lxml")
for link in soup.findAll('a'):
href = link.get('href')
print(href)
page = page + 1
santabanta(1,url)

Categories

Resources