I am trying to dowload a big amount of HTML pages from a certain website, with the following python code using "requests" package:
FROM = 547495
TO = 570000
for page_number in range(FROM, TO):
url = DEFAULT_URL + str(page_number)
response = requests.get(url)
if response.status_code == 200:
with open(str(page_number) + ".html", "wb") as file:
file.write(response.content)
time.sleep(0.5)
I put a sleep(0.5) command in order that the web server will not think it is a DDOS attack.
after about 20,000 pages, I started getting only 403 forbiden http status code, and I can't anymore download pages.
But, if I try to open the same pages in my browser It opens well, so I guess the web server did not block me.
does someone has an Idea what caused it? and how can I handle it?
thank you
Make it look like it's your browser using headers, and set a cookie ID if it requires a session, here is an example. You can retrieve values of headers from inspecting "Network" tab in your browser when visiting the pages.
with requests.session() as sess:
sess.headers["User-Agent"]= "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:81.0) Gecko/20100101 Firefox/81.0"
sess.get(url)
sess.headers["Cookie"] = "eZSESSID={}".format(sess.cookies.get("eZSESSID"))
for page_number in range(FROM, TO):
if response.status_code == 200:
with open(str(page_number) + ".html", "wb") as file:
file.write(response.content)
time.sleep(0.5)
Related
Recently, i am trying to download some image from a website.
I search the displayed image element inside html.
Then, I open the image url on new tab, but it returns 403 Forbidden page.
I copy the string and insert it into another pages html and the image can display successfully.
I want to ask about the reason of it, and what can i do to download the image.
(I am trying to download it through python request.get())
Thank you.
This web server checks the Referer header when you request the image. To successfully download the image, the Referer must be the page the image is on. It doesn't care about the User-Agent. I assume the image showed up when you put it in another page because your browser cached the image, and did not actually request it from the server again.
By using your browser's network monitor tool, you can see how your browser got the image's URL. In this case, the URL wasn't a part of the original html document. Your browser executed some JavaScript that unpacked the URL and inserted an img element into the div element with id="mangaBox". Because of this, you can't use vanilla requests, as it doesn't execute JavaScript. I used Requests-HTML.
The code below downloads the image from the link you gave in your comment, and saves it to disk:
import os, urllib
from requests_html import HTMLSession
session = HTMLSession()
session.headers.update({"User-Agent": r"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0",
"Referer": r"https://tw.manhuagui.com/comic/35275/481200.html"
})
url = r"https://tw.manhuagui.com/comic/35275/481200.html"
response = session.get(url)
print(response, len(response.content))
response.html.render()
img = response.html.find("img#mangaFile", first=True)
print("img element:", img)
url = img.attrs["src"]
print("image url:", url)
response = session.get(url)
print(response, len(response.content))
filename = os.path.basename(urllib.parse.urlsplit(url).path)
print("filename:", filename)
with open(filename, "wb") as f:
f.write(response.content)
Output:
<Response [200]> 6715
img element: <Element 'img' alt='在地下城寻找邂逅难道有错吗? 第00话' id='mangaFile' src='https://i.hamreus.com/ps3/z/zdxcxzxhndyc_sddc/第00话/P0018.jpg.webp?cid=481200&md5=aAAP75PBy9DIa0bb8Hlwfw' class=('mangaFile',) data-tag='mangaFile' style='display: block; transform: rotate(0deg); transform-origin: 50% 50% 0px;' imgw='907'>
image url: https://i.hamreus.com/ps3/z/zdxcxzxhndyc_sddc/第00话/P0018.jpg.webp?cid=481200&md5=aAAP75PBy9DIa0bb8Hlwfw
<Response [200]> 186386
filename: P0018.jpg.webp
For what it's worth, a whole heap of image URLs, in addition to the main image of the current page, are packed in the last script element of the original html document.
<script type="text/javascript">window["\x65\x76\x61\x6c"](function(p,a,c,k,e,d)...
Some websites block requests without a useragent, try this:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
requests.get(url, headers=headers)
Reference to Python requests. 403 Forbidden
I am trying to make a basic web crawler. My internet is through proxy connection. So I used the solution given here. But still while running the code I am getting the error.
My code is:
#!/usr/bin/python3.4
import requests
from bs4 import BeautifulSoup
import urllib.request as req
proxies = {
"http": r"http://usr:pass#202.141.80.22:3128",
"https": r"http://usr:pass#202.141.80.22:3128",
}
url = input("Ask user for something")
def santabanta(max_pages,url):
page = 1
while (page <= max_pages):
source_code = requests.get(url,proxies=proxies)
plain_text = source_code.text
print (plain_text)
soup = BeautifulSoup(plain_text,"lxml")
for link in soup.findAll('a'):
href = link.get('href')
print(href)
page = page + 1
santabanta(1,url)
But while running on terminal in ubuntu 14.04 I am getting the following error:
The following error was encountered while trying to retrieve the URL: http://www.santabanta.com/wallpapers/gauhar-khan/? Cache Access Denied. Sorry, you are not currently allowed to request http://www.santabanta.com/wallpapers/gauhar-khan/? from this cache until you have authenticated yourself.
The url posted by me is:http://www.santabanta.com/wallpapers/gauhar-khan/
Please help me
open the url.
hit F12(chrome user)
now go to "network" in the menu below.
hit f5 to reload the page so that chrome records all the data received from server.
open any of the "received file" and go down to "request header"
pass all the header to request.get()
.[Here is an image to help you][1]
[1]: http://i.stack.imgur.com/zUEBE.png
Make the header as follows:
headers = { 'Accept':' */ * ',
'Accept-Encoding':'gzip, deflate, sdch',
'Accept-Language':'en-US,en;q=0.8',
'Cache-Control':'max-age=0',
'Connection':'keep-alive',
'Proxy-Authorization':'Basic ZWRjZ3Vlc3Q6ZWRjZ3Vlc3Q=',
'If-Modified-Since':'Fri, 13 Nov 2015 17:47:23 GMT',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'
}
There is another way to solve this problem.
What you can do is let your python script to use the proxy defined in your environment variable
Open terminal (CTRL + ALT + T)
export http_proxy="http://usr:pass#proxy:port"
export https_proxy="https://usr:pass#proxy:port"
and remove the proxy lines from your code
Here is the changed code:
#!/usr/bin/python3.4
import requests
from bs4 import BeautifulSoup
import urllib.request as req
url = input("Ask user for something")
def santabanta(max_pages,url):
page = 1
while (page <= max_pages):
source_code = requests.get(url)
plain_text = source_code.text
print (plain_text)
soup = BeautifulSoup(plain_text,"lxml")
for link in soup.findAll('a'):
href = link.get('href')
print(href)
page = page + 1
santabanta(1,url)
I was trying to figure out how to download files from a web hosting site like zippy share. I saw this post How to download in bash from zippyshare? that shows how to use wget, and manually add in the cookie from the browser and add that to the header in wget. That works. But I want to use python, and get the cookie and then execute wget, so that I can do this programmatically(example: scraping a bunch of download links).
I came up with this hacky script to get the cookie and execute the wget command but it seems that the cookie is not good because I get a 302 redirect:
import urllib2, os
class HeadRequest(urllib2.Request):
def get_method(self):
return "HEAD"
url = "http://www67.zippyshare.com/d/64003087/2432/Alex%20Henning%2c%20Laurie%20Webb%20-%20In%20Your%20Arms%20%28Joy%20Kitikonti%20Remix%29%20%5bquality-dance-music.com%5d.mp3"
referer = "http://www67.zippyshare.com/v/64003087/file.html"
response = urllib2.urlopen(HeadRequest(url))
headers = response.info()
jcookieString = headers['Set-Cookie'].split(';')[0] #[11:]
# print headers
print "jcookie string " + jcookieString
wgetString = "wget " + url + " --referer=" + referer + " --cookies=off --header \"Cookie: " + jcookieString + "\"" + "--user-agent=\"Mozilla/5.0 (Windows NT 6.0) Gecko/20100101 Firefox/14.0.1\""
os.system(wgetString)
I also tried using python's cookielib, but got the same behavior of the 302 redirect. Thanks.
EDIT: using requests here is the code now persisting the cookie that comes from the referer request because I am using the session to make the request...yet still no go:
looking at the response.history shows that the 302 redirect is still happening for some reason.
import requests
downloadUrl="http://www67.zippyshare.com/d/3278160/42939/Andre%20Nazareth%20-%20Bella%20Notte%20%28Original%20Mix%29%20%5bquality-dance-music.com%5d.mp3"
referer= "http://www67.zippyshare.com/v/3278160/file.html"
header={"user-agent": "\"Mozilla/5.0 (Windows NT 6.0) Gecko/20100101 Firefox/14.0.1\"", 'referer': referer}
refererSession = requests.Session()
refererSession.get(referer)
downloadResponse = refererSession.get(downloadUrl, headers=header)
print downloadResponse.request.headers
print downloadResponse.status_code
if downloadResponse.status_code == 200:
mp3Name = "song2.mp3"
song = open(mp3Name, "wb")
song.write(downloadResponse.content)
song.close()
Using a system call from within python should really be left for situations where there is no other choice. Use the requests library, like so:
import requests
header={"user-agent":\"Mozilla/5.0 (Windows NT 6.0) Gecko/20100101 Firefox/14.0.1\"",
'referer': referer}
cookies = dict(cookie_name='cookie_text')
r = requests.get(url, header=header, cookies=cookies)
If it doesn't work, maybe the settings itself aren't suitable for what you are trying to do. I am perplexed why you both set the cookie and have cookies=off in the wget statement.
I am try to learn python, but I have no knowledge about HTTP, I read some posts here about how to use requests to login web site. But it doesn't work. My simple code is here (not real number and password):
#!/usr/bin/env python3
import requests
login_data = {'txtDID': '111111111',
'txtPswd': 'mypassword'}
with requests.Session() as c:
c.post('http://phone.ipkall.com/login.asp', data=login_data)
r = c.get('http://phone.ipkall.com/update.asp')
print(r.text)
print("Done")
But I can't get my personal information which should be showed after login. Can anyone give me some hint? Or point me a direction? I have no idea what's going wrong.
Servers don't like bots (scripts) for security reason. So your script have to behave like human using real browser. First use get() to get session cookies, set user-agent in headers to real one. Use http://httpbin.org/headers to see what user-agent is send by your browser.
Always check results r.status_code and r.url
So you can start with this:
(I don't have acount on this server so I can't test it)
#!/usr/bin/env python3
import requests
s = requests.Session()
s.headers.update({
'User-agent': "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:30.0) Gecko/20100101 Firefox/30.0",
})
# --------
# to get cookies, session ID, etc.
r = s.get('http://phone.ipkall.com/login.asp')
print( r.status_code, r.url )
# --------
login_data = {
'txtDID': '111111111',
'txtPswd': 'mypassword',
'submit1': 'Submit'
}
r = s.post('http://phone.ipkall.com/process.asp?action=verify', data=login_data)
print( r.status_code, r.url )
# --------
BTW: If page use JavaScript you have problem because requests can't run javascript on page.
I want to send a POST request to the page after opening it using Python (using urllib2.urlopen). Webpage is http://wireless.walmart.com/content/shop-plans/?r=wm
Code which I am using right now is:
url = 'http://wireless.walmart.com/content/shop-plans/?r=wm'
user_agent = 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1;Trident/5.0)'
values = {'carrierID':'68',
'conditionToType':'1',
'cssPrepend':'wm20',
'partnerID':'36575'}
headers = { 'User-Agent' : user_agent }
data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
page = response.read()
walmart = open('Walmart_ContractPlans_ATT.html','wb')
walmart.write(page)
This is giving me page which opens by default, after inspecting the page using Firebug I came to know that carrierID:68 is sent when I click on the button which sends this POST request.
I want to simulate this browser behaviour.
Please help me in resolving this.
For webscraping I prefer to use requests and pyquery. First you download the data:
import requests
from pyquery import PyQuery as pq
url = 'http://wireless.walmart.com/content/getRatePlanInfo'
payload = {'carrierID':68, 'conditionToType':1, 'cssPrepend':'wm20'}
r = requests.post(url, data=payload)
d = pq(r.text)
After this you proceed to parse the elements, for example to extract all plans:
plans = []
plans_selector = '.wm20_planspage_planDetails_sub_detailsDiv_ul_li'
plans = d(plans_selector).each(lambda i, n: plans.append(pq(n).text()))
Result:
['Basic 200',
'Simply Everything',
'Everything Data 900',
'Everything Data 450',
'Talk 450',
...
I recommend looking at a browser emulator like mechanize, rather than trying to do this with raw HTTP requests.