Recursively downloading a website in Python

Recursively downloading a website in Python - python

I'm trying to recursively download all of the directories and files of a website starting from its root. I tried writing the code for this, but I am unsure as to how I am supposed to get Python to create the appropriate directories and files. I understand how I can use requests to make GET requests to sites, but I don't know how I can actually write the files needed to my system. Here's my code:
import os
import requests
import getpass
class Fetch:
url=''
def __init__(self, url):
self.url = url
user = getpass.getuser()
os.chdir('/home' + user)
def download():
r = requests.get(url, stream=True)

Related

Python requests downloads HTML but not the file

I am downloading some files from the FAO GAEZ database, which uses HTTP POST based login from.
I am thus using the requests module. Here is my code:
my_user = "blabla"
my_pass = "bleble"
site_url = "http://www.gaez.iiasa.ac.at/w/ctrl?_flow=Vwr&_view=Welcome&fieldmain=main_lr_lco_cult&idPS=0&idAS=0&idFS=0"
file_url = "http://www.gaez.iiasa.ac.at/w/ctrl?_flow=VwrServ&_view=AAGrid&idR=m1ed3ed864793f16e83ba9a5a975066adaa6bf1b0"
with requests.Session() as s:
s.get(site_url)
s.post(site_url, data={'_username': 'my_user', '_password': 'my_pass'})
r = s.get(file_url)
if r.ok:
with open(my_path + "\\My file.zip", "wb") as c:
c.write(r.content)
However, with this procedure I download the HTML of the page.
I suspect that to solve the problem I have to add the name of the zip file to the url, i.e. new_file_url = file_url + "/file_name.zip". The problem is that I don't know the "file_name". I've tried with the name of the file which I obtain when I download it manually, but it does not work.
Any of idea on how to solve this? If you need more details on GAEZ website, see also: Python - Login and download specific file from website

Web Scraping FileNotFoundError When Saving to PDF

I copy some Python code in order to download data from a website. Here is my specific website:
https://www.codot.gov/business/bidding/bid-tab-archives/bid-tabs-2017-1
Here is the code which I copied:
import requests
from bs4 import BeautifulSoup
def _getUrls_(res):
hrefs = []
soup = BeautifulSoup(res.text, 'lxml')
main_content = soup.find('div',{'id' : 'content-core'})
table = main_content.find("table")
for a in table.findAll('a', href=True):
hrefs.append(a['href'])
return(hrefs)
bidurl = 'https://www.codot.gov/business/bidding/bid-tab-archives/bid-tabs-2017-1'
r = requests.get(bidurl)
hrefs = _getUrls_(r)
def _getPdfs_(hrefs, basedir):
for i in range(len(hrefs)):
print(hrefs[i])
respdf = requests.get(hrefs[i])
pdffile = basedir + "/pdf_dot/" + hrefs[i].split("/")[-1] + ".pdf"
try:
with open(pdffile, 'wb') as p:
p.write(respdf.content)
p.close()
except FileNotFoundError:
print("No PDF produced")
basedir= "/Users/ABC/Desktop"
_getPdfs_(hrefs, basedir)
The code runs successfully, but it did not download anything at all, even though there is no Filenotfounderror obviously.
I tried the following two URLs:
https://www.codot.gov/business/bidding/bid-tab-archives/bid-tabs-2017/aqc-088a-035-20360
https://www.codot.gov/business/bidding/bid-tab-archives/bid-tabs-2017/aqc-r100-258-21125
However both of these URLs return >>> No PDF produced.
The thing is that the code worked and downloaded successfully for other people, but not me.

Your code works I just tested. You need to make sure the basedir exists, you want to add this to your code:
if not os.path.exists(basedir):
os.makedirs(basedir)

I used this exact (indented) code but replaced the basedir with my own dir and it worked only after I made sure that the path actually exists. This code does not create the folder in case it does not exist.

As others have pointed out, you need to create basedir beforehand. The user running the script may not have the directory created. Make sure you insert this code at the beginning of the script, before the main logic.
Additionally, hardcoding the base directory might not be a good idea when transferring the script to different systems. It would be preferable to use the users %USERPROFILE% enviorment variable:
from os import envioron
basedir= join(environ["USERPROFILE"], "Desktop", "pdf_dot")
Which would be the same as C:\Users\blah\Desktop\pdf_dot.
However, the above enivorment variable only works for Windows. If you want it to work Linux, you will have to use os.environ["HOME"] instead.
If you need to transfer between both systems, then you can use os.name:
from os import name
from os import environ
# Windows
if name == 'nt':
basedir= join(environ["USERPROFILE"], "Desktop", "pdf_dot")
# Linux
elif name == 'posix':
basedir = join(environ["HOME"], "Desktop", "pdf_dot")

You don't need to specify the directory or create any folder manually. All you need do is run the following script. When the execution is done, you should get a folder named pdf_dot in your desktop containing the pdf files you wish to grab.
import requests
from bs4 import BeautifulSoup
import os
URL = 'https://www.codot.gov/business/bidding/bid-tab-archives/bid-tabs-2017-1'
dirf = os.environ['USERPROFILE'] + '\Desktop\pdf_dot'
if not os.path.exists(dirf):os.makedirs(dirf)
os.chdir(dirf)
res = requests.get(URL)
soup = BeautifulSoup(res.text, 'lxml')
pdflinks = [itemlink['href'] for itemlink in soup.find_all("a",{"data-linktype":"internal"}) if "reject" not in itemlink['href']]
for pdflink in pdflinks:
filename = f'{pdflink.split("/")[-1]}{".pdf"}'
with open(filename, 'wb') as f:
f.write(requests.get(pdflink).content)

Running Background task for checking regular updates python-flask

I have one web application with Rest API ,In that application I have some video files now i am creating one intermediate server in this one.I am accessing my web content using API but here i need to check updates in regular intervals if new updates will be there i need to download them.
Here files are video files and i'm using flask
I tried this i'm not getting
from flask import Flask,render_template,json,jsonify
import schedule
import time
import requests,json
from pathlib import Path
import multiprocessing
import time
import sys
import schedule,wget,requests,json,os,errno,shutil,time,config
def get_videos():
response = requests.get('my api here')
data = response.json()
files =list() # collecting my list of video files
l = len(data)
for i in range(l):
files.append(data[i]['filename'])
return files
def checkfor_file(myfiles):
for i in range(len(myfiles)):
url = 'http://website.com/static/Video/'+myfiles[i] # checking file exists are not in my folder
if url:
os.remove(url)
else:
pass
def get_newfiles(myfiles):
for i in range(len(myfiles)):
url = config.videos+myfiles[i]
filename= wget.download(url)# downloading video files here
def move_files(myfiles):
for i in range(len(myfiles)):
file = myfiles[i]
shutil.move(config.source_files+file,config.destinatin) # moving download folder to another folder
def videos():
files = set(get_videos()) # getting only unique file only
myfiles = list(files)
checkfor_file(myfiles)
get_newfiles(myfiles)
move_files(myfiles)
def job():
videos()
schedule.every(10).minutes.do(job) # running for every ten minutes
while True:
schedule.run_pending()
time.sleep(1)
pi =Flask(__name__)
#pi.route('/')
def index():
response = requests.get('myapi')
data = response.json()
return render_template('main.html',data=data)

django how to download a file from the internet

I want to have a user input a file URL and then have my django app download the file from the internet.
My first instinct was to call wget inside my django app, but then I thought there may be another way to get this done. I couldn't find anything when I searched. Is there a more django way to do this?

You are not really dependent on Django for this.
I happen to like using requests library.
Here is an example:
import requests
def download(url, path, chunk=2048):
req = requests.get(url, stream=True)
if req.status_code == 200:
with open(path, 'wb') as f:
for chunk in req.iter_content(chunk):
f.write(chunk)
f.close()
return path
raise Exception('Given url is return status code:{}'.format(req.status_code))
Place this is a file and import into your module whenever you need it.
Of course this is very minimal but this will get you started.

You can use urlopen from urllib2 like in this example:
import urllib2
pdf_file = urllib2.urlopen("http://www.example.com/files/some_file.pdf")
with open('test.pdf','wb') as output:
output.write(pdf_file.read())
For more information, read the urllib2 docs.

Python BeautifulSoup web image crawler IOError: [Errno 2] No such file or directory

I wrote the following Python code to crawl the images from the website www.style.com
import urllib2, urllib, random, threading
from bs4 import BeautifulSoup
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
class Images(threading.Thread):
def __init__(self, lock, src):
threading.Thread.__init__(self)
self.src = src
self.lock = lock
def run(self):
self.lock.acquire()
urllib.urlretrieve(self.src,'./img/'+str(random.choice(range(9999))))
print self.src+'get'
self.lock.release()
def imgGreb():
lock = threading.Lock()
site_url = "http://www.style.com"
html = urllib2.urlopen(site_url).read()
soup = BeautifulSoup(html)
img=soup.findAll(['img'])
for i in img:
print i.get('src')
Images(lock, i.get('src')).start()
if __name__ == '__main__':
imgGreb()
But I got this error:
IOError: [Errno 2] No such file or directory: '/images/homepage-2013-october/header/logo.png'
How can it be solved?
Also can this recursively find all the images in the website? I mean other images that are not on the homepage.
Thanks!

You are using the relative path without the domain when you tried to retrieve the URL.
Some of the images are javascript based and you will get the relative path to be javascript:void(0);, which you will never get the page. I added the try except to get around that error. Or you can smartly detect if the URL ends with jpg/gif/png or not. I will that work to you :)
BTW, not all the images are included in the URL, some of the pictures, Beautiful One, are called using Javascript, will there is nothing we can do using urllib and beautifulsoup only. If you really want to challenge yourself, maybe you can try to learn Selenium, which is a more powerful tool.
Try the code below directly:
import urllib2
from bs4 import BeautifulSoup
import sys
from urllib import urlretrieve
reload(sys)
def imgGreb():
site_url = "http://www.style.com"
html = urllib2.urlopen(site_url).read()
soup = BeautifulSoup(html)
img=soup.findAll(['img'])
for i in img:
try:
# built the complete URL using the domain and relative url you scraped
url = site_url + i.get('src')
# get the file name
name = "result_" + url.split('/')[-1]
# detect if that is a type of pictures you want
type = name.split('.')[-1]
if type in ['jpg', 'png', 'gif']:
# if so, retrieve the pictures
urlretrieve(url, name)
except:
pass
if __name__ == '__main__':
imgGreb()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Recursively downloading a website in Python - python

Related

Python requests downloads HTML but not the file

Web Scraping FileNotFoundError When Saving to PDF

Running Background task for checking regular updates python-flask

django how to download a file from the internet

Python BeautifulSoup web image crawler IOError: [Errno 2] No such file or directory

Categories

Resources