I was messing around in python 3.x yesterday, and i wanted to scrape all of the images off of a HTTPS website. This is the code I have so far
import urllib
import urllib.request
idnum = 190154
ur = 'https://skystorage.iscorp.com/pictures/IL/Lincolnway//%d' % idnum
url = ur + '.JPG?rev=0'
filename = str(idnum) + '.JPG'
idnum = idnum + 1
try: urllib.request.urlretrieve(url , filename)
except urllib.error.URLError as e:
print(e.reason)
This, however, is not working at all as planned, as the URL is HTTPS and urllib does not seem to support this. How would I be able to do something similarly to scrape the images?
man there is much work to do,but anyway I would like to help you .
first think you should know is that if you are in a html page,first you must create a list of the urls of the image that you would like to download ,to do this you can find helpful to know what Is a regular expression and know how to use RE library of python.
With the RE you can search in the html code the urls of the image.
Then made a method that save on your computer all of image that are in the list that you have been created before.
I hope I was helpful
Related
I'm trying to scrape this image using urllib.urlretrieve.
>>> import urllib
>>> urllib.urlretrieve('http://i9.mangareader.net/one-piece/3/one-piece-1668214.jpg',
path) # path was previously defined
This code successfully saves the file in the given path. However, when I try to open the file, I get:
Could not load image 'imagename.jpg':
Error interpreting JPEG image file (Not a JPEG file: starts with 0x3c 0x21)
When I do file imagename.jpg in my bash terminal, I get imagefile.jpg: HTML document, ASCII text.
So how do I scrape this image as a JPEG file?
It's because the owner of the server hosting that image is deliberately blocking access from Python's urllib. That's why it's working with requests. You can also do it with pure Python, but you'll have to give it an HTTP User-Agent header that makes it look like something other than urllib. For example:
import urllib2
req = urllib2.Request('http://i9.mangareader.net/one-piece/3/one-piece-1668214.jpg')
req.add_header('User-Agent', 'Feneric Was Here')
resp = urllib2.urlopen(req)
imgdata = resp.read()
with open(path, 'wb') as outfile:
outfile.write(imgdata)
So it's a little more involved to get around, but still not too bad.
Note that the site owner probably did this because some people had gotten abusive. Please don't be one of them! With great power comes great responsibility, and all that.
I'm trying to download fanart images from fanart.tv api. Therefore I wrote a script to build the api call and collect the URLs. The code might need some clean up but I guess it is functional for now:
APICaller.py
My problem now is to save the images with that generic filename which is givin within the URL
For example
I call my script with these args:
python APICaller.py -a "Madonna" -p "C:/temp" -n "Madonna - Hung up"
As a result I receive:
'http://assets.fanart.tv/fanart/music/79239441-bfd5-4981-a70c-55c3f15c1287/artistbackground/madonna-4fe25d4f1b951.jpg', 'http://assets.fanart.tv/fanart/music/79239441-bfd5-4981-a70c-55c3f15c1287/artistbackground/madonna-4fe2766aac587.jpg',
... and so on
Now I want to save all images to /extrafanart/madonna-4fe25d4f1b951.jpg ...
What is the best way to handle it? urlparse, split or parsing with regex maybe?
Please help, this is very frustrating :(
import os
import urllib
for url in urllist:
filename = url.rstrip('/').rsplit('/', 1)[-1]
path = os.path.join(os.path.join(os.path.sep, 'extrafanart'), filename)
urllib.urlretrieve(url, path)
I want to save all images from a site. wget is horrible, at least for http://www.leveldesigninspirationmachine.tumblr.com since in the image folder it just drops html files, and nothing as an extension.
I found a python script, the usage is like this:
[python] ImageDownloader.py URL MaxRecursionDepth DownloadLocationPath MinImageFileSize
Finally I got the script running after some BeautifulSoup problems.
However, I can't find the files anywhere. I also tried "/" as the output dir in hope the images got on the root of my HD but no luck. Can someone either help me to simplify the script so it outputs at the cd directory set in terminal. Or give me a command that should work. I have zero python experience and I don't really want to learn python for a 2 year old script that maybe doesn't even work the way I want.
Also, how can I pass an array of website? With a lot of scrapers it gives me the first few results of the page. Tumblr has the load on scroll but that has no effect so i would like to add /page1 etc.
thanks in advance
# imageDownloader.py
# Finds and downloads all images from any given URL recursively.
# FB - 201009094
import urllib2
from os.path import basename
import urlparse
#from BeautifulSoup import BeautifulSoup # for HTML parsing
import bs4
from bs4 import BeautifulSoup
global urlList
urlList = []
# recursively download images starting from the root URL
def downloadImages(url, level, minFileSize): # the root URL is level 0
# do not go to other websites
global website
netloc = urlparse.urlsplit(url).netloc.split('.')
if netloc[-2] + netloc[-1] != website:
return
global urlList
if url in urlList: # prevent using the same URL again
return
try:
urlContent = urllib2.urlopen(url).read()
urlList.append(url)
print url
except:
return
soup = BeautifulSoup(''.join(urlContent))
# find and download all images
imgTags = soup.findAll('img')
for imgTag in imgTags:
imgUrl = imgTag['src']
# download only the proper image files
if imgUrl.lower().endswith('.jpeg') or \
imgUrl.lower().endswith('.jpg') or \
imgUrl.lower().endswith('.gif') or \
imgUrl.lower().endswith('.png') or \
imgUrl.lower().endswith('.bmp'):
try:
imgData = urllib2.urlopen(imgUrl).read()
if len(imgData) >= minFileSize:
print " " + imgUrl
fileName = basename(urlsplit(imgUrl)[2])
output = open(fileName,'wb')
output.write(imgData)
output.close()
except:
pass
print
print
# if there are links on the webpage then recursively repeat
if level > 0:
linkTags = soup.findAll('a')
if len(linkTags) > 0:
for linkTag in linkTags:
try:
linkUrl = linkTag['href']
downloadImages(linkUrl, level - 1, minFileSize)
except:
pass
# main
rootUrl = 'http://www.leveldesigninspirationmachine.tumblr.com'
netloc = urlparse.urlsplit(rootUrl).netloc.split('.')
global website
website = netloc[-2] + netloc[-1]
downloadImages(rootUrl, 1, 50000)
As Frxstream has commented, this program creates the files in the current directory (i.e. where you run it). After running the program, run ls -l (or dir) to find the files it has created.
If it seemingly hasn't created any files, then most probably it really hasn't created any files, most probably because there was an exception which your except: pass has hidden. To see what was going wrong, replace try: ... except: pass with just ..., and rerun the program. (If you can't understand and fix that, ask a separate StackOverflow question.)
it's hard to tell without looking at the errors (+1 to turning off your try/except block so you can see the exceptions) but I do see one typo here:
fileName = basename(urlsplit(imgUrl)[2])
you didn't do "from urlparse import urlsplit" you have "import urlparse" so you need to refer to it as urlparse.urlsplit() as you have in other places, so should be like this
fileName = basename(urlparse.urlsplit(imgUrl)[2])
Hi I searched a lot and ended up with no relevant results on how to save a webpage using python 2.6 and renaming it while saving.
Better user requests libraty:
import requests
pagelink = "http://www.example.com"
page = requests.get(pagelink)
with open('/path/to/file/example.html', "w") as file:
file.write(page.text)
You may want to use the urllib(2) package to access the webpage, and then save the file object to the desired location (os.path).
It should look something like this:
import urllib2, os
pagelink = "http://www.example.com"
page = urllib2.urlopen(pagelink)
with open(os.path.join('/(full)path/to/Documents',pagelink), "w") as file:
file.write(page)
How can I modify the code below to capture all e-mails instead of images:
import urllib2
import re
from os.path import basename
from urlparse import urlsplit
url = "URL WITH IMAGES"
urlContent = urllib2.urlopen(url).read()
# HTML image tag: <img src="url" alt="some_text"/>
imgUrls = re.findall('img .*?src="(.*?)"', urlContent)
# download all images
for imgUrl in imgUrls:
try:
imgData = urllib2.urlopen(imgUrl).read()
fileName = basename(urlsplit(imgUrl)[2])
output = open(fileName,'wb')
output.write(imgData)
output.close()
except:
pass
Need to get a directory from an array of websites. I'm using C++ to create code for Unix by calling the .py file multiple times and then appending it to an existing file each time.
Parsing/validating email address requires a strong regex. You can look for those on google. I am showing you a simple email address parsing regex.
emails = re.findall('([a-zA-Z0-9\.]+#[a-zA-Z0-9]+\.[a-zA-Z]{2,3})', urlContent)
This is just a rudimentary example. You need to use a powerful one.