I'm working on a program that uses Beautiful Soup to scrape a website, and then urllib to retrieve images found on the website (using the image's direct URL). The website I'm scraping isn't the original host of the image, but does link to the original image. The problem I've run into is that for certain websites retrieving www.example.com/images/foobar.jpg redirects me to the homepage www.example.com and produces an empty (0 KB) image. In fact, going to www.example.com/images/foobar.jpg redirects as well. Interesting on the website I'm scraping, the image shows up normal.
I've seen some examples on SO, but they all explain how to capture cookies, headers, and other similar data from websites while getting around the redirect, and I was unable to get them to work for me. Is there a way to prevent a redirect and get the image stored at www.example.com/images/foobar.jpg?
This is the block of code that saves the image:
from urllib import urlretrieve
...
for imData in imList:
imurl = imData['imurl']
fName = os.path.basename(URL)
fName,ext = os.path.splitext(fName)
fName += "_%02d"%(ctr,)+ext
urlretrieve(imurl,fName)
ctr += 1
The code that handles all the scraping is too long too reasonably put here. But I have verified that in imData['imurl'] holds the accurate url for the image, for example http://upload.wikimedia.org/wikipedia/commons/9/95/Brown_Bear_cub_in_river_1.jpg. However certain images redirect like: http://www.public-domain-image.com/public-domain-images-pictures-free-stock-photos/fauna-animals-public-domain-images-pictures/bears-public-domain-images-pictures/brown-bear-in-dog-salmon-creek.jpg.
The website you are attempting to download the image from may have extra checks to limit the amount of screen scraping. A common check is the Referer header which you can try adding to the urllib request:
req = urllib2.Request('<img url>')
req.add_header('Referer', '<page url / domain>')
For example the request my browser used for this an alpaca image from the website you referenced includes a referer header:
Request URL:http://www.public-domain-image.com/cache/fauna-animals-public-domain-images-pictures/alpacas-and-llamas-public-domain-images-pictures/alpacas-animals-vicugna-pacos_w725_h544.jpg
Request Method:GET
....
Referer:http://www.public-domain-image.com/fauna-animals-public-domain-images-pictures/alpacas-and-llamas-public-domain-images-pictures/alpacas-animals-vicugna-pacos.jpg.html
User-Agent:Mozilla/5.0
Related
I'm attempting to pull a JSON from my Tumblr page such that I can display an image within a Jupyter notebook globally rather than just pasting locally saved images. Ideally, I'd GET the JSON, extract the png URL and then use IOBytes and PIL to display the image.
However, when I send a get request to the server:
import json
import requests
url = 'https://www.tumblr.com/blog/ims4jupyter'
r = requests.get(url, headers={'accept': 'application/json'})
print(r.json())
I get a JSONDecodeError. Typing,
r.content
into the terminal returns a HTTP formatted webpage. I think this means that Tumblr refuses to return JSON, but other websites (such as YouTube, for example) won't return JSON either.
The solution to this question was that the website I was trying to request JSON from did not accept those requests. If you copy my code, but instead use https://api.my-ip.io/ip.json (as an example), the original code will run.
I am currently scraping the amazon.com website with request and the use of proxies.
I'm using the request module with python3.
The issue that I am facing is that amazon returns me the results based on the geolocation of the proxy rather than the results for the amazon.com for the US.
How can I make sure that I will get only the results for the amazon.com from the US for the US only?
Here is my code:
import requests
url = 'https://www.amazon.com/s?k=a%20circuit'
proxy_dict = # Proxy details in the dictionary form
response = requests.get(url, proxies=proxy_dict, timeout=(3, 3))
htmlText = response.text
# I am than saving the details in a text file.
The code works fine, but it does return the results for the location of the proxy.
How can I make sure that the results will be only for the US?
Shell I perhaps add something on the url or something on the header?
I'm trying to scrape multiple financial websites (Wells Fargo, etc.) to pull my transaction history for data analysis purposes. I can do the scraping part once I get to the page I need; the problem I'm having is getting there. I don't know how to pass my username and password and then navigate from there. I would like to do this without actually opening a browser.
I found Michael Foord's article "HOWTO Fetch Internet Resources Using The urllib Package" and tried to adapt one of the examples to meet my needs but can't get it to work (I've tried adapting to several other search results as well). Here's my code:
import bs4
import urllib.request
import urllib.parse
##Navigate to the website.
url = 'https://www.wellsfargo.com/'
values = {'j_username':'USERNAME', 'j_password':'PASSWORD'}
data = urllib.parse.urlencode(values)
data = data.encode('ascii')
req = urllib.request.Request(url, data)
with urllib.request.urlopen(req) as response:
the_page = response.read()
soup = bs4.BeautifulSoup(the_page,"html.parser")
The 'j_username' and 'j_password' both come from inspecting the text boxes on the login page.
I just don't think I'm pointing to the right place or passing my credentials correctly. The URL I'm using is just the login page so is it actually logging me in? When I print the URL from response it returns https://wellsfargo.com/. If I'm ever able to successfully login, it just takes me to a summary page of my accounts. I would then need to follow another link to my checking, savings, etc.
I really appreciate any help you can offer.
I'm using mechanize library to log in website. I checked, it works well. But problem is i can't use response.read() with BeautifulSoup and 'lxml'.
#BeautifulSoup
response = browser.open(url)
source = response.read()
soup = BeautifulSoup(source) #source.txt doesn't work either
for link in soup.findAll('a', {'class':'someClass'}):
some_list.add(link)
This doesn't work, actually doesn't find any tag. It works well when i use requests.get(url).
#lxml->html
response = browser.open(url)
source = response.read()
tree = html.fromstring(source) #souce.txt doesn't work either
print tree.text
like_pages = buyers = tree.xpath('//a[#class="UFINoWrap"]') #/text() doesn't work either
print like_pages
Doesn't print anything. I know it has problem with return type of response, since it works well with requests.open(). What could i do? Could you, please, provide sample code where response.read() used in html parsing?
By the way, what is difference between response and requests objects?
Thank you!
I found solution. It is because mechanize.browser is emulated browser, and it gets only raw html. The page i wanted to scrape adds class to tag with help of JavaScript, so those classes were not on raw html. Best option is to use webdriver. I used Selenium for Python. Here is code:
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.set_preference('network.http.phishy-userpass-length', 255)
driver = webdriver.Firefox(firefox_profile=profile)
driver.get(url)
list = driver.find_elements_by_xpath('//a[#class="someClass"]')
Note: You need to have Firefox installed. Or you can choose another profile according to browser you want to use.
A request is what a web client sends to a server, with details about what URL the client wants, what http verb to use (get / post, etc), and if you are submitting a form the request typically contains the data you put in the form.
A response is what a web server sends back in reply to a request from a client. The response has a status code which indicates if the request was successful (code 200 usually if there were no problems, or an error code like 404 or 500). The response usually contains data, like the html in a page, or the binary data in a jpeg. The response also has headers that give more information about what data is in the response (e.g. the "Content-Type" header which says what format the data is in).
Quote from #davidbuxton's answer on this link.
Good luck!
I have a url that when opened, all it does is initiate a download, and immediately closes the page. I need to capture this download (a png) with python and save it to my own directory. I have tried all the usual urlib and urlib2 methods and even tried with mechanize but it's not working.
The url automatically starting a download and then closing is definitely causing some problems.
UPDATE: Specifically it is using Nginx to serve up the file with a X-Accel-Mapping header.
There's nothing particularly special about X-Accel-Mapping header. Perhaps the page makes the HTTP request with ajax, and uses the X-Accel-Mapping reader value to trigger the download?
Here's how I'd do it with urllib2:
response = urllib2.urlopen(url_to_get_x_accel_mapping_header)
download_url = response.headers['X-Accel-Mapping']
download_contents = urllib2.urlopen(download_url).read()
import urllib
URL= YOUR_URL
IMAGE = URL.rsplit('/',1)[1]
urllib.urlretrieve(URL, IMAGE)
For details to dynamically download images from a url list visit here