Scraped images are somehow protected against scraping

Scraped images are somehow protected against scraping - python

I'm trying to scrape image URLs from a website and display the images on another site (using BeautifulSoup), but the website in question (yupoo.com) has some sort of protection against loading images from their server if you don't browse their site.
How to reproduce my problem:
You can't load this image:
https://photo.yupoo.com/0832club_v/0058b582/96f83ddb.jpeg
Now visit this site: https://0832club.x.yupoo.com/29611853?uid=1
Now open the link above
"https://photo.yupoo.com/0832club_v/0058b582/96f83ddb.jpeg" and for
some reason it now works...
I checked for cookies and stuff, but I seriously don't understand how they protect their images

You have to send header referer and then server think that it is load from page https://0832club.x.yupoo.com/29611853?uid=1
import requests
url = 'https://photo.yupoo.com/0832club_v/0058b582/96f83ddb.jpeg'
headers = {
'referer': 'Referer: https://0832club.x.yupoo.com/29611853?uid=1'
}
r = requests.get(url, headers=headers)
print(r.content[:100]) # you can see string `JFIF` or `GIF` in content
f = open('output.jpg', 'wb')
f.write(r.content)
f.close()
With referer I see string JFIF in content so it sends JPG. Without referer I see string GIF in content so it sends GIF
You can also check
print(r.headers['Content-Type'])
and with referer it returns image/jpeg, without referer it returns image/gif

Related

Extract media files from cache using Selenium

I'm trying to download some videos from a website using Selenium.
Unfortunately I can't download it from source cause the video is stored in a directory with restricted access, trying to retrieve them using urllib, requests or ffmpeg returns a 403 Forbidden error, even after injecting my user data to the website.
I was thinking of playing the video in its entirety and store the media file from cache.
Would it be a possibility? Where can I find the cache folder in a custom profile? How do I discriminate among files in cache?
EDIT: This is what I attempted to do using requests
import requests
def main():
s = requests.Session()
login_page = '<<login_page>>'
login_data = dict()
login_data['username'] = '<<username>>'
login_data['password'] = '<<psw>>'
login_r = s.post(login_page)
video_src = '<<video_src>>'
cookies = dict(login_r.cookies) # contains the session cookie
# static cookies for every session
cookies['_fbp'] = 'fb.1.1630500067415.734723547'
cookies['_ga'] = 'GA1.2.823223936.1630500067'
cookies['_gat'] = '1'
cookies['_gid'] = 'GA1.2.1293544716.1631011551'
cookies['user'] = '66051'
video_r = s.get(video_src, cookies=cookies)
print(video_r.status_code)
if __name__ == '__main__':
main()
The print() function returns:
403
This is the network tab for the video:

Regarding video_r = s.get(video_src, cookies=cookies) Have you try to stream the response ? which send correct byte-range headers to download the video. Most websites prevent downloading the file as "one" block.
with open('...', 'wb') as f:
response = s.get(url=link, stream=True)
for chunk in response.iter_content(chunk_size=512):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
You can send a head request before if you want, in that way you can create a progress bar, you will retrieve the full content length from header.
Also a 403 is commonly use by anti-bot system, may be your selenium is detected.

You blocked because you forgot about headers.
You must use:
s.get('https://httpbin.org / headers', headers ={'user-agent': <The user agent value (for example: last line of your uploaded image)>})
or:
s.headers.update({'user-agent': <The user agent value (for example: last line of your uploaded image)>})
before sending a request

Python: Download a protected image using Requests Session

I'm trying to access a protected image on a website. If I use my broswer and log in, I can easily see the image. Now I'm trying to do the same using python.
I came across this question which seemed like the same issue that I have. With that I wrote the following program:
import requests
session = requests.Session()
session.get("https://www.page.com/login")
response = session.post("https://www.page.com/login", data = {"cmd": "login", "username": "myusername", "password": "mypass"})
print response.text
Output here is:
{"status":"ok","message":"Successful login.","referer":"https:\/\/page.com\/"}
I've also saved the response.text into a .html file, which confirmed that I've logged in. If at this point I open any subpage on page.com, I'm still logged in.
Then I try to access the protected image at url:
with open('image.html', 'wb') as file:
response = session.get(url, headers = {"User-Agent": "Mozilla/5.0"})
print response.status_code
file.write(response.text)
file.close()
Which gives the output of 403. If I open the image.html file there's an "Item not available".
I've tried doing all of this using mechanize and cookielib instead of requests.Session() where I created a cookiejar, created a Browser() object, successfully logged in but as I tried accessing that image, I had the same 403 error. Therefore any ideas will be greatly appriciated.

python requests cannot get html

I tried to get html code from a site name dcinside in Korea, i am using requests but cannot get html code
and this is my code
import requests
url = "http://gall.dcinside.com/board/lists/?id=bitcoins&page=1"
req = requests.get(url)
print (req)
print (req.content)
but the result was
Why I cannot get html codes even using requests??

Most likely they are detecting that you are trying to crawl data dynamically, and not giving any content as a response. Try pretending to be a browser and passing some User-Agent headers.
headers = {
'User-Agent': 'My User Agent 1.0',
'From': 'youremail#domain.com'
}
response = requests.get(url, headers=headers)
# use authentic mozilla or chrome user-agent strings if this doesn't work

Take a look at this:
Python Web Crawlers and "getting" html source code
Like the guy said in the aforementioned post, you should use urllib2 which will allow you to easily obtain web resources.

How to Send post response as image from python requests

i am trying to get my image hosted online and for that i am using python
import requests
url = 'http://imgup.net/'
data = {'image[image][]':'http://www.webhost-resources.com/wp-content/uploads/2015/01/dedicated-hosting-server.jpg'}
r = requests.post(url, files=data)
i am not able to get the response url of the hosted image from the response .
Please help !

The files parameter of requests.post needs a:
Dictionary of 'name': file-like-objects (or {'name': ('filename', fileobj)}) for multipart encoding upload.
There's more data you'll need to send than just the file, most importantly the "authenticity token". If you look at the source code of the page, it'll show you all other parameters as <input type="hidden"> tags.
The upload URL is http://imgup.net/upload, as you can see from the action attribute of <form>.
So what you need to do is:
Download the image you want to upload (I'll call it dhs.jpg).
Do a GET request of the main page, extracting the authenticity_token.
Once you have that, send the request with files= and data=:
‌
url = "http://imgup.net/upload"
data = {'utf8': '✓', 'authenticity_token': '<put your scraped token here>', '_method': 'put'}
f = open("dhs.jpg", "rb") # open in binary mode
files = {'image[image][]': f}
r = requests.post(url, files=files, data=data)
f.close()
print(r.json()["image_link"]
Final note: While I couldn't find any rule against this behaviour in their T&C, the presence of an authenticity token makes it seem likely that imgup doesn't really want you to do this automatically.

Python Retrieve File while Ignoring Redirect

I'm working on a program that uses Beautiful Soup to scrape a website, and then urllib to retrieve images found on the website (using the image's direct URL). The website I'm scraping isn't the original host of the image, but does link to the original image. The problem I've run into is that for certain websites retrieving www.example.com/images/foobar.jpg redirects me to the homepage www.example.com and produces an empty (0 KB) image. In fact, going to www.example.com/images/foobar.jpg redirects as well. Interesting on the website I'm scraping, the image shows up normal.
I've seen some examples on SO, but they all explain how to capture cookies, headers, and other similar data from websites while getting around the redirect, and I was unable to get them to work for me. Is there a way to prevent a redirect and get the image stored at www.example.com/images/foobar.jpg?
This is the block of code that saves the image:
from urllib import urlretrieve
...
for imData in imList:
imurl = imData['imurl']
fName = os.path.basename(URL)
fName,ext = os.path.splitext(fName)
fName += "_%02d"%(ctr,)+ext
urlretrieve(imurl,fName)
ctr += 1
The code that handles all the scraping is too long too reasonably put here. But I have verified that in imData['imurl'] holds the accurate url for the image, for example http://upload.wikimedia.org/wikipedia/commons/9/95/Brown_Bear_cub_in_river_1.jpg. However certain images redirect like: http://www.public-domain-image.com/public-domain-images-pictures-free-stock-photos/fauna-animals-public-domain-images-pictures/bears-public-domain-images-pictures/brown-bear-in-dog-salmon-creek.jpg.

The website you are attempting to download the image from may have extra checks to limit the amount of screen scraping. A common check is the Referer header which you can try adding to the urllib request:
req = urllib2.Request('<img url>')
req.add_header('Referer', '<page url / domain>')
For example the request my browser used for this an alpaca image from the website you referenced includes a referer header:
Request URL:http://www.public-domain-image.com/cache/fauna-animals-public-domain-images-pictures/alpacas-and-llamas-public-domain-images-pictures/alpacas-animals-vicugna-pacos_w725_h544.jpg
Request Method:GET
....
Referer:http://www.public-domain-image.com/fauna-animals-public-domain-images-pictures/alpacas-and-llamas-public-domain-images-pictures/alpacas-animals-vicugna-pacos.jpg.html
User-Agent:Mozilla/5.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.