"403 Forbidden" when use python urlib package to download the image - python

i am new to urllib package.
i try to download all the images in website "http://www.girl-atlas.com/album/576545de58e039318beb37f6"
the question is: when i copy the the url of an image, and pass the url to a browser, i will get an error "403 Forbidden". However, when i right click an image in the browser, and choose to open the image in a new window, this time, i will get the image in the new window.
the problem is: how the urlib simulates the second way?

It is forbidden to use the URLs outside a broweser. To ensure this, browsers send always a referer, the site, from which the image is loaded. If a browser would be written in Python, this would look like this:
import urllib.request
opener = urllib.request.URLopener()
opener.addheader('Referer', 'http://www.girl-atlas.com/album/576545de58e039318beb37f6')
image = opener.open('http://girlatlas.b0.upaiyun.com/41/20121222/234720feaa1fc912ba4e.jpg!lrg')
data = image.read()
image.close()

Related

Get redirected ULR with Python to get access code

This question has been asked several times, and didn't find any answer that works for me. I am using request library to get the redirect url, however my code returns the original url. If I click on the link it takes few second before I get the redirect url and then manually extract the code, but I need to get this information by python.
Here is my code. I have tried response.history but it returns empty list.
import requests
response = requests.get("https://login.microsoftonline.com/{tenant_id}/oauth2/v2.0/authorize?client_id={client_id}&response_type=code&redirect_uri=http%3A%2F%2Flocalhost%2Fmyapp%2F&response_mode=query&scope=user.read%20chat.read&state=12345")
print(response)
print('-------------------')
print(response.url)
I am trying to get the code by following this Microsoft documention "https://learn.microsoft.com/en-us/graph/auth-v2-user".
Here are the links that I found in stack over flow and didn't solve my issue.
To get redirected URL with requests , How to get redirect url code with Python? ( this is probably very close to my situation), how to get redirect url using python requests and this one Python Requests library redirect new url
I didn't have any luck to get redirected url back by using requests as mentioned in previous posts. But I was able to work around this using webbrowser library and then get the browser history using sqlite 3 and was able to get the result that I was looking for.
I had to go through postman and add postman url into my app registration for using Graph API, but if you simply want to get redirected url you can follow the same code and you should get redirected url.
let me know if there are better solutions.
here is my code:
import webbrowser
import sqlite3
import pandas as pd
import shutil
webbrowser.open("https://login.microsoftonline.com/{tenant_id}/oauth2/v2.0/authorize?client_id={client_id}&response_type=code&redirect_uri=http%3A%2F%2Flocalhost%2Fmyapp%2F&response_mode=query&scope=user.read%20chat.read&state=12345")
#source file is where the history of your webbroser is saved, I was using chrome, but it should be the same process if you are using different browser
source_file = 'C:\\Users\\{your_user_id}\\AppData\\Local\\Google\\Chrome\\User Data\\Default\\History'
# could not directly connect to history file as it was locked and had to make a copy of it in different location
destination_file = 'C:\\Users\\{user}\\Downloads\\History'
time.sleep(30) # there is some delay to update the history file, so 30 sec wait give it enough time to make sure your last url get logged
shutil.copy(source_file,destination_file) # copying the file.
con = sqlite3.connect('C:\\Users\\{user}\\Downloads\\History')#connecting to browser history
cursor = con.execute("SELECT * FROM urls")
names = [description[0] for description in cursor.description]
urls = cursor.fetchall()
con.close()
df_history = pd.DataFrame(urls,columns=names)
last_url = df_history.loc[len(df_history)-1,'url']
print(last_url)
>>https://oauth.pstmn.io/v1/browser-callback?code={code}&state=12345&session_state={session_state}#

Scraping a secure website requiring clicks on javascript links

I have a daily task at work to download some files from internal company website. The site requires a login. But the main url is something like:
https://abcd.com
But when I open that in the browser, it redirects to something like:
https://abcdGW/ln-eng.aspx?lang=eng&lnid=e69d5d-xxx-xxx-1111cef&regl=en-US
My task normally is to open this site, login, click some links back and forth and download some files. This takes me 10 minutes everyday. But I wanna automate this using python. Using my basic knowledge I have written below code:
import urllib3
from bs4 import BeautifulSoup
import requests
import http
url = "https://abcd.com"
redirectURL = requests.get(url).url
jar = http.cookiejar.CookieJar(policy=None)
http = urllib3.PoolManager()
acc_pwd = {'datasouce': 'Data1', 'user':'xxxx', 'password':'xxxx'}
response = http.request('GET', redirectURL)
soup = BeautifulSoup(response.data)
r = requests.get(redirectURL, cookies=jar)
r = requests.post(redirectURL, cookies=jar, data=acc_pwd)
print ("RData %s" % r.text)
This shows that I am able to successfully login. The next step is something where i am stuck. On the page after login I have some links on left side, one of those I need to click. When I inspect them in Chrome, I see them as:
href="javascript:__doPostBack('myAppControl$menu_itm_proj11','')"><div class="menu-cell">
<img class="menu-image" src="images/LiteMenu/projects.png" style="border-width:0px;"><span class="menu-text">Projects</span> </div></a>
This is probably a javascript link. I need to click this, and then on new page another link, then another to download a file and back to the main page and do this all over again to download different files.
I would be grateful to anyone who can help or suggest.
Thanks to chris, I was able to complete this..
First using the request library I got the redirect url as:
redirectURL = requests.get(url).url
After that I use scrapy and selenium for click links and downloading files..
By adding selenium to the browser as add-in/plugin, it was quite simple.

How to fix HTML downloading instead of image file

I'm trying to download a file from a link using urllib in Python 3.7 and it downloads the HTML file and not the Image File.
So I'm trying to receive information from a Google Form, the information is sent to a Google Sheet. I'm able to receive the information in the sheet no problem. However the Form requires an Image submission which appears in the sheet as a URL. (Example: https://drive.google.com/open?id=1YCBmEOz6_l7WDQw5t6AYBSb9B5XXKTuX)
This is my code:
import urllib.request
import random
Then I create a download function:
def downloader(image_url):
file_name = random.randrange(1,10000)
full_file_name = str(file_name) + '.png'
print(full_file_name)
urllib.request.urlretrieve(image_url,full_file_name)
I get the URL and isolate the ID of the image:
ImgId="https://drive.google.com/open?id=1Mp5XYoyyEfWJryz8ojLbHuZ6V0IzERIV"
ImgId=ImgId[33:]
Then I put the ID in a download link:
ImgId="https://drive.google.com/uc?authuser=0&id="+ImgId+"&export=download"
Which results in (in the above example) "https://drive.google.com/uc?authuser=0&id=1YCBmEOz6_l7WDQw5t6AYBSb9B5XXKTuX&export=download".
Next I run the download function:
downloader(ImgId)
So after this I expected the png file to be downloaded into the folder of the program, however it downloaded a html file of the google drive log-in page instead of an image file, or even an html file of the image. Noting that to view or download the image it requires you to be signed in to Google to download in the browser, could authorization be an issue?
(Note: If I manually paste the download link as generated by the program into my browser it downloads the image correctly)
(P.S I'm an absolute noob, so yeah)
(Thanks in advance for any answers)
Instead of using urllib for dowmloading, use requests and get the page contents using GET rest call and then convert the response content to soup content using beautifulsoup and then point to the content which you want to download, as the download function inside html would have a download link associated with it and then send a get request again with js download.
import requests
import bs4
response = requests.get(<your_url>)
soup = bs4.BeautifulSoup(response.content, 'html5lib')
# Get the download link and supply all the necessary values to the link
# Initiate Requests again

How to implement data from one web browser to another

i'm using Selenium WebDriver with Python 2.7.14 on Firefox browser. i'm tring to get text from .JSON file that located at this url: http://a360ci.s3.amazonaws.com/Jmx/einat_world_bank.json and implement all the data in the main area on this url: http://jsonviewer.stack.hu/
This is my code:
driver = self.driver
driver.get('http://a360ci.s3.amazonaws.com/Jmx/einat_world_bank.json')
RawData = driver.find_element_by_id("tab-1")
RawData.click()
self.driver.implicitly_wait(2)
content = driver.find_element_by_class_name("data").text
driver.get('http://jsonviewer.stack.hu/')
MainField = driver.find_element_by_id("edit")
MainField.send_keys(content)
*I moved to RawData tab because on Firefox the JSON not parsing well
*After the second url opened the program stuck and nothing happens. what can be the problem and how it can be solved? Thanks.
Not clear the root cause. suggest you to fetch the json by http client library, not by opening in browser and get from browser. Otherwise your code can't run cross browser. Different browser will display the json content with different DOM tree. I think there is no 'tab-1' when open in Chrome. Another reason is firefox report parsing fail, but chrome has no such issue.
Because the json content is too long. suggest you not use send_keys(), you can try
driver.execute_script('arguments[0].value=arguments[1]', textbox, jsonstring)

How to download a file using python that is sent after some delay by server?

I have to download a large number of files from a local server. When opening the URL in the browser[Firefox], the page opens with content "File being generated.. Wait.." and then the popup comes up with the option to save the required .xlsx file.
I tried to save the page object using urllib, but it saves the .html file with the content as "File being generated.. Wait..". I used the code as described here (using urllib2):
How do I download a file over HTTP using Python?
I don't know how to download the file that is sent later by the server. It works fine in browser. How to emulate it using python?
first of all you have to know the exact URL where the document is generated. You can use firefox and the addons Http Live Headers.
And then use python to "simulate" the same request.
I hope that help.
PD: or share the url of the site and then I could help to you better.
import requests
url = 'https://readthedocs.org/projects/python-guide/downloads/pdf/latest/'
myfile = requests.get(url, allow_redirects=True)
open('c:/example.pdf', 'wb').write(myfile.content)
A bit old but faced the same problem.
The key to solution is in allow_redirects=True.
Is it as simple as
import urllib2
import time
response = urllib2.urlopen('http://www.example.com/')
time.sleep(10) # Or however long you need.
html = response.read()

Categories

Resources