Python download href, got the source code instead of a pdf file - python

I'm trying to download a pdf file with the following href (i change some value cause the pdf contain personal information)
https://clients.direct-energie.com/grandcompte/factures/consulter-votre-facture/?tx_defacturation%5BdoId%5D=857AD9348B0007984D4B128F1E8BE&cHash=7b3a9f6d109dde87bd1d95b80ca1d
When i past this href in my browser the pdf file is directly download, but when i'm trying to use request in my python code its only download the source code of
https://clients.direct-energie.com/grandcompte/factures/consulter-votre-facture/
Here is my code, i use selenium to find the href in the website
fact = driver.find_element_by_xpath(url)
href = fact.get_attribute('href')
print(href) // href is correct here
reply = get(href, Stream=True)
print(reply) // I got the source code
Here is the html find by selenium
I hope you have enough informations to help, Thx

Can't use your link because it required auth so found another example of a redirecting pdf download. Setting Chrome to download the pdf instead of displaying it taken from this StackOverflow answer.
import selenium.webdriver
url = "https://readthedocs.org/projects/selenium-python/downloads/pdf/latest/"
download_dir = 'C:/Dev'
profile = {
"plugins.plugins_list": [{"enabled": False, "name": "Chrome PDF Viewer"}],
"download.default_directory": download_dir ,
"download.extensions_to_open": "applications/pdf"
}
options = selenium.webdriver.ChromeOptions()
options.add_experimental_option("prefs", profile)
driver = selenium.webdriver.Chrome(options=options)
driver.get(url)
From looking at the docs, the driver.get method doesn't return anything, it's just telling the webdriver to navigate to a page. If you want to handle the pdf in Python before saving it to a file then perhaps look at using Requests or Robobrowser.
Stream=True option wasn't available for webdriver.Chrome so not sure if this is the method you were using but the above should do what you want.

Related

Why is the html content I got from inspector different from what I got from Request?

Here is the site I am trying to scrap data from:
https://www.onestopwineshop.com/collection/type/red-wines
import requests
from bs4 import BeautifulSoup
url = "https://www.onestopwineshop.com/collection/type/red-wines"
response = requests.get(url)
#print(response.text)
soup = BeautifulSoup(response.content,'lxml')
The code I have above.
It seems like the HTML content I got from the inspector is different from what I got from BeautifulSoup.
My guess is that they are preventing me from getting their data as they detected I am not accessing the site with a browser. If so, is there any way to bypass that?
(Update) Attempt with selenium:
from selenium import webdriver
import time
path = "C:\Program Files (x86)\chromedriver.exe"
# start web browser
browser=webdriver.Chrome(path)
#navigate to the page
url = "https://www.onestopwineshop.com/collection/type/red-wines"
browser.get(url)
# sleep the required amount to let the page load
time.sleep(3)
# get source code
html = browser.page_source
# close web browser
browser.close()
Update 2:(loaded with devtool)
Any website with content that is loaded after the inital page load is unavailable with BS4 with your current method. This is because the content will be loaded with an AJAX call via javascript and the requests library is unable to parse and run JS code.
To achieve this you will have to look at something like selenium which controls a browser via python or other languages... There is a seperate version of selenium for each browser i.e firefox, chrome etc.
Personally I use chrome so the drivers can be found here...
https://chromedriver.chromium.org/downloads
download the correct driver for your version of chrome
install selenium via pip
create a scrape.py file and put the driver in the same folder.
then to get the html string to use with bs4
from selenium import webdriver
import time
# start web browser
browser=webdriver.Chrome()
#navigate to the page
browser.get('http://selenium.dev/')
# sleep the required amount to let the page load
time.sleep(2)
# get source code
html = browser.page_source
# close web browser
browser.close()
You should then be able to use the html variable with BS4
I'll actually turn my comment to an answer because it is a solution to your problem :
As other said, this page is loaded dynamically, but there are ways to retrieve data without running javascript, in your case you want to look at the "network" tab or your dev tools and filter "fetch" requests.
This could be particularly interesting for you :
You don't need selenium or beautifulsoup at all, you can just use requests and parse the json, if you are good enough ;)
There is a working cURL requests : curl 'https://api.commerce7.com/v1/product/for-web?&collectionSlug=red-wines' -H 'tenant: one-stop-wine-shop'
You get an error if you don't add the tenant header.
And that's it, no html parsing, no waiting for the page to load, no javascript running. Much more powerful that the selenium solution.

How to automatically find link of download button and download corresponding file with Python?

I have permission to download some weather data from the following website:
https://www.meteobridel.com/messnetz/index3.php#
I was wondering is there is a possibility to automatically find the download URL behind the 'CSV' button and then download that csv file with Python.
I tried this, but it didn't work:
from selenium import webdriver
browser = webdriver.Safari()
url = 'https://meteobridel.lu/?page_id=5'
browser.get(url)
browser.find_element_by_xpath('//*[#id="CSV"]').click()
browser.close()
Thanks already!
Try
from selenium import webdriver
browser = webdriver.Safari()
url = 'https://meteobridel.lu/?page_id=5'
browser.get(url)
browser.find_element_by_xpath('//body/div[#id='main']/div[1]/div[1]/div[1]/a[4]').click()
browser.close()
Checking the page you provided I can't find an "CSV"-ID.
Maybe try getting the button by class:
browser.find_element_by_xpath(r"//a[contains(#class, 'buttons-csv')]").click()
element is inside iframe so you have to switch to it first , as the id of frame is unique you can switch like
browser.switch_to.frame("iframe")
browser.find_element_by_xpath('//span[contains(text(),"CSV")]/..').click()

Selenium Firefox browser is stuck after downloading pdf

Was hoping someone could help me understand what's going on:
I'm using Selenium with Firefox browser to download a pdf (need Selenium to login to the corresponding website):
le = browser.find_elements_by_xpath('//*[#title="Download PDF"]')
time.sleep(5)
if le:
pdf_link = le[0].get_attribute("href")
browser.get(pdf_link)
The code does download the pdf, but after that just stays idle.
This seems to be related to the following browser settings:
fp.set_preference("pdfjs.disabled", True)
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf")
If I disable the first, it doesn't hang, but opens pdf instead of downloading it. If I disable the second, a "Save As" pop-up window shows up. Could someone explain how to handle this?
For me, the best way to solve this was to let Firefox render the PDF in the browser via pdf.js and then send a subsequent fetch via the Python requests library with the selenium cookies attached. More explanation below:
There are several ways to render a PDF via Firefox + Selenium. If you're using the most recent version of Firefox, it'll most likely render the PDF via pdf.js so you can view it inline. This isn't ideal because now we can't download the file.
You can disable pdf.js via Selenium options but this will likely lead to the issue in this question where the browser gets stuck. This might be because of an unknown MIME-Type but I'm not totally sure. (There's another StackOverflow answer that says this is also due to Firefox versions.)
However, we can bypass this by passing Selenium's cookie session to requests.session().
Here's a toy example:
import requests
from selenium import webdriver
pdf_url = "/url/to/some/file.pdf"
# setup driver with options
driver = webdriver.Firefox(..options)
# do whatever you need to do to auth/login/click/etc.
# navigate to the PDF URL in case the PDF link issues a
# redirect because requests.session() does not persist cookies
driver.get(pdf_url)
# get the URL from Selenium
current_pdf_url = driver.current_url
# create a requests session
session = requests.session()
# add Selenium's cookies to requests
selenium_cookies = driver.get_cookies()
for cookie in selenium_cookies:
session.cookies.set(cookie["name"], cookie["value"])
# Note: If headers are also important, you'll need to use
# something like seleniumwire to get the headers from Selenium
# Finally, re-send the request with requests.session
pdf_response = session.get(current_pdf_url)
# access the bytes response from the session
pdf_bytes = pdf_response.content
I highly recommend using seleniumwire over regular selenium because it extends Python Selenium to let you return headers, wait for requests to finish, use proxies, and much more.

Python Web Scraping saving Tik Tok video from url

I am trying to save videos from this url:
Original:
https://api2.musical.ly/aweme/v1/play/?video_id=v09044a20000beeff4c108gs7sflfdug
Link changes to this:
http://v16.muscdn.com/3d238aa3e1c34000ce53792155cd0e15/5bcf3070/video/tos/maliva/tos-maliva-v-0068/e5a1ab74d0b54f97b3578924a428e58d/
The video is from TikTok. When you go to the url, it instantly redirects you to another url. The other url is the one I want in order to save the video. However, the url it directs you to does not have a "view html source" option. I can inspect the element and that shows it has a video tag, but I cannot find a way to save the url between the tag. I am using python and beautifulsoup. I tried to do this with selenium, but to no effect.
Edit:
The link that it redirects to changes all the time! as of 27/08/2019, the link below works...
If you get Access denied you should check the link once again...
I think you should use other libraries for saving videos...
For example (in Python 3+):
import urllib.request
vid_url = "http://v19.muscdn.com/21b98c731608b8aa296ec31468c26dd1/5d652a88/video/tos/maliva/tos-maliva-v-0068/e5a1ab74d0b54f97b3578924a428e58d/?rc=amdvdnY7NDdpaDMzNTczM0ApdSlINzU2NTM0MzM2MzM1MzQ1b2k5ZmU5Z2c1ZGY5ZmQzPGZAaUBoNnYpQGczdilAZjY1QHJjYzRkLWBjYl8tLV4xNnNzOmk0NTU1LjQtLi4uMTQ0NTYtOiM2MDAtXjQzXzMxMTFeMWEzYSNvIzphLW8jOmAtbyMwLl4%3D"
urllib.request.urlretrieve(vid_url, "your_video_name.mp4")
If you insist on using selenium you can add options like this:
options = webdriver.ChromeOptions()
options.add_experimental_option("prefs", {
"download.default_directory": r"C:\Users\xxx\downloads\Test",
"download.prompt_for_download": False,
"download.directory_upgrade": True,
"safebrowsing.enabled": True
})
driver = webdriver.Chrome(chrome_options=options)
Hope this helps you!

error while parsing url using python

I am working on a url using python.
If I click the url, I am able to get the excel file.
but If I run following code, it gives me weird output.
>>> import urllib2
>>> urllib2.urlopen('http://intranet.stats.gov.my/trade/download.php?id=4&var=2012/2012%20MALAYSIA%27S%20EXPORTS%20BY%20ECONOMIC%20GROUPING.xls').read()
output :
"<script language=javascript>window.location='2012/2012 MALAYSIA\\'S EXPORTS BY ECONOMIC GROUPING.xls'</script>"
why its not able to read content with urllib2?
Take a look using an http listener (or even Google Chrome Developer Tools), there's a redirect using javascript when you get to the page.
You will need to access the initial url, parse the result and fetch again the actual url.
#Kai in this question seems to have found an answer to javascript redirects using the module Selenium
from selenium import webdriver
driver = webdriver.Firefox()
link = "http://yourlink.com"
driver.get(link)
#this waits for the new page to load
while(link == driver.current_url):
time.sleep(1)
redirected_url = driver.current_url

Categories

Resources