Download file associated to export button in webpage from terminal - python

I would like to download the file that is produced by clicking on the "EXCEL Document" button in the bottom right of this page from the terminal.
Is it possible to do that from a unix bash?
Also within R or using python would be ok.
http://www.vivc.de/index.php?r=eva-analysis-mikrosatelliten-vivc%2Fresultmsatvar&EvaAnalysisMikrosatellitenVivcSearch%5Bleitname_list%5D=&EvaAnalysisMikrosatellitenVivcSearch%5Bleitname_list%5D%5B%5D=ABADI&EvaAnalysisMikrosatellitenVivcSearch%5BName_in_bibliography%5D=
Thanks

The requests library may be what you're looking for. You'd need the URL to pass in to requests.get
import requests
r = requests.get('http://google.com')
r.raise_for_status() # Will error out if there's an issue with the get request
print(r.content)
outputs
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world\'s information, including webpages, images, videos and more. Google has many special features to help you find exactly what you\'re looking for." name="description"><meta content="noodp" name="robots"><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><scr...

Related

Download Excel File via button on website with Python

I am currently working on a code that downloads an excel file from a website. The respective file is actually hidden behind an Export button (see website: https://www.amundietf.co.uk/retail/product/view/LU1437018838). However, I have already identified the link behind which is the following: https://www.amundietf.co.uk/retail/ezaap/service/ExportDataProductSheet/Fwd/en-GB/718/LU1437018838/object/export/repartition?idFonctionnel=export-repartition-pie-etf&exportIndex=3&hasDisclaimer=1. Since the link does not directly guide to the file but rather executes some Java widget, I am not able to download the file via python. I have tried the folling code:
import re
import requests
link = 'https://www.amundietf.co.uk/retail/ezaap/service/ExportDataProductSheet/Fwd/en-GB/718/LU1437018838/object/export/repartition?idFonctionnel=export-repartition-pie-etf&exportIndex=3&hasDisclaimer=1'
r = requests.get(link, verify= False)
However, I am not able to connect to the file. Does somebody has an idea for doing this?
I would recommend using HTML:
<html lang=en>
<body>
Click here to download
</body>
</html>
In the href attribute to tag, you can put the path to your own excel file. I used an external link to an example file I found on google. To open in new tab, use target="_blank" as attribute to .
Hope it works!

trying to download full HTML pages

I am tring to download few hundreds of HTML pages in order to parse them and calculate some measures.
I tried it with linux WGET, and with a loop of the following code in python:
url = "https://www.camoni.co.il/411788/168022"
html = urllib.request.urlopen(url).read()
but the html file I got doen't contain all the content I see in the browser in the same page. for example text I see on the screen is not found in the HTML file. only when I right click the page in the browser and "Save As" i get the full page.
the problem - I need a big anount of pages and can not do it by hand.
URL example - https://www.camoni.co.il/411788/168022 - thelast number changes
thank you
That's because that site is not static. It uses JavaScript (in this example jQuery lib) to fetch additional data from server and paste on page.
So instead of trying to GET raw HTML you should inspect requests in developer tools. There's a POST request on https://www.camoni.co.il/ajax/tabberChangeTab with such data:
tab_name=tab_about
memberAlias=ד-ר-דינה-ראלט-PhD
currentURL=/411788/ד-ר-דינה-ראלט-PhD
And the result is HTML that pasted on page after.
So instead of trying to just download page you should inspect page and requests to get data or use headless browser such as Google Chrome to emulate 'Save As' button and save data.

Python Selenium with BeautifulSoup: PHP redirect removes useful information from the URL. How to fix?

I am trying to use Python Selenium with BeautifulSoup to scrape data off a PHP-enabled website.
But the site does an immediate redirect:
<html>
<head>
<meta content="0;url=index.php" http-equiv="refresh"/>
</head>
<body>
<p>Redirecting to TestRail ..</p>
</body>
</html>
... when I just give the URL "https://mysite.thing.com"
When I change it to: "https://mysite.thing.com/index.php" ... I get a 404 error.
How to get around this? Any suggestions appreciated!
I think it's because php requested webpages are generated on the fly with a randomly generated token, thereby going directly to the index.php will take you know here because your 'token' is nil, I would go through the motions in selenium to navigate the page as if you were doing it instead of trying to skip ahead.
I could be totally wrong about the php thing BTW, it's a vague memory....
It worked to use this simpler code:
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
urlpage = "https://my.site.com"
print(urlpage)
driver.get(urlpage)
html = driver.page_source
print(html)
This follows the redirect and does what I expect.

Selenium raw page source

I am trying to get the source code of a particular site with the help of Selenium with:
Python code:
driver.page_source
But it returns it after it has been encoded.
The raw file:
<html>
<head>
<title>AAAAAAAA</title>
</head>
<body>
</body>
When press 'View page source' inside Chrome, I saw the correct source raw without encoding.
How can this be achieved?
You can try using Javascript instead of Python builtin code to get the page source.
javascriptPageSource = driver.execute_script("return document.body.outerHTML;")

How to get all visible text in a web page (not html source)?

for example, I'd want to get the text show at "www.google.com" just like open it in chrome and press ctrl+a & ctrl+c:
..
Google PrivacyTermsSettingsAdvertisingBusinessAboutHow Search works
instead of:
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta charset="UTF-8"><meta content="origin" name="referrer"><meta content="Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for." name="description"><meta content="noodp" name="robots"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><meta content="origin" name="referrer"><title>Google</title><script nonce="kYKSVIWLPxNkDhoVCq276A==">(function(){window.google={kEI:'ZqUZXruXDNfT-
...
I have tried the requests_html model like blow:
import requests_html
s = requests_html.HTMLSession()
page = s.get('https://www.google.com')
print(page.html.text)
but it still show the html like blow:
Google
(function(){window.google={kEI:'y6cZXu3LJ8SkwAPWz6KIBA',kEXPI:'31',authuser:0,kGL:'ZZ',kBL:'JGpW'};google.sn='webhp';google.kHL='en';google.jsfs='Ffpdje';})();(function(){google.lc=[];google.li=0;google.getEI=function(a){for(var b;a&&(!a.getAttribute||!(b=a.getAttribute("eid")));)a=a.parentNode;return b||google.kEI};google.getLEI=function(a){for(var b=null;a&&(!a.getAttribute||!(b=a.getAttribute("leid")));)a=a.parentNode;return b};
...
then, how could I get the all text show on the page like press ctrl+a and ctrl+c?
Thanks.
There several ways of doing this, but the one I usually use is:
from bs4 import BeautifulSoup as bs
import requests_html
s = requests_html.HTMLSession()
page = s.get('https://www.google.com')
soup=bs(page.text,'lxml')
print(soup.get_text())
Output:
About Store GmailImagesSign in Remove Report inappropriate predictions PrivacyTermsSettingsSearch settingsAdvanced searchYour data in SearchHistorySearch HelpSend feedbackAdvertisingBusiness How Search works

Categories

Resources