Extract URL from webpage and save to disk

Extract URL from webpage and save to disk - python

I am trying to write a script to automaotmcally query sci-hub.io with an article's title and save a PDF copy of the articles full text to my computer with a specific file name.
To do this I have written the following code:
url = "http://sci-hub.io/"
data = read_csv("C:\\Users\\Sangeeta's\\Downloads\\distillersr_export (1).csv")
for index, row in data.iterrows():
try:
print('http://sci-hub.io/' + str(row['DOI']))
res = requests.get('http://sci-hub.io/' + str(row['DOI']))
print(res.content)
except:
print('NO DOI: ' + str(row['ref']))
This opens a CSV file with a list of DOI's and names of the file to be saved. For each DOI, it then queries sci-hub.io for the full-text. The presented page embeds the PDF in however I am now unsure how to extract the URL for the PDF and save it to disk.
An example of the page can be seen in the image below:
In this image, the desired URL is http://dacemirror.sci-hub.io/journal-article/3a257a9ec768d1c3d80c066186aba421/pajno2010.pdf.
How can I automatically extract this URL and then save the PDF file to disk?
When I print res.content, I get this:
b'<!DOCTYPE html>\n<html>\n <head>\n <title></title>\n <meta charset="UTF-8">\n <meta name="viewport" content="width=device-width">\n </head>\n <body>\n <style type = "text/css">\n body {background-color:#F0F0F0}\n div {overflow: hidden; position: absolute;}\n #top {top:0;left:0;width:100%;height:50px;font-size:14px} /* 40px */\n #content {top:50px;left:0;bottom:0;width:100%}\n p {margin:0;padding:10px}\n a {font-size:12px;font-family:sans-serif}\n a.target {font-weight:normal;color:green;margin-left:10px}\n a.reopen {font-weight:normal;color:blue;text-decoration:none;margin-left:10px}\n iframe {width:100%;height:100%}\n \n p.agitation {padding-top:5px;font-size:20px;text-align:center}\n p.agitation a {font-size:20px;text-decoration:none;color:green}\n\n .banner {position:absolute;z-index:9999;top:400px;left:0px;width:300px;height:225px;\n border: solid 1px #ccc; padding: 5px;\n text-align:center;font-size:18px}\n .banner img {border:0}\n \n p.donate {padding:0;margin:0;padding-top:5px;text-align:center;background:green;height:40px}\n p.donate a {color:white;font-weight:bold;text-decoration:none;font-size:20px}\n\n #save {position:absolute;z-index:9999;top:180px;left:8px;width:210px;height:36px;\n border-radius: 4px; border: solid 1px #ccc; padding: 5px;\n text-align:center;font-size:18px;background:#F0F0F0;color:#333}\n\n #save a {text-decoration:none;color:white;font-size:inherit;color:#666}\n\n #save p { margin: 0; padding: 0; margin-top: 8px}\n\n #reload {position:absolute;z-index:9999;top:240px;left:8px;width:210px;height:36px;\n border-radius: 4px; border: solid 1px #ccc; padding: 5px;\n text-align:center;font-size:18px;background:#F0F0F0;color:#333}\n\n #reload a {text-decoration:none;color:white;font-size:inherit;color:#666}\n\n #reload p { margin: 0; padding: 0; margin-top: 8px}\n\n\n #saveastro {position:absolute;z-index:9999;top:360px;left:8px;width:230px;height:70px;\n border-radius: 4px; border: solid 1px #ccc; background: white; text-align:center}\n #saveastro p { margin: 0; padding: 0; margin-top: 16px}\n \n \n #donate {position:absolute;z-index:9999;top:170px;right:16px;width:220px;height:36px;\n border-radius: 4px; border: solid 1px #ccc; padding: 5px;\n text-align:center;font-size:18px;background:white;color:#333}\n \n #donate a {text-decoration:none;color:green;font-size:inherit}\n\n #donatein {position:absolute;z-index:9999;top:220px;right:16px;width:220px;height:36px;\n border-radius: 4px; border: solid 1px #ccc; padding: 5px;\n text-align:center;font-size:18px;background:green;color:#333}\n\n #donatein a {text-decoration:none;color:white;font-size:inherit}\n \n #banner {position:absolute;z-index:9999;top:50%;left:45px;width:250px;height:250px; padding: 0; border: solid 1px white; border-radius: 4px}\n \n </style>\n \n \n \n <script type = "text/javascript">\n window.onload = function() {\n var url = document.getElementById(\'url\');\n if (url.innerHTML.length > 77)\n url.innerHTML = url.innerHTML.substring(0,77) + \'...\';\n };\n </script>\n <div id = "top">\n \n <p class="agitation" style = "padding-top:12px">\n \xd0\xa1\xd1\x82\xd1\x80\xd0\xb0\xd0\xbd\xd0\xb8\xd1\x87\xd0\xba\xd0\xb0 \xd0\xbf\xd1\x80\xd0\xbe\xd0\xb5\xd0\xba\xd1\x82\xd0\xb0 Sci-Hub \xd0\xb2 \xd1\x81\xd0\xbe\xd1\x86\xd0\xb8\xd0\xb0\xd0\xbb\xd1\x8c\xd0\xbd\xd1\x8b\xd1\x85 \xd1\x81\xd0\xb5\xd1\x82\xd1\x8f\xd1\x85 \xe2\x86\x92 <a target="_blank" href="https://vk.com/sci_hub">vk.com/sci_hub</a>\n </p>\n \n </div>\n \n <div id = "content">\n <iframe src = "http://moscow.sci-hub.io/202d9ebdfbb8c0c56964a31b2fdfe8e9/roerdink2016.pdf" id = "pdf"></iframe>\n </div>\n \n <div id = "donate">\n <p><a target = "_blank" href = "//sci-hub.io/donate">\xd0\xbf\xd0\xbe\xd0\xb4\xd0\xb4\xd0\xb5\xd1\x80\xd0\xb6\xd0\xb0\xd1\x82\xd1\x8c \xd0\xbf\xd1\x80\xd0\xbe\xd0\xb5\xd0\xba\xd1\x82 →</a></p>\n </div>\n <div id = "donatein">\n <p><a target = "_blank" href = "//sci-hub.io/donate">support the project →</a></p>\n </div>\n <div id = "save">\n <p>\xe2\x87\xa3 \xd1\x81\xd0\xbe\xd1\x85\xd1\x80\xd0\xb0\xd0\xbd\xd0\xb8\xd1\x82\xd1\x8c \xd1\x81\xd1\x82\xd0\xb0\xd1\x82\xd1\x8c\xd1\x8e</p>\n </div>\n <div id = "reload">\n <p>↻ \xd1\x81\xd0\xba\xd0\xb0\xd1\x87\xd0\xb0\xd1\x82\xd1\x8c \xd0\xb7\xd0\xb0\xd0\xbd\xd0\xbe\xd0\xb2\xd0\xbe</p>\n </div>\n \n \n<!-- Yandex.Metrika counter --> <script type="text/javascript"> (function (d, w, c) { (w[c] = w[c] || []).push(function() { try { w.yaCounter10183018 = new Ya.Metrika({ id:10183018, clickmap:true, trackLinks:true, accurateTrackBounce:true, ut:"noindex" }); } catch(e) { } }); var n = d.getElementsByTagName("script")[0], s = d.createElement("script"), f = function () { n.parentNode.insertBefore(s, n); }; s.type = "text/javascript"; s.async = true; s.src = "https://mc.yandex.ru/metrika/watch.js"; if (w.opera == "[object Opera]") { d.addEventListener("DOMContentLoaded", f, false); } else { f(); } })(document, window, "yandex_metrika_callbacks"); </script> <noscript><div><img src="https://mc.yandex.ru/watch/10183018?ut=noindex" style="position:absolute; left:-9999px;" alt="" /></div></noscript> <!-- /Yandex.Metrika counter -->\n </body>\n</html>\n'
Which does include the URL, however I am unsure how to extract it.
Update:
I am now able to extract the URL but when I try to access the page with the PDF (through urllib.request) I get a 403 response even though the URL is valid. Any ideas on why and how to fix? (I am able to access through my browser so not IP blocked)

You can use urllib library to access the html of the page and even download files, and regex to find the url of the file you want to download.
import urllib
import re
site = urllib.urlopen(".../index.html")
data = site.read() # turns the contents of the site into a string
files = re.findall('(http|https):\/\/([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?(.pdf)', data) # finds the url
for file in files:
urllib.urlretrieve(file, filepath) # "filepath" is where you want to save it

Here is the Solution:-
url = re.search('<iframe src = "\s*([^"]+)"', res.content)
url.group(1)
urllib.urlretrieve(url.group(1),'C:/.../Docs/test.pdf')
I ran it and it is working :)
For Python 3:
Change urrlib.urlretrive to urllib.request.urlretrieve

You can do it with a clunky code requiring selenium, requests and scrapy.
Use selenium to request either an article title or DOI.
>>> from selenium import webdriver
>>> driver.get("http://sci-hub.io/")
>>> input_box = driver.find_element_by_name('request')
>>> input_box.send_keys('amazing scientific results\n')
An article by the title 'amazing scientific results' doesn't seem to exist. As a result, the site returns a diagnostic page in the browser window which we can ignore. It also puts 'http://sci-hub.io/' in webdriver's current_url property. This is helpful because it's an indication that the requested result isn't available.
>>> driver.current_url
'http://sci-hub.io/'
Let's try again, looking for the item that you know exists.
>>> driver.get("http://sci-hub.io/")
>>> input_box = driver.find_element_by_name('request')
>>> input_box.send_keys('DOI: 10.1016/j.anai.2016.01.022\n')
>>> driver.current_url
'http://sci-hub.io/10.1016/j.anai.2016.01.022'
This time the site returns a distinctive url. Unfortunately, if we load this using selenium we will get the pdf and, unless you're more able than I am, you will find it difficult to download this to a file on your machine.
Instead, I download it using the requests library. Loaded in this form you will find that the url of the pdf becomes visible in the HTML.
>>> import requests
>>> r = requests.get(driver.current_url)
To ferret out the url I use scrapy.
>>> from scrapy.selector import Selector
>>> selector = Selector(text=r.text)
>>> pdf_url = selector.xpath('.//iframe/#src')[0].extract()
Finally I use requests again to download the pdf so that I can save it to a conveniently named file on local storage.
>>> r = requests.get(pdf_url).content
>>> open('article_name', 'wb').write(r)
211853

I solved this using a combination of the answers above - namely SBO7 & Roxerg.
I use the following to extract the URL from the page and then download the PDF:
res = requests.get('http://sci-hub.io/' + str(row['DOI']))
useful = BeautifulSoup(res.content, "html5lib").find_all("iframe")
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', str(useful[0]))
response = requests.get(urls[0])
with open("C:\\Users\\Sangeeta's\\Downloads\\ref\\" + str(row['ref']) + '.pdf', 'wb') as fw:
fw.write(response.content)
Note: This will not work for all articles - some link to webpages (example) and this doesn't correctly work for those.

Related

Scrape "Button" tag with Selenium

import requests
from selenium import webdriver
import bs4
PATH = 'C:\Program Files (x86)\chromedriver.exe'
driver = webdriver.Chrome(PATH)
oLat = 33.8026087
oLong = -84.3369491999999
dLat = 33.79149
dLong = -84.32312
url = "https://ride.lyft.com/ridetype?origin=" + str(oLat) + "%2C" + str(oLong) + "&destination=" + str(dLat) + "%2C" + str(dLong) + "&ride_type=&offerProductId=standard"
driver.get(url)
content = driver.page_source
soup = bs4.BeautifulSoup(content)
print(soup)
print(url)
Here is my code currently. I am trying to scrape the lyft price estimate.
The data is in the "button" tag. This does not show up in the html from the code I provided above. How can I get this data to show up?
import requests
from selenium import webdriver
import bs4
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
PATH = 'C:\Program Files (x86)\chromedriver.exe'
driver = webdriver.Chrome(PATH)
oLat = 33.7885662
oLong = -84.326684
dLat = 33.4486296
dLong = -84.4550443
url = "https://ride.lyft.com/ridetype?origin=" + str(oLat) + "%2C" + str(oLong) + "&destination=" + str(dLat) + "%2C" + str(dLong) + "&ride_type=&offerProductId=standard"
driver.get(url)
spanThing = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR , "span.sc-7e9e68d9-0 lctkqn")))
print(spanThing)
driver.quit()
I tried this additional code, but it doesn't find the span and class for some reason. I'm not sure why

To extract the Page Source you need to induce WebDriverWait for the visibility_of_element_located() of a static element and you can use the following locator strategies:
oLat = 33.8026087
oLong = -84.3369491999999
dLat = 33.79149
dLong = -84.32312
url = "https://ride.lyft.com/ridetype?origin=" + str(oLat) + "%2C" + str(oLong) + "&destination=" + str(dLat) + "%2C" + str(dLong) + "&ride_type=&offerProductId=standard"
driver.get(url)
WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//span[contains(., 'Sign up / Log in to request ride')]")))
print(driver.page_source)
driver.quit()
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Console Output:
<html lang="en-US" class="js-focus-visible" data-js-focus-visible=""><head><meta name="viewport" content="width=device-width"><script type="module">
if (window.performance) {
const toSnake = (str) => str.replace(/([A-Z])/g, function($1) {return '_' + $1.toLowerCase();});
const measure = () => {
const { timing } = window.performance;
if (!timing.navigationStart) return;
const al = [
'event_name','sending_service','connection_end','connection_start','dom_complete',
'dom_content_loaded_event_end','dom_content_loaded_event_start','dom_interactive',
'dom_loading','domain_lookup_end','domain_lookup_start','fetch_start','load_event_end',
'load_event_start','navigation_start','redirect_end','redirect_start','request_start',
'response_end','response_start','secure_connection_start','unload_event_end',
'unload_event_start','connect_start','connect_end','ms_first_paint','source','uri_path',
'request_end','code','track_id','uri_href'
];
const { href = '', pathname = '' } = window.location;
const sE = { event_name: 'navigation_timing_absolute', uri_href: href, uri_path: pathname, sending_service: 'riderweb', source: 'riderweb' };
for (let eN in timing) {
const sEN = toSnake(eN);
if (al.includes(sEN)) { sE[sEN] = timing[eN]; }
}
// iOS 11 supports ES modules, but sendBeacon not available until 11.3.
if (navigator.sendBeacon) {
navigator.sendBeacon('https://www.lyft.com/api/track', JSON.stringify(sE));
}
};
try {
if (document.readyState === 'complete') {
measure();
} else {
window.addEventListener('load', measure);
}
} catch(e) {}
}
</script><script>
var _i18n_extends = Object.assign || function (target) { for (var i = 1; i < arguments.length; i++) { var source = arguments[i]; for (var key in source) { if (Object.prototype.hasOwnProperty.call(source, key)) { target[key] = source[key]; } } } return target; };
;if(!window.__TRANSLATIONS__) window.__TRANSLATIONS__ = {};
window.__TRANSLATIONS__.locale = "en-US";
window.__TRANSLATIONS__.bundleName = "common";
if (!window.__TRANSLATIONS__.data) window.__TRANSLATIONS__.data = {};
_i18n_extends(window.__TRANSLATIONS__.data, {"%;":{"s":"OK"},"#":{"s":"Sorry, we can't find that page"},"$":{"s":"Sorry, there was an error"},"%":{"s":"Back"},"A":{"s":"No tip"},"T":{"s":"Lyft: Request a ride on the web"},"p":{"s":"Current location"},"q":{"s":"You set your pickup as \"Your Location\"{originatingAppMsg}"},"r":{"s":" in Google Maps"},"s":{"s":"To use the same pickup location, Lyft needs access to your current location."},"t":{"s":"Share your location"},"u":{"s":"Location sharing is denied"},"w":{"s":"Submit"},"x":{"s":"Save"},"y":{"s":"Confirm"},"z":{"s":"Unknown error"},"{":{"s":"Close"},"|":{"s":"Cancel"},"}":{"s":"Edit"},"~":{"s":"Delete"},"! ":{"s":"Done"},"!!":{"s":"Log out"},"!#":{"s":"Are you sure you want to log out?"},"!%":{"s":"Payment defaults"},"!&":{"s":"Add a payment method to get started."},"!(":{"s":"Add new card"},"!)":{"s":"Could not update payment method"},"!*":{"s":"Payment"},"!+":{"s":"manage your payment methods"},"!,":{"s":"Payment method"},"!-":{"s":"Card failed!"},"!.":{"s":"Payment method not supported on ride.lyft.com."},"!\u002F":{"s":"Payment method updated across Lyft apps."},"!0":{"s":"You cannot delete your only valid payment method."},"!1":{"s":"Gift cards"},"!2":{"s":"redeem gift cards"},"!3":{"s":"This field is required"},"!4":{"s":"Something went wrong. Please try again."},"!5":{"s":"Click to log out or switch accounts"},"!6":{"s":"Go back"},"!Z":{"s":"Schedule"},"!k":{"s":"schedule a ride"},"(6":{"s":"Ride"},"(7":{"s":"Rent"},"(8":{"s":"Rent a car through Lyft or our partner Sixt"},"(9":{"s":"Help"},"(:":{"s":"Business"},"(;":{"s":"Upcoming rides"},"(\u003C":{"s":"Install on Phone"},"(=":{"s":"Sign up \u002F Log in"},"(\u003E":{"s":"Log in"},"(l":{"s":"Install app"},"(m":{"s":"Free"},")z":{"s":"Not now"},"){":{"s":"Get the Lyft app"},")|":{"s":"More travel options from the palm of your hand"},")}":{"s":"From bikes to rentals and everything in between. If it gets you there, it's in the app."},"*\u003E":{"s":"Install on Desktop"},"*?":{"s":"Install on Desktop. It's free and takes up no space on your device"},"*C":{"s":"Text me a link"},"*D":{"s":"We'll send you a text with a link to download the app."},"*E":{"s":"Enter mobile phone number"},"*F":{"s":"Phone invalid"},"*G":{"s":"Refresh"},"*H":{"s":"An update is available"},",+":{"s":"View profile"},",,":{"s":"Get a ride"},",-":{"s":"Rides"},",.":{"s":"Gift cards"},",\u002F":{"s":"Promos"},",0":{"s":"Donate"},",1":{"s":"Invite friends"},",2":{"s":"Help"},",3":{"s":"Settings"},",4":{"s":"Safety Tools"},",5":{"s":"Lyft Rentals"},")d":{"s":"Log in \u002F Sign up"},")e":{"s":"You will need to log in to {action}!"},")f":{"s":"Log in"},")g":{"s":"Cancel"},"a":{"s":"Lyft and OpenStreetMap watermark"},"#L":{"s":"add promotions"},"&^":{"s":"Just now"},"&`.zero":{"s":"{minutes} minutes ago"},"&_.one":{"s":"{minutes} minute ago"},"&`.two":{"s":"{minutes} minutes ago"},"&`.few":{"s":"{minutes} minutes ago"},"&`.many":{"s":"{minutes} minutes ago"},"&`.other":{"s":"{minutes} minutes ago"},"&b.zero":{"s":"{hours} hours ago"},"&a.one":{"s":"{hours} hour ago"},"&b.two":{"s":"{hours} hours ago"},"&b.few":{"s":"{hours} hours ago"},"&b.many":{"s":"{hours} hours ago"},"&b.other":{"s":"{hours} hours ago"},"&d.zero":{"s":"{days} days ago"},"&c.one":{"s":"{days} day ago"},"&d.two":{"s":"{days} days ago"},"&d.few":{"s":"{days} days ago"},"&d.many":{"s":"{days} days ago"},"&d.other":{"s":"{days} days ago"},"&e":{"s":"Less than a minute"},"&g.zero":{"s":"{minutes} Total minutes"},"&f.one":{"s":"{minutes} Total minute"},"&g.two":{"s":"{minutes} Total minutes"},"&g.few":{"s":"{minutes} Total minutes"},"&g.many":{"s":"{minutes} Total minutes"},"&g.other":{"s":"{minutes} Total minutes"},"&i.zero":{"s":"{hours} Total hours"},"&h.one":{"s":"{hours} Total hour"},"&i.two":{"s":"{hours} Total hours"},"&i.few":{"s":"{hours} Total hours"},"&i.many":{"s":"{hours} Total hours"},"&i.other":{"s":"{hours} Total hours"},"&k.zero":{"s":"{days} Total days"},"&j.one":{"s":"{days} Total day"},"&k.two":{"s":"{days} Total days"},"&k.few":{"s":"{days} Total days"},"&k.many":{"s":"{days} Total days"},"&k.other":{"s":"{days} Total days"},"(a":{"s":"Any fare exceeding your Lyft Cash balance will be charged to your default payment method."},"(|":{"s":"Total"},"(}":{"s":"You'll pay this price unless you add a stop, change your destination, or if credit expires."},"(~":{"s":"This is an estimated range for your trip."},") ":{"s":"\u003CLink\u003ELog in\u003C\u002FLink\u003E or sign up to lock in your price and request a ride."},")?":{"s":"Driver Name:"},")#":{"s":"Driver's car image"},")A":{"s":"License Plate Number:"},")B":{"s":"Pick up"},")C":{"s":"Picked up"},")D":{"s":"Drop-off"},")E":{"s":"Dropped off"},")F":{"s":"Current location"},")c":{"s":"Close banner"},")y":{"s":"Riders"},"*I":{"s":"Add card"},"*J":{"s":"Edit {cardLabel}"},"+2":{"s":"$10"},"+3":{"s":"$8"},"+4":{"s":"$10"},"+5":{"s":"Unlimited 180-min classic rides for 24 hours"},"+6":{"s":"$15"},"+7":{"s":"Unlimited 30-min classic rides for 24 hours"},"+8":{"s":"Your payment info will be stored securely."},",#":{"s":"Please follow \u003CSupportLink\u003Ethese instructions\u003C\u002FSupportLink\u003E to allow this site to show notifications."},",$":{"s":"Notifications are blocked"},",\u003C":{"s":"Session expired"},",=":{"s":"You have been logged out. Please log back in to continue."},"!u":{"s":"Click to edit your pickup location"},"%\u002F":{"s":"You must \u003CLink\u003Elog in\u003C\u002FLink\u003E to {action}."},"&J":{"s":"Something went wrong. Unable to load your referral history. Please try again."},"7f523512b795a02fd9b9b05a1e22ff9b":{"s":"Card number"},"3effb3a930ea2ce61705bffc624e19b6":{"s":"Expiration"},"755c8f863223ae3f7ac0ac1cfe8b3072":{"s":"Name on card"},"22b715147b81b76566fa183406659069":{"s":"Country"},"4b3d5e03b24b6bbc630d15ad2251755f":{"s":"Billing address"},"e0a8872668d31bb76156a8d80a5d7a6c":{"s":"City"},"f420cf2cf310bbff1ead064745e66ec1":{"s":"State"},"8e9d206ff46216065a42a3953a63bd9f":{"s":"Province \u002F Territory"},"9dca7ddd59d7aca64aae58c7a99e16ce":{"s":"State \u002F Province"},"50be4be10369e747d757e7b2db2c9ed3":{"s":"Zip code"},"11ceb56a912fd18cc9ea1054c5405c13":{"s":"Postal code"},"5a0a89ab4fd1ceebfd9f68b88d27e685":{"s":"Save"},"45c9b92858c6ce6b50c1967661063ae8":{"s":"Cancel"},"29fc403cabcebe790ddd09c592f7e7cd":{"s":"There was a problem reading your card details. Please try again."},"1ae24aeff3771f629b2f865074b68050":{"s":"You must be logged in to add a payment method."},"275c89584bcddfbf0019d8d5a2ce6128":{"s":"You must be logged in to edit a payment method."},"2a420e791e0ec6d47cb64d5fab8376a9":{"s":"Please fill out all required fields"},"a966a08942254351695c6993e781301e":{"s":"Something went wrong. Please check your information and try again"}});
</script><meta charset="utf-8"><meta content="IE=Edge" http-equiv="X-UA-Compatible"><meta name="google" content="notranslate"><meta http-equiv="Accept-CH" content="DPR, Viewport-Width, Width, Downlink, Save-Data, Content-DPR"><link rel="home" href="https://ride.lyft.com"><link rel="canonical" href="https://ride.lyft.com"><link rel="icon" href="https://cdn.lyft.com/static/www-meta-assets/favicon.ico"><link rel="shortcut icon" sizes="192x192" href="https://cdn.lyft.com/static/riderweb/images/icons/icon-192x192.png"><link rel="apple-touch-startup-image" href="https://cdn.lyft.com/static/riderweb/images/icons/icon-192x192.png"><link rel="apple-touch-icon" href="https://cdn.lyft.com/static/riderweb/images/icons/icon-192x192.png"><meta name="apple-mobile-web-app-capable" content="yes"><meta name="apple-mobile-web-app-status-bar-style" content="black-translucent"><meta property="og:title" content="Lyft: Request a ride on the web"><meta property="og:url" content="https://ride.lyft.com"><meta name="twitter:card" content="summary_large_image"><meta name="twitter:site" content="#lyft"><meta name="msapplication-starturl" content="https://ride.lyft.com"><link rel="stylesheet" href="https://cdn.lyft.com/coreui/base.4.6.5.css"><meta name="google-site-verification" content="V9fk-oLTj9Ewu7Kc6Vetf94qp8HZ3gfjxFMkn8LmZ3Y"><link rel="manifest" href="/manifest.json" crossorigin="use-credentials"><meta name="theme-color" content="#FFFFFF"><meta name="description" content="Request a Lyft ride in a web browser on your phone, tablet, or laptop – no app download required. Get a ride from a friendly driver in minutes."><meta property="og:description" content="Request a Lyft ride in a web browser on your phone, tablet, or laptop – no app download required. Get a ride from a friendly driver in minutes."><meta property="og:image" content="/images/share.png">
.
,
<next-route-announcer><p aria-live="assertive" id="__next-route-announcer__" role="alert" style="border: 0px; clip: rect(0px, 0px, 0px, 0px); height: 1px; margin: -1px; overflow: hidden; padding: 0px; position: absolute; width: 1px; white-space: nowrap; overflow-wrap: normal;"></p></next-route-announcer><iframe name="__privateStripeMetricsController9540" frameborder="0" allowtransparency="true" scrolling="no" role="presentation" allow="payment *" src="https://js.stripe.com/v3/m-outer-93afeeb17bc37e711759584dbfc50d47.html#url=https%3A%2F%2Fride.lyft.com%2Fridetype%3Forigin%3D33.8026087%252C-84.3369491999999%26destination%3D33.79149%252C-84.32312%26ride_type%3D%26offerProductId%3Dstandard&title=Lyft%3A%20Price%20estimate&referrer=&muid=NA&sid=NA&version=6&preview=false" aria-hidden="true" tabindex="-1" style="border: none !important; margin: 0px !important; padding: 0px !important; width: 1px !important; min-width: 100% !important; overflow: hidden !important; display: block !important; visibility: hidden !important; position: fixed !important; height: 1px !important; pointer-events: none !important; user-select: none !important;"></iframe></body></html>

Selecting all values within tag from webpage during selector

I want to select all values that belong to a specific tag using the selector function, however it seems to return only the first value given that the tag is repeated more than once. Shouldn't it select all the values under all the tags when repeated?
For example:
url = 'https://findthatlocation.com/film-title/a-knights-tale'
r = requests.get(url[1])
soup = BeautifulSoup(r.content, 'lxml')
street = list(soup.select_one("div[style='color: #999; font-size: 12px; margin-bottom: 5px;']").stripped_strings)
#or
st = soup.find('div', {'style':'color: #999; font-size: 12px; margin-bottom: 5px;'})
Only returns:
#street.text.strip()
['Prague,']
st.text.strip()
Prague,
However, more than one of the tag appears in the webpage, so I was expecting something like this:
#when using street.text.strip()
['Prague,', 'Prague Castle, Prague']

Use .select, not .select_one:
import requests
from bs4 import BeautifulSoup
url = "https://findthatlocation.com/film-title/a-knights-tale"
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
out = [d.get_text(strip=True) for d in soup.select("h3 + div")]
print(out)
Prints:
['Prague,', 'Prague Castle, Prague']

code:
url = 'https://findthatlocation.com/film-title/a-knights-tale'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
# use select instead select_one
st = soup.select("div[style='color: #999; font-size: 12px; margin-bottom: 5px;']")
box=[]
for i in st:
box.append(i.text.strip())
box
return:
['Prague,', 'Prague Castle, Prague']

How to make a dataframe download through browser using python

I have a function, which generates a dataframe, which I am exporting as an excel sheet, at the end of the function.
df.to_excel('response.xlsx')
This excel file is being saved in my working directory.
Now I'm hosting this in Streamlit on heroku as a web app, but I want this excel file to be downloaded in user's local disk (a normal browser download) once this function is called. Is there a way to do it ?

Snehan Kekre, from streamlit, wrote the following solution in this thread.
streamlit as st
import pandas as pd
import io
import base64
import os
import json
import pickle
import uuid
import re
def download_button(object_to_download, download_filename, button_text, pickle_it=False):
"""
Generates a link to download the given object_to_download.
Params:
------
object_to_download: The object to be downloaded.
download_filename (str): filename and extension of file. e.g. mydata.csv,
some_txt_output.txt download_link_text (str): Text to display for download
link.
button_text (str): Text to display on download button (e.g. 'click here to download file')
pickle_it (bool): If True, pickle file.
Returns:
-------
(str): the anchor tag to download object_to_download
Examples:
--------
download_link(your_df, 'YOUR_DF.csv', 'Click to download data!')
download_link(your_str, 'YOUR_STRING.txt', 'Click to download text!')
"""
if pickle_it:
try:
object_to_download = pickle.dumps(object_to_download)
except pickle.PicklingError as e:
st.write(e)
return None
else:
if isinstance(object_to_download, bytes):
pass
elif isinstance(object_to_download, pd.DataFrame):
#object_to_download = object_to_download.to_csv(index=False)
towrite = io.BytesIO()
object_to_download = object_to_download.to_excel(towrite, encoding='utf-8', index=False, header=True)
towrite.seek(0)
# Try JSON encode for everything else
else:
object_to_download = json.dumps(object_to_download)
try:
# some strings <-> bytes conversions necessary here
b64 = base64.b64encode(object_to_download.encode()).decode()
except AttributeError as e:
b64 = base64.b64encode(towrite.read()).decode()
button_uuid = str(uuid.uuid4()).replace('-', '')
button_id = re.sub('\d+', '', button_uuid)
custom_css = f"""
<style>
#{button_id} {{
display: inline-flex;
align-items: center;
justify-content: center;
background-color: rgb(255, 255, 255);
color: rgb(38, 39, 48);
padding: .25rem .75rem;
position: relative;
text-decoration: none;
border-radius: 4px;
border-width: 1px;
border-style: solid;
border-color: rgb(230, 234, 241);
border-image: initial;
}}
#{button_id}:hover {{
border-color: rgb(246, 51, 102);
color: rgb(246, 51, 102);
}}
#{button_id}:active {{
box-shadow: none;
background-color: rgb(246, 51, 102);
color: white;
}}
</style> """
dl_link = custom_css + f'<a download="{download_filename}" id="{button_id}" href="data:application/vnd.openxmlformats-officedocument.spreadsheetml.sheet;base64,{b64}">{button_text}</a><br></br>'
return dl_link
vals= ['A','B','C']
df= pd.DataFrame(vals, columns=["Title"])
filename = 'my-dataframe.xlsx'
download_button_str = download_button(df, filename, f'Click here to download {filename}', pickle_it=False)
st.markdown(download_button_str, unsafe_allow_html=True)
I'd recommend searching the thread on that discussion forum. There seem to be at least 3-4 alternatives to this code.

Mark Madson has this workaround posted on github. I lifted it from the repo and am pasting here as an answer.
import base64
import pandas as pd
import streamlit as st
def st_csv_download_button(df):
csv = df.to_csv(index=False) #if no filename is given, a string is returned
b64 = base64.b64encode(csv.encode()).decode()
href = f'Download CSV File'
st.markdown(href, unsafe_allow_html=True)
usage:
st_csv_download_button(my_data_frame)
right click + save-as.
I think you can do the same by doing to_excel instead of to_csv.

Python sorting html table

I am looping through a list of servers and connecting with OpenSSL, to retrieve the SSL cert, and grabbing the server name, the date the cert expires, and calculating the number of days until cert expires. I am then building an html table with the data. The columns are Host, Hostname, Expiration Date, and Remaining Days. What is the best way to sort the table by the "Remaining Days" column?
# Update the hosts entry
ssl_results[str(ip)][0] = host
ssl_results[str(ip)][1] = server_name
ssl_results[str(ip)][2] = exp_date
ssl_results[str(ip)][3] = days_to_expire
# Loop through the ssl_results entries and generate a email + results file
try:
# variable to hold html for email
SSLCertificates = """<html>
<head>
<style>
table{width: 1024px;}
table, th, td {
border: 1px solid black;
border-collapse: collapse;
}
th, td {
padding: 5px;
text-align: left;
}
ul:before{
content:attr(data-header);
font-size:120%;
font-weight:bold;
margin-left:-15px;
}
</style>
</head>
<body>
<p><h2>Blah, </h2>
<h3>SSL Expiration Summary:</h3>
<span style="color:red;"><b>Blah Blah Blah.<b></span><br><br>
<table id=\"exp_ssls\"><tr><th>Host</th><th>Hostname</th><th>Expiration Date</th><th>Remaining Days</th></tr>
"""
for entries in ssl_results:
SSLCertificates += "<tr><td>" + str(entries) + "</td><td>" + str(ssl_results[entries][1]) + "</td><td>" + str(
ssl_results[entries][2]) + "</td><td>" + str(ssl_results[entries][3]) + "</td></tr>"
SSLCertificates += """</body>
</html>"""
f = open('SSLCertificates.html', 'w')
f.write(SSLCertificates)
f.close()
filename = 'SSLCertificates.html'
attachment = open(filename, 'rb')

Sort the dict before you form the html tags. then Iterate thru the dict and print it using html tags. Use sorted() to sort your dict before you iterate thru it.
import operator
x = {1: 2, 3: 4, 4: 3, 2: 1, 0: 0}
sorted_x = sorted(x.items(), key=operator.itemgetter(1))
sorted_x will be a list of tuples sorted by the second element in each tuple. dict(sorted_x) == x.

Error with encode / decode in html generated for outlook

I have 2 functions, the first prepares the html and writes to a .txt file. The second function opens this file and manages an email through Outlook.
In the body of the message, the content of this html will be placed with the correct formatting. Everything goes perfectly, the .txt comes with the html without any errors, but when Outlook is opening it, it is closed and the Error / Exception is generated below:
'ascii' codec can't encode character u'\xe7' in position 529: ordinal not in range(128)
I know that this \xe7 is ç, but I can not solve it, I've already tried to define it by .decode("utf-8") and .encode("utf-8"), in the variable email_html_leitura, but the error of codec persists. Follow the code of the 2 functions to see if I did something wrong:
Function 1:
import sys
import codecs
import os.path
def gerar_html_do_email(self):
texto_solic = u'Solicitação Grupo '
with codecs.open('html.txt', 'w+', encoding='utf8') as email_html:
try:
for k, v in self.dicionario.iteritems():
email_html.write('<h1>'+k+'</h1>'+'\n')
for v1 in v:
if (v1 in u'Consulte o documento de orientação.') or (v1 in u'Confira o documento de orientação.'):
for x, z in self.tit_nome_pdf.iteritems():
if x in k:
email_html.write('<a href='+'%s/%s'%(self.hiperlink,z+'>')+'Mais detalhes'+'</a>'+'\n')
else:
email_html.write('<p>'+v1+'</p>'+'\n')
email_html.write('<p><i>'+texto_solic+'</i></p>'+'\n')
email_html.close()
except Exception as erro:
self.log.write('gerar_html_para_o_email: \n%s\n'%erro)
Function 2:
def gerar_email(self):
import win32com.client as com
try:
outlook = com.Dispatch("Outlook.Application")
mail = outlook.CreateItem(0)
mail.To = u"Lista Liberação de Versões Sistema"
mail.CC = u"Lista GCO"
mail.Subject = u"Atualização Semanal Sistema Acrool"
with codecs.open('html.txt', 'r+', encoding='utf8') as email_html_leitura:
mail.HTMLBody = """
<html>
<head></head>
<body>
<style type=text/css>
h1{
text-align: center;
font-family: "Arial";
font-size: 1.1em;
font-weight: bold;
}
p{
text-align: justify;
font-family: "Arial";
font-size: 1.1em;
}
a{
font-family: "Arial";
font-size: 1.1em;
}
</style>
%s
</body>
</html>
"""%(email_html_leitura.read().decode("utf-8"))
email_html_leitura.close()
mail.BodyFormat = 2
mail.Display(True)
except Exception as erro:
self.log.write('gerar_email: \n%s\n'%erro)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract URL from webpage and save to disk - python

Here is the Solution:- url = re.search('<iframe src = "\s*([^"]+)"', res.content) url.group(1) urllib.urlretrieve(url.group(1),'C:/.../Docs/test.pdf') I ran it and it is working :) For Python 3: Change urrlib.urlretrive to urllib.request.urlretrieve

Related

Scrape "Button" tag with Selenium

Selecting all values within tag from webpage during selector

How to make a dataframe download through browser using python

Python sorting html table

Error with encode / decode in html generated for outlook

Categories

Resources