I'm trying to download files from a website using python requests module and beautifulsoup4 but the problem is that you have to wait for 5 seconds before the download button appears.
I tried using requests.get('URL') to get the page and then parse it with beautifulsoup4 to get the download link but the problem is that you have to wait 5 seconds (if you were to open it with an actual browser) in order for the button to appear so when I pass the URL to requests.get() the initial response object doesn't have the button element I searched a lot on google but couldn't find any results that helped me.
Is there a way to "refresh" the response object? or "wait"? that is to update it's contents after five seconds as if it were opened with a browser?
I don't think this is possible with the requests module. What should I do?
I'm running Windows10 64x
I'm new so sorry if the formatting is bad. :(
HTTP is stateless, every new request goes as a different request to the earlier one. We typically imeplement states in cookies, browser stoarges and so on. Being a plain HTTP client, there is no way for requests to refresh a request, and the next request will be a compleletly new request.
What you're looking for is some client that understands JavaScript and can handle page update automatically. I suggest you to look at selenium which can do browser automation.
Try something like this,
driver.get("http://somedomain/url_that_delays_loading")
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "myDynamicElement"))
)
finally:
driver.quit()
Related
I am trying to download torrent file from this code :
url = "https://itorrents.org/torrent/0BB4C10F777A15409A351E58F6BF37E8FFF53CDB.torrent"
r = requests.get(url, allow_redirects=True)
open('test123.torrent', 'wb').write(r.content)
It downloads a torrent file , but when i load it to bittorrent error occurs.
It says Unable to Load , Torrent Is Not Valid Bencoding
Can anybody please help me to resolve this problem ? Thanks in advance
This page use cloudflare to prevent scraping the page,I am sorry to say that bypassing cloudflare is very hard if you only use requests, the measures cloudflare takes will update soon.This page will check your browser whether it support Javascript.If not, they won't give you the bytes of the file.That's why you couldn't use them.(You could use r.text to see the response content, it is a html page.Not a file.)
Under this circumstance, I think you should consider about using selenium.
Bypassing Cloudflare can be a pain, so I suggest using a library that handles it. Please don't forget that your code may break in the future because Cloudflare changes their techniques periodically. Well, if you use the library, you will just need to update the library (at least you should hope for that).
I used a similar library only in NodeJS, but I see python also has something like that - cloudscraper
Example:
import cloudscraper
scraper = cloudscraper.create_scraper() # returns a CloudScraper instance
# Or: scraper = cloudscraper.CloudScraper() # CloudScraper inherits from requests.Session
print scraper.get("http://somesite.com").text # => "<!DOCTYPE html><html><head>..."
Depending on your usage you may need to consider using proxies - CloudFlare can still block you if you send too many requests.
Also, if you are working with video torrents, you may be interested in Torrent Stream Server. It a server that downloads and streams video at the same time, so you can watch the video without fully downloading it.
We can do by adding cookies in headers .
But after some time cookie expires.
Therefore only solution is to download from opening browser
I can get to the point where the video is right in front of me. I need to loop through the urls and download all of these videos. The issue is that the request is stateful and I cannot use urllib because then authorization issues occur. How do I just target the three dots in chrome video viewer and download the file?
All I need now is to be able to download by clicking on the download button. I do not know if it can be done without the specification of coordinates. Like I said, the urls follow a pattern and I can generate them. The only issue is the authorization. Please help me get the videos through selenium.
Note that the video is in JavaScript so I cannot really target the three dots or the download button.
You can get the cookies from the driver and pass the information to the Request session. So, you can download with the Requests library.
import requests
cookies = driver.get_cookies()
s = requests.Session()
for cookie in cookies:
s.cookies.set(cookie['name'], cookie['value'])
response = s.get(urlDownload, stream=True)
print(response.status_code)
with open(fileName,'wb') as f:
f.write(response.content)
you can use selenium in python 2 as you have only pics i cannot give you a real code but something like that will help.you can find XPath by inspecting HTML
import selenium
driver.find_element_by_xpath('xpath of 3 dots')
I ran into some trouble trying to build a python bot to check the status of a application. For explanation, here is an example of the process:
1.) Visit website (https://examplewebsite.com/checkinfo.do?a=sample
2.) Assuming the query string is correct, the website will drop a cookie. This cookie is a 'session cookie' and therefore expires immediately upon exiting closing or navigating away from the webpage.
3.) Once the cookie has been obtained visit https://examplewebsite.com/step2.do?a=sample
4.) Parse result of the request to https://examplewebsite.com/step2.do?a=sample
Although this may appear simple, I am having trouble sending the second request to https://examplewebsite.com/step2.do?a=sample and am routinely met with a page saying "session expired".
I have tried to reproduce this process using Python requests, Python requests.session, as well as dryscrape. But I can not get it to work - my gut feeling is that because the cookie is a "session cookie" it is being immediately when the second request to https://examplewebsite.com/step2.do?a=sample is initiated but before the page loads.
Perhaps a better way to explain the issue is with browser behavior, using Firefox and IE the website behaves like the below:
1.) Visit website (https://examplewebsite.com/checkinfo.do?a=sample)
A cookie is successfully obtained
2.) Replace the URL with https://examplewebsite.com/step2.do?a=sample
A "session expired" warning is presented
HOWEVER, this does work:
1.) Visit website (link same as above)
A cookie is successfully obtained
2.) While keeping the first tab open, create a second tab and paste
https://examplewebsite.com/step2.do?a=sample into the browser
The information is correctly loaded and no "session expired" warning is presented.
So my question is, how do I reproduce the behavior of "creating a new tab" within Python, while simultaneously keeping the session cookie shared between the requests.
Here is my attempts to do it in dryscrape
import dryscrape
a = dryscrape.Session()
a.set_header("User-Agent", "Firefox")
a.visit('checkinfo.do URL')
b = dryscrape.Session()
b.set_header("User-Agent", "Firefox")
b.set_cookie(a.cookies()) #This is my attempt to share the session cookies in a seperate dryscrape object to simulate putting the URL in a second tab.
b.visit('step2.do URL')
Unfortunately, the above does not work, and a.cookies does not match b.cookies, both before and after the second request.
NOTE: The website page does have code that will end the session when the page is unloaded. So if dryscrape is doing anything equivalent to unloading the page the session cookie would be marked invalid by the server.
Now I have also ran into a similar problem.
The problem with your code is that a.cookies() returns a list of cookies. You need to set each cookie separately.
Try something like this:
b = dryscrape.Session()
b.set_header("User-Agent", "Firefox")
for cookie in a.cookies():
b.set_cookie(cookie)
Hope it helps :)
I'm looking to embark on my own personal project of creating an application which i can save doc/texts/image from the site my browser is at. I have done a lot of research to conclude that either of the two ways is possible for now: using cookies or packet sniffers to identify the IP address(the packet sniffer method being more relevent at the moment).
I would like to automate the application so I would not have to copy and paste the url on my browser and paste it into the script using urllib.
Are there any suggestions that experienced network programmers can provide with regards to the process or modules or libraries I need?
thanks so much
jonathan
If you want to download all images, docs, and text while you're actively browsing (which is probably a bad idea considering the sheer amount of bandwidth) then you'll want something more than urllib2. I assume you don't want to have to keep copying and pasting all the urls into a script to download everything, if that is not the case a simple urllib2 and beautifulsoup filter would do you wonders.
However if what I assume is correct then you are probably going to want to investigate selenium. From there you can launch a selenium window (defaults to Firefox) and then do your browsing normally. The best option from there is to continually poll the current url and if it is different identify all of the elements you want to download and then use urllib2 to download them. Since I don't know what you want to download I can't really help you on that part. However here is what something like that would look like in selenium:
from selenium import webdriver
from time import sleep
# Startup the web-browser
browser = webdriver.Firefox()
current_url = browser.current_url
while True:
try:
# If we have a url, identify and download your items
if browser.current_url != current_url:
# Download the stuff here
current_url = browser.current_url
# Triggered once you close the web-browser
except:
break
# Sleep for half a second to avoid demolishing your machine from constant polling
sleep(0.5)
Once again I advise against doing this, as constantly downloading images, text, and documents would take up a huge amount of space.
I am trying to write a script to identify if potential new homes have Verizon FiOS service available.
Unfortunately the site's extensive use of javascript has prevented everything from working. I'm using selenium (wrapped in the splinter module) to let the javascript execute, but I can't get past the second page.
Here is a simplified version of the script:
from splinter import Browser
browser = Browser()
browser.visit('https://www.verizon.com/FORYOURHOME/ORDERING/OrderNew/OrderAddressInfo.aspx')
nameAddress1 = "ctl00$body_content$txtAddress"
nameZip = "ctl00$body_content$txtZip"
formFill = {nameAddress1: '46 Demarest Ave',
nameZip: '10956'}
browser.fill_form(formFill)
browser.find_by_id('btnContinue').first.click()
if browser.is_element_present_by_id('rdoAddressOption0', wait_time=10):
browser.find_by_id('rdoAddressOption0').first.click()
browser.find_by_id('body_content_btnContinue').first.click()
In this example, it chooses the first option when it asks for confirmation of address.
It errors out with an ElementNotVisibleException. If I remove the is_element_present check, it errors out because it cannot find the element. The element is visible and clickable in the live browser that selenium is controlling, but it seems like selenium is not seeing an updated version of the page HTML.
As an alternative, I thought I might be able to do the POST request and process the response with requests or mechanize, but there is some kind of funny redirect that I can't wrap my head around.
How do I either get selenium to behave, or bypass the javascript/ajax and do it with GET and POST requests?
The problem is that the input you are clicking on is really hidden by setting display: none style.
To workaround it, execute javascript code to click on the input and set checked attribute:
browser.execute_script("""var element = document.getElementById('rdoAddressOption0');
element.click();
element.checked=true;""")
browser.find_by_id('body_content_btnContinue').first.click()