Link Scraping with requests, bs4. Getting Warning: unresponsive script - python

Im trying to collect all links from a webpage using requests, Beautifulsoup4 and SoupStrainer in Python3.3. For writing my code im using Komodo Edit 8.0 and also let my scripts run in Komodo Edit. So far everything works fine but on some webpages it occurs that im getting a popup with the following Warning
Warning unresponsive script
A script on this page may be busy, or it may have stopped responding. You can stop the script
now, or you can continue to see if the script will complete.
Script: viewbufferbase:797
Then i can chose if i want to continue or stop the script.
Here a little code snippet:
try:
r = requests.get(adress, headers=headers)
soup = BeautifulSoup(r.text, parse_only=SoupStrainer('a', href=True))
for link in soup.find_all('a'):
#some code
except requests.exceptions.RequestException as e:
print(e)
My question is what is causing this error. Is it my python script that is taking too long on a webpage or is it a script on the webpage im scraping? I cant think of the latter because technically im not executing the scripts on the page right?
Or can it maybe be my bad internet-connection?
Oh and another little question, with the above code snippet am im downloading pictures or just the plain html-code? Because sometimes when i look into my connection status for me its way too much data that im receiving just for requesting plain html code?
If so, how can I avoid downloading such stuff and how is it possible in general to avoid downloads with requests, because sometimes it can be that my program ends on a download page.
Many Thanks!

The issue might be either long loading times of a site, or a cycle in your website links' graph - i.e. page1 (Main Page) has link to page2 (Terms of Service) which in turn has link to page1. You could try this snippet to see how long it takes to get a response from a website (snippet usage included).
Regarding your last question:
I'm pretty sure requests doesn't parse your response's content (except for .json() method). What you might be experiencing is a link to a resource, like Free Cookies! which you script would visit. requests have mechanics to counter such case, see this for reference. Moreover, the aforementioned technique allows checking Content-Type header to make sure you're downloading pages you're interested in.

Related

Getting very weird 400 errors (Imgur API)

I know how this may read to you but I am seriously out of ideas on this one.
I've written something in Python that can download stuff from Imgur using their API. I have an Authorization clientID and everything, and this thing works.
But sometimes I am getting a 400 HTML response status code with empty text when requesting a direct link so I can save the file. According to the doc this error will come back at you when you're missing a required parameter or if a part of your request is incorrect; this feels like a weird error to come back at me, when other requests get through just fine. The weirder part is that I can send 100 requests in 100 seconds, all will come back with 400 None. But once I open the image in my browser and view it, the request suddently changes to 200, everything works and this link will never throw a 400 ever again.
My friend suggestet that maybe it had something to do with my IP adress (getting flagged as a spam bot or whatever), so I opened the URLs via my phone on cellular, but as soon as the image loaded on my phone, the requests were successful. Also, when I tried finding more links with this strange behaviour, I tested a bunch of images from the imgur front page, and all worked fine. Combining this with the fact that I got the problematic links from very old reddit threads leaves me with the only idea that it has something to do with the age of the files or rather their last view date.
The requests happened via a code similar to:
headers = {'Authorization': f"Client-ID <id>"}
url = f"https://api.imgur.com/3/image/{fileID}"
requests.get(url, headers=headers)
I'm asking this here mainly because the Imgur API page says the best way to get help about the API is to post the problem here on stackoverflow. Maybe one of their engineers sees this and can answer my question or maybe someone else has an idea what may be going on here. In any case, I would be grateful for any useful input ^^

cloudscraper exception but is not the captcha one

I had a problem with downloading a file with request lib because of cloudflare so I wanted to use cloudscraper.
I tried different things but I always have the same result. the code that downloads the file is this one
scraper = cloudscraper.create_scraper()
response = scraper.get(Linkds).text
open("DownloadedFile.torrent", "wb").write(response.content)
print('pass download')
the error I get is this one:
cloudscraper.exceptions.CloudflareChallengeError: Detected a Cloudflare version 2 challenge, This feature is not available in the opensource (free) version.
the searches I found that people always talked about the same problem but it says version 2 captcha challenge. which is not like the error I get. here what I searched.
https://github.com/ptrstn/dailyblink/issues/7
cloudscraper.exceptions.CloudflareChallengeError: Detected a Cloudflare version 2 Captcha challenge
https://www.reddit.com/r/webscraping/comments/xj9ykg/python_scraping_rent_properties_getting_blocked/
I also tried using the webbrowser lib but it doesn't help that much because i need to handle the file internally for automation purposes. (I think I could use an AHK script to do that, but at that point is just overly complicated and way harder to setup for me)
I tried tinkering with the request lib but I couldn't get it to work either(it only showed the cloudflare protection)
(not sure if it matter but the program is a discordbot/tkinterUI downloader that downloads the torrent file into a folder)
the bot only needs to work with https://nhentai.net/ the idea is that you go into a link like https://nhentai.net/g/111111/ then you add download to the end like this
(messageuwu is the link) https://nhentai.net/g/111111/
Linkds=messageuwu+'download'
and it downloads the torrent file. the problem is, cloudflare. with the browser lib works kinda right(because of automation problem is not practical) I don't know what else to try.

Downloading torrent file using get request (.torrent)

I am trying to download torrent file from this code :
url = "https://itorrents.org/torrent/0BB4C10F777A15409A351E58F6BF37E8FFF53CDB.torrent"
r = requests.get(url, allow_redirects=True)
open('test123.torrent', 'wb').write(r.content)
It downloads a torrent file , but when i load it to bittorrent error occurs.
It says Unable to Load , Torrent Is Not Valid Bencoding
Can anybody please help me to resolve this problem ? Thanks in advance
This page use cloudflare to prevent scraping the page,I am sorry to say that bypassing cloudflare is very hard if you only use requests, the measures cloudflare takes will update soon.This page will check your browser whether it support Javascript.If not, they won't give you the bytes of the file.That's why you couldn't use them.(You could use r.text to see the response content, it is a html page.Not a file.)
Under this circumstance, I think you should consider about using selenium.
Bypassing Cloudflare can be a pain, so I suggest using a library that handles it. Please don't forget that your code may break in the future because Cloudflare changes their techniques periodically. Well, if you use the library, you will just need to update the library (at least you should hope for that).
I used a similar library only in NodeJS, but I see python also has something like that - cloudscraper
Example:
import cloudscraper
scraper = cloudscraper.create_scraper() # returns a CloudScraper instance
# Or: scraper = cloudscraper.CloudScraper() # CloudScraper inherits from requests.Session
print scraper.get("http://somesite.com").text # => "<!DOCTYPE html><html><head>..."
Depending on your usage you may need to consider using proxies - CloudFlare can still block you if you send too many requests.
Also, if you are working with video torrents, you may be interested in Torrent Stream Server. It a server that downloads and streams video at the same time, so you can watch the video without fully downloading it.
We can do by adding cookies in headers .
But after some time cookie expires.
Therefore only solution is to download from opening browser

how to convert IP address into http for urllib

I'm looking to embark on my own personal project of creating an application which i can save doc/texts/image from the site my browser is at. I have done a lot of research to conclude that either of the two ways is possible for now: using cookies or packet sniffers to identify the IP address(the packet sniffer method being more relevent at the moment).
I would like to automate the application so I would not have to copy and paste the url on my browser and paste it into the script using urllib.
Are there any suggestions that experienced network programmers can provide with regards to the process or modules or libraries I need?
thanks so much
jonathan
If you want to download all images, docs, and text while you're actively browsing (which is probably a bad idea considering the sheer amount of bandwidth) then you'll want something more than urllib2. I assume you don't want to have to keep copying and pasting all the urls into a script to download everything, if that is not the case a simple urllib2 and beautifulsoup filter would do you wonders.
However if what I assume is correct then you are probably going to want to investigate selenium. From there you can launch a selenium window (defaults to Firefox) and then do your browsing normally. The best option from there is to continually poll the current url and if it is different identify all of the elements you want to download and then use urllib2 to download them. Since I don't know what you want to download I can't really help you on that part. However here is what something like that would look like in selenium:
from selenium import webdriver
from time import sleep
# Startup the web-browser
browser = webdriver.Firefox()
current_url = browser.current_url
while True:
try:
# If we have a url, identify and download your items
if browser.current_url != current_url:
# Download the stuff here
current_url = browser.current_url
# Triggered once you close the web-browser
except:
break
# Sleep for half a second to avoid demolishing your machine from constant polling
sleep(0.5)
Once again I advise against doing this, as constantly downloading images, text, and documents would take up a huge amount of space.

Emulating a browser to download a file?

There's an FLV file on the web that can be downloaded directly in Chrome. The file is a television program, published by CCTV (China Central Television). CCTV is a non-profit, state-owned broadcaster, financed by the Chinese tax payer, which allows us to download their content without infringing copyrights.
Using wget, I can download the file from a different address, but not from the address that works in Chrome.
This is what I've tried to do:
url='http://114.80.235.200/f4v/94/163005294.h264_1.f4v?10000&key=7b9b1155dc632cbab92027511adcb300401443020d&playtype=1&tk=163659644989925531390490125&brt=2&bc=0&nt=0&du=1496650&ispid=23&rc=200&inf=1&si=11000&npc=1606&pp=0&ul=2&mt=-1&sid=10000&au=0&pc=0&cip=222.73.44.31&hf=0&id=tudou&itemid=135558267&fi=163005294&sz=59138302'
wget -c $url --user-agent="" -O xfgs.f4v
This doesn't work either:
wget -c $url -O xfgs.f4v
The output is:
Connecting to 118.26.57.12:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
2013-02-13 09:50:42 ERROR 403: Forbidden.
What am I doing wrong?
I ultimately want to download it with the Python library mechanize. Here is the code I'm using for that:
import mechanize
br = mechanize.Browser()
br = mechanize.Browser()
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
url='http://114.80.235.200/f4v/94/163005294.h264_1.f4v?10000&key=7b9b1155dc632cbab92027511adcb300401443020d&playtype=1&tk=163659644989925531390490125&brt=2&bc=0&nt=0&du=1496650&ispid=23&rc=200&inf=1&si=11000&npc=1606&pp=0&ul=2&mt=-1&sid=10000&au=0&pc=0&cip=222.73.44.31&hf=0&id=tudou&itemid=135558267&fi=163005294&sz=59138302'
r = br.open(url).read()
tofile=open("/tmp/xfgs.f4v","w")
tofile.write(r)
tofile.close()
This is the result:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 203, in open
return self._mech_open(url, data, timeout=timeout)
File "/usr/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 255, in _mech_open
raise response
mechanize._response.httperror_seek_wrapper: HTTP Error 403: Forbidden
Can anyone explain how to get the mechanize code to work please?
First of all, if you are attempting any kind of scraping (yes this counts as scraping even though you are not necessarily parsing HTML), you have a certain amount of preliminary investigation to perform.
If you don't already have Firefox and Firebug, get them. Then if you don't already have Chrome, get it.
Start up Firefox/Firebug, and Chrome, clear out all of your cookies/etc. Then open up Firebug, and in Chrome open up View->Developer->Developer Tools.
Then load up the main page of the video you are trying to grab. Take notice of any cookies/headers/POST variables/query string variables that are being set when the page loads. You may want to save this info somewhere.
Then try to download the video, once again, take notice of any cookies/headers/post variables/query string variables that are being set when the video is loaded. It is very likely that there was a cookie or POST variable set when you initially loaded the page, that is required to actually pull the video file.
When you write your python, you are going to need to emulate this interaction as closely as possible. Use python-requests. This is probably the simplest URL library available, and unless you run into a wall somehow with it (something it can't do), I would never use anything else. The second I started using python-requests, all of my URL fetching code shrunk by a factor of 5x.
Now, things are probably not going to work the first time you try them. Soooo, you will need to load the main page using python. Print out all of your cookies/headers/POST variables/query string variables, and compare them to what Chrome/Firebug had. Then try loading your video, once again, compare all of these values (that means what YOU sent the server, and what the SERVER sent you back as well). You will need to figure out what is different between them (don't worry, we ALL learned this one in Kindergarten... "one of these things is not like the other") and dissect how that difference is breaking stuff.
If at the end of all of this, you still can't figure it out, then you probably need to look at the HTML for the page that contains the link to the movie. Look for any javascript in the page. Then use Firebug/Chrome Developer Tools to inspect the javascript and see if it is doing some kind of management of your user session. If it is somehow generating tokens (cookies or POST/GET variables) related to video access, you will need to emulate its tokenizing method in python.
Hopefully all of this helps, and doesn't look too scary. The key is you are going to need to be a scientist. Figure out what you know, what you don't, what you want, and start experimenting and recording your results. Eventually a pattern will emerge.
Edit: Clarify steps
Investigate how state is being maintained
Pull initial page with python, grab any state info you need from it
Perform any tokenizing that may be required with that state info
Pull the video using the tokens from steps 2 and 3
If stuff blows up, output your request/response headers,cookies,query vars, post vars, and compare them to Chrome/Firebug
Return to step 1. until you find a solution
Edit:
You may also be getting redirected at either one of these requests (the html page or the file download). You will most likely miss the request/response in Firebug/Chrome if that is happening. The solution would be to use a sniffer like LiveHTTPHeaders, or like has been suggested by other responders, WireShark or Fiddler. Note that Fiddler will do you no good if you are on a Linux or OSX box. It is Windows only and is definitely focused on .NET development... (ugh). Wireshark is very useful but overkill for most problems, and depending on what machine you are running, you may have problems getting it working. So I would suggest LiveHTTPHeaders first.
I love this kind of problem
It seems that mechanize can do stateful browsing, meaning that it will keep context and cookies between browser requests. I would suggest to first load the complete page where the video is located, then do a second try to download the video explicitly. That way, the web server will think that it is a full (legit) browsing session ongoing
you can use selenium or watir to do all the stuff you need in a browser.
since you don't want to see the browser, you can run selenium headless.
see also this answer.
Assuming that you did not type the URL out of the blue by hand, use mechanize to first go to the page where you got that from. Then emulate the action you take to download the actual file (probably clicking a link or a button).
This might not work though as Mechanize keeps state of cookies and redirects, but does not handle any JavaScript real-time changes to the html pages. To check if JavaScript is crucial for the operation, switch of JavaScript in Chrome (or any other browser) and make sure you can download the file. If JavaScript is necessary, I would try and programmatically drive a browser to get the file.
My usual approach to trying this kind of scraping is
try wget or pythons urllib2
try mechanize
drive a browser
Unless there is some captcha, the last one usually works, but the others are easier (and faster).
In order to clarify the "why" part of your question you can route your browser and your code's requests through a debug proxy. If you are using windows I suggest fiddler2. There exist other debug proxies for other platforms as well. But fiddler2 is definitely my favourite.
http://www.fiddler2.com/fiddler2/
https://www.owasp.org/index.php/Category:OWASP_WebScarab_Project
http://www.charlesproxy.com/
Or more low level
http://netcat.sourceforge.net/
http://www.wireshark.org/
Once you know the differences it is usually much simpler to come up with a solution. I suspect that the other answers with regard to stateful browsing / cookies are correct. With the mentioned tools you can analyze these cookies and roll a suitable solution without going for browser automation.
I think many sites use temporary links that only exist in your session. The code in the url is probably something like your session-id. That means the particular link will never work again.
You'll have to reopen the page that contains the link using some library that accomodates this session (like mentioned in other answers). And then try to locate the link and only use it in this session.
While the current accepted answer (by G. Shearer) is the best possible advice for scraping in general, I've found a way to skip a few steps - with a firefox extension called cliget that takes the request context with all the http headers and cookies and generates a curl (or wget) command that is copied to the clipboard.
EDIT: this feature is also available in the network panels of firebug and the chrome debugger - right click request, "copy as curl"
Most of the time you'll get a very verbose command with a few apparently unneeded headers, but you can remove those one by one until the server rejects the request, instead of the opposite (which, honestly, I find frustrating - I often got stuck thinking what header was missing from the request).
(Also, you might want to remove the -O option from the curl commandline to see the result in stdout instead of downloading it to a file, and add -v to see the full header list)
Even if you don't want to use curl/wget, converting one curl/wget commandline to python code is just a matter of knowing how to add headers to an urllib request (or any http request library for that matter)
There's an open source, Python library, named ghost, that wraps a headless, WebKit browser, so you can control everything through a simple API:
from ghost import Ghost
ghost = Ghost()
page, resources = ghost.open('http://my.web.page')
It supports cookies, JavaScript and everything else. You can inject JavaScript into the page, and while it's headless, so it doesn't render anything graphically, you still have the DOM. It's a complete browser.
It wouldn't scale well, but it's lots of fun, and may be useful when you need something approaching a complete browser.
from urllib import urlopen
print urlopen(url) #python built-in high level interface to get ANY online resources, auto responds to HTTP error codes.
Did you try requests module? it's much simpler to use than urllib2 and pycurl etc.
yet it's powerful. it has following features: The link is here
International Domains and URLs
Keep-Alive & Connection Pooling
Sessions with Cookie Persistence
Browser-style SSL Verification
Basic/Digest Authentication
Elegant Key/Value Cookies
Automatic Decompression
Unicode Response Bodies
Multipart File Uploads
Connection Timeouts
.netrc support
Python 2.6—3.3
Thread-safe.
You could use Internet Download Manager it is able to capture and download any streaming media from any website

Categories

Resources