cloudscraper exception but is not the captcha one - python

I had a problem with downloading a file with request lib because of cloudflare so I wanted to use cloudscraper.
I tried different things but I always have the same result. the code that downloads the file is this one
scraper = cloudscraper.create_scraper()
response = scraper.get(Linkds).text
open("DownloadedFile.torrent", "wb").write(response.content)
print('pass download')
the error I get is this one:
cloudscraper.exceptions.CloudflareChallengeError: Detected a Cloudflare version 2 challenge, This feature is not available in the opensource (free) version.
the searches I found that people always talked about the same problem but it says version 2 captcha challenge. which is not like the error I get. here what I searched.
https://github.com/ptrstn/dailyblink/issues/7
cloudscraper.exceptions.CloudflareChallengeError: Detected a Cloudflare version 2 Captcha challenge
https://www.reddit.com/r/webscraping/comments/xj9ykg/python_scraping_rent_properties_getting_blocked/
I also tried using the webbrowser lib but it doesn't help that much because i need to handle the file internally for automation purposes. (I think I could use an AHK script to do that, but at that point is just overly complicated and way harder to setup for me)
I tried tinkering with the request lib but I couldn't get it to work either(it only showed the cloudflare protection)
(not sure if it matter but the program is a discordbot/tkinterUI downloader that downloads the torrent file into a folder)
the bot only needs to work with https://nhentai.net/ the idea is that you go into a link like https://nhentai.net/g/111111/ then you add download to the end like this
(messageuwu is the link) https://nhentai.net/g/111111/
Linkds=messageuwu+'download'
and it downloads the torrent file. the problem is, cloudflare. with the browser lib works kinda right(because of automation problem is not practical) I don't know what else to try.

Related

how to download a video from a twitch clip in python [duplicate]

I have been trying to download past broadcasts for a streamer on twitch using python. I found this python code online:
https://gist.github.com/baderj/8340312
However, when I try to call the functions I am getting errors giving me a status 400 message.
Unsure if this is the code I want to download the video (as an mp4) or how to use it properly.
And by video I mean something like this as an example: www(dot)twitch.tv/imaqtpie/v/108909385 //note cant put more than 3 links since I don't have 10 reputation
Any tips on how i should go about doing this?
Here's an example of running it in cmd:
python twitch_past_broadcast_downloader.py 108909385
After running it, it gave me this:
Exception API returned 400
This is where i got the information on running it:
https://www.johannesbader.ch/2014/01/find-video-url-of-twitch-tv-live-streams-or-past-broadcasts/
Huh it's not as easy at it seems ... The code you found on this gist is quite old and Twitch has completely changed its API. Now you will need a Client ID to download videos, in order to limit the amount of video you're downloading.
If you want to correct this gist, here are simple steps you can do :
Register an application : Everything is explained here ! Register you app and keep closely your client id.
Change API route : It is no longer '{base}/api/videos/a{id_}' but {base}/kraken/videos/{id_} (not sure about the last one). You will need to change it inside the python code. The doc is here.
Add the client id to the url : As said in the doc, you need to give a header to the request you make, so add a Client-ID: <client_id> header in the request.
And now I think you will need to start debugging a bit, because it is old code :/
I will try myself to do it and I will edit this answer when I'm finished, but try yourself :)
See ya !
EDIT : Mhhh ... It doesn't seem to be possible anyway to download a video with the API :/ I was thinking only links to API changed, but the chunks section of the response from the video url disappeared and Twitch is no longer giving access to raw videos :/
Really sorry I told you to do that, even with the API I think is no longer possible :/
You can download past broadcasts of Twitch videos with Python library streamlink.
You will need OAuth token which you can generate with the command
streamlink --twitch-oauth-authenticate
Download VODs with:
streamlink --twitch-oauth-token <your-oauth-token> https://www.twitch.tv/videos/<VideoID> best -o <your-output-folder>

How do I get URLs that are being accessed in my browser in 'real time'?

I want to write a program that returns the current or last visited URL by me on my computer (Windows 10) browser. Is there any way in which I can get that URL?
I tried using Python and SQLite to access Chrome history database on C:\Users%USERNAME%\AppData\Local\Google\Chrome\User Data\Default\History and it worked, but if I'm using the browser, the database gets locked.
I know that by using Wireshark, one can see the packets when accessing an URL, but I cannot find the complete URL in those packets fields, only the server name (ie: stackoverflow.com).
I'd like to know whether there is a way in which I can see that information as it's done by Wireshark, but only to get the complete URL, nothing else. Thank you!
I found a solution to this by using mitmproxy: https://mitmproxy.org/. This video on YouTube helped me with the installation and setup process: https://www.youtube.com/watch?v=7BXsaU42yok. The video explains the installation on Mac, but it's not so different from Windows. Then you can use Python to capture and process the URLs contained within the HTTPS requests by using the flow.request.pretty_url property: https://docs.mitmproxy.org/stable/addons-scripting/.

Use Webdav on Linux to get a file from SkyDrive *without* mounting

I want to download a file from SkyDrive programmatically using Python on Linux.
I can't use the API as it's a OneNote file and the API can't be used to download these.
My understanding is that SD supports Webdav and there are plenty of examples where people have mounted an SD folder using davfs2 but I just want to be able to grab a specific file without mounting.
I can use the API to get the document owner's cid so don't need to jump through any windows based hoops but my - probably lame, have not really researched webdav - efforts to download the file always resort in an error.
For example using easywebdav:
import easywebdav
webdav = easywebdav.connect("d.docs.live.net/mycid")
webdav.download('me/skydrive/Documents/Getting\ Started', '/tmp/foo')
#this gives the 302 error mentioned in the comments at the end of the the 'jumping through windows hoops' link I posted above.
Is there any workaround for the redirection problem I've seen mentioned?
Do I have this wrong and when accessing files on a webdav share it makes sense, and indeed it's essential, to mount it as a file system?
If you are downloading a specific file, and already know the exact path/URL to that file (as per your example), I'm not sure that you really need to worry about the DAV extensions. Have you tried downloading the file using a simple HTTP GET, through something like urllib2?

Link Scraping with requests, bs4. Getting Warning: unresponsive script

Im trying to collect all links from a webpage using requests, Beautifulsoup4 and SoupStrainer in Python3.3. For writing my code im using Komodo Edit 8.0 and also let my scripts run in Komodo Edit. So far everything works fine but on some webpages it occurs that im getting a popup with the following Warning
Warning unresponsive script
A script on this page may be busy, or it may have stopped responding. You can stop the script
now, or you can continue to see if the script will complete.
Script: viewbufferbase:797
Then i can chose if i want to continue or stop the script.
Here a little code snippet:
try:
r = requests.get(adress, headers=headers)
soup = BeautifulSoup(r.text, parse_only=SoupStrainer('a', href=True))
for link in soup.find_all('a'):
#some code
except requests.exceptions.RequestException as e:
print(e)
My question is what is causing this error. Is it my python script that is taking too long on a webpage or is it a script on the webpage im scraping? I cant think of the latter because technically im not executing the scripts on the page right?
Or can it maybe be my bad internet-connection?
Oh and another little question, with the above code snippet am im downloading pictures or just the plain html-code? Because sometimes when i look into my connection status for me its way too much data that im receiving just for requesting plain html code?
If so, how can I avoid downloading such stuff and how is it possible in general to avoid downloads with requests, because sometimes it can be that my program ends on a download page.
Many Thanks!
The issue might be either long loading times of a site, or a cycle in your website links' graph - i.e. page1 (Main Page) has link to page2 (Terms of Service) which in turn has link to page1. You could try this snippet to see how long it takes to get a response from a website (snippet usage included).
Regarding your last question:
I'm pretty sure requests doesn't parse your response's content (except for .json() method). What you might be experiencing is a link to a resource, like Free Cookies! which you script would visit. requests have mechanics to counter such case, see this for reference. Moreover, the aforementioned technique allows checking Content-Type header to make sure you're downloading pages you're interested in.

Emulating a browser to download a file?

There's an FLV file on the web that can be downloaded directly in Chrome. The file is a television program, published by CCTV (China Central Television). CCTV is a non-profit, state-owned broadcaster, financed by the Chinese tax payer, which allows us to download their content without infringing copyrights.
Using wget, I can download the file from a different address, but not from the address that works in Chrome.
This is what I've tried to do:
url='http://114.80.235.200/f4v/94/163005294.h264_1.f4v?10000&key=7b9b1155dc632cbab92027511adcb300401443020d&playtype=1&tk=163659644989925531390490125&brt=2&bc=0&nt=0&du=1496650&ispid=23&rc=200&inf=1&si=11000&npc=1606&pp=0&ul=2&mt=-1&sid=10000&au=0&pc=0&cip=222.73.44.31&hf=0&id=tudou&itemid=135558267&fi=163005294&sz=59138302'
wget -c $url --user-agent="" -O xfgs.f4v
This doesn't work either:
wget -c $url -O xfgs.f4v
The output is:
Connecting to 118.26.57.12:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
2013-02-13 09:50:42 ERROR 403: Forbidden.
What am I doing wrong?
I ultimately want to download it with the Python library mechanize. Here is the code I'm using for that:
import mechanize
br = mechanize.Browser()
br = mechanize.Browser()
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
url='http://114.80.235.200/f4v/94/163005294.h264_1.f4v?10000&key=7b9b1155dc632cbab92027511adcb300401443020d&playtype=1&tk=163659644989925531390490125&brt=2&bc=0&nt=0&du=1496650&ispid=23&rc=200&inf=1&si=11000&npc=1606&pp=0&ul=2&mt=-1&sid=10000&au=0&pc=0&cip=222.73.44.31&hf=0&id=tudou&itemid=135558267&fi=163005294&sz=59138302'
r = br.open(url).read()
tofile=open("/tmp/xfgs.f4v","w")
tofile.write(r)
tofile.close()
This is the result:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 203, in open
return self._mech_open(url, data, timeout=timeout)
File "/usr/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 255, in _mech_open
raise response
mechanize._response.httperror_seek_wrapper: HTTP Error 403: Forbidden
Can anyone explain how to get the mechanize code to work please?
First of all, if you are attempting any kind of scraping (yes this counts as scraping even though you are not necessarily parsing HTML), you have a certain amount of preliminary investigation to perform.
If you don't already have Firefox and Firebug, get them. Then if you don't already have Chrome, get it.
Start up Firefox/Firebug, and Chrome, clear out all of your cookies/etc. Then open up Firebug, and in Chrome open up View->Developer->Developer Tools.
Then load up the main page of the video you are trying to grab. Take notice of any cookies/headers/POST variables/query string variables that are being set when the page loads. You may want to save this info somewhere.
Then try to download the video, once again, take notice of any cookies/headers/post variables/query string variables that are being set when the video is loaded. It is very likely that there was a cookie or POST variable set when you initially loaded the page, that is required to actually pull the video file.
When you write your python, you are going to need to emulate this interaction as closely as possible. Use python-requests. This is probably the simplest URL library available, and unless you run into a wall somehow with it (something it can't do), I would never use anything else. The second I started using python-requests, all of my URL fetching code shrunk by a factor of 5x.
Now, things are probably not going to work the first time you try them. Soooo, you will need to load the main page using python. Print out all of your cookies/headers/POST variables/query string variables, and compare them to what Chrome/Firebug had. Then try loading your video, once again, compare all of these values (that means what YOU sent the server, and what the SERVER sent you back as well). You will need to figure out what is different between them (don't worry, we ALL learned this one in Kindergarten... "one of these things is not like the other") and dissect how that difference is breaking stuff.
If at the end of all of this, you still can't figure it out, then you probably need to look at the HTML for the page that contains the link to the movie. Look for any javascript in the page. Then use Firebug/Chrome Developer Tools to inspect the javascript and see if it is doing some kind of management of your user session. If it is somehow generating tokens (cookies or POST/GET variables) related to video access, you will need to emulate its tokenizing method in python.
Hopefully all of this helps, and doesn't look too scary. The key is you are going to need to be a scientist. Figure out what you know, what you don't, what you want, and start experimenting and recording your results. Eventually a pattern will emerge.
Edit: Clarify steps
Investigate how state is being maintained
Pull initial page with python, grab any state info you need from it
Perform any tokenizing that may be required with that state info
Pull the video using the tokens from steps 2 and 3
If stuff blows up, output your request/response headers,cookies,query vars, post vars, and compare them to Chrome/Firebug
Return to step 1. until you find a solution
Edit:
You may also be getting redirected at either one of these requests (the html page or the file download). You will most likely miss the request/response in Firebug/Chrome if that is happening. The solution would be to use a sniffer like LiveHTTPHeaders, or like has been suggested by other responders, WireShark or Fiddler. Note that Fiddler will do you no good if you are on a Linux or OSX box. It is Windows only and is definitely focused on .NET development... (ugh). Wireshark is very useful but overkill for most problems, and depending on what machine you are running, you may have problems getting it working. So I would suggest LiveHTTPHeaders first.
I love this kind of problem
It seems that mechanize can do stateful browsing, meaning that it will keep context and cookies between browser requests. I would suggest to first load the complete page where the video is located, then do a second try to download the video explicitly. That way, the web server will think that it is a full (legit) browsing session ongoing
you can use selenium or watir to do all the stuff you need in a browser.
since you don't want to see the browser, you can run selenium headless.
see also this answer.
Assuming that you did not type the URL out of the blue by hand, use mechanize to first go to the page where you got that from. Then emulate the action you take to download the actual file (probably clicking a link or a button).
This might not work though as Mechanize keeps state of cookies and redirects, but does not handle any JavaScript real-time changes to the html pages. To check if JavaScript is crucial for the operation, switch of JavaScript in Chrome (or any other browser) and make sure you can download the file. If JavaScript is necessary, I would try and programmatically drive a browser to get the file.
My usual approach to trying this kind of scraping is
try wget or pythons urllib2
try mechanize
drive a browser
Unless there is some captcha, the last one usually works, but the others are easier (and faster).
In order to clarify the "why" part of your question you can route your browser and your code's requests through a debug proxy. If you are using windows I suggest fiddler2. There exist other debug proxies for other platforms as well. But fiddler2 is definitely my favourite.
http://www.fiddler2.com/fiddler2/
https://www.owasp.org/index.php/Category:OWASP_WebScarab_Project
http://www.charlesproxy.com/
Or more low level
http://netcat.sourceforge.net/
http://www.wireshark.org/
Once you know the differences it is usually much simpler to come up with a solution. I suspect that the other answers with regard to stateful browsing / cookies are correct. With the mentioned tools you can analyze these cookies and roll a suitable solution without going for browser automation.
I think many sites use temporary links that only exist in your session. The code in the url is probably something like your session-id. That means the particular link will never work again.
You'll have to reopen the page that contains the link using some library that accomodates this session (like mentioned in other answers). And then try to locate the link and only use it in this session.
While the current accepted answer (by G. Shearer) is the best possible advice for scraping in general, I've found a way to skip a few steps - with a firefox extension called cliget that takes the request context with all the http headers and cookies and generates a curl (or wget) command that is copied to the clipboard.
EDIT: this feature is also available in the network panels of firebug and the chrome debugger - right click request, "copy as curl"
Most of the time you'll get a very verbose command with a few apparently unneeded headers, but you can remove those one by one until the server rejects the request, instead of the opposite (which, honestly, I find frustrating - I often got stuck thinking what header was missing from the request).
(Also, you might want to remove the -O option from the curl commandline to see the result in stdout instead of downloading it to a file, and add -v to see the full header list)
Even if you don't want to use curl/wget, converting one curl/wget commandline to python code is just a matter of knowing how to add headers to an urllib request (or any http request library for that matter)
There's an open source, Python library, named ghost, that wraps a headless, WebKit browser, so you can control everything through a simple API:
from ghost import Ghost
ghost = Ghost()
page, resources = ghost.open('http://my.web.page')
It supports cookies, JavaScript and everything else. You can inject JavaScript into the page, and while it's headless, so it doesn't render anything graphically, you still have the DOM. It's a complete browser.
It wouldn't scale well, but it's lots of fun, and may be useful when you need something approaching a complete browser.
from urllib import urlopen
print urlopen(url) #python built-in high level interface to get ANY online resources, auto responds to HTTP error codes.
Did you try requests module? it's much simpler to use than urllib2 and pycurl etc.
yet it's powerful. it has following features: The link is here
International Domains and URLs
Keep-Alive & Connection Pooling
Sessions with Cookie Persistence
Browser-style SSL Verification
Basic/Digest Authentication
Elegant Key/Value Cookies
Automatic Decompression
Unicode Response Bodies
Multipart File Uploads
Connection Timeouts
.netrc support
Python 2.6—3.3
Thread-safe.
You could use Internet Download Manager it is able to capture and download any streaming media from any website

Categories

Resources