How to prevent downloading HTML/text pages as .png - python

http://puu.sh/3Krct.png
My program generates random links to a service that hosts images, and it grabs and downloads random images. The program makes a lot of requests, and so it has to go through proxies.
Well, when the program is started, I just give it the path to a fresh large proxy list; however, sometimes the proxies will not connect to the website and sometimes they will return a custom HTML page - OR the image service will return the message on the page "You don't have permission to view this image." Although, the program will still save the request and download the page with a .png extension
And so sometimes those HTML/text pages are saved as .png files:
http://puu.sh/3KrxM.png
http://puu.sh/3KrGN.png
Is there any way I can prevent the downloading of these pages, and only download the actual images?
Thank you.
if self.proxy != False:
#make our requests go through proxy
self.opener.retrieve(url, filename)
else:
urllib.request.urlretrieve(url, filename)

I think you should change the logic.
If a proxy returns an error getting the page you asked, it normally uses an HTTP Status Code != 200
You should then check in order:
The HTTP status != 200
The Content-type header returned for the correct type (in this case image/jpeg)
And for this type of tasks I suggest using the requests module.

Related

How to check HTTP status of a file online without fully downloading the file?

I have a database of thousands of files online, and I want to check what their status is (e.g. if the file exists, if it sends us to a 404, etc.) and update this in my database.
I've used urllib.request to download files to a python script. However, obviously downloading terabytes of files is going to take a long time. Parallelizing the process would help, but ultimately I just don't want to download all the data, just check the status. Is there an ideal way to check (using urllib or another package) the HTTP response code of a certain URL?
Additionally, if I can get the file size from the server (which would be in the HTTP response), then I can also update this in my database.
If your web server is standards-based, you can use a HEAD request instead of a GET. It returns the same status without actually fetching the page.
The requests module can check the status response of a request.
Just do:
import requests
url = 'https://www.google.com' # Change to your link
response = requests.get(url)
print(response.status_code)
this code shows me 200, so the request has been successful

Downloading torrent file using get request (.torrent)

I am trying to download torrent file from this code :
url = "https://itorrents.org/torrent/0BB4C10F777A15409A351E58F6BF37E8FFF53CDB.torrent"
r = requests.get(url, allow_redirects=True)
open('test123.torrent', 'wb').write(r.content)
It downloads a torrent file , but when i load it to bittorrent error occurs.
It says Unable to Load , Torrent Is Not Valid Bencoding
Can anybody please help me to resolve this problem ? Thanks in advance
This page use cloudflare to prevent scraping the page,I am sorry to say that bypassing cloudflare is very hard if you only use requests, the measures cloudflare takes will update soon.This page will check your browser whether it support Javascript.If not, they won't give you the bytes of the file.That's why you couldn't use them.(You could use r.text to see the response content, it is a html page.Not a file.)
Under this circumstance, I think you should consider about using selenium.
Bypassing Cloudflare can be a pain, so I suggest using a library that handles it. Please don't forget that your code may break in the future because Cloudflare changes their techniques periodically. Well, if you use the library, you will just need to update the library (at least you should hope for that).
I used a similar library only in NodeJS, but I see python also has something like that - cloudscraper
Example:
import cloudscraper
scraper = cloudscraper.create_scraper() # returns a CloudScraper instance
# Or: scraper = cloudscraper.CloudScraper() # CloudScraper inherits from requests.Session
print scraper.get("http://somesite.com").text # => "<!DOCTYPE html><html><head>..."
Depending on your usage you may need to consider using proxies - CloudFlare can still block you if you send too many requests.
Also, if you are working with video torrents, you may be interested in Torrent Stream Server. It a server that downloads and streams video at the same time, so you can watch the video without fully downloading it.
We can do by adding cookies in headers .
But after some time cookie expires.
Therefore only solution is to download from opening browser

Capturing the video stream from a website into a file

For my image classification project I need to collect classified images, and for me a good source would be different webcams around the world streaming video in the internet. Like this one:
https://www.skylinewebcams.com/en/webcam/espana/comunidad-valenciana/alicante/benidorm-playa-poniente.html
I don't really have any experience with video streaming and web scraping generally, so after searching for the info in internet, i came up with this naive code in python:
url='https://www.skylinewebcams.com/a816de08-9805-4cc2-94e6-2daa3495eb99'
r1 = requests.get(url, stream=True)
filename = "stream.avi"
if(r1.status_code == 200):
with open(filename,'w') as f:
for chunk in r1.iter_content(chunk_size=1024):
f.write(chunk)
else:
print("Received unexpected status code {}".format(r.status_code))
where the url address was taken from the source of the video block from the website:
<video data-html5-video=""
poster="//static.skylinewebcams.com/_2933625150.jpg" preload="metadata"
src="blob:https://www.skylinewebcams.com/a816de08-9805-4cc2-94e6-
2daa3495eb99"></video>
but it does not work (avi file is empty), even though in the browser video streaming is working good. Can anybody explain me how to capture this video stream into the file?
I've made some progress since then. Here is the code:
print ("Recording video...")
url='https://hddn01.skylinewebcams.com/02930601ENXS-1523680721427.ts'
r1 = requests.get(url, stream=True)
filename = "stream.avi"
num=0
if(r1.status_code == 200):
with open(filename,'wb') as f:
for chunk in r1.iter_content(chunk_size=1024):
num += 1
f.write(chunk)
if num>5000:
print('end')
break
else:
print("Received unexpected status code {}".format(r.status_code))
Now i can get some piece of video written in the file. What I've change is 1) in open(filename,'wb') changed 'w' to 'wb' to write binary data, but most important 2) changed url. I looked in Chrome devtools 'network' what requests are sent by browser to get the live stream, and just copied the most fresh one, it requests some .ts file.
Next, i've found out how to get the addresses of .ts video files. One can use m3u8 module (installable by pip) like this:
import m3u8
m3u8_obj = m3u8.load('https://hddn01.skylinewebcams.com/live.m3u8?
a=k2makj8nd279g717kt4d145pd3')
playlist=[el['uri'] for el in m3u8_obj.data['segments']]
The playlist of the video files will then be something like that
['https://hddn04.skylinewebcams.com/02930601ENXS-1523720836405.ts',
'https://hddn04.skylinewebcams.com/02930601ENXS-1523720844347.ts',
'https://hddn04.skylinewebcams.com/02930601ENXS-1523720852324.ts',
'https://hddn04.skylinewebcams.com/02930601ENXS-1523720860239.ts',
'https://hddn04.skylinewebcams.com/02930601ENXS-1523720868277.ts',
'https://hddn04.skylinewebcams.com/02930601ENXS-1523720876252.ts']
and I can download each of the video files from the list.
The only problem left, is that in order to load the playlist i need first to open the webpage in a browser. Otherwise the playlist is gonna be empty. Probably opening the webpage initiates the streaming and this creates m3u8 file on the server that can be requested. I still don't know how to initialize streaming from python, without opening the page in the browser.
The list turns out empty because you're making an HTTP request without headers (which means you're doing it programmatically for sure) and most sites just respond to those with 403 outright.
You should use a library like Requests or pycurl to add headers to your requests and they should work fine. For an example request (complete with headers), you can open your web browser's developer console while watching streaming, find an HTTP request for the m3u8 url, right-click on it, and "copy as cURL". Note that there are site-specific, arbitrary headers that may be required to be sent with each request.
If you want to scrape multiple sites with different headers, and/or want to future-proof your code for if they change the headers, addresses or formats, then you probably need something more advanced. Worst-case scenario, you might need to run a headless browser to open the site with WebDriver/Selenium and capture the requests it makes to generate your requests.
Keep in mind you might have to read each site's ToS or otherwise you might be performing illegal activities. Scraping while breaking the ToS is basically digital trespassing and I think at least craigslist has already won lawsuits based on that criteria.

Python - Capture auto-downloading file from aspx web page

I'm trying to export a CSV from this page via a python script. The complicated part is that the page opens after clicking the export button on this page, begins the download, and closes again, rather than just hosting the file somewhere static. I've tried using the Requests library, among other things, but the file it returns is empty.
Here's what I've done:
url = 'http://aws.state.ak.us/ApocReports/CampaignDisclosure/CDExpenditures.aspx?exportAll=True&amp%3bexportFormat=CSV&amp%3bisExport=True%22+id%3d%22M_C_sCDTransactions_csfFilter_ExportDialog_hlAllCSV?exportAll=True&exportFormat=CSV&isExport=True'
with open('CD_Transactions_02-27-2017.CSV', "wb") as file:
# get request
response = get(url)
# write to file
file.write(response.content)
I'm sure I'm missing something obvious, but I'm pulling my hair out.
It looks like the file is being generated on demand, and the url stays only valid as long as the session lasts.
There are multiple requests from the browser to the webserver (including POST requests).
So to get those files via code, you would have to simulate the browser, possibly including session state etc (and in this case also __VIEWSTATE ).
To see the whole communication, you can use developer tools in the browser (usually F12, then select NET to see the traffic), or use something like WireShark.
In other words, this won't be an easy task.
If this is open government data, it might be better to just ask that government for the data or ask for possible direct links to the (unfiltered) files (sometimes there is a public ftp server for example) - or sometimes there is an API available.
The file is created on demand but you can download it anyway. Essentially you have to:
Establish a session to save cookies and viewstate
Submit a form in order to click the export button
Grab the link which lies behind the popped-up csv-button
Follow that link and download the file
You can find working code here (if you don't mind that it's written in R): Save response from web-scraping as csv file

Python - Direct linking blocking via iFrames, can I still get the binaries?

I have a scraper script that pulls binary content off publishers websites. Its built to replace the manual action of saving hundreds of individual pdf files that colleagues would other wise have to undertake.
The websites are credential based, and we have the correct credentials and permissions to collect this content.
I have encountered a website that has the pdf file inside an iFrame.
I can extract the content URL from the HTML. When I feed the URL to the content grabber, I collect a small piece of HTML that says: <html><body>Forbidden: Direct file requests are not allowed.</body></html>
I can feed the URL directly to the browser, and the PDF file resolves correctly.
I am assuming that there is a session cookie (or something, I'm not 100% comfortable with the terminology) that gets sent with the request to show that the GET request comes from a live session, not a remote link.
I looked at the refering URL, and saw these different URLs that point to the same article that I collected over a day of testing (I have scrubbed identifers from the URL):-
http://content_provider.com/NDM3NTYyNi45MTcxODM%3D/elibrary//title/issue/article.pdf
http://content_provider.com/NDM3NjYyMS4wNjU3MzY%3D/elibrary//title/issue/article.pdf
http://content_provider.com/NDM3Njc3Mi4wOTY3MDM%3D/elibrary//title/issue/article.pdf
http://content_provider.com/NDM3Njg3Ni4yOTc0NDg%3D/elibrary//title/issue/article.pdf
This suggests that there is something in the URL that is unique, and needs associating to something else to circumvent the direct link detector.
Any suggestions on how to get round this problem?
OK. The answer was Cookies and headers. I collected the get header info via httpfox and made a identical header object in my script, and i grabbed the session ID from request.cookie and sent the cookie with each request.
For good measure I also set the user agent to a known working browser agent, just in case the server was checking agent details.
Works fine.

Categories

Resources