I'm trying to build a sort-of client a blog platform in my country, but the blog platform has an in-house built captcha generation.
The problem is that the CAPTCHA is built as such so that a new image is generated every time there is a GET request. So suppose the captcha image URL is this: http://example.com/randomcaptcha.aspx?someparams-that-are-always-the-same
Even when I open the above link in Firefox and hit refresh (which shows the JPG image only), I'm presented with a different image every time I refresh.
The problem arises because when mechanize downloads the entire web page, it also downloads the image during that request (or rather, it follows the randomcaptcha.aspx link). So when I try to download the image again, I need to issue another GET request to grab the image and download it - and at this moment the image has changed.
How would I solve this problem?
Thank you.
EDIT the code currently is this:
browser.open("http://www.example.com/registration.aspx") #this contains the randomcaptcha.aspx url in img src
#then we have a regex to find the url of the image, say the variable is url
with open("captcha.jpg", "wb") as file:
file.write(browser.open_novisit(url).read())
At this time the downloaded captcha.jpg file is already different from the one presented on the registration page. I used the software called Fiddler to see that - there are definitely 2 GET requests being issued for the randomcaptcha.aspx url.
EDIT #2 Solved: My bad. The captcha URL was incorrect.
Related
I am trying to upload an image to slack and post it in an image block of a slack message to a specific channel.
upload an image to Slack.
make the image public with files.sharedPublicURL
check if the url is public: public_url_shared being true.
use the permalink_public I receive for the uploaded image for creating the slack message (an image block).
for debugging I am using Slack's Block Kit Builde. I am replacing the URL in the image_url example of the block kit demo with the one I received from slack:
https://slack-files.com/T04AG7BVD-FLWHBHY86-1ba8263c00
or:
https://slack-files.com/T04AG7BVD-FLNJJURL1-7b17f26c80
The image should be shown. Instead there is the error in Slack's Block Kit Builder as well as a direct slack-api call: Downloading image failed.
If I open the permalink_public in an incognito session. I can see the file. so it is public.
The reason the link for permalink_public does not work in your layout block is that it links to a public website showing the image, but is not a direct link to the image file (which is what you need of course).
But you can construct a direct image link from the link to the website.
The website link you get from permalink_public has the format:
https://slack-files.com/{team_id}-{file_id}-{pub_secret}
The direct link to the image has the format:
https://files.slack.com/files-pri/{team_id}-{file_id}/{filename}?pub_secret={pub_secret}
So you just need to extract the pub_secret from permalink_public and you should be able to construct the direct link to the image. The other parameters you can get from your file object.
Example for your image:
https://files.slack.com/files-pri/T04AG7BVD-FLWHBHY86/no_image_found.png?pub_secret=1ba8263c00
Note that this does no appear to be a documented approach, so as all undocumented approaches and hacks its subject to change.
I am working on a telegram bot that displays images from several webcams upon request. I fetch the images from urls and then send to the user (using bot.sendPhoto() ) My problem is that for any given webcam the filename does not change and it seems that the photo is sent from telegram's cache. So it will display the image from the first time that image was requested.
I have thought about downloading the image from the url, saving with a variable name (like a name with a timestamp in it) then sending it to the chat, this seems like an inelegant solution and was hoping for something better. Like forcing the image not to be cached on the telegram server.
I am using the python-telegram-bot wrapper, but I am not sure that it's specific to that.
Any ideas? I have tried searching but so far am turning up little.
Thanks in advance.
I had the same problem too, but i've found the simplest solution.
When you call the image, you have to add a parameter with timestamp to the image link.
Example:
http://www.example.com/img/img.jpg?a=TIMESTAMP
Where TIMESTAMP is the timestamp function based on the language you are using.
Simple but tricky ;)
I think the best way is to do the same as we do in React where also, same URL calls are first checked in the cache.
If you are using Python the best way is:
timestamp = datetime.datetime.now().isoformat()
# Above statement returns like: '2013-11-18T08:18:31.809000'
pic_url = '{0}?a={1}'.format(img_url, timestamp)
Hope that helps!
I had the same problem. I wanted to create a bot which sends an image taken by a webcam of a ski slope (webcam.example.com/image.jpg). Unfortunately, the filename and so the url never updates and telegram always sends the cached image. So I decided to alter the url passed to the api. In order to achieve this, I wrote a simple php site (example.com/photo.php) which redirects to the original url of the photo. After that, I created a folder (example.com/getphoto/) on my webspace with a .htaccess file inside. The .htaccess redirects all request in this folder to the photo.php site which redirects to the image (webcam.example.com/image.jpg). So you could add everything to the url of the folder and still get the picture (e. g. example.com/getphoto/42 or example.com/getphoto/hrte8437g). The telegram api seems to cache photos by url, so if you add always another ending to the url passed to the api, telegram doesnt use the cached version and sends the current image instead. The easiest way to always change the url is by adding the current date to it.
example.com/photo.php
<?php
header("Location: http://webcam.example.com/image.jpg");
die();
?>
example.com/getphoto/.htaccess
RewriteEngine on
RewriteRule ^(.*)$ http://example.com/photo.php
in python:
bot.sendPhoto(chat_id, 'example.com/getphoto/' + strftime("%Y-%m-%d_%H-%M-%S", gmtime()))
This workaround should also work in other languages like java or php. You just need to change the way to get the current date.
I'm trying to export a CSV from this page via a python script. The complicated part is that the page opens after clicking the export button on this page, begins the download, and closes again, rather than just hosting the file somewhere static. I've tried using the Requests library, among other things, but the file it returns is empty.
Here's what I've done:
url = 'http://aws.state.ak.us/ApocReports/CampaignDisclosure/CDExpenditures.aspx?exportAll=True&%3bexportFormat=CSV&%3bisExport=True%22+id%3d%22M_C_sCDTransactions_csfFilter_ExportDialog_hlAllCSV?exportAll=True&exportFormat=CSV&isExport=True'
with open('CD_Transactions_02-27-2017.CSV', "wb") as file:
# get request
response = get(url)
# write to file
file.write(response.content)
I'm sure I'm missing something obvious, but I'm pulling my hair out.
It looks like the file is being generated on demand, and the url stays only valid as long as the session lasts.
There are multiple requests from the browser to the webserver (including POST requests).
So to get those files via code, you would have to simulate the browser, possibly including session state etc (and in this case also __VIEWSTATE ).
To see the whole communication, you can use developer tools in the browser (usually F12, then select NET to see the traffic), or use something like WireShark.
In other words, this won't be an easy task.
If this is open government data, it might be better to just ask that government for the data or ask for possible direct links to the (unfiltered) files (sometimes there is a public ftp server for example) - or sometimes there is an API available.
The file is created on demand but you can download it anyway. Essentially you have to:
Establish a session to save cookies and viewstate
Submit a form in order to click the export button
Grab the link which lies behind the popped-up csv-button
Follow that link and download the file
You can find working code here (if you don't mind that it's written in R): Save response from web-scraping as csv file
How can I take a scrennshot of flash website in Python 3.5.1. I trying something like this but I can't see video image.
from selenium import webdriver
def webshot(url, filename):
browser = webdriver.PhantomJS()
browser.get(url)
browser.save_screenshot(filename)
browser.quit()
webshot('https://www.youtube.com/watch?v=YQHsXMglC9A', 'screentest.png')
Short version : With Youtube system, if you didn't press that "play" button (initiate playback) there is no video served. Loading the page via browser is a form of initiating playback too. However using a webshot doesn't fulfill Youtube server's requirements so it wont work.
long version :
How can I take a screenshot of a Flash website... I tried this but I
can't see video image.
webshot('https://www.youtube.com/watch?v=YQHsXMglC9A', 'screentest.png')
You cannot screenshot Youtube's video player content like this. The way Youtube works is that when video page is ready, another PHP file is accessed to determine the video link (eg: the correct file for chosen quality settings, etc). Basically you have to appear to be like a browser making an HTTP request to their servers. Their server gives temporary token to access video link until token expires etc. There's other issues like CORS to deal with. These things are not being done by your tool.
If only Youtube used a normal <video tag> with simple MP4 link then your code would've worked.
The best you can get is like below (see how there is no controls?) using :
webshot('https://www.youtube.com/embed/YQHsXMglC9A', 'screentest.png')
I have a scraper script that pulls binary content off publishers websites. Its built to replace the manual action of saving hundreds of individual pdf files that colleagues would other wise have to undertake.
The websites are credential based, and we have the correct credentials and permissions to collect this content.
I have encountered a website that has the pdf file inside an iFrame.
I can extract the content URL from the HTML. When I feed the URL to the content grabber, I collect a small piece of HTML that says: <html><body>Forbidden: Direct file requests are not allowed.</body></html>
I can feed the URL directly to the browser, and the PDF file resolves correctly.
I am assuming that there is a session cookie (or something, I'm not 100% comfortable with the terminology) that gets sent with the request to show that the GET request comes from a live session, not a remote link.
I looked at the refering URL, and saw these different URLs that point to the same article that I collected over a day of testing (I have scrubbed identifers from the URL):-
http://content_provider.com/NDM3NTYyNi45MTcxODM%3D/elibrary//title/issue/article.pdf
http://content_provider.com/NDM3NjYyMS4wNjU3MzY%3D/elibrary//title/issue/article.pdf
http://content_provider.com/NDM3Njc3Mi4wOTY3MDM%3D/elibrary//title/issue/article.pdf
http://content_provider.com/NDM3Njg3Ni4yOTc0NDg%3D/elibrary//title/issue/article.pdf
This suggests that there is something in the URL that is unique, and needs associating to something else to circumvent the direct link detector.
Any suggestions on how to get round this problem?
OK. The answer was Cookies and headers. I collected the get header info via httpfox and made a identical header object in my script, and i grabbed the session ID from request.cookie and sent the cookie with each request.
For good measure I also set the user agent to a known working browser agent, just in case the server was checking agent details.
Works fine.