Can a website stop a program from downloading files automatically? - python

I am writing a program in python that will automatically download pdf files from a website once a day.
When trying to test I noticed that the files downloaded had the correct extension but they are very small (<1kB) compared to the normal size of about 100kB when downloaded manually.
Can a website block a program from automatically downloading files?
Is there anything that can be done about this?

Yes. Cloudflare can block bots from downloading files. Blocking is usually done by detecting the user-agent or including javascript in a webpage. I would examine the pdf file in notepad and see what it contains also try adding a user-agent option in your python code.

Related

GitHub Actions - Where are downloaded files saved?

I've seen plenty of questions and docs about how to download artifacts generated in a workflow to pass between jobs. However, I've only found one thread about persisting downloaded files between steps of the same job, and am hoping someone can help clarify how this should work, as the answer on that thread doesn't make sense to me.
I'm building a workflow that navigates a site using Selenium and exports data manually (sadly there is no API). When running this locally, I am able to navigate the site just fine and click a button that downloads a CSV. I can then re-import that CSV for further processing (ultimately, it's getting cleaned and sent to Redshift). However, when I run this in GitHub Actions, I am unclear where the file is downloaded to, and am therefore unable to re-import it. Some things I've tried:
Echoing the working directory when the workflow runs, and setting up my pandas.read_csv() call to import the file from that directory.
Downloading the file and then echoing os.listdir() to print the contents of the working directory. When I do this, the CSV file is not listed, which makes me believe it was not saved to the working directory as expected. (which would explain why #1 doesn't work)
FWIW, the website in question does not give me the option to choose where the file downloads. When run locally, I hit the button on the site, and it automatically exports a CSV to my Downloads folder. So I'm at the mercy of wherever GitHub decides to save the file.
Last, because I feel like someone will suggest this - it is not an option for me to use read_html() to scrape the file from the page's HTML.
Thanks in advance!

Automatically organize downloads from Whatsapp Web with Python

Is it possible to make a program in Python in which the program automatically organize downloads from Whatsapp Web with Python?
By default when downloading an image (or file) from WhatsApp Web it remains in the folder "C:\Users\Name_User\Downloads" for windows users.
The purpose of the program is to dynamically change the default directory and to store each download according to the number (or name) of the contact from which the file comes.
Is this thing possible on python?
Sure thing you can manipulate and list any files with standard os module(copy,delete,move files,create directories etc.).Also a third party module called watchdog can monitor directory or even files changes.

Troubles with downloading

I has wrote script for one site. But, links to download files at this site aren't primary and call some PHP script. How can I download file?
It is a script for one minecraft site with mods. I just want to download files, not PHP script.

Python script to download directory from URL

I want to copy my own photos in a given web directory to my Raspberry so I can display them in a slideshow.
I'm looking for a "simple" script to download these files using python. I can then paste this code into my slideshow so that it refreshes the pics every day.
I suppose that the python wget utility would be the tool to use. However, I can only find examples on how to download a single file, not a whole directory.
Any ideas how to do this?
It depends on the server used to host the images and if the script can see a list of images to download. If this list isn't there in some form e.g. a webpage list, JSON or XML feed, there is no way for a script to download the files as the script doesnt "know" what's there dynamically.
Another option is for a python script to SSH into the server, list the contents of a directory and then download. This presumes you have programmatic access to the server.
If access to the server is a no, and there is no dynamic list then the last option would be to go to this website where you know the photos are and scrape their paths and download them. However this may scrape unwanted data such as other images, icons, etc.
https://medium.freecodecamp.org/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe

download large file from web via python cgi

I got a little headless LAMP-webserver running and I am also using the server for downloading files from the internet. At the moment I have to login via SSH and start the download via wget. The files are partially really large (exceeding 4GB).
A nice solution would be to use a python cgi to add a link to the queue and let python do the rest. I already know how to download files from the net (like here: Download file via python) and I know how to write the python-cgi (or wsgi). The thing is, that the script needs to run continuously, which would mean to keep the connection alive - which would be pretty useless. Therefore I think I need some kind of a background solution.
Help or hints would be much appreciated.
Thanks in advance & best regards!

Categories

Resources