I need to use a Python program to download a file from an HTTP server while preserving the original timestamp of the file creation.
Accordingly, two questions:
How to get file date from HTTP server using Python 3.7?
How to set this date for the downloaded file?
You could have a look at requests to download the file and get the modification date from the headers.
To set the dates you can use os.utime and email.utils.parsedate for parsing the date from the headers (see this answer by tzot).
Here is an example:
import datetime
import os
import time
import requests
import email.utils as eut
url = 'http://www.hamsterdance.org/hamsterdance/index-Dateien/hamster.gif'
r = requests.get(url)
f = open('output', 'wb')
f.write(r.content)
f.close()
last_modified = r.headers['last-modified']
modified = time.mktime(datetime.datetime(*eut.parsedate(last_modified)[:6]).timetuple())
now = time.mktime(datetime.datetime.today().timetuple())
os.utime('output', (now, modified))
Related
I have a Python 3.10 script to download a PDF from a URL, I get no errors but when I run the code the PDF does not download. I've done a sanity check to ensure the PDF is actually on the URL (which it is)
I'm not sure if this maybe has something to do with HTTP/ HTTPS? This site does have an expired HTTPS certificate, but it is a government site and this is really for testing only so I am not worried about that and can ignore the error
from fileinput import filename
import os
import os.path
from datetime import datetime
import urllib.request
import requests
import urllib3
urllib3.disable_warnings()
resp = requests.get('http:// url domain .org', verify=False)
urllib.request.urlopen('http:// my url .pdf')
filename = datetime.now().strftime("%Y_%m_%d-%I_%M_%S_%p")
save_path = "C:/Users/bob/Desktop/folder"
Or maybe is the issue something to do with urllib3 ignoring the error and urllib downloading the file?
Redacted the specific URL here
The urllib.request.urlopen method doesn't save the remote URL to a file -- it returns a response object that can be treated as a file-like object. You could do something like:
response = urllib.request.urlopen('http:// my url .pdf')
with open('filename.pdf') as fd:
fd.write(response.read())
The urllib.request.urlretrieve method, on the other hand, will take care of writing the remote content to a local file. You would use it like this to write the PDF file to a local file named filename.pdf:
response = urllib.request.urlretrieve('http://my url .pdf',
filename='filename.pdf')
See the documentation for information about the return value from the urlretrieve method.
I have a script to download a PDF from the internet and save it to a specific directory, how can I go about appending the date and time to the file name?
# Import all needed modules and tools
from fileinput import filename
import os
import os.path
from datetime import datetime
import urllib.request
import requests
# Disable SSL and HTTPS Certificate Warnings
import urllib3
urllib3.disable_warnings()
resp = requests.get('url.org', verify=False)
# Get current date and time
current_datetime = datetime.now()
print("Current date & time : ". current_datetime)
# Convert datetime obj to string
str_current_datetime = str(current_datetime)
# Download and name the PDF file from the URL
response= urllib.request.urlretrieve('url.pdf',
filename = 'my directory\civil.pdf')
# Save to the preferred directory
with open("my directory\civil.pdf", 'wb') as f: f.write(resp.content)
Use f-strings:
open(f"file - {datetime.now().strftime('%Y-%m-%D')}.txt", "w")
# will create a new file with the title: "file - Year-Month-Date.txt"
# then you can do whatever you want with it
f-string docs
I'm trying to log covid data from a website and update it each day with new cases. So far I have managed to put the numbers of cases in the file through scraping, but each day I have to manually enter the dates and run the file to get the updated statistics. How would I go about writing a script that will update the CSV each day, with new dates and the new number of cases, while saving the old ones for future use? I wrote this and run it in Virtual Studio Code.
import csv
import bs4
import urllib
from urllib.request import urlopen as uReq
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
#For sites that can't be opened due to Urllib blocker, use a Mozilla User agent to get access
pageRequest = Request('https://coronavirusbellcurve.com/', headers = {'User-Agent': 'Mozilla/5.0'})
htmlPage = urlopen(pageRequest).read()
page_soup = soup(htmlPage, 'html.parser')
specificDiv = page_soup.find("div", {"class": "table-responsive-xl"})
TbodyStats = specificDiv.table.tbody.tr.contents
TbodyDates = specificDiv.table.thead.tr.contents
def writeCSV():
with open('CovidHTML.csv','w', newline= '') as file:
theWriter = csv.writer(file)
theWriter.writerow(['5/8', ' 5/9', ' 5/10',' 5/11',' 5/12'])
row = []
for i in range(3,len(TbodyStats),2):
row.append([TbodyStats[i].text])
theWriter.writerow(row)
writeCSV()
If you want to preserve the older contents of the csv file, then open the file in append mode (as correctly pointed out by #bfris)
with open('CovidHTML.csv','a', newline= '') as file:
If you are using Linux, you can set up a cron job to invoke the python script every day at some specific time.
First, locate the path to python using the which command:
$ which python3
This gave me
/usr/bin/python3
Then the cron job will look like:
10 14 * * * /usr/bin/python3 /path/to/python/file.py
Add this line to the crontab file. This will call the python script everyday at 2:10PM everyday.
You can take a look here for details.
In case you are using Windows, you can take a look at this question.
I have a web link which downloads an excel file directly. It opens a page writing "your file is downloading" and starts downloading the file.
Is there any way i can automate it using requests module ?
I am able to do it with selenium but i want it to run in background so i was wondering if i can use request module.
I have used request.get but it simply gives the text i.e "your file is downloading" but somehow i am not able to get the file.
This Python3 code downloads any file from web to a memory:
import requests
from io import BytesIO
url = 'your.link/path'
def get_file_data(url):
response = requests.get(url)
f = BytesIO()
for chunk in response.iter_content(chunk_size=1024):
f.write(chunk)
f.seek(0)
return f
data = get_file_data(url)
You can use next code to read the Excel file:
import pandas as pd
xlsx = pd.read_excel(data, skiprows=0)
print(xlsx)
It sounds like you don't actually have a direct URL to the file, and instead need to engage with some javascript. Perhaps there is an underlying network call that you can find by inspecting the page traffic in your browser that shows a direct URL for downloading the file. With that you can actually just read the excel file URL directly with pandas:
import pandas as pd
url = "https://example.com/some_file.xlsx"
df = pd.read_excel(url)
print(df)
This is nice and tidy, but if you really want to use requests (or avoid pandas) you can download the raw file content as shown in this answer and then use the pyexcel_xlsx package's get_xlsx function to read it without any pandas involvement.
I would like to automate the download of CSV files from the World Bank's dataset.
My problem is that the URL corresponding to a specific dataset does not lead directly to the desired CSV file but is instead a query to the World Bank's API. As an example, this is the URL to get the GDP per capita data: http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv.
If you paste this URL in your browser, it will automatically start the download of the corresponding file. As a consequence, the code I usually use to collect and save CSV files in Python is not working in the present situation:
baseUrl = "http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv"
remoteCSV = urllib2.urlopen("%s" %(baseUrl))
myData = csv.reader(remoteCSV)
How should I modify my code in order to download the file coming from the query to the API?
This will get the zip downloaded, open it and get you a csv object with whatever file you want.
import urllib2
import StringIO
from zipfile import ZipFile
import csv
baseUrl = "http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv"
remoteCSV = urllib2.urlopen(baseUrl)
sio = StringIO.StringIO()
sio.write(remoteCSV.read())
# We create a StringIO object so that we can work on the results of the request (a string) as though it is a file.
z = ZipFile(sio, 'r')
# We now create a ZipFile object pointed to by 'z' and we can do a few things here:
print z.namelist()
# A list with the names of all the files in the zip you just downloaded
# We can use z.namelist()[1] to refer to 'ny.gdp.pcap.cd_Indicator_en_csv_v2.csv'
with z.open(z.namelist()[1]) as f:
# Opens the 2nd file in the zip
csvr = csv.reader(f)
for row in csvr:
print row
For more information see ZipFile Docs and StringIO Docs
import os
import urllib
import zipfile
from StringIO import StringIO
package = StringIO(urllib.urlopen("http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv").read())
zip = zipfile.ZipFile(package, 'r')
pwd = os.path.abspath(os.curdir)
for filename in zip.namelist():
csv = os.path.join(pwd, filename)
with open(csv, 'w') as fp:
fp.write(zip.read(filename))
print filename, 'downloaded successfully'
From here you can use your approach to handle CSV files.
We have a script to automate access and data extraction for World Bank World Development Indicators like: https://data.worldbank.org/indicator/GC.DOD.TOTL.GD.ZS
The script does the following:
Downloading the metadata data
Extracting metadata and data
Converting to a Data Package
The script is python based and uses python 3.0. It has no dependencies outside of the standard library. Try it:
python scripts/get.py
python scripts/get.py https://data.worldbank.org/indicator/GC.DOD.TOTL.GD.ZS
You also can read our analysis about data from World Bank:
https://datahub.io/awesome/world-bank
Just a suggestion than a solution. You can use pd.read_csv to read any csv file directly from a URL.
import pandas as pd
data = pd.read_csv('http://url_to_the_csv_file')