I need to get file info (path, size, dates, etc) and save it in a txt but I don't know where or how to do it.
This is what I have:
ruta = "FolderPath"
os.listdir(path=ruta)
miArchivo = open("TxtPath","w")
def getListOfFiles(ruta):
listOfFile = os.listdir(ruta)
allFiles = list()
for entry in listOfFile:
fullPath = os.path.join(ruta, entry)
if os.path.isdir(fullPath):
allFiles = allFiles + getListOfFiles(fullPath)
else:
allFiles.append(fullPath)
return allFiles
listOfFiles = getListOfFiles(ruta)
for elem in listOfFiles:
print(elem)
print("\n")
miArchivo.write("%s\n" % (elem))
miArchivo.close()
The output is (only path, no other info):
What I want is:
V:\1111111\222222222\333333333\444444444\5555555555\66666666\Folder\File name -- size -- modification date and so on
I think that you may want to use scandir instead of listdir for this:
for item in os.scandir(my_path):
print(item.name, item.path, item.stat().st_size, item.stat().st_atime)
You will also want to check here for more detailed information regarding the appropriate calls (for the time you are looking for and the size). (os.scandir was added in python 3.6)
https://docs.python.org/2.7/library/os.path.html#module-os.path
os.path.getsize(path) # size in bytes
os.path.ctime(path) # time of last metadata change; it's a bit OS specific.
Here's a rewrite of your program. I did this:
Reformatted with autopep8 for better readability. (That's something you can install to prettify your code your code. But IDEs such as PyCharm Community Edition can help you to do the same, in addition to helping you with code completion and a GUI debugger.)
Made your getListofFiles() return a list of tuples. There are three elements in each one; the filename, the size, and the timestamp of the file, which appears to be what's known as an epoch time (time in seconds since 1970; you will have to go through python documentation on dates and times).
The tuples is written to your text file in a .csv style format (but note there are modules to do the same in a much better way).
Rewritten code:
import os
def getListOfFiles(ruta):
listOfFile = os.listdir(ruta)
allFiles = list()
for entry in listOfFile:
fullPath = os.path.join(ruta, entry)
if os.path.isdir(fullPath):
allFiles = allFiles + getListOfFiles(fullPath)
else:
print('getting size of fullPath: ' + fullPath)
size = os.path.getsize(fullPath)
ctime = os.path.getctime(fullPath)
item = (fullPath, size, ctime)
allFiles.append(item)
return allFiles
ruta = "FolderPath"
miArchivo = open("TxtPath", "w")
listOfFiles = getListOfFiles(ruta)
for elem in listOfFiles:
miArchivo.write("%s,%s,%s\n" % (elem[0], elem[1], elem[2]))
miArchivo.close()
Now it does this.
my-MBP:verynew macbookuser$ python verynew.py; cat TxtPath
getting size of fullPath: FolderPath/dir2/file2
getting size of fullPath: FolderPath/dir2/file1
getting size of fullPath: FolderPath/dir1/file1
FolderPath/dir2/file2,3,1583242888.4
FolderPath/dir2/file1,1,1583242490.17
FolderPath/dir1/file1,1,1583242490.17
my-MBP:verynew macbookuser$
To interpret the dates, use https://stackoverflow.com/a/52858040/11262633. Building on YamiOmar88's great answer:
import os
import datetime
def ts_to_dt(ts):
return datetime.datetime.fromtimestamp(ts)
for item in os.scandir(my_path):
print(item.name, item.path, item.stat().st_size, ts_to_dt(item.stat().st_atime))
Related
I am making code which generates a new text file with today's date each time it is run. For exemple today's file name would be 2020-10-05. I would like to increment it so that if the program is run one or more times the same day it becomes 2020-10-05_1, _2 etc..
I have this code that I found from another question and i've tried tinkering with it but I'm still stuck. The problem is here they convert the file name to an int 1,2,3 and this way it works but this isn't the result I want.
def incrementfile():
todayday = datetime.datetime.today().date()
output_folder = "//10.2.30.61/c$/Qlikview_Tropal/Raport/"
highest_num = 0
for f in os.listdir(output_folder):
if os.path.isfile(os.path.join(output_folder, f)):
file_name = os.path.splitext(f)[0]
try:
file_num = int(file_name)
if file_num > highest_num:
highest_num = file_num
except ValueError:
print("The file name %s is not an integer. Skipping" % file_name)
output_file = os.path.join(output_folder, str(highest_num + 1) + f"{todayday}" + ".txt")
return output_file
How can I modify this code so that the output I get in the end is something like 2020-10-05_0, _1, _2 etc.. ?
Thanks !
I strongly recommend you to use pathlib instead of os.path.join. This is more convenient.
def incrementfile():
td = datetime.datetime.today().date()
path = pathlib.Path("/tmp") #set your output folder isntead of /tmp
inc = len(list(path.glob(f"{td}*")))+1
outfile = path/f"{td}_{inc}.txt"
return outfile
Not a direct answer to your question, but instead of using _1, _2 etc, you could use a full timestamp with date and current time, which would avoid duplication, EG:
from datetime import datetime
t = str(datetime.now()).replace(":", "-").replace(" ", "_")
print(t)
Example output:
2020-10-05_13-06-53.825870
I think this will work-
import os
import datetime
#assuming files will be .txt format
def incrementfile():
output_folder = "//10.2.30.61/c$/Qlikview_Tropal/Raport/"
files=os.listdir(output_folder)
current_name=datetime.date.today().strftime('%Y-%m-%d_0')
current_num=1
def nameChecker(name,files):
return True if name +'.txt' in files else False
while namChecker(current_name,files):
current_name+='_'+str(current_num)
current_num+=1
return current_name+'.txt'
I'm fairly new to using Python. I have been trying to set up a very basic web scraper to help speed up my workday, it is supposed to download images from a section of a website and save them.
I have a list of urls and I am trying to use urllib.request.urlretrieve to download all the images.
The output location (savepath) updates so it adds 1 to the current highest number in the folder.
I've tried a bunch of different ways but urlretrieve only saves the image from the last url in the list. Is there a way to download all the images in the url list?
to_download=['url1','url2','url3','url4']
for t in to_download:
urllib.request.urlretrieve(t, savepath)
This is the code I was trying to use to update the savepath every time
def getNextFilePath(photos):
highest_num = 0
for f in os.listdir(photos):
if os.path.isfile(os.path.join(photos, f)):
file_name = os.path.splitext(f)[0]
try:
file_num = int(file_name)
if file_num > highest_num:
highest_num = file_num
except ValueError:
'The file name "%s" is not an integer. Skipping' % file_name
output_file = os.path.join(output_folder, str(highest_num + 1))
return output_file
as suggested by #vks, you need to update savepath (otherwise you save each url onto the same file). One way to do so, is to use enumerate:
from urllib import request
to_download=['https://edition.cnn.com/','https://edition.cnn.com/','https://edition.cnn.com/','https://edition.cnn.com/']
for i, url in enumerate(to_download):
save_path = f'website_{i}.txt'
print(save_path)
request.urlretrieve(url, save_path)
which you may want to contract into:
from urllib import request
to_download=['https://edition.cnn.com/','https://edition.cnn.com/','https://edition.cnn.com/','https://edition.cnn.com/']
[request.urlretrieve(url, f'website_{i}.txt') for i, url in enumerate(to_download)]
see:
Python3 doc: Python enumerate doc
Example of enumerate: enumerate example
Example of f' using a string with a {variable}': f string example
FOR SECOND PART OF THE QUESTION:
Not sure what you are trying to achieve but:
def getNextFilePath(photos):
file_list = os.listdir(photos)
file_list = [int(s) for s in file_list if s.isdigit()]
print(file_list)
max_id_file = max(file_list)
print(f'max id:{max_id_file}')
output_file = os.path.join(output_folder, str(max_id_file + 1))
print(f'output file path:{output_file}')
return output_file
this will hopefully find all files that are named with digits (IDs), and find the highest ID, and return a new file name as a max_id+1
I guess that this will replace the save_path in your example.
Which quickly coding, AND MODIFYING above function, so that it returns the max_id and not the path.
The bellow code be a working example using the iterrator:
import os
from urllib import request
photo_folder = os.path.curdir
def getNextFilePath(photos):
file_list = os.listdir(photos)
print(file_list)
file_list = [int(os.path.splitext(s)[0]) for s in file_list if os.path.splitext(s)[0].isdigit()]
if not file_list:
return 0
print(file_list)
max_id_file = max(file_list)
#print(f'max id:{max_id_file}')
#output_file = os.path.join(photo_folder, str(max_id_file + 1))
#print(f'output file path:{output_file}')
return max_id_file
def download_pic(to_download):
start_id = getNextFilePath(photo_folder)
for i, url in enumerate(to_download):
save_path = f'{i+start_id}.png'
output_file = os.path.join(photo_folder, save_path)
print(output_file)
request.urlretrieve(url, output_file)
You should add handling exception etc, but this seems to be working, if I understood correctly.
Are you updating savepath? If you pass the same savepath to each loop iteration, it is likely just overwriting the same file over and over.
Hope that helps, happy coding!
I'm trying to extract files from a zip file using Python 2.7.1 (on Windows, fyi) and each of my attempts shows extracted files with Modified Date = time of extraction (which is incorrect).
import os,zipfile
outDirectory = 'C:\\_TEMP\\'
inFile = 'test.zip'
fh = open(os.path.join(outDirectory,inFile),'rb')
z = zipfile.ZipFile(fh)
for name in z.namelist():
z.extract(name,outDirectory)
fh.close()
I also tried using the .extractall method, with the same results.
import os,zipfile
outDirectory = 'C:\\_TEMP\\'
inFile = 'test.zip'
zFile = zipfile.ZipFile(os.path.join(outDirectory,inFile))
zFile.extractall(outDirectory)
Can anyone tell me what I'm doing wrong?
I'd like to think this is possible without having to post-correct the modified time per How do I change the file creation date of a Windows file?.
Well, it does take a little post-processing, but it's not that bad:
import os
import zipfile
import time
outDirectory = 'C:\\TEMP\\'
inFile = 'test.zip'
fh = open(os.path.join(outDirectory,inFile),'rb')
z = zipfile.ZipFile(fh)
for f in z.infolist():
name, date_time = f.filename, f.date_time
name = os.path.join(outDirectory, name)
with open(name, 'wb') as outFile:
outFile.write(z.open(f).read())
date_time = time.mktime(date_time + (0, 0, -1))
os.utime(name, (date_time, date_time))
Okay, maybe it is that bad.
Based on Jia103's answer, I have developed a function (using Python 2.7.14) which preserves directory and file dates AFTER everything has been extracted. This isolates any ugliness in the function, and you can also use zipfile.Zipfile.extractAll() or whatever zip extract method you want:
import time
import zipfile
import os
# Restores the timestamps of zipfile contents.
def RestoreTimestampsOfZipContents(zipname, extract_dir):
for f in zipfile.ZipFile(zipname, 'r').infolist():
# path to this extracted f-item
fullpath = os.path.join(extract_dir, f.filename)
# still need to adjust the dt o/w item will have the current dt
date_time = time.mktime(f.date_time + (0, 0, -1))
# update dt
os.utime(fullpath, (date_time, date_time))
To preserve dates, just call this function after your extract is done.
Here's an example, from a script I wrote to zip/unzip game save directories:
z = zipfile.ZipFile(zipname, 'r')
print 'I have opened zipfile %s, ready to extract into %s' \
% (zipname, gamedir)
try: os.makedirs(gamedir)
except: pass # Most of the time dir already exists
z.extractall(gamedir)
RestoreTimestampsOfZipContents(zipname, gamedir) #<-- USED
print '%s zip extract done' % GameName[game]
Thanks everyone for your previous answers!
Based on Ethan Fuman's answer, I have developed this version (using Python 2.6.6) which is a little more consise:
zf = ZipFile('archive.zip', 'r')
for zi in zf.infolist():
zf.extract(zi)
date_time = time.mktime(zi.date_time + (0, 0, -1))
os.utime(zi.filename, (date_time, date_time))
zf.close()
This extracts to the current working directory and uses the ZipFile.extract() method to write the data instead of creating the file itself.
Based on Ber's answer, I have developed this version (using Python 2.7.11), which also accounts for directory mod dates.
from os import path, utime
from sys import exit
from time import mktime
from zipfile import ZipFile
def unzip(zipfile, outDirectory):
dirs = {}
with ZipFile(zipfile, 'r') as z:
for f in z.infolist():
name, date_time = f.filename, f.date_time
name = path.join(outDirectory, name)
z.extract(f, outDirectory)
# still need to adjust the dt o/w item will have the current dt
date_time = mktime(f.date_time + (0, 0, -1))
if (path.isdir(name)):
# changes to dir dt will have no effect right now since files are
# being created inside of it; hold the dt and apply it later
dirs[name] = date_time
else:
utime(name, (date_time, date_time))
# done creating files, now update dir dt
for name in dirs:
date_time = dirs[name]
utime(name, (date_time, date_time))
if __name__ == "__main__":
unzip('archive.zip', 'out')
exit(0)
Since directories are being modified as the extracted files are being created inside them, there appears to be no point in setting their dates with os.utime until after the extraction has completed, so this version caches the directory names and their timestamps till the very end.
I'm writing yet another python purge script. This is replacing a very old bash script with tons of find -delete which take up to 9h to purge our video backend.
I know there is tons of those either on stack or right in google but thing is i have a few more constraints which left me to write what i find poor/unefficient code.
consider the following dir structure:
/data/channel1/video_800/0001/somefile_800_001.ts
/data/channel1/video_800/0001/somefile_800_002.ts
/data/channel1/video_800/0002/somediffile_800_001.ts
/data/channel1/video_800/0002/somediffile_800_002.ts
/data/channel1/video_800.m3u8
/data/channel1/video_900/0001/someotherfile_900_001.ts
/data/channel1/video_900/0002/afile_900_001.ts
/data/channel1/video_900/0003/bfile_900_001.ts
/data/channel1/video_900/0003/cfile_900_001.ts
/data/channel1/video_900.m3u8
/data/channel2/video_800/0001/againsomefile_800_001.ts
/data/channel2/video_800/0001/againsomefile_800_001.ts
/data/channel2/video_800.m3u8
/data/sport_channel/video_1000/0001/somefile.ts
/data/sport_channel/video_1000/0001/somefile2.ts
First thing that interests me is the channel name since there is a rule for channel* and one for sport*.
Second thing is the end of the video dirs that equals the bitrate... 800, 900, 1000 since these can have different retention days.
Finaly i'm going through everything and remove files based on bitrate and extention.
The bellow code works but is overly complicated and i'm sure not very pythonic. Since what i care most in this case is performance i'm sure there is a more efficient way to do this. Stacking for loop in for loop is not only poor design but also gets me a 'find_files' is too complex [mccabe] in my pymode.
** Left the remove function out of the code example but it's just a plain try:except using os.rmdir and os.remove
I'm open to all suggestions to improving my code.
Thanks!
#!/usr/bin/python
import os
import time
import fnmatch
path = '/data'
debits_short = ['200', '700', '1000', '1300', '2500']
debits_long = ['400', '1800']
def find_files(chan_name, debits, duration):
time_in_secs = time.time() - (duration * 24 * 60 * 60)
# List channel
for channel in os.listdir(path):
# Match category channels
if fnmatch.fnmatch(channel, chan_name):
# Go through bitrates
for debit in debits:
# Channel path now the default search path
channel_path = path + channel
# Walk through channel path to match bitrate files
for root, dirs, files in os.walk(channel_path, topdown=False):
for filename in files:
# Remove files that contain _bitrate_ and end with ts
if '_' + debit + '_' in filename:
if filename.endswith('.ts'):
if os.path.isfile(os.path.join(root, filename)):
if os.stat(os.path.join(root, filename)).st_mtime <= time_in_secs:
remove(os.path.join(root, filename))
# Remove playlist files that contain bitrate.m3u8
if filename.endswith(debit + '.m3u8'):
if os.path.isfile(os.path.join(root, filename)):
if os.stat(os.path.join(root, filename)).st_mtime <= time_in_secs:
remove(os.path.join(root, filename))
# Remove empty dirs
for dir in dirs:
if not os.listdir(os.path.join(root, dir)):
remove(os.path.join(root, dir))
find_files('channel*', debits_long, 3)
find_files('sport*', debits_short, 7)
Here's a possible approach:
import os
import glob
import time
class Purge(object):
removable_extensions = ['ts', 'm3u8']
def __init__(self, basedir, channel_pattern, debits,
older_than_days, test_mode=False):
self.basedir = basedir
self.channel_pattern = channel_pattern
self.debits = debits
self.older_than_secs = time.time() - 24*60*60*older_than_days
self.test_mode = test_mode # If `True`, do not delete files.
def delete_file(self, filepath):
try:
os.remove(filepath)
except OSError:
pass
def file_for_deletion(self, filepath):
# Return `True` if a file meets all conditions for deletion.
filename, ext = os.path.splitext(os.path.basename(filepath))
condition_ext = ext[1:] in self.removable_extensions
condition_old = os.stat(filepath).st_mtime <= self.older_than_secs
condition_deb = any(
'_{}_'.format(d) in filename or filename.endswith(d)
for d in self.debits
)
return all((condition_ext, condition_old, condition_deb))
def purge_channel(self, channel_dir):
for root, dirs, files in os.walk(channel_dir):
for name in files:
filepath = os.path.join(root, name)
if self.file_for_deletion(filepath):
print filepath
if not self.test_mode:
self.delete_file(filepath)
#TODO: delete empty directories here.
def purge(self):
channels = glob.glob(os.path.join(self.basedir, self.channel_pattern))
for channel_dir in channels:
self.purge_channel(channel_dir)
if __name__ == '__main__':
purge_job_info = dict(
basedir=r'path/to/data', # All channel folders live here.
channel_pattern='channel*', # `glob` pattern.
debits=['400', '1800'],
older_than_days=7,
)
p = Purge(**purge_job_info)
p.test_mode = True
p.purge()
When registered user upload some files as PDF to MEDIA_ROOT (named usermedia directory)
document is saved in directory as 12345676542.pdf
this number is users OIB number which give when registered
def handle_uploaded_file(f,wusr):
nname = "%s.%s" % (str(wusr.oib), f.name.split(".")[1])
print nname
destination = open('%s/%s' % (MEDIA_ROOT, nname), 'wb+')
for chunk in f.chunks():
destination.write(chunk)
destination.close()
but when the user wants to upload another document this document is saved as previous document
How to set when user want upload another file to geth file named as 12345676542-1.pdf
You either need to maintain a data store of what the last index used by that user was or search the file system for existing files for that user and find the first unused (or last used) index, then create your new file with that.
Here's an example of a solution. Keep in mind I haven't tested this, so there might be syntax errors. Treat this as a suggestion.
def handle_uploaded_file(f,wusr):
nname = "%s.%s" % (str(wusr.oib), f.name.split(".")[1])
nname = unique(nname)
destination = open('%s/%s' % (MEDIA_ROOT, nname), 'wb+')
for chunk in f.chunks():
destination.write(chunk)
destination.close()
# Return unique file name in format <filename>-<num>.<ext>
def unique(path):
import os.path
num = 0
newpath = path
def fileExists(path):
return os.path.isfile(path)
# Keep incrementing until an unique filename is reached
while fileExists(newpath):
num += 1
pieces = path.rsplit('.', 1)
newpath = "%s-%d.%s" % (pieces[0], num, pieces[1])
return newpath
The unique function would generate a new file name guaranteed to be unique. This particular solution of checking the disk for every interval could be problematic when you reach a high number of identically named uploads. If the speed of this solution turns out to be a problem, just list all files in the directory to begin with and perform the above operations on that string instead. That will reduce the number of disk operations from x to 1
Your code needs to check for existing files until it finds an appropriate unused filename. Something like this:
import os
filename = base_filename = '123456765432'
ext = '.pdf'
suffix = 0
while os.path.exists(filename+ext):
suffix += 1
filename = '%s-%d' % (base_filename, suffix)