List of filenames modified within 1 week - python

I've got a segment of my script which will create a list of files to scan through for key words..
The problem is, the log files collectively are around 11gb. When I use grep in the shell to search through them, it takes around 4 or 5 minutes. When I do it with my python script, it just hangs the server to the extent where I need to reboot it.
Doesn't seem right that it would cause the whole server to crash, but in reality I don't need it to scroll through all the files, just those which were modified within the last week.
I've got this so far:
logs = [log for log in glob('/var/opt/cray/log/p0-current/*') if not os.path.isdir(log)]
I assume I will need to add something prior to this to initially filter out the wrong files?
I've been playing with os.path.getmtime in this format:
logs = [log for log in glob('/var/opt/cray/log/p0-current/*') if not os.path.isdir(log)]
for log in logs:
mtime = os.path.getmtime(log)
if mtime < "604800":
do-stuff (create a new list? Or update logs?)
That's kind of where I am now, and it doesn't work but I was hoping there was something more elegant I could do with the list inline?

Depending on how many filenames and how little memory (512MB VPS?), it's possible you're running out of memory creating two lists of all the filenames (one from glob and one from your list comprehension.) Not necessarily the case but it's all I have to go on.
Try switching to iglob (which uses os.scandir under the hood and returns an iterator) and using a generator expression and see if that helps.
Also, getmtime gets a time, not an interval from now.
import os
import glob
import time
week_ago = time.time() - 7 * 24 * 60 * 60
log_files = (
x for x in glob.iglob('/var/opt/cray/log/p0-current/*')
if not os.path.isdir(x)
and os.path.getmtime(x) > week_ago
)
for filename in log_files:
pass # do something

Related

Is there a way to get N oldest filenames in a directory without fetching all filenames (preferred in python or shell)

I am using below code to get json filenames in a directory.
import glob
jsonFiles = glob.glob(folderPath+"*.json")
Many new json files get created in the directory per second (say 100/s). Usually it works fine, but gets stuck when no. of files is large (~150000) and takes long (3 - 4 mins) to retrieve filenames. This might be because of large incoming rate (not sure).
Is there any alternative approach to get filenames EFFICIENTLY using python or linux command.
Getting oldest 1000 filenames will work too. I don't need all filenames at once.
I came across following shell command:
ls -Art | head -n 1000
Will it help? Does it lists all filenames first, then retrieves 1000 oldest record? Thanks in Advance.
Found scandir to be useful.
# Python version 2.x
import scandir
ds = scandir.scandir('./files/')
fileNames = []
count=0
for file in ds:
count+= 1
fileNames.append(file.name)
if count==1000:
break
# Python version 3.x
import os
ds = os.scandir('./files/')
...
This gives 1000 random fileNames in the directory without looking at all of the fileNames. If we don't break out of loop, it will continue to provide file names in random order (filename once given won't be repeated).

Compare one directory at time 1 to same directory at time 2

My goal : compare the content of one directory (including sub-directories and files) at time 1 to the content of the same directory at time 2 (e.g. 6 months later). "Content" means : number and names of the subdirectories + number and names and size of files. The main intended outcome is : being sure that no files were destroyed or corrupted in the mean time.
I did not find any existing tool, although I was wondering whether https://github.com/njanakiev/folderstats folderstats could help.
Would you have any suggestion of modules or anything to start well? If you heard about an existing tool for this, I would also be interested.
Thanks.
Here's some code that should help to get you started. It defines a function that will build a data structure of nested dictionaries that correspond the contents of the starting root directory and everything below it in the filesystem. Each each item dictionary that has with a 'type' key with the value 'file', will also have a 'stat' key that can contain whatever file metadata you want or need, such as time of creation, last modification time, length in bytes, … etc.
You can use it to obtain a "before" and "after" snapshots of the directory you're tracking and use them for comparison purposes. I've left the latter (the comparing) out since I'm not sure exactly what you're interested in.
Note that when I actually went about implementing this, I found it simpler to write a recursive function than to use os.walk(), as I suggested in a comment.
The following implements a version of the function and prints out the data structure of nested dictionaries it returns.
import os
from pathlib import PurePath
def path_to_dict(path):
result = {}
result['full_path'] = PurePath(path).as_posix()
if os.path.isdir(path):
result['type'] = 'dir'
result['items'] = {filename: path_to_dict(os.path.join(path, filename))
for filename in os.listdir(path)}
else:
result['type'] = 'file'
result['stat'] = 'os.stat(path)' # Preserve any needed metadata.
return result
root = './folder' # Change as desired.
before = path_to_dict(root)
# Pretty-print data structure created.
from pprint import pprint
pprint(before, sort_dicts=False)

Find out differences between directory listings on time A and time B on FTP

I want to build a script which finds out which files on an FTP server are new and which are already processed.
For each file on the FTP we read out the information, parse it and write the information we need from it to our database. The files are xml-files, but have to be translated.
At the moment I'm using mlsd() to get a list, but this takes up to 4 minutes because there are already 15.000 files in this directory - it will be more everyday.
Instead of comparing this list with an older list which I saved in a textfile I would like to know if there are better possibilities.
Because this task has to run "live" it would end in an cronjob every 1 or 2 minutes. If this method takes to long this won't work.
The solution should be either in PHP or Python.
def handle(self, *args, **options):
ftp = FTP_TLS(host=host)
ftp.login(user,passwd)
ftp.prot_p()
list = ftp.mlsd("...")
for item in list:
print(item[0] + " => " + item[1]['modify'])
This code examples already runs 4 minutes.
I have always tried to avoid browsing a folder to find what could have changed. I prefered setting a dedicated workflow. When files can only be added (or new versions of existing files), I tried to use a workflow where files are added in one directory and then go in other directories where they are archived. Processing can occur in a directory where files are deleted after being used, or when they are copied/moved from a folder to an other one.
As a slight goody, I also use a copy/rename pattern: the files are first copied using a temporary name (for example a .t prefix or suffix) and renamed when the copy has ended. This prevents trying to process a file which is not fully copied. Ok it used to be more important when we had slow lines, but race conditions should be avoided as much as possible, and it allows to use daemon which polls a folder every 10 seconds or less.
Unsure whether it is really relevant here because it could require some refactoring, but it gives bullet proof solutions.
If FTP is your only interface to the server, there's no better way that what you are already doing.
Except maybe, if you server supports non-standard -t switch to LIST/NLST commands, which returns the list sorted by timestamps.
See How to get files in FTP folder sorted by modification time.
And if what takes long is the download of the file list (not initiation of the download). In that case you can request sorted list, but download only the leading new files, aborting the listing once you find the first already processed file.
For an example, how to abort download of a file list, see:
Download the first N rows of text file in ftp with ftplib.retrlines
Something like this:
class AbortedListing(Exception):
pass
def collectNewFiles(s):
if isProcessedFile(s): # your code to detect if the file was processed already
print("We know this file already: " + s + " - aborting")
raise AbortedListing()
print("New file: " + s)
try:
ftp.retrlines("NLST -t /path", collectNewFiles)
except AbortedListing:
# read/skip response
ftp.getmultiline()

How to increment number with leading zeros in filename

I have a set of files, not necessarily of the same extension. These files were created by a Python script that iterates over some files (say x.jpg, y.jpg, and z.jpg), and then numbers them with zero-padding so that the numbers are of length 7 characters (in this example, the filenames become 0000001 - x.jpg, 0000002 - y.jpg, and 0000003 - z.jpg).
I now need a script (any language is fine, but Bash/zsh is preferred), that will increment these numbers by an argument. Thereby renaming all the files in the directory. For example, I'd like to call the program as (assuming a Shell script):
./rename.sh 5
The numbers in the final filenames should be padded to length 7, and it's guaranteed that there's no file initially whose number is 9999999. So the resulting files should be 0000006 - x.jpg, 0000007.jpg, 0000008.jpg. It's guaranteed that all the files initially are incremental; that is, there are no gaps in the numbers.
I can't seem to do this easily at all in Bash, and it seems kind of like a chore even in Python. What's the best way to do this?
Edit: Okay so here are my efforts so far. I think the leading 0s are a problem, so I removed them using rename:
rename 's/^0*//' *
Now with just the numbers left, I'd ideally use a loop, something like this, but I'm not exactly familiar with the syntax and why this is wrong:
for file in "(*) - (*)" ; do mv "${file}" "$(($1+5)) - ${2}" ; done
The 5 there is just hard-coded, but I guess changing that to the first argument shouldn't be too big a deal. I can then use another loop to add the 0s back.
import sys, glob, re, os
# Get the offset at the first command-line argument
offset = int(sys.argv[1])
# Go through the list of files in the reverse order
for name in reversed(glob.glob('*.jpg')):
# Extract the number and the rest of the name
i, rest = re.findall("^(\d+)(.+)", name)[0]
# Construct the new file name
new_name = "{:07d}{}".format(int(i) + offset, rest)
# Rename
os.rename(name, new_name)

Python "if modified since" detection?

I wrote some code to tackle a work-related problem. The idea is that the program is in an infinite loop and every 10 minutes it checks a certain folder for new files and if there are any, it copies them to another folder. My code reads all the files in the folder and drops the list into a txt file. After 10 minutes it checks the folder again and compares the old list with a new list. If they are identical, it doesn't do anything. This is where I stopped, because my idea is iffy and stupid and started to think of better ways. Here's the code
import time
import os, os.path
i=1
while i==1:
folder="folderlocation"
def getfiles(dirpat):
filelist = [s for s in os.listdir(dirpat)
if os.path.isfile(os.path.join(dirpat, s))]
filelist.sort(key=lambda s: os.path.getmtime(os.path.join(dirpat, s)))
return filelist
newlist=getfiles(folder)
outfile='c://old.txt'
last=newlist
text_file = open(outfile, "r")
oldlist = text_file.read()
text_file.close()
if str(last) == str(oldlist):
i==i
print "s"
else:
with open(outfile, 'w') as fileHandle:
#this part is virtually useless
diff=list(set(lines) - set(last))
print diff
fileHandle.write(str(getfiles(folder)))
time.sleep(600)
This realization has some bugs in it and doesn't work as I would like to.
Is there another way of tackling it. Is it possible for it to just remember the latest modified date and after 10 minutes there are files that are newer, then it points them out? Don't need the copying part just yet but really need the checking part. Does anyone have some ideas or some similar code?
The cleanest solution is file alteration monitoring. See this question for information for both *ix and Windows. To summarize, on Linux, you could use libfam, and Windows has FindFirstChangeNotification and related functions.
There are some clear problems with your script, such as NOP lines like i==i. You also don't need to convert to a string before comparing.
A technique I've been using (I was actually working on generalising it two hours ago) keeps a {path: os.stat(path).st_mtime} dict, polling in a while True loop with time.sleep and updating the entries and checking if they've changed. Simple and cross-platform.
Another option for solving this problem would be to run rsync periodically, for example from cron.

Categories

Resources