Recursive directory size includes symlinks twice - python

I took code from a question on Stack Overflow that's supposed to measure a directory's size:
def dirSize(directory):
totalSize = 0
for dirpath, dirnames, filenames in os.walk(directory):
for f in filenames:
fp = os.path.join(dirpath, f)
totalSize += os.path.getsize(fp)
return totalSize
But if I have this directory:
ls -l
-rw-r--r-- 1 lucas lucas 5120000 Oct 18 17:36 x
lrwxrwxrwx 1 lucas lucas 1 Oct 18 17:34 y -> x
And I run that function on it, I get this:
10240000
It seems to count symlinks as the size of the file they link to, not 4KB as they actually are. How can I fix this?

how about
totalSize += os.path.getsize(fp) if not os.path.islink(fp) else 4096

Just pass argument followlinks=False to os.walk. See the documentation for more information.

Related

find common files between two directories - exclude file extension

I have two directories with files that end in two different extensions:
Folder A called profile (1204 FILES)
file.fasta.profile
file1.fasta.profile
file2.fasta.profile
Folder B called dssp (1348 FILES)
file.dssp
file1.dssp
file2.dssp
file3.dssp #<-- odd one out
I have some files in folder B that are not found in folder A and should be removed for example file3.profile would be deleted as it is not found in folder A. I just want to retain those that are common in their filename, but excluding extension to end up with 1204 files in both
I saw some bash lines using diff but it does not consider this case, where the ones I want to remove are those that are not found in the corresponding other file.
Try this Shellcheck-clean Bash program:
#! /bin/bash -p
folder_a=PATH_TO_FOLDER_A
folder_b=PATH_TO_FOLDER_B
shopt -s nullglob
for ppath in "$folder_a"/*.profile; do
pfile=${ppath##*/}
dfile=${pfile%.profile}.dssp
dpath=$folder_b/$dfile
[[ -f $dpath ]] || echo rm -v -- "$ppath"
done
It currently just prints what it would do. Remove the echo once you are sure that it will do what you want.
shopt -s nullglob makes globs expand to nothing when nothing matches (otherwise they expand to the glob pattern itself, which is almost never useful in programs).
See Removing part of a string (BashFAQ/100 (How do I do string manipulation in bash?)) for information about the string manipulation mechanisms used (e.g. ${ppath##*/}).
With find:
find 'folder A' -type f -name '*.fasta.profile' -exec sh -c \
'! [ -f "folder B/$(basename -s .fasta.profile "$1").dssp" ]' _ {} \; -print
Replace -print by -delete when you will be convinced that it does what you want.
Or, maybe a bit faster:
find 'folder A' -type f -name '*.fasta.profile' -exec sh -c \
'for f in "$#"; do [ -f "folder B/$(basename -s .fasta.profile "$f").dssp" ] || echo rm "$f"; done' _ {} +
Remove echo when you will be convinced that it does what you want.
Here is a way to do it:
for both A and B directories, list the files under each directory, without the extension.
compare both lists, show only the file that does not appear in both.
Code:
#!/bin/bash
>a.list
>b.list
for file in A/*
do
basename "${file%.*}" >>a.list
done
for file in B/*
do
basename "${file%.*}" >>b.list
done
comm -23 <(sort a.list) <(sort b.list) >delete.list
while IFS= read -r line; do
rm -v A/"$line"\.*
done < "delete.list"
# cleanup
rm -f a.list b.list delete.list
"${file%.*}" removes the extension
basename removes the path
comm -23 ... shows only the lines that appear only in a.list
EDIT May 10th: my initial code listed the file, but did not delete it. Now it does.
Python version:
EDIT: now suports multiple extensions
#!/usr/bin/python3
import glob, os
def removeext(filename):
index = filename.find(".")
return(filename[:index])
setA = set(map(removeext,os.listdir('A')))
print("Files in directory A: " + str(setA))
setB = set(map(removeext,os.listdir('B')))
print("Files in directory B: " + str(setB))
setDiff = setA.difference(setB)
print("Files only in directory A: " + str(setDiff))
for filename in setDiff:
file_path = "A/" + filename + ".*"
for file in glob.glob(file_path):
print("file=" + file)
os.remove(file)
Does pretty much the same as my bash version above.
list files in A
list files in B
get the list of differences
delete the differences from A
Test output, done on Linux Mint, bash 4.4.20
mint:~/SO$ l
drwxr-xr-x 2 Nic3500 Nic3500 4096 May 10 10:36 A/
drwxr-xr-x 2 Nic3500 Nic3500 4096 May 10 10:36 B/
mint:~/SO$ l A
total 0
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:06 file1.fasta.profile
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:06 file2.fasta.profile
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:14 file3.fasta.profile
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:36 file4.fasta.profile
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:06 file.fasta.profile
mint:~/SO$ l B
total 0
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:05 file1.dssp
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:06 file2.dssp
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:06 file3.dssp
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:05 file.dssp
mint:~/SO$ ./so.py
Files in directory A: {'file1', 'file', 'file3', 'file2', 'file4'}
Files in directory B: {'file1', 'file', 'file3', 'file2'}
Files only in directory A: {'file4'}
file=A/file4.fasta.profile
mint:~/SO$ l A
total 0
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:06 file1.fasta.profile
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:06 file2.fasta.profile
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:14 file3.fasta.profile
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:06 file.fasta.profile

Python split image into frames, messes with colors

I'm working on a script that loops through a few directories, finds all the gifs in that directory, splits them into individual png frame images, and then writes them to a directory.
The actual splitting works just fine, however all frames other than the first one get messed up. Based on my research I understand some gifs rather than storing all individual frames only store what changes in between frames, I'm guessing this is one of those cases based on the outputs I'm getting.
Here is the code I'm using:
from PIL import Image
from os import listdir
import os
from os.path import isfile, join
characterDirs = ["01_mario"]
print(characterDirs)
#Loop through characterDirs
for char in characterDirs:
#save each gif into a directory for easy access
moves = [f for f in listdir("./media/gifs/" + char)]
#Make a directory for that character
os.mkdir("./media/frames/" + char)
#Loop through all moves in the array
for move in moves:
print(move)
i=0
#Open the gif
gif = Image.open("./media/gifs/" + char + "/" + move)
#Make a directory for the move
os.mkdir("./media/frames/" + char + "/" + move[0:-4])
#Keep going until there are no remaining frames of the gif
while True:
try:
#Save the frame
gif.save("./media/frames/" + char + "/" + move[0:-4] + "/" + str(i+1) + ".png")
#Increment to next frame
gif.seek(gif.tell()+1)
i +=1
except EOFError:
break
Here's a few frames that I'm getting:
https://ultimate-hitboxes.s3.amazonaws.com/stackoverflow/10.png
(You can change the 10 in the url to any number between 1-33 to get each broken frame)
Here's the full gif:
https://ultimate-hitboxes.s3.amazonaws.com/stackoverflow/MarioBAir.gif
Thanks in advance!
Try just using ImageMagick in Terminal like this:
convert MarioBAir.gif -coalesce frame-%02d.png
That will give you 33 separate frames:
Filenames are:
-rw-r--r-- 1 root staff 30873 18 Mar 17:45 frame-00.png
-rw-r--r-- 1 root staff 31971 18 Mar 17:45 frame-01.png
-rw-r--r-- 1 root staff 75743 18 Mar 17:45 frame-02.png
-rw-r--r-- 1 root staff 73075 18 Mar 17:45 frame-03.png
-rw-r--r-- 1 root staff 34927 18 Mar 17:45 frame-04.png
-rw-r--r-- 1 root staff 35757 18 Mar 17:45 frame-05.png
...
...
-rw-r--r-- 1 root staff 72723 18 Mar 17:45 frame-31.png
-rw-r--r-- 1 root staff 72103 18 Mar 17:45 frame-32.png
If you use v7 ImageMagick, the command becomes:
magick MarioBAir.gif -coalesce frame-%02d.png

How do I get the "biggest" path?

I need to write some Python code to get the latest version of Android from a path. For example:
$ ls -l android_tools/sdk/platforms/
total 8
drwxrwxr-x 5 deqing deqing 4096 Mar 21 11:42 android-18
drwxrwxr-x 5 deqing deqing 4096 Mar 21 11:42 android-19
$
In this case I'd like to have android_tools/sdk/platforms/android-19.
The max function can take a key=myfunc parameter to specify a function that will return a comparison value. So you could do something like:
import os, re
dirname = 'android_tools/sdk/platforms'
files = os.listdir(my_dir)
def mykeyfunc(fname):
digits = re.search(r'\d+$', fname).group()
return int(digits)
print max(files, mykeyfunc)
Adjust that regular expression as needed for the actual files you're dealing with, and that should get you started.

Get the file hierarchy of a directory by the file path with Python

Here is my question. I used os.walk to get all the file paths under a specific directory and stored the path in a file like this
/indexes/attachment/CCTBAU/CCTBAU-13/87009
/indexes/attachment/CCTBAU/CCTBAU-19/91961
/indexes/attachment/CCTBAU/CCTBAU-19/thumbs/_thumb_91961.png
/indexes/attachment/CCTBAU/CCTBAU-11/86413
/indexes/attachment/CCTBAU/CCTBAU-11/thumbs/_thumb_86412.png
/indexes/attachment/CCTBAU/CCTBAU-11/thumbs/_thumb_86413.png
/indexes/attachment/CCTBAU/CCTBAU-12/86614
/indexes/attachment/CCTBAU/CCTBAU-16/90240
/indexes/attachment/CCTBAU/CCTBAU-17/90241
/indexes/attachment/ACD/ACD-200/91345
/indexes/attachment/ACD/ACD-200/96305
/indexes/attachment/ACD/ACD-200/99169
/indexes/attachment/ACD/ACD-201/91344
/indexes/attachment/ACD/ACD-202/91346
/indexes/attachment/ACD/ACD-197/88916
/indexes/attachment/ACD/ACD-189/73799
/indexes/attachment/ACD/ACD-38/60709
/indexes/attachment/ACD/ACD-198/88918
Now, I want to get the file hierarchy by reading all paths in the file, which means that I read the file and get all the paths, then I can know the file hierarchy is
index
|--attachment
|-----ACD
| |---ACD-200
| |---...
|
|-----CCTBAU
|----CCTBAU-13
|----...
Who can help out of this? Thanks in advance!
I use os.listdir, and codes are as below:
1 import os
2
3 def PrintDir(dir, depth, prefix = ' '):
4 contents = os.listdir(dir)
5 paths = filter(lambda x : os.path.isdir(os.path.join(dir, x)), contents)
6 files = [x for x in contents if x not in paths]
7 if not paths and not files:
8 return
9
10 print depth * prefix + '|----' + os.path.basename(dir) \
if depth != 0 else os.path.basename(dir)
11 for subdir in paths:
12 PrintDir(os.path.join(dir, subdir), depth+1, prefix)
13 for filename in files:
14 print depth * prefix + '|----' + filename
15
16 PrintDir('~/testdir', 0)
You can also use os.walk to get what you want, as return value of os.walk is a tuple:
root, dirs, files.
test case is :
testdir/a/aa/aaa
testdir/b/bb/bbb
testdir/b/bb.txt
and aaa, bbb, bb.txt are files.
and output is:
testdir
|----a
|----aa
|----aaa
|----b
|----bb
|----bbb
|----bb.txt

Using Python's ftplib to get a directory listing, portably

You can use ftplib for full FTP support in Python. However the preferred way of getting a directory listing is:
# File: ftplib-example-1.py
import ftplib
ftp = ftplib.FTP("www.python.org")
ftp.login("anonymous", "ftplib-example-1")
data = []
ftp.dir(data.append)
ftp.quit()
for line in data:
print "-", line
Which yields:
$ python ftplib-example-1.py
- total 34
- drwxrwxr-x 11 root 4127 512 Sep 14 14:18 .
- drwxrwxr-x 11 root 4127 512 Sep 14 14:18 ..
- drwxrwxr-x 2 root 4127 512 Sep 13 15:18 RCS
- lrwxrwxrwx 1 root bin 11 Jun 29 14:34 README -> welcome.msg
- drwxr-xr-x 3 root wheel 512 May 19 1998 bin
- drwxr-sr-x 3 root 1400 512 Jun 9 1997 dev
- drwxrwxr-- 2 root 4127 512 Feb 8 1998 dup
- drwxr-xr-x 3 root wheel 512 May 19 1998 etc
...
I guess the idea is to parse the results to get the directory listing. However this listing is directly dependent on the FTP server's way of formatting the list. It would be very messy to write code for this having to anticipate all the different ways FTP servers might format this list.
Is there a portable way to get an array filled with the directory listing?
(The array should only have the folder names.)
Try using ftp.nlst(dir).
However, note that if the folder is empty, it might throw an error:
files = []
try:
files = ftp.nlst()
except ftplib.error_perm as resp:
if str(resp) == "550 No files found":
print "No files in this directory"
else:
raise
for f in files:
print f
The reliable/standardized way to parse FTP directory listing is by using MLSD command, which by now should be supported by all recent/decent FTP servers.
import ftplib
f = ftplib.FTP()
f.connect("localhost")
f.login()
ls = []
f.retrlines('MLSD', ls.append)
for entry in ls:
print entry
The code above will print:
modify=20110723201710;perm=el;size=4096;type=dir;unique=807g4e5a5; tests
modify=20111206092323;perm=el;size=4096;type=dir;unique=807g1008e0; .xchat2
modify=20111022125631;perm=el;size=4096;type=dir;unique=807g10001a; .gconfd
modify=20110808185618;perm=el;size=4096;type=dir;unique=807g160f9a; .skychart
...
Starting from python 3.3, ftplib will provide a specific method to do this:
http://bugs.python.org/issue11072
http://hg.python.org/cpython/file/67053b135ed9/Lib/ftplib.py#l535
I found my way here while trying to get filenames, last modified stamps, file sizes etc and wanted to add my code. It only took a few minutes to write a loop to parse the ftp.dir(dir_list.append) making use of python std lib stuff like strip() (to clean up the line of text) and split() to create an array.
ftp = FTP('sick.domain.bro')
ftp.login()
ftp.cwd('path/to/data')
dir_list = []
ftp.dir(dir_list.append)
# main thing is identifing which char marks start of good stuff
# '-rw-r--r-- 1 ppsrt ppsrt 545498 Jul 23 12:07 FILENAME.FOO
# ^ (that is line[29])
for line in dir_list:
print line[29:].strip().split(' ') # got yerself an array there bud!
# EX ['545498', 'Jul', '23', '12:07', 'FILENAME.FOO']
There's no standard for the layout of the LIST response. You'd have to write code to handle the most popular layouts. I'd start with Linux ls and Windows Server DIR formats. There's a lot of variety out there, though.
Fall back to the nlst method (returning the result of the NLST command) if you can't parse the longer list. For bonus points, cheat: perhaps the longest number in the line containing a known file name is its length.
I happen to be stuck with an FTP server (Rackspace Cloud Sites virtual server) that doesn't seem to support MLSD. Yet I need several fields of file information, such as size and timestamp, not just the filename, so I have to use the DIR command. On this server, the output of DIR looks very much like the OP's. In case it helps anyone, here's a little Python class that parses a line of such output to obtain the filename, size and timestamp.
import datetime
class FtpDir:
def parse_dir_line(self, line):
words = line.split()
self.filename = words[8]
self.size = int(words[4])
t = words[7].split(':')
ts = words[5] + '-' + words[6] + '-' + datetime.datetime.now().strftime('%Y') + ' ' + t[0] + ':' + t[1]
self.timestamp = datetime.datetime.strptime(ts, '%b-%d-%Y %H:%M')
Not very portable, I know, but easy to extend or modify to deal with various different FTP servers.
This is from Python docs
>>> from ftplib import FTP_TLS
>>> ftps = FTP_TLS('ftp.python.org')
>>> ftps.login() # login anonymously before securing control
channel
>>> ftps.prot_p() # switch to secure data connection
>>> ftps.retrlines('LIST') # list directory content securely
total 9
drwxr-xr-x 8 root wheel 1024 Jan 3 1994 .
drwxr-xr-x 8 root wheel 1024 Jan 3 1994 ..
drwxr-xr-x 2 root wheel 1024 Jan 3 1994 bin
drwxr-xr-x 2 root wheel 1024 Jan 3 1994 etc
d-wxrwxr-x 2 ftp wheel 1024 Sep 5 13:43 incoming
drwxr-xr-x 2 root wheel 1024 Nov 17 1993 lib
drwxr-xr-x 6 1094 wheel 1024 Sep 13 19:07 pub
drwxr-xr-x 3 root wheel 1024 Jan 3 1994 usr
-rw-r--r-- 1 root root 312 Aug 1 1994 welcome.msg
That helped me with my code.
When I tried feltering only a type of files and show them on screen by adding a condition that tests on each line.
Like this
elif command == 'ls':
print("directory of ", ftp.pwd())
data = []
ftp.dir(data.append)
for line in data:
x = line.split(".")
formats=["gz", "zip", "rar", "tar", "bz2", "xz"]
if x[-1] in formats:
print ("-", line)

Categories

Resources