find common files between two directories - exclude file extension - python

I have two directories with files that end in two different extensions:
Folder A called profile (1204 FILES)
file.fasta.profile
file1.fasta.profile
file2.fasta.profile
Folder B called dssp (1348 FILES)
file.dssp
file1.dssp
file2.dssp
file3.dssp #<-- odd one out
I have some files in folder B that are not found in folder A and should be removed for example file3.profile would be deleted as it is not found in folder A. I just want to retain those that are common in their filename, but excluding extension to end up with 1204 files in both
I saw some bash lines using diff but it does not consider this case, where the ones I want to remove are those that are not found in the corresponding other file.

Try this Shellcheck-clean Bash program:
#! /bin/bash -p
folder_a=PATH_TO_FOLDER_A
folder_b=PATH_TO_FOLDER_B
shopt -s nullglob
for ppath in "$folder_a"/*.profile; do
pfile=${ppath##*/}
dfile=${pfile%.profile}.dssp
dpath=$folder_b/$dfile
[[ -f $dpath ]] || echo rm -v -- "$ppath"
done
It currently just prints what it would do. Remove the echo once you are sure that it will do what you want.
shopt -s nullglob makes globs expand to nothing when nothing matches (otherwise they expand to the glob pattern itself, which is almost never useful in programs).
See Removing part of a string (BashFAQ/100 (How do I do string manipulation in bash?)) for information about the string manipulation mechanisms used (e.g. ${ppath##*/}).

With find:
find 'folder A' -type f -name '*.fasta.profile' -exec sh -c \
'! [ -f "folder B/$(basename -s .fasta.profile "$1").dssp" ]' _ {} \; -print
Replace -print by -delete when you will be convinced that it does what you want.
Or, maybe a bit faster:
find 'folder A' -type f -name '*.fasta.profile' -exec sh -c \
'for f in "$#"; do [ -f "folder B/$(basename -s .fasta.profile "$f").dssp" ] || echo rm "$f"; done' _ {} +
Remove echo when you will be convinced that it does what you want.

Here is a way to do it:
for both A and B directories, list the files under each directory, without the extension.
compare both lists, show only the file that does not appear in both.
Code:
#!/bin/bash
>a.list
>b.list
for file in A/*
do
basename "${file%.*}" >>a.list
done
for file in B/*
do
basename "${file%.*}" >>b.list
done
comm -23 <(sort a.list) <(sort b.list) >delete.list
while IFS= read -r line; do
rm -v A/"$line"\.*
done < "delete.list"
# cleanup
rm -f a.list b.list delete.list
"${file%.*}" removes the extension
basename removes the path
comm -23 ... shows only the lines that appear only in a.list
EDIT May 10th: my initial code listed the file, but did not delete it. Now it does.

Python version:
EDIT: now suports multiple extensions
#!/usr/bin/python3
import glob, os
def removeext(filename):
index = filename.find(".")
return(filename[:index])
setA = set(map(removeext,os.listdir('A')))
print("Files in directory A: " + str(setA))
setB = set(map(removeext,os.listdir('B')))
print("Files in directory B: " + str(setB))
setDiff = setA.difference(setB)
print("Files only in directory A: " + str(setDiff))
for filename in setDiff:
file_path = "A/" + filename + ".*"
for file in glob.glob(file_path):
print("file=" + file)
os.remove(file)
Does pretty much the same as my bash version above.
list files in A
list files in B
get the list of differences
delete the differences from A
Test output, done on Linux Mint, bash 4.4.20
mint:~/SO$ l
drwxr-xr-x 2 Nic3500 Nic3500 4096 May 10 10:36 A/
drwxr-xr-x 2 Nic3500 Nic3500 4096 May 10 10:36 B/
mint:~/SO$ l A
total 0
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:06 file1.fasta.profile
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:06 file2.fasta.profile
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:14 file3.fasta.profile
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:36 file4.fasta.profile
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:06 file.fasta.profile
mint:~/SO$ l B
total 0
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:05 file1.dssp
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:06 file2.dssp
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:06 file3.dssp
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:05 file.dssp
mint:~/SO$ ./so.py
Files in directory A: {'file1', 'file', 'file3', 'file2', 'file4'}
Files in directory B: {'file1', 'file', 'file3', 'file2'}
Files only in directory A: {'file4'}
file=A/file4.fasta.profile
mint:~/SO$ l A
total 0
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:06 file1.fasta.profile
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:06 file2.fasta.profile
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:14 file3.fasta.profile
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:06 file.fasta.profile

Related

Subprocess does not store output in variable

I am trying to capture the output of this command:
ls -l /sys/class/net/e*/device/virtfn*
in my python script using the subprocess library.
The output of this command is:
lrwxrwxrwx. 1 root root 0 Sep 7 14:52 /sys/class/net/enp1s0f0/device/virtfn0 -> ../0000:01:10.0
lrwxrwxrwx. 1 root root 0 Sep 7 14:52 /sys/class/net/enp1s0f0/device/virtfn1 -> ../0000:01:10.2
lrwxrwxrwx. 1 root root 0 Sep 7 14:52 /sys/class/net/enp1s0f0/device/virtfn2 -> ../0000:01:10.4
lrwxrwxrwx. 1 root root 0 Sep 7 14:52 /sys/class/net/enp1s0f0/device/virtfn3 -> ../0000:01:10.6
lrwxrwxrwx. 1 root root 0 Sep 7 14:52 /sys/class/net/enp1s0f0/device/virtfn4 -> ../0000:01:11.0
lrwxrwxrwx. 1 root root 0 Sep 7 14:52 /sys/class/net/enp1s0f0/device/virtfn5 -> ../0000:01:11.2
lrwxrwxrwx. 1 root root 0 Sep 7 14:52 /sys/class/net/enp1s0f0/device/virtfn6 -> ../0000:01:11.4
lrwxrwxrwx. 1 root root 0 Sep 7 14:52 /sys/class/net/enp1s0f0/device/virtfn7 -> ../0000:01:11.6
lrwxrwxrwx. 1 root root 0 Sep 7 14:52 /sys/class/net/enp1s0f1/device/virtfn0 -> ../0000:01:10.1
lrwxrwxrwx. 1 root root 0 Sep 7 14:52 /sys/class/net/enp1s0f1/device/virtfn1 -> ../0000:01:10.3
lrwxrwxrwx. 1 root root 0 Sep 7 14:52 /sys/class/net/enp1s0f1/device/virtfn2 -> ../0000:01:10.5
lrwxrwxrwx. 1 root root 0 Sep 7 14:52 /sys/class/net/enp1s0f1/device/virtfn3 -> ../0000:01:10.7
lrwxrwxrwx. 1 root root 0 Sep 7 14:52 /sys/class/net/enp1s0f1/device/virtfn4 -> ../0000:01:11.1
lrwxrwxrwx. 1 root root 0 Sep 7 14:52 /sys/class/net/enp1s0f1/device/virtfn5 -> ../0000:01:11.3
lrwxrwxrwx. 1 root root 0 Sep 7 14:52 /sys/class/net/enp1s0f1/device/virtfn6 -> ../0000:01:11.5
lrwxrwxrwx. 1 root root 0 Sep 7 14:52 /sys/class/net/enp1s0f1/device/virtfn7 -> ../0000:01:11.7
The code in my script:
def getMacOfBusSlotFunction(self, slotbus, slotslot, slotfunction):
myParentDevicesProcess = subprocess.Popen(['ls','-l','/sys/class/net/e*/device/virtfn*'])
stdout , stderr = myParentDevicesProcess.communicate()
print(stdout.decode("utf-8"))
I used Retrieving the output of subprocess.call() as a basis.
I added the .decode("utf-8") part as I thought maybe the output was being returned as bytes. Including it and excluding it still give the same result...
The actual output I get from running this is a blank line (\n).
I expect the output to be the actual output of the command.
You forgot to tell Popen to capture the output/stderr; by default it just lets it go to the terminal. To fix, you just need to tell it to capture them via pipes to the parent process:
myParentDevicesProcess = subprocess.Popen(['ls','-l','/sys/class/net/e*/device/virtfn*'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
Of course, that still won't work because glob expansion is a property of the shell, not the ls command, and you're not running your command through a shell (nor should you). You could have Python do the glob expansion for you to roughly match the shell with (putting import glob at the top of the file):
myParentDevicesProcess = subprocess.Popen(['ls','-l'] + glob.glob('/sys/class/net/e*/device/virtfn*'), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
Popen (or any subprocess call), without shell=True, cannot use shell features like wildcard expansion. If you had examined standard error too, you would have discovered an error message that ls could not find a file named literally /sys/class/net/e*/device/virtfn.
The trivial fix is to use a shell (and probably switch back to avoid bare Popen).
listing = subprocess.check_output('ls -l /sys/class/net/e*/device/virtfn', shell=True)
In some ways, a better solution would be to use Python's native functions to extract the information you need, rather than attempt to parse ls output but since we don't know what information you want, that's a bit hard to pin down. If your ultimate goal is to resolve the symlinks, try
import glob
import os
for symlink in glob.glob('/sys/class/net/e*/device/virtfn'):
print(os.readlink(symlink))

Strange behavior of TextIOWrapper.tell() with Python 3.6.9 in context of 0D/0A

ENVIRONMENT:
Intel/88-Ubuntu SMP Tue Feb 11 20:11:34 UTC 2020
Python 3.6.9
GIVEN:
A tiny program stored in test.py which shows input position and input character code for the consecutive reading of single characters.
fh = open("tmp.txt", "r")
while 1 + 1 == 2:
tmp = fh.read(1)
if not tmp: break
print(fh.tell(), "%x" % ord(tmp))
Fill a tmp.txt in bash to contain some data
echo -e "\x41\x42\x3b\x0d\x0a\x0d\x0a" > tmp.txt
OUTPUT:
Running python3 test.py delivers
1 41
2 42
18446744073709551620 3b
5 a
7 a
8 a
QUESTION:
Where does the excessively high value 18446744073709551620 for fh.tell() come from? Interestingly,
this does not happen in the following cases.
echo -e "\x41\x42\x3b\x0d\x0a" > tmp.txt # only one 0x0d/0x0a
echo -e "\x42\x3b\x0d\x0a\x0d\x0a" > tmp.txt # no 'A' at the beginning of the file

How do I get the "biggest" path?

I need to write some Python code to get the latest version of Android from a path. For example:
$ ls -l android_tools/sdk/platforms/
total 8
drwxrwxr-x 5 deqing deqing 4096 Mar 21 11:42 android-18
drwxrwxr-x 5 deqing deqing 4096 Mar 21 11:42 android-19
$
In this case I'd like to have android_tools/sdk/platforms/android-19.
The max function can take a key=myfunc parameter to specify a function that will return a comparison value. So you could do something like:
import os, re
dirname = 'android_tools/sdk/platforms'
files = os.listdir(my_dir)
def mykeyfunc(fname):
digits = re.search(r'\d+$', fname).group()
return int(digits)
print max(files, mykeyfunc)
Adjust that regular expression as needed for the actual files you're dealing with, and that should get you started.

Recursive directory size includes symlinks twice

I took code from a question on Stack Overflow that's supposed to measure a directory's size:
def dirSize(directory):
totalSize = 0
for dirpath, dirnames, filenames in os.walk(directory):
for f in filenames:
fp = os.path.join(dirpath, f)
totalSize += os.path.getsize(fp)
return totalSize
But if I have this directory:
ls -l
-rw-r--r-- 1 lucas lucas 5120000 Oct 18 17:36 x
lrwxrwxrwx 1 lucas lucas 1 Oct 18 17:34 y -> x
And I run that function on it, I get this:
10240000
It seems to count symlinks as the size of the file they link to, not 4KB as they actually are. How can I fix this?
how about
totalSize += os.path.getsize(fp) if not os.path.islink(fp) else 4096
Just pass argument followlinks=False to os.walk. See the documentation for more information.

Using Python's ftplib to get a directory listing, portably

You can use ftplib for full FTP support in Python. However the preferred way of getting a directory listing is:
# File: ftplib-example-1.py
import ftplib
ftp = ftplib.FTP("www.python.org")
ftp.login("anonymous", "ftplib-example-1")
data = []
ftp.dir(data.append)
ftp.quit()
for line in data:
print "-", line
Which yields:
$ python ftplib-example-1.py
- total 34
- drwxrwxr-x 11 root 4127 512 Sep 14 14:18 .
- drwxrwxr-x 11 root 4127 512 Sep 14 14:18 ..
- drwxrwxr-x 2 root 4127 512 Sep 13 15:18 RCS
- lrwxrwxrwx 1 root bin 11 Jun 29 14:34 README -> welcome.msg
- drwxr-xr-x 3 root wheel 512 May 19 1998 bin
- drwxr-sr-x 3 root 1400 512 Jun 9 1997 dev
- drwxrwxr-- 2 root 4127 512 Feb 8 1998 dup
- drwxr-xr-x 3 root wheel 512 May 19 1998 etc
...
I guess the idea is to parse the results to get the directory listing. However this listing is directly dependent on the FTP server's way of formatting the list. It would be very messy to write code for this having to anticipate all the different ways FTP servers might format this list.
Is there a portable way to get an array filled with the directory listing?
(The array should only have the folder names.)
Try using ftp.nlst(dir).
However, note that if the folder is empty, it might throw an error:
files = []
try:
files = ftp.nlst()
except ftplib.error_perm as resp:
if str(resp) == "550 No files found":
print "No files in this directory"
else:
raise
for f in files:
print f
The reliable/standardized way to parse FTP directory listing is by using MLSD command, which by now should be supported by all recent/decent FTP servers.
import ftplib
f = ftplib.FTP()
f.connect("localhost")
f.login()
ls = []
f.retrlines('MLSD', ls.append)
for entry in ls:
print entry
The code above will print:
modify=20110723201710;perm=el;size=4096;type=dir;unique=807g4e5a5; tests
modify=20111206092323;perm=el;size=4096;type=dir;unique=807g1008e0; .xchat2
modify=20111022125631;perm=el;size=4096;type=dir;unique=807g10001a; .gconfd
modify=20110808185618;perm=el;size=4096;type=dir;unique=807g160f9a; .skychart
...
Starting from python 3.3, ftplib will provide a specific method to do this:
http://bugs.python.org/issue11072
http://hg.python.org/cpython/file/67053b135ed9/Lib/ftplib.py#l535
I found my way here while trying to get filenames, last modified stamps, file sizes etc and wanted to add my code. It only took a few minutes to write a loop to parse the ftp.dir(dir_list.append) making use of python std lib stuff like strip() (to clean up the line of text) and split() to create an array.
ftp = FTP('sick.domain.bro')
ftp.login()
ftp.cwd('path/to/data')
dir_list = []
ftp.dir(dir_list.append)
# main thing is identifing which char marks start of good stuff
# '-rw-r--r-- 1 ppsrt ppsrt 545498 Jul 23 12:07 FILENAME.FOO
# ^ (that is line[29])
for line in dir_list:
print line[29:].strip().split(' ') # got yerself an array there bud!
# EX ['545498', 'Jul', '23', '12:07', 'FILENAME.FOO']
There's no standard for the layout of the LIST response. You'd have to write code to handle the most popular layouts. I'd start with Linux ls and Windows Server DIR formats. There's a lot of variety out there, though.
Fall back to the nlst method (returning the result of the NLST command) if you can't parse the longer list. For bonus points, cheat: perhaps the longest number in the line containing a known file name is its length.
I happen to be stuck with an FTP server (Rackspace Cloud Sites virtual server) that doesn't seem to support MLSD. Yet I need several fields of file information, such as size and timestamp, not just the filename, so I have to use the DIR command. On this server, the output of DIR looks very much like the OP's. In case it helps anyone, here's a little Python class that parses a line of such output to obtain the filename, size and timestamp.
import datetime
class FtpDir:
def parse_dir_line(self, line):
words = line.split()
self.filename = words[8]
self.size = int(words[4])
t = words[7].split(':')
ts = words[5] + '-' + words[6] + '-' + datetime.datetime.now().strftime('%Y') + ' ' + t[0] + ':' + t[1]
self.timestamp = datetime.datetime.strptime(ts, '%b-%d-%Y %H:%M')
Not very portable, I know, but easy to extend or modify to deal with various different FTP servers.
This is from Python docs
>>> from ftplib import FTP_TLS
>>> ftps = FTP_TLS('ftp.python.org')
>>> ftps.login() # login anonymously before securing control
channel
>>> ftps.prot_p() # switch to secure data connection
>>> ftps.retrlines('LIST') # list directory content securely
total 9
drwxr-xr-x 8 root wheel 1024 Jan 3 1994 .
drwxr-xr-x 8 root wheel 1024 Jan 3 1994 ..
drwxr-xr-x 2 root wheel 1024 Jan 3 1994 bin
drwxr-xr-x 2 root wheel 1024 Jan 3 1994 etc
d-wxrwxr-x 2 ftp wheel 1024 Sep 5 13:43 incoming
drwxr-xr-x 2 root wheel 1024 Nov 17 1993 lib
drwxr-xr-x 6 1094 wheel 1024 Sep 13 19:07 pub
drwxr-xr-x 3 root wheel 1024 Jan 3 1994 usr
-rw-r--r-- 1 root root 312 Aug 1 1994 welcome.msg
That helped me with my code.
When I tried feltering only a type of files and show them on screen by adding a condition that tests on each line.
Like this
elif command == 'ls':
print("directory of ", ftp.pwd())
data = []
ftp.dir(data.append)
for line in data:
x = line.split(".")
formats=["gz", "zip", "rar", "tar", "bz2", "xz"]
if x[-1] in formats:
print ("-", line)

Categories

Resources