I need to iterate through a folder and find every instance where the filenames are identical (except for extension) and then zip (preferably using tarfile) each of these into one file.
So I have 5 files named: "example1" each with different file extensions. I need to zip them up together and output them as "example1.tar" or something similar.
This would be easy enough with a simple for loop such as:
tar = tarfile.open('example1.tar',"w")
for output in glob ('example1*'):
tar.add(output)
tar.close()
however, there are 300 "example" files and I need to iterate through each one and their associated 5 files in order to make this work. This is way over my head. Any advice greatly appreciated.
The pattern you're describing generalizes to MapReduce. I found a simple implementation of MapReduce online, from which an even-simpler version is:
def map_reduce(data, mapper, reducer):
d = {}
for elem in data:
key, value = mapper(elem)
d.setdefault(key, []).append(value)
for key, grp in d.items():
d[key] = reducer(key, grp)
return d
You want to group all files by their name without the extension, which you can get from os.path.splitext(fname)[0]. Then, you want to make a tarball out of each group by using the tarfile module. In code, that is:
import os
import tarfile
def make_tar(basename, files):
tar = tarfile.open(basename + '.tar', 'w')
for f in files:
tar.add(f)
tar.close()
map_reduce(os.listdir('.'),
lambda x: (os.path.splitext(x)[0], x),
make_tar)
Edit: If you want to group files in different ways, you just need to modify the second argument to map_reduce. The code above groups files that have the same value for the expression os.path.splitext(x)[0]. So to group by the base file name with all the extensions stripped off, you could replace that expression with strip_all_ext(x) and add:
def strip_all_ext(path):
head, tail = os.path.split(path)
basename = tail.split(os.extsep)[0]
return os.path.join(head, basename)
You could do this:
list all files in the directory
create a dictionary where the basename is the key and all the extensions are values
then tar all the files by dictionary key
Something like this:
import os
import tarfile
from collections import defaultdict
myfiles = os.listdir(".") # List of all files
totar = defaultdict(list)
# now fill the defaultdict with entries; basename as keys, extensions as values
for name in myfiles:
base, ext = os.path.splitext(name)
totar[base].append(ext)
# iterate through all the basenames
for base in totar:
files = [base+ext for ext in totar[base]]
# now tar all the files in the list "files"
tar = tarfile.open(base+".tar", "w")
for item in files:
tar.add(item)
tar.close()
You have to problems. Solve the separately.
Finding matching names. Use a collections.defaultict
Creating tar files after you find the matching names. You've got that pretty well covered.
So. Solve problem 1 first.
Use glob to get all the names. Use os.path.basename to split the path and basename. Use os.path.splitext to split the name and extension.
A dictionary of lists can be used to save all files that have the same name.
Is that what you're doing in part 1?
Part 2 is putting the files into tar archives. For that, you've got most of the code you need.
Try using the glob module: http://docs.python.org/library/glob.html
#! /usr/bin/env python
import os
import tarfile
tarfiles = {}
for f in os.listdir ('files'):
prefix = f [:f.rfind ('.') ]
if prefix in tarfiles: tarfiles [prefix] += [f]
else: tarfiles [prefix] = [f]
for k, v in tarfiles.items ():
tf = tarfile.open ('%s.tar.gz' % k, 'w:gz')
for f in v: tf.addfile (tarfile.TarInfo (f), file ('files/%s' % f) )
tf.close ()
import os
import tarfile
allfiles = {}
for filename in os.listdir("."):
basename = '.'.join (filename.split(".")[:-1] )
if not basename in all_files:
allfiles[basename] = [filename]
else:
allfiles[basename].append(filename)
for basename, filenames in allfiles.items():
if len(filenames) < 2:
continue
tardata = tarfile.open(basename+".tar", "w")
for filename in filenames:
tardata.add(filename)
tardata.close()
Related
I wrote a dataframe to a csv in Pyspark. And I got the output files in the directory as:
._SUCCESS.crc
.part-00000-6cbfdcfd-afff-4ded-802c-6ccd67f3804a-c000.csv.crc
part-00000-6cbfdcfd-afff-4ded-802c-6ccd67f3804a-c000.csv
How do I keep only the CSV file in the directory and delete rest of the files, using Python?
import os
directory = "/path/to/directory/with/files"
files_in_directory = os.listdir(directory)
filtered_files = [file for file in files_in_directory if not file.endswith(".csv")]
for file in filtered_files:
path_to_file = os.path.join(directory, file)
os.remove(path_to_file)
first, you list all files in directory. Then, you only keep in list those, which don't end with .csv. And then, you remove all files that are left.
Try iterating over the files in the directory, and then os.remove only those files that do not end with .csv.
import os
dir_path = "path/to/the/directory/containing/files"
dir_list = os.listdir(dir_path)
for item in dir_list:
if not item.endswith(".csv"):
os.remove(os.path.join(dir_path, item))
You can also have fun with list comprehension for doing this:
import os
dir_path = 'output/'
[os.remove(os.path.join(dir_path, item)) for item in os.listdir(dir_path) if not item.endswith('.csv')]
I would recommended to use pathlib (Python >= 3.4) and the in-build type set() to substract all csv filenames from the list of all files. I would argument this is easy to read, fast to process and a good pythonic solution.
>>> from pathlib import Path
>>> p = Path('/path/to/directory/with/files')
>>> # Get all file names
>>> # https://stackoverflow.com/a/65025567/4865723
>>> set_all_files = set(filter(Path.is_file, p.glob('**/*')))
>>> # Get all csv filenames (BUT ONLY with lower case suffix!)
>>> set_csv_files = set(filter(Path.is_file, p.glob('**/*.csv')))
>>> # Create a file list without csv files
>>> set_files_to_delete = set_all_files - set_csv_files
>>> # Iteratore on that list and delete the file
>>> for file_name in set_files_to_delete:
... Path(file_name).unlink()
for (root,dirs,files) in os.walk('Test', topdown=true):
for name in files:
fp = os.path.join(root, name)
if name.endswith(".csv"):
pass
else:
os.remove(fp)
What the advandtage of os.walk?, it reads all the subdirectory in particular directory mentioned.
I want to put files in the multiple zip files that have common substring into a single zipfile
I have a folder "temp" containing some .zip files and some other files
filename1_160645.zip
filename1_165056.zip
filename1_195326.zip
filename2_120528.zip
filename2_125518.zip
filename3_171518.zip
test.xlsx
filename19_161518.zip
I have following dataframe df_filenames containing the prefixes of filename
filename_prefix
filename1
filename2
filename3
if there are multiple .zip files in the temp folder with the same prefix that exists in the dataframe df_filenames,i want to merge the contents of those files
for example filename1_160645.zip contains following contents
1a.csv
1b.csv
and filename1_165056.zip contains following contents
1d.csv
and filename1_195326.zip contains following contents
1f.csv
after merging the contents of above 2 files into the filename1_160645.zip
the contents of filename1_160645.zip will be
1a.csv
1b.csv
1d.csv
1f.csv
At the end only following files will remain the temp folder
filename1_160645.zip
filename2_120528.zip
filename3_171518.zip
test.xlsx
filename19_161518.zip
I have written the following code but it's not working
import os
import zipfile as zf
import pandas as pd
df_filenames=pd.read_excel('filename_prefix.xlsx')
#Get the list of all the filenames in the temp folder
lst_fnames=os.listdir(r'C:\Users\XYZ\Downloads\temp')
#take only .zip files
lst_fnames=[fname for fname in lst_fnames if fname.endswith('.zip')]
#take distinct prefixes in the dataframe
df_prefixes=df_filenames['filename_prefix'].unique()
for prefix in df_prefixes:
#this list will contain zip files with the same prefixes
lst=[]
#total count of files in the lst
count=0
for fname in lst_fnames:
if prefix in fname:
#print(prefix)
lst.append(fname)
#print(lst)
#if the list has more than 1 zip files,merge them
if len(lst)>1:
print(lst)
with zf.ZipFile(lst[0], 'a') as f1:
print(f1.filename)
for f in lst[1:]:
with zf.ZipFile(path+'\\'+f, 'r') as f:
print(f.filename) #getting entire path of the file here,not just filename
[f1.writestr(t[0], t[1].read()) for t in ((n, f.open(n)) for n in f.namelist())]
print(f1.namelist())
after merging the contents of the files with the filename containing filename1 into the filename1_160645.zip,
the contents of ``filename1_160645.zip``` should be
1a.csv
1b.csv
1d.csv
1f.csv
but nothing has changed when I double click filename1_160645.zip
Basically, 1a.csv,1b.csv,1d.csv,1f.csv are not part of filename1_160645.zip
I would use shutil for a higher level view for dealing with archive files. Additionally using pathlib gives nice methods/attributes for a given filepath. Combined with a groupby, we can easily extract target files that are related to each other.
import itertools
import shutil
from pathlib import Path
import pandas as pd
filenames = pd.read_excel('filename_prefix.xlsx')
prefixes = filenames['filename_prefix'].unique()
path = Path.cwd() # or change to Path('path/to/desired/dir/')
zip_files = (file for file in path.iterdir() if file.suffix == '.zip')
target_files = sorted(file for file in zip_files
if any(file.stem.startswith(pre) for pre in prefixes))
file_groups = itertools.groupby(target_files, key=lambda x: x.stem.split('_')[0])
for _, group in file_groups:
first, *rest = group
if not rest:
continue
temp_dir = path / first.stem
temp_dir.mkdir()
shutil.unpack_archive(first, extract_dir=temp_dir)
for item in rest:
shutil.unpack_archive(item, extract_dir=temp_dir)
item.unlink()
shutil.make_archive(temp_dir, 'zip', temp_dir)
shutil.rmtree(temp_dir)
I have the following script, which works well, but now I want to add the name of the folder to the output file. The script goes with os.walk into the subdirectories of the current working folder and I want to add the subdirectory folder names to the output. Preferably, I want to add only the folder name (and not the whole path) to the first line of the output file. Who can help me to edit the script?
Thanks in advance!
import os
import csv
from itertools import chain
from collections import defaultdict
def get_file_values(find_files, output_name):
for root, dirs, files in os.walk(os.getcwd()):
if all(x in files for x in find_files):
outputs = []
for f in find_files:
d = {}
with open(os.path.join(root, f), 'r') as f1:
for line in f1:
ta = line.split()
d[ta[1]] = int(ta[0])
outputs.append(d)
d3 = defaultdict(list)
for k, v in chain(*(d.items() for d in outputs)):
d3[k].append(v)
with open(os.path.join(root, output_name), 'w+') as fnew:
writer = csv.writer(fnew)
for k, v in d3.items():
writer.writerow([k] + v)
get_file_values(['genes.gff.genespercontig.csv', 'hmmer.analyze.txt.results.txt'], 'output_contigsvsgenes.csv')
You can use os.listdir(path) to get all subdirectory and filenames.
See: https://docs.python.org/2/library/os.html
os.listdir('.')for current directory and os.listdir('../') for one level up directory, for example.
os.path.relpath('.','..') gives you folder name of current directory.
My goal is to concatenate files in a folder based on a string in the middle of the filename, ideally using python or bash. To simplify the question, here is an example:
P16C-X128-22MB-LL_merged_trimmed.fastq
P16C-X128-27MB-LR_merged_trimmed.fastq
P16C-X1324-14DL-UL_merged_trimmed.fastq
P16C-X1324-21DL-LL_merged_trimmed.fastq
I would like to concatenate based on the value after the first dash but before the second (e.g. X128 or X1324), so that I am left with (in this example), two additional files that contain the concatenated contents of the individual files:
P16C-X128-Concat.fastq (concat of 2 files with X128)
P16C-X1324-Concat.fastq (concat of 2 files with X1324)
Any help would be appreciated.
For simple string manipulations, I prefer to avoid the use of regular expressions. I think that str.split() is enough in this case. Besides, for simple file name matching, the library fnmatch provides enough functionality.
import fnmatch
import os
from itertools import groupby
path = '/full/path/to/files/'
ext = ".fastq"
files = fnmatch.filter(os.listdir(path), '*' + ext)
def by(fname): return fname.split('-')[1] # Ej. X128
# You said:
# I would like to concatenate based on the value after the first dash
# but before the second (e.g. X128 or X1324)
# If you want to keep both parts together, uncomment the following:
# def by(fname): return '-'.join(fname.split('-')[:2]) # Ej. P16C-X128
for k, g in groupby(sorted(files, key=by), key=by):
dst = str(k) + '-Concat' + ext
with open(os.path.join(path, dst), 'w') as dstf:
for fname in g:
with open(os.path.join(path, fname), 'r') as srcf:
dstf.write(srcf.read())
Instead of the read, write in Python, you could also delegate the concatenation to the OS. You would normally use a bash command like this:
cat *-X128-*.fastq > X128.fastq
Using the subprocess library:
import subprocess
for k, g in groupby(sorted(files, key=by), key=by):
dst = str(k) + '-Concat' + ext
with open(os.path.join(path, dst), 'w') as dstf:
command = ['cat'] # +++
for fname in g:
command.append(os.path.join(path, fname)) # +++
subprocess.run(command, stdout=dstf) # +++
Also, for a batch job like this one, you should consider placing the concatenated files in a separate directory, but that is easily done by changing the dst filename.
You can use open to read and write (create) files, os.listdir to get all files (and directories) in a certain directory and re to match file name as needed.
Use a dictionary to store contents by filename prefix (the file's name up until 3rd hyphen -) and concatenate the contents together.
import os
import re
contents = {}
file_extension = "fastq"
# Get all files and directories that are in current working directory
for file_name in os.listdir('./'):
# Use '.' so it doesn't match directories
if file_name.endswith('.' + file_extension):
# Match the first 2 hyphen-separated values from file name
prefix_match = re.match("^([^-]+\-[^-]+)", file_name)
file_prefix = prefix_match.group(1)
# Read the file and concatenate contents with previous contents
contents[file_prefix] = contents.get(file_prefix, '')
with open(file_name, 'r') as the_file:
contents[file_prefix] += the_file.read() + '\n'
# Create new file for each file id and write contents to it
for file_prefix in contents:
file_contents = contents[file_prefix]
with open(file_prefix + '-Concat.' + file_extension, 'w') as the_file:
the_file.write(file_contents)
I compare two text files and print out the results to a 3rd file. I am trying to make it so the script i'm running would iterate over all of the folders that have two text files in them, in the CWD of the script.
What i have so far:
import os
import glob
path = './'
for infile in glob.glob( os.path.join(path, '*.*') ):
print('current file is: ' + infile)
with open (f1+'.txt', 'r') as fin1, open(f2+'.txt', 'r') as fin2:
Would this be a good way to start the iteration process?
It's not the most clear code but it gets the job done. However, i'm pretty sure i need to take the logic out of the read / write methods but i'm not sure where to start.
What i'm basically trying to do is have a script iterate over all of the folders in its CWD, open each folder, compare the two text files inside, write a 3rd text file to the same folder, then move on to the next.
Another method i have tried is as follows:
import os
rootDir = 'C:\\Python27\\test'
for dirName, subdirList, fileList in os.walk(rootDir):
print('Found directory: %s' % dirName)
for fname in fileList:
print('\t%s' % fname)
And this outputs the following (to give you a better example of the file structure:
Found directory: C:\Python27\test
test.py
Found directory: C:\Python27\test\asdd
asd1.txt
asd2.txt
Found directory: C:\Python27\test\chro
ch1.txt
ch2.txt
Found directory: C:\Python27\test\hway
hw1.txt
hw2.txt
Would it be wise to put the compare logic under the for fname in fileList? How do i make sure it compares the two text files inside the specific folder and not with other fnames in the fileList?
This is the full code that i am trying to add this functionality into. I appologize for the Frankenstein nature of it but i am still working on a refined version but it does not work yet.
from collections import defaultdict
from operator import itemgetter
from itertools import groupby
from collections import deque
import os
class avs_auto:
def load_and_compare(self, input_file1, input_file2, output_file1, output_file2, result_file):
self.load(input_file1, input_file2, output_file1, output_file2)
self.compare(output_file1, output_file2)
self.final(result_file)
def load(self, fileIn1, fileIn2, fileOut1, fileOut2):
with open(fileIn1+'.txt') as fin1, open(fileIn2+'.txt') as fin2:
frame_rects = defaultdict(list)
for row in (map(str, line.split()) for line in fin1):
id, frame, rect = row[0], row[2], [row[3],row[4],row[5],row[6]]
frame_rects[frame].append(id)
frame_rects[frame].append(rect)
frame_rects2 = defaultdict(list)
for row in (map(str, line.split()) for line in fin2):
id, frame, rect = row[0], row[2], [row[3],row[4],row[5],row[6]]
frame_rects2[frame].append(id)
frame_rects2[frame].append(rect)
with open(fileOut1+'.txt', 'w') as fout1, open(fileOut2+'.txt', 'w') as fout2:
for frame, rects in sorted(frame_rects.iteritems()):
fout1.write('{{{}:{}}}\n'.format(frame, rects))
for frame, rects in sorted(frame_rects2.iteritems()):
fout2.write('{{{}:{}}}\n'.format(frame, rects))
def compare(self, fileOut1, fileOut2):
with open(fileOut1+'.txt', 'r') as fin1:
with open(fileOut2+'.txt', 'r') as fin2:
lines1 = fin1.readlines()
lines2 = fin2.readlines()
diff_lines = [l.strip() for l in lines1 if l not in lines2]
diffs = defaultdict(list)
with open(fileOut1+'x'+fileOut2+'.txt', 'w') as result_file:
for line in diff_lines:
d = eval(line)
for k in d:
list_ids = d[k]
for i in range(0, len(d[k]), 2):
diffs[d[k][i]].append(k)
for id_ in diffs:
diffs[id_].sort()
for k, g in groupby(enumerate(diffs[id_]), lambda (i, x): i - x):
group = map(itemgetter(1), g)
result_file.write('{0} {1} {2}\n'.format(id_, group[0], group[-1]))
def final(self, result_file):
with open(result_file+'.txt', 'r') as fin:
lines = (line.split() for line in fin)
for k, g in groupby(lines, itemgetter(0)):
fst = next(g)
lst = next(iter(deque(g, 1)), fst)
with open('final/{}.avs'.format(k), 'w') as fout:
fout.write('video0=ImageSource("old\%06d.jpeg", {}-3, {}+3, 15)\n'.format(fst[1], lst[2]))
fout.write('video1=ImageSource("new\%06d.jpeg", {}-3, {}+3, 15)\n'.format(fst[1], lst[2]))
fout.write('video0=BilinearResize(video0,640,480)\n')
fout.write('video1=BilinearResize(video1,640,480)\n')
fout.write('StackHorizontal(video0,video1)\n')
fout.write('Subtitle("ID: {}", font="arial", size=30, align=8)'.format(k))
using the load_and_compare() function, i define two input text files, two output text files, a file for the comparison results and a final phase that writes many files for all of the differences.
What i am trying to do is have this whole class run on the current working directory and go through every sub folder, compare the two text files, and write everything into the same folder, specifically the final() results.
You can indeed use os.walk(), since that already separates the directories from the files. You only need the directories it returns, because that's where you're looking for your 2 specific files.
You could also use os.listdir() but that returns directories as well files in the same list, so you would have to check for directories yourself.
Either way, once you have the directories, you iterate over them (for subdir in dirnames) and join the various path components you have: The dirpath, the subdir name that you got from iterating over the list and your filename.
Assuming there are also some directories that don't have the specific 2 files, it's a good idea to wrap the open() calls in a try..except block and thus ignore the directories where one of the files (or both of them) doesn't exist.
Finally, if you used os.walk(), you can easily choose if you only want to go into directories one level deep or walk the whole depth of the tree. In the former case, you just clear the dirnames list by dirnames[:] = []. Note that dirnames = [] wouldn't work, since that would just create a new empty list and put that reference into the variable instead of clearing the old list.
Replace the print("do something ...") with your program logic.
#!/usr/bin/env python
import errno
import os
f1 = "test1"
f2 = "test2"
path = "."
for dirpath, dirnames, _ in os.walk(path):
for subdir in dirnames:
filepath1, filepath2 = [os.path.join(dirpath, subdir, f + ".txt") for f in f1, f2]
try:
with open(filepath1, 'r') as fin1, open(filepath2, 'r') as fin2:
print("do something with " + str(fin1) + " and " + str(fin2))
except IOError as e:
# ignore directiories that don't contain the 2 files
if e.errno != errno.ENOENT:
# reraise exception if different from "file or directory doesn't exist"
raise
# comment the next line out if you want to traverse all subsubdirectories
dirnames[:] = []
Edit:
Based on your comments, I hope I understand your question better now.
Try the following code snippet instead. The overall structure stays the same, only now I'm using the returned filenames of os.walk(). Unfortunately, that would also make it harder to do something like "go only into the subdirectories 1 level deep", so I hope walking the tree recursively is fine with you. If not, I'll have to add a little code to later.
#!/usr/bin/env python
import fnmatch
import os
filter_pattern = "*.txt"
path = "."
for dirpath, dirnames, filenames in os.walk(path):
# comment this out if you don't want to filter
filenames = [fn for fn in filenames if fnmatch.fnmatch(fn, filter_pattern)]
if len(filenames) == 2:
# comment this out if you don't want the 2 filenames to be sorted
filenames.sort(key=str.lower)
filepath1, filepath2 = [os.path.join(dirpath, fn) for fn in filenames]
with open(filepath1, 'r') as fin1, open(filepath2, 'r') as fin2:
print("do something with " + str(fin1) + " and " + str(fin2))
I'm still not really sure what your program logic does, so you will have to interface the two yourself.
However, I noticed that you're adding the ".txt" extension to the file name explicitly all over your code, so depending on how you are going to use the snippet, you might or might not need to remove the ".txt" extension first before handing the filenames over. That would be achieved by inserting the following line after or before the sort:
filenames = [os.path.splitext(fn)[0] for fn in filenames]
Also, I still don't understand why you're using eval(). Do the text files contain python code? In any case, eval() should be avoided and be replaced by code that's more specific to the task at hand.
If it's a list of comma separated strings, use line.split(",") instead.
If there might be whitespace before or after the comma, use [word.strip() for word in line.split(",")] instead.
If it's a list of comma separated integers, use [int(num) for num in line.split(",")] instead - for floats it works analogously.
etc.