I have csv files among other files, uncompressed or compressed with either gz, bz2, or other format. All compressed files have their original extension preserved in their name. So the compression specific extension is appended to the original filename.
The list of possible compression formats is given through a list, for example:
z_types = [ '.gz', '.bz2' ] # could be many more than two types
I would like to make a list of the cvs files disregarding whether they are compressed or not. I usually do for uncompressed csv files the following:
import os
[ file_ if file_.endswith('.csv') for file_ in os.listdir(path_to_files) ]
for the case I want even compressed file I would do:
import os
acsv_files_ = []
for file_ in os.listdir(path_to_files):
for ztype_ in z_types + [ '' ]:
if file_.endswith('.csv' + ztype_):
acsv_files_.append(file_)
though this would work, is there any more concise and efficient way of doing this? for example using an 'or' operator within .endswith()?
Yes, that is possible. See str.endswith:
Return True if the string ends with the specified suffix, otherwise return False. suffix can also be a tuple of suffixes to look for. With optional start, test beginning at that position. With optional end, stop comparing at that position.
In [10]: "foo".endswith(("o", "r"))
Out[10]: True
In [11]: "bar".endswith(("o", "r"))
Out[11]: True
In [12]: "baz".endswith(("o", "r"))
Out[12]: False
So you could use
[file_ if file_.endswith(tuple(z_types + [""])) for file_ in os.listdir(path_to_files)]
If your file names all end in '.csv' or '.csv.some_compressed_ext' you could use the following:
import os
csvfiles = [f for f in os.listdir(path) if '.csv' in f]
You can do this in one line as:
import os
exts = ['','.gz','.bz2','.tar'] # includes '' as the null-extenstion
# this creates the list
files_to_process = [_file for _file in os.listdir(path_to_files) if not _file.endswith('.not_to_process') and _file.endswith(tuple('.csv'+ext for ext in exts+['']))]
Broken down:
files_to_process = [
_file
for _file in os.listdir(path_to_files)
if not _file.endswith('.no') # Checks against files you have marked as bad
and
_file.endswith( # checks if any of the provided entries in the tuple are endings to the _file name
tuple( # generates a tuple from the given generator argument
'.csv'+ext for ext in exts+[''] # Creates a tuple containing all the variations: .csv, .csv.gz, .csv.bz2, etc.
)
)
]
EDIT
For an even more general solution:
import os
def validate_file(f):
# do any tests on the file that you need to determine whether it is valid
# for processing
exts = ['','.gz','bz2']
if f.endswith('.some_extension_name_you_made_to_mark_bad_files'):
return False
else:
return file.endswith(tuple('.csv'+ext for ext in exts))
exts = [f for f in os.listdir(path_to_files) if validate_file(f)]
You could of course replace the code in validate_file with whatever testing you wish to do on the file. You could even use this approach to validate file contents too, i.e.
def validate_file(f):
content = ''.join(i for i in f)
if 'apple' in content:
return True
else:
return False
Related
I have a list that contains different paths to files. I want to sort the elements in this list by matching the stem of these paths (i.e. the file name) with a column in a csv file that contains the file names. This is to make sure that the list displays elements in order of the file names that are contained in the csv. The csv is similar to as shown below:
I have done the following:
file_list = ['C:\\Example\\SS\\e342-SFA.jpg', 'C:\\Example\\DF\\j541-DFS.jpg', 'C:\\Example\\SD\\p162-YSA.jpg']
for f in file_list:
x = Path(f).stem # grabs file name from file_list without .jpg
for line in csv_file:
IL = line.replace(":", "").replace("\n", "").replace("(", "").replace(")", "")
columns = IL.split(",")
if columns[3] == x: # column[3] = File name in csv
[do the sorting]
I'm not sure how to proceed further from here.
I'll assume you already know how to open and parse a csv file, and hence you already have the list ['p162-YSA', 'e342-SFA', 'j541-DFS'].
from ntpath import basename, splitext
order_list = ['p162-YSA', 'e342-SFA', 'j541-DFS']
file_list = ['C:\\Example\\SS\\e342-SFA.jpg', 'C:\\Example\\DF\\j541-DFS.jpg', 'C:\\Example\\SD\\p162-YSA.jpg']
order_dict = {}
for i, w in enumerate(order_list):
order_dict[w] = i
# {'p162-YSA': 0, 'e342-SFA': 1, 'j541-DFS': 2}
sorted_file_list = [None] * len(file_list)
for name in file_list:
sorted_file_list[ order_dict[splitext(basename(name))[0]] ] = name
print(sorted_file_list)
# ['C:\\Example\\SD\\p162-YSA.jpg', 'C:\\Example\\SS\\e342-SFA.jpg', 'C:\\Example\\DF\\j541-DFS.jpg']
Note: I chose to import basename and splitext from ntpath rather than from os.path so that this code can run on my linux machine. See this related question: Get basename of a Windows path in Linux.
I'm working on project to check for copies between two drives and I got stuck on sorting.
the output I have now is [ Filename, Hash, Location] in two list called drive1 and drive2
the output id'e like to end up with two text files with a list of the files that aren't in the other drive.
import os
import os.path
import hashlib
from os import path
drive1 = []
drive2 = []
file1 = input("Directory 1 location : ")
file2 = input("Directory 2 location : ")
AFile = open('skrar.txt','w')
AFile.close
def hash_file(filename):
if path.isfile(filename) is False:
pass
# make a hash object
md5_h = hashlib.md5()
# open file for reading in binary mode
with open(filename,'rb') as file:
# read file in chunks and update hash
chunk = 0
while chunk != b'':
chunk = file.read(1024)
md5_h.update(chunk)
# return the hex digest
return md5_h.hexdigest()
with open('Drive1.txt', 'w') :
AFile.write(hashlib.sha224(b"FILENAME").hexdigest())
for folderName, subfolders, filenames in os.walk(file1):
os.chdir(folderName)
for filename in filenames:
AFile.write(filename+";"+hash_file(filename)+";"+os.getcwd()+";"+os.path.join(os.getcwd(),filename)+'\n')
with open('Drive2.txt', 'w') :
AFile.write(hashlib.sha224(b"FILENAME").hexdigest())
for folderName, subfolders, filenames in os.walk(file2):
os.chdir(folderName)
for filename in filenames:
AFile.write(filename+";"+hash_file(filename)+";"+os.getcwd()+";"+os.path.join(os.getcwd(),filename)+'\n')
with open('Drive1.txt','r') as file:
for line in file:
drive1.append(line.split(";"))
with open('Drive2.txt','r') as file:
for line in file:
drive2.append(line.split(";"))
I'm not sure how to go about this maybe I should use dictionaries?
As I understand it, both drive1 and drive2 contain are lists of lists with length 3. The simplest approach would be the following:
# filter() creates a new list with the files in the opposite drive removed
files_only_in_drive1 = filter(lambda x: x not in drive2, drive1)
files_only_in_drive2 = filter(lambda x: x not in drive1, drive2)
This isn't the fastest solution (since search in an unordered list takes linear time). A more performant solution would take advantage of hashing and the set difference operator:
# Use tuple() for hashability.
drive1_file_set = set([tuple(file) for file in drive1])
drive2_file_set = set([tuple(file) for file in drive2])
# Now, remove files that are in the other drive using the set difference operator. In case it is necessary, I've added extra syntax to transform the 3-tuples to lists, and to cast the set back into a list.
files_only_in_drive_1 = [list(file) for file in drive1_file_set.difference(drive2_file_set)]
files_only_in_drive_2 = [list(file) for file in drive2_file_set.difference(drive1_file_set)]
I have a several files in a directory with the the following names
example1.txt
example2.txt
...
example10.txt
and a bunch of other files.
I'm trying to write a script that can get all the files with a file name like <name><digit>.txt and then get the one with higher digit (in this case example10.txt) and then write a new file where we add +1 to the digit, that is example11.txt
Right now I'm stuck at the part of selecting the files .txt and getting the last one.
Here is the code
import glob
from natsort import natsorted
files = natsorted(glob.glob('*[0-9].txt'))
last_file = files[-1]
print(files)
print(last_file)
You can use a regular expression to split the file name in the text and number part, increment the number and join everything else together to have your new file name:
import re
import glob
from natsort import natsorted
files = natsorted(glob.glob('*[0-9].txt'))
last_file = files[-1]
base_name, digits = re.match(r'([a-zA-Z]+)([0-9]+)\.txt', last_file).groups()
next_number = int(digits) + 1
next_file_name = f'{base_name}{next_number}.txt'
print(files)
print(last_file)
print(next_file_name)
Note that the regex assumes that the base name of the file has only alpha characters, with no spaces or _, etc. The regex can be extended if needed.
you can use this script, it will work well for your purpose i think.
import os
def get_last_file():
files = os.listdir('./files')
for index, file in enumerate(files):
filename = str(file)[0:str(file).find('.')]
digit = int(''.join([char for char in filename if char.isdigit()]))
files[index] = digit
files.sort()
return files[-1]
def add_file(file_name, extension):
last_digit = get_last_file() +1
with open('./files/' + file_name + str(last_digit) + '.' + extension, 'w') as f:
f.write('0')
# call this to create a new incremental file.
add_file('example', 'txt')
Here's a simple solution.
files = ["example1.txt", "example2.txt", "example3.txt", "example10.txt"]
highestFileNumber = max(int(file[7:-4]) for file in files)
fileToBeCreated = f"example{highestFileNumber+1}.txt"
print(fileToBeCreated)
output:
example11.txt
.txt and example are constants, so theres no sense in looking for patterns. just trim the example and .txt
My goal is to concatenate files in a folder based on a string in the middle of the filename, ideally using python or bash. To simplify the question, here is an example:
P16C-X128-22MB-LL_merged_trimmed.fastq
P16C-X128-27MB-LR_merged_trimmed.fastq
P16C-X1324-14DL-UL_merged_trimmed.fastq
P16C-X1324-21DL-LL_merged_trimmed.fastq
I would like to concatenate based on the value after the first dash but before the second (e.g. X128 or X1324), so that I am left with (in this example), two additional files that contain the concatenated contents of the individual files:
P16C-X128-Concat.fastq (concat of 2 files with X128)
P16C-X1324-Concat.fastq (concat of 2 files with X1324)
Any help would be appreciated.
For simple string manipulations, I prefer to avoid the use of regular expressions. I think that str.split() is enough in this case. Besides, for simple file name matching, the library fnmatch provides enough functionality.
import fnmatch
import os
from itertools import groupby
path = '/full/path/to/files/'
ext = ".fastq"
files = fnmatch.filter(os.listdir(path), '*' + ext)
def by(fname): return fname.split('-')[1] # Ej. X128
# You said:
# I would like to concatenate based on the value after the first dash
# but before the second (e.g. X128 or X1324)
# If you want to keep both parts together, uncomment the following:
# def by(fname): return '-'.join(fname.split('-')[:2]) # Ej. P16C-X128
for k, g in groupby(sorted(files, key=by), key=by):
dst = str(k) + '-Concat' + ext
with open(os.path.join(path, dst), 'w') as dstf:
for fname in g:
with open(os.path.join(path, fname), 'r') as srcf:
dstf.write(srcf.read())
Instead of the read, write in Python, you could also delegate the concatenation to the OS. You would normally use a bash command like this:
cat *-X128-*.fastq > X128.fastq
Using the subprocess library:
import subprocess
for k, g in groupby(sorted(files, key=by), key=by):
dst = str(k) + '-Concat' + ext
with open(os.path.join(path, dst), 'w') as dstf:
command = ['cat'] # +++
for fname in g:
command.append(os.path.join(path, fname)) # +++
subprocess.run(command, stdout=dstf) # +++
Also, for a batch job like this one, you should consider placing the concatenated files in a separate directory, but that is easily done by changing the dst filename.
You can use open to read and write (create) files, os.listdir to get all files (and directories) in a certain directory and re to match file name as needed.
Use a dictionary to store contents by filename prefix (the file's name up until 3rd hyphen -) and concatenate the contents together.
import os
import re
contents = {}
file_extension = "fastq"
# Get all files and directories that are in current working directory
for file_name in os.listdir('./'):
# Use '.' so it doesn't match directories
if file_name.endswith('.' + file_extension):
# Match the first 2 hyphen-separated values from file name
prefix_match = re.match("^([^-]+\-[^-]+)", file_name)
file_prefix = prefix_match.group(1)
# Read the file and concatenate contents with previous contents
contents[file_prefix] = contents.get(file_prefix, '')
with open(file_name, 'r') as the_file:
contents[file_prefix] += the_file.read() + '\n'
# Create new file for each file id and write contents to it
for file_prefix in contents:
file_contents = contents[file_prefix]
with open(file_prefix + '-Concat.' + file_extension, 'w') as the_file:
the_file.write(file_contents)
I have different csv files in different directories. so i want to find specific cells in different columns that correspond to a specific date in my input.txt file.
here is what i have until now:
import glob, os, csv, numpy
import re, csv
if __name__ == '__main__':
Input=open('Input.txt','r');
output = []
for i, line in enumerate(Input):
if i==0:
header_Input = Input.readline().replace('\n','').split(',');
else:
date_input = Input.readline().replace('\n','').split(',');
a=os.walk("path to the directory")
[x[0] for x in os.walk("path to the directory")]
print(a)
b=next(os.walk('.'))[1] # immediate child directories.
for dirname, dirnames, filenames in os.walk('.'):
# print path to all subdirectories first.
for subdirname in dirnames:
print(os.path.join(dirname, subdirname))
# print path to all filenames.
for filename in filenames:
#print(os.path.join(dirname, filename))
csvfile = 'csv_file'
if csvfile in filename:
print(os.path.join(dirname, filename))
Now I have the csv files, so i need to find the date_input in every file, and print the line that contains all the information. Or if possible, to get only the cells that are in the columns with header == header_input.
This is not intended to be a full answer to your question. But you may want to consider replacing
for i, line in enumerate(Input):
if i==0:
header_Input = Input.readline().replace('\n','').split(',');
else:
date_input = Input.readline().replace('\n','').split(',');
with
header_Input = Input.readline().strip().split(',')
date_input = Input.readline().strip().split(',')
The enumerate(Input) expression reads lines from the file, and so do calls to readline() in the loop body. This will most likely result in some unfortunate results like reading alternating lines from the file.
The strip() method removes whitespace from the start and end of the line. Alternatively you may want to know that s[:-1] strips off the last character of s.