I wrote a python script for a friend that:
takes a CSV of photos she's been cataloging that has the name of the photos in an ordered list
finds the image files on the filesystem
matches the files in the csv with files on the system
copies the images on the filesystem to a folder with a figure name in the order the files appear in the CSV
So essentially, it does:
INPUT: myphoto1.tiff, mypainting.jpeg, myphoto9.jpg, orderedlist.csv
OUTPUT: fig001.jpg, fig002.tiff, fig003.jpeg
This code is going to run on a mac. This works fine except we ran into an issue where some of the files (all by the same photographer) have 1 bracket in them, e.g.
myphoto[fromitaly.jpg
This seems to break my regular expression search:
The relevant code:
orderedpaths = [path for item in target for path in filenames if re.search(item, path)]
Where filenames is a list of the photo files on the system and target is the list from the CSV. This code is supposed to match the CSV file name (and it's subsequent order in the list) to the filename to give an ordered list of the filenames on the system.
The error:
Traceback (most recent call last):
File "renameimages.py", line 43, in <module>
orderedpaths = [path for item in target for path in filenames if re.search(item, path)]
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 142, in search
return _compile(pattern, flags).search(string)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 244, in _compile
raise error, v # invalid expression
sre_constants.error: unexpected end of regular expression
I tried or considered:
Changing the filenames/csv, but this isn't scalable and ideally her
department will be using this script more in the future
Investigating treating the files as "raw" -- but it didn't seem like
that was possible for input from CSV
Deleting the [ character from the input, but the problem is that
then the input won't match the actual files on the system.
I suppose I should mention I only suspect this was the issue: by printing out the progress of the code, it appears as if the code gets to the CSV item with the bracket and errors.
The relevant code is the part where you buld a regular expression using a user input, without sanitizing it. You should not do that.
I believe you don't need to use RE at all. you can find matching string using if item in path or path.endswith(item) or something like that.
The best option is to use your library:
from os.path import basename
orderedpaths = [ ... if basename(path) == item]
If you insist on using REs, you should escape your input using re.escape():
orderedpaths = [path for item in target for path in filenames
if re.search(re.escape(item), path)]
Related
I wrote a small script in python to concatenate some lines from different files into one file. But somehow it doesn't print anything I like it to print by the function I wrote. I tried to spot the problems, but after one evening and one morning, I still can't find the problem. Could somebody help me please? Thanks a lot!
So I have a folder where around thousands of .fa files are. In each of the .fa file, I would like to extract the line starting with ">", and also do some change to extract the information I like. In the end, I would like to combine all the information extracted from one file into one line in a new file, and then concatenate all the information from all the .fa file into one .txt file.
So the folder:
% ls
EstimatedSpeciesTree.nwk HOG9998.fa concatenate_gene_list_HOG.py
HOG9997.fa HOG9999.fa output
One .fa file for example
>BnaCnng71140D [BRANA]
MTSSFKLSDLEEVTTNAEKIQNDLLKEILTLNAKTEYLRQFLHGSSDKTFFKKHVPVVSYEDMKPYIERVADGEPSEIIS
GGPITKFLRRYSF
>Cadbaweo98702.t [THATH]
MTSSFKLSDLEEVTTNAEKIQNDLLKEILTLNAKTEYLRQFLHGSSDKTFFKKHVPVVSYEDMKPYIERVADGEPSEIIS
GGPITKFLRRYSF
What I would like to have is one file like this
HOG9997.fa BnaCnng71140D:BRANA Cadbaweo98702.t:THATH
HOG9998.fa Bkjfnglks098:BSFFE dsgdrgg09769.t
HOG9999.fa Dsdfdfgs1937:XDGBG Cadbaweo23425.t:THATH Dkkjewe098.t:THUGB
# NOTE: the number of lines in each .fa file are uncertain. Also, some lines has [ ], but some lines has not.
So my code is
#!/usr/bin/env python3
import os
import re
import csv
def concat_hogs(a_file):
output = []
for row in a_file: # select the gene names in each HOG fasta file
if row.startswith(">"):
trans = row.partition(" ")[0].partition(">")[2]
if row.partition(" ")[2].startswith("["):
species = re.search(r"\[(.*?)\]", row).group(1)
else:
species = ""
output.append(trans + ":" + species)
return '\t'.join(output)
folder = "/tmp/Fasta/"
print("Concatenate names in " + folder)
for file in os.listdir(folder):
if file.endswith('.fa'):
reader = csv.reader(file, delimiter="\t")
print(file + concat_hogs(reader))
But the output only prints the file name with out the part that should be generated by the function concat_hogs(file). I don't understand why.
The error comes from you passing the name of the file to your concat_hogs function instead of an iterable file handle. You are missing the actual opening of the file for reading purposes.
I agree with Jay M that your code can be simplified drastically, not least by using regular expressions more efficiently. Also pathlib is awesome.
But I think it can be even more concise and expressive. Here is my suggestion:
#!/usr/bin/env python3
import re
from pathlib import Path
GENE_PATTERN = re.compile(
r"^>(?P<trans>[\w.]+)\s+(?:\[(?P<species>\w+)])?"
)
def extract_gene(string: str) -> str:
match = re.search(GENE_PATTERN, string)
return ":".join(match.groups(default=""))
def concat_hogs(file_path: Path) -> str:
with file_path.open("r") as file:
return '\t'.join(
extract_gene(row)
for row in file
if row.startswith(">")
)
def main() -> None:
folder = Path("/tmp/Fasta/")
print("Concatenate names in", folder)
for element in folder.iterdir():
if element.is_file() and element.suffix == ".fa":
print(element.name, concat_hogs(element))
if __name__ == '__main__':
main()
I am using named capturing groups for the regular expression because I prefer it for readability and usability later on.
Also I assume that the first group can only contain letters, digits and dots. Adjust the pattern, if there are more options.
PS
Just to add a few additional explanations:
The pathlib module is a great tool for any basic filesystem-related task. Among a few other useful methods you can look up there, I use the Path.iterdir method, which just iterates over elements in that directory instead of creating an entire list of them in memory first the way os.listdir does.
The RegEx Match.groups method returns a tuple of the matched groups, the default parameter allows setting the value when a group was not matched. I put an empty string there, so that I can always simply str.join the groups, even if the species-group was not found. Note that this .groups call will result in an AttributeError, if no match was found because then the match variable will be set to None. It may or may not be useful for you to catch this error.
For a few additional pointers about using regular expressions in Python, there is a great How-To-Guide in the docs. In addition I can only agree with Jay M about how useful regex101.com is, regardless of language specifics. Also, I think I would recommend using his approach of reading the entire file into memory as a single string first and then using re.findall on it to grab all matches at once. That is probably much more efficient than going line-by-line, unless you are dealing with gigantic files.
In concat_hogs I pass a generator expression to str.join. This is more efficient than first creating a list and passing that to join because no additional memory needs to be allocated. This is possible because str.join accepts any iterable of strings and that generator expression (... for ... in ...) returns a Generator, which inherits from Iterator and thus from Iterable. For more insight about the container inheritance structures I always refer to the collections.abc docs.
Use standard Python libraries
In this case
regex (use a site such as regex101 to test your regex)
pathlib to encapsulate paths in a platform independent way
collections.namedtuple to make data more structured
A breakdown of the regex used here:
>([a-z0-9A-Z\.]+?)\s*(\n|\[([A-Z]+)\]?\n)
> The start of block character
(regex1) First matchig block
\s* Any amount of whitespace (i.e. zero space is ok)
(regex2|regex3) A choice of two possible regex
regex1: + = One or more of characters in [class] Where class is any a to z or 0 to 9 or a dot
regex2: \n = A newline that immediately follows the whitespace
regex3: [([A-Z]+)] = One or more upper case letter inside square brackets
Note1: The brackets create capture groups, which we later use to split out the fields.
Note2: The regex demands zero or more whitespace between the first and second part of the text line, this makes it more resiliant.
import re
from collections import namedtuple
from pathlib import Path
import os
class HOG(namedtuple('HOG', ['filepath', 'filename', 'key', 'text'], defaults=[None])):
__slots__ = ()
def __str__(self):
return f"{self.key}:{self.text}"
regex = re.compile(r">([a-z0-9A-Z\.]+?)\s*(\n|\[([A-Z]+)\]?\n)")
path = Path(os.path.abspath("."))
wildcard = "*.fa"
files = list(path.glob("*.fa"))
print(f"Searching {path}/{wildcard} => found {len(files)} files")
data = {}
for file in files:
print(f"Processing {file}")
with file.open() as hf:
text = hf.read(-1)
matches = regex.findall(text)
for match in matches:
key = match[0].strip()
text = match[-1].strip()
if file.name not in data:
data[file.name] = []
data[file.name].append(HOG(path, file.name, key, text))
print("Now you have the data you can process it as you like")
for file, entries in data.items():
line = "\t".join(list(str(e) for e in entries))
print(file, line)
# e.g. Write the output as desired
with Path("output").open("w") as fh:
for file, entries in data.items():
line = "\t".join(list(str(e) for e in entries))
fh.write(f"{file}\t{line}\n")
In Python, its the vagueness that I struggle with.
Lets start with what I know. I know that I want to search a specific directory for a file. And I want to search that file for a specific line that contains a specific string, and return only that line.
Which brings me to what I don't know. I have a vague description of the specific filename:
some_file_{variable}_{somedatestamp}_{sometimestamp}.log
So I know the file will start with some_file followed by a known variable and ending in .log. I don't know the date stamp or the time stamp. And to top it off, the file might not still exist at the time of searching, so I need to cover for that eventuality.
To better describe the problem, I have the line of the BASH script that accomplishes this:
ls -1tr /dir/some_file_${VARIABLE}_*.log | tail -2 | xargs -I % grep "SEARCH STRING" %
So basically, I want to recreate that line of code from the BASH script in Python, and throw a message in the case that the search returns no files.
Some variant of this will work. This will work for an arbitrary folder structure and will search all subfolders.... results will hold path to directory, filename, and txt, line number (1-based).
Some key pieces:
os.walk is beautiful for searching directory trees. The top= can be relative or absolute
use a context manager with ... as ... to open files as it closes them automatically
python iterates over text-based files in line-by line format
Code
from os import walk, path
magic_text = 'lucky charms'
results = []
for dirpath, _, filenames in walk(top='.'):
for f in filenames:
# check the name. can use string methods or indexing...
if f.startswith('some_name') and f[-4:] == '.log':
# read the lines and look for magic text
with open(path.join(dirpath, f), 'r') as src:
# if you iterate over a file, it returns line-by-line
for idx, line in enumerate(src):
# strings support the "in" operator...
if magic_text in line:
results.append((dirpath, f, idx+1, line))
for item in results:
print(f'on path {item[0]} in file {item[1]} on line {item[2]} found: {item[3]}')
In a trivial folder tree, I placed the magic words in one file and got this result:
on path ./subfolder/subsub in file some_name_331a.log on line 2 found: lucky charms are delicious
See if this works for you:
from glob import glob
import os
log_dir = 'C:\\Apps\\MyApp\\logs\\'
log_variable = input("Enter variable:")
filename = "some_file"+log_variable
# Option 1
searched_files1 = glob(log_dir+filename+'*.log')
print(f'Total files found = {len(searched_files1)}')
print(searched_files1)
# Option 2
searched_files2 = []
for object in os.listdir(log_dir):
if (os.path.isfile(os.path.join(log_dir,object)) and object.startswith(filename) and object.endswith('.log')):
searched_files2.append(object)
print(f'Total files found = {len(searched_files2)}')
print(searched_files2)
So I've got the code below and when I run tests to spit out all the files in A1_dir and A2_list, all of the files are showing up, but when I try to get the fnmatch to work, I get no results.
For background in case its helpful: I am trying to comb through a directory of files and take an action (duplicate the file) only IF it matches a file name on the newoutput.txt list. I'm sure there's a better way to do all of this lol, so if you have that I'd love to hear it too!
import fnmatch
import os
A1_dir = ('C:/users/alexd/kobe')
A2_list = open('C:/users/alexd/kobe/newoutput.txt')
Lines = A2_list.readlines()
A2_list.close()
for file in (os.listdir(A1_dir)):
for line in Lines:
if fnmatch.fnmatch(file, line):
print("got one:{file}")
readline returns a single line and readlines returns all the lines as a list (doc). However, in both cases, the lines always have a trailing \n i.e. the newline character.
A simple fix here would be to change
Lines = A2_list.readlines()
to
Lines = [i.strip() for i in A2_list.readlines()]
Since you asked for a better way, you could take a look at set operations.
Since the lines are exactly what you want the file names to be (and not patterns), save A2_list as a set instead of a list.
Next, save all the files from os.listdir also as a set.
Finally, perform a set intersection
import fnmatch
import os
with open('C:/users/alexd/kobe/newoutput.txt') as fp:
myfiles = set(i.strip() for i in fp.readlines())
all_files = set(os.listdir('C:/users/alexd/kobe'))
for f in all_files.intersection(myfiles):
print(f"got one:{f}")
You cannot use fnmatch.fnmatch to compare 2 different filenames, fnmatch.fnmatch only accepts 2 parameters filename and pattern respectively.
As you can see in the official documentation:
Possible Solution:
I don't think that you have to use any function to compare 2 strings. Both os.listdir() and .readlines() returns you lists of strings.
I want to rename filenames of the form xyz.ogg.mp3 to xyz.mp3.
I have a regex that looks for .ogg in every file then it replaces the .ogg with an empty string but I get the following error:
Traceback (most recent call last):
File ".\New Text Document.py", line 7, in <module>
os.rename(files, '')
TypeError: rename() argument 1 must be string, not _sre.SRE_Match
Here is what I tried:
for file in os.listdir("./"):
if file.endswith(".mp3"):
files = re.search('.ogg', file)
os.rename(files, '')
How can I make this loop look for every .ogg in each file then replace it with an empty string?
The file structure looks like this: audiofile.ogg.mp3
You can do something like this:
for file in os.listdir("./"):
if file.endswith(".mp3") and '.ogg' in file:
os.rename(file, file.replace('.ogg',''))
Would be far more quicker to write a command line :
rename 's/\.ogg//' *.ogg.mp3
(perl's rename)
An example using Python 3's pathlib (but not regular expressions, as it's kind of overkill for the stated problem):
from pathlib import Path
for path in Path('.').glob('*.mp3'):
if '.ogg' in path.stem:
new_name = path.name.replace('.ogg', '')
path.rename(path.with_name(new_name))
A few notes:
Path('.') gives you a Path object pointing to the current working directory
Path.glob() searches recursively, and the * there is a wildcard (so you get anything ending in .mp3)
Path.stem gives you the file name minus the extension (so if your path were /foo/bar/baz.bang, the stem would be baz)
I am trying to automate some plotting using python and fortran together.
I am very close to getting it to work, but I'm having problems getting the result from a glob search to feed into my python function.
I have a .py script that says
import glob
run=glob.glob('JUN*.aijE*.nc')
from plot_check import plot_check
plot_check(run)
But I am getting this error
plot_check(run)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "plot_check.py", line 7, in plot_check
ncfile=Dataset(run,'r')
File "netCDF4.pyx", line 1328, in netCDF4.Dataset.__init__ (netCDF4.c:6336)
RuntimeError: No such file or directory
I checked that the glob is doing its job and it is, but I think it's the format of my variable "run" that's screwing me up.
In python:
>>run
>>['JUN3103.aijE01Ccek0kA.nc']
>>type(run)
<type 'list'>
So my glob is finding the file name of the file I want to put into my function, but something isn't quite working when I try to input the variable "run" in to my function "plot_check".
I think it might be something to do with the format of my variable "run", but I'm not quite sure how to fix it.
Any help would be greatly appreciated!
glob.glob returns a list of all matching filenames. If you know there's always going to be exactly one file, you can just grab the first element:
filenames = glob.glob('JUN*.aijE*.nc')
plot_check(filenames[0])
Or, if it might match more than one file, then iterate over the results:
filenames = glob.glob('JUN*.aijE*.nc')
for filename in filenames:
plot_check(filename)
Perhaps Dataset expects to be passed a single string filename, rather than a list with one element?
Try using run[0] instead (though you may want to check to make sure your glob actually matches a file before you do that).