I have many folders that contains several versionned files. Here are example files:
Cat_Setup_v01.mb
Cat_Setup_v18.mb
The version number has a two characters padding. This way, I can easily sort files using:
listFiles = glob.glob( myPath + "*.m*") # Retrieve files in my folder
listFiles.sort()
Unfortunately, I have some files with more than a hundred versions. Thus, my sorting method is broken with v1XX as they are sorted between v09 and v10.
Is there an efficient way I can sort my files in the right way without having to rename them all and change their padding?
sorted(versionNumber, key=int) combined with some split string operations could be an interesting trail but I'm affraid it will be too cumbersome.
I don't know Python much and as it seems to be an interesting language with a lot of possibilities, I'm pretty sure there is a more efficient way.
Cheers
Regular Expression may help you.
import re
file=["Cat_Setup_v91.mb", "Cat_Setup_v01.mb", "Cat_Setup_v119.mb"]
print sorted(file, key=lambda x: int(re.findall("(?<=v)\d+", x)[0]))
give the output:
['Cat_Setup_v01.mb', 'Cat_Setup_v91.mb', 'Cat_Setup_v119.mb']
Updated: change "(?<=v)\w*" to "(?<=v)\d+" according to #Rawing comment
Related
I'm working on a small project that requires that I use Python to uppercase all the names of files in a certain directory "ex: input: Brandy.jpg , output: BRANDY.jpg".
The thing is I've never done on multiple files before, what I've done was the following:
universe = os.listdir('parallel_universe/')
universe = [os.path.splitext(x)[0].upper() for x in universe]
But what I've done capitalized the names in the list only but not the files in the directory itself, the output was like the following:
['ADAM SANDLER','ANGELINA JULIE','ARIANA GRANDE','BEN AFFLECK','BEN STILLER','BILL GATES', 'BRAD PITT','BRITNEY SPEARS','BRUCE LEE','CAMERON DIAZ','DWAYNE JOHNSON','ELON MUSK','ELTON JOHN','JACK BLACK','JACKIE CHAN','JAMIE FOXX','JASON SEGEL', 'JASON STATHAM']
What am I missing here? And since I don't have much experience in Python, I'd love if your answers include explanations for each step, and thanks in advance.
Right now, you are converting the strings to uppercase, but that's it. There is no actual renaming being done. In order to rename, you need to use os.rename
If you were to wrap your code with os.rename, it should solve your problem, like so:
[os.rename("parallel_universe/" + x, "parallel_universe/" + os.path.splitext(x)[0].upper() + os.path.splitext(x)[1]) for x in universe]
I have removed the assignment universe= because this line no longer returns a list and you will instead get a bunch on None objects.
Docs for os.rename: https://docs.python.org/3/library/os.html#os.rename
Someone has challenged me to create a program that sorts their pictures into folders based on the month they were taken, and I want to do it in one line (I know, it's inefficient and unreadable, but I still want to do it because one-liners are cool)
I needed a for loop to accomplish this, but the only way I know of to use a for loop in one line is list comprehension, so that's what I did, but it creates an empty list, and doesn't print anything from the list or anything.
What I'm doing is renaming the file to be the month created + original filename (ex: bacon.jpg --> May\bacon.jpg)
Here is my code (Python 3.7.3):
import time
import os.path
[os.rename(str(os.fspath(f)), str(time.ctime(os.path.getctime(str(os.fspath(f))))).split()[1] + '\\' + str(os.fspath(f))) for f in os.listdir() if f.endswith('.jpg')]
and the more readable, non-list-comprehension version:
import time
import os.path
for f in os.listdir():
fn = str(os.fspath(f))
dateCreated = str(time.ctime(os.path.getctime(fn)))
monthCreated = dateCreated.split()[1]
os.rename(fn, monthCreated + '\\' + fn)
Is list comprehension a bad way to do it? Also, is there a reason why, if I print the list it's [] instead of [None, None, None, None, None, (continuing "None"s for every image moved)]?
Please note: I understand that it's inefficient and bad practice. If I were doing this for purposes other than just for fun to see if I could do it, I would obviously not try to do it in one line.
This is bad in two immediate respects:
You're using a list comprehension when you're not actually interested in constructing a list -- you ignore the object you just constructed.
Your construction has an ugly side effect in the OS.
Your purpose appears to be renaming a sequence of files, not constructing a list. The Python facility you want is, I believe, the map function. Write a function to change one file name, and then use map on a list of file names -- or tuples of old, new file names -- to run through the sequence of desired changes.
Is list comprehension a bad way to do it?
YES. But if you want to do it in one line, it is either that or using ";". For instance:
for x in range(5): print(x);print(x+2)
And, by the way, just renaming a file including a slash will not create a folder. You have to use os.mkdir('foldername').
In the end, if you really want to do that, I would just recommend doing it normally in many lines and then separating it with semicolons in a single line.
I'm storing / caching the filesystem (filenames only) in memory to be able to do fast research à la Everything. Thus I don't want to use the OS's built-in file search GUI.
I do it with:
import os
L = []
for root,dirs,files in os.walk(PATH):
L.append([root, files])
and the result is like this:
[['D:\\', ['a.jpg', 'b.jpg']],
...
['D:\\Temp12', ['test.txt', 'test2.txt']]]
The problem is that doing research takes too much time when L will contain millions of elements:
query = 'test2' #searching for filename containg this text
for dir in L:
for f in dir[1]:
if query in f:
print '%s found: %s' % (query, os.path.join(dir[0],f))
Indeed, this is a very naive search because it requires to browse the whole list to find items.
How to make the queries faster?
Maybe it seems that a list is not the right data structure to do full-text research, is there a tree-like structure?
Research in a lists are O(n), Research in dictionaries are amortized O(1). If you don't need to associate values, use sets.
If you want more about this : https://www.ics.uci.edu/~pattis/ICS-33/lectures/complexitypython.txt
In your case, I would use sets. It will make your queries a lot faster.
EDIT :
The way you are doing it, checking every file for a match can't be quicker that way. Even if you use a dict, you would check every filename for a match.
New idea :
You can create a dict with all filename as keys and root as value for each. This way you can recreate the full path later.
The idea is now to create a tree were each node is a letter and were path between each will made words (filename). It could be difficult to implement and the result may not be faster depending on the way you construct the tree.
You have to remember that you want to check each and every filename and using a list or a dict won't change that. The tree/graph is the only solution I can think of.
Could you consider using a database for this?
SQLite offers :memory: option, which creates your database in memory only. Of course you can optimise your algorithm and data structure as pointed out in other answers and comments, but databases generally are already very good at this with their indexing, and you would not need to design something similar.
Your table(s) could be either simply one table with fields full_path and filename, and if you indexed it by filename, it would be fast. This would store a lot of redundant information, as every file would have the full path in full_path. A better solution would be to have a table for directories and another for files, and you would just reference directories from files to get the full path of a match.
Just a thought.
Hannu
I have a directory with files of the following structure
A2ML1_A8K2U0_MutationOutput.txt
A4GALT_Q9NPC4_MutationOutput.txt
A4GNT_Q9UNA3_MutationOutput.txt
...
The first few letters represent the gene, the next few the Uniprot Number (a unique protein identifier) and MutationOutput is self explanatory.
In Python, I want to execute the following line:
f_outputfile.write(mutation_directory + SOMETHING +line[1+i]+"_MutationOutput.txt\n")
here, line[1+i] correctly identifies the Uniprot ID.
What I need to do is correctly identify the gene name. So somehow, I need to quickly search over that directory, find the file that has the line[i+1] value in it's uniprot field and then pull out the gene name.
I know I can list all the files in the directory, then I can do str.split() on each string and find it. But is there a way I can do that smarter? Should I use a dictionary? Can I just do a quick regex search?
The entire directory is about 8,116 files -- so not that many.
Thank you for your help!
What I need to do is correctly identify the gene name. So somehow, I need to quickly search over that directory, find the file that has the line[i+1] value in it's uniprot field and then pull out the gene name.
Think about how you'd do this in the shell:
$ ls mutation_directory/*_A8K2U0_MutationOutput.txt
mutation_directory/A2ML1_A8K2U0_MutationOutput.txt
Or, if you're on Windows:
D:\Somewhere> dir mutation_directory\*_A8K2U0_MutationOutput.txt
A2ML1_A8K2U0_MutationOutput.txt
And you can do the exact same thing in Python, with the glob module:
>>> import glob
>>> glob.glob('mutation_directory/*_A8K2U0_MutationOutput.txt')
['mutation_directory/A2ML1_A8K2U0_MutationOutput.txt']
And of course you can wrap this up in a function:
>>> def find_gene(uniprot):
... pattern = 'mutation_directory/*_{}_MutationOutput.txt'.format(uniprot)
... return glob.glob(pattern)[0]
But is there a way I can do that smarter? Should I use a dictionary?
Whether that's "smarter" depends on your use pattern.
If you're looking up thousands of files per run, it would certainly be more efficient to read the directory just once and use a dictionary instead of repeatedly searching. But if you're planning on, e.g., reading in an entire file anyway, that's going to take orders of magnitude longer than looking it up, so it probably won't matter. And you know what they say about premature optimization.
But if you want to, you can make a dictionary keyed by the Uniprot number pretty easily:
d = {}
for f in os.listdir('mutation_directory'):
gene, uniprot, suffix = f.split('_')
d[uniprot] = f
And then:
>>> d['A8K2U0']
'mutation_directory/A2ML1_A8K2U0_MutationOutput.txt'
Can I just do a quick regex search?
For your simple case, you don't need regular expressions.*
More importantly, what are you going to search? Either you're going to loop—in which case you might as well use glob—or you're going to have to build up an artificial giant string to search—in which case you're better off just building the dictionary.
* In fact, at least on some platforms/implementations, glob is implemented by making a regular expression out of your simple wildcard pattern, but you don't have to worry about that.
You can use glob
In [4]: import glob
In [5]: files = glob.glob('*_Q9UNA3_*')
In [6]: files
Out[6]: ['A4GNT_Q9UNA3_MutationOutput.txt']
I need some help on sorting 2 lists..one with file listings and one with directory listings.
These lists are generated through another part in a much larger script that I cannot put on here.
filelist = ['EN088_EFH_030_comp_v011.mov', 'EN086_EHA_010_comp_v031.mov', 'EN083_WDA_400_comp_v021.mov', 'EN086_EHA_020_comp_v010.mov', 'EN083_WDA_450_comp_v012.mov']
folderlist = ['[EN086_EHA_010_comp_v031]', '[EN083_WDA_400_comp_v021]', '[EN086_EHA_020_comp_v010]', '[EN083_WDA_450_comp_v012]']
using .sort I can get the data to output like this.
[CB083_WDA_400_comp_v021]
[CB083_WDA_450_comp_v012]
[CB086_EHA_010_comp_v031]
[CB086_EHA_020_comp_v010]
CB083_WDA_400_comp_v021.mov
CB083_WDA_450_comp_v012.mov
CB086_EHA_010_comp_v031.mov
CB086_EHA_020_comp_v010.mov
CB088_EFH_030_comp_v011.mov
But I need it to output like this
[CB083_WDA_400_comp_v021]
CB083_WDA_400_comp_v021.mov
[CB083_WDA_450_comp_v012]
CB083_WDA_450_comp_v012.mov
[CB086_EHA_010_comp_v031]
CB086_EHA_010_comp_v031.mov
[CB086_EHA_020_comp_v010]
CB086_EHA_020_comp_v010.mov
CB088_EFH_030_comp_v011.mov
How can I go about sorting it but ignoring the [] during the sort?
Or what would I do to get the second output?
I'm kind of stumped on what I should do.
Any tips or suggestions?
....sort(key=lambda x: x.strip('[]'))