How to find a specific file in Python - python

I have a directory with files of the following structure
A2ML1_A8K2U0_MutationOutput.txt
A4GALT_Q9NPC4_MutationOutput.txt
A4GNT_Q9UNA3_MutationOutput.txt
...
The first few letters represent the gene, the next few the Uniprot Number (a unique protein identifier) and MutationOutput is self explanatory.
In Python, I want to execute the following line:
f_outputfile.write(mutation_directory + SOMETHING +line[1+i]+"_MutationOutput.txt\n")
here, line[1+i] correctly identifies the Uniprot ID.
What I need to do is correctly identify the gene name. So somehow, I need to quickly search over that directory, find the file that has the line[i+1] value in it's uniprot field and then pull out the gene name.
I know I can list all the files in the directory, then I can do str.split() on each string and find it. But is there a way I can do that smarter? Should I use a dictionary? Can I just do a quick regex search?
The entire directory is about 8,116 files -- so not that many.
Thank you for your help!

What I need to do is correctly identify the gene name. So somehow, I need to quickly search over that directory, find the file that has the line[i+1] value in it's uniprot field and then pull out the gene name.
Think about how you'd do this in the shell:
$ ls mutation_directory/*_A8K2U0_MutationOutput.txt
mutation_directory/A2ML1_A8K2U0_MutationOutput.txt
Or, if you're on Windows:
D:\Somewhere> dir mutation_directory\*_A8K2U0_MutationOutput.txt
A2ML1_A8K2U0_MutationOutput.txt
And you can do the exact same thing in Python, with the glob module:
>>> import glob
>>> glob.glob('mutation_directory/*_A8K2U0_MutationOutput.txt')
['mutation_directory/A2ML1_A8K2U0_MutationOutput.txt']
And of course you can wrap this up in a function:
>>> def find_gene(uniprot):
... pattern = 'mutation_directory/*_{}_MutationOutput.txt'.format(uniprot)
... return glob.glob(pattern)[0]
But is there a way I can do that smarter? Should I use a dictionary?
Whether that's "smarter" depends on your use pattern.
If you're looking up thousands of files per run, it would certainly be more efficient to read the directory just once and use a dictionary instead of repeatedly searching. But if you're planning on, e.g., reading in an entire file anyway, that's going to take orders of magnitude longer than looking it up, so it probably won't matter. And you know what they say about premature optimization.
But if you want to, you can make a dictionary keyed by the Uniprot number pretty easily:
d = {}
for f in os.listdir('mutation_directory'):
gene, uniprot, suffix = f.split('_')
d[uniprot] = f
And then:
>>> d['A8K2U0']
'mutation_directory/A2ML1_A8K2U0_MutationOutput.txt'
Can I just do a quick regex search?
For your simple case, you don't need regular expressions.*
More importantly, what are you going to search? Either you're going to loop—in which case you might as well use glob—or you're going to have to build up an artificial giant string to search—in which case you're better off just building the dictionary.
* In fact, at least on some platforms/implementations, glob is implemented by making a regular expression out of your simple wildcard pattern, but you don't have to worry about that.

You can use glob
In [4]: import glob
In [5]: files = glob.glob('*_Q9UNA3_*')
In [6]: files
Out[6]: ['A4GNT_Q9UNA3_MutationOutput.txt']

Related

Find most matched string from pandas dataframe object

I have created a pandas dataframe object with all the Files, FilePaths and FileDirectory names which are in a specific folder. Now I am reading filenames from a json file and want to find the exact location of the files by searching 'FilePaths' or 'FileDirectory' from the dataframe/pickle(as it is much much faster to search).
What I am trying for example:
>> dcm_sure_full_path = '/Images/20150121100254179/1.2.840.113845.13.4353.3528386102.229789272626/1.2.840.113845.13.4353.3528386102.230081008712'
>> set(df[df['FileDirectory'].str.contains(os.path.basename(dcm_sure_full_path), regex=False)]['FileDirectory'])#.iloc[0]
This gives me 3 different paths, which means some part of the files matches with three different locations.
{'/media/banikr2/CAP_Exam_Data0/Images/20150121100254179/1.2.840.113845.13.4353.3528386102.229789272626/1.2.840.113845.13.4353.3528386102.230081008712',
'/media/banikr2/CAP_Exam_Data0/Working Storage/wilist_VascuCAP CT Development_SE/69930316/im0-1.2.840.113845.13.4353.3528386102.230081008712',
'/media/banikr2/CAP_Exam_Data0/Working Storage/wilist_WP A.2 Development_images_noTruth/69930316/im0-1.2.840.113845.13.4353.3528386102.230081008712'}
but you can clearly see I exactly needed the first one which matches the most with the desired path. Now I tried to get the most matched one by the following code:
>> set(df[np.char.find(df['FileDirectory'].values.astype(str), dcm_sure_full_path) > -1]['FileDirectory'])#.iloc[rn])
or just change the os.path.basename in previous one:
set(df[df['FileDirectory'].str.contains((dcm_sure_full_path), regex=False)]['FileDirectory'])#.iloc[0]
which in result gives the desired path and discards two others.
{'/media/banikr2/CAP_Exam_Data0/Images/20150121100254179/1.2.840.113845.13.4353.3528386102.229789272626/1.2.840.113845.13.4353.3528386102.230081008712'}
My ques is, are there better, smarter ways to do this sort of search and find with more accuracy so that I don't miss any file directory or so?

Delete a Portion of a CSV Cell in Python

I have recently stumbled upon a task utilizing some CSV files that are, to say the least, very poorly organized, with one cell containing what should be multiple separate columns. I would like to use this data in a Python script but want to know if it is possible to delete a portion of the row (all of it after a certain point) then write that to a dictionary.
Although I can't show the exact contents of the CSV, it looks like this:
useful. useless useless useless useless
I understand that this will most likely require either a regular expression or an endswith statement, but doing all of that to a CSV file is beyond me. Also, the period written after useful on the CSV should be removed as well, and is not a typo.
If you know the character you want to split on you can use this simple method:
good_data = bad_data.split(".")[0]
good_data = good_data.strip() # remove excess whitespace at start and end
This method will always work. split will return a tuple which will always have at least 1 entry (the full string). Using index may throw an exception.
You can also limit the # of splits that will happen if necessary using split(".", N).
https://docs.python.org/2/library/stdtypes.html#str.split
>>> "good.bad.ugly".split(".", 1)
['good', 'bad.ugly']
>>> "nothing bad".split(".")
['nothing bad']
>>> stuff = "useful useless"
>>> stuff = stuff[:stuff.index(".")]
ValueError: substring not found
Actual Answer
Ok then notice that you can use indexing for strings just like you do for lists. I.e. "this is a very long string but we only want the first 4 letters"[:4] gives "this". If we now new the index of the dot we could just get what you want like that. For exactly that strings have the index method. So in total you do:
stuff = "useful. useless useless useless useless"
stuff = stuff[:stuff.index(".")]
Now stuff is very useful :).
In case we are talking about a file containing multiple lines like that you could do it for each line. Split that line at , and put all in a dictionary.
data = {}
with open("./test.txt") as f:
for i, line in enumerate(f.read().split("\n")):
csv_line = line[:line.index(".")]
for j,col in enumerate(csv_line.split(",")):
data[(i,j)] = col
How one would do this
Notice that most people would not want to do it by hand. It is a common task to work on tabled data and there is a library called pandas for that. Maybe it would be a good idea to familiarise yourself a bit more with python before you dive into pandas though. I think a good point to start is this. Using pandas your task would look like this
import pandas as pd
pd.read_csv("./test.txt", comment=".")
giving you what is called a dataframe.

Extract list of variables with start attribute from Modelica model

Is there an easy way to extract a list of all variables with start attribute from a Modelica model? The ultimate goal is to run a simulation until it reaches steady-state, then run a python script that compares the values of start attribute against the steady-state value, so that I can identify start values that were chosen badly.
In the Dymola Python interface I could not find such a functionality. Another approach could be to generate the modelDescription.xml and parse it, I assume the information is available somewhere in there, but for that approach I also feel I need help to get started.
Similar to this answer, you can extract that info easily from the modelDescription.xml inside a FMU with FMPy.
Here is a small runnable example:
from fmpy import read_model_description
from fmpy.util import download_test_file
from pprint import pprint
fmu_filename = 'CoupledClutches.fmu'
download_test_file('2.0', 'CoSimulation', 'MapleSim', '2016.2', 'CoupledClutches', fmu_filename)
model_description = read_model_description(fmu_filename)
start_vars = [v for v in model_description.modelVariables if v.start and v.causality == 'local']
pprint(start_vars)
The files dsin.txt and dsfinal.txt might help you around with this. They have the same structure, with values at the start and at the end of the simulation; by renaming dsfinal.txt to dsin.txt you can start your simulation from the (e.g. steady-state) values you computed in a previous run.
It might be worthy working with these two files if you have in mind already to use such values for running other simulations.
They give you information about solvers/simulation settings, that you won't find in the .mat result files (if they're of any interest for your case)
However, if it is only a comparison between start and final values of variables that are present in the result files anyway, a better choice might be to use python and a library to read the result.mat file (dymat, modelicares, etc). It is then a matter of comparing start-end values of the signals of interest.
After some trial and error, I came up with this python code snippet to get that information from modelDescription.xml:
import xml.etree.ElementTree as ET
root = ET.parse('modelDescription.xml').getroot()
for ScalarVariable in root.findall('ModelVariables/ScalarVariable'):
varStart = ScalarVariable.find('*[#start]')
if varStart is not None:
name = ScalarVariable.get('name')
value = varStart.get('start')
print(f"{name} = {value};")
To generate the modelDescription.xml file, run Dymola translation with the flag
Advanced.FMI.GenerateModelDescriptionInterface2 = true;
Python standard library has several modules for processing XML:
https://docs.python.org/3/library/xml.html
This snippet uses ElementTree.
This is just a first step, not sure if I missed something basic.

Data structure for filesystem

I'm storing / caching the filesystem (filenames only) in memory to be able to do fast research à la Everything. Thus I don't want to use the OS's built-in file search GUI.
I do it with:
import os
L = []
for root,dirs,files in os.walk(PATH):
L.append([root, files])
and the result is like this:
[['D:\\', ['a.jpg', 'b.jpg']],
...
['D:\\Temp12', ['test.txt', 'test2.txt']]]
The problem is that doing research takes too much time when L will contain millions of elements:
query = 'test2' #searching for filename containg this text
for dir in L:
for f in dir[1]:
if query in f:
print '%s found: %s' % (query, os.path.join(dir[0],f))
Indeed, this is a very naive search because it requires to browse the whole list to find items.
How to make the queries faster?
Maybe it seems that a list is not the right data structure to do full-text research, is there a tree-like structure?
Research in a lists are O(n), Research in dictionaries are amortized O(1). If you don't need to associate values, use sets.
If you want more about this : https://www.ics.uci.edu/~pattis/ICS-33/lectures/complexitypython.txt
In your case, I would use sets. It will make your queries a lot faster.
EDIT :
The way you are doing it, checking every file for a match can't be quicker that way. Even if you use a dict, you would check every filename for a match.
New idea :
You can create a dict with all filename as keys and root as value for each. This way you can recreate the full path later.
The idea is now to create a tree were each node is a letter and were path between each will made words (filename). It could be difficult to implement and the result may not be faster depending on the way you construct the tree.
You have to remember that you want to check each and every filename and using a list or a dict won't change that. The tree/graph is the only solution I can think of.
Could you consider using a database for this?
SQLite offers :memory: option, which creates your database in memory only. Of course you can optimise your algorithm and data structure as pointed out in other answers and comments, but databases generally are already very good at this with their indexing, and you would not need to design something similar.
Your table(s) could be either simply one table with fields full_path and filename, and if you indexed it by filename, it would be fast. This would store a lot of redundant information, as every file would have the full path in full_path. A better solution would be to have a table for directories and another for files, and you would just reference directories from files to get the full path of a match.
Just a thought.
Hannu

Python - sorting incremented files

I have many folders that contains several versionned files. Here are example files:
Cat_Setup_v01.mb
Cat_Setup_v18.mb
The version number has a two characters padding. This way, I can easily sort files using:
listFiles = glob.glob( myPath + "*.m*") # Retrieve files in my folder
listFiles.sort()
Unfortunately, I have some files with more than a hundred versions. Thus, my sorting method is broken with v1XX as they are sorted between v09 and v10.
Is there an efficient way I can sort my files in the right way without having to rename them all and change their padding?
sorted(versionNumber, key=int) combined with some split string operations could be an interesting trail but I'm affraid it will be too cumbersome.
I don't know Python much and as it seems to be an interesting language with a lot of possibilities, I'm pretty sure there is a more efficient way.
Cheers
Regular Expression may help you.
import re
file=["Cat_Setup_v91.mb", "Cat_Setup_v01.mb", "Cat_Setup_v119.mb"]
print sorted(file, key=lambda x: int(re.findall("(?<=v)\d+", x)[0]))
give the output:
['Cat_Setup_v01.mb', 'Cat_Setup_v91.mb', 'Cat_Setup_v119.mb']
Updated: change "(?<=v)\w*" to "(?<=v)\d+" according to #Rawing comment

Categories

Resources