Find most matched string from pandas dataframe object

Find most matched string from pandas dataframe object - python

I have created a pandas dataframe object with all the Files, FilePaths and FileDirectory names which are in a specific folder. Now I am reading filenames from a json file and want to find the exact location of the files by searching 'FilePaths' or 'FileDirectory' from the dataframe/pickle(as it is much much faster to search).
What I am trying for example:
>> dcm_sure_full_path = '/Images/20150121100254179/1.2.840.113845.13.4353.3528386102.229789272626/1.2.840.113845.13.4353.3528386102.230081008712'
>> set(df[df['FileDirectory'].str.contains(os.path.basename(dcm_sure_full_path), regex=False)]['FileDirectory'])#.iloc[0]
This gives me 3 different paths, which means some part of the files matches with three different locations.
{'/media/banikr2/CAP_Exam_Data0/Images/20150121100254179/1.2.840.113845.13.4353.3528386102.229789272626/1.2.840.113845.13.4353.3528386102.230081008712',
'/media/banikr2/CAP_Exam_Data0/Working Storage/wilist_VascuCAP CT Development_SE/69930316/im0-1.2.840.113845.13.4353.3528386102.230081008712',
'/media/banikr2/CAP_Exam_Data0/Working Storage/wilist_WP A.2 Development_images_noTruth/69930316/im0-1.2.840.113845.13.4353.3528386102.230081008712'}
but you can clearly see I exactly needed the first one which matches the most with the desired path. Now I tried to get the most matched one by the following code:
>> set(df[np.char.find(df['FileDirectory'].values.astype(str), dcm_sure_full_path) > -1]['FileDirectory'])#.iloc[rn])
or just change the os.path.basename in previous one:
set(df[df['FileDirectory'].str.contains((dcm_sure_full_path), regex=False)]['FileDirectory'])#.iloc[0]
which in result gives the desired path and discards two others.
{'/media/banikr2/CAP_Exam_Data0/Images/20150121100254179/1.2.840.113845.13.4353.3528386102.229789272626/1.2.840.113845.13.4353.3528386102.230081008712'}
My ques is, are there better, smarter ways to do this sort of search and find with more accuracy so that I don't miss any file directory or so?

Related

Given the path, how to extract only a part of it using python?

I have a path as '/img/dataset/A/B/3/5/img1.png'
How can I extract the '/A/B/3/5/img1.png' part out of this?
I tried
path = '/img/dataset/A/B/3/5/img1.png'
extracted = path.partition('/')
This gave me the whole path as the output.

This should work :
path[len('/img/dataset'):]
Strings share some properties with lists, indexing and len beeing some.

How to use variables in re.findall?

I'm extremely new to Python, so I'm sorry if my question is rather stupid.
I have a bunch of long strings in one list and shortened parts of it in another list
Like:
Long string is D:\\Python\\Songs\\testsong.mp3 and short string is \\testsong.mp3
And I want to get a list containing 'Songs' (basically the name of the folder containing the mp3) but I have multiple folders and multiple songs so I'm trying to use re.findall but it only accepts defined patterns and my patterns change due to different songs names.

from pathlib import Path
directories = [
"D:/Python/Songs/testsong.mp3",
"C:/users/xyz/desktop/anothersong.mp3"
]
song_folder_names = set(Path(directory).parts[-2] for directory in directories)
print(song_folder_names)
Output:
{'desktop', 'Songs'}
>>>
Notice, that the order of the folder names is not preserved - because I'm using a set.

I want to be able to group same file sequence from a list python

Working on a python function which parses a file containing a list of strings.
Basically a walked folder structure parsed to a txt file so I don't have to work on real raid while in prod. That is also a requirement. To work from a txt file containing list of paths.
lpaths =[
'/projects/0100/dbu/shots/11_1/SC11_1_Shot012/render/SC11_1_Shot012.v01_1025.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot012/render/SC11_1_Shot012.v01_1042.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot012/render/SC11_1_Shot012.v01_1016.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot012/2d/app/Shot012_v1.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot012/2d/app/Shot012_v02.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot004/3d/app2/workspace.cfg',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot004/3d/app2/scenes/SC11_1_Shot004_v01.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot004/3d/app2/scenes/Shot004_camera_v01.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot004/render/SC11_1_Shot004.v01_1112.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot004/render/SC11_1_Shot004.v01_1034.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot004/render/SC11_1_Shot004.v02_1116.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot004/render/SC11_1_Shot004.v02_1126.exr'
]
This is partial list of the cleaned list version ive already worked out and works fine.
The real problem, need to parse all frames from a folder to into a list so it can hold a proper listed sequence.
There could be 1 frame or 1000, also there are multiple sequences in same folder as seen in the list.
My goal is to have a list for each sequence in a folder, so I can push them ahead to do more work down the road.
Code:
groups = [list(group) for key, group in itertools.groupby(sorted(lpaths), len)]
pp.pprint(groups)

Since you seem to have differing naming conventions you need to write a function that takes a single string and, possibly using regular expressions, returns an unambiguous key for you to sort on, lets say that you names are critically identified by the shot number which can be identified by r".*[Ss]hot_?(\d+).*\.ext" you could return an integer for the match base 10 so discarding any leading 0s.
Since you also may have a version number you could do a similar operation to get an unambiguous version number, (and possibly only process the latest version of a given shot).

Reading Multiple S3 Folders / Paths Into PySpark

I am conducting a big data analysis using PySpark. I am able to import all CSV files, stored in a particular folder of a particular bucket, using the following command:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('file:///home/path/datafolder/data2014/*.csv')
(where * acts like a wildcard)
The issues I have are the following:
What if I want to do my analysis on 2014 and 2015 data i.e. file 1 is .load('file:///home/path/SFweather/data2014/*.csv'), file 2 is .load('file:///home/path/SFweather/data2015/*.csv') and file 3 is .load('file:///home/path/NYCweather/data2014/*.csv') and file 4 is .load('file:///home/path/NYCweather/data2015/*.csv'). How do I import multiple paths at the same time to get one dataframe? Do I need to store them all individually as dataframes and then join them together within PySpark? (You may assume they all CSVs have the same schema)
Suppose it is November 2014 now. What if I want to run the analysis again, but on the "most recent data" run e.g. dec14 when it is December 2014? For example, I want to load in file 2: .load('file:///home/path/datafolder/data2014/dec14/*.csv') in December 14 and use this file: .load('file:///home/path/datafolder/data2014/nov14/*.csv') for the original analysis. Is there a way to schedule the Jupyter notebook (or similar) to update the load path and import the latest run (in this case 'nov14' would be replaced by 'dec14' and then 'jan15' etc).
I had a look through the previous questions but was unable to find an answer given this is AWS / PySpark integration specific.
Thank you in advance for the help!
[Background: I have been given access to many S3 buckets from various teams containing various big data sets. Copying it over to my S3 bucket, then building a Jupyter notebook seems like a lot more work than just pulling in the data directly from their bucket and building a model / table / etc ontop of it and saving the processed output into a database. Hence I am posting the questions above. If my thinking is completely wrong, please stop me! :)]

You can read in multiple paths with wildcards as long as the files are all in the same format.
In your example:
.load('file:///home/path/SFweather/data2014/*.csv')
.load('file:///home/path/SFweather/data2015/*.csv')
.load('file:///home/path/NYCweather/data2014/*.csv')
.load('file:///home/path/NYCweather/data2015/*.csv')
You could replace the 4 load statements above with the following path to read all csv's in at once to one dataframe:
.load('file:///home/path/*/*/*.csv')
If you want to be more specific in order to avoid reading in certain files/folders, you can do the following:
.load('file:///home/path/[SF|NYC]weather/data201[4|5]/*.csv')

You can load multiple paths at once using lists of pattern strings. The pyspark.sql.DataFrameReader.load method accepts a list of path strings, which is especially helpful if you can't express all of the paths you want to load using a single Hadoop glob pattern:
?
Matches any single character.
*
Matches zero or more characters.
[abc]
Matches a single character from character set {a,b,c}.
[a-b]
Matches a single character from the character range {a...b}.
Note that character a must be lexicographically less than or
equal to character b.
[^a]
Matches a single character that is not from character set or
range {a}. Note that the ^ character must occur immediately
to the right of the opening bracket.
\c
Removes (escapes) any special meaning of character c.
{ab,cd}
Matches a string from the string set {ab, cd}
{ab,c{de,fh}}
Matches a string from the string set {ab, cde, cfh}
For example, if you want to load the following paths:
[
's3a://bucket/prefix/key=1/year=2010/*.csv',
's3a://bucket/prefix/key=1/year=2011/*.csv',
's3a://bucket/prefix/key=2/year=2020/*.csv',
's3a://bucket/prefix/key=2/year=2021/*.csv',
]
You could reduce these to two path patterns,
s3a://bucket/prefix/key=1/year=201[0-1]/*.csv and
s3a://bucket/prefix/key=2/year=202[0-1]/*.csv,
and call load() twice. You could go further and reduce these to a single pattern string using {ab,cd} alternation, but I think the most readable way to express paths like these using glob patterns with a single call to load() is to pass a list of path patterns:
spark.read.format('csv').load(
[
's3a://bucket/prefix/key=1/year=201[0-1]/*.csv',
's3a://bucket/prefix/key=2/year=202[0-1]/*.csv',
]
)
For the paths you listed in your issue № 1, you can express all four with a single pattern string:
'file:///home/path/{NY,SF}weather/data201[45]/*.csv'
For your issue № 2, you can write logic to construct the paths you want to load.

How to find a specific file in Python

I have a directory with files of the following structure
A2ML1_A8K2U0_MutationOutput.txt
A4GALT_Q9NPC4_MutationOutput.txt
A4GNT_Q9UNA3_MutationOutput.txt
...
The first few letters represent the gene, the next few the Uniprot Number (a unique protein identifier) and MutationOutput is self explanatory.
In Python, I want to execute the following line:
f_outputfile.write(mutation_directory + SOMETHING +line[1+i]+"_MutationOutput.txt\n")
here, line[1+i] correctly identifies the Uniprot ID.
What I need to do is correctly identify the gene name. So somehow, I need to quickly search over that directory, find the file that has the line[i+1] value in it's uniprot field and then pull out the gene name.
I know I can list all the files in the directory, then I can do str.split() on each string and find it. But is there a way I can do that smarter? Should I use a dictionary? Can I just do a quick regex search?
The entire directory is about 8,116 files -- so not that many.
Thank you for your help!

What I need to do is correctly identify the gene name. So somehow, I need to quickly search over that directory, find the file that has the line[i+1] value in it's uniprot field and then pull out the gene name.
Think about how you'd do this in the shell:
$ ls mutation_directory/*_A8K2U0_MutationOutput.txt
mutation_directory/A2ML1_A8K2U0_MutationOutput.txt
Or, if you're on Windows:
D:\Somewhere> dir mutation_directory\*_A8K2U0_MutationOutput.txt
A2ML1_A8K2U0_MutationOutput.txt
And you can do the exact same thing in Python, with the glob module:
>>> import glob
>>> glob.glob('mutation_directory/*_A8K2U0_MutationOutput.txt')
['mutation_directory/A2ML1_A8K2U0_MutationOutput.txt']
And of course you can wrap this up in a function:
>>> def find_gene(uniprot):
... pattern = 'mutation_directory/*_{}_MutationOutput.txt'.format(uniprot)
... return glob.glob(pattern)[0]
But is there a way I can do that smarter? Should I use a dictionary?
Whether that's "smarter" depends on your use pattern.
If you're looking up thousands of files per run, it would certainly be more efficient to read the directory just once and use a dictionary instead of repeatedly searching. But if you're planning on, e.g., reading in an entire file anyway, that's going to take orders of magnitude longer than looking it up, so it probably won't matter. And you know what they say about premature optimization.
But if you want to, you can make a dictionary keyed by the Uniprot number pretty easily:
d = {}
for f in os.listdir('mutation_directory'):
gene, uniprot, suffix = f.split('_')
d[uniprot] = f
And then:
>>> d['A8K2U0']
'mutation_directory/A2ML1_A8K2U0_MutationOutput.txt'
Can I just do a quick regex search?
For your simple case, you don't need regular expressions.*
More importantly, what are you going to search? Either you're going to loop—in which case you might as well use glob—or you're going to have to build up an artificial giant string to search—in which case you're better off just building the dictionary.
* In fact, at least on some platforms/implementations, glob is implemented by making a regular expression out of your simple wildcard pattern, but you don't have to worry about that.

You can use glob
In [4]: import glob
In [5]: files = glob.glob('*_Q9UNA3_*')
In [6]: files
Out[6]: ['A4GNT_Q9UNA3_MutationOutput.txt']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find most matched string from pandas dataframe object - python

Related

Given the path, how to extract only a part of it using python?

How to use variables in re.findall?

I want to be able to group same file sequence from a list python

Reading Multiple S3 Folders / Paths Into PySpark

How to find a specific file in Python

Categories

Resources