How to use variables in re.findall? - python

I'm extremely new to Python, so I'm sorry if my question is rather stupid.
I have a bunch of long strings in one list and shortened parts of it in another list
Like:
Long string is D:\\Python\\Songs\\testsong.mp3 and short string is \\testsong.mp3
And I want to get a list containing 'Songs' (basically the name of the folder containing the mp3) but I have multiple folders and multiple songs so I'm trying to use re.findall but it only accepts defined patterns and my patterns change due to different songs names.

from pathlib import Path
directories = [
"D:/Python/Songs/testsong.mp3",
"C:/users/xyz/desktop/anothersong.mp3"
]
song_folder_names = set(Path(directory).parts[-2] for directory in directories)
print(song_folder_names)
Output:
{'desktop', 'Songs'}
>>>
Notice, that the order of the folder names is not preserved - because I'm using a set.

Related

Uppercase the names of multiple files in a directory in Python

I'm working on a small project that requires that I use Python to uppercase all the names of files in a certain directory "ex: input: Brandy.jpg , output: BRANDY.jpg".
The thing is I've never done on multiple files before, what I've done was the following:
universe = os.listdir('parallel_universe/')
universe = [os.path.splitext(x)[0].upper() for x in universe]
But what I've done capitalized the names in the list only but not the files in the directory itself, the output was like the following:
['ADAM SANDLER','ANGELINA JULIE','ARIANA GRANDE','BEN AFFLECK','BEN STILLER','BILL GATES', 'BRAD PITT','BRITNEY SPEARS','BRUCE LEE','CAMERON DIAZ','DWAYNE JOHNSON','ELON MUSK','ELTON JOHN','JACK BLACK','JACKIE CHAN','JAMIE FOXX','JASON SEGEL', 'JASON STATHAM']
What am I missing here? And since I don't have much experience in Python, I'd love if your answers include explanations for each step, and thanks in advance.
Right now, you are converting the strings to uppercase, but that's it. There is no actual renaming being done. In order to rename, you need to use os.rename
If you were to wrap your code with os.rename, it should solve your problem, like so:
[os.rename("parallel_universe/" + x, "parallel_universe/" + os.path.splitext(x)[0].upper() + os.path.splitext(x)[1]) for x in universe]
I have removed the assignment universe= because this line no longer returns a list and you will instead get a bunch on None objects.
Docs for os.rename: https://docs.python.org/3/library/os.html#os.rename

I want to be able to group same file sequence from a list python

Working on a python function which parses a file containing a list of strings.
Basically a walked folder structure parsed to a txt file so I don't have to work on real raid while in prod. That is also a requirement. To work from a txt file containing list of paths.
lpaths =[
'/projects/0100/dbu/shots/11_1/SC11_1_Shot012/render/SC11_1_Shot012.v01_1025.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot012/render/SC11_1_Shot012.v01_1042.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot012/render/SC11_1_Shot012.v01_1016.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot012/2d/app/Shot012_v1.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot012/2d/app/Shot012_v02.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot004/3d/app2/workspace.cfg',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot004/3d/app2/scenes/SC11_1_Shot004_v01.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot004/3d/app2/scenes/Shot004_camera_v01.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot004/render/SC11_1_Shot004.v01_1112.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot004/render/SC11_1_Shot004.v01_1034.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot004/render/SC11_1_Shot004.v02_1116.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot004/render/SC11_1_Shot004.v02_1126.exr'
]
This is partial list of the cleaned list version ive already worked out and works fine.
The real problem, need to parse all frames from a folder to into a list so it can hold a proper listed sequence.
There could be 1 frame or 1000, also there are multiple sequences in same folder as seen in the list.
My goal is to have a list for each sequence in a folder, so I can push them ahead to do more work down the road.
Code:
groups = [list(group) for key, group in itertools.groupby(sorted(lpaths), len)]
pp.pprint(groups)
Since you seem to have differing naming conventions you need to write a function that takes a single string and, possibly using regular expressions, returns an unambiguous key for you to sort on, lets say that you names are critically identified by the shot number which can be identified by r".*[Ss]hot_?(\d+).*\.ext" you could return an integer for the match base 10 so discarding any leading 0s.
Since you also may have a version number you could do a similar operation to get an unambiguous version number, (and possibly only process the latest version of a given shot).

Reading Multiple S3 Folders / Paths Into PySpark

I am conducting a big data analysis using PySpark. I am able to import all CSV files, stored in a particular folder of a particular bucket, using the following command:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('file:///home/path/datafolder/data2014/*.csv')
(where * acts like a wildcard)
The issues I have are the following:
What if I want to do my analysis on 2014 and 2015 data i.e. file 1 is .load('file:///home/path/SFweather/data2014/*.csv'), file 2 is .load('file:///home/path/SFweather/data2015/*.csv') and file 3 is .load('file:///home/path/NYCweather/data2014/*.csv') and file 4 is .load('file:///home/path/NYCweather/data2015/*.csv'). How do I import multiple paths at the same time to get one dataframe? Do I need to store them all individually as dataframes and then join them together within PySpark? (You may assume they all CSVs have the same schema)
Suppose it is November 2014 now. What if I want to run the analysis again, but on the "most recent data" run e.g. dec14 when it is December 2014? For example, I want to load in file 2: .load('file:///home/path/datafolder/data2014/dec14/*.csv') in December 14 and use this file: .load('file:///home/path/datafolder/data2014/nov14/*.csv') for the original analysis. Is there a way to schedule the Jupyter notebook (or similar) to update the load path and import the latest run (in this case 'nov14' would be replaced by 'dec14' and then 'jan15' etc).
I had a look through the previous questions but was unable to find an answer given this is AWS / PySpark integration specific.
Thank you in advance for the help!
[Background: I have been given access to many S3 buckets from various teams containing various big data sets. Copying it over to my S3 bucket, then building a Jupyter notebook seems like a lot more work than just pulling in the data directly from their bucket and building a model / table / etc ontop of it and saving the processed output into a database. Hence I am posting the questions above. If my thinking is completely wrong, please stop me! :)]
You can read in multiple paths with wildcards as long as the files are all in the same format.
In your example:
.load('file:///home/path/SFweather/data2014/*.csv')
.load('file:///home/path/SFweather/data2015/*.csv')
.load('file:///home/path/NYCweather/data2014/*.csv')
.load('file:///home/path/NYCweather/data2015/*.csv')
You could replace the 4 load statements above with the following path to read all csv's in at once to one dataframe:
.load('file:///home/path/*/*/*.csv')
If you want to be more specific in order to avoid reading in certain files/folders, you can do the following:
.load('file:///home/path/[SF|NYC]weather/data201[4|5]/*.csv')
You can load multiple paths at once using lists of pattern strings. The pyspark.sql.DataFrameReader.load method accepts a list of path strings, which is especially helpful if you can't express all of the paths you want to load using a single Hadoop glob pattern:
?
Matches any single character.
*
Matches zero or more characters.
[abc]
Matches a single character from character set {a,b,c}.
[a-b]
Matches a single character from the character range {a...b}.
Note that character a must be lexicographically less than or
equal to character b.
[^a]
Matches a single character that is not from character set or
range {a}. Note that the ^ character must occur immediately
to the right of the opening bracket.
\c
Removes (escapes) any special meaning of character c.
{ab,cd}
Matches a string from the string set {ab, cd}
{ab,c{de,fh}}
Matches a string from the string set {ab, cde, cfh}
For example, if you want to load the following paths:
[
's3a://bucket/prefix/key=1/year=2010/*.csv',
's3a://bucket/prefix/key=1/year=2011/*.csv',
's3a://bucket/prefix/key=2/year=2020/*.csv',
's3a://bucket/prefix/key=2/year=2021/*.csv',
]
You could reduce these to two path patterns,
s3a://bucket/prefix/key=1/year=201[0-1]/*.csv and
s3a://bucket/prefix/key=2/year=202[0-1]/*.csv,
and call load() twice. You could go further and reduce these to a single pattern string using {ab,cd} alternation, but I think the most readable way to express paths like these using glob patterns with a single call to load() is to pass a list of path patterns:
spark.read.format('csv').load(
[
's3a://bucket/prefix/key=1/year=201[0-1]/*.csv',
's3a://bucket/prefix/key=2/year=202[0-1]/*.csv',
]
)
For the paths you listed in your issue № 1, you can express all four with a single pattern string:
'file:///home/path/{NY,SF}weather/data201[45]/*.csv'
For your issue № 2, you can write logic to construct the paths you want to load.

How to find a specific file in Python

I have a directory with files of the following structure
A2ML1_A8K2U0_MutationOutput.txt
A4GALT_Q9NPC4_MutationOutput.txt
A4GNT_Q9UNA3_MutationOutput.txt
...
The first few letters represent the gene, the next few the Uniprot Number (a unique protein identifier) and MutationOutput is self explanatory.
In Python, I want to execute the following line:
f_outputfile.write(mutation_directory + SOMETHING +line[1+i]+"_MutationOutput.txt\n")
here, line[1+i] correctly identifies the Uniprot ID.
What I need to do is correctly identify the gene name. So somehow, I need to quickly search over that directory, find the file that has the line[i+1] value in it's uniprot field and then pull out the gene name.
I know I can list all the files in the directory, then I can do str.split() on each string and find it. But is there a way I can do that smarter? Should I use a dictionary? Can I just do a quick regex search?
The entire directory is about 8,116 files -- so not that many.
Thank you for your help!
What I need to do is correctly identify the gene name. So somehow, I need to quickly search over that directory, find the file that has the line[i+1] value in it's uniprot field and then pull out the gene name.
Think about how you'd do this in the shell:
$ ls mutation_directory/*_A8K2U0_MutationOutput.txt
mutation_directory/A2ML1_A8K2U0_MutationOutput.txt
Or, if you're on Windows:
D:\Somewhere> dir mutation_directory\*_A8K2U0_MutationOutput.txt
A2ML1_A8K2U0_MutationOutput.txt
And you can do the exact same thing in Python, with the glob module:
>>> import glob
>>> glob.glob('mutation_directory/*_A8K2U0_MutationOutput.txt')
['mutation_directory/A2ML1_A8K2U0_MutationOutput.txt']
And of course you can wrap this up in a function:
>>> def find_gene(uniprot):
... pattern = 'mutation_directory/*_{}_MutationOutput.txt'.format(uniprot)
... return glob.glob(pattern)[0]
But is there a way I can do that smarter? Should I use a dictionary?
Whether that's "smarter" depends on your use pattern.
If you're looking up thousands of files per run, it would certainly be more efficient to read the directory just once and use a dictionary instead of repeatedly searching. But if you're planning on, e.g., reading in an entire file anyway, that's going to take orders of magnitude longer than looking it up, so it probably won't matter. And you know what they say about premature optimization.
But if you want to, you can make a dictionary keyed by the Uniprot number pretty easily:
d = {}
for f in os.listdir('mutation_directory'):
gene, uniprot, suffix = f.split('_')
d[uniprot] = f
And then:
>>> d['A8K2U0']
'mutation_directory/A2ML1_A8K2U0_MutationOutput.txt'
Can I just do a quick regex search?
For your simple case, you don't need regular expressions.*
More importantly, what are you going to search? Either you're going to loop—in which case you might as well use glob—or you're going to have to build up an artificial giant string to search—in which case you're better off just building the dictionary.
* In fact, at least on some platforms/implementations, glob is implemented by making a regular expression out of your simple wildcard pattern, but you don't have to worry about that.
You can use glob
In [4]: import glob
In [5]: files = glob.glob('*_Q9UNA3_*')
In [6]: files
Out[6]: ['A4GNT_Q9UNA3_MutationOutput.txt']

Find list of common parent path strings in a list of path strings using python

What is the most effective way to find a list of longest common parent path strings in a list of path strings using python?
Additional Note Where there are two or more matches I would like to descend as necessary to create as few as possible redundant
Input list
input_paths = [
'/project/path/to/a/directory/of/files',
'/project/path/to/a/directory/full/of/files',
'/project/path/to/some/more/files',
'/project/path/to/some/more/directories/of/files'
'/project/path/to/another/file',
'/project/mount/another/path/of/files',
'/project/mount/another/path/of/test/stuff',
'/project/mount/another/path/of/files/etc',
'/project/mount/another/drive/of/things',
'/project/local/folder/of/documents'
]
filter_path = '/project'
Output list
common_prefix_list = [
'path/to/a/directory',
'path/to/some/more',
'path/to/another',
'mount/another/path/of',
'mount/another/drive/of',
'local/folder/of'
]
My rudimentary guess is to split into lists on os.sep and then use set intersection but I believe there are more robust algorithms to find what is essentially a longest common substring problem. I'm sure this has been done a million times before so please offer up your elegant solution.
My end task is to collect a list of assets common to a project in disparate paths into one common folder with a structure that does not create conflicts with individual assets nor create paths that are overly redundant (hence the filter_path).

Categories

Resources