Edit file names in Python according to certain rules - python

I have a great number of files whose names are structured as follows:
this_is_a_file.extension
I got to strip them of what begins with the last underscore (included), preserving the extension, and save the file with the new name into another directory.
Note that these names have variable length, so I cannot leverage single characters' position.
Also, they have a different number of underscores, otherwise I'd have applied something similar: split a file name
How can I do it?

You could create a function that splits the original filename along underscores, and splits the last segment along periods. Then you can join it all back together again like so:
def myJoin(filename):
splitFilename=filename.split('_')
extension=splitFilename[-1].split('.')
splitFilename.pop(-1)
return('_'.join(splitFilename)+'.'+extension[-1])
Some examples to show it working:
>>> p="this_is_a_file.extension"
>>> myJoin(p)
'this_is_a.extension'
>>> q="this_is_a_file_with_more_segments.extension"
>>> myJoin(q)
'this_is_a_file_with_more.extension'

Related

Get File name from a dash delimited string in python lambda

I have a bunch of file name like this in s3
1623130500-1623130500-Photo-verified-20210631-0-22.csv.gz
1623130500-1623130500-Add-to-cart-20210631-0-4.csv.gz
with lambda python code can I separate only Photo-verified / Add-to-cart from the above?
I need a solution which give me file name on runtime from above kind of string
I think you are asking how to extract either Photo-verified or Add-to-cart from the above strings.
You can split on - and then extract the portion you want. Basically, you don't want the first two parts or the last 3 parts, so use:
filename.split('-')[2:-3]
That will return a list with:
['Photo', 'verified']
You could then join() them together using:
'-'.join(filename.split('-')[2:-3])
This would give:
Photo-verified
On the second string, it would give:
Add-to-cart

How to extract specific characters from a string that can vary

I'm trying to extract the specific part of the name of the file that can have varying number of '_'. I previously used partition/rpartition to strip everything before and after underscore bars, but I didn't take into account the possibilities of different underscore bar numbers.
The purpose of the code is to extract specific characters in between underscore bars.
filename = os.path.basename(files).partition('_')[2].rpartition('_')[0].rpartition('_')[0].rpartition('_')[0]
The above is my current code. A typical name of the file looks like:
P0_G12_190325184517_t20190325_5
or it can also have
P0_G12_190325184517_5
From what I understand, my current code's rpartition needs to match the number of underscore bars of the file for the first file, but the same code doesn't work for the second file obviously.
I want to extract
G12
this part can also be just two characters like G1 so two to three characters from the above types of filenames.
You can use:
os.path.basename(files).split('_')[1]
You could either use split to create a list with the separate parts, like this:
files.split('_')
Or you could use regex:
https://regex101.com/r/jiUNLV/1
And do like this:
import re
pattern = r'.*_(\w{2,3})_\d+.*'
match = re.match(pattern, files)
if match:
print(match.group(1))

Parsing full names from a list of names

I am using namesparser to extract full names from a list of names.
from namesparser import HumanNames
names = HumanNames('Randy Heimerman, James Durham, Nate Green')
print(names.human_names[0])
Namesparser works well in most cases, but the above example is getting hung up. I believe it is because the name "Randy" includes "and", which namesparser is treating as a separator.
When I move Randy's name to the end of the string, the correct name is printed (James Durham). If I try to print either of the 2 other names, though, the wrong strings are returned.
Any ideas on how I can resolve this?
I think you should use the comma , as your delimiter.
def print_names( name_string ):
return (name.strip() for name in name_string.split(","))
what this does is split your string on the comma, and then strip trailing and leading spaces, etc... before returning an array of names.
Now that you have a generator of names, you can pass it into other things for example:
humans = [HumanName(name) for name in print_names(name_string)]
but then again, I dont know what your class HumanNames / HumanName really means, and you didnt put a class defition.
If you are looking at this module: https://pypi.org/project/nameparser/ in which it takes a string consisting of a singular name, the above will still work no problem.

Splitting a file with multiple but not all delimiters in Python

I know there's been several answers to questions regarding multiple delimiters, but my issue involves needing to delimit by multiple delimiters but not all of them. I have a file that contains the following:
((((((Anopheles_coluzzii:0.002798,Anopheles_arabiensis:0.005701):0.001405,(Anopheles_gambiae:0.002824,Anopheles_quadriannulatus:0.004249):0.002085):0,Anopheles_melas:0.008552):0.003211,Anopheles_merus:0.011152):0.068265,Anopheles_christyi:0.086784):0.023746,Anopheles_epiroticus:0.082921):1.101881;
It is newick format so all information is in one long line. What I would like to do is isolate all the numbers that follow another number. So for example the first number I would like to isolate is 0.001405. I would like to put that in a list with all the other numbers that follow a number (not a name etc).
I tried to use the following code:
with open("file.nh", "r") as f:
for line in f:
data = line
z = re.findall(r"[\w']+", data)
The issue here is that this splits the list using "." as well as the other delimiters and this is a problem because all the numbers I require have decimal points.
I considered going along with this and converting the numbers in the list to ints and then removing all non-int values and 0 values. However, some of the files contain 0 as a value that needs to be kept.
So is there a way of choosing which delimiters to use and which to avoid when multiple delimiters are required?
It's not necessary to split by multiple but not all delimiters if you set up your regex to catch the wanted parts. By your definition, you could use every number after ):. Using the re module a possible solution is this:
with open("file.nh", "r") as f:
for line in f:
z = re.findall(r"\):([0-9.]+)", line)
print(z)
The result is:
['0.001405', '0.002085', '0', '0.003211', '0.068265', '0.023746', '1.101881']
r"\):([0-9.]+)" is searching for ): followed by a part with numbers or decimal point. The second part is the result and is therefore inside parenthesis.
As Alex Hall mentioned in most cases it's not a good idea to use regex if the data is well structured. Watch out for libraries working with the given data structure instead.

How to add a comma to the end of a list efficiently?

I have a list of horizontal names that is too long to open in excel. It's 90,000 names long. I need to add a comma after each name to put into my program. I tried find/replace but it freezes up my computer and crashes. Is there a clever way I can get a comma at the end of each name? My options to work with are python and excel thanks.
If you actually had a Python list, say names, then ','.join(names) would make into a string with a comma between each name and the following one (if you need one at the end as well, just use + ',' to append one more comma to the result).
Even though you say you have "a list" I suspect you actually have a string instead, for example in a file, where the names are separated by...? You don't tell us, and therefore force us to guess. For example, if they're separated by line-ends (one name per line), your life is easiest:
with open('yourfile.txt') as f:
result = ','.join(f)
(again, supplement this with a + ',' after the join if you need that, of course). That's because separation by line-ends is the normal default behavior for a text file, of course.
If the separator is something different, you'll have to read the file's contents as a string (with f.read()) and split it up appropriately then join it up again with commas.
For example, if the separator is a tab character:
with open('yourfile.txt') as f:
result = ','.join(f.read().split('\t'))
As you see, it's not so much worse;-).

Categories

Resources