How to extract first part of a file name? - python

Newbie to Python here. I've been trying to iterate through filenames in a loop and grab the first part of the file name with Python.
My file names are structured as such: "Pitt_0050003_rest.nii.gz". I only want the "Pitt_0050003" part (keep in mind, the file names are various lengths).
Here's the code I've been trying:
fileid = []
for f in dataset:
#print(f)
comp=f.split('/')
fs = (comp[-1]) #get the file name without nii.gz extension
res = re.findall("_rest.nii(\d-)", f) #get the file name without _rest?
if not res: continue
fileid.append(res)
print (fileid)
Any tips?

If all you files will have a '_rest' at the end, then you can try this:
string = "Pitt_0050003_rest.nii.gz."
string = string[:string.index('_rest')]
# Value of string from this line will be Pitt_0050003

You can split by underscore and ignore the last index if your naming convention remains same for all varying filenames.
> myfile = "Pitt_0050003_rest.nii.gz"
> first_name = myfile.split('_')
> first_name
['Pitt', '0050003', 'rest.nii.gz']
> first_name.pop()
'rest.nii.gz'
>
> first_name
['Pitt', '0050003']
>
> '_'.join(first_name)
'Pitt_0050003'
>

Related

Find the email address that occurs the most in a txt file

I have to go through a txt file which contains all manner of info and pull the email address that occurs the most therewithin.
My code is as follows, but it does not work. It prints no output and I am not sure why. Here is the code:
name = input("Enter file:")
if len(name) < 1 : name = "mbox-short.txt"
handle = open(name)
names = handle.readlines()
count = dict()
for name in names:
name = name.split()
for letters in name:
if '#' not in letters:
name.remove(letters)
else:
continue
name = str(name)
if name not in count:
count[name] = 1
else:
count[name] = count[name]+ 1
print(max(count, key=count.get(1)))
As I understand it, this code works as follows:
we first open the file, then we read the lines, then we create an empty dict
Then in the first for loop, we split the txt file into a list based on each line.
Then, in the second for loop, for each item in each line, if there is no #, then it is removed.
We then return for the original for loop, where, if the name is not a key in dict, it is added with a value of 1; else one is added to its value.
Finally, we print the max key & value.
Where did I go wrong???
Thank you for your help in advance.
You need to change the last line to:
print(max(count, key=count.get))
EDIT
For sake of more explanation:
You were providing max() with the wrong ordering function by key=count.get(1).
So, count.get(1) would return default value or None when the key argument you passed to get() isn't in the dictionary.
If so, max() would then behave by outputing the max string key in your dictionary (as long as all your keys are strings and your dictionary is not empty).
Please use the following code:
names = '''hola#hola.com
whatsap#hola.com
hola#hola.com
hola#hola.com
klk#klk.com
klk#klk.com
klk#klk.com
klk#klk.com
klk#klk.com
whatsap#hola.com'''
count = list(names.split("\n"))
sett = set(names.split("\n"))
highest = count.count(count[0])
theone = count[0]
for i in sett:
l = count.count(i)
if l > highest:
highest = l
theone = i
print(theone)
Output:
klk#klk.com
Import Regular Expressions (re) as it will help in getting emails.
import re
name = input("Enter file:")
if len(name) < 1 : name = "mbox-short.txt"
handle = open(name)
names = "\n".join(handle.readlines())
email_ids = re.findall(r"[0-9a-zA-Z._+%]+#[0-9a-zA-Z._+%]+[.][0-9a-zA-Z.]+", names)
email_ids = [(email_ids.count(email_id), email_id) for email_id in email_ids].sort(reverse=True)
email_ids = set([i[1] for i in email_ids)
In the variable email_ids you will get a set of the emails arranged on the basis of their occurrences, in descending order.
I know that the code is lengthy and has a few redundant lines, but there are there to make the code self-explanatory.

Find specific substring while iterating through multiple file names

I need to find the identification number of a big number of files while iterating throught them.
The file names are loaded onto a list and look like:
ID322198.nii
ID9828731.nii
ID23890.nii
FILEID988312.nii
So the best way to approach this would be to find the number that sits between ID and .nii
Because number of digits varies I can't simply select [-10:-4] of thee file name. Any ideas?
You can use a regex (see it in action here):
import re
files = ['ID322198.nii','ID9828731.nii','ID23890.nii','FILEID988312.nii']
[re.findall(r'ID(\d+)\.nii', file)[0] for file in files]
Returns:
['322198', '9828731', '23890', '988312']
to find the position of ID and .nii, you can use python's index() function
for line in file:
idpos =
nilpos =
data =
or as a list of ints:
[ int(line[line.index("ID")+1:line.index(".nii")]) for line in file ]
Using rindex:
s = 'ID322198.nii'
s = s[s.rindex('D')+1 : s.rindex('.')]
print(s)
Returns:
322198
Then apply this sintax to a list of strings.
It seems like you could filter the digits out, like this:
digits = ''.join(d for d in filename if d.isdigit())
That will work nicely as long as there are no other digits in the filename (e.g backups with a .1 suffix or something).
for name in files:
name = name.replace('.nii', '')
id_num = name.replace(name.rstrip('0123456789'), '')
How this works:
# example
name = 'ID322198.nii'
# remove '.nii'. -> name1 = 'ID322198'
name1 = name.replace('.nii', '')
# strip all digits from the end. -> name2 = 'ID'
name2 = name1.rstrip('0123456789')
# remove 'ID' from 'ID322198'. -> id_num = '322198'
id_num = name1.replace(name2, '')

How to get part of filename into a variable?

I have a lot of .csv files and I'd like to parse the file names.
The file names are in this format:
name.surname.csv
How can I write a function that populates two variables with the components of the file name?
A = name
B = surname
Use str.split and unpack the result in A, B and another "anonymous" variable to store (and ignore) the extension.
filename = 'name.surname.csv'
A, B, _ = filename.split('.')
Try this, the name is split by . and stored in A and B
a="name.surname.csv"
A,B,C=a.split('.')
Of course, this assumes that your file name is in the form first.second.csv
If the file names always have the exact same form, with exactly two periods, then you can do:
>>> name, surname, ext = "john.doe.csv".split(".")
>>> name
'john'
>>> surname
'doe'
>>> ext
'csv'
>>>
Simple use str.split() method and this function.
def split_names(input:str):
splitted = input.split(".")
return splitted[0], splitted[1]
A, B = split_names("name.surname.csv")
First find all the files in your directory with the extention '.csv', then split it by '.'
import os
for file in os.listdir("/mydir"):
if file.endswith(".csv"):
# print the file name
print(os.path.join("/mydir", file))
# split the file name by '.'
name, surname, ext = file.split(".")
# print or append or whatever you will do with the result here
If you have file saved at a specific location in the system , then you have to first get only the file name :
# if filename = name.surname.csv then discard first two lines
filename = "C://CSVFolder//name.surname.csv"
absfilename = filename.split('//')[-1]
# by concept of packing unpacking
A,B,ext = absfilename.split('.')
else you can just provide
A,B,ext = "name.surname.csv".split('.')
print A,B,ext
Happy coding :)

Python: Retrieving and renaming indexed files in a directory

I created a script to rename indexed files in a given directory
e.g If the directory has the following files >> (bar001.txt, bar004.txt, bar007.txt, foo2.txt, foo5.txt, morty.dat, rick.py). My script should be able to rename 'only' the indexed files and close gaps like this >> (bar001.txt, bar002.txt, bar003.txt, foo1.txt, foo2.txt...).
I put the full script below which doesn't work. The error is logical because no error messages are given but files in the directory remain unchanged.
#! python3
import os, re
working_dir = os.path.abspath('.')
# A regex pattern that matches files with prefix,numbering and then extension
pattern = re.compile(r'''
^(.*?) # text before the file number
(\d+) # file index
(\.([a-z]+))$ # file extension
''',re.VERBOSE)
# Method that renames the items of an array
def rename(array):
for i in range(len(array)):
matchObj = pattern.search(array[i])
temp = list(matchObj.group(2))
temp[-1] = str(i+1)
index = ''.join(temp)
array[i] = matchObj.group(1) + index + matchObj.group(3)
return(array)
array = []
directory = sorted(os.listdir('.'))
for item in directory:
matchObj = pattern.search(item)
if not matchObj:
continue
if len(array) == 0 or matchObj.group(1) in array[0]:
array.append(item)
else:
temp = array
newNames = rename(temp)
for i in range(len(temp)):
os.rename(os.path.join(working_dir,temp[i]),
os.path.join(working_dir,newNames[i]))
array.clear() #reset array for other files
array.append(item)
To summarise, you want to find every file whose name ends with a number and
fill in the gaps for every set of files that have the same name, save for the number suffix. You don't want to create any new files; rather, the ones with the highest numbers should be used to fill the gaps.
Since this summary translates rather nicely into code, I will do so rather than working off of your code.
import re
import os
from os import path
folder = 'path/to/folder/'
pattern = re.compile(r'(.*?)(\d+)(\.[a-z]+)$')
summary = {}
for fn in os.listdir(folder):
m = pattern.match(fn)
if m and path.isfile(path.join(folder, fn)):
# Create a key if there isn't one, add the 'index' to the set
# The first item in the tuple - len(n) - tells use how the numbers should be formatted later on
name, n, ext = m.groups()
summary.setdefault((name, ext), (len(n), set()))[1].add(int(n))
for (name, ext), (n, current) in summary.items():
required = set(range(1, len(current)+1)) # You want these
gaps = required - current # You're missing these
superfluous = current - required # You don't need these, so they should be renamed to fill the gaps
assert(len(gaps) == len(superfluous)), 'Something has gone wrong'
for old, new in zip(superfluous, gaps):
oldname = '{name}{n:>0{pad}}{ext}'.format(pad=n, name=name, n=old, ext=ext)
newname = '{name}{n:>0{pad}}{ext}'.format(pad=n, name=name, n=new, ext=ext)
print('{old} should be replaced with {new}'.format(old=oldname, new=newname))
That about covers it I think.

How to save regular expression objects in a dictionary?

For a first-semester task I'm supposed to write a script that finds first and last names in a file and displays them in the following order (last name, first name) next to the original entry (first name, last name).
The file has one entry per line which looks as follows: "Srđa Slobodan ĐINIC POPOVIC".
My questions are probably basic but I'm stuck:
How can I save all the entries of the file in a hash (multi-part first names/multi-part lastnames)? With re.compile() and re.search() I only manage to get one result. With re.findall() I get all, but can't group.() them and get encoding errors.
How can I connect the original name entry (last name/first name) to the new entry (first name/last name).
import re, codecs
file = codecs.open('FILE.tsv', encoding='utf-8')
test = file.read()
list0 = test.rstrip()
for word in list0:
p = re.compile('(([A-Z]+\s\-?)+)')
u = re.compile('((\(?[A-Z][a-z]+\)?\s?-?\.?)+)')
hash1 = {}
hash1[p.search(test).group()] = u.search(test).group()
hash2 = {}
hash2[u.search(test).group()] = p.search(test).group()
print hash1,'\t',hash2

Categories

Resources