I have some code that writes out files with names like this:
body00123.txt
body00124.txt
body00125.txt
body-1-2126.txt
body-1-2127.txt
body-1-2128.txt
body-3-3129.txt
body-3-3130.txt
body-3-3131.txt
Such that the first two numbers in the file can be 'negative', but the last 3 numbers are not.
I have a list such as this:
123
127
129
And I want to remove all the files that don't end with one of these numbers. An example of the desired leftover files would be like this:
body00123.txt
body-1-2127.txt
body-3-3129.txt
My code is running in python, so I have tried:
if i not in myList:
os.system('rm body*' + str(i) + '.txt')
And this resulted in every file being deleted.
The solution should be such that any .txt file that ends with a number contained by myList should be kept. All other .txt files should be deleted. This is why I'm trying to use a wildcard in my attempt.
Because all the files end in .txt, you can cut that part out and use the str.endswith() function. str.endswith() accepts a tuple of strings, and sees if your string ends in any of them. As a result, you can do something like this:
all_file_list = [...]
keep_list = [...]
files_to_remove = []
file_to_remove_tup = tuple(files_to_remove)
for name in all_file_list:
if name[:-4].endswith(file_to_remove_tup)
files_to_remove.append(name)
# or os.remove(name)
Related
I have a list of paths that look like this (see below). As you can see, file-naming is inconsistent, but I would like to keep only one file per person. I already have a function that removes duplicates if they have the exact same file name but different file extensions, however, with this inconsistent file-naming case it seems trickier.
The list of files looks something like this (but assume there are thousands of paths and words that aren't part of the full names e.g. cv, curriculum vitae etc.):
all_files =
['cv_bob_johnson.pdf',
'bob_johnson_cv.pdf',
'curriculum_vitae_bob_johnson.pdf',
'cv_lara_kroft_cv.pdf',
'cv_lara_kroft.pdf' ]
Desired output:
unique_files = ['cv_bob_johnson.pdf', 'cv_lara_kroft.pdf']
Given that the names are somewhat in a written pattern most of the time (e.g. first name precedes last name), I assume there has to be a way of getting a unique set of the paths if the names are repeated?
If you want to keep your algorithm relatively simple (i.e., not using ML etc), you'll need to have some idea about the typical substrings that you want to remove. Let's make a list of such substrings, for example:
remove = ['cv_', '_cv', 'curriculum_vitae_', '_curriculum_vitae']
Then you can process your list of files this way:
import re
all_files = ['cv_bob_johnson.pdf', 'bob_johnson_cv.pdf', 'curriculum_vitae_bob_johnson.pdf', 'cv_lara_kroft_cv.pdf', 'cv_lara_kroft.pdf']
remove = ['cv_', '_cv', 'curriculum_vitae_', '_curriculum_vitae']
unique = []
for file in all_files:
# strip a suffix, if any:
try:
name, suffix = file.rsplit('.', 1)
except:
name, suffix = file, None
# remove the excess parts:
for rem in remove:
name = re.sub(rem, '', name)
# append the result to the list:
unique.append(f'{name}.{suffix}' if suffix else name)
# remove duplicates:
unique = list(set(unique))
print(unique)
I have a number of html files in a directory. I am trying to store the filenames in a list so that I can use it later to compare with another list.
Eg: Prod224_0055_00007464_20170930.html is one of the filenames. From the filename, I want to extract '00007464' and store this value in a list and repeat the same for all the other files in the directory. How do I go about doing this? I am new to Python and any help would be greatly appreciated!
Please let me know if you need more information to answer the question.
Split the filename on underscores and select the third element (index 2).
>>> 'Prod224_0055_00007464_20170930.html'.split('_')[2]
'00007464'
In context that might look like this:
nums = [f.split('_')[2] for f in os.listdir(dir) if f.endswith('.html')]
you may try this (assuming you are in the folder with the files:
import os
num_list = []
r, d, files = os.walk( '.' ).next()
for f in files :
parts = f.split('_') # now `parts` contains ['Prod224', '0055', '00007464', '20170930.html']
print parts[2] # this outputs '00007464'
num_list.append( parts[2] )
Assuming you have a certain pattern for your files, you can use a regex:
>>> import re
>>> s = 'Prod224_0055_00007464_20170930.html'
>>> desired_number = re.findall("\d+", s)[2]
>>> desired_number
'00007464'
Using a regex will help you getting not only that specific number you want, but also other numbers in the file name.
This will work if the name of your files follow the pattern "[some text][number]_[number]_[desired_number]_[a date].html". After getting the number, I think it will be very simple to use the append method to add that number to any list you want.
This is my code:
files = open('clean.txt').readlines()
print files
finallist = []
for items in files:
new = items.split()
new.append(finallist)
And since the file of text is too huge, here is an example of "print files":
files = ['chemistry leads outstanding another story \n', 'rhapsodic moments blow narrative prevent bohemian rhapsody']
I really need each line separated by '\n' to be splitted in words & placed in a list of list just like the format below:
outcome = [['chemistry','leads','outstanding', 'another', 'story'],['rhapsodic','moments','blow', 'narrative', 'prevent', 'bohemian', 'rhapsody']]
I've tried methods just like the first code given and it returns an empty list. Please help! Thanks in advance.
The last line of your code is backwards, it seems. Instead of
new.append(finallist)
it should be
finallist.append(new)
I changed the last line to the version above, and the result was a list (finallist) containing 2 sub-lists. Here is the code that seems to work:
files = open('clean.txt').readlines()
print files
finallist = []
for items in files:
new = items.split()
finallist.append(new)
Use list comprehension to reduce line
finallist = [i.split() for i in files]
Well, I'm learning Python, so I'm working on a project that consists in passing some numbers of PDF files to xlsx and placing them in their corresponding columns, rows determined according to row heading.
The idea that came to me to carry it out is to convert the PDF files to txt and make a dictionary with the txt files, whose key is a part of the file name (because it contains a part of the row header) and the values be the numbers I need.
I have already managed to convert the txt files, now i'm dealing with the script to carry the dictionary. at the moment look like this:
import os
import re
p = re.compile(r'\w+\f+')
'''
I'm not entirely sure at the moment how the .compile of regular expressions works, but I know I'm missing something to indicate that what I want is immediately to the right, I'm also not sure if the keywords will be ignored, I just want take out the numbers
'''
m = p.match('Theese are the keywords' or 'That are immediately to the left' or 'The numbers I want')
def IsinDict(txtDir):
ToData = ()
if txtDir == "": txtDir = os.getcwd() + "\\"
for txt in os.listdir(txtDir):
ToKey = txt[9:21]
if ToKey == (r"\w+"):
Data = open(txt, "r")
for string in Data:
ToData += m.group()
Diccionary = dict.fromkeys(ToKey, ToData)
return Diccionary
txtDir = "Absolute/Path/OfTheText/Files"
IsinDict(txtDir)
Any contribution is welcome, thanks for your attention.
I want to remove the last string in the list i.e. the library name (delimited by '\'). The text string that I have contains path of libraries used at the compilation time. These libraries are delimited by spaces. I want to retain each path but not till the library name, just one root before it.
Example:
text = " /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/crtbeginT.o /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/crtfastmath.o /opt/cray/cce/8.2.5/craylibs/x86-64/no_mmap.o /opt/cray/cce/8.2.5/craylibs/x86-64/libcraymath.a /opt/cray/cce/8.2.5/craylibs/x86-64/libcraymp.a /opt/cray/atp/1.7.1/lib/libAtpSigHandler.a /opt/cray/atp/1.7.1/lib/libAtpSigHCommData.a "
I want my output to be like -
Output_list =
[/opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4,
/opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4,
/opt/cray/cce/8.2.5/craylibs/x86-64,
/opt/cray/cce/8.2.5/craylibs/x86-64,
/opt/cray/cce/8.2.5/craylibs/x86-64,
/opt/cray/atp/1.7.1/lib,
/opt/cray/atp/1.7.1/lib]
and finally I want to remove the duplicates in the output_list so that the list looks like.
New_output_list =
[/opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4,
/opt/cray/cce/8.2.5/craylibs/x86-64,
/opt/cray/atp/1.7.1/lib]
I am getting the results using split() function but I am struggling to discard the library names from the path.
any help would be appreciated.
You seem to want (don't try and do string operations with paths, it's bound to end badly):
import os
New_output_List = list(set(os.path.dirname(pt) for pt in text.split()))
os.path.dirname splits a path into it's gets the directory name from a path. We do this for every item in the text, split into a list based on white-space. This is done for every item in the series.
To remove the duplicates, we just convert it to a set and then finally to a list.
try with this
text = " /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/crtbeginT.o /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/crtfastmath.o /opt/cray/cce/8.2.5/craylibs/x86-64/no_mmap.o /opt/cray/cce/8.2.5/craylibs/x86-64/libcraymath.a /opt/cray/cce/8.2.5/craylibs/x86-64/libcraymp.a /opt/cray/atp/1.7.1/lib/libAtpSigHandler.a /opt/cray/atp/1.7.1/lib/libAtpSigHCommData.a "
New_output_List = []
for x in list(set(text.split(' '))):
New_output_list.append("".join("/" + y if y else '' for y in x.split("/")[:-1]))