Exclude the last delimited item from my list - python

I want to remove the last string in the list i.e. the library name (delimited by '\'). The text string that I have contains path of libraries used at the compilation time. These libraries are delimited by spaces. I want to retain each path but not till the library name, just one root before it.
Example:
text = " /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/crtbeginT.o /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/crtfastmath.o /opt/cray/cce/8.2.5/craylibs/x86-64/no_mmap.o /opt/cray/cce/8.2.5/craylibs/x86-64/libcraymath.a /opt/cray/cce/8.2.5/craylibs/x86-64/libcraymp.a /opt/cray/atp/1.7.1/lib/libAtpSigHandler.a /opt/cray/atp/1.7.1/lib/libAtpSigHCommData.a "
I want my output to be like -
Output_list =
[/opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4,
/opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4,
/opt/cray/cce/8.2.5/craylibs/x86-64,
/opt/cray/cce/8.2.5/craylibs/x86-64,
/opt/cray/cce/8.2.5/craylibs/x86-64,
/opt/cray/atp/1.7.1/lib,
/opt/cray/atp/1.7.1/lib]
and finally I want to remove the duplicates in the output_list so that the list looks like.
New_output_list =
[/opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4,
/opt/cray/cce/8.2.5/craylibs/x86-64,
/opt/cray/atp/1.7.1/lib]
I am getting the results using split() function but I am struggling to discard the library names from the path.
any help would be appreciated.

You seem to want (don't try and do string operations with paths, it's bound to end badly):
import os
New_output_List = list(set(os.path.dirname(pt) for pt in text.split()))
os.path.dirname splits a path into it's gets the directory name from a path. We do this for every item in the text, split into a list based on white-space. This is done for every item in the series.
To remove the duplicates, we just convert it to a set and then finally to a list.

try with this
text = " /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/crtbeginT.o /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/crtfastmath.o /opt/cray/cce/8.2.5/craylibs/x86-64/no_mmap.o /opt/cray/cce/8.2.5/craylibs/x86-64/libcraymath.a /opt/cray/cce/8.2.5/craylibs/x86-64/libcraymp.a /opt/cray/atp/1.7.1/lib/libAtpSigHandler.a /opt/cray/atp/1.7.1/lib/libAtpSigHCommData.a "
New_output_List = []
for x in list(set(text.split(' '))):
New_output_list.append("".join("/" + y if y else '' for y in x.split("/")[:-1]))

Related

How can I remove files that have an unknown number in them?

I have some code that writes out files with names like this:
body00123.txt
body00124.txt
body00125.txt
body-1-2126.txt
body-1-2127.txt
body-1-2128.txt
body-3-3129.txt
body-3-3130.txt
body-3-3131.txt
Such that the first two numbers in the file can be 'negative', but the last 3 numbers are not.
I have a list such as this:
123
127
129
And I want to remove all the files that don't end with one of these numbers. An example of the desired leftover files would be like this:
body00123.txt
body-1-2127.txt
body-3-3129.txt
My code is running in python, so I have tried:
if i not in myList:
os.system('rm body*' + str(i) + '.txt')
And this resulted in every file being deleted.
The solution should be such that any .txt file that ends with a number contained by myList should be kept. All other .txt files should be deleted. This is why I'm trying to use a wildcard in my attempt.
Because all the files end in .txt, you can cut that part out and use the str.endswith() function. str.endswith() accepts a tuple of strings, and sees if your string ends in any of them. As a result, you can do something like this:
all_file_list = [...]
keep_list = [...]
files_to_remove = []
file_to_remove_tup = tuple(files_to_remove)
for name in all_file_list:
if name[:-4].endswith(file_to_remove_tup)
files_to_remove.append(name)
# or os.remove(name)

How to delete paths that contain the same names?

I have a list of paths that look like this (see below). As you can see, file-naming is inconsistent, but I would like to keep only one file per person. I already have a function that removes duplicates if they have the exact same file name but different file extensions, however, with this inconsistent file-naming case it seems trickier.
The list of files looks something like this (but assume there are thousands of paths and words that aren't part of the full names e.g. cv, curriculum vitae etc.):
all_files =
['cv_bob_johnson.pdf',
'bob_johnson_cv.pdf',
'curriculum_vitae_bob_johnson.pdf',
'cv_lara_kroft_cv.pdf',
'cv_lara_kroft.pdf' ]
Desired output:
unique_files = ['cv_bob_johnson.pdf', 'cv_lara_kroft.pdf']
Given that the names are somewhat in a written pattern most of the time (e.g. first name precedes last name), I assume there has to be a way of getting a unique set of the paths if the names are repeated?
If you want to keep your algorithm relatively simple (i.e., not using ML etc), you'll need to have some idea about the typical substrings that you want to remove. Let's make a list of such substrings, for example:
remove = ['cv_', '_cv', 'curriculum_vitae_', '_curriculum_vitae']
Then you can process your list of files this way:
import re
all_files = ['cv_bob_johnson.pdf', 'bob_johnson_cv.pdf', 'curriculum_vitae_bob_johnson.pdf', 'cv_lara_kroft_cv.pdf', 'cv_lara_kroft.pdf']
remove = ['cv_', '_cv', 'curriculum_vitae_', '_curriculum_vitae']
unique = []
for file in all_files:
# strip a suffix, if any:
try:
name, suffix = file.rsplit('.', 1)
except:
name, suffix = file, None
# remove the excess parts:
for rem in remove:
name = re.sub(rem, '', name)
# append the result to the list:
unique.append(f'{name}.{suffix}' if suffix else name)
# remove duplicates:
unique = list(set(unique))
print(unique)

Python: split hard coded path

I need to split a path up in python and then remove the last two levels.
Here is an example, the path I want to parse. I want to parse it to level 6.
C:\Users\Me\level1\level2\level3\level4\level5\level6\level7\level8
Below is what I want the output to be. Currently, I can only go one level up.
C:\Users\Me\level1\level2\level3\level4\level5\level6\
a ="C:\Users\Me\level1\level2\level3\level4\level5\level6\level7\level8"
split_path=os.path.split(a)
print split_path
Output:
('C:\Users\Me\level1\level2\level3\level4\level5\level6\level7','level8')
Split the path into all its parts, then join all the parts, except the last two.
import os
seperator = os.path.sep
parts = string.split(seperator)
output = os.path.join(*parts[0:-2])
You can either use the split function twice:
os.path.split(os.path.split(a)[0])[0]
This works since os.path.split() returns a tuple with two items, head and tail, and by taking [0] of that we'll get the head. Then just split again and take the first item again with [0].
Or join your path with the parent directory twice:
os.path.abspath(os.path.join(a, '..', '..'))
You can easily create a function that will step back as many steps as you want:
def path_split(path, steps):
for i in range(steps + 1):
path = os.path.split(path)[0]
return path
So
>>> path_split("C:\Users\Me\level1\level2\level3\level4\level5\level6\level7\level8", 2)
"C:\Users\Me\level1\level2\level3\level4\level5\level6\"
os.path.split(path) gives the whole path except the lastone, and the last one in a tuple. So if you want to remove the last two,
os.path.split(os.path.split(your_path)[0])[0]

Looping a write command to output many different indices from a list separately in Python

Im trying to get an output like:
KPLR003222854-2009131105131
in a text file. The way I am attempting to derive that output is as such:
with open('Processed_Data.txt', 'r') as file_P, open('KIC_list.txt', 'w') as namelist:
nameData = []
for line in file_P:
splt_file_P = line.split()
nameData.append(splt_file_P[0])
for key in nameData:
namelist.write('\n' 'KPLR00' + "".join(str(w) for w in nameData) + '-2009131105131')
However I am having an issue in that the numbers in the nameData array are all appearing at once in the specified output, instead of using on ID cleanly as shown above the output is something like this:
KPLR00322285472138721382172198371823798123781923781237819237894676472634973256279234987-2009131105131
So my question is how do I loop the write command in a way that will allow me to get each separate ID (each has a specific index value, but there are over 150) to be properly outputted.
EDIT:
Also, some of the ID's in the list are not the same length, so I wanted to add 0's to the front of the 'key' to make them all equal 9 digits. I cheated this by adding the 0's into the KPLR in quotes but not all of the ID's need just two 0's. The question is, could I add 0's between KPLR and the key in any way to match the 9-digit format?
Your code looks like it's working as one would expect: "".join(str(w) for w in nameData) makes a string composed of the concatenation of every item in nameData.
Chances are you want;
for key in nameData:
namelist.write('\n' 'KPLR00' + key + '-2009131105131')
Or even better:
for key in nameData:
namelist.write('\nKPLR%09i-2009131105131'%int(key)) #no string concatenation
String concatenation tends to be slower, and if you're not only operating on strings, will involve explicit calls to str. Here's a pair of ideone snippets showing the difference: http://ideone.com/RR5RnL and http://ideone.com/VH2gzx
Also, the above form with the format string '%09i' will pad with 0s to make the number up to 9 digits. Because the format is '%i', I've added an explicit conversion to int. See here for full details: http://docs.python.org/2/library/stdtypes.html#string-formatting-operations
Finally, here's a single line version (excepting the with statement, which you should of course keep):
namelist.write("\n".join("KPLR%09i-2009131105131"%int(line.split()[0]) for line in file_P))
You can change this:
"".join(str(w) for w in nameData)
to this:
",".join(str(w) for w in nameData)
Basically, the "," will comma delimit the elements in your nameData list. If you use "", then there will be nothing to separate the elements, so they appear all at once. You can change the delimiter to suit your needs.
Just for kicks:
with open('Processed_Data.txt', 'r') as file_P, open('KIC_list.txt', 'w') as namelist:
nameData = [line.split()[0] for line in file_P]
namelist.write("\n".join("KPLR00" + str(key) + '-2009131105131' for key in nameData))
I think that will work, but I haven't tested it. You can make it even smaller/uglier by not using nameData at all, and just use that list comprehension right in its place.

Split specific items in list into two

I'm building an XML parser in python for an SVG file. It will eventually become specific instructions for stepper motors.
SVG files contain commands such as 'M', 'C' and 'L.' The path data might look like this:
[M199.66, 0.50C199.6, 0.50...0.50Z]
When I extracted the path data, it's a list of one item (which is a string). I split the long string into multiple strings:
[u'M199.6', u'0.50C199.66', u'0.50']
The 'M, C and L' commands are important - I'm having difficulty splitting '0.5C199.6' into '0.5' and 'C199.6' because it only exists for certain items in the list, and I'd like to retain the C and not discard it. This is what I have so far:
for item in path_strings[0]:
s=string.split(path_strings[0], ',')
print s
break
for i in range(len(s)):
coordinates=string.split(s[i],'C')
print coordinates
break
You could try breaking it into substrings like this:
whole = "0.5C199.66"
start = whole[0:whole.find("C")]
end = whole[whole.find("C"):]
That should give you start == "0.5" and end == "C199.66"
Alternatively you could use the index function instead of find, which raises a ValueError when the substring can't be found. That would give you the benefit of easily determining that for the current string, no 'C' command is present.
http://docs.python.org/2/library/string.html#string-functions
Use a regex to search for the commands ([MCL]).
import re
lst = [u'M199.6', u'0.50C199.66', u'0.50']
for i, j in enumerate(lst):
m = re.search('(.+?)([MCL].+)', j)
if m:
print [m.group(1), m.group(2)] # = coordinates from your example
lst[i:i+1] = [m.group(1), m.group(2)] # replace the item in the lst with the splitted thing
# or do something else with the coordinates, whatever you want.
print lst
splits your list in:
[u'M199.6', u'0.50', u'C199.66', u'0.50']

Categories

Resources