removing iterated string from string array

removing iterated string from string array - python

I am writing a small script that lists the currently connected hard disks on my machine. I only need the disk identifier(disk0), not the partition ID(disk0s1, disk0s2, etc.)
How can I iterate through an array that contains diskID and partitionID and remove the partitionID entries? Here's what I'm trying so far:
import os
allDrives = os.listdir("/dev/")
parsedDrives = []
def parseAllDrives():
parsedDrives = []
matching = []
for driveName in allDrives:
if 'disk' in driveName:
parsedDrives.append(driveName)
else:
continue
for itemName in parsedDrives:
if len(parsedDrives) != 0:
if 'rdisk' in itemName:
parsedDrives.remove(itemName)
else:
continue
else:
continue
#### this is where the problem starts: #####
# iterate through possible partition identifiers
for i in range(5):
#create a string for the partitionID
systemPostfix = 's' + str(i)
matching.append(filter(lambda x: systemPostfix in x, parsedDrives))
for match in matching:
if match in parsedDrives:
parsedDrives.remove(match)
print("found a mactch and removed it")
print("matched: %s" % matching)
print(parsedDrives)
parseAllDrives()
That last bit is just the most recent thing I've tried. Definitely open to going a different route.

try beginning with
allDrives = os.listdir("/dev/")
disks = [drive for drive in allDrives if ('disk' in drive)]
then, given disks id's are only 5-chars length
short_disks = [disk[:6] for disk in disks]
unique_short_disks = list(set(short_disks))

Related

Highlighting occurrences of a string in a Tkinter textField

I have a regex pattern return a list of all the start and stop indices of an occurring string and I want to be able to highlight each occurrence, it's extremely slow with my current setup — using a 133,000 line file it takes about 8 minutes to highlight all occurrences.
Here's my current solution:
if IPv == 4:
v4FoundUnique = v4FoundUnique + 1
# highlight all regions found
for j in range(qty):
v4Found = v4Found + 1
# don't highlight if they set the checkbox not to
if highlightText:
# get row.column coordinates of start and end of match
# very slow
startIndex = textField.index('1.0 + {} chars'.format(starts[j]))
# compute end based on start, using assumption that IP addresses
# won't span lines drastically faster than computing from raw index
endIndex = "{}.{}".format(startIndex.split(".")[0],
int(startIndex.split(".")[1]) + stops[j]-starts[j])
# apply tag
textField.tag_add("{}v4".format("public" if isPublic else "private"),
startIndex, endIndex)

So, TKinter has a pretty bad implementation of changing "absolute location" to its row.column format:
startIndex = textField.index('1.0 + {} chars'.format(starts[j]))
it's actually faster to do it like this:
for address in v4check.finditer(filetxt):
# address.group() returns matching text
# address.span() returns the indices (start,stop)
start,stop = address.span()
ip = address.group()
srow = filetxt.count("\n",0,start)+1
scol = start-filetxt.rfind("\n",0,start)-1
start = "{}.{}".format(srow,scol)
stop = "{}.{}".format(srow,scol+len(ip))
which takes the regex results and the input file to get the data we need (row.colum)
There could be a faster way of doing this but this is the solution I found that works!

Find Similar Elements in List using Python

I need to look for similar Items in a list using python. (e.g. 'Limits' is similar to 'Limit' or 'Download ICD file' is similar to 'Download ICD zip file')
I really want my results to be similar with chars, not with digits (e.g. 'Angle 1' is similar to 'Angle 2'). All these strings in my list end with an '\0'
What I am trying to do is split every item at blanks and look if any part consists of a digit.
But somehow it is not working as I want it to work.
Here is my code example:
for k in range(len(split)): # split already consists of splitted list entry
replace = split[k].replace(
"\\0", ""
) # replace \0 at every line ending to guarantee it is only a digit
is_num = lambda q: q.replace(
".", "", 1
).isdigit() # lambda i found somewhere on the internet
check = is_num(replace)
if check == True: # break if it is a digit and split next entry of list
break
elif check == False: # i know, else would be fine too
seq = difflib.SequenceMatcher(a=List[i].lower(), b=List[j].lower())
if seq.ratio() > 0.9:
print(Element1, "is similar to", Element2, "\t")
break

Try this, its using get_close_matches from difflib instead of sequencematcher.
from difflib import get_close_matches
a = ["abc/0", "efg/0", "bc/0"]
b=[]
for i in a:
x = i.rstrip("/0")
b.append(x)
for i in range(len(b)):
print(get_close_matches(b[i], (b)))

How to compare two files from filelist using regex?

The file is reading from a folder with os.listdir. After I entered regex of the file r'^[1-9\w]{2}_[1-9\w]{4}[1][7][\d\w]+\.[\d\w]+' and the similar for another file r'^[1-9\w]{2}_[1-9\w]{4}[1][8]+' . The condition of the comparison is that when the first seven symbols are matching then os.remove(os.path.join(dir_name, each)) . Example of a few: bh_txbh171002.xml, bh_txbh180101.xml, ce_txce170101.xml...
As I understood we can't use match because there's no any string and it returns None, moreover it compares file with regex only. I am thinking about the condition if folder.itself(file) and file.startswitch("......."): But can't figure out how could I point the first seven symbols of file names what should be compared.
Honestly, I've placed my worse version of the code in that request and since that time I learnt a little bit more: the link - press to check it up

Regex is the wrong tool here I do not have your files so I create randomized demodata:
import random
import string
random.seed(42) # make random repeatable
def generateFileNames(amount):
"""Generate 2*amount of names XX_XXXX with X in [a-zA-T0-9] with duplicates in it"""
def rndName():
"""generate one random name XX_XXXX with X in [a-zA-T0-9]"""
characters = string.ascii_lowercase + string.digits
return random.choices(characters,k=2)+['_']+random.choices(characters,k=4)
for _ in range(amount): # create 2*amount names, some duplicates
name = rndName()
yield ''.join(name) # yield name once
if random.randint(1,10) > 3: # more likely to get same names twice
yield ''.join(name) # same name twice
else:
yield ''.join(rndName()) # different 2nd name
def generateNumberParts(amount):
"""Generate 2*amount of 6-digit-strings, some with 17+18 as starting numbers"""
def rndNums(nr):
"""Generate nr digits as string list"""
return random.choices(string.digits,k=nr)
for _ in range(amount):
choi = rndNums(4)
# i am yielding 18 first to demonstrate that sorting later works
yield ''.join(['18']+choi) # 18xxxx numbers
if random.randint(1,10) > 5:
yield ''.join(['17']+choi) # 17xxxx
else:
yield ''.join(rndNums(6)) # make it something other
# half the amount of files generated
m = 10
# generate filenames
filenames = [''.join(x)+'.xml' for x in zip(generateFileNames(m),
generateNumberParts(m)]
Now I have my names as list and can start to find out which are dupes with newer timestamps:
# make a dict out of your filenames, use first 7 as key
# with list of values of files starting with this key a values:
fileDict={}
for names in filenames:
fileDict.setdefault(names[0:7],[]).append(names) # create key=[] or/and append names
for k,v in fileDict.items():
print (k, " " , v)
# get files to delete (all the lower nr of the value-list if multiple in it)
filesToDelete = []
for k,v in fileDict.items():
if len(v) == 1: # nothing to do, its only 1 file
continue
print(v, " to ", end = "" ) # debugging output
v.sort(key = lambda x: int(x[7:9])) # sort by a lambda that integerfies 17/18
print (v) # debugging output
filesToDelete.extend(v[:-1]) # add all but the last file to the delete list
print("")
print(filesToDelete)
Output:
# the created filenames in your dict by "key [values]"
xa_ji0y ['xa_ji0y188040.xml', 'xa_ji0y501652.xml']
v3_a3zm ['v3_a3zm181930.xml']
mm_jbqe ['mm_jbqe171930.xml']
ck_w5ng ['ck_w5ng180679.xml', 'ck_w5ng348136.xml']
zy_cwti ['zy_cwti184296.xml', 'zy_cwti174296.xml']
41_iblj ['41_iblj182983.xml', '41_iblj172983.xml']
5x_ff0t ['5x_ff0t187453.xml']
sd_bdw2 ['sd_bdw2177453.xml']
vn_vqjt ['vn_vqjt189618.xml', 'vn_vqjt179618.xml']
ep_q85j ['ep_q85j185198.xml', 'ep_q85j175198.xml']
vf_1t2t ['vf_1t2t180309.xml', 'vf_1t2t089040.xml']
11_ertj ['11_ertj188425.xml', '11_ertj363842.xml']
# sorting the names by its integer at 8/9 position of name
['xa_ji0y188040.xml','xa_ji0y501652.xml'] to ['xa_ji0y188040.xml','xa_ji0y501652.xml']
['ck_w5ng180679.xml','ck_w5ng348136.xml'] to ['ck_w5ng180679.xml','ck_w5ng348136.xml']
['zy_cwti184296.xml','zy_cwti174296.xml'] to ['zy_cwti174296.xml','zy_cwti184296.xml']
['41_iblj182983.xml','41_iblj172983.xml'] to ['41_iblj172983.xml','41_iblj182983.xml']
['vn_vqjt189618.xml','vn_vqjt179618.xml'] to ['vn_vqjt179618.xml','vn_vqjt189618.xml']
['ep_q85j185198.xml','ep_q85j175198.xml'] to ['ep_q85j175198.xml','ep_q85j185198.xml']
['vf_1t2t180309.xml','vf_1t2t089040.xml'] to ['vf_1t2t089040.xml','vf_1t2t180309.xml']
['11_ertj188425.xml','11_ertj363842.xml'] to ['11_ertj188425.xml','11_ertj363842.xml']
# list of files to delete
['xa_ji0y188040.xml', 'ck_w5ng180679.xml', 'zy_cwti174296.xml', '41_iblj172983.xml',
'vn_vqjt179618.xml', 'ep_q85j175198.xml', 'vf_1t2t089040.xml', '11_ertj188425.xml']

I can't understand what's wrong with my code. There I defined the list from certain folder, so that I could work at the strings in each file, right? Then I applied the conditions for filtering and further choice of the one file to delete.
import os
dir_name = "/Python/Test_folder/Schems"
filenames = os.listdir(dir_name)
for names in filenames:
filenames.setdefault(names[0:7],[]).append(names) # create key=[] or/and append names
for k,v in filenames.items():
filesToDelete = [] #ther's a syntax mistake. But I can't get it - there's the list or not?
for k,v in filenames.items():
if len(v) == 1:
continue
v.sort(key = lambda x: int(x[7:9]))
filesToDelete.extend(v[:-1])

Making two compressed files form into one paragraph

I have this code that I am trying to decompress. First, I have compressed the code which is all working but then when I go onto decompressing it there is a ValueError.
List.append(dic[int(bob)])
ValueError: invalid literal for int() with base 10: '1,2,3,4,5,6,7,8,9,'
This is the code...
def menu():
print("..........................................................")
para = input("Please enter a paragraph.")
print()
s = para.split() # splits sentence
another = [0] # will gradually hold all of the numbers repeated or not
index = [] # empty index
word_dictionary = [] # names word_dictionary variable
for count, i in enumerate(s): # has a count and an index for enumerating the split sentence
if s.count(i) < 2: # if it is repeated
another.append(max(another) + 1) # adds the other count to another
else: # if is has not been repeated
another.append(s.index(i) +1) # adds the index (i) to another
new = " " # this has been added because other wise the numbers would be 01234567891011121341513161718192320214
another.remove(0) # takes away the 0 so that it doesn't have a 0 at the start
for word in s: # for every word in the list
if word not in word_dictionary: # If it's not in word_dictionary
word_dictionary.append(word) # adds it to the dicitonary
else: # otherwise
s.remove(word) # it will remove the word
fo = open("indx.txt","w+") # opens file
for index in another: # for each i in another
index= str(index) # it will turn it into a string
fo.write(index) # adds the index to the file
fo.write(new) # adds a space
fo.close() # closes file
fo=open("words.txt", "w+") # names a file sentence
for word in word_dictionary:
fo.write(str(word )) # adds sentence to the file
fo.write(new)
fo.close() # closes file
menu()
index=open("indx.txt","r+").read()
dic=open("words.txt","r+").read()
index= index.split()
dic = dic.split()
Num=0
List=[]
while Num != len(index):
bob=index[Num]
List.append(dic[int(bob)])
Num+=1
print (List)
The problem is down on line 50. with ' List.append(dic[int(bob)])'.
Is there a way to get the Error message to stop popping up and for the code to output the sentence as inputted above?
Latest error message has occurred:
List.append(dic[int(bob)])
IndexError: list index out of range
When I run the code, I input "This is a sentence. This is another sentence, with commas."

The issue is index= index.split() is by default splitting on spaces, and, as the exception shows, your numbers are separated by ,s.
Without seeing index.txt I can't be certain if it will fix all of your indexes, but for the issue in OP, you can fix it by specifying what to split on, namely a comma:
index= index.split(',')
To your second issue, List.append(dic[int(bob)]) IndexError: list index out of range has two issues:
Your indexes start at 1, not 0, so you are off by one when reconstituting your array
This can be fixed with:
List.append(dic[int(bob) - 1])
Additionally you're doing a lot more work than you need to. This:
fo = open("indx.txt","w+") # opens file
for index in another: # for each i in another
index= str(index) # it will turn it into a string
fo.write(index) # adds the index to the file
fo.write(new) # adds a space
fo.close() # closes file
is equivalent to:
with open("indx.txt","w") as fo:
for index in another:
fo.write(str(index) + new)
and this:
Num=0
List=[]
while Num != len(index):
bob=index[Num]
List.append(dic[int(bob)])
Num+=1
is equivalent to
List = []
for item in index:
List.append(dic[int(item)])
Also, take a moment to review PEP-8 and try to follow those standards. Your code is very difficult to read because it doesn't follow them. I fixed the formatting on your comments so StackOverflow's parser could parse your code, but most of them only add clutter.

Parsing GenBank to FASTA with yield in Python (x, y)

For now I have tried to define and document my own function to do it, but I am encountering issues with testing the code and I have actually no idea if it is correct. I found some solutions with BioPython, re or other, but I really want to make this work with yield.
#generator for GenBank to FASTA
def parse_GB_to_FASTA (lines):
#set Default label
curr_label = None
#set Default sequence
curr_seq = ""
for line in lines:
#if the line starts with ACCESSION this should be saved as the beginning of the label
if line.startswith('ACCESSION'):
#if the label has already been changed
if curr_label is not None:
#output the label and sequence
yield curr_label, curr_seq
''' if the label starts with ACCESSION, immediately replace the current label with
the next ACCESSION number and continue with the next check'''
#strip the first column and leave the number
curr_label = '>' + line.strip()[12:]
#check for the organism column
elif line.startswith (' ORGANISM'):
#add the organism name to the label line
curr_label = curr_label + " " + line.strip()[12:]
#check if the region of the sequence starts
elif line.startswith ('ORIGIN'):
#until the end of the sequence is reached
while line.startswith ('//') is False:
#get a line without spaces and numbers
curr_seq += line.upper().strip()[12:].translate(None, '1234567890 ')
#if no more lines, then give the last label and sequence
yield curr_label, curr_seq

I often work with very large GenBank files and found (years ago) that the BioPython parsers were too brittle to make it through 100's of thousands of records (at the time), without crashing on an unusual record.
I wrote a pure python(2) function to return the next whole record from an open file, reading in 1k chunks, and leaving the file pointer ready to get the next record. I tied this in with a simple iterator that uses this function, and a GenBank Record class which has a fasta(self) method to get a fasta version.
YMMV, but the function that gets the next record is here as should be pluggable in any iterator scheme you want to use. As far as converting to fasta goes you can use logic similar to your ACCESSION and ORIGIN grabbing above, or you can get the text of sections (like ORIGIN) using:
sectionTitle='ORIGIN'
searchRslt=re.search(r'^(%s.+?)^\S'%sectionTitle,
gbrText,re.MULTILINE | re.DOTALL)
sectionText=searchRslt.groups()[0]
Subsections like ORGANISM, require a left side pad of 5 spaces.
Here's my solution to the main issue:
def getNextRecordFromOpenFile(fHandle):
"""Look in file for the next GenBank record
return text of the record
"""
cSize =1024
recFound = False
recChunks = []
try:
fHandle.seek(-1,1)
except IOError:
pass
sPos = fHandle.tell()
gbr=None
while True:
cPos=fHandle.tell()
c=fHandle.read(cSize)
if c=='':
return None
if not recFound:
locusPos=c.find('\nLOCUS')
if sPos==0 and c.startswith('LOCUS'):
locusPos=0
elif locusPos == -1:
continue
if locusPos>0:
locusPos+=1
c=c[locusPos:]
recFound=True
else:
locusPos=0
if (len(recChunks)>0 and
((c.startswith('//\n') and recChunks[-1].endswith('\n'))
or (c.startswith('\n') and recChunks[-1].endswith('\n//'))
or (c.startswith('/\n') and recChunks[-1].endswith('\n/'))
)):
eorPos=0
else:
eorPos=c.find('\n//\n',locusPos)
if eorPos == -1:
recChunks.append(c)
else:
recChunks.append(c[:(eorPos+4)])
gbrText=''.join(recChunks)
fHandle.seek(cPos-locusPos+eorPos)
return gbrText

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

removing iterated string from string array - python

try beginning with allDrives = os.listdir("/dev/") disks = [drive for drive in allDrives if ('disk' in drive)] then, given disks id's are only 5-chars length short_disks = [disk[:6] for disk in disks] unique_short_disks = list(set(short_disks))

Related

Highlighting occurrences of a string in a Tkinter textField

Find Similar Elements in List using Python

How to compare two files from filelist using regex?

Making two compressed files form into one paragraph

Parsing GenBank to FASTA with yield in Python (x, y)

Categories

Resources