Mapping modified string indices to original string indices in Python

Mapping modified string indices to original string indices in Python - python

I'm relatively new to programming and wanted to get some help on a problem I've have. I need to figure out a way to map the indices of a string back to an original string after removing certain positions. For example, say I had a list:
original_string = 'abcdefgh'
And I removed a few elements to get:
new_string = acfh
I need a way to get the "true" indices of new_string. In other words, I want the indices of the positions I've kept as they were in original_string. Thus returning:
original_indices_of_new_string = [0,2,5,7]
My general approach has been something like this:
I find the positions I've removed in the original_string to get:
removed_positions = [1,3,4,6]
Then given the indices of new_string:
new_string_indices = [0,1,2,3]
Then I think I should be able to do something like this:
original_indices_of_new_string = []
for i in new_string_indices:
offset = 0
corrected_value = i + offset
if corrected_value in removed_positions:
#somehow offset to correct value
offset+=1
else:
original_indices_of_new_string.append(corrected_value)
This doesn't really work because the offset is reset to 0 after every loop, which I only want to happen if the corrected_value is in removed_positions (ie. I want to offset 2 for removed_positions 3 and 4 but only 1 if consecutive positions weren't removed).
I need to do this based off positions I've removed rather than those I've kept because further down the line I'll be removing more positions and I'd like to just have an easy function to map those back to the original each time. I also can't just search for the parts I've removed because the real string isn't unique enough to guarantee that the correct portion gets found.
Any help would be much appreciated. I've been using stack overflow for a while now and have always found the question I've had in a previous thread but couldn't find something this time so I decided to post a question myself! Let me know if anything needs clarification.
*Letters in the string are a not unique

Given your string original_string = 'abcdefgh' you can create a tuple of the index, and character of each:
>>> li=[(i, c) for i, c in enumerate(original_string)]
>>> li
[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd'), (4, 'e'), (5, 'f'), (6, 'g'), (7, 'h')]
Then remove your desired charaters:
>>> new_li=[t for t in li if t[1] not in 'bdeg']
>>> new_li
[(0, 'a'), (2, 'c'), (5, 'f'), (7, 'h')]
Then rejoin that into a string:
>>> ''.join([t[1] for t in new_li])
acfh
Your 'answer' is the method used to create new_li and referring to the index there:
>>> ', '.join(map(str, (t[0] for t in new_li)))
0, 2, 5, 7

You can create a new class to deal with this stuff
class String:
def __init__(self, myString):
self.myString = myString
self.myMap = {}
self.__createMapping(self.myString)
def __createMapping(self, myString):
index = 0
for character in myString:
# If the character already exists in the map, append the index to the list
if character in self.myMap:
self.myMap[character].append(index)
else:
self.myMap[character] = [index,]
index += 1
def removeCharacters(self, myList):
for character in self.myString:
if character in myList:
self.myString = self.myString.replace(character, '')
del self.myMap[character]
return self.myString
def getIndeces(self):
return self.myMap
if __name__ == '__main__':
myString = String('abcdef')
print myString.removeCharacters(['a', 'b']) # Prints cdef
print myString.getIndeces() # Prints each character and a list of the indeces these occur at
This will give a mapping of the characters and a list of the indeces that they occur at. You can add more functionality if you want a single list returned, etc. Hopefully this gives you an idea of how to start

If removing by index, you simply need to start with a list of all indexes, e.g.: [0, 1, 2, 3, 4] and then, as you remove at each index, remove it from that list. For example, if you're removing indexes 1 and 3, you'll do:
idxlst.remove(1)
idxlst.remove(3)
idxlst # => [0, 2, 4]
[update]: if not removing by index, it's probably easiest to find the index first and then proceed with the above solution, e.g. if removing 'c' from 'abc', do:
i = mystr.index('c')
# remove 'c'
idxlst.remove(i)

Trying to stay as close as possible to what you were originally trying to accomplish, this code should work:
big = 'abcdefgh'
small='acfh'
l = []
current = 0
while len(small) >0:
if big[current] == small[0]:
l.append(current)
small = small[1:]
else:
current += 1
print(l)
The idea is working from the front so you don't need to worry about offset.
A precondition is of course that small actually is obtained by removing a few indices from big. Otherwise, an IndexError is thrown. If you need the code to be more robust, just catch the exception at the very end and return an empty list or something. Otherwise the code should work fine.

Assuming the character in your input string are unique, this is what is happening with your code:
original_indices_of_new_string = []
for i in new_string_indices:
offset = 0
corrected_value = i + offset
if corrected_value in removed_positions:
#somehow offset to correct value
offset+=1
else:
original_indices_of_new_string.append(corrected_value)
Setting offset to 0 every time in the loop is as good as having it preset to 0 outside the loop. And if you are adding 0 everytime to i in the loop, might as well use i. That boils down your code to:
if i in removed_positions:
#somehow offset to correct value
pass
else:
original_indices_of_new_string.append(i)
This code gives the output as [0, 2] and the logic is right (again assuming the characters in the input are unique) What you should be doing is, running the loop for the length of the original_string. That will give you what you want. Like this:
original_indices_of_new_string = []
for i in range(len(original_string)):
if i in removed_positions:
#somehow offset to correct value
pass
else:
original_indices_of_new_string.append(i)
print original_indices_of_new_string
This prints:
[0, 2, 5, 7]
A simpler one liner to achieve the same would be:
original_indices_of_new_string = [original_string.index(i) for i in new_string for j in i]
Hope this helps.

It may help to map the characters in the new string with their positions in the original string in a dictionary and recover the new string like this:
import operator
chars = {'a':0, 'c':2, 'f':6, 'h':8}
sorted_chars = sorted(chars.iteritems(), key=operator.itemgetter(1))
new_string = ''.join([char for char, pos in sorted_chars]) # 'acfh'

Related

Issues removing words from a list in Python

I'm building a Wordle solver. Basically removing words from a list, if they don't have specific characters, or don't have them at specific locations. I'm not concerned about the statistics for optimal choices yet.
When I run the below code (I think all relevant sections are included), my output is clear that it found a letter matching position to the 'word of the day'. But then the next iteration, it will choose a word that doesn't have that letter, when it should only select from remaining words.
Are words not actually being removed? Or is there something shadowing a scope I can't find?
I've rewritten whole sections, with the exact same problem happening.
#Some imports and reading the word list here.
def word_compare(word_of_the_day, choice_word):
results = []
index = 0
letters[:] = choice_word
for letter in letters:
if letter is word_of_the_day[index]:
results.append((letter, 2, index))
elif letter in word_of_the_day:
results.append((letter, 1, index))
else:
results.append((letter, 0, index))
index += 1
print("\nIteration %s\nWord of the Day: %s,\nChoice Word: %s,\nResults: %s" % (
iteration, word_of_the_day, choice_word, results))
return results
def remove_wrong_words():
for item in results:
if item[1] == 0:
for word in words:
if item[0] in word:
words.remove(word)
for item in results:
if item[1] == 2:
for word in words:
if word[item[2]] != item[0]:
words.remove(word)
print("Words Remaining: %s" % len(words))
return words
words, letters = prep([])
# choice_word = best_word_choice()
choice_word = "crane"
iteration = 1
word_of_the_day = random.choice(words)
while True:
if choice_word == word_of_the_day:
break
else:
words.remove(choice_word)
results = word_compare(word_of_the_day, choice_word)
words = remove_wrong_words()
if len(words) < 10:
print(words)
choice_word = random.choice(words)
iteration += 1
Output I'm getting:
Iteration 1
Word of the Day: stake,
Choice Word: crane,
Results: [('c', 0, 0), ('r', 0, 1), ('a', 2, 2), ('n', 0, 3), ('e', 2, 4)]
Words Remaining: 386
Iteration 2
Word of the Day: stake,
Choice Word: lease,
Results: [('l', 0, 0), ('e', 1, 1), ('a', 2, 2), ('s', 1, 3), ('e', 2, 4)]
Words Remaining: 112
Iteration 3
Word of the Day: stake,
Choice Word: paste,
Results: [('p', 0, 0), ('a', 1, 1), ('s', 1, 2), ('t', 1, 3), ('e', 2, 4)]
Words Remaining: 81
Iteration 4
Word of the Day: stake,
Choice Word: spite,
... This continues for a while until solved. In this output, 'a' is found to be in the correct place (value of 2 in the tuple) on the second iteration. This should remove all words from the list that don't have 'a' as the third character. Instead 'paste' and 'spite' are chosen for later iterations from that same list, instead of having been removed.

Your issue has to do with removing an item from a list while you iterate over it. This often results in skipping later values, as the list iteration is being handled by index, under the covers.
Specifically, the problem is here (and probably in the other loop too):
for word in words:
if item[0] in word:
words.remove(word)
If the if condition is true for the first word in the words list, the second word will not be checked. That's because when the for loop asks the list iterator for the next value, it's going to yield the second value of the list as it now stands, which is going to be the third value from the original list (since the first one is gone).
There are a few ways you could avoid this problem.
One approach is to iterate on a copy of the list you're going to modify. This means that the iterator won't ever skip over anything, since the copied list is not having anything removed from it as you go (only the original list is changing). A common way to make the copy is with a slice:
for word in words[:]: # iterate on a copy of the list
if item[0] in word:
words.remove(word) # modify the original list here
Another option is to build a new list full of the valid values from the original list, rather than removing the invalid ones. A list comprehension is often good enough for this:
words = [word for word in words if item[0] not in word]
This may be slightly complicated in your example because you're using global variables. You would either need to change that design (and e.g. accept a list as an argument and return the new version), or add global words statement to let the function's code rebind the global variable (rather than modifying it in place).

I think one of your issues is the following line: if letter is word_of_the_day[index]:. This should be == not is as the latter checks for whether the two objects being compared have the same memory address (i.e. id()), not whether they have the same value. Thus, results will never return a tuple with a value of 2 in position 1, so this means the second for loop in remove_wrong_words won't do anything either. There may be more going on but I'd like a concrete example to run before digging in further.

How to use enumerate in a list comprehension with two lists?

I just started to use list comprehension and I'm struggling with it. In this case, I need to get the n number of each list (sequence_0 and sequence_1) that the iteration is at each time. How can I do that?
The idea is to get the longest sequence of equal nucleotides (a motif) between the two sequences. Once a pair is finded, the program should continue in the nexts nucleotides of the sequences, checking if they are also equal and then elonganting the motif with it. The final output should be an list of all the motifs finded.
The problem is, to continue in the next nucleotides once a pair is finded, i need the position of the pair in both sequences to the program continue. The index function does not work in this case, and that's why i need the enumerate.
Also, I don't understand exactly the reason for the x and y between (), it would be good to understand that too :)
just to explain, the content of the lists is DNA sequences, so its basically something like:
sequence_1 = ['A', 'T', 'C', 'A', 'C']
def find_shared_motif(arq):
data = fastaread(arq)
seqs = [list(sequence) for sequence in data.values()]
motifs = [[]]
i = 0
sequence_0, sequence_1 = seqs[0], seqs[1] # just to simplify
for x, y in [(x, y) for x in zip(sequence_0[::], sequence_0[1::]) for y in zip(sequence_1[::], sequence_1[1::])]:
print(f'Pairs {"".join(x)} and {"".join(y)} being analyzed...')
if x == y:
print(f'Pairs {"".join(x)} and {"".join(y)} match!')
motifs[i].append(x[0]), motifs[i].append(x[1])
k = sequence_0.index(x[0]) + 2 # NAO ESTA DEVOLVENDO O NUMERO CERTO
u = sequence_1.index(y[0]) + 2
print(k, u)
# Determines if the rest of the sequence is compatible
print(f'Starting to elongate the motif {x}...')
for j, m in enumerate(sequence_1[u::]):
try:
# Checks if the nucleotide is equal for both of the sequences
print(f'Analyzing the pair {sequence_0[k + j]}, {m}')
if m == sequence_0[k + j]:
motifs[i].append(m)
print(f'The pair {sequence_0[k + j]}, {m} is equal!')
# Stop in the first nonequal residue
else:
print(f'The pair {sequence_0[k + j]}, {m} is not equal.')
break
except IndexError:
print('IndexError, end of the string')
else:
i += 1
motifs.append([])
return motifs
...

One way to go with it is to start zipping both lists:
a = ['A', 'T', 'C', 'A', 'C']
b = ['A', 'T', 'C', 'C', 'T']
c = list(zip(a,b))
In that case, c will have the list of tuples below
c = [('A','A'), ('T','T'), ('C','C'), ('A','C'), ('C','T')]
Then, you can go with list comprehension and enumerate:
d = [(i, t) for i, t in enumerate(c)]
This will bring something like this to you:
d = [(0, ('A','A')), (1, ('T','T')), (2, ('C','C')), ...]
Of course you can go for a one-liner, if you want:
d = [(i, t) for i, t in enumerate(zip(a,b))]
>>> [(0, ('A','A')), (1, ('T','T')), (2, ('C','C')), ...]
Now, you have to deal with the nested tuples. Focus on the internal ones. It is obvious that what you want is to compare the first element of the tuples with the second ones. But, also, you will need the position where the difference resides (that lies outside). So, let's build a function for it. Inside the function, i will capture the positions, and t will capture the inner tuples:
def compare(a, b):
d = [(i, t) for i, t in enumerate(zip(a,b))]
for i, t in d:
if t[0] != t[1]:
return i
return -1
In that way, if you get -1 at the end, it means that all elements in both lists are equal, side by side. Otherwise, you will get the position of the first difference between them.
It is important to notice that, in the case of two lists with different sizes, the zip function will bring a list of tuples with the size matching the smaller of the lists. The extra elements of the other list will be ignored.
Ex.
list(zip([1,2], [3,4,5]))
>>> [(1,3), (2,4)]
You can use the function compare with your code to get the positions where the lists differ, and use that to build your motifs.

Comparing set components

I want to make a function which takes two sets A and B and returns True if A is less than B.
def set_less_than(A, B):
Alist = list(A)
Blist = list(B)
ans = []
for index in range(len(A)):
try:
ans.append(Alist[index] < Blist[index])
except TypeError:
ans.append("impossible to compare")
return ans
However, how can I handle this case,
for example:
A = { 0, 'b', (0, 'a') } and B = { 1, 'a', (2, -3) }
I want a output like
[True, False, (True, 'impossible')]

This is a tough question to answer because, as Scott points out, the answer will be nonsensical, as sets are unordered. That is: the items cannot be compared in order. So attempting to grab each item, in order, by index, subsequently won't work. Furthermore, it does not make sense to say that an alphabetic character is less than another one. Same thing applies to tuples.
But if this is indeed really want you want to do, this is how I'd do it:
See if the items are integers ().
If so, compare them and append the True or False value.
If not, append "impossible".
def set_less_than(set_a, set_b):
list_a = list(set_a)
list_b = list(set_b)
ans = []
for index in range(len(list_a)):
if isinstance(list_a[index], int) and isinstance(list_b[index], int):
ans.append(list_a[index] < list_b[index])
else:
ans.append("impossible")
return ans
a = set([0, 'b', (0, 'a')])
b = set([1, 'a', (2, -3)])
answer = set_less_than(a, b)
for a in answer:
print a
In my case, 0 is compared to 'a', etc. Every item compared returns
"impossible".

Replace one item in a string with one item from a list

I have a string and a list:
seq = '01202112'
l = [(0,1,0),(1,1,0)]
I would like a pythonic way of replacing each '2' with the value at the corresponding index in the list l such that I obtain two new strings:
list_seq = [01001110, 01101110]
By using .replace(), I could iterate through l, but I wondered is there a more pythonic way to get list_seq?

I might do something like this:
out = [''.join(c if c != '2' else str(next(f, c)) for c in seq) for f in map(iter, l)]
The basic idea is that we call iter to turn the tuples in l into iterators. At that point every time we call next on them, we get the next element we need to use instead of the '2'.
If this is too compact, the logic might be easier to read as a function:
def replace(seq, to_replace, fill):
fill = iter(fill)
for element in seq:
if element != to_replace:
yield element
else:
yield next(fill, element)
giving
In [32]: list(replace([1,2,3,2,2,3,1,2,4,2], to_replace=2, fill="apple"))
Out[32]: [1, 'a', 3, 'p', 'p', 3, 1, 'l', 4, 'e']
Thanks to #DanD in the comments for noting that I had assumed I'd always have enough characters to fill from! We'll follow his suggestion to keep the original characters if we run out, but modifying this approach to behave differently is straightforward and left as an exercise for the reader. :-)

[''.join([str(next(digit, 0)) if x is '2' else x for x in seq])
for digit in map(iter, l)]

I don't know if this solution is 'more pythonic' but:
def my_replace(s, c=None, *other):
return s if c is None else my_replace(s.replace('2', str(c), 1), *other)
seq = '01202112'
l = [(0,1,0),(1,1,0)]
list_req = [my_replace(seq, *x) for x in l]

seq = '01202112'
li = [(0,1,0),(1,1,0)]
def grunch(s, tu):
it = map(str,tu)
return ''.join(next(it) if c=='2' else c for c in s)
list_seq = [grunch(seq,tu) for tu in li]

python - assign letter of the alphabet to each value in a list

I have a list of values in a for loop. e.g. myList = [1,5,7,3] which I am using to create a bar chart (using google charts)
I want to label each value with a letter of the alphabet (A-Z) e.g. A = 1, B = 5, C = 7, D = 3
What is the best way to do this running through a for loop
e.g.
for x in myList:
x.label = LETTER OF THE ALPHABET
The list can be any length in size so wont always be just A to D
EDIT
myList is a list of objects not numbers as I have put in example above. Each object will have a title attached to it (could be numbers, text letters etc.), however these titles are quite long so mess things up when displaying them on Google charts. Therefore on the chart I was going to label the chart with letters going A, B, C, ....to the lenth of the myList, then having a key on the chart cross referencing the letters on the chart with the actual titles. The length of myList is more than likely going to be less than 10 so there would be no worries about running of of letters.
Hope this clears things up a little

If you want to go on like ..., Y, Z, AA, AB ,... you can use itertools.product:
import string
import itertools
def product_gen(n):
for r in itertools.count(1):
for i in itertools.product(n, repeat=r):
yield "".join(i)
mylist=list(range(35))
for value, label in zip(mylist, product_gen(string.ascii_uppercase)):
print(value, label)
# value.label = label
Part of output:
23 X
24 Y
25 Z
26 AA
27 AB
28 AC
29 AD

import string
for i, x in enumerate(myList):
x.label = string.uppercase[i]
This will of course fail if len(myList) > 26

import string
myList = [1, 5, 7, 3]
labels = [string.uppercase[x+1] for x in myList]
# ['C', 'G', 'I', 'E']

for i in range(len(myList)):
x.label = chr(i+65)
More on the function here.

charValue = 65
for x in myList:
x.label = chr(charValue)
charValue++
Be careful if your list is longer than 26 characters

First, if myList is a list of integers, then,
for x in myList:
x.label = LETTER OF THE ALPHABET
won't work, since int has no attribute label. You could loop over myList and store the labels in a list (here: pairs):
import string
pairs = []
for i, x in enumerate(myList):
label = string.letters(i) # will work for i < 52 !!
pairs.append( (label, x) )
# pairs is now a list of (label, value) pairs
If you need more than 52 labels, you can use some random string generating function, like this one:
import random
def rstring(length=4):
return ''.join([ random.choice(string.uppercase) for x in range(length) ])

Since I like list comprehensions, I'd do it like this:
[(i, chr(x+65)) for x, i in enumerate([1, 5, 7, 3])]
Which results in:
[(1, 'A'), (5, 'B'), (7, 'C'), (3, 'D')]

import string
for val in zip(myList, string.uppercase):
val[0].label = val[1]

You can also use something like this:
from string import uppercase
res = ((x , uppercase[i%26]*(i//26+1)) for i,x in enumerate(inputList))
Or you can use something like this - note that this is just an idea how to deal with long lists not the solution:
from string import uppercase
res = ((x , uppercase[i%26] + uppercase[i/26]) for i,x in enumerate(inputList))

Are you looking for a dictionary, where each of your values are keyed to a letter of the alphabet? In that case, you can do:
from string import lowercase as letters
values = [1, 23, 3544, 23]
mydict = {}
for (let, val) in zip(letters, values):
mydict[let] = val
<<< mydict == {'a': 1, 'c': 23, 'b': 3544, 'd': 23}
<<< mydict['a'] == 1
You'll have to add additional logic if you need to handle lists longer than the alphabet.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.