I am working with a data frame which consists of a column with numbers in the format:
[[45, 45, 'D'],[46, 49, 'C'],[50, 66, 'S'],[67, 101, 'C'],[102, 103, 'S'],[104, 106, 'C'],[107, 108, 'S'],[109, 120, 'C'],[121, 121, 'S'],[122, 123, 'C'],[124, 140, 'S'],[141, 149, 'C'],[150, 176, 'S'],[177, 178, 'C'],[179, 181, 'S'],[182, 194, 'C'],[195, 213, 'S'],[214, 21``7, 'C']]
These numbers correspond to the positions of characters in a string: i.e. the string:
'MGILSFLPVLATESDWADCKSPQPWGHMLLWTAVLFLAPVAGTPAAPPKAVLKLEPQWINVLQEDSVTLTCRGTHSPESDSIQWFHNGNLIPTHTQPSYRFKANNNDSGEYTCQTGQTSLSDPVHLTVLSEWLVLQTPHLEFQEGETIVLRCHSWKDKPLVKVTFFQNGKSKKFSRSDPNFSIPQANHSHSGDYHCTGNIGYTLYSSKPVTITVQAPSSSPMGIIVAVVTGIAVAAIVAAVVALIYCRKKRISALPGYPECREMGETLPEKPANPTNPDEADKVGAENTITYSLLMHPDALEEPDDQNRI'
As you can see, some of the characters in the list are not corresponding to a number in the number list (i.e.) 0-44 is missing. So the characters at the 0-44th position have to be removed to create a shorter sequence of letters.
I am able to do this for one line, but I am struggling to do it for every line in the data frame.
This is the code for doing it for one line:
new_s = ''
for item in res:
new_s += strSeq[item[0]-1:item[1]]
print(len(new_s), new_s)
And this is what I have been trying to try to get it for all lines:
shortenedSeq_list =[]
counter=0
stringstring=[]
for rows in df.itertuples():
strSeq2 = [rows.sequence]
strremove2 = [rows.shortened_mobidb_consensus]
for item in strremove2:
res = ast.literal_eval(item)
for item in res:
stringstring.append(strSeq2[item[0]-1:item[1]])
stringstring
But this results in the output :
[],
[],
[],
[],
[],
[],
[],
[],
[],
['MGKGKPRGLNSARKLRVHRRNNRWAETTYKKRLLGTAFKSSPFGGSSHAKGIVLEKIGIESKQPNSAIRKCVRVQLIKNGKKVTAFVPNDGCLNFVDENDEVLLAGFGRKGKAKGDIPGVRFKVVKVSGVSLLALWKEKKEKPRS'],
[],
[],
Whereas I want each line in the list to be the sequence which has been shortened.
I ultimately want to add this list as a column in a the dataframe.
UPDATE
The numbers are outputted as a string rather than a list, so res is the numbers as a list, and this is the working code output:
173 AAPPKAVLKLEPQWINVLQEDSVTLTCRGTHSPESDSIQWFHNGNLIPTHTQPSYRFKANNNDSGEYTCQTGQTSLSDPVHLTVLSEWLVLQTPHLEFQEGETIVLRCHSWKDKPLVKVTFFQNGKSKKFSRSDPNFSIPQANHSHSGDYHCTGNIGYTLYSSKPVTITVQAP Where 173 is the length of the shortened sequence, followed by the sequence.
df sample:
shortened_mobidb_consensus;sequence
[[45, 45, 'D'], [46, 49, 'C'], [50, 66, 'S'], [67, 101, 'C'], [102, 103, 'S'], [104, 106, 'C'], [107, 108, 'S'], [109, 120, 'C'], [121, 121, 'S'], [122, 123, 'C'], [124, 140, 'S'], [141, 149, 'C'], [150, 176, 'S'], [177, 178, 'C'], [179, 181, 'S'], [182, 194, 'C'], [195, 213, 'S'], [214, 217, 'C']];MGILSFLPVLATESDWADCKSPQPWGHMLLWTAVLFLAPVAGTPAAPPKAVLKLEPQWINVLQEDSVTLTCRGTHSPESDSIQWFHNGNLIPTHTQPSYRFKANNNDSGEYTCQTGQTSLSDPVHLTVLSEWLVLQTPHLEFQEGETIVLRCHSWKDKPLVKVTFFQNGKSKKFSRSDPNFSIPQANHSHSGDYHCTGNIGYTLYSSKPVTITVQAPSSSPMGIIVAVVTGIAVAAIVAAVVALIYCRKKRISALPGYPECREMGETLPEKPANPTNPDEADKVGAENTITYSLLMHPDALEEPDDQNRI
[[1, 1, 'D'], [2, 143, 'S'], [144, 145, 'C']];MGKGKPRGLNSARKLRVHRRNNRWAETTYKKRLLGTAFKSSPFGGSSHAKGIVLEKIGIESKQPNSAIRKCVRVQLIKNGKKVTAFVPNDGCLNFVDENDEVLLAGFGRKGKAKGDIPGVRFKVVKVSGVSLLALWKEKKEKPRS
[[1, 145, 'S']];MGKGKPRGLNSARKLRVHRRNNRWAETTYKKRLLGTAFKSSPFGGSSHAKGIVLEKIGIESKQPNSAIRKCVRVQLIKNGKKVTAFVPNDGCLNFVDENDEVLLAGFGRKGKAKGDIPGVRFKVVKVSGVSLLALWKEKKEKPRS
[[1, 1, 'D'], [2, 2, 'C'], [3, 37, 'S'], [38, 39, 'C'], [40, 40, 'S'], [41, 41, 'C'], [42, 62, 'S'], [63, 65, 'C'], [66, 231, 'S']];MSKNILVLGGSGALGAEVVKFFKSKSWNTISIDFRENPNADHSFTIKDSGEEEIKSVIEKINSKSIKVDTFVCAAGGWSGGNASSDEFLKSVKGMIDMNLYSAFASAHIGAKLLNQGGLFVLTGASAALNRTSGMIAYGATKAATHHIIKDLASENGGLPAGSTSLGILPVTLDTPTNRKYMSDANFDDWTPLSEVAEKLFEWSTNSDSRPTNGSLVKFETKSKVTTWTNL
[[24, 29, 'D'], [30, 91, 'S'], [92, 92, 'D']];MKVSTTALAVLLCTMTLCNQVFSAPYGADTPTACCFSYSRKIPRQFIVDYFETSSLCSQPGVIFLTKRNRQICADSKETWVQEYITDLELNA
Solution 1:
df = pd.read_csv('stringsample.txt',sep=';',converters={0:ast.literal_eval})
for index, row in df.iterrows():
new_s = ''
res = row.shortened_mobidb_consensus
for item in res:
new_s += row.sequence[item[0]-1:item[1]]
df.loc[index,'output'] = new_s
df['output']
0 AAPPKAVLKLEPQWINVLQEDSVTLTCRGTHSPESDSIQWFHNGNL...
1 MGKGKPRGLNSARKLRVHRRNNRWAETTYKKRLLGTAFKSSPFGGS...
2 MGKGKPRGLNSARKLRVHRRNNRWAETTYKKRLLGTAFKSSPFGGS...
3 MSKNILVLGGSGALGAEVVKFFKSKSWNTISIDFRENPNADHSFTI...
4 APYGADTPTACCFSYSRKIPRQFIVDYFETSSLCSQPGVIFLTKRN...
Name: output, dtype: object
Solution 2: (Fixing your code)
df = pd.read_csv('stringsample.txt',sep=';')
shortenedSeq_list =[]
counter=0
stringstring=[]
for rows in df.itertuples():
strSeq2 = rows.sequence
strremove2 = rows.shortened_mobidb_consensus
res = ast.literal_eval(strremove2)
new_s = ''
for item in res:
new_s += strSeq2[item[0]-1:item[1]]
stringstring.append(new_s)
stringstring
['AAPPKAVLKLEPQWINVLQEDSVTLTCRGTHSPESDSIQWFHNGNLIPTHTQPSYRFKANNNDSGEYTCQTGQTSLSDPVHLTVLSEWLVLQTPHLEFQEGETIVLRCHSWKDKPLVKVTFFQNGKSKKFSRSDPNFSIPQANHSHSGDYHCTGNIGYTLYSSKPVTITVQAP',
'MGKGKPRGLNSARKLRVHRRNNRWAETTYKKRLLGTAFKSSPFGGSSHAKGIVLEKIGIESKQPNSAIRKCVRVQLIKNGKKVTAFVPNDGCLNFVDENDEVLLAGFGRKGKAKGDIPGVRFKVVKVSGVSLLALWKEKKEKPRS',
'MGKGKPRGLNSARKLRVHRRNNRWAETTYKKRLLGTAFKSSPFGGSSHAKGIVLEKIGIESKQPNSAIRKCVRVQLIKNGKKVTAFVPNDGCLNFVDENDEVLLAGFGRKGKAKGDIPGVRFKVVKVSGVSLLALWKEKKEKPRS',
'MSKNILVLGGSGALGAEVVKFFKSKSWNTISIDFRENPNADHSFTIKDSGEEEIKSVIEKINSKSIKVDTFVCAAGGWSGGNASSDEFLKSVKGMIDMNLYSAFASAHIGAKLLNQGGLFVLTGASAALNRTSGMIAYGATKAATHHIIKDLASENGGLPAGSTSLGILPVTLDTPTNRKYMSDANFDDWTPLSEVAEKLFEWSTNSDSRPTNGSLVKFETKSKVTTWTNL',
'APYGADTPTACCFSYSRKIPRQFIVDYFETSSLCSQPGVIFLTKRNRQICADSKETWVQEYITDLELNA']
Related
how can i take a string like this
string = "image1 [{'box': [35, 0, 112, 36], 'score': 0.8626706004142761, 'label': 'FACE_F'}, {'box': [71, 80, 149, 149], 'score': 0.8010843992233276, 'label': 'FACE_F'}, {'box': [0, 81, 80, 149], 'score': 0.7892318964004517, 'label': 'FACE_F'}]"
and turn it into variables like this?
filename = "image1"
box = [35, 0, 112, 36]
score = 0.8010843992233276
label = "FACE_F"
or if there are more than one of box, score, or label
filename = "image1"
box = [[71, 80, 149, 149], [35, 0, 112, 36], [0, 81, 80, 149]]
score = [0.8010843992233276, 0.8626706004142761, 0.7892318964004517]
label = ["FACE_F", "FACE_F", "FACE_F"]
this is how far i've gotten
log = open(r'C:\Users\15868\Desktop\python\log.txt', "r")
data = log.readline()
log.close()
print(data)
filename = data.split(" ")[0]
info = data.rsplit(" ")[1]
print(filename)
print(info)
output
[{'box':
image1
Here is how I would do it:
import ast
string = "image1 [{'box': [35, 0, 112, 36], 'score': 0.8626706004142761, 'label': 'FACE_F'}, {'box': [71, 80, 149, 149], 'score': 0.8010843992233276, 'label': 'FACE_F'}, {'box': [0, 81, 80, 149], 'score': 0.7892318964004517, 'label': 'FACE_F'}]"
filename, data = string.split(' ', 1)
data = ast.literal_eval(data)
print(filename)
print(data)
Output:
image1
[{'box': [35, 0, 112, 36], 'score': 0.8626706004142761, 'label': 'FACE_F'}, {'box': [71, 80, 149, 149], 'score': 0.8010843992233276, 'label': 'FACE_F'}, {'box': [0, 81, 80, 149], 'score': 0.7892318964004517, 'label': 'FACE_F'}]
(updated to follow your example of combining the keys):
From there I'd just write some simple code, something like:
box = []
score = []
label = []
for row in data:
box.append(row['box'])
score.append(row['score'])
label.append(row['label'])
To unpack that data there are fancier ways but that is the most straight forward, for example:
box, score, label = zip(*{ x.values() for x in data })
I have this dict:
{'q1': [5, 6, 90, 91, 119, 144, 181, 399],
'q2': [236, 166],
'q3': [552, 401, 1297, 1296],
}
And I'd like to prepend a 'd' to every element within each value's list:
{'q1': ['d5', 'd6', 'd90', 'd91', 'd119', 'd144', 'd181', 'd399'],
'q2': ['d236', 'd166'],
'q3': ['d552', 'd401', 'd1297', 'd1296'],
}
I have tried out = {k: 'd'+str(v) for k,v in out.items()} but this only adds the 'd' to the outside of each value's list:
{'q1': d[5, 6, 90, 91, 119, 144, 181, 399],
'q2': d[236, 166],
'q3': d[552, 401, 1297, 1296],
}
I imagine I have to do a sort of list comprehension within the dict comprehension, but I am not sure how to implement.
Try:
dct = {
"q1": [5, 6, 90, 91, 119, 144, 181, 399],
"q2": [236, 166],
"q3": [552, 401, 1297, 1296],
}
for v in dct.values():
v[:] = (f"d{i}" for i in v)
print(dct)
Prints:
{
"q1": ["d5", "d6", "d90", "d91", "d119", "d144", "d181", "d399"],
"q2": ["d236", "d166"],
"q3": ["d552", "d401", "d1297", "d1296"],
}
Try using an f-string in a nested comprehension if you desire to keep your original dict unchanged:
>>> d = {'q1': [5, 6, 90, 91, 119, 144, 181, 399], 'q2': [236, 166], 'q3': [552, 401, 1297, 1296]}
>>> {k: [f'd{x}' for x in v] for k, v in d.items()}
{'q1': ['d5', 'd6', 'd90', 'd91', 'd119', 'd144', 'd181', 'd399'], 'q2': ['d236', 'd166'], 'q3': ['d552', 'd401', 'd1297', 'd1296']}
Basically I have something like this :
letters = "ABNJDSJHIUOIUIYEIUWEYIUJHAJHSGJHASNMVFDJHKIUYEIUWYEWUIEYUIUYIEJSGCDJHDS"
And I want to find the index of letters above let's say M. I want to do something like :
import numpy as np
letters = "ABNJDSJHIUOIUIYEIUWEYIUJHAJHSGJHASNMVFDJHKIUYEIUWYEWUIEYUIUYIEJSGCDJHDS"
# - test
np_array = np.array(np.where(letters > chr(77))[0])
Is this possible? or do I have do something like letters not in ...?
Convert letters to a character array:
>>> ar = np.array(list(letters))
>>> ar
array(['A', 'B', 'N', 'J', 'D', 'S', 'J', 'H', 'I', 'U', 'O', 'I', 'U',
'I', 'Y', 'E', 'I', 'U', 'W', 'E', 'Y', 'I', 'U', 'J', 'H', 'A',
'J', 'H', 'S', 'G', 'J', 'H', 'A', 'S', 'N', 'M', 'V', 'F', 'D',
'J', 'H', 'K', 'I', 'U', 'Y', 'E', 'I', 'U', 'W', 'Y', 'E', 'W',
'U', 'I', 'E', 'Y', 'U', 'I', 'U', 'Y', 'I', 'E', 'J', 'S', 'G',
'C', 'D', 'J', 'H', 'D', 'S'], dtype='<U1')
>>> np.where(ar > 'M')[0]
array([ 2, 5, 9, 10, 12, 14, 17, 18, 20, 22, 28, 33, 34, 36, 43, 44, 47,
48, 49, 51, 52, 55, 56, 58, 59, 63, 70], dtype=int64)
Byte arrays can also be:
>>> ar = np.array(bytearray(letters.encode()))
>>> ar
array([65, 66, 78, 74, 68, 83, 74, 72, 73, 85, 79, 73, 85, 73, 89, 69, 73,
85, 87, 69, 89, 73, 85, 74, 72, 65, 74, 72, 83, 71, 74, 72, 65, 83,
78, 77, 86, 70, 68, 74, 72, 75, 73, 85, 89, 69, 73, 85, 87, 89, 69,
87, 85, 73, 69, 89, 85, 73, 85, 89, 73, 69, 74, 83, 71, 67, 68, 74,
72, 68, 83], dtype=uint8)
>>> np.where(ar > ord('M'))[0]
array([ 2, 5, 9, 10, 12, 14, 17, 18, 20, 22, 28, 33, 34, 36, 43, 44, 47,
48, 49, 51, 52, 55, 56, 58, 59, 63, 70], dtype=int64)
I am trying to write a program that checks if smaller words are found within a larger word. For example, the word "computer" contains the words "put", "rum", "cut", etc. To perform the check I am trying to code each word as a product of prime numbers, that way the smaller words will all be factors of the larger word. I have a list of letters and a list of primes and have assigned (I think) an integer value to each letter:
letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59,
61, 67, 71, 73, 79, 83, 89, 97, 101]
index = 0
while index <= len(letters)-1:
letters[index] = primes[index]
index += 1
The problem I am having now is how to get the integer code for a given word and be able to create the codes for a whole list of words. For example, I want to be able to input the word "cab," and have the code generate its integer value of 5*2*3 = 30.
Any help would be much appreciated.
from functools import reduce # only needed for Python 3.x
from operator import mul
primes = [
2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41,
43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101
]
lookup = dict(zip("abcdefghijklmnopqrstuvwxyz", primes))
def encode(s):
return reduce(mul, (lookup.get(ch, 1) for ch in s.lower()))
then
encode("cat") # => 710
encode("act") # => 710
Edit: more to the point,
def is_anagram(s1, s2):
"""
s1 consists of the same letters as s2, rearranged
"""
return encode(s1) == encode(s2)
def is_subset(s1, s2):
"""
s1 consists of some letters from s2, rearranged
"""
return encode(s2) % encode(s1) == 0
then
is_anagram("cat", "act") # => True
is_subset("cat", "tactful") # => True
I would use a dict here to look-up the prime for a given letter:
In [1]: letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
In [2]: primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59,
61, 67, 71, 73, 79, 83, 89, 97, 101]
In [3]: lookup = dict(zip(letters, primes))
In [4]: lookup['a']
Out[4]: 2
This will let you easily determine the list of primes for a given word:
In [5]: [lookup[letter] for letter in "computer"]
Out[5]: [5, 47, 41, 53, 73, 71, 11, 61]
To find the product of those primes:
In [6]: import operator
In [7]: reduce(operator.mul, [lookup[letter] for letter in "cab"])
Out[7]: 30
You've got your two lists set up, so now you just need to iterate over each character in a word and determine what value that letter gives you.
Something like
total = 1
for letter in word:
index = letters.index(letter)
total *= primes[index]
Or whichever operation you decide to use.
You would generalize that to a list of words.
Hmmmm... It isn't very clear how this code is supposed to run. If it is built to find words in the english dictionary, think about using PyEnchant, a module for checking if words are in the dictionary. Something you could try is this:
letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101]
word = raw_input('What is your word? ')
word = list(word)
total = 1
nums = []
for k in word:
nums.append(primes[letters.index(k)])
for k in nums:
total = total*k
print total
This will output as:
>>> What is your word? cat
710
>>>
This is correct, as 5*2*71 equals 710
Reference: Is there a faster way of converting a number to a name?
In the question referenced above, a solution was found for turning a numbe into a name. This question asks just the opposite. How can you convert a name back into a number? So far, this is what I have:
>>> import string
>>> HEAD_CHAR = ''.join(sorted(string.ascii_letters + '_'))
>>> TAIL_CHAR = ''.join(sorted(string.digits + HEAD_CHAR))
>>> HEAD_BASE, TAIL_BASE = len(HEAD_CHAR), len(TAIL_CHAR)
>>> def number_to_name(number):
"Convert a number into a valid identifier."
if number < HEAD_BASE:
return HEAD_CHAR[number]
q, r = divmod(number - HEAD_BASE, TAIL_BASE)
return number_to_name(q) + TAIL_CHAR[r]
>>> [number_to_name(n) for n in range(117)]
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A0', 'A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'AA', 'AB', 'AC', 'AD', 'AE', 'AF', 'AG', 'AH', 'AI', 'AJ', 'AK', 'AL', 'AM', 'AN', 'AO', 'AP', 'AQ', 'AR', 'AS', 'AT', 'AU', 'AV', 'AW', 'AX', 'AY', 'AZ', 'A_', 'Aa', 'Ab', 'Ac', 'Ad', 'Ae', 'Af', 'Ag', 'Ah', 'Ai', 'Aj', 'Ak', 'Al', 'Am', 'An', 'Ao', 'Ap', 'Aq', 'Ar', 'As', 'At', 'Au', 'Av', 'Aw', 'Ax', 'Ay', 'Az', 'B0']
>>> def name_to_number(name):
assert name, 'Name must exist!'
head, *tail = name
number = HEAD_CHAR.index(head)
for position, char in enumerate(tail):
if position:
number *= TAIL_BASE
else:
number += HEAD_BASE
number += TAIL_CHAR.index(char)
return number
>>> [name_to_number(number_to_name(n)) for n in range(117)]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 54]
The function number_to_name works perfectly, and name_to_number works up until it gets to number 116. At that point, the function returns 54 instead. Does anyone see the code's problem?
Solution based on recursive's answer:
import string
HEAD_CHAR = ''.join(sorted(string.ascii_letters + '_'))
TAIL_CHAR = ''.join(sorted(string.digits + HEAD_CHAR))
HEAD_BASE, TAIL_BASE = len(HEAD_CHAR), len(TAIL_CHAR)
def name_to_number(name):
if not name.isidentifier():
raise ValueError('Name must be a Python identifier!')
head, *tail = name
number = HEAD_CHAR.index(head)
for char in tail:
number *= TAIL_BASE
number += TAIL_CHAR.index(char)
return number + sum(HEAD_BASE * TAIL_BASE ** p for p in range(len(tail)))
Unfortunately, these identifiers don't yield to traditional constant base encoding techniques. For example "A" acts like a zero, but leading "A"s change the value. In normal number systems leading zeroes do not. There could be multiple approaches, but I settled on one that calculates the total number of identifiers with fewer digits, and starts from that.
def name_to_number(name):
assert name, 'Name must exist!'
skipped = sum(HEAD_BASE * TAIL_BASE ** i for i in range(len(name) - 1))
val = reduce(
lambda a,b: a * TAIL_BASE + TAIL_CHAR.index(b),
name[1:],
HEAD_CHAR.index(name[0]))
return val + skipped