split int and string to tuple from string python - python

i have a list of string with names and numbers like :
["mike5","john","sara2","bob","nick6"]
and i want to create from each string a tuple (name,age) like this :
[('mike', 5), ('john', 0), ('sara', 2), ('bob', 0), ('nick', 5)]
so if a string doesn't contain a number the age is 0
what is the simplest way to do it?
i tried to use :
temp = re.compile("([a-zA-Z]+)([0-9]+)")
res = temp.match(type).group()
but it fails

You can use the following regex to find the name and the number ([a-z]+)(\d+)?, along with .groups(0) as default value (see match.groups())
def split_vals(word):
name, number = re.search(r"([a-z]+)(\d+)?", word).groups(0)
return name, int(number)
values = ["mike5", "john", "sara2", "bob", "nick6"]
values = [split_vals(value) for value in values]
# [('mike', 5), ('john', 0), ('sara', 2), ('bob', 0), ('nick', 6)]

If fails because your match doesn't return anything:
temp.match('john') is None
True
You need to change your regex to:
# The * means 0 or more. Otherwise, you've required a number to be present
temp = re.compile("([a-zA-Z]+)([0-9]*)")
temp.match('john')
<re.Match object; span=(0, 4), match='john'>
Last, if you want tuples, use groups(), not group()
[temp.match(item).groups() for item in x]
[('mike', '5'), ('john', ''), ('sara', '2'), ('bob', ''), ('nick', '6')]

A couple of things:
The regex is correct up to [0-9]+. This means you MUST match 1 or more digits. However, not all your strings will have a digit present such as john, so I would suggest using * which matches zero or more digits.
You are using the syntax pattern.match(string) which will throw an error. You need to use the syntax match(pattern, string) (see below for further clarification).
In addition, using groups() instead of group() will return a tuple of all the captured matches within your regex (again see below).
Using a loop to iterate over your items and an if statement you should be able to achieve your desired result:
lst=["mike5","john","sara2","bob","nick6"]
pattern = re.compile("([a-zA-Z]+)([0-9]*)")
name_age = []
for value in lst:
name,age = re.match(pattern,value).groups()
if not age: age = 0
name_age.append((name,age))
print(name_age)

import re
inArr = ["mike5","john","sara2","bob","nick6"]
outArr = []
for item in inArr:
regexResult = re.search('([a-z]+)(\d?)', item, re.IGNORECASE)
if regexResult:
name = regexResult.group(1)
age = regexResult.group(2) or 0
outArr.append((name, int(age))
print(outArr) # [('mike', 5), ('john', 0), ('sara', 2), ('bob', 0), ('nick', 6)]

Related

Walking throughout syntax tree recursively

I have a sentence which is syntactically parsed. For example, "My mom wants to cook". Parsing is [('My', 1), ('mom', 2), ('wants', -1), ('to', 2), ('cook', 3)]. The numbers mean the indexes of items the words depend on: 'mom' depends on 'wants' and 'wants' is the second element of the array (starting from zero as usual). 'Wants' has '-1' because that is the core of sentence, it doesn't depend on anything else. I need to GET the SUBJECT which is 'my mom' here. How can I do this?
To this moment, I have only tried writing loops which work not in every case. The deal is that the subject may consist of more than 2 words, and that number is undefined. Something like this...
'Values' is [('My', 1), ('mom', 2), ('wants', -1), ('to', 2), ('cook', 3)]
for indx, value in enumerate(values):
m = morph.parse(value[0])
if isinstance(m, list):
m = m[0]
if 'NOUN' in m.tag:
if value[1] == str(index[0]): #it checks if the word (part of the subject) depends on the verb
terms.append([value[0], indx])
if len(terms) > 0:
term = terms[0][0]
t = []
for indx, value in enumerate(values):
if value[1] == str(terms[0][1]): #it checks if the word depend on the found part of the subject
m = morph.parse(value[0])
if isinstance(m, list):
m = m[0]
if 'NOUN' in m.tag:
t.append([value[0], terms[0][0]])
The algorithm should work like this: walks the whole array and stops when it finds all the dependencies of the given word and all the dependencies of these dependencies. (In the example, all the dependencies of 'mom'). Please, help!
Sorry that I took so long to get back to you.
I finally figured it out. Code is at the bottom, but I'll explain how it works first:
when parse(values) is called it iterates through the sentence in values and calls the recursive getDepth for each word. ```getDepth`` computes how many word relations are between a given word and the verb.
E.g. for "The", depth is 1, because it directly calls the verb.
for "King", it is 2, because "King" calls "The" and "The" calls the verb (Where call = have a pointer that points to a certain word)
Once all depths are computed, parse finds the word with the highest depth ("France") and uses recursive traceFrom() to string together the subject.
All you really care a about is parse() which takes the preparsed string like [('My', 1), ('mom', 2), ('wants', -1), ('to', 2), ('cook', 3)] and spits out the complete subject. Works for both examples, but you should test some more.
values = [('The', 4), ('King', 0), ('of', 1), ('France', 2), ('died', -1)]
#values = [('My', 1), ('mom', 2), ('wants', -1), ('to', 2), ('cook', 3)]
def getDepth(i, s, n = 0):
print('D is %d' % n)
if s[i][1] == -1:
return n
else:
return getDepth(s[i][1], s, n+1)
def traceFrom(m, s, dt=[]):
print('Got n:%d, s:' % m, s)
if s[m][1] == -1:
d = []
n = 0
for i in range(len(s)):
if i in dt:
d.append(s[i][0])
return " ".join(d)
else:
dt.append(m)
return traceFrom(s[m][1], s, dt)
def parse(sentence):
d = []
for i in range(len(sentence)):
d.append(getDepth(i, sentence))
m = d.index(max(d))
print('Largest is ' , d[m], ' of ', d)
return traceFrom(m, sentence)
print('Subject :', parse(values))
Given your preparsed array, this is quite easy:
values = [('My', 1), ('mom', 2), ('wants', -1), ('to', 2), ('cook', 3)]
def parseFrom(n, data):
if values[n][1] != -1:
#print('Stepping (%s,%d)' % (values[n][0], values[n][1]))
data.append(values[n][0])
return parseFrom(values[n][1], data)
else:
#print('At verb')
return data
subject = ' '.join(parseFrom(0, []))
print('Sentence has subject:', subject)
The function is recursive if the current word is not the verb, otherwise returns the subject as an array. Sorry if it doesn't work on all sentences

How to iterate over all overlapping matches in string?

I want to iterate over all overlapping matches of terms in a sentence/string. If it is not possible to solve the problem with just a single regex, it is fine to go with more than one -- but the number of expressions should be independent of the number of terms.
>>> import re
>>> text = 'X has effect on Y.'
>>> terms = ['has effect', 'effect', 'effect on']
>>> pattern = r"(?=\b(" + "|".join(re.escape(term) for term in terms) + r")\b)"
>>> print pattern
(?=\b(has\ effect|effect|effect\ on)\b)
>>> [(m.span(1), text[m.start(1):m.end(1)]) for m in re.finditer(pattern, text)]
[((2, 12), 'has effect'), ((6, 12), 'effect')]
I'm still not able to extract 'effect on' term -- I think I stuck with 'lookbehind assertion'. Thanks!
EDIT #1:
Problem description in terms of input/output specs with example:
Input:
a sentence, e.g. 'X has effect on Y.'
a list of terms, e.g. ['has effect', 'effect', 'effect on']
Output:
a list of mentions (((start, end), term) tuples), e.g. [((2, 12), 'has effect'), ((6, 12), 'effect'), ((6, 15), 'effect on')]

How to count number of occurrences of a chracter in a string (list)

I'm trying to count the occurrence of each character for any given string input, the occurrences must be output in ascending order( includes numbers and exclamation marks)
I have this for my code so far, i am aware of the Counter function, but it does not output the answer in the format I'd like it to, and I do not know how to format Counter. Instead I'm trying to find away to use the count() to count each character. I've also seen the dictionary function , but I'd be hoping that there is a easier way to do it with count()
from collections import Counter
sentence=input("Enter a sentence b'y: ")
lowercase=sentence.lower()
list1=list(lowercase)
list1.sort()
length=len(list1)
list2=list1.count(list1)
print(list2)
p=Counter(list1)
print(p)
collections.Counter objects provide a most_common() method that returns a list of tuples in decreasing frequency. So, if you want it in ascending frequency, reverse the list:
from collections import Counter
sentence = input("Enter a sentence: ")
c = Counter(sentence.lower())
result = reversed(c.most_common())
print(list(result))
Demo run
Enter a sentence: Here are 3 sentences. This is the first one. Here is the second. The end!
[('a', 1), ('!', 1), ('3', 1), ('f', 1), ('d', 2), ('o', 2), ('c', 2), ('.', 3), ('r', 4), ('i', 4), ('n', 5), ('t', 6), ('h', 6), ('s', 7), (' ', 14), ('e', 14)]
Just call .most_common and reverse the output with reversed to get the output from least to most common:
from collections import Counter
sentence= "foobar bar"
lowercase = sentence.lower()
for k, count in reversed(Counter(lowercase).most_common()):
print(k,count)
If you just want to format the Counter output differently:
for key, value in Counter(list1).items():
print('%s: %s' % (key, value))
Your best bet is to use Counter (which does work on a string) and then sort on it's output.
from collections import Counter
sentence = input("Enter a sentence b'y: ")
lowercase = sentence.lower()
# Counter will work on strings
p = Counter(lowercase)
count = Counter.items()
# count is now (more or less) equivalent to
# [('a', 1), ('r', 1), ('b', 1), ('o', 2), ('f', 1)]
# And now you can run your sort
sorted_count = sorted(count)
# Which will sort by the letter. If you wanted to
# sort by quantity, tell the sort to use the
# second element of the tuple by setting key:
# sorted_count = sorted(count, key=lambda x:x[1])
for letter, count in sorted_count:
# will cycle through in order of letters.
# format as you wish
print(letter, count)
Another way to avoid using Counter.
sentence = 'abc 11 222 a AAnn zzz?? !'
list1 = list(sentence.lower())
#If you want to remove the spaces.
#list1 = list(sentence.replace(" ", ""))
#Removing duplicate characters from the string
sentence = ''.join(set(list1))
dict = {}
for char in sentence:
dict[char] = list1.count(char)
for item in sorted(dict.items(), key=lambda x: x[1]):
print 'Number of Occurences of %s is %d.' % (item[0], item[1])
Output:
Number of Occurences of c is 1.
Number of Occurences of b is 1.
Number of Occurences of ! is 1.
Number of Occurences of n is 2.
Number of Occurences of 1 is 2.
Number of Occurences of ? is 2.
Number of Occurences of 2 is 3.
Number of Occurences of z is 3.
Number of Occurences of a is 4.
Number of Occurences of is 6.
One way to do this would be by removing instances of your sub string and looking at the length...
def nofsub(s,ss):
return((len(s)-len(s.replace(ss,"")))/len(ss))
alternatively you could use re or regular expressions,
from re import *
def nofsub(s,ss):
return(len(findall(compile(ss), s)))
finally you could count them manually,
def nofsub(s,ss):
return(len([k for n,k in enumerate(s) if s[n:n+len(ss)]==ss]))
Test any of the three with...
>>> nofsub("asdfasdfasdfasdfasdf",'asdf')
5
Now that you can count any given character you can iterate through your string's unique characters and apply a counter for each unique character you find. Then sort and print the result.
def countChars(s):
s = s.lower()
d = {}
for k in set(s):
d[k]=nofsub(s,k)
for key, value in sorted(d.iteritems(), key=lambda (k,v): (v,k)):
print "%s: %s" % (key, value)
You could use the list function to break the words apart`from collections
from collections import Counter
sentence=raw_input("Enter a sentence b'y: ")
lowercase=sentence.lower()
list1=list(lowercase)
list(list1)
length=len(list1)
list2=list1.count(list1)
print(list2)
p=Counter(list1)
print(p)

A more complex version of "How can I tell if a string repeats itself in Python?"

I was reading this post and I wonder if someone can find the way to catch repetitive motifs into a more complex string.
For example, find all the repetitive motifs in
string = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
Here the repetitive motifs:
'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
So, the output should be something like this:
output = {'ACGT': {'repeat': 2,
'region': (5,13)},
'GT': {'repeat': 3,
'region': (19,24)},
'TATACG': {'repeat': 2,
'region': (29,40)}}
This example comes from a typical biological phenomena termed microsatellite which is present into the DNA.
UPDATE 1: Asterisks were removed from the string variable. It was a mistake.
UPDATE 2: Single character motif doesn't count. For example: in ACGUGAAAGUC, the 'A' motif is not taken into account.
you can use a recursion function as following :
Note: The result argument will be treated as a global variable (because passing mutable object to the function affects the caller)
import re
def finder(st,past_ind=0,result=[]):
m=re.search(r'(.+)\1+',st)
if m:
i,j=m.span()
sub=st[i:j]
ind = (sub+sub).find(sub, 1)
sub=sub[:ind]
if len(sub)>1:
result.append([sub,(i+past_ind+1,j+past_ind+1)])
past_ind+=j
return finder(st[j:],past_ind)
else:
return result
s='AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
print finder(s)
result:
[['ACGT', (5, 13)], ['GT', (19, 25)], ['TATACG', (29, 41)]]
answer to previous question for following string :
s = 'AAAC**ACGTACGTA**ATTCC**GTGTGT**CCCC**TATACGTATACG**TTT'
You can use both answers from mentioned question and some extra recipes :
First you can split the string with ** then create a new list contain the repeated strings with r'(.+)\1+' regex :
So the result will be :
>>> new=[re.search(r'(.+)\1+',i).group(0) for i in s.split('**')]
>>> new
['AAA', 'ACGTACGT', 'TT', 'GTGTGT', 'CCCC', 'TATACGTATACG', 'TTT']
Note about 'ACGTACGT' that missed the A at the end!
Then you can use principal_period's function to get the repeated sub strings :
def principal_period(s):
i = (s+s).find(s, 1, -1)
return None if i == -1 else s[:i]
>>> for i in new:
... p=principal_period(i)
... if p is not None and len(p)>1:
... l.append(p)
... sub.append(i)
...
So you will have the repeated strings in l and main strings in sub :
>>> l
['ACGT', 'GT', 'TATACG']
>>> sub
['ACGTACGT', 'GTGTGT', 'TATACGTATACG']
Then you need a the region that you can do it with span method :
>>> for t in sub:
... regons.append(re.search(t,s).span())
>>> regons
[(6, 14), (24, 30), (38, 50)]
And at last you can zip the 3 list regon,sub,l and use a dict comprehension to create the expected result :
>>> z=zip(sub,l,regons)
>>> out={i :{'repeat':i.count(j),'region':reg} for i,j,reg in z}
>>> out
{'TATACGTATACG': {'region': (38, 50), 'repeat': 2}, 'ACGTACGT': {'region': (6, 14), 'repeat': 2}, 'GTGTGT': {'region': (24, 30), 'repeat': 3}}
The main code :
>>> s = 'AAAC**ACGTACGTA**ATTCC**GTGTGT**CCCC**TATACGTATACG**TTT'
>>> sub=[]
>>> l=[]
>>> regon=[]
>>> new=[re.search(r'(.+)\1+',i).group(0) for i in s.split('**')]
>>> for i in new:
... p=principal_period(i)
... if p is not None and len(p)>1:
... l.append(p)
... sub.append(i)
...
>>> for t in sub:
... regons.append(re.search(t,s).span())
...
>>> z=zip(sub,l,regons)
>>> out={i :{'repeat':i.count(j),'region':reg} for i,j,reg in z}
>>> out
{'TATACGTATACG': {'region': (38, 50), 'repeat': 2}, 'ACGTACGT': {'region': (6, 14), 'repeat': 2}, 'GTGTGT': {'region': (24, 30), 'repeat': 3}}
If you can bound your query then you can use a single pass of the string. The number of comparisons will be length of string * (max_length - min_length) so will scale linearly.
s = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
def find_repeats(s, max_length, min_length=2):
for i in xrange(len(s)):
for j in xrange(min_length, max_length+1):
count = 1
while s[i:i+j] == s[i+j*count:i+j*count+j]: count += 1
if count > 1:
yield s[i:i+j], i, count
for pattern, position, count in find_repeats(s, 6, 2):
print "%6s at region (%d, %d), %d repeats" % (pattern, position, position + count*len(pattern), count)
Output:
AC at region (2, 6), 2 repeats
ACGT at region (4, 12), 2 repeats
CGTA at region (5, 13), 2 repeats
GT at region (18, 24), 3 repeats
TG at region (19, 23), 2 repeats
GT at region (20, 24), 2 repeats
CC at region (24, 28), 2 repeats
TA at region (28, 32), 2 repeats
TATACG at region (28, 40), 2 repeats
ATACGT at region (29, 41), 2 repeats
TA at region (34, 38), 2 repeats
Note that this catches a fair few more overlapping patterns than the regexp answers, but without knowing more about what you consider a good match it is difficult to reduce it further, for example why is TATACG better than ATACGT?
Extra: Using a dict to return matches is a bad idea as the patterns are not going to be unique.
This simple while loop detects all repeated patterns:
def expand():
global hi
hi += 1
def shrink():
global lo
lo += 1
s = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
motifs = set()
lo = 0
hi = 0
f = expand
while hi <= len(s):
sub = s[lo : hi+1]
if s.count(sub) > 1:
motifs.add(sub)
if lo==hi: f = expand
f()
else:
f = shrink if lo<=hi else expand
f()
At this point, motifs contains all the repeated patterns... Let's filter them with some criteria:
minlen = 3
for m in filter(lambda m: len(m)>=minlen and s.count(2*m)>=1, motifs):
print(m)
'''
ATACGT
ACGT
TATACG
CGTA
'''
You can use the fact that in regex, lookaheads do not advance the primary iterator. Thus, you can nest a capture group within a lookahead to find the (potentially overlapping) patterns that repeat and have a specified minimum length:
>>> import re
>>> s = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
>>> re.findall(r'(?=(.{2,})\1+)', s)
['AC', 'ACGT', 'CGTA', 'GT', 'TG', 'GT', 'CC', 'TATACG', 'ATACGT', 'TA']
>>> re.findall(r'(?=(.{2,}?)\1+)', s)
['AC', 'ACGT', 'CGTA', 'GT', 'TG', 'GT', 'CC', 'TA', 'ATACGT', 'TA']
Note the slightly different results between using a greedy and a non-greedy quantifier. The greedy quantifier searches for the longest repeating substring starting from every index in the original string, if one exists. The non-greedy quantifier searches for the shortest of the same. The limitation is that you can only get a maximum one pattern per starting index in the string. If you have any ideas to solve this problem, let me know! Potentially, we can use the greedy quantifier regex to set up a recursive solution that finds every repeating pattern starting from each index, but let's avoid "premature computation" for now.
Now if we take the regex (?=(.{2,})\1+) and modify it, we can also capture the entire substring that contains repeated motifs. By doing this, we can use the span of the matches to calculate the number of repetitions:
(?=((.{2,})\2+))
In the above regex, we have a capture group inside a capture group inside a lookahead. Now we have everything we need to solve the problem:
def repeated_motifs(s):
import re
from collections import defaultdict
rmdict = defaultdict(list)
for match in re.finditer(r'(?=((.{2,})\2+))', s):
motif = match.group(2)
span1, span2 = match.span(1), match.span(2)
startindex = span1[0]
repetitions = (span1[1] - startindex) // (span2[1] - startindex)
others = rmdict[motif]
if not others or startindex > others[-1]['region'][1]:
others.append({'repeat': repetitions, 'region': span1})
return rmdict
s = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
d = repeated_motifs(s)
print(d)
# list of the repeating motifs we have found sorted by first region
print(sorted(list(d.keys()), key=lambda k: d[k][0]['region']))
Because desired behavior in the situation where a motif repeats itself in multiple "regions" of the string was not specified, I have made the assumption that OP would like a dictionary of string->list where each list contains its own set of dictionaries.

how to replace the alphabetically smallest letter by 1, the next smallest by 2 but do not discard multiple occurrences of a letter?

I am using Python 3 and I want to write a function that takes a string of all capital letters, so suppose s = 'VENEER', and gives me the following output '614235'.
The function I have so far is:
def key2(s):
new=''
for ch in s:
acc=0
for temp in s:
if temp<=ch:
acc+=1
new+=str(acc)
return(new)
If s == 'VENEER' then new == '634335'. If s contains no duplicates, the code works perfectly.
I am stuck on how to edit the code to get the output stated in the beginning.
Note that the built-in method for replacing characters within a string, str.replace, takes a third argument; count. You can use this to your advantage, replacing only the first appearance of each letter (obviously once you replace the first 'E', the second one will become the first appearance, and so on):
def process(s):
for i, c in enumerate(sorted(s), 1):
## print s # uncomment to see process
s = s.replace(c, str(i), 1)
return s
I have used the built-in functions sorted and enumerate to get the appropriate numbers to replace the characters:
1 2 3 4 5 6 # 'enumerate' from 1 -> 'i'
E E E N R V # 'sorted' input 's' -> 'c'
Example usage:
>>> process("VENEER")
'614235'
One way would be to use numpy.argsort to find the order, then find the ranks, and join them:
>>> s = 'VENEER'
>>> order = np.argsort(list(s))
>>> rank = np.argsort(order) + 1
>>> ''.join(map(str, rank))
'614235'
You can use a regex:
import re
s="VENEER"
for n, c in enumerate(sorted(s), 1):
s=re.sub('%c' % c, '%i' % n, s, count=1)
print s
# 614235
You can also use several nested generators:
def indexes(seq):
for v, i in sorted((v, i) for (i, v) in enumerate(seq)):
yield i
print ''.join('%i' % (e+1) for e in indexes(indexes(s)))
# 614235
From your title, you may want to do like this?
>>> from collections import OrderedDict
>>> s='VENEER'
>>> d = {k: n for n, k in enumerate(OrderedDict.fromkeys(sorted(s)), 1)}
>>> "".join(map(lambda k: str(d[k]), s))
'412113'
As #jonrsharpe commented I didn't need to use OrderedDict.
def caps_to_nums(in_string):
indexed_replaced_string = [(idx, val) for val, (idx, ch) in enumerate(sorted(enumerate(in_string), key=lambda x: x[1]), 1)]
return ''.join(map(lambda x: str(x[1]), sorted(indexed_replaced_string)))
First we run enumerate to be able to save the natural sort order
enumerate("VENEER") -> [(0, 'V'), (1, 'E'), (2, 'N'), (3, 'E'), (4, 'E'), (5, 'R')]
# this gives us somewhere to RETURN to later.
Then we sort that according to its second element, which is alphabetical, and run enumerate again with a start value of 1 to get the replacement value. We throw away the alpha value, since it's not needed anymore.
[(idx, val) for val, (idx, ch) in enumerate(sorted([(0, 'V'), (1, 'E'), ...], key = lambda x: x[1]), start=1)]
# [(1, 1), (3, 2), (4, 3), (2, 4), (5, 5), (0, 6)]
Then map the second element (our value) sorting by the first element (the original index)
map(lambda x: str(x[1]), sorted(replacement_values)
and str.join it
''.join(that_mapping)
Ta-da!

Categories

Resources