extracting data from matchobjects - python

I have a long sequence with multiple repeats of a specific string( 'say GAATTC') randomly throughout the sequence string. I'm currently using the regular expression .span() to provide with me with the indices of where the pattern 'GAATTC' is found. Now I want to use those indices to slice the pattern between the G and A (i.e. 'G|AATTC').
How do I use the data from the match object to slice those out?

If I understand you correctly, you have the string and an index where the sequence GAATTC starts, so do you need this (i here is the m.start for the group)?
>>> seq = "GAATTC"
>>> s = "AATCCTGAGAATTCAAC"
>>> i = 8 # the index where seq starts in s
>>> s[i:]
'GAATTCAAC'
>>> s[i:i+len(seq)]
'GAATTC'
That extracts it. You can also slice the original sequence at the G like this:
>>> s[:i+1]
'AATCCTGAG'
>>> s[i+1:]
'AATTCAAC'
>>>

If what you want to do is replace the 'GAATTC' by the 'G|AATTC' one (not sure of what you want to do in the end), I think that you can manage this without regex:
>>> string = 'GAATTCAAGAATTCTTGAATTCGAATTCAATATATA'
>>> string.replace('GAATTC', 'G|AATTC')
'G|AATTCAAG|AATTCTTG|AATTCG|AATTCAATATATA'
EDIT: ok, this way can be adapted to suit what you want to do:
>>> groups = string.replace('GAATTC', 'G|AATTC').split('|')
>>> groups
['G', 'AATTCAAG', 'AATTCTTG', 'AATTCG', 'AATTCAATATATA']
>>> map(len, groups)
[1, 8, 8, 6, 13]

Related

How do I use regular expressions to find a character more than 1 time?

import re
palabra="hola mundo"
caracter="o"
buscacaracter=(re.search(caracter,palabra))
print(buscacaracter)
So in this code it search until the first "o" is found but it doesn't count the last "o".
Is there a way to find every "o"?
This happens with these methods as well:
print(buscacaracter.start())
print(buscacaracter.end())
print(buscacaracter.span())
And when I use this method:
(re.findall(caracter,palabra))
it doesn't return the positions of the "o", just the number of them.
If you just want the index of each caracter it is faster, easier to use enumerate:
palabra="hola mundo"
caracter="o"
>>> [i for i,c in enumerate(palabra) if c==caracter]
[1, 9]
If you want a regex, use finditer which return a match object which contains the index:
>>> [m.span()[0] for m in re.finditer(rf'{caracter}', palabra)]
[1, 9]

Efficient way to reverse a large iterator

This may sound a bit insane but I have an iterator with N = 10**409 elements. Is there a way to get items from the end of this "list"? I.e. when I call next(iterator) it gives me what I want to be the last thing, but to get to what I want to be first thing I would need to call next(iterator) N times.
If I do something like list(iterator).reverse() it will of course crash due to lack of memory.
Edit: how the iterator is being used with a simplified example:
# prints all possible alphabetical character combinations that can fit in a tweet
chars = "abcdefghijklmnopqrstuvwxyz "
cproduct = itertools.product(chars,repeat=250)
for subset in cproduct:
print(''.join(subset))
# will start with `aaaaaaaa...aaa`
# but I want it to start with `zzz...zzz`
For some problems, you can compute the elements in reverse. For the example you provide, one can simply reverse the items you are taking the product of.
In this example, we reverse the symbols before taking the product to get the "reverse iterator":
>>> symbols = "abc"
>>> perms = itertools.product(symbols, repeat=5)
>>> perms = ["".join(x) for x in perms]
>>> perms
['aaaaa', 'aaaab', 'aaaac', 'aaaba', 'aaabb',
...,
'cccbb', 'cccbc', 'cccca', 'ccccb', 'ccccc']
>>> perms_rev = itertools.product(symbols[::-1], repeat=5)
>>> perms_rev = ["".join(x) for x in perms_rev]
>>> perms_rev
['ccccc', 'ccccb', 'cccca', 'cccbc', 'cccbb',
...,
'aaabb', 'aaaba', 'aaaac', 'aaaab', 'aaaaa']
>>> perms_rev == perms[::-1]
True

python split and remove duplicates

I have the following output with print var:
test.qa.home-page.website.com-3412-jan
test.qa.home-page.website.net-5132-mar
test.qa.home-page.website.com-8422-aug
test.qa.home-page.website.net-9111-jan
I'm trying to find the correct split function to populate below:
test.qa.home-page.website.com
test.qa.home-page.website.net
test.qa.home-page.website.com
test.qa.home-page.website.net
...as well as remove duplicates:
test.qa.home-page.website.com
test.qa.home-page.website.net
The numeric values after "com-" or "net-" are random so I think my struggle is finding out how to rsplit ("-" + [CHECK_FOR_ANY_NUMBER])[0] . Any suggestions would be great, thanks in advance!
How about :
import re
output = [
"test.qa.home-page.website.com-3412-jan",
"test.qa.home-page.website.net-5132-mar",
"test.qa.home-page.website.com-8422-aug",
"test.qa.home-page.website.net-9111-jan"
]
trimmed = set([re.split("-[0-9]", item)[0] for item in output])
print(trimmed)
# out : {'test.qa.home-page.website.net', 'test.qa.home-page.website.com'}
If you have an array of values, and you want to remove duplicates, you can use set.
>>> l = [1,2,3,1,2,3]
>>> l
[1, 2, 3, 1, 2, 3]
>>> set(l)
{1, 2, 3}
You can get to a useful array by str.split('-')[0]-ing every value.
You could use a regex to parse the individual lines and a set comprehension to uniqueify:
txt='''\
test.qa.home-page.website.com-3412-jan
test.qa.home-page.website.net-5132-mar
test.qa.home-page.website.com-8422-aug
test.qa.home-page.website.net-9111-jan'''
import re
>>> {re.sub(r'^(.*\.(?:com|net)).*', r'\1', s) for s in txt.split() }
{'test.qa.home-page.website.net', 'test.qa.home-page.website.com'}
Or just use the same regex with set and re.findall with the re.M flag:
>>> set(re.findall(r'^(.*\.(?:com|net))', txt, flags=re.M))
{'test.qa.home-page.website.net', 'test.qa.home-page.website.com'}
If you want to maintain order, use {}.fromkeys() (since Python 3.6):
>>> list({}.fromkeys(re.findall(r'^(.*\.(?:com|net))', txt, flags=re.M)).keys())
['test.qa.home-page.website.com', 'test.qa.home-page.website.net']
Or, if you know your target is always 2 - from the end, just use .rsplit() with maxsplit=2:
>>> {s.rsplit('-',maxsplit=2)[0] for s in txt.splitlines()}
{'test.qa.home-page.website.com', 'test.qa.home-page.website.net'}

searching a list of strings for integers

Given the following list of strings:
my_list = ['element0 123 321\n', 'element1 223 32221\n', 'element2 19823 328771\n', ... ]
how can I split each entry into a list of tuples:
[ (123, 321), (223, 32221), (19823, 328771), ... ]
In my other poor attempt, I managed to extract the numbers, but I encountered a problem, the element placeholder also contains a number which this method includes! It also doesn't write to a tuple, rather a list.
numbers = list()
for s in my_list:
for x in s:
if x.isdigit():
numbers.append((x))
numbers
We can first build a regex that identifies positive integers:
from re import compile
INTEGER_REGEX = compile(r'\b\d+\b')
Here \d stands for digit (so 0, 1, etc.), + for one or more, and \b are word boundaries.
We can then use INTEGER_REGEX.findall(some_string) to identify all positive integers from the input. Now the only thing left to do is iterate through the elements of the list, and convert the output of INTEGER_REGEX.findall(..) to a tuple. We can do this with:
output = [tuple(INTEGER_REGEX.findall(l)) for l in my_list]
For your given sample data, this will produce:
>>> [tuple(INTEGER_REGEX.findall(l)) for l in my_list]
[('123', '321'), ('223', '32221'), ('19823', '328771')]
Note that digits that are not separate words will not be matched. For instance the 8 in 'see you l8er' will not be matched, since it is not a word.
your attempts iterates on each char of the string. You have to split the string according to blank. A task that str.split does flawlessly.
Also numbers.append((x)) is numbers.append(x). For a tuple of 1 element, add a comma before the closing parenthese. Even if that doesn't solve it either.
Now, the list seems to contain an id (skipped), then 2 integers as string, so why not splitting, zap the first token, and convert as tuple of integers?
my_list = ['element0 123 321\n', 'element1 223 32221\n', 'element2 19823 328771\n']
result = [tuple(map(int,x.split()[1:])) for x in my_list]
print(result)
gives:
[(123, 321), (223, 32221), (19823, 328771)]

Cut character string every two commas

I would like to separate my string every both commas but I can not, can you help me.
This is what I want: ['nb1,nb2','nb3,nb4','nb5,nb6']
Here is what I did :
a= 'nb1,nb2,nb3,nb4,nb5,nb6'
compteur=0
for i in a:
if i==',' :
compteur+=1
if compteur%2==0:
print compteur
test = a.split(',', compteur%2==0 )
print a
print test
The result:
2
4
nb1,nb2,nb3,nb4,nb5,nb6
['nb1', 'nb2,nb3,nb4,nb5,nb6']
Thanks you by advances for you answers
You can use regex
In [12]: re.findall(r'([\w]+,[\w]+)', 'nb1,nb2,nb3,nb4,nb5,nb6')
Out[12]: ['nb1,nb2', 'nb3,nb4', 'nb5,nb6']
A quick fix could be to simply first separate the elements by commas and then join the elements by two together again. Like:
sub_result = a.split(',')
result = [','.join(sub_result[i:i+2]) for i in range(0,len(sub_result),2)]
This gives:
>>> result
['nb1,nb2', 'nb3,nb4', 'nb5,nb6']
This will also work if the number of elements is odd. For example:
>>> a = 'nb1,nb2,nb3,nb4,nb5,nb6,nb7'
>>> sub_result = a.split(',')
>>> result = [','.join(sub_result[i:i+2]) for i in range(0,len(sub_result),2)]
>>> result
['nb1,nb2', 'nb3,nb4', 'nb5,nb6', 'nb7']
You use a zip operation of the list with itself to create pairs:
a = 'nb1,nb2,nb3,nb4,nb5,nb6'
parts = a.split(',')
# parts = ['nb1', 'nb2', 'nb3', 'nb4', 'nb5', 'nb6']
pairs = list(zip(parts, parts[1:]))
# pairs = [('nb1', 'nb2'), ('nb2', 'nb3'), ('nb3', 'nb4'), ('nb4', 'nb5'), ('nb5', 'nb6')]
Now you can simply join every other pair again for your output:
list(map(','.join, pairs[::2]))
# ['nb1,nb2', 'nb3,nb4', 'nb5,nb6']
Split the string by comma first, then apply the common idiom to partition an interable into sub-sequences of length n (where n is 2 in your case) with zip.
>>> s = 'nb1,nb2,nb3,nb4,nb5,nb6'
>>> [','.join(x) for x in zip(*[iter(s.split(','))]*2)]
['nb1,nb2', 'nb3,nb4', 'nb5,nb6']

Categories

Resources