Here is my list:
liPos = [(2,5),(8,9),(18,22)]
The first item of each tuple is the starting position and the second is the ending position.
Then I have a string like this:
s = "I hope that I will find an answer to my question!"
Now, considering my liPos list, I want to format the string by removing the chars between each starting and ending position (and including the surrounding numbers) provided in the tuples. Here is the result that I want:
"I tt I will an answer to my question!"
So basically, I want to remove the chars between 2 and 5 (including 2 and 5), then between 8,9 (including 8 and 9) and finally between 18,22 (including 18 and 22).
Any suggestion?
This assumes that liPos is already sorted, if it is not used sorted(liPos, reverse=True) in the for loop.
liPos = [(2,5),(8,9),(18,22)]
s = "I hope that I will find an answer to my question!"
for begin, end in reversed(liPos):
s = s[:begin] + s[end+1:]
print s
Here is an alternative method that constructs a new list of slice tuples to include, and then joining the string with only those included portions.
from itertools import chain, izip_longest
# second slice index needs to be increased by one, do that when creating liPos
liPos = [(a, b+1) for a, b in liPos]
result = "".join(s[b:e] for b, e in izip_longest(*[iter(chain([0], *liPos))]*2))
To make this slightly easier to understand, here are the slices generated by izip_longest:
>>> list(izip_longest(*[iter(chain([0], *liPos))]*2))
[(0, 2), (6, 8), (10, 18), (23, None)]
liPos = [(2,5),(8,9),(18,22)]
s = "I hope that I will find an answer to my question!"
exclusions = set().union(* (set(range(t[0], t[1]+1)) for t in liPos) )
pruned = ''.join(c for i,c in enumerate(s) if i not in exclusions)
print pruned
Here is one, compact possibility:
"".join(s[i] for i in range(len(s)) if not any(start <= i <= end for start, end in liPos))
This ... is a quick stab at the problem. There may be a better way, but it's a start at least.
>>> liPos = [(2,5),(8,9),(18,22)]
>>>
>>> toRemove = [i for x, y in liPos for i in range(x, y + 1)]
>>>
>>> toRemove
[2, 3, 4, 5, 8, 9, 18, 19, 20, 21, 22]
>>>
>>> s = "I hope that I will find an answer to my question!"
>>>
>>> s2 = ''.join([c for i, c in enumerate(s) if i not in toRemove])
>>>
>>> s2
'I tt I will an answer to my question!'
Related
I know there are plenty of topics about finding indices of given keywords in strings, but my case is a bit different
I have 2 inputs, one is a string and another is a mapping list (or whatever you wanna call it)
s = "I am awesome and I love you"
mapping_list = "1 1 2 3 1 2 3"
each word will always map onto a digit in the mapping list. Now I want to find all indices of a given number, say 1, when matching the string.
In the above case, it will return [0, 2, 17] (Thakns #rahlf23)
My current approach would be zipping each word with a digit by doing
zip(mapping_list.split(' '), s.split(' '))
which gives me
('1', 'I')
('1', 'am')
('2', 'awesome')
('3', 'and')
('1', 'I')
('2', 'love')
('3', 'you')
and then iterate through the list, find "1", use the word to generate a regex, and then search for indices and append it to a list or something. Rinse and repeat.
However this seems really inefficient especially if the s gets really long
I'm wondering if there's a better way to deal with it.
You could map the words to their len and use itertools.accumulate, although you have to add 1 to each length (for the spaces) and add an initial 0 for the start of the first word.
>>> words = "I am awesome and I love you".split()
>>> mapping = list(map(int, "1 1 2 3 1 2 3".split()))
>>> start_indices = list(itertools.accumulate([0] + [len(w)+1 for w in words]))
>>> start_indices
[0, 2, 5, 13, 17, 19, 24, 28]
The last element is not used. Then, zip and iterate the pairs and collect them in a dictionary.
>>> d = collections.defaultdict(list)
>>> for x, y in zip(mapping, start_indices):
... d[x].append(y)
>>> dict(d)
>>> {1: [0, 2, 17], 2: [5, 19], 3: [13, 24]}
Alternatively, you could also use a regular expression like \b\w (word-boundary followed by word-character) to find each position a word starts, then proceed as above.
>>> s = "I am awesome and I love you"
>>> [m.start() for m in re.finditer(r"\b\w", s)]
[0, 2, 5, 13, 17, 19, 24]
# Find the indices of all the word starts
word_starts = [0] + [m.start()+1 for m in re.finditer(' ', s)]
# Break the mapping list into an actual list
mapping = mapping_list.split(' ')
# Find the indices in the mapping list we care about
word_indices = [i for i, e in enumerate(mapping) if e == '1']
# Map those indices onto the word start indices
word_starts_at_indices = [word_starts[i] for i in word_indices]
# Or you can do the last line the fancy way:
# word_starts_at_indices = operator.itemgetter(*word_indices)(word_starts)
I have some basic knowledge on Python but I have no idea what's going for the below code. Can someone help me to explain or 'translate' it into a more normal/common expression?
steps = len(t)
sa = [i for i in range(steps)]
sa.sort(key = lambda i: t[i:i + steps])#I know that sa is a list
for i in range(len(sa)):
sf = t[sa[i] : sa[i] + steps]
't' is actually a string
Thank you.
What I don't understand is the code: sa.sort(key = lambda i: t[i:i + steps])`
sa.sort(key = lambda i: t[i:i + steps])
It sorts sa according to the natural ordering of substrings t[i:i+len(t)]. Actually i + steps will always be greater or equal than steps (which is len(t)) so it could be written t[i:] instead (which makes the code simpler to understand)
You will better understand using the decorate/sort/undecorate pattern:
>>> t = "azerty"
>>> sa = range(len(t))
>>> print sa
[0, 1, 2, 3, 4, 5]
>>> decorated = [(t[i:], i) for i in sa]
>>> print decorated
[('azerty', 0), ('zerty', 1), ('erty', 2), ('rty', 3), ('ty', 4), ('y', 5)]
>>> decorated.sort()
>>> print decorated
[('azerty', 0), ('erty', 2), ('rty', 3), ('ty', 4), ('y', 5), ('zerty', 1)]
>>> sa = [i for (_dummy, i) in decorated]
>>> print sa
[0, 2, 3, 4, 5, 1]
and sf = t[sa[i] : sa[i] + steps]
This could also be written more simply:
for i in range(len(sa)):
sf = t[sa[i] : sa[i] + steps]
=>
for x in sa:
sf = t[x:]
print sf
which yields:
azerty
erty
rty
ty
y
zerty
You'll notice that this is exactly the keys used (and then discarded)
in the decorate/sort/undecorate example above, so the whole thing could be rewritten as:
def foo(t):
decorated = sorted((t[i:], i) for i in range(len(t)))
for sf, index in decorated:
print sf
# do something with sf here
As to what all this is supposed to do, I'm totally at lost, but at least you now have a much more pythonic (readable...) version of this code ;)
The lambda in sort defines the criteria according to which the list is going to be sorted.
In other words, the list will not be sorted simply according to its values, but according to the function applied to the values.
Have a look here for more details.
It looks like what you are doing is sorting the list according to the alphabetical ordering of the substrings of the input string t.
Here is what is happening:
t = 'hello' # EXAMPLE
steps = len(t)
sa = [i for i in range(steps)]
sort_func = lambda i: t[i:i + steps]
for el in sa:
print sort_func(el)
#ello
#hello
#llo
#lo
#o
So these are the values that determines the sorting of the list.
transf_list = [sort_func(el) for el in sa]
sorted(transf_list)
# ['ello', 'hello', 'llo', 'lo', 'o']
Hence:
sa.sort(key = sort_func)#I know that sa is a list
# [1, 0, 2, 3, 4]
Problem
Background:
I have a list of ~10,000 lists containing irregular data which needs to be transformed to a specific format. This data will be ingested into a pandas dataframe after transformation.
TL/DR; How to elegantly transform matched strings of the following regex in a list?
Regex
'\d{1,3}-\d{1,3},\d{1,3}-\d{1,3}'
Example:
'1-3,6-8' to '1 2 3 6 7 8'
Current Solution:
Using list comprehensions required multiple type casts to transform the string and is unfit to be a lasting solution.
pat = re.compile('\d{1,3}-\d{1,3},\d{1,3}-\d{1,3}')
row = ['sss-www,ddd-eee', '1-3,6-8', 'XXXX', '0-2,3-7','234','1,5']
lst = [((str(list(range(int(x.split(',')[0].split('-')[0]),
int(x.split(','[0].split('-')[1])+1))).strip('[]').replace(',', '')+' '
+str(list(range(int(x.split(',')[1].split('-')[0]),
int(x.split(',')[1].split('-')[1]) + 1))).strip('[]').replace(',', '')))
if pat.match(str(x)) else x for x in row]
Result
['sss-www,ddd-eee', '1 2 3 6 7 8', 'XXXX', '0 1 2 3 4 5 6 7', '234', '1,5']
Capture the groups it's easier.
Then you convert the group list to integers, and process them 2 by 2 in a list comprehension, chained with itertools.chain
import re,itertools
pat = re.compile('(\d{1,3})-(\d{1,3}),(\d{1,3})-(\d{1,3})')
z='1-3,6-8'
groups = [int(x) for x in pat.match(z).groups()]
print(list(itertools.chain(*(list(range(groups[i],groups[i+1]+1)) for i in range(0,len(groups),2)))))
result:
[1, 2, 3, 6, 7, 8]
not sure you're calling that "elegant", though. It remains complicated, mostly because most objects return generators that need converting to list explicitly.
Several ways to do this, here is mine:
import re
txt = '1-3,6-8'
# Safer to use a raw string
pat = re.compile(r'(\d{1,3})-(\d{1,3}),(\d{1,3})-(\d{1,3})')
m = pat.match(txt)
if m:
start1, end1, start2, end2 = m.groups()
result = [i for i in range(int(start1), int(end1)+1)]
result += [i for i in range(int(start2), int(end2)+1)]
print(result)
Gives:
[1, 2, 3, 6, 7, 8]
I'm assuming Python 3 here (as stated in the question).
Python 2 could use:
result = range(int(start1), int(end1)+1)
result += range(int(start2), int(end2)+1)
I assume that you want to handle longer sequences as well, like 1-10,15,23-25? You don't really need regular expressions for this, regular string processing functions will work well.
def parse_sequence(seq):
result = []
for part in seq.split(','):
points = [int(s) for s in part.split('-')]
if len(points) == 2:
result.extend(range(points[0], points[1]+1))
elif len(points) == 1:
result.append(points[0])
else:
raise ValueError('invalid sequence')
return result
Here is my solution:
import re
from itertools import chain
s = '1-3, 6 - 8, 12-14, 20 -22'
rslt = list(chain(*[range(int(tup[0]), int(tup[1]) + 1)
for tup in re.findall(r'(\d+)\s*?-\s*?(\d+)', s)]))
Output:
In [43]: rslt
Out[43]: [1, 2, 3, 6, 7, 8, 12, 13, 14, 20, 21, 22]
Step by step:
In [44]: re.findall(r'(\d+)\s*?-\s*?(\d+)', s)
Out[44]: [('1', '3'), ('6', '8'), ('12', '14'), ('20', '22')]
In [45]: [range(int(tup[0]),int(tup[1])+1) for tup in re.findall(r'(\d+)\s*?-\s*?(\d+)', s)]
Out[45]: [range(1, 4), range(6, 9), range(12, 15), range(20, 23)]
Depends on exactly what data you're expecting to see. In general the best way to do this is going to be to write a function that parses the string in chunks. Something like:
def parse(string):
chunks = string.split(',')
for chunk in chunks:
match = re.match('(\d+)-(\d+)', chunk)
if match:
start = int(match.group(1))
end = int(match.group(2))
yield range(start:end+1)
else:
yield int(chunk)
s_tmp = s.split(",")
[*range(x.split("-")int([0]),x.split("-")int(x[1])) for x in s_tmp]
apologies if there is syntax errors . i'm typing this from my phone . basically split by , then split by - then unpack the entries from range
I recently started using Python and wrote some simple scripts
Now I have this question:
I have this string:
mystring = 'AAAABBAAABBAAAACCAAAACCAAAA'
and I have these following strings:
String_A = BB
String_B = CC
I would like to get all possible combinations of strings starting with String_A and ending with String_B (kind of vague so below is the desired output)
output:
BBAAABBAAAACCAAACC
BBAAABBAAAACC
BBAAACCAAAACC
BBAAACC
I am able to count the number of occurences of String_A and String_B in mystring using
mystring.count()
And I am able to print out one specific output (the one with the first occurence of String_A and the first occurence of String_B), by doing the following:
if String_A in mystring:
String_B_End = mystring.index(String_B) + len(String_B)
output = mystring[mystring.index(String_A); String_B_End]
print(output)
this works perfect but only gives me the following output:
BBAAABBAAAACC
How can I get all the specified output strings from mystring?
thanx in advance!
If I understand the intention of your question correctly you can use the following code:
>>> import re
>>> mystring = 'AAAABBAAABBAAAACCAAAACCAAAA'
>>> String_A = 'BB'
>>> String_B = 'CC'
>>> def find_occurrences(s, a, b):
a_is = [m.start() for m in re.finditer(re.escape(a), s)] # All indexes of a in s
b_is = [m.start() for m in re.finditer(re.escape(b), s)] # All indexes of b in s
result = [s[i:j+len(b)] for i in a_is for j in b_is if j>i]
return result
>>> find_occurrences(mystring, String_A, String_B)
['BBAAABBAAAACC', 'BBAAABBAAAACCAAAACC', 'BBAAAACC', 'BBAAAACCAAAACC']
This uses the find all occurrences of a substring code from this answer
In its current form the code does not work for overlapping substrings, if mystring = 'BBB' and you look for substring 'BB' it only returns the index 0. If you want to account for such overlapping substrings change the lines where you are getting the indexes of the substrings to a_is = [m.start() for m in re.finditer("(?={})".format(re.escape(a)), s)]
Well, first you need to get the indexes of String_A and String_B in the text. See this:
s = mystring
[i for i in range(len(s)-len(String_A)+1) if s[i:i+len(String_A)]==String_A]
it returns [4, 9], i.e. the indexes of 'BB' in mystring. You do similarly for String_B for which the answer would be [15, 21].
Then you do this:
[(i, j) for i in [4, 9] for j in [15, 21] if i < j]
This line combines each starting location with each ending location and ensures that the starting location occurs before the ending location. The i < j would not be essential for this particular example, but in general you should have it. The result is [(4, 15), (4, 21), (9, 15), (9, 21)].
Then you just convert the start and end indices to substrings:
[s[a:b+len(String_B)] for a, b in [(4, 15), (4, 21), (9, 15), (9, 21)]]
If I have a list
lst = [1, 2, 3, 4, 5]
and I want to show that two items exist one of which is larger than the other by 1, can I do this without specifying which items in the list?
ie. without having to do something like:
lst[1] - lst[0] == 1
a general code that works for any int items in the lst
You can pair the numbers if the one less than the number is in the list:
new = [(i, i - 1) for i in lst if i - 1 in lst]
This one: makes set of the list for faster member checks; then short circuiting checks if i + 1 exists in that set for each i in the list (I iterate over list instead of the newly created set because it should be slightly faster). As soon as it is proven that any i + 1 also is in the list, the function exits with True return value, False otherwise.
def has_n_and_n_plus_1(lst):
lset = set(lst)
return any(i + 1 in lset for i in lst)
Testing:
>>> has_n_and_n_plus_1([6,2,7,11,42])
True
>>> has_n_and_n_plus_1([6,2,9,11,42])
False
The all tricks in 1 basket brain-teaser one:
from operator import sub
from itertools import starmap, tee
a, b = tee(sorted(lst))
next(b, None)
exists = 1 in starmap(sub, zip(b, a))
What this code does is: sort the list in increasing order; then do the pairwise iteration of a, b = lst[i], lst[i + 1], then starmaps each b, a into the sub operator resulting in b - a; and then checks with in operator if that resulting iterator contains any 1.
You could zip the list with itself shifted by one.
>>> lst = [1,2,3,4,5]
>>> zip(lst, lst[1:])
[(1, 2), (2, 3), (3, 4), (4, 5)]
This assumes that the list is ordered. If it is not, then you could sort it first and then filter it to exclude non matches (perhaps including the indexes in the original list if that is important). So if it's a more complex list of integers this should work:
>>> lst = [99,12,13,44,15,16,45,200]
>>> lst.sort()
>>> [(x,y) for (x,y) in zip(lst, lst[1:]) if x + 1 == y]
[(12, 13), (15, 16), (44, 45)]
The following is the equivalent using functions. The use of izip from itertools ensure the list is only iterated over once when we are looking for matches with the filter function:
>>> from itertools import izip
>>> lst = [99,12,13,44,15,16,45,200]
>>> lst.sort()
>>> filter(lambda (x,y): x+1==y, izip(lst, lst[1:]))
[(12, 13), (15, 16), (44, 45)]
The same could be written using for comprehensions, but personally I prefer using functions.