Regular expression group capture with multiple matches

Regular expression group capture with multiple matches - python

Quick regular expression question.
I'm trying to capture multiple instances of a capture group in python (don't think it's python specific), but the subsequent captures seems to overwrite the previous.
In this over-simplified example, I'm essentially trying to split a string:
x = 'abcdef'
r = re.compile('(\w){6}')
m = r.match(x)
m.groups() # = ('f',) ?!?
I want to get ('a', 'b', 'c', 'd', 'e', 'f'), but because regex overwrites subsequent captures, I get ('f',)
Is this how regex is supposed to behave? Is there a way to do what I want without having to repeat the syntax six times?
Thanks in advance!
Andrew

You can't use groups for this, I'm afraid. Each group can match only once, I believe all regexes work this way. A possible solution is to try to use findall() or similar.
r=re.compile(r'\w')
r.findall(x)
# 'a', 'b', 'c', 'd', 'e', 'f'

The regex module can do this.
> m = regex.match('(\w){6}', "abcdef")
> m.captures(1)
['a', 'b', 'c', 'd', 'e', 'f']
Also works with named captures:
> m = regex.match('(?P<letter>)\w)', "abcdef")
> m.capturesdict()
{'letter': ['a', 'b', 'c', 'd', 'e', 'f']}
The regex module is expected to replace the 're' module - it is a drop-in replacement that acts identically, except it has many more features and capabilities.

To find all matches in a given string use re.findall(regex, string). Also, if you want to obtain every letter here, your regex should be either '(\w){1}' or just '(\w)'.
See:
r = re.compile('(\w)')
l = re.findall(r, x)
l == ['a', 'b', 'c', 'd', 'e', 'f']

I suppose your question is a simplified presentation of your need.
Then, I take an exemple a little more complex:
import re
pat = re.compile('[UI][bd][ae]')
ch = 'UbaUdeIbaIbeIdaIdeUdeUdaUdeUbeIda'
print [mat.group() for mat in pat.finditer(ch)]
result
['Uba', 'Ude', 'Iba', 'Ibe', 'Ida', 'Ide', 'Ude', 'Uda', 'Ude', 'Ube', 'Ida']

Related

Python: remove initial number and underscore from strings in a line (but keep other underscores)

I have a split line (split using .split()) that looks like this:
['a', 'b', 'c', '1_a23_4', '2_b234', '300_235_2_2', '1000_1_1_1_1']
Each string has a variable number of underscores and different combinations of letters/numbers after the first underscore. For any string with a number followed by an underscore, I want to drop the initial number/underscore to get this result:
['a', 'b', 'c', 'a23_4', 'b234', '235_2_2', '1_1_1_1']
This is similar to this question, but I have multiple underscores for some strings in the split line.

You can use re.sub:
import re
d = ['a', 'b', 'c', '1_a23_4', '2_b234', '300_235_2_2', '1000_1_1_1_1']
new_d = [re.sub('^\d+_', '', i) for i in d]
Output:
['a', 'b', 'c', 'a23_4', 'b234', '235_2_2', '1_1_1_1']

>>> l=['a', 'b', 'c', '1_a23_4', '2_b234', '300_235_2_2', '1000_1_1_1_1']
>>> l=["_".join(i.split("_")[1:]) if "_" in i else i for i in l]
>>> l
['a', 'b', 'c', 'a23_4', 'b234', '235_2_2', '1_1_1_1']

How can I get the list to split how i want automatically?

I have some code here:
lsp_rows = ['a', 'b', 'c', 'd', 'e', 'b', 'c', 'd', 'e', 'a', 'c',
'd', 'e', 'a', 'b', 'd', 'e', 'a', 'b', 'c', 'e', 'a',
'b', 'c', 'd']
n = int(width/length)
x = [a+b+c+d+e for a,b,c,d,e in zip(*[iter(lsp_rows)]*n)]
Currently, this will split my list "lsp_rows" in groups of 5 all the time as my n = 5. But I need it to split differently depending on "n" as it will change depending on the values of width and length.
So if n is 4 i need the list to split into 4's.
I can see that the problem is with the "a+b+c+d+e for a,b,c,d,e", and I don't know a way to make this change without my manual input, is there a way for me to solve this.
If you guys could explain as thoroughly as possible i'd really appreciate it as i'm pretty new to python. Thanks in advance!

With strings only you can:
[''.join(t) for t in zip(*[iter(lsp_rows)]*n)]
Or slightly more succinct and possibly less memory usage:
map(''.join, zip(*[iter(lsp_rows)]*n))
The answer provided by #hpaulj is more useful in the general case.
And, on the off-chance that you're just trying to generate the cycles of a string, the following will produce the same output.
s = 'abcde'
[s[i:] + s[:i] for i in range(len(s))]

I believe this will generalize your expression to n items:
import functools
import operator
[functools.reduce(operator.add,abc) for abc in zip(*[iter(x)]*n)]
though I'd still like see a test case.
For example if x is a list of lists, the result is a list of x flattened.
A list of numbers or a string look better:
In [394]: [functools.reduce(operator.add,abc) for abc in zip(*[iter('abcdefghij')]*4)]
Out[394]: ['abcd', 'efgh']
In [395]: [functools.reduce(operator.add,abc) for abc in zip(*[iter('abcdefghij')]*5)]
Out[395]: ['abcde', 'fghij']
In [396]: [functools.reduce(operator.add,abc) for abc in zip(*[iter(range(20))]*5)]
Out[396]: [10, 35, 60, 85]
with your list of characters
In [400]: [functools.reduce(operator.add,abc) for abc in zip(*[iter(lsp_rows)]*5)]
Out[400]: ['abcde', 'bcdea', 'cdeab', 'deabc', 'eabcd']
In [401]: [functools.reduce(operator.add,abc) for abc in zip(*[iter(lsp_rows)]*6)]
Out[401]: ['abcdeb', 'cdeacd', 'eabdea', 'bceabc']
All these imports can be replaced with join if the items are strings.

python: compare lists in a sequence using nested for loops

so I have two lists where I compare a person's answers to the correct answers:
correct_answers = ['A', 'C', 'A', 'B', 'D']
user_answers = ['B', 'A', 'C', 'B', 'D']
I need to compare the two of them (without using sets, if that's even possible) and keep track of how many of the person's answers are wrong - in this case, 3
I tried using the following for loops to count how many were correct:
correct = 0
for i in correct_answers:
for j in user_answers:
if i == j:
correct += 1
print(correct)
but this doesn't work and I'm not sure what I need to change to make it work.

Just count them:
correct_answers = ['A', 'C', 'A', 'B', 'D']
user_answers = ['B', 'A', 'C', 'B', 'D']
incorrect = sum(1 if correct != user else 0
for correct, user in zip(correct_answers, user_answers))

I blame #alecxe for convincing me to post this, the ultra-efficient solution:
from future_builtins import map # <-- Only on Python 2 to get generator based map and avoid intermediate lists; on Py3, map is already a generator
from operator import ne
numincorrect = sum(map(ne, correct_answers, user_answers))
Pushes all the work to the C layer (making it crazy fast, modulo the initial cost of setting it all up; no byte code is executed if the values processed are Python built-in types, which removes a lot of overhead), and one-lines it without getting too cryptic.

The less pythonic, more generic (and readable) solution is pretty simple too.
correct_answers = ['A', 'C', 'A', 'B', 'D']
user_answers = ['B', 'A', 'C', 'B', 'D']
incorrect = 0
for i in range(len(correct_answers)):
if correct_answers[i] != user_answers[i]:
incorrect += 1
This assumes your lists are the same length. If you need to validate that, you can do it before running this code.
EDIT: The following code does the same thing, provided you are familiar with zip
correct_answers = ['A', 'C', 'A', 'B', 'D']
user_answers = ['B', 'A', 'C', 'B', 'D']
incorrect = 0
for answer_tuple in zip(correct_answers, user_answers):
if answer_tuple[0] != answer_tuple[1]:
incorrect += 1

find all characters NOT in regex pattern

Let's say I have a regex of legal characters
legals = re.compile("[abc]")
I can return a list of legal characters in a string like this:
finder = re.finditer(legals, "abcdefg")
[match.group() for match in finder]
>>>['a', 'b', 'c']
How can I use regex to find a list of the characters NOT in the regex? IE in my case it would return
['d','e','f','g']
Edit: To clarify, I'm hoping to find a way to do this without modifying the regex itself.

Negate the character class:
>>> illegals = re.compile("[^abc]")
>>> finder = re.finditer(illegals, "abcdefg")
>>> [match.group() for match in finder]
['d', 'e', 'f', 'g']
If you can't do that (and you're only dealing with one-character length matches), you could
>>> legals = re.compile("[abc]")
>>> remains = legals.sub("", "abcdefg")
>>> [char for char in remains]
['d', 'e', 'f', 'g']

How to unpack a list?

When extracting data from a list this way
line[0:3], line[3][:2], line[3][2:]
I receive an array and two variables after it, as should be expected:
(['a', 'b', 'c'], 'd', 'e')
I need to manipulate the list so the end result is
('a', 'b', 'c', 'd', 'e')
How? Thank you.
P.S. Yes, I know that I can write down the first element as line[0], line[1], line[2], but I think that's a pretty awkward solution.

from itertools import chain
print tuple(chain(['a', 'b', 'c'], 'd', 'e'))
Output:
('a', 'b', 'c', 'd','e')

Try this.
line = ['a', 'b', 'c', 'de']
tuple(line[0:3] + [line[3][:1]] + [line[3][1:]])
('a', 'b', 'c', 'd', 'e')
NOTE:
I think there is some funny business in your slicing logic.
If [2:] returns any characters, [:2] must return 2 characters.
Please provide your input line.

Obvious answer: Instead of your first line, do:
line[0:3] + [line[3][:2], line[3][2:]]
That works assuming that line[0:3] is a list. Otherwise, you may need to make some minor adjustments.

This function
def merge(seq):
merged = []
for s in seq:
for x in s:
merged.append(x)
return merged
source: http://www.testingreflections.com/node/view/4930

def is_iterable(i):
return hasattr(i,'__iter__')
def iterative_flatten(List):
for item in List:
if is_iterable(item):
for sub_item in iterative_flatten(item):
yield sub_item
else:
yield item
def flatten_iterable(to_flatten):
return tuple(iterative_flatten(to_flatten))
this should work for any level of nesting

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regular expression group capture with multiple matches - python

You can't use groups for this, I'm afraid. Each group can match only once, I believe all regexes work this way. A possible solution is to try to use findall() or similar. r=re.compile(r'\w') r.findall(x) # 'a', 'b', 'c', 'd', 'e', 'f'

To find all matches in a given string use re.findall(regex, string). Also, if you want to obtain every letter here, your regex should be either '(\w){1}' or just '(\w)'. See: r = re.compile('(\w)') l = re.findall(r, x) l == ['a', 'b', 'c', 'd', 'e', 'f']

Related

Python: remove initial number and underscore from strings in a line (but keep other underscores)

How can I get the list to split how i want automatically?

python: compare lists in a sequence using nested for loops

find all characters NOT in regex pattern

How to unpack a list?

Categories

Resources