Python group repeating characters in list together - python

I want to split a string of 1s and 0s into a list of repeated groups of these characters.
Example:
m = '00001100011111000'
Goes to:
m = ['0000', '11', '000', '11111', '000']
How would I do this in efficient concise code? Would regex work for this?

Yes, you can use re. For example:
import re
m = "00001100011111000"
print(["".join(v) for v, _ in re.findall(r"((.)\2*)", m)])
Prints:
['0000', '11', '000', '11111', '000']
Other regex:
print(re.findall(r"0+|1+", m))
Prints:
['0000', '11', '000', '11111', '000']

You can use itertools groupby
from itertools import groupby
m = '00001100011111000'
["".join(g) for _, g in groupby(m)]

Another method using itertools.groupby:
from itertools import groupby
["".join(g) for k, g in groupby(m)]

You can make use of re.finditer():
m = "00001100011111000"
[s.group(0) for s in re.finditer(r"(\d)\1*", m)]
Output:
['0000', '11', '000', '11111', '000']

Related

How to split a text to line by line in python [duplicate]

Is it possible to split a string every nth character?
For example, suppose I have a string containing the following:
'1234567890'
How can I get it to look like this:
['12','34','56','78','90']
For the same question with a list, see How do I split a list into equally-sized chunks?. The same techniques generally apply, though there are some variations.
>>> line = '1234567890'
>>> n = 2
>>> [line[i:i+n] for i in range(0, len(line), n)]
['12', '34', '56', '78', '90']
Just to be complete, you can do this with a regex:
>>> import re
>>> re.findall('..','1234567890')
['12', '34', '56', '78', '90']
For odd number of chars you can do this:
>>> import re
>>> re.findall('..?', '123456789')
['12', '34', '56', '78', '9']
You can also do the following, to simplify the regex for longer chunks:
>>> import re
>>> re.findall('.{1,2}', '123456789')
['12', '34', '56', '78', '9']
And you can use re.finditer if the string is long to generate chunk by chunk.
There is already an inbuilt function in python for this.
>>> from textwrap import wrap
>>> s = '1234567890'
>>> wrap(s, 2)
['12', '34', '56', '78', '90']
This is what the docstring for wrap says:
>>> help(wrap)
'''
Help on function wrap in module textwrap:
wrap(text, width=70, **kwargs)
Wrap a single paragraph of text, returning a list of wrapped lines.
Reformat the single paragraph in 'text' so it fits in lines of no
more than 'width' columns, and return a list of wrapped lines. By
default, tabs in 'text' are expanded with string.expandtabs(), and
all other whitespace characters (including newline) are converted to
space. See TextWrapper class for available keyword args to customize
wrapping behaviour.
'''
Another common way of grouping elements into n-length groups:
>>> s = '1234567890'
>>> map(''.join, zip(*[iter(s)]*2))
['12', '34', '56', '78', '90']
This method comes straight from the docs for zip().
I think this is shorter and more readable than the itertools version:
def split_by_n(seq, n):
'''A generator to divide a sequence into chunks of n units.'''
while seq:
yield seq[:n]
seq = seq[n:]
print(list(split_by_n('1234567890', 2)))
Using more-itertools from PyPI:
>>> from more_itertools import sliced
>>> list(sliced('1234567890', 2))
['12', '34', '56', '78', '90']
I like this solution:
s = '1234567890'
o = []
while s:
o.append(s[:2])
s = s[2:]
You could use the grouper() recipe from itertools:
Python 2.x:
from itertools import izip_longest
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
Python 3.x:
from itertools import zip_longest
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
These functions are memory-efficient and work with any iterables.
This can be achieved by a simple for loop.
a = '1234567890a'
result = []
for i in range(0, len(a), 2):
result.append(a[i : i + 2])
print(result)
The output looks like
['12', '34', '56', '78', '90', 'a']
I was stuck in the same scenario.
This worked for me:
x = "1234567890"
n = 2
my_list = []
for i in range(0, len(x), n):
my_list.append(x[i:i+n])
print(my_list)
Output:
['12', '34', '56', '78', '90']
Try the following code:
from itertools import islice
def split_every(n, iterable):
i = iter(iterable)
piece = list(islice(i, n))
while piece:
yield piece
piece = list(islice(i, n))
s = '1234567890'
print list(split_every(2, list(s)))
Try this:
s='1234567890'
print([s[idx:idx+2] for idx,val in enumerate(s) if idx%2 == 0])
Output:
['12', '34', '56', '78', '90']
>>> from functools import reduce
>>> from operator import add
>>> from itertools import izip
>>> x = iter('1234567890')
>>> [reduce(add, tup) for tup in izip(x, x)]
['12', '34', '56', '78', '90']
>>> x = iter('1234567890')
>>> [reduce(add, tup) for tup in izip(x, x, x)]
['123', '456', '789']
As always, for those who love one liners
n = 2
line = "this is a line split into n characters"
line = [line[i * n:i * n+n] for i,blah in enumerate(line[::n])]
more_itertools.sliced has been mentioned before. Here are four more options from the more_itertools library:
s = "1234567890"
["".join(c) for c in mit.grouper(2, s)]
["".join(c) for c in mit.chunked(s, 2)]
["".join(c) for c in mit.windowed(s, 2, step=2)]
["".join(c) for c in mit.split_after(s, lambda x: int(x) % 2 == 0)]
Each of the latter options produce the following output:
['12', '34', '56', '78', '90']
Documentation for discussed options: grouper, chunked, windowed, split_after
A simple recursive solution for short string:
def split(s, n):
if len(s) < n:
return []
else:
return [s[:n]] + split(s[n:], n)
print(split('1234567890', 2))
Or in such a form:
def split(s, n):
if len(s) < n:
return []
elif len(s) == n:
return [s]
else:
return split(s[:n], n) + split(s[n:], n)
, which illustrates the typical divide and conquer pattern in recursive approach more explicitly (though practically it is not necessary to do it this way)
A solution with groupby:
from itertools import groupby, chain, repeat, cycle
text = "wwworldggggreattecchemggpwwwzaz"
n = 3
c = cycle(chain(repeat(0, n), repeat(1, n)))
res = ["".join(g) for _, g in groupby(text, lambda x: next(c))]
print(res)
Output:
['www', 'orl', 'dgg', 'ggr', 'eat', 'tec', 'che', 'mgg', 'pww', 'wza', 'z']
These answers are all nice and working and all, but the syntax is so cryptic... Why not write a simple function?
def SplitEvery(string, length):
if len(string) <= length: return [string]
sections = len(string) / length
lines = []
start = 0;
for i in range(sections):
line = string[start:start+length]
lines.append(line)
start += length
return lines
And call it simply:
text = '1234567890'
lines = SplitEvery(text, 2)
print(lines)
# output: ['12', '34', '56', '78', '90']
Another solution using groupby and index//n as the key to group the letters:
from itertools import groupby
text = "abcdefghij"
n = 3
result = []
for idx, chunk in groupby(text, key=lambda x: x.index//n):
result.append("".join(chunk))
# result = ['abc', 'def', 'ghi', 'j']

Break a binary string down to segments

The task here is to break down a string 110011110110000 into a list:
['11', '00', '1111', '0', '11', '0000']
My solution is
str1='110011110110000'
seg = []
a0=str1[0]
seg0=''
for a in str1:
print('a=',a)
if a==a0:
seg0=seg0+a
else:
print('seg0=',seg0)
seg.append(seg0)
seg0=a
a0=a
seg.append(seg0)
seg
It's ugly and I am sure you guys out there have a one-liner for this. Maybe regex?
You can use itertools.groupby (doc):
str1='110011110110000'
from itertools import groupby
l = [v * len([*g]) for v, g in groupby(str1)]
print(l)
Prints:
['11', '00', '1111', '0', '11', '0000']
EDIT: version with regex:
str1='110011110110000'
import re
print([g[0] for g in re.findall(r'((\d)\2*)', str1)])
Here is a regex solution:
result = [x[0] for x in re.findall(r'(([10])\2*)', str1)]
The regex is (([10])\2*), find a 0 or 1, then keep looking for that same thing. Since findall returns all groups in the match, we need to map it to the first group (Group 2 is the ([10]) bit).
Here is an iterative regex approach, using the simple pattern 1+|0+:
str1 = "110011110110000"
pattern = re.compile(r'(1+|0+)')
result = []
for m in re.finditer(pattern, str1):
result.append(m.group(0))
print(result)
This prints:
['11', '00', '1111', '0', '11', '0000']
Note that we might want to instead use re.split here. The problem with re.split is that it doesn't seem to support splitting on lookarounds. In other languages, such as Java, we could try splitting on this pattern:
(?<=0)(?=1)|(?<=1)(?=0)
This would nicely generate the array/list we expect.
one line solution using groupy
from itertools import groupby
text='1100111101100001'
sol = [''.join(group) for key, group in groupby(text)]
print(sol)
output
['11', '00', '1111', '0', '11', '0000', '1']
not regex solution, but improvement on ur code
str1='110011110110000'
def func(string):
tmp = string[0]
res =[]
for i, v in enumerate(string, 1):
if v==tmp[-1]:
tmp+=v
else:
res.append(tmp)
tmp=v
res.append(tmp)
return res
print(func(str1))
output
['111', '00', '1111', '0', '11', '0000']
You can use general regex (.)\1*
(.) - match single character (any) and store it in first capturing group
\1* - repeat what's ca[ptured in first captruing group zero or more times
Demo
Matches collection will be your desired result.

How to read 1st column from csv and separate into multidimensional array

I am trying to separate a column that I read from a .csv file into a multidimensional array. So, if the first column is read into a single array and looks like this:
t = ['90-0066', '24', '33', '34', '91-0495', '22', '33', '92-6676', '23', '32']
How do I write the code in python for every value like '90-0066' the following numbers are put into an array until the next - value? So I would like the array to look like:
t = [['24', '33', '34'], ['22', '33'], ['23', '32']]
Thanks!
You can use itertools.groupby in a list comprehension:
from itertools import groupby
t = [list(g) for k, g in groupby(t, key=str.isdigit) if k]
t becomes:
[['24', '33', '34'], ['22', '33'], ['23', '32']]
If the numbers are possibly floating points, you can use regex instead:
import re
t = [list(g) for k, g in groupby(t, key=lambda s: bool(re.match(r'\d+(?:\.\d+)?$', s)) if k]
Or zip longest with two list comprehensions:
>>> from itertools import zip_longest
>>> l=[i for i,v in enumerate(t) if not v.isdigit()]
>>> [t[x+1:y] for x,y in zip_longest(l,l[1:])]
[['24', '33', '34'], ['22', '33'], ['23', '32']]
>>>

Python Regex capture multiple sections within string

I have string that are always of the format track-a-b where a and b are integers.
For example:
track-12-29
track-1-210
track-56-1
How do I extract a and b from such strings in python?
If it's just a single string, I would approach this using split:
>>> s = 'track-12-29'
>>> s.split('-')[1:]
['12', '29']
If it is a multi-line string, I would use the same approach ...
>>> s = 'track-12-29\ntrack-1-210\ntrack-56-1'
>>> results = [x.split('-')[1:] for x in s.splitlines()]
[['12', '29'], ['1', '210'], ['56', '1']]
You'll want to use re.findall() with capturing groups:
results = [re.findall(r'track-(\d+)-(\d+)', datum) for datum in data]

Splitting a string into a list (but not separating adjacent numbers) in Python

For example, I have:
string = "123ab4 5"
I want to be able to get the following list:
["123","ab","4","5"]
rather than list(string) giving me:
["1","2","3","a","b","4"," ","5"]
Find one or more adjacent digits (\d+), or if that fails find non-digit, non-space characters ([^\d\s]+).
>>> string = '123ab4 5'
>>> import re
>>> re.findall('\d+|[^\d\s]+', string)
['123', 'ab', '4', '5']
If you don't want the letters joined together, try this:
>>> re.findall('\d+|\S', string)
['123', 'a', 'b', '4', '5']
The other solutions are definitely easier. If you want something far less straightforward, you could try something like this:
>>> import string
>>> from itertools import groupby
>>> s = "123ab4 5"
>>> result = [''.join(list(v)) for _, v in groupby(s, key=lambda x: x.isdigit())]
>>> result = [x for x in result if x not in string.whitespace]
>>> result
['123', 'ab', '4', '5']
You could do:
>>> [el for el in re.split('(\d+)', string) if el.strip()]
['123', 'ab', '4', '5']
This will give the split you want:
re.findall(r'\d+|[a-zA-Z]+', "123ab4 5")
['123', 'ab', '4', '5']
you can do a few things here, you can
1. iterate the list and make groups of numbers as you go, appending them to your results list.
not a great solution.
2. use regular expressions.
implementation of 2:
>>> import re
>>> s = "123ab4 5"
>>> re.findall('\d+|[^\d]', s)
['123', 'a', 'b', '4', ' ', '5']
you want to grab any group which is at least 1 number \d+ or any other character.
edit
John beat me to the correct solution first. and its a wonderful solution.
i will leave this here though because someone else might misunderstand the question and look for an answer to what i thought was written also. i was under the impression the OP wanted to capture only groups of numbers, and leave everything else individual.

Categories

Resources