How can I group a list of objects by continuity?

How can I group a list of objects by continuity? - python

Given a very large (gigabytes) list of arbitrary objects (I've seen a similar solution to this for ints), can I either group it easily into sublists by equivalence? Either in-place or by generator which consumes the original list.
l0 = [A,B, A,B,B, A,B,B,B,B, A, A, A,B] #spaces for clarity
Desired result:
[['A', 'B'], ['A', 'B', 'B'], ['A', 'B', 'B', 'B', 'B'], ['A'], ['A'], ['A', 'B']]
I wrote a looping version like so:
#find boundaries
b0 = []
prev = A
group = A
for idx, elem in enumerate(l0):
if elem == group:
b0.append(idx)
prev = elem
b0.append(len(l0)-1)
for idx, b in enumerate(b0):
try:
c = b0[idx+1]
except:
break
if c == len(l0)-1:
l1.append(l0[b:])
else:
l1.append(l0[b:c])
Can this be done as a generator gen0(l) that will work like:
for g in gen(l0):
print g
....
['A', 'B']
['A', 'B', 'B']
['A', 'B', 'B', 'B', 'B']
....
etc?
EDIT: using python 2.6 or 2.7
EDIT: preferred solution, mostly based on the accepted answer:
def gen_group(f, items):
out = [items[0]]
while items:
for elem in items[1:]:
if f(elem, out[0]):
break
else:
out.append(elem)
for _i in out:
items.pop(0)
yield out
if items:
out = [items[0]]
g = gen_group(lambda x, y: x == y, l0)
for out in g:
print out

Maybe something like this:
def subListGenerator(f,items):
i = 0
n = len(items)
while i < n:
sublist = [items[i]]
i += 1
while i < n and not f(items[i]):
sublist.append(items[i])
i += 1
yield sublist
Used like:
>>> items = ['A', 'B', 'A', 'B', 'B', 'A', 'B', 'B', 'B', 'B', 'A', 'A', 'A', 'B']
>>> g = subListGenerator(lambda x: x == 'A',items)
>>> for x in g: print(x)
['A', 'B']
['A', 'B', 'B']
['A', 'B', 'B', 'B', 'B']
['A']
['A']
['A', 'B']

I assume that A is your breakpoint.
>>> A, B = 'A', 'B'
>>> x = [A,B, A,B,B, A,B,B,B,B, A, A, A,B]
>>> map(lambda arr: [i for i in arr[0]], map(lambda e: ['A'+e], ''.join(x).split('A')[1:]))
[['A', 'B'], ['A', 'B', 'B'], ['A', 'B', 'B', 'B', 'B'], ['A'], ['A'], ['A', 'B']]

Here's a simple generator to perform your task:
def gen_group(L):
DELIMETER = "A"
out = [DELIMETER]
while L:
for ind, elem in enumerate(L[1:]):
if elem == DELIMETER :
break
else:
out.append(elem)
for i in range(ind + 1):
L.pop(0)
yield out
out = [DELIMETER ]
The idea is to cut down the list and yield the sublists until there is nothing left. This assumes the list starts with "A" (DELIMETER variable).
Sample output:
for out in gen_group(l0):
print out
Produces
['A', 'B']
['A', 'B', 'B']
['A', 'B', 'B', 'B', 'B']
['A']
['A']
['A', 'B']
['A']
Comparitive Timings:
timeit.timeit(s, number=100000) is used to test each of the current answers, where s is the multiline string of the code (listed below):
Trial 1 Trial 2 Trial 3 Trial 4 | Avg
This answer (s1): 0.08247 0.07968 0.08635 0.07133 0.07995
Dilara Ismailova (s2): 0.77282 0.72337 0.73829 0.70574 0.73506
John Coleman (s3): 0.08119 0.09625 0.08405 0.08419 0.08642
This answer is the fastest, but it is very close. I suspect the difference is the additional argument and anonymous function in John Coleman's answer.
s1="""l0 = ["A","B", "A","B","B", "A","B","B","B","B", "A", "A", "A","B"]
def gen_group(L):
out = ["A"]
while L:
for ind, elem in enumerate(L[1:]):
if elem == "A":
break
else:
out.append(elem)
for i in range(ind + 1):
L.pop(0)
yield out
out = ["A"]
out =gen_group(l0)"""
s2 = """A, B = 'A', 'B'
x = [A,B, A,B,B, A,B,B,B,B, A, A, A,B]
map(lambda arr: [i for i in arr[0]], map(lambda e: ['A'+e], ''.join(x).split('A')[1:]))"""
s3 = """def subListGenerator(f,items):
i = 0
n = len(items)
while i < n:
sublist = [items[i]]
i += 1
while i < n and not f(items[i]):
sublist.append(items[i])
i += 1
yield sublist
items = ['A', 'B', 'A', 'B', 'B', 'A', 'B', 'B', 'B', 'B', 'A', 'A', 'A', 'B']
g = subListGenerator(lambda x: x == 'A',items)"""

The following works in this case. You could change the l[0] != 'A' condition to be whatever. I would probably pass it as an argument, so that you can reuse it somewhere else.
def gen(l_arg, boundary):
l = l_arg.copy() # Optional if you want to save memory
while l:
sub_list = [l.pop(0)]
while l and l[0] != boundary: # Here boundary = 'A'
sub_list.append(l.pop(0))
yield sub_list
It assumes that there is an 'A' at the beginning of your list. And it copies the list, which isn't great when the list is in the range of Gb. you could remove the copy to save memory if you don't care about keeping the original list.

Related

Variable value is changing between print and append

The variable is changing and prints different value and it saves another
If i run this code
def swap(string,x,y):
string[y], string[x] = string[x], string[y]
def permutations(string ,i=0):
if i == len(string):
yield string
for x in range(i, len(string)):
perm = string
swap(perm,x,i)
yield from permutations(perm, i+1)
swap(perm,i,x)
result = []
test = permutations(['a','b','c'])
for x in test:
print(x)
result.append(x)
print(result)
It prints this and i dont know why:
['a', 'b', 'c']
['a', 'c', 'b']
['b', 'a', 'c']
['b', 'c', 'a']
['c', 'b', 'a']
['c', 'a', 'b']
[['a', 'b', 'c'], ['a', 'b', 'c'], ['a', 'b', 'c'], ['a', 'b', 'c'], ['a', 'b', 'c'], ['a', 'b', 'c']]

You're mutating the same x in place, so only the final version of it is printed after the loop.
result.append(x) does not copy the object (x in this case), it just places a reference to it into the result list.
Do e.g. result.append(x[:]) or result.append(list(x)) to put copies of x into the result list.

That's why the yielded list has the same references, so whenever you change it, the previous referenced value will be changed, too. The quick fix is to return a copy instance of the list.
def swap(string,x,y):
string[y], string[x] = string[x], string[y]
def permutations(string ,i=0):
if i == len(string):
yield string.copy()
for x in range(i, len(string)):
perm = string
swap(perm,x,i)
yield from permutations(perm, i+1)
swap(perm,i,x)
result = []
test = permutations(['a','b','c'])
for x in test:
print(x)
result.append(x)
print(result)

Is there better way to check value changed in sequence?

I have a list like below:
list = ['A', 'A', 'B', 'A', 'B', 'A', 'B']
And, I want to count the number of the first value (in case above, 'A') consecutively before the other values ('B') come.
So I wrote a code like:
history = list[0]
number_A = 0
number_B = 0
for i in list:
if history != i:
break
if i == 'A':
number_A += 1
history = 'A'
else:
number_B += 1
history = 'B'
However, I think this is very untidy.
Is there any more simple way to do this process?
Thank you for reading.

Using groupby with the default key function, you can count the number of items in the first grouper:
from itertools import groupby
def count_first(lst):
if not lst:
return 0
_, grouper = next(groupby(lst))
return sum(1 for _ in grouper)
print(count_first(['A', 'A', 'B', 'A', 'B', 'A', 'B']))
# 2

There is no reason for the "else" clause, you are not going to count 'B's since you are going to break before you get there.
lst = ['A', 'A', 'B', 'A', 'B', 'A', 'B']
count = 0
for i in lst:
if i != lst[0]:
break
count += 1
print("list starts with %d of %s's" % (count, lst[0]))

You could use takewhile:
from itertools import takewhile
my_list = ['A', 'A', 'B', 'A', 'B', 'A', 'B']
res = takewhile(lambda x: x == my_list[0], my_list)
print(len(list(res)))
OUT: 2

I renamed your list to lst in order to not override the builtin name list.
>>> lst = ['A', 'A', 'B', 'A', 'B', 'A', 'B']
>>> string = ''.join(lst)
>>> len(string) - len(string.lstrip('A'))
2

complete list if the first and last element is equal

I have a problem trying to transform a list.
The original list is like this:
[['a','b','c',''],['c','e','f'],['c','g','h']]
now I want to make the output like this:
[['a','b','c','e','f'],['a','b','c','g','h']]
When the blank is found ( '' ) merge the three list into two lists.
I need to write a function to do this for me.
Here is what I tried:
for x in mylist:
if x[len(x) - 1] == '':
m = x[len(x) - 2]
for y in mylist:
if y[0] == m:
combine(x, y)
def combine(x, y):
for m in y:
if not m in x:
x.append(m)
return(x)
but its not working the way I want.

try this :
mylist = [['a','b','c',''],['c','e','f'],['c','g','h']]
def combine(x, y):
for m in y:
if not m in x:
x.append(m)
return(x)
result = []
for x in mylist:
if x[len(x) - 1] == '':
m = x[len(x) - 2]
for y in mylist:
if y[0] == m:
result.append(combine(x[0:len(x)-2], y))
print(result)
your problem was with
combine(x[0:len(x)-2], y)
output :
[['a', 'b', 'c', 'e', 'f'], ['a', 'b', 'c', 'g', 'h']]

So you basically want to merge 2 lists? If so, you can use one of 2 ways :
Either use the + operator, or use the
extend() method.
And then you put it into a function.

I made it with standard library only with comments. Please refer it.
mylist = [['a','b','c',''],['c','e','f'],['c','g','h']]
# I can't make sure whether the xlist's item is just one or not.
# So, I made it to find all
# And, you can see how to get the last value of a list as [-1]
xlist = [x for x in mylist if x[-1] == '']
ylist = [x for x in mylist if x[-1] != '']
result = []
# combine matrix of x x y
for x in xlist:
for y in ylist:
c = x + y # merge
c = [i for i in c if i] # drop ''
c = list(set(c)) # drop duplicates
c.sort() # sort
result.append(c) # add to result
print (result)
The result is
[['a', 'b', 'c', 'e', 'f'], ['a', 'b', 'c', 'g', 'h']]

Your code almost works, except you never do anything with the result of combine (print it, or add it to some result list), and you do not remove the '' element. However, for a longer list, this might be a bit slow, as it has quadratic complexity O(n²).
Instead, you can use a dictionary to map first elements to the remaining elements of the lists. Then you can use a loop or list comprehension to combine the lists with the right suffixes:
lst = [['a','b','c',''],['c','e','f'],['c','g','h']]
import collections
replacements = collections.defaultdict(list)
for first, *rest in lst:
replacements[first].append(rest)
result = [l[:-2] + c for l in lst if l[-1] == "" for c in replacements[l[-2]]]
# [['a', 'b', 'c', 'e', 'f'], ['a', 'b', 'c', 'g', 'h']]
If the list can have more than one placeholder '', and if those can appear in the middle of the list, then things get a bit more complicated. You could make this a recursive function. (This could be made more efficient by using an index instead of repeatedly slicing the list.)
def replace(lst, last=None):
if lst:
first, *rest = lst
if first == "":
for repl in replacements[last]:
yield from replace(repl + rest)
else:
for res in replace(rest, first):
yield [first] + res
else:
yield []
for l in lst:
for x in replace(l):
print(x)
Output for lst = [['a','b','c','','b',''],['c','b','','e','f'],['c','g','b',''],['b','x','y']]:
['a', 'b', 'c', 'b', 'x', 'y', 'e', 'f', 'b', 'x', 'y']
['a', 'b', 'c', 'g', 'b', 'x', 'y', 'b', 'x', 'y']
['c', 'b', 'x', 'y', 'e', 'f']
['c', 'g', 'b', 'x', 'y']
['b', 'x', 'y']

try my solution
although it will change the order of list but it's quite simple code
lst = [['a', 'b', 'c', ''], ['c', 'e', 'f'], ['c', 'g', 'h']]
lst[0].pop(-1)
print([list(set(lst[0]+lst[1])), list(set(lst[0]+lst[2]))])

Find all substrings in a string in python 3 with brute-force

I want to find all substrings 'A' to 'B' in L = ['C', 'A', 'B', 'A', 'A', 'X', 'B', 'Y', 'A'] with bruteforce, this is what i've done:
def find_substring(L):
t = 0
s = []
for i in range(len(L) - 1):
l = []
if ord(L[i]) == 65:
for j in range(i, len(L)):
l.append(L[j])
if ord(L[j]) == 66:
t = t + 1
s.append(l)
return s, t
Now I want the output:
[['A','B'], ['A','B','A','A','X','B'], ['A','A','X','B'], ['A','X','B']]
But i get:
[['A','B','A','A','X','B','Y','A'],['A','B','A','A','X','B','Y','A'],['A','A','X','B','Y','A'],['A','X','B','Y','A']]
Can someone tell me what I'm doing wrong?

The problem is that the list s, holds references to the l lists.
So even though you are appending the correct l lists to s, they are changed after being appended as the future iterations of the j loop modify the l lists.
You can fix this by appending a copy of the l list: l[:].
Also, you can compare strings directly, no need to convert to ASCII.
def find_substring(L):
s = []
for i in range(len(L) - 1):
l = []
if L[i] == 'A':
for j in range(i, len(L)):
l.append(L[j])
if L[j] == 'B':
s.append(l[:])
return s
which now works:
>>> find_substring(['C', 'A', 'B', 'A', 'A', 'X', 'B', 'Y', 'A'])
[['A', 'B'], ['A', 'B', 'A', 'A', 'X', 'B'], ['A', 'A', 'X', 'B'], ['A', 'X', 'B']]

When you append l to s, you are adding a reference to a list which you then continue to grow. You want to append a copy of the l list's contents at the time when you append, to keep it static.
s.append(l[:])
This is a common FAQ; this question should probably be closed as a duplicate.

You would be better first finding all indices of 'A' and 'B', then iterating over those, avoiding brute force.
def find_substrings(lst)
idx_A = [i for i, c in enumerate(lst) if c == 'A']
idx_B = [i for i, c in enumerate(lst) if c == 'B']
return [lst[i:j+1] for i in idx_A for j in idx_B if j > i]

You can reset l to a copy of the string after l is appended l = l[:] right after the last append.

So, you want all the substrings that start with 'A' and end with 'B'?
When you use #Joeidden's code you can change need the for i in range(len(L) - 1): to for i in range(len(L)): because only strings that end with 'B' will be appended to s.
def find_substring(L):
s = []
for i in range(len(L)):
l = []
if L[i] == 'A':
for j in range(i, len(L)):
l.append(L[j])
if L[j] == 'B':
s.append(l[:])
return s

Another slightly different approach would be this:
L = ['C', 'A', 'B', 'A', 'A', 'X', 'B', 'Y', 'A']
def find_substring(L):
output = []
# Start searching for A.
for i in range(len(L)):
# If you found one start searching all B's until you reach the end.
if L[i]=='A':
for j in range(i,len(L),1):
# If you found a B, append the sublist from i index to j+1 index (positions of A and B respectively).
if L[j]=='B':
output.append(L[i:j+1])
return output
result = find_substring(L)
print(result)
Output:
[['A', 'B'], ['A', 'B', 'A', 'A', 'X', 'B'], ['A', 'A', 'X', 'B'], ['A', 'X', 'B']]
In case you need a list comprehension of the above:
def find_substring(L):
output = [L[i:j+1] for i in range(len(L)) for j in range(i,len(L),1) if L[i]=='A' and L[j]=='B']
return output

Python: Keep first occurrence of an item in a list

How can I remove all occurrences of a specific value in a list except for the first occurrence?
E.g. I have a list:
letters = ['a', 'b', 'c', 'c', 'c', 'd', 'c', 'a', 'a', 'c']
And I need a function that looks something like this:
preserve_first(letters, 'c')
And returns this:
['a', 'b', 'c', 'd', 'a', 'a']
Removing all but the first occurrence of the given value while otherwise preserving the order. If there is a way to do this with a pandas.Series that would be even better.

You want to remove duplicates of 'c' only. So you want to filter where the series is either not duplicated at all or it isn't equal to 'c'. I like to use pd.Series.ne in place of pd.Series != because the reduction in wrapping parenthesis adds to readability (my opinion).
s = pd.Series(letters)
s[s.ne('c') | ~s.duplicated()]
0 a
1 b
2 c
5 d
7 a
8 a
dtype: object
To do exactly what was asked for.
def preserve_first(letters, letter):
s = pd.Series(letters)
return s[s.ne(letter) | ~s.duplicated()].tolist()
preserve_first(letters, 'c')
['a', 'b', 'c', 'd', 'a', 'a']

A general Python solution:
def keep_first(iterable, value):
it = iter(iterable)
for val in it:
yield val
if val == value:
yield from (el for el in it if el != value)
This yields all items up to and including the first value if found, then yields the rest of the iterable filtering out items matching the value.

You can try this using generators:
def conserve_first(l, s):
last_seen = False
for i in l:
if i == s and not last_seen:
last_seen = True
yield i
elif i != s:
yield i
letters = ['a', 'b', 'c', 'c', 'c', 'd', 'c', 'a', 'a', 'c']
print(list(conserve_first(letters, "c")))
Output:
['a', 'b', 'c', 'd', 'a', 'a']

Late to the party, but
letters = ['a', 'b', 'c', 'c', 'c', 'd', 'c', 'a', 'a', 'c']
def preserve_first(data, letter):
new = []
count = 0
for i in data:
if i not in new:
if i == letter and count == 0:
new.append(i)
count+=1
elif i == letter and count == 1:
continue
else:
new.append(i)
else:
if i == letter and count == 1:
continue
else:
new.append(i)
l = preserve_first(letters, "c")

You can use a list filter and slices:
def preserve_first(letters, elem):
if elem in letters:
index = letters.index(elem)
return letters[:index + 1] + filter(lambda a: a != 'c', letters[index + 1:])

Doesn't use pandas but this is a simple algorithm to do the job.
def preserve_firsts(letters, target):
firsts = []
seen = False
for letter in letters:
if letter == target:
if not seen:
firsts.append(letter)
seen = True
else:
firsts.append(letter)
return firsts
> letters = ['a', 'b', 'c', 'c', 'c', 'd', 'c', 'a', 'a']
> preserve_firsts(letters, 'c')
['a', 'b', 'c', 'd', 'a', 'a']

Simplest solution I could come up with.
letters = ['a', 'b', 'c', 'c', 'c', 'd', 'c', 'a', 'a', 'c']
key = 'c'
def preserve_first(letters, key):
first_occurrence = letters.index(key)
return [item for i, item in enumerate(letters) if i == first_occurrence or item != key]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I group a list of objects by continuity? - python

I assume that A is your breakpoint. >>> A, B = 'A', 'B' >>> x = [A,B, A,B,B, A,B,B,B,B, A, A, A,B] >>> map(lambda arr: [i for i in arr[0]], map(lambda e: ['A'+e], ''.join(x).split('A')[1:])) [['A', 'B'], ['A', 'B', 'B'], ['A', 'B', 'B', 'B', 'B'], ['A'], ['A'], ['A', 'B']]

Related

Variable value is changing between print and append

Is there better way to check value changed in sequence?

complete list if the first and last element is equal

Find all substrings in a string in python 3 with brute-force

Python: Keep first occurrence of an item in a list

Categories

Resources