Python - Create set from string - python

I have strings in the format "1-3 6:10-11 7-9" and from them I want to create number sets as follows {1,2,3,6,10,11,7,8,9}.
For creating the set from the range of numbers, I have the following code:
def create_set(src):
lset = []
if len(src) > 0:
pos = src.find('-')
if pos != -1:
first = int(src[:pos])
last = int(src[pos+1:])
else:
return [int(src)] # Only one number
for j in range (first, last+1):
lset.append(j)
return set(lset)
But I cannot figure out how to correctly treat the ':' when it appears in the string. Can someone help me?
Thanks in advance!
EDIT: By the way, is there a more compact way of parsing such strings, perhaps using regular expressions?

Something like this might work for you:
s = '1-3 6:10-11 7-9'
s = s.replace(':', ' ')
lset = set()
fs = s.split()
for f in fs:
r = f.split('-')
if len(r)==1:
# add a single number
lset.add(int(r[0]))
else:
# add a range of numbers (inclusive of the endpoints)
lset |= set(range(int(r[0]), int(r[1])+1))
print(lset)

EDIT: By the way, is there a more compact way of parsing such strings,
perhaps using regular expressions?
Perhaps a cleaner (and slightly more efficient) way:
import re
import itertools
allGroups = re.findall(r"(\d+)(?:-(\d+)|:)", s)
expanded = [range(int(x), (int(x) if y == '' else int(y)) + 1) for x, y in allGroups]
print {x for x in itertools.chain.from_iterable(expanded)}
Explanations:
Match all strings like 'a-b' or 'a:' and return a list of (a, b) and (a, '') pairs respectively:
allGroups = re.findall(r"(\d+)(?:-(\d+)|:)", s)
This produces:
[('1', '3'), ('6', ''), ('10', '11'), ('7', '9')]
Using list comprehension expand all pairs of (x, y) into the full list of numbers in the range (x, y + 1), taking care to handle the (x, '') case as (x, x+1):
expanded = [range(int(x), (int(x) if y == '' else int(y)) + 1) for x, y in allGroups]
This produces:
[[1, 2, 3], [6], [10, 11], [7, 8, 9]]
Use itertools.chain.from_iterable() to transform the list of lists into a single iterable which is iterated by a set comprehension into the final set:
print {x for x in itertools.chain.from_iterable(expanded)}
This produces:
set([1, 2, 3, 6, 7, 8, 9, 10, 11])

Related

How to merge an array with its array elements in Python?

I have an array like below;
constants = ['(1,2)', '(1,5,1)', '1']
I would like to transform the array into like below;
constants = [(1,2), 1, 2, 3, 4, 5, 1]
For doing this, i tried some operations;
from ast import literal_eval
import numpy as np
constants = literal_eval(str(constants).replace("'",""))
constants = [(np.arange(*i) if len(i)==3 else i) if isinstance(i, tuple) else i for i in constants]
And the output was;
constants = [(1, 2), array([1, 2, 3, 4]), 1]
So, this is not expected result and I'm stuck in this step. The question is, how can i merge the array with its parent array?
This is one approach.
Demo:
from ast import literal_eval
constants = ['(1,2)', '(1,5,1)', '1']
res = []
for i in constants:
val = literal_eval(i) #Convert to python object
if isinstance(val, tuple): #Check if element is tuple
if len(val) == 3: #Check if no of elements in tuple == 3
val = list(val)
val[1]+=1
res.extend(range(*val))
continue
res.append(val)
print(res)
Output:
[(1, 2), 1, 2, 3, 4, 5, 1]
I'm going to assume that this question is very literal, and that you always want to transform this:
constants = ['(a, b)', '(x, y, z)', 'i']
into this:
transformed = [(a,b), x, x+z, x+2*z, ..., y, i]
such that the second tuple is a range from x to y with step z. So your final transformed array is the first element, then the range defined by your second element, and then your last element. The easiest way to do this is simply step-by-step:
constants = ['(a, b)', '(x, y, z)', 'i']
literals = [eval(k) for k in constants] # get rid of the strings
part1 = [literals[0]] # individually make each of the three parts of your list
part2 = [k for k in range(literals[1][0], literals[1][1] + 1, literals[1][2])] # or if you don't need to include y then you could just do range(literals[1])
part3 = [literals[2]]
transformed = part1 + part2 + part3
I propose the following:
res = []
for cst in constants:
if isinstance(cst,tuple) and (len(cst) == 3):
#add the range to the list
res.extend(range(cst[0],cst[1], cst[2]))
else:
res.append(cst)
res has the result you want.
There may be a more elegant way to solve it.
Please use code below to resolve parsing described above:
from ast import literal_eval
constants = ['(1,2)', '(1,5,1)', '1']
processed = []
for index, c in enumerate(constants):
parsed = literal_eval(c)
if isinstance(parsed, (tuple, list)) and index != 0:
processed.extend(range(1, max(parsed) + 1))
else:
processed.append(parsed)
print processed # [(1, 2), 1, 2, 3, 4, 5, 1]

Don't understand Python Expression

I have some basic knowledge on Python but I have no idea what's going for the below code. Can someone help me to explain or 'translate' it into a more normal/common expression?
steps = len(t)
sa = [i for i in range(steps)]
sa.sort(key = lambda i: t[i:i + steps])#I know that sa is a list
for i in range(len(sa)):
sf = t[sa[i] : sa[i] + steps]
't' is actually a string
Thank you.
What I don't understand is the code: sa.sort(key = lambda i: t[i:i + steps])`
sa.sort(key = lambda i: t[i:i + steps])
It sorts sa according to the natural ordering of substrings t[i:i+len(t)]. Actually i + steps will always be greater or equal than steps (which is len(t)) so it could be written t[i:] instead (which makes the code simpler to understand)
You will better understand using the decorate/sort/undecorate pattern:
>>> t = "azerty"
>>> sa = range(len(t))
>>> print sa
[0, 1, 2, 3, 4, 5]
>>> decorated = [(t[i:], i) for i in sa]
>>> print decorated
[('azerty', 0), ('zerty', 1), ('erty', 2), ('rty', 3), ('ty', 4), ('y', 5)]
>>> decorated.sort()
>>> print decorated
[('azerty', 0), ('erty', 2), ('rty', 3), ('ty', 4), ('y', 5), ('zerty', 1)]
>>> sa = [i for (_dummy, i) in decorated]
>>> print sa
[0, 2, 3, 4, 5, 1]
and sf = t[sa[i] : sa[i] + steps]
This could also be written more simply:
for i in range(len(sa)):
sf = t[sa[i] : sa[i] + steps]
=>
for x in sa:
sf = t[x:]
print sf
which yields:
azerty
erty
rty
ty
y
zerty
You'll notice that this is exactly the keys used (and then discarded)
in the decorate/sort/undecorate example above, so the whole thing could be rewritten as:
def foo(t):
decorated = sorted((t[i:], i) for i in range(len(t)))
for sf, index in decorated:
print sf
# do something with sf here
As to what all this is supposed to do, I'm totally at lost, but at least you now have a much more pythonic (readable...) version of this code ;)
The lambda in sort defines the criteria according to which the list is going to be sorted.
In other words, the list will not be sorted simply according to its values, but according to the function applied to the values.
Have a look here for more details.
It looks like what you are doing is sorting the list according to the alphabetical ordering of the substrings of the input string t.
Here is what is happening:
t = 'hello' # EXAMPLE
steps = len(t)
sa = [i for i in range(steps)]
sort_func = lambda i: t[i:i + steps]
for el in sa:
print sort_func(el)
#ello
#hello
#llo
#lo
#o
So these are the values that determines the sorting of the list.
transf_list = [sort_func(el) for el in sa]
sorted(transf_list)
# ['ello', 'hello', 'llo', 'lo', 'o']
Hence:
sa.sort(key = sort_func)#I know that sa is a list
# [1, 0, 2, 3, 4]

How to elegantly transform '1-3,6-8' to '1 2 3 6 7 8' within a list?

Problem
Background:
I have a list of ~10,000 lists containing irregular data which needs to be transformed to a specific format. This data will be ingested into a pandas dataframe after transformation.
TL/DR; How to elegantly transform matched strings of the following regex in a list?
Regex
'\d{1,3}-\d{1,3},\d{1,3}-\d{1,3}'
Example:
'1-3,6-8' to '1 2 3 6 7 8'
Current Solution:
Using list comprehensions required multiple type casts to transform the string and is unfit to be a lasting solution.
pat = re.compile('\d{1,3}-\d{1,3},\d{1,3}-\d{1,3}')
row = ['sss-www,ddd-eee', '1-3,6-8', 'XXXX', '0-2,3-7','234','1,5']
lst = [((str(list(range(int(x.split(',')[0].split('-')[0]),
int(x.split(','[0].split('-')[1])+1))).strip('[]').replace(',', '')+' '
+str(list(range(int(x.split(',')[1].split('-')[0]),
int(x.split(',')[1].split('-')[1]) + 1))).strip('[]').replace(',', '')))
if pat.match(str(x)) else x for x in row]
Result
['sss-www,ddd-eee', '1 2 3 6 7 8', 'XXXX', '0 1 2 3 4 5 6 7', '234', '1,5']
Capture the groups it's easier.
Then you convert the group list to integers, and process them 2 by 2 in a list comprehension, chained with itertools.chain
import re,itertools
pat = re.compile('(\d{1,3})-(\d{1,3}),(\d{1,3})-(\d{1,3})')
z='1-3,6-8'
groups = [int(x) for x in pat.match(z).groups()]
print(list(itertools.chain(*(list(range(groups[i],groups[i+1]+1)) for i in range(0,len(groups),2)))))
result:
[1, 2, 3, 6, 7, 8]
not sure you're calling that "elegant", though. It remains complicated, mostly because most objects return generators that need converting to list explicitly.
Several ways to do this, here is mine:
import re
txt = '1-3,6-8'
# Safer to use a raw string
pat = re.compile(r'(\d{1,3})-(\d{1,3}),(\d{1,3})-(\d{1,3})')
m = pat.match(txt)
if m:
start1, end1, start2, end2 = m.groups()
result = [i for i in range(int(start1), int(end1)+1)]
result += [i for i in range(int(start2), int(end2)+1)]
print(result)
Gives:
[1, 2, 3, 6, 7, 8]
I'm assuming Python 3 here (as stated in the question).
Python 2 could use:
result = range(int(start1), int(end1)+1)
result += range(int(start2), int(end2)+1)
I assume that you want to handle longer sequences as well, like 1-10,15,23-25? You don't really need regular expressions for this, regular string processing functions will work well.
def parse_sequence(seq):
result = []
for part in seq.split(','):
points = [int(s) for s in part.split('-')]
if len(points) == 2:
result.extend(range(points[0], points[1]+1))
elif len(points) == 1:
result.append(points[0])
else:
raise ValueError('invalid sequence')
return result
Here is my solution:
import re
from itertools import chain
s = '1-3, 6 - 8, 12-14, 20 -22'
rslt = list(chain(*[range(int(tup[0]), int(tup[1]) + 1)
for tup in re.findall(r'(\d+)\s*?-\s*?(\d+)', s)]))
Output:
In [43]: rslt
Out[43]: [1, 2, 3, 6, 7, 8, 12, 13, 14, 20, 21, 22]
Step by step:
In [44]: re.findall(r'(\d+)\s*?-\s*?(\d+)', s)
Out[44]: [('1', '3'), ('6', '8'), ('12', '14'), ('20', '22')]
In [45]: [range(int(tup[0]),int(tup[1])+1) for tup in re.findall(r'(\d+)\s*?-\s*?(\d+)', s)]
Out[45]: [range(1, 4), range(6, 9), range(12, 15), range(20, 23)]
Depends on exactly what data you're expecting to see. In general the best way to do this is going to be to write a function that parses the string in chunks. Something like:
def parse(string):
chunks = string.split(',')
for chunk in chunks:
match = re.match('(\d+)-(\d+)', chunk)
if match:
start = int(match.group(1))
end = int(match.group(2))
yield range(start:end+1)
else:
yield int(chunk)
s_tmp = s.split(",")
[*range(x.split("-")int([0]),x.split("-")int(x[1])) for x in s_tmp]
apologies if there is syntax errors . i'm typing this from my phone . basically split by , then split by - then unpack the entries from range

Duplicates counting with order order preserving in Python lists

suppose the list
[7,7,7,7,3,1,5,5,1,4]
I would like to remove duplicates and get them counted while preserving the order of the list. To preserve the order of the list removing duplicates i use the function
def unique(seq, idfun=None):
# order preserving
if idfun is None:
def idfun(x): return x
seen = {}
result = []
for item in seq:
marker = idfun(item)
if marker in seen: continue
seen[marker] = 1
result.append(item)
return result
that is giving to me the output
[7,3,1,5,1,4]
but the desired output i want would be (in the final list could exists) is:
[7,3,3,1,5,2,4]
7 is written because it's the first item in the list, then the following is checked if it's the different from the previous. If the answer is yes count the occurrences of the same item until a new one is found. Then repeat the procedure. Anyone more skilled than me that could give me a hint in order to get the desired output listed above? Thank you in advance
Perhaps something like this?
>>> from itertools import groupby
>>> seen = set()
>>> out = []
>>> for k, g in groupby(lst):
if k not in seen:
length = sum(1 for _ in g)
if length > 1:
out.extend([k, length])
else:
out.append(k)
seen.add(k)
...
>>> out
[7, 4, 3, 1, 5, 2, 4]
Update:
As per your comment I guess you wanted something like this:
>>> out = []
>>> for k, g in groupby(lst):
length = sum(1 for _ in g)
if length > 1:
out.extend([k, length])
else:
out.append(k)
...
>>> out
[7, 4, 3, 1, 5, 2, 1, 4]
Try this
import collections as c
lst = [7,7,7,7,3,1,5,5,1,4]
result = c.OrderedDict()
for el in lst:
if el not in result.keys():
result[el] = 1
else:
result[el] = result[el] + 1
print result
prints out: OrderedDict([(7, 4), (3, 1), (1, 2), (5, 2), (4, 1)])
It gives a dictionary though. For a list, use:
lstresult = []
for el in result:
# print k, v
lstresult.append(el)
if result[el] > 1:
lstresult.append(result[el] - 1)
It doesn't match your desired output but your desired output also seems like kind of a mangling of what is trying to be represented

find the "overlap" between 2 python lists

Given 2 lists:
a = [3,4,5,5,5,6]
b = [1,3,4,4,5,5,6,7]
I want to find the "overlap":
c = [3,4,5,5,6]
I'd also like it if i could extract the "remainder" the part of a and b that's not in c.
a_remainder = [5,]
b_remainder = [1,4,7,]
Note:
a has three 5's in it and b has two.
b has two 4's in it and a has one.
The resultant list c should have two 5's (limited by list b) and one 4 (limited by list a).
This gives me what i want, but I can't help but think there's a much better way.
import copy
a = [3,4,5,5,5,6]
b = [1,3,4,4,5,5,6,7]
c = []
for elem in copy.deepcopy(a):
if elem in b:
a.pop(a.index(elem))
c.append(b.pop(b.index(elem)))
# now a and b both contain the "remainders" and c contains the "overlap"
On another note, what is a more accurate name for what I'm asking for than "overlap" and "remainder"?
collection.Counter available in Python 2.7 can be used to implement multisets that do exactly what you want.
a = [3,4,5,5,5,6]
b = [1,3,4,4,5,5,6,7]
a_multiset = collections.Counter(a)
b_multiset = collections.Counter(b)
overlap = list((a_multiset & b_multiset).elements())
a_remainder = list((a_multiset - b_multiset).elements())
b_remainder = list((b_multiset - a_multiset).elements())
print overlap, a_remainder, b_remainder
Use python set
intersection = set(a) & set(b)
a_remainder = set(a) - set(b)
b_remainder = set(b) - set(a)
In the language of sets, overlap is 'intersection' and remainder is 'set difference'. If you had distinct items, you wouldn't have to do these operations yourself, check out http://docs.python.org/library/sets.html if you're interested.
Since we're not working with distinct elements, your approach is reasonable. If you wanted this to run faster, you could create a dictionary for each list and map the number to how many elements are in each array (e.g., in a, 3->1, 4->1, 5->2, etc.). You would then iterate through map a, determine if that letter existed, decrement its count and add it to the new list
Untested code, but this is the idea
def add_or_update(map,value):
if value in map:
map[value]+=1
else
map[value]=1
b_dict = dict()
for b_elem in b:
add_or_update(b_dict,b_elem)
intersect = []; diff = [];
for a_elem in a:
if a_elem in b_dict and b_dict[a_elem]>0:
intersect.add(a_elem);
for k,v in diff:
for i in range(v):
diff.add(k);
OK, verbose, but kind of cool (similar in spirit to the collections.Counter idea, but more home-made):
import itertools as it
flatten = it.chain.from_iterable
sorted(
v for u,v in
set(flatten(enumerate(g)
for k, g in it.groupby(a))).intersection(
set(flatten(enumerate(g)
for k, g in it.groupby(b))))
)
The basic idea is to make each of the lists into a new list which attaches a counter to each object, numbered to account for duplicates -- so that then you can then use set operations on these tuples after all.
To be slightly less verbose:
aa = set(flatten(enumerate(g) for k, g in it.groupby(a)))
bb = set(flatten(enumerate(g) for k, g in it.groupby(b)))
# aa = set([(0, 3), (0, 4), (0, 5), (0, 6), (1, 5), (2, 5)])
# bb = set([(0, 1), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7), (1, 4), (1, 5)])
cc = aa.intersection(bb)
# cc = set([(0, 3), (0, 4), (0, 5), (0, 6), (1, 5)])
c = sorted(v for u,v in cc)
# c = [3, 4, 5, 5, 6]
groupby -- produces a list of lists containing identical elements
(but because of the syntax needs the g for k,g in it.groupby(a) to extract each list)
enumerate -- appends a counter to each element of each sublist
flatten -- create a single list
set -- convert to a set
intersection -- find the common elements
sorted(v for u,v in cc) -- get rid of the counters and sort the result
Finally, I'm not sure what you mean by the remainders; it seems like it ought to be my aa-cc and bb-cc but I don't know where you get a_remainder = [4]:
sorted(v for u,v in aa-cc)
# [5]
sorted(v for u,v in bb-cc)
# [1, 4, 7]
A response from kerio in #python on freenode:
[ i for i in itertools.chain.from_iterable([k] * v for k, v in \
(Counter(a) & Counter(b)).iteritems())
]
Try difflib.SequenceMatcher(), "a flexible class for comparing pairs of sequences of any type"...
A quick try:
a = [3,4,5,5,5,6]
b = [1,3,4,4,5,5,6,7]
sm = difflib.SequenceMatcher(None, a, b)
c = []
a_remainder = []
b_remainder = []
for tag, i1, i2, j1, j2 in sm.get_opcodes():
if tag == 'replace':
a_remainder.extend(a[i1:i2])
b_remainder.extend(b[j1:j2])
elif tag == 'delete':
a_remainder.extend(a[i1:i2])
elif tag == 'insert':
b_remainder.extend(b[j1:j2])
elif tag == 'equal':
c.extend(a[i1:i2])
And now...
>>> print c
[3, 4, 5, 5, 6]
>>> print a_remainder
[5]
>>> print b_remainder
[1, 4, 7]
Aset = Set(a);
Bset = Set(b);
a_remainder = a.difference(b);
b_remainder = b.difference(a);
c = a.intersection(b);
But if you need c to have duplicates, and order is important for you,
you may look for w:Longest common subsequence problem
I don't think you should actually use this solution, but I took this opportunity to practice with lambda functions and here is what I came up with :)
a = [3,4,5,5,5,6]
b = [1,3,4,4,5,5,6,7]
dedup = lambda x: [set(x)] if len(set(x)) == len(x) else [set(x)] + dedup([x[i] for i in range(1, len(x)) if x[i] == x[i-1]])
default_set = lambda x: (set() if x[0] is None else x[0], set() if x[1] is None else x[1])
deduped = map(default_set, map(None, dedup(a), dedup(b)))
get_result = lambda f: reduce(lambda x, y: list(x) + list(y), map(lambda x: f(x[0], x[1]), deduped))
c = get_result(lambda x, y: x.intersection(y)) # [3, 4, 5, 6, 5]
a_remainder = get_result(lambda x, y: x.difference(y)) # [5]
b_remainder = get_result(lambda x, y: y.difference(x)) # [1, 7, 4]
I'm pretty sure izip_longest would have simplified this a bit (wouldn't have needed the default_set lambda), but I was testing this with Python 2.5.
Here are some of the intermediate values used in the calculation in case anyone wants to understand this:
dedup(a) = [set([3, 4, 5, 6]), set([5]), set([5])]
dedup(b) = [set([1, 3, 4, 5, 6, 7]), set([4, 5])]
deduped = [(set([3, 4, 5, 6]), set([1, 3, 4, 5, 6, 7])), (set([5]), set([4, 5])), (set([5]), set([]))]

Categories

Resources