Python package for converting finite regex to a text array? [duplicate]

Python package for converting finite regex to a text array? [duplicate] - python

Suppose I have the following string:
trend = '(A|B|C)_STRING'
I want to expand this to:
A_STRING
B_STRING
C_STRING
The OR condition can be anywhere in the string. i.e STRING_(A|B)_STRING_(C|D)
would expand to
STRING_A_STRING_C
STRING_B_STRING C
STRING_A_STRING_D
STRING_B_STRING_D
I also want to cover the case of an empty conditional:
(|A_)STRING would expand to:
A_STRING
STRING
Here's what I've tried so far:
def expandOr(trend):
parenBegin = trend.index('(') + 1
parenEnd = trend.index(')')
orExpression = trend[parenBegin:parenEnd]
originalTrend = trend[0:parenBegin - 1]
expandedOrList = []
for oe in orExpression.split("|"):
expandedOrList.append(originalTrend + oe)
But this is obviously not working.
Is there any easy way to do this using regex?

Here's a pretty clean way. You'll have fun figuring out how it works :-)
def expander(s):
import re
from itertools import product
pat = r"\(([^)]*)\)"
pieces = re.split(pat, s)
pieces = [piece.split("|") for piece in pieces]
for p in product(*pieces):
yield "".join(p)
Then:
for s in ('(A|B|C)_STRING',
'(|A_)STRING',
'STRING_(A|B)_STRING_(C|D)'):
print s, "->"
for t in expander(s):
print " ", t
displays:
(A|B|C)_STRING ->
A_STRING
B_STRING
C_STRING
(|A_)STRING ->
STRING
A_STRING
STRING_(A|B)_STRING_(C|D) ->
STRING_A_STRING_C
STRING_A_STRING_D
STRING_B_STRING_C
STRING_B_STRING_D

import exrex
trend = '(A|B|C)_STRING'
trend2 = 'STRING_(A|B)_STRING_(C|D)'
>>> list(exrex.generate(trend))
[u'A_STRING', u'B_STRING', u'C_STRING']
>>> list(exrex.generate(trend2))
[u'STRING_A_STRING_C', u'STRING_A_STRING_D', u'STRING_B_STRING_C', u'STRING_B_STRING_D']

I would do this to extract the groups:
def extract_groups(trend):
l_parens = [i for i,c in enumerate(trend) if c == '(']
r_parens = [i for i,c in enumerate(trend) if c == ')']
assert len(l_parens) == len(r_parens)
return [trend[l+1:r].split('|') for l,r in zip(l_parens,r_parens)]
And then you can evaluate the product of those extracted groups using itertools.product:
expr = 'STRING_(A|B)_STRING_(C|D)'
from itertools import product
list(product(*extract_groups(expr)))
Out[92]: [('A', 'C'), ('A', 'D'), ('B', 'C'), ('B', 'D')]
Now it's just a question of splicing those back onto your original expression. I'll use re for that :)
#python3.3+
def _gen(it):
yield from it
p = re.compile('\(.*?\)')
for tup in product(*extract_groups(trend)):
gen = _gen(tup)
print(p.sub(lambda x: next(gen),trend))
STRING_A_STRING_C
STRING_A_STRING_D
STRING_B_STRING_C
STRING_B_STRING_D
There's probably a more readable way to get re.sub to sequentially substitute things from an iterable, but this is what came off the top of my head.

It is easy to achieve with sre_yield module:
>>> import sre_yield
>>> trend = '(A|B|C)_STRING'
>>> strings = list(sre_yield.AllStrings(trend))
>>> print(strings)
['A_STRING', 'B_STRING', 'C_STRING']
The goal of sre_yield is to efficiently generate all values that can match a given regular expression, or count possible matches efficiently... It does this by walking the tree as constructed by sre_parse (same thing used internally by the re module), and constructing chained/repeating iterators as appropriate. There may be duplicate results, depending on your input string though -- these are cases that sre_parse did not optimize.

Related

How to extract each word consecutive to its own previous number in a string and sorting the result in Python

Input : x3b4U5i2
Output : bbbbiiUUUUUxxx
How can i solve this problem in Python. I have to print the word next to it's number n times and sort it

It wasn't clear if multiple digit counts or groups of letters should be handled. Here's a solution that does all of that:
import re
def main(inp):
parts = re.split(r"(\d+)", inp)
parts_map = {parts[i]:int(parts[i+1]) for i in range(0, len(parts)-1, 2)}
print(''.join([c*parts_map[c] for c in sorted(parts_map.keys(),key=str.lower)]))
main("x3b4U5i2")
main("x3brx4U5i2")
main("x23b4U35i2")
Result:
bbbbiiUUUUUxxx
brxbrxbrxbrxiiUUUUUxxx
bbbbiiUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUxxxxxxxxxxxxxxxxxxxxxxx

I'm assuming the formatting will always be <char><int> with <int> being in between 1 and 9...
input_ = "x3b4U5i2"
result_list = [input_[i]*int(input_[i+1]) for i in range(0, len(input_), 2)]
result_list.sort(key=str.lower)
result = ''.join(result_list)
There's probably a much more performance-oriented approach to solving this, it's just the first solution that came into my limited mind.
Edit
After the feedback in the comments I've tried to improve performance by sorting it first, but I have actually decreased performance in the following implementaiton:
input_ = "x3b4U5i2"
def sort_first(value):
return value[0].lower()
tuple_construct = [(input_[i], int(input_[i+1])) for i in range(0, len(input_), 2)]
tuple_construct.sort(key=sort_first)
result = ''.join([tc[0] * tc[1] for tc in tuple_construct])
Execution time for 100,000 iterations on it:
1) The execution time is: 0.353036
2) The execution time is: 0.4361724

One option, extract the character/digit(s) pairs with a regex, sort them by letter (ignoring case), multiply the letter by the number of repeats, join:
s = 'x3b4U5i2'
import re
out = ''.join([c*int(i) for c,i in
sorted(re.findall('(\D)(\d+)', s),
key=lambda x: x[0].casefold())
])
print(out)
Output: bbbbiiUUUUUxxx
If you want to handle multiple characters you can use '(\D+)(\d+)'

No list comprehensions or generator expressions in sight. Just using re.sub with a lambda to expand the length encoding, then sorting that, and then joing that back into a string.
import re
s = "x3b4U5i2"
''.join(sorted(re.sub(r"(\D+)(\d+)",
lambda m: m.group(1)*int(m.group(2)),
s),
key=lambda x: x[0].casefold()))
# 'bbbbiiUUUUUxxx'
If we use re.findall to extract a list of pairs of strings and multipliers:
import re
s = 'x3b4U5i2'
pairs = re.findall(r"(\D+)(\d+)", s)
Then we can use some functional style to sort that list before expanding it.
from operator import itemgetter
def compose(f, g):
return lambda x: f(g(x))
sorted(pairs, key=compose(str.lower, itemgetter(0)))
# [('b', '4'), ('i', '2'), ('U', '5'), ('x', '3')]

Regex: Split characters with "/"

I have these strings, for example:
['2300LO/LCE','2302KO/KCE']
I want to have output like this:
['2300LO','2300LCE','2302KO','2302KCE']
How can I do it with Regex in Python?
Thanks!

You can make a simple generator that yields the pairs for each string. Then you can flatten them into a single list with itertools.chain()
from itertools import product, chain
def getCombos(s):
nums, code = re.match(r'(\d+)(.*)', s).groups()
for pair in product([nums], code.split("/")):
yield ''.join(pair)
a = ['2300LO/LCE','2302KO/KCE']
list(chain.from_iterable(map(getCombos, a)))
# ['2300LO', '2300LCE', '2302KO', '2302KCE']
This has the added side benefit or working with strings like '2300LO/LCE/XX/CC' which will give you ['2300LO', '2300LCE', '2300XX', '2300CC',...]

You can try something like this:
list1 = ['2300LO/LCE','2302KO/KCE']
list2 = []
for x in list1:
a = x.split('/')
tmp = re.findall(r'\d+', a[0]) # extracting digits
list2.append(a[0])
list2.append(tmp[0] + a[1])
print(list2)

This can be implemented with simple string splits.
Since you asked the output with regex, here is your answer.
list1 = ['2300LO/LCE','2302KO/KCE']
import re
r = re.compile("([0-9]{1,4})([a-zA-Z].*)/([a-zA-Z].*)")
out = []
for s in list1:
items = r.findall(s)[0]
out.append(items[0]+items[1])
out.append(items[2])
print(out)
The explanation for the regex - (4 digit number), followed by (any characters), followed by a / and (rest of the characters).
they are grouped with () , so that when you use find all, it becomes individual elements.

pattern match get list and dict from string

I have string below,and I want to get list,dict,var from this string.
How can I to split this string to specific format?
s = 'list_c=[1,2],a=3,b=1.3,c=abch,list_a=[1,2],dict_a={a:2,b:3}'
import re
m1 = re.findall (r'(?=.*,)(.*?=\[.+?\],?)',s)
for i in m1 :
print('m1:',i)
I only get result 1 correctly.
Does anyone know how to do?
m1: list_c=[1,2],
m1: a=3,b=1.3,c=abch,list_a=[1,2],

Use '=' to split instead, then you can work around with variable name and it's value.
You still need to handle the type casting for values (regex, split, try with casting may help).
Also, same as others' comment, using dict may be easier to handle
s = 'list_c=[1,2],a=3,b=1.3,c=abch,list_a=[1,2],dict_a={a:2,b:3}'
al = s.split('=')
var_l = [al[0]]
value_l = []
for a in al[1:-1]:
var_l.append(a.split(',')[-1])
value_l.append(','.join(a.split(',')[:-1]))
value_l.append(al[-1])
output = dict(zip(var_l, value_l))
print(output)

You may have better luck if you more or less explicitly describe the right-hand side expressions: numbers, lists, dictionaries, and identifiers:
re.findall(r"([^=]+)=" # LHS and assignment operator
+r"([+-]?\d+(?:\.\d+)?|" # Numbers
+r"[+-]?\d+\.|" # More numbers
+r"\[[^]]+\]|" # Lists
+r"{[^}]+}|" # Dictionaries
+r"[a-zA-Z_][a-zA-Z_\d]*)", # Idents
s)
# [('list_c', '[1,2]'), ('a', '3'), ('b', '1.3'), ('c', 'abch'),
# ('list_a', '[1,2]'), ('dict_a', '{a:2,b:3}')]

The answer is like below
import re
from pprint import pprint
s = 'list_c=[1,2],a=3,b=1.3,c=abch,list_a=[1],Save,Record,dict_a={a:2,b:3}'
m1 = re.findall(r"([^=]+)=" # LHS and assignment operator
+r"([+-]?\d+(?:\.\d+)?|" # Numbers
+r"[+-]?\d+\.|" # More numbers
+r"\[[^]]+\]|" # Lists
+r"{[^}]+}|" # Dictionaries
+r"[a-zA-Z_][a-zA-Z_\d]*)", # Idents
s)
temp_d = {}
for i,j in m1:
temp = i.strip(',').split(',')
if len(temp)>1:
for k in temp[:-1]:
temp_d[k]=''
temp_d[temp[-1]] = j
else:
temp_d[temp[0]] = j
pprint(temp_d)
Output is like
{'Record': '',
'Save': '',
'a': '3',
'b': '1.3',
'c': 'abch',
'dict_a': '{a:2,b:3}',
'list_a': '[1]',
'list_c': '[1,2]'}

Instead of picking out the types, you can start by capturing the identifiers. Here's a regex that captures all the identifiers in the string (for lowercase only, but see note):
regex = re.compile(r'([a-z]|_)+=')
#note if you want all valid variable names: r'([a-z]|[A-Z]|[0-9]|_)+'
cases = [x.group() for x in re.finditer(regex, s)]
This gives a list of all the identifiers in the string:
['list_c=', 'a=', 'b=', 'c=', 'list_a=', 'dict_a=']
We can now define a function to sequentially chop up s using the
above list to partition the string sequentially:
def chop(mystr, mylist):
temp = mystr.partition(mylist[0])[2]
cut = temp.find(mylist[1]) #strip leading bits
return mystr.partition(mylist[0])[2][cut:], mylist[1:]
mystr = s[:]
temp = [mystr]
mylist = cases[:]
while len() > 1:
mystr, mylist = chop(mystr, mylist)
temp.append(mystr)
This (convoluted) slicing operation gives this list of strings:
['list_c=[1,2],a=3,b=1.3,c=abch,list_a=[1,2],dict_a={a:2,b:3}',
'a=3,b=1.3,c=abch,list_a=[1,2],dict_a={a:2,b:3}',
'b=1.3,c=abch,list_a=[1,2],dict_a={a:2,b:3}',
'c=abch,list_a=[1,2],dict_a={a:2,b:3}',
'list_a=[1,2],dict_a={a:2,b:3}',
'dict_a={a:2,b:3}']
Now cut off the ends using each successive entry:
result = []
for x in range(len(temp) - 1):
cut = temp[x].find(temp[x+1]) - 1 #-1 to remove commas
result.append(temp[x][:cut])
result.append(temp.pop()) #get the last item
Now we have the full list:
['list_c=[1,2]', 'a=3', 'b=1.3', 'c=abch', 'list_a=[1,2]', 'dict_a={a:2,b:3}']
Each element is easily parsable into key:value pairs (and is also executable via exec).

Transforming pairwise string into tuples

I have the following string:
my_str = "StemCells(16.53),Bcells(13.59),Monocytes(11.58),abTcells(10.05),Macrophages(9.69), gdTCells(9.49),StromalCells(9.36),DendriticCells(9.20),NKCells(7.81),Neutrophils(2.71)"
What I want to do is to transform them in to this tuple
[ ('StemCells', 16.530000000000001),
('Bcells', 13.59),
('Monocytes', 11.58),
('abTcells', 10.050000000000001),
('Macrophages', 9.6899999999999995),
('gdTCells', 9.4900000000000002),
('StromalCells', 9.3599999999999994),
('DendriticCells', 9.1999999999999993),
('NKCells', 7.8099999999999996),
('Neutrophils', 2.71)]
How can I do that conveniently in Python

my_str = "StemCells(16.53),Bcells(13.59),Monocytes(11.58),abTcells(10.05),Macrophages(9.69), gdTCells(9.49),StromalCells(9.36),DendriticCells(9.20),NKCells(7.81),Neutrophils(2.71)"
import re
stuff = re.findall(r'(\w+)\(([0-9.]+)\)',my_str)
stuff
Out[4]:
[('StemCells', '16.53'),
('Bcells', '13.59'),
('Monocytes', '11.58'),
('abTcells', '10.05'),
('Macrophages', '9.69'),
('gdTCells', '9.49'),
('StromalCells', '9.36'),
('DendriticCells', '9.20'),
('NKCells', '7.81'),
('Neutrophils', '2.71')]
That gets you most of the way there, then it's just a bit of type conversion
[(s,float(f)) for s,f in stuff]
Out[7]:
[('StemCells', 16.53),
('Bcells', 13.59),
('Monocytes', 11.58),
('abTcells', 10.05),
('Macrophages', 9.69),
('gdTCells', 9.49),
('StromalCells', 9.36),
('DendriticCells', 9.2),
('NKCells', 7.81),
('Neutrophils', 2.71)]

well simplest solution, without using regex:
my_str = "StemCells(16.53),Bcells(13.59),Monocytes(11.58),abTcells(10.05),Macrophages(9.69), gdTCells(9.49),StromalCells(9.36),DendriticCells(9.20),NKCells(7.81),Neutrophils(2.71)"
that_str = map(lambda s: s.rstrip(')').split('(') ,my_str.split(','))
that_str = map(lambda s: (s[0], float(s[1])), that_str)
>>> that_str
[('StemCells', 16.53), ('Bcells', 13.59), ('Monocytes', 11.58), ('abTcells', 10.05), ('Macrophages', 9.69), (' gdTCells', 9.49), ('StromalCells', 9.36), ('DendriticCells', 9.2), ('NKCells', 7.81), ('Neutrophils', 2.71)]
you could do the job in one pass, using an external function, instead of a lambda:
def build_tuple(s):
t = s.rstrip(')').split('(')
return (t[0], float(t[1]))
that_str = map(build_tuple, my_str.split(','))

How about:
>>> zip(re.findall(r'([a-zA-Z]+)',my_str), map(float, re.findall(r'([0-9.]+)',my_str)))

Inserting a character at regular intervals in a list

I am trying to convert 10000000C9ABCDEF to 10:00:00:00:c9:ab:cd:ef
This is needed because 10000000C9ABCDEF format is how I see HBAs or host bust adapaters when I login to my storage arrays. But the SAN Switches understand 10:00:00:00:c9:ab:cd:ef notation.
I have only been able to accomplish till the following:
#script to convert WWNs to lowercase and add the :.
def wwn_convert():
while True:
wwn = (input('Enter the WWN or q to quit- '))
list_wwn = list(wwn)
list_wwn = [x.lower() for x in list_wwn]
lower_wwn = ''.join(list_wwn)
print(lower_wwn)
if wwn == 'q':
break
wwn_convert()
I tried ':'.join, but that inserts : after each character, so I get 1:0:0:0:0:0:0:0:c:9:a:b:c:d:e:f
I want the .join to go through a loop where I can say something like for i in range (0, 15, 2) so that it inserts the : after two characters, but not quite sure how to go about it. (Good that Python offers me to loop in steps of 2 or any number that I want.)
Additionally, I will be thankful if someone could direct me to pointers where I could script this better...
Please help.
I am using Python Version 3.2.2 on Windows 7 (64 Bit)

Here is another option:
>>> s = '10000000c9abcdef'
>>> ':'.join(a + b for a, b in zip(*[iter(s)]*2))
'10:00:00:00:c9:ab:cd:ef'
Or even more concise:
>>> import re
>>> ':'.join(re.findall('..', s))
'10:00:00:00:c9:ab:cd:ef'

>>> s = '10000000C9ABCDEF'
>>> ':'.join([s[x:x+2] for x in range(0, len(s)-1, 2)])
'10:00:00:00:C9:AB:CD:EF'
Explanation:
':'.join(...) returns a new string inserting ':' between the parts of the iterable
s[x:x+2] returns a substring of length 2 starting at x from s
range(0, len(s) - 1, 2) returns a list of integers with a step of 2
so the list comprehension would split the string s in substrings of length 2, then the join would put them back together but inserting ':' between them.

>>> s='10000000C9ABCDEF'
>>> si=iter(s)
>>> ':'.join(c.lower()+next(si).lower() for c in si)
>>> '10:00:00:00:c9:ab:cd:ef'
In lambda form:
>>> (lambda x: ':'.join(c.lower()+next(x).lower() for c in x))(iter(s))
'10:00:00:00:c9:ab:cd:ef'

I think what would help you out the most is a construction in python called a slice. I believe that you can use them on any iterable object, including strings, making them quite useful and something that is generally a very good idea to know how to use.
>>> s = '10000000C9ABCDEF'
>>> [s.lower()[i:i+2] for i in range(0, len(s)-1, 2)]
['10', '00', '00', '00', 'c9', 'ab', 'cd', 'ef']
>>> ':'.join([s.lower()[i:i+2] for i in range(0, len(s)-1, 2)])
'10:00:00:00:c9:ab:cd:ef'
If you'd like to read some more about slices, they're explained very nicely in this question, as well as a part of the actual python documentation.

It may be done using grouper recipe from here.
from itertools import izip_longest
def grouper(n, iterable, fillvalue=None):
"grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
Using this function, the code will look like:
def join(it):
for el in it:
yield ''.join(el)
':'.join(join(grouper(2, s)))
It works this way:
grouper(2,s) returns tuples '1234...' -> ('1','2'), ('3','4') ...
def join(it) does this: ('1','2'), ('3','4') ... -> '12', '34' ...
':'.join(...) creates a string from iterator: '12', '34' ... -> '12:34...'
Also, it may be rewritten as:
':'.join(''.join(el) for el in grouper(2, s))

Here is my simple, straightforward solution:
s = '10000000c9abcdef'
new_s = str()
for i in range(0, len(s)-1, 2):
new_s += s[i:i+2]
if i+2 < len(s):
new_s += ':'
>>> new_s
'10:00:00:00:c9:ab:cd:ef'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python package for converting finite regex to a text array? [duplicate] - python

import exrex trend = '(A|B|C)_STRING' trend2 = 'STRING_(A|B)_STRING_(C|D)' >>> list(exrex.generate(trend)) [u'A_STRING', u'B_STRING', u'C_STRING'] >>> list(exrex.generate(trend2)) [u'STRING_A_STRING_C', u'STRING_A_STRING_D', u'STRING_B_STRING_C', u'STRING_B_STRING_D']

Related

How to extract each word consecutive to its own previous number in a string and sorting the result in Python

Regex: Split characters with "/"

pattern match get list and dict from string

Transforming pairwise string into tuples

Inserting a character at regular intervals in a list

Categories

Resources