How to efficiently extract substrings from a string (with - or _)

How to efficiently extract substrings from a string (with - or _) - python

I have a list of wall names of a building, and it looks like below:
wall_list = ['W1_1F-12F', 'W2_1F-9F', 'W3_10F-12F']
I want to separate them into three or extract each of the elements (like W1, 1F, 12F) so that I can use wall names or floor information in another process.
wall_name = [W1, W2, W3...]
Floor_from = [1F, 1F, 10F...]
Floor_to = [12F, 9F, 12F...]
This is the result I want to get in the end.
I think it will be efficient to solve this problem by reading strings before or after _ and -, if this kind of method exists.

You can use the regex version of the split function with a simple pattern:
import re
wall_list = ['W1_1F-12F', 'W2_1F-9F', 'W3_10F-12F']
for s in wall_list:
print(re.split('[_-]', s))
Which will give:
['W1', '1F', '12F']
['W2', '1F', '9F']
['W3', '10F', '12F']
And to separate them to elements just put the result into zip:
import re
wall_list = ['W1_1F-12F', 'W2_1F-9F', 'W3_10F-12F']
wall, floor_from, floor_to = zip(*(re.split('[_-]', s) for s in wall_list))
print(wall, floor_from, floor_to, sep='\n')
Which will now give:
('W1', 'W2', 'W3')
('1F', '1F', '10F')
('12F', '9F', '12F')

import re
def extract_components(wall):
match = re.match("^(W\d+)_(\d+F)-(\d+F)", wall)
return match.groups()
def extract(walls):
return list(zip(*[extract_components(wall) for wall in walls]))
wall_name, floor_from, floor_to = extract(wall_list)
Results:
(+) >>> wall_name
('W1', 'W2', 'W3')
(+) >>> floor_from
('1F', '1F', '10F')
(+) >>> floor_to
('12F', '9F', '12F')

wall_list = ["W1_1F-12F", "W2_1F-9F", "W3_10F-12F"]
wall_name = [] #[W1, W2, W3...]
Floor_from = [] #[1F, 1F, 10F...]
Floor_to = [] #[12F, 9F, 12F...]
for i in wall_list:
wall_name.append(i.split("_")[0])
Floor_from.append(i.split("_")[1].split("-")[0])
Floor_to.append(i.split("_")[1].split("-")[1])
print(wall_name,Floor_from,Floor_to)

Try this
wallList = ["W1_1F-12F", "W2_1F-9F", "W3_10F-12F"]
wallName = []
floorFrom = []
floorTo = []
for element in wallList:
wallName.append( element.split("_")[0] )
floorFrom.append( element.split("-")[0].split("_")[1] )
floorTo.append( element.split("-")[1] )
print(wallName)
print(floorFrom)
print(floorTo)

you could convert underlines to dashes and the use a simple split:
wall_list = ['W1_1F-12F', 'W2_1F-9F', 'W3_10F-12F']
wall_name,Floor_from,Floor_to = map(list,zip(*(w.replace('_','-').split('-')
for w in wall_list)))
print(wall_name) # ['W1', 'W2', 'W3']
print(Floor_from) # ['1F', '1F', '10F']
print(Floor_to) # ['12F', '9F', '12F']

Related

Python split list values based on condition

Given a python list split values based on certain criteria:
list = ['(( value(name) = literal(luke) or value(like) = literal(music) )
and (value(PRICELIST) in propval(valid))',
'(( value(sam) = literal(abc) or value(like) = literal(music) ) and
(value(PRICELIST) in propval(valid))']
Now list[0] would be
(( value(name) = literal(luke) or value(like) = literal(music) )
and (value(PRICELIST) in propval(valid))
I want to split such that upon iterating it would give me:
#expected output
value(sam) = literal(abc)
value(like) = literal(music)
That too if it starts with value and literal. At first I thought of splitting with and ,or but it won't work as sometimes there could be missing and ,or.
I tried :
for i in list:
i.split()
print(i)
#output ['((', 'value(abc)', '=', 'literal(12)', 'or' ....
I am open to suggestions based on regex also. But I have little idea about it I prefer not to include it

So to avoid so much clutter, I'm going to explain the solution in this comment. I hope that's okay.
Given your comment above which I couldn't quite understand, is this what you want? I changed the list to add in the other values you mentioned:
>>> import re
>>> list = ['''(( value(name) = literal(luke) or value(like) = literal(music) )
and (value(PRICELIST) in propval(valid))''',
'''(( value(sam) = literal(abc) or value(like) = literal(music) ) and
(value(PRICELIST) in propval(valid))''',
'''(value(PICK_SKU1) = propval(._sku)''', '''propval(._amEntitled) > literal(0))''']
>>> found_list = []
>>> for item in list:
for element in re.findall('([\w\.]+(?:\()[\w\.]+(?:\))[\s=<>(?:in)]+[\w\.]+(?:\()[\w\.]+(?:\)))', item):
found_list.append(element)
>>> found_list
['value(name) = literal(luke)', 'value(like) = literal(music)', 'value(PRICELIST) in propval(valid)', 'value(sam) = literal(abc)', 'value(like) = literal(music)', 'value(PRICELIST) in propval(valid)', 'value(PICK_SKU1) = propval(._sku)', 'propval(._amEntitled) > literal(0)']
Explanation:
Pre-Note - I changed [a-zA-Z0-9\._]+ to [\w\.]+ because they mean essentially the same thing but one is more concise. I explain what characters are covered by those queries in the next step
With ([\w\.]+, noting that it is "unclosed" meaning I am priming the regex to capture everything in the following query, I am telling it to begin by capturing all characters that are in the range a-z, A-Z, and _, and an escaped period (.)
With (?:\() I am saying the captured query should contain an escaped "opening" parenthesis (()
With [\w\.]+(?:\)) I'm saying follow that parenthesie again with the word characters outlined in the second step, but this time through (?:\)) I'm saying follow them with an escaped "closing" parenthesis ())
This [\s=<>(?:in)]+ is kind of reckless but for the sake of readability and assuming that your strings will remain relatively consistent this says, that the "closing parenthesis" should be followed by "whitespace", a =, a <, a >, or the word in, in any order however many times they all occur consistently. It is reckless because it will also match things like << <, = in > =, etc. To make it more specific could easily result in a loss of captures though
With [\w\.]+(?:\()[\w\.]+(?:\)) I'm saying once again, find the word characters from step 1, followed by an "opening parenthesis," followed again by the word characters, followed by a "closing parenthesis"
With the ) I am closing the "unclosed" capture group (remember the first capture group above started as "unclosed"), to tell the regex engine to capture the entire query I have outlined
Hope this helps

#Duck_dragon
Your strings in your list in the opening post were formatted in such a way that they cause a syntax error in Python. In the example I give below, I edited it to use '''
>>> import re
>>> list = ['''(( value(name) = literal(luke) or value(like) = literal(music) )
and (value(PRICELIST) in propval(valid))''',
'''(( value(sam) = literal(abc) or value(like) = literal(music) ) and
(value(PRICELIST) in propval(valid))''']
#Simple findall without setting it equal to a variable so it returns a list of separate strings but which you can't use
#You can also use the *MORE SIMPLE* but less flexible regex: '([a-zA-Z]+\([a-zA-Z]+\)[\s=]+[a-zA-Z]+\([a-zA-Z]+\))'
>>> for item in list:
re.findall('([a-zA-Z]+(?:\()[a-zA-Z]+(?:\))[\s=]+[a-zA-Z]+(?:\()[a-zA-Z]+(?:\)))', item)
['value(name) = literal(luke)', 'value(like) = literal(music)']
['value(sam) = literal(abc)', 'value(like) = literal(music)']
.
To take this a step further and give you an array you can work with:
>>> import re
>>> list = ['''(( value(name) = literal(luke) or value(like) = literal(music) )
and (value(PRICELIST) in propval(valid))''',
'''(( value(sam) = literal(abc) or value(like) = literal(music) ) and
(value(PRICELIST) in propval(valid))''']
#Declaring blank array found_list which you can use to call the individual items
>>> found_list = []
>>> for item in list:
for element in re.findall('([a-zA-Z]+(?:\()[a-zA-Z]+(?:\))[\s=]+[a-zA-Z]+(?:\()[a-zA-Z]+(?:\)))', item):
found_list.append(element)
>>> found_list
['value(name) = literal(luke)', 'value(like) = literal(music)', 'value(sam) = literal(abc)', 'value(like) = literal(music)']
.
Given your comment below which I couldn't quite understand, is this what you want? I changed the list to add in the other values you mentioned:
>>> import re
>>> list = ['''(( value(name) = literal(luke) or value(like) = literal(music) )
and (value(PRICELIST) in propval(valid))''',
'''(( value(sam) = literal(abc) or value(like) = literal(music) ) and
(value(PRICELIST) in propval(valid))''',
'''(value(PICK_SKU1) = propval(._sku)''', '''propval(._amEntitled) > literal(0))''']
>>> found_list = []
>>> for item in list:
for element in re.findall('([\w\.]+(?:\()[\w\.]+(?:\))[\s=<>(?:in)]+[\w\.]+(?:\()[\w\.]+(?:\)))', item):
found_list.append(element)
>>> found_list
['value(name) = literal(luke)', 'value(like) = literal(music)', 'value(PRICELIST) in propval(valid)', 'value(sam) = literal(abc)', 'value(like) = literal(music)', 'value(PRICELIST) in propval(valid)', 'value(PICK_SKU1) = propval(._sku)', 'propval(._amEntitled) > literal(0)']
.
Edit: Or is this what you want?
>>> import re
>>> list = ['''(( value(name) = literal(luke) or value(like) = literal(music) )
and (value(PRICELIST) in propval(valid))''',
'''(( value(sam) = literal(abc) or value(like) = literal(music) ) and
(value(PRICELIST) in propval(valid))''']
#Declaring blank array found_list which you can use to call the individual items
>>> found_list = []
>>> for item in list:
for element in re.findall('([a-zA-Z]+(?:\()[a-zA-Z]+(?:\))[\s=<>(?:in)]+[a-zA-Z]+(?:\()[a-zA-Z]+(?:\)))', item):
found_list.append(element)
>>> found_list
['value(name) = literal(luke)', 'value(like) = literal(music)', 'value(PRICELIST) in propval(valid)', 'value(sam) = literal(abc)', 'value(like) = literal(music)', 'value(PRICELIST) in propval(valid)']
Let me know if you need an explanation.
.
#Fyodor Kutsepin
In your example take out your_list_ and replace it with OP's list to avoid confusion. Secondly, your for loop lacks a : producing syntax errors

First, I would suggest you to avoid of naming your variables like build-in functions.
Second, you don't need a regex if you want to get the mentioned output.
for example:
first, rest = your_list_[1].split(') and'):
for item in first[2:].split('or')
print(item)

Not saying you should but you definately could use a PEG parser here:
from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor
data = ['(( value(name) = literal(luke) or value(like) = literal(music) ) and (value(PRICELIST) in propval(valid))',
'(( value(sam) = literal(abc) or value(like) = literal(music) ) and (value(PRICELIST) in propval(valid))']
grammar = Grammar(
r"""
expr = term (operator term)*
term = lpar* factor (operator needle)* rpar*
factor = needle operator needle
needle = word lpar word rpar
operator = ws? ("=" / "or" / "and" / "in") ws?
word = ~"\w+"
lpar = "(" ws?
rpar = ws? ")"
ws = ~r"\s*"
"""
)
class HorribleStuff(NodeVisitor):
def generic_visit(self, node, visited_children):
return node.text or visited_children
def visit_factor(self, node, children):
output, equal = [], False
for child in node.children:
if (child.expr.name == 'needle'):
output.append(child.text)
elif (child.expr.name == 'operator' and child.text.strip() == '='):
equal = True
if equal:
print(output)
for d in data:
tree = grammar.parse(d)
hs = HorribleStuff()
hs.visit(tree)
This yields
['value(name)', 'literal(luke)']
['value(sam)', 'literal(abc)']

Finding a term in a list

I have the following code to find the keywords in a user's profile:
profile_text = self.text.lower()
term_string = ''
TERMS = ['spring', 'java', 'angular', 'elastic', 'css']
for term in TERMS:
if term in profile_text: term_string += term.strip() + ', '
return term_string.strip(' ,')
This will return something like:
"spring, angular, css"
However, it will also return "java" if the user has a word such as "javascript". What would be a good pattern to prevent that?

You should use regular expressions.
You could do something like:
import re
TERMS = ['spring', 'java', 'angular', 'elastic', 'css']
matched_terms = []
for term in TERMS:
if re.search(r'\b{}\b'.format(term), profile_text, re.M):
matched_terms.append(term)
return ', '.join(matched_terms)

Python Case Insensitive Replace All of multiple strings

I want to replace all occurrences of a set of strings in a text line. I came up with this approach, but I am sure there is a better way of doing this:
myDict = {}
test = re.compile(re.escape('pig'), re.IGNORECASE)
myDict['car'] = test
test = re.compile(re.escape('horse'), re.IGNORECASE)
myDict['airplane'] = test
test = re.compile(re.escape('cow'), re.IGNORECASE)
myDict['bus'] = test
mystring = 'I have this Pig and that pig with a hOrse and coW'
for key in myDict:
regex_obj = myDict[key]
mystring = regex_obj.sub(key, mystring)
print mystring
I have this car and that car with a airplane and bus
Based on #Paul Rooney's answer below, ideally I would do this:
def init_regex():
rd = {'pig': 'car', 'horse':'airplane', 'cow':'bus'}
myDict = {}
for key,value in rd.iteritems():
pattern = re.compile(re.escape(key), re.IGNORECASE)
myDict[value] = pattern
return myDict
def strrep(mystring, patternDict):
for key in patternDict:
regex_obj = patternDict[key]
mystring = regex_obj.sub(key, mystring)
return mystring

Try
import itertools
import re
mystring = 'I have this Pig and that pig with a hOrse and coW'
rd = {'pig': 'car', 'horse':'airplane', 'cow':'bus'}
cachedict = {}
def strrep(orig, repdict):
for k,v in repdict.iteritems():
if k in cachedict:
pattern = cachedict[k]
else:
pattern = re.compile(k, re.IGNORECASE)
cachedict[k] = pattern
orig = pattern.sub(v, orig)
return orig
print strrep(mystring, rd)
This answer was initially written for python2, but for python 3 you would use repdict.items instead of repdict.iteritems.

How to replace text in curly brackets with another text based on comparisons using Python Regex

I am quiet new to regular expressions. I have a string that looks like this:
str = "abc/def/([default], [testing])"
and a dictionary
dict = {'abc/def/[default]' : '2.7', 'abc/def/[testing]' : '2.1'}
and using Python RE, I want str in this form, after comparisons of each element in dict to str:
str = "abc/def/(2.7, 2.1)"
Any help how to do it using Python RE?
P.S. its not the part of any assignment, instead it is the part of my project at work and I have spent many hours to figure out solution but in vain.

import re
st = "abc/def/([default], [testing], [something])"
dic = {'abc/def/[default]' : '2.7',
'abc/def/[testing]' : '2.1',
'bcd/xed/[something]' : '3.1'}
prefix_regex = "^[\w*/]*"
tag_regex = "\[\w*\]"
prefix = re.findall(prefix_regex, st)[0]
tags = re.findall(tag_regex, st)
for key in dic:
key_prefix = re.findall(prefix_regex, key)[0]
key_tag = re.findall(tag_regex, key)[0]
if prefix == key_prefix:
for tag in tags:
if tag == key_tag:
st = st.replace(tag, dic[key])
print st
OUTPUT:
abc/def/(2.7, 2.1, [something])

Here is a solution using re module.
Hypotheses :
there is a dictionary whose keys are composed of a prefix and a variable part, the variable part is enclosed in brackets ([])
the values are strings by which the variable parts are to be replaced in the string
the string is composed by a prefix, a (, a list of variable parts and a )
the variable parts in the string are enclosed in []
the variable parts in the string are separated by a comma followed by optional spaces
Python code :
import re
class splitter:
pref = re.compile("[^(]+")
iden = re.compile("\[[^]]*\]")
def __init__(self, d):
self.d = d
def split(self, s):
m = self.pref.match(s)
if m is not None:
p = m.group(0)
elts = self.iden.findall(s, m.span()[1])
return p, elts
return None
def convert(self, s):
p, elts = self.split(s)
return p + "(" + ", ".join((self.d[p + elt] for elt in elts)) + ")"
Usage :
s = "abc/def/([default], [testing])"
d = {'abc/def/[default]' : '2.7', 'abc/def/[testing]' : '2.1'}
sp = splitter(d)
print(sp.convert(s))
output :
abc/def/(2.7, 2.1)

Regex is probably not required here. Hope this helps
lhs,rhs = str.split("/(")
rhs1,rhs2 = rhs.strip(")").split(", ")
lhs+="/"
print "{0}({1},{2})".format(lhs,dict[lhs+rhs1],dict[lhs+rhs2])
output
abc/def/(2.7,2.1)

Python regexp matching or tokenizing

I have a dump of a data structure which i'm trying to convert into an XML. the structure has a number of nested structures within it. So i'm kind of lost on how to start because all the regex expressions that i can think of will not work on nested expressions.
For example, let's say there is a structure dump like this:
abc = (
bcd = (efg = 0, ghr = 5, lmn = 10),
ghd = 5,
zde = (dfs = 10, fge =20, dfg = (sdf = 3, ert = 5), juh = 0))
and i want to come out with an output like this:
< abc >
< bcd >
< efg >0< /efg >
< ghr >5< /ghr >
< lmn >10< /lmn >
< /bcd >
.....
< /abc >
So what would be a good approach to this? tokenizing the expression, a clever regex or using a stack?

Use pyparsing.
$ cat parsing.py
from pyparsing import nestedExpr
abc = """(
bcd = (efg = 0, ghr = 5, lmn 10),
ghd = 5,
zde = (dfs = 10, fge =20, dfg = (sdf = 3, ert = 5), juh = 0))"""
print nestedExpr().parseString(abc).asList()
$ python parsing.py
[['bcd', '=', ['efg', '=', '0,', 'ghr', '=', '5,', 'lmn', '10'], ',', 'ghd', '=', '5,', 'zde', '=', ['dfs', '=', '10,', 'fge', '=20,', 'dfg', '=', ['sdf', '=', '3,', 'ert', '=', '5'], ',', 'juh', '=', '0']]]

Here is an alternate answer that uses pyparsing more idiomatically. Because it provides a detailed grammar for what inputs may be seen and what results should be returned, parsed data is not "messy." Thus toXML() needn't work as hard nor do any real cleanup.
print "\n----- ORIGINAL -----\n"
dump = """
abc = (
bcd = (efg = 0, ghr = 5, lmn 10),
ghd = 5,
zde = (dfs = 10, fge =20, dfg = (sdf = 3, ert = 5), juh = 0))
""".strip()
print dump
print "\n----- PARSED INTO LIST -----\n"
from pyparsing import Word, alphas, nums, Optional, Forward, delimitedList, Group, Suppress
def Syntax():
"""Define grammar and parser."""
# building blocks
name = Word(alphas)
number = Word(nums)
_equals = Optional(Suppress('='))
_lpar = Suppress('(')
_rpar = Suppress(')')
# larger constructs
expr = Forward()
value = number | Group( _lpar + delimitedList(expr) + _rpar )
expr << name + _equals + value
return expr
parsed = Syntax().parseString(dump)
print parsed
print "\n----- SERIALIZED INTO XML ----\n"
def toXML(part, level=0):
xml = ""
indent = " " * level
while part:
tag = part.pop(0)
payload = part.pop(0)
insides = payload if isinstance(payload, str) \
else "\n" + toXML(payload, level+1) + indent
xml += "{indent}<{tag}>{insides}</{tag}>\n".format(**locals())
return xml
print toXML(parsed)
The input and XML output is the same as my other answer. The data returned by parseString() is the only real change:
----- PARSED INTO LIST -----
['abc', ['bcd', ['efg', '0', 'ghr', '5', 'lmn', '10'], 'ghd', '5', 'zde',
['dfs', '10', 'fge', '20', 'dfg', ['sdf', '3', 'ert', '5'], 'juh', '0']]]

I don't think regexps is the best approach here, but for those curious it can be done like this:
def expr(m):
out = []
for item in m.group(1).split(','):
a, b = map(str.strip, item.split('='))
out.append('<%s>%s</%s>' % (a, b, a))
return '\n'.join(out)
rr = r'\(([^()]*)\)'
while re.search(rr, data):
data = re.sub(rr, expr, data)
Basically, we repeatedly replace lowermost parenthesis (no parens here) with chunks of xml until there's no more parenthesis. For simplicity, I also included the main expression in parenthesis, if this is not the case, just do data='(%s)' % data before parsing.

I like Igor Chubin's "use pyparsing" answer, because in general, regexps handle nested structures very poorly (though thg435's iterative replacement solution is a clever workaround).
But once pyparsing's done its thing, you then need a routine to walk the list and emit XML. It needs to be intelligent about the imperfections of pyparsing's results. For example, fge =20, doesn't yield the ['fge', '=', '20'] you'd like, but ['fge', '=20,']. Commas are sometimes also added in places that are unhelpful. Here's how I did it:
from pyparsing import nestedExpr
dump = """
abc = (
bcd = (efg = 0, ghr = 5, lmn 10),
ghd = 5,
zde = (dfs = 10, fge =20, dfg = (sdf = 3, ert = 5), juh = 0))
"""
dump = dump.strip()
print "\n----- ORIGINAL -----\n"
print dump
wrapped = dump if dump.startswith('(') else "({})".format(dump)
parsed = nestedExpr().parseString(wrapped).asList()
print "\n----- PARSED INTO LIST -----\n"
print parsed
def toXML(part, level=0):
def grab_tag():
return part.pop(0).lstrip(",")
def grab_payload():
payload = part.pop(0)
if isinstance(payload, str):
payload = payload.lstrip("=").rstrip(",")
return payload
xml = ""
indent = " " * level
while part:
tag = grab_tag() or grab_tag()
payload = grab_payload() or grab_payload()
# grab twice, possibly, if '=' or ',' is in the way of what you're grabbing
insides = payload if isinstance(payload, str) \
else "\n" + toXML(payload, level+1) + indent
xml += "{indent}<{tag}>{insides}</{tag}>\n".format(**locals())
return xml
print "\n----- SERIALIZED INTO XML ----\n"
print toXML(parsed[0])
Resulting in:
----- ORIGINAL -----
abc = (
bcd = (efg = 0, ghr = 5, lmn 10),
ghd = 5,
zde = (dfs = 10, fge =20, dfg = (sdf = 3, ert = 5), juh = 0))
----- PARSED INTO LIST -----
[['abc', '=', ['bcd', '=', ['efg', '=', '0,', 'ghr', '=', '5,', 'lmn', '10'], ',', 'ghd', '=', '5,', 'zde', '=', ['dfs', '=', '10,', 'fge', '=20,', 'dfg', '=', ['sdf', '=', '3,', 'ert', '=', '5'], ',', 'juh', '=', '0']]]]
----- SERIALIZED INTO XML ----
<abc>
<bcd>
<efg>0</efg>
<ghr>5</ghr>
<lmn>10</lmn>
</bcd>
<ghd>5</ghd>
<zde>
<dfs>10</dfs>
<fge>20</fge>
<dfg>
<sdf>3</sdf>
<ert>5</ert>
</dfg>
<juh>0</juh>
</zde>
</abc>

You can use re module to parse nested expressions (though it is not recommended):
import re
def repl_flat(m):
return "\n".join("<{0}>{1}</{0}>".format(*map(str.strip, s.partition('=')[::2]))
for s in m.group(1).split(','))
def eval_nested(expr):
val, n = re.subn(r"\(([^)(]+)\)", repl_flat, expr)
return val if n == 0 else eval_nested(val)
Example
print eval_nested("(%s)" % (data,))
Output
<abc><bcd><efg>0</efg>
<ghr>5</ghr>
<lmn>10</lmn></bcd>
<ghd>5</ghd>
<zde><dfs>10</dfs>
<fge>20</fge>
<dfg><sdf>3</sdf>
<ert>5</ert></dfg>
<juh>0</juh></zde></abc>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to efficiently extract substrings from a string (with - or _) - python

Related

Python split list values based on condition

Finding a term in a list

Python Case Insensitive Replace All of multiple strings

How to replace text in curly brackets with another text based on comparisons using Python Regex

Python regexp matching or tokenizing

Categories

Resources