Python split list values based on condition - python

Given a python list split values based on certain criteria:
list = ['(( value(name) = literal(luke) or value(like) = literal(music) )
and (value(PRICELIST) in propval(valid))',
'(( value(sam) = literal(abc) or value(like) = literal(music) ) and
(value(PRICELIST) in propval(valid))']
Now list[0] would be
(( value(name) = literal(luke) or value(like) = literal(music) )
and (value(PRICELIST) in propval(valid))
I want to split such that upon iterating it would give me:
#expected output
value(sam) = literal(abc)
value(like) = literal(music)
That too if it starts with value and literal. At first I thought of splitting with and ,or but it won't work as sometimes there could be missing and ,or.
I tried :
for i in list:
i.split()
print(i)
#output ['((', 'value(abc)', '=', 'literal(12)', 'or' ....
I am open to suggestions based on regex also. But I have little idea about it I prefer not to include it

So to avoid so much clutter, I'm going to explain the solution in this comment. I hope that's okay.
Given your comment above which I couldn't quite understand, is this what you want? I changed the list to add in the other values you mentioned:
>>> import re
>>> list = ['''(( value(name) = literal(luke) or value(like) = literal(music) )
and (value(PRICELIST) in propval(valid))''',
'''(( value(sam) = literal(abc) or value(like) = literal(music) ) and
(value(PRICELIST) in propval(valid))''',
'''(value(PICK_SKU1) = propval(._sku)''', '''propval(._amEntitled) > literal(0))''']
>>> found_list = []
>>> for item in list:
for element in re.findall('([\w\.]+(?:\()[\w\.]+(?:\))[\s=<>(?:in)]+[\w\.]+(?:\()[\w\.]+(?:\)))', item):
found_list.append(element)
>>> found_list
['value(name) = literal(luke)', 'value(like) = literal(music)', 'value(PRICELIST) in propval(valid)', 'value(sam) = literal(abc)', 'value(like) = literal(music)', 'value(PRICELIST) in propval(valid)', 'value(PICK_SKU1) = propval(._sku)', 'propval(._amEntitled) > literal(0)']
Explanation:
Pre-Note - I changed [a-zA-Z0-9\._]+ to [\w\.]+ because they mean essentially the same thing but one is more concise. I explain what characters are covered by those queries in the next step
With ([\w\.]+, noting that it is "unclosed" meaning I am priming the regex to capture everything in the following query, I am telling it to begin by capturing all characters that are in the range a-z, A-Z, and _, and an escaped period (.)
With (?:\() I am saying the captured query should contain an escaped "opening" parenthesis (()
With [\w\.]+(?:\)) I'm saying follow that parenthesie again with the word characters outlined in the second step, but this time through (?:\)) I'm saying follow them with an escaped "closing" parenthesis ())
This [\s=<>(?:in)]+ is kind of reckless but for the sake of readability and assuming that your strings will remain relatively consistent this says, that the "closing parenthesis" should be followed by "whitespace", a =, a <, a >, or the word in, in any order however many times they all occur consistently. It is reckless because it will also match things like << <, = in > =, etc. To make it more specific could easily result in a loss of captures though
With [\w\.]+(?:\()[\w\.]+(?:\)) I'm saying once again, find the word characters from step 1, followed by an "opening parenthesis," followed again by the word characters, followed by a "closing parenthesis"
With the ) I am closing the "unclosed" capture group (remember the first capture group above started as "unclosed"), to tell the regex engine to capture the entire query I have outlined
Hope this helps

#Duck_dragon
Your strings in your list in the opening post were formatted in such a way that they cause a syntax error in Python. In the example I give below, I edited it to use '''
>>> import re
>>> list = ['''(( value(name) = literal(luke) or value(like) = literal(music) )
and (value(PRICELIST) in propval(valid))''',
'''(( value(sam) = literal(abc) or value(like) = literal(music) ) and
(value(PRICELIST) in propval(valid))''']
#Simple findall without setting it equal to a variable so it returns a list of separate strings but which you can't use
#You can also use the *MORE SIMPLE* but less flexible regex: '([a-zA-Z]+\([a-zA-Z]+\)[\s=]+[a-zA-Z]+\([a-zA-Z]+\))'
>>> for item in list:
re.findall('([a-zA-Z]+(?:\()[a-zA-Z]+(?:\))[\s=]+[a-zA-Z]+(?:\()[a-zA-Z]+(?:\)))', item)
['value(name) = literal(luke)', 'value(like) = literal(music)']
['value(sam) = literal(abc)', 'value(like) = literal(music)']
.
To take this a step further and give you an array you can work with:
>>> import re
>>> list = ['''(( value(name) = literal(luke) or value(like) = literal(music) )
and (value(PRICELIST) in propval(valid))''',
'''(( value(sam) = literal(abc) or value(like) = literal(music) ) and
(value(PRICELIST) in propval(valid))''']
#Declaring blank array found_list which you can use to call the individual items
>>> found_list = []
>>> for item in list:
for element in re.findall('([a-zA-Z]+(?:\()[a-zA-Z]+(?:\))[\s=]+[a-zA-Z]+(?:\()[a-zA-Z]+(?:\)))', item):
found_list.append(element)
>>> found_list
['value(name) = literal(luke)', 'value(like) = literal(music)', 'value(sam) = literal(abc)', 'value(like) = literal(music)']
.
Given your comment below which I couldn't quite understand, is this what you want? I changed the list to add in the other values you mentioned:
>>> import re
>>> list = ['''(( value(name) = literal(luke) or value(like) = literal(music) )
and (value(PRICELIST) in propval(valid))''',
'''(( value(sam) = literal(abc) or value(like) = literal(music) ) and
(value(PRICELIST) in propval(valid))''',
'''(value(PICK_SKU1) = propval(._sku)''', '''propval(._amEntitled) > literal(0))''']
>>> found_list = []
>>> for item in list:
for element in re.findall('([\w\.]+(?:\()[\w\.]+(?:\))[\s=<>(?:in)]+[\w\.]+(?:\()[\w\.]+(?:\)))', item):
found_list.append(element)
>>> found_list
['value(name) = literal(luke)', 'value(like) = literal(music)', 'value(PRICELIST) in propval(valid)', 'value(sam) = literal(abc)', 'value(like) = literal(music)', 'value(PRICELIST) in propval(valid)', 'value(PICK_SKU1) = propval(._sku)', 'propval(._amEntitled) > literal(0)']
.
Edit: Or is this what you want?
>>> import re
>>> list = ['''(( value(name) = literal(luke) or value(like) = literal(music) )
and (value(PRICELIST) in propval(valid))''',
'''(( value(sam) = literal(abc) or value(like) = literal(music) ) and
(value(PRICELIST) in propval(valid))''']
#Declaring blank array found_list which you can use to call the individual items
>>> found_list = []
>>> for item in list:
for element in re.findall('([a-zA-Z]+(?:\()[a-zA-Z]+(?:\))[\s=<>(?:in)]+[a-zA-Z]+(?:\()[a-zA-Z]+(?:\)))', item):
found_list.append(element)
>>> found_list
['value(name) = literal(luke)', 'value(like) = literal(music)', 'value(PRICELIST) in propval(valid)', 'value(sam) = literal(abc)', 'value(like) = literal(music)', 'value(PRICELIST) in propval(valid)']
Let me know if you need an explanation.
.
#Fyodor Kutsepin
In your example take out your_list_ and replace it with OP's list to avoid confusion. Secondly, your for loop lacks a : producing syntax errors

First, I would suggest you to avoid of naming your variables like build-in functions.
Second, you don't need a regex if you want to get the mentioned output.
for example:
first, rest = your_list_[1].split(') and'):
for item in first[2:].split('or')
print(item)

Not saying you should but you definately could use a PEG parser here:
from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor
data = ['(( value(name) = literal(luke) or value(like) = literal(music) ) and (value(PRICELIST) in propval(valid))',
'(( value(sam) = literal(abc) or value(like) = literal(music) ) and (value(PRICELIST) in propval(valid))']
grammar = Grammar(
r"""
expr = term (operator term)*
term = lpar* factor (operator needle)* rpar*
factor = needle operator needle
needle = word lpar word rpar
operator = ws? ("=" / "or" / "and" / "in") ws?
word = ~"\w+"
lpar = "(" ws?
rpar = ws? ")"
ws = ~r"\s*"
"""
)
class HorribleStuff(NodeVisitor):
def generic_visit(self, node, visited_children):
return node.text or visited_children
def visit_factor(self, node, children):
output, equal = [], False
for child in node.children:
if (child.expr.name == 'needle'):
output.append(child.text)
elif (child.expr.name == 'operator' and child.text.strip() == '='):
equal = True
if equal:
print(output)
for d in data:
tree = grammar.parse(d)
hs = HorribleStuff()
hs.visit(tree)
This yields
['value(name)', 'literal(luke)']
['value(sam)', 'literal(abc)']

Related

How to efficiently extract substrings from a string (with - or _)

I have a list of wall names of a building, and it looks like below:
wall_list = ['W1_1F-12F', 'W2_1F-9F', 'W3_10F-12F']
I want to separate them into three or extract each of the elements (like W1, 1F, 12F) so that I can use wall names or floor information in another process.
wall_name = [W1, W2, W3...]
Floor_from = [1F, 1F, 10F...]
Floor_to = [12F, 9F, 12F...]
This is the result I want to get in the end.
I think it will be efficient to solve this problem by reading strings before or after _ and -, if this kind of method exists.
You can use the regex version of the split function with a simple pattern:
import re
wall_list = ['W1_1F-12F', 'W2_1F-9F', 'W3_10F-12F']
for s in wall_list:
print(re.split('[_-]', s))
Which will give:
['W1', '1F', '12F']
['W2', '1F', '9F']
['W3', '10F', '12F']
And to separate them to elements just put the result into zip:
import re
wall_list = ['W1_1F-12F', 'W2_1F-9F', 'W3_10F-12F']
wall, floor_from, floor_to = zip(*(re.split('[_-]', s) for s in wall_list))
print(wall, floor_from, floor_to, sep='\n')
Which will now give:
('W1', 'W2', 'W3')
('1F', '1F', '10F')
('12F', '9F', '12F')
import re
def extract_components(wall):
match = re.match("^(W\d+)_(\d+F)-(\d+F)", wall)
return match.groups()
def extract(walls):
return list(zip(*[extract_components(wall) for wall in walls]))
wall_name, floor_from, floor_to = extract(wall_list)
Results:
(+) >>> wall_name
('W1', 'W2', 'W3')
(+) >>> floor_from
('1F', '1F', '10F')
(+) >>> floor_to
('12F', '9F', '12F')
wall_list = ["W1_1F-12F", "W2_1F-9F", "W3_10F-12F"]
wall_name = [] #[W1, W2, W3...]
Floor_from = [] #[1F, 1F, 10F...]
Floor_to = [] #[12F, 9F, 12F...]
for i in wall_list:
wall_name.append(i.split("_")[0])
Floor_from.append(i.split("_")[1].split("-")[0])
Floor_to.append(i.split("_")[1].split("-")[1])
print(wall_name,Floor_from,Floor_to)
Try this
wallList = ["W1_1F-12F", "W2_1F-9F", "W3_10F-12F"]
wallName = []
floorFrom = []
floorTo = []
for element in wallList:
wallName.append( element.split("_")[0] )
floorFrom.append( element.split("-")[0].split("_")[1] )
floorTo.append( element.split("-")[1] )
print(wallName)
print(floorFrom)
print(floorTo)
you could convert underlines to dashes and the use a simple split:
wall_list = ['W1_1F-12F', 'W2_1F-9F', 'W3_10F-12F']
wall_name,Floor_from,Floor_to = map(list,zip(*(w.replace('_','-').split('-')
for w in wall_list)))
print(wall_name) # ['W1', 'W2', 'W3']
print(Floor_from) # ['1F', '1F', '10F']
print(Floor_to) # ['12F', '9F', '12F']

How to get the minimum and maximum values in a string?

files = ['foo.0001.jpg', 'test2.0003.jpg', 'foo.0004.jpg', 'tmp.txt',
'foo.0003.jpg', 'test2.0002.jpg', 'test2.0004.jpg', 'test.0002.jpg',
'foo.0002.jpg', 'foo.0005.jpg', 'test.0001.jpg']
and I want foo.####.jpg and min, max print
test.####.jpg and min, max print
test2.####.jpg and min, max print
def get_frame_number(files):
for c in foo:
value = files.get(c)
for i in value:
num = i.split(".")[1]
num_list.append(int(num))
print str(min(num_list)) + "-" + str(max(num_list))
I have a function. but couldn't figure it out.
You can use re to try to pull the number out of your file name. Then use this function as the key argument to max and min respectively.
import re
def get_frame_number(file):
match = re.match(r'[\w\d]+\.(\d+)\.jpg', file)
if match:
return int(match.group(1))
else:
return float('nan')
>>> max(files, key=get_frame_number)
'foo.0005.jpg'
>>> min(files, key=get_frame_number)
'foo.0001.jpg'
An option would be using key arg (with lambda function) of max() and min() built-in functions like this:
for fn in ('foo', 'test', 'test2'):
fn_max = max(
(name for name in files if name.startswith('{}.'.format(fn))),
key=lambda name: int(name.split('.')[1]))
fn_min = min(
(name for name in files if name.startswith('{}.'.format(fn))),
key=lambda name: int(name.split('.')[1]))
print(fn, fn_max, fn_min)
Output:
('foo', 'foo.0005.jpg', 'foo.0001.jpg')
('test', 'test.0002.jpg', 'test.0001.jpg')
('test2', 'test2.0004.jpg', 'test2.0002.jpg')
import re
foo = re.findall( r'(foo\.\d+.jpg)','|'.join( sorted(files) ) )
foo[0], foo[-1]
Output :
('foo.0001.jpg', 'foo.0005.jpg')
Similarly you can check for min, max of other files:
test = re.findall( r'(test\.\d+.jpg)','|'.join( sorted(files) ) )
test[0], test[-1]
test2 = re.findall( r'(test2\.\d+.jpg)','|'.join( sorted(files) ) )
test2[0], test2[-1]
Putting all together in one liner:
[ ( i[0], i[-1] ) for i in [ re.findall( r'('+ j + '\.\d+.jpg)','|'.join( sorted(files) ) ) for j in ['foo','test','test2'] ] ]
Output:
[('foo.0001.jpg', 'foo.0005.jpg'),
('test.0001.jpg', 'test.0002.jpg'),
('test2.0002.jpg', 'test2.0004.jpg')]
def get_frame_number(files,name):
nums = []
for each in files:
parts = each.strip().split('.')
if parts[0] == name:nums.append(int(parts[1]))
else:print("Ignoring",each)
return(sorted(nums)[0],sorted(nums)[-1])
Try this with :
print(get_frame_number(files,"test"))
print(get_frame_numbers(files,"test2"))
print(get_frame_numbers(files,"foo"))

Extracting data from string with specific format using Python

I am novice with Python and currently I am trying to use it to parse some custom output formated string. In fact format contains named lists of float and lists of tuples of float. I wrote a function but it looks excessive. How can it be done in more suitable way for Python?
import re
def extract_line(line):
line = line.lstrip('0123456789# ')
measurement_list = list(filter(None, re.split(r'\s*;\s*', line)))
measurement = {}
for elem in measurement_list:
elem_list = list(filter(None, re.split(r'\s*=\s*', elem)))
name = elem_list[0]
if name == 'points':
points = list(filter(None, re.split(r'\s*\(\s*|\s*\)\s*',elem_list[1].strip(' {}'))))
for point in points:
p = re.match(r'\s*(\d+(?:\.\d+)?)\s*,\s*(\d+(?:\.\d+)?)\s*', point).groups()
if 'points' not in measurement.keys():
measurement['points'] = []
measurement['points'].append(tuple(map(float,p)))
else:
values = list(filter(None, elem_list[1].strip(' {}').split(' ')))
for value in values:
if name not in measurement.keys():
measurement[name] = []
measurement[name].append(float(value))
return measurement
to_parse = '#10 points = { ( 2.96296 , 0.822213 ) ( 3.7037 , 0.902167 ) } ; L = { 5.20086 } ; P = { 3.14815 3.51852 } ;'
print(extract_line(to_parse))
You can do it using re.findall:
import re
to_parse = '#10 points = { ( 2.96296 , 0.822213 ) ( 3.7037 , 0.902167 ) } ; L = { 5.20086 } ; P = { 3.14815 3.51852 } ;'
m_list = re.findall(r'(\w+)\s*=\s*{([^}]*)}', to_parse)
measurements = {}
for k,v in m_list:
if k == 'points':
elts = re.findall(r'([0-9.]+)\s*,\s*([0-9.]+)', v)
measurements[k] = [tuple(map(float, elt)) for elt in elts]
else:
measurements[k] = [float(x) for x in v.split()]
print(measurements)
Feel free to put it in a function and to check if keys don't already exists.
This:
import re
a=re.findall(r' ([\d\.eE-]*) ',to_parse)
map(float, a)
>> [2.96296, 0.822213, 3.7037, 0.902167, 5.20086, 3.14815]
Will give you your list of numbers, is that what you look for?

Is it possible to use regular expressions with pdfquery?

Can we use regex to detect text within a pdf (using pdfquery or another tool)?
I know we can do this:
pdf = pdfquery.PDFQuery("tests/samples/IRS_1040A.pdf")
pdf.load()
label = pdf.pq('LTTextLineHorizontal:contains("Cash")')
left_corner = float(label.attr('x0'))
bottom_corner = float(label.attr('y0'))
cash = pdf.pq('LTTextLineHorizontal:in_bbox("%s, %s, %s, %s")' % \
(left_corner, bottom_corner-30, \
left_corner+150, bottom_corner)).text()
print cash
'179,000.00'
But we need something like this:
pdf = pdfquery.PDFQuery("tests/samples/IRS_1040A.pdf")
pdf.load()
label = pdf.pq('LTTextLineHorizontal:regex("\d{1,3}(?:,\d{3})*(?:\.\d{2})?")')
cash = str(label.attr('x0'))
print cash
'179,000.00'
This is not exactly a lookup for a regex, but it works to format/filter the possible extractions:
def regex_function(pattern, match):
re_obj = re.search(pattern, match)
if re_obj != None and len(re_obj.groups()) > 0:
return re_obj.group(1)
return None
pdf = pdfquery.PDFQuery("tests/samples/IRS_1040A.pdf")
pattern = ''
pdf.extract( [
('with_parent','LTPage[pageid=1]'),
('with_formatter', 'text'),
('year', 'LTTextLineHorizontal:contains("Form 1040A (")',
lambda match: regex_function(SOME_PATTERN_HERE, match)))
])
I didn't test this next one, but it might work also:
def some_regex_function_feature():
# here you could use some regex.
return float(this.get('width',0)) * float(this.get('height',0)) > 40000
pdf.pq('LTPage[page_index="1"] *').filter(regex_function_filter_here)
[<LTTextBoxHorizontal>, <LTRect>, <LTRect>]

How to replace text in curly brackets with another text based on comparisons using Python Regex

I am quiet new to regular expressions. I have a string that looks like this:
str = "abc/def/([default], [testing])"
and a dictionary
dict = {'abc/def/[default]' : '2.7', 'abc/def/[testing]' : '2.1'}
and using Python RE, I want str in this form, after comparisons of each element in dict to str:
str = "abc/def/(2.7, 2.1)"
Any help how to do it using Python RE?
P.S. its not the part of any assignment, instead it is the part of my project at work and I have spent many hours to figure out solution but in vain.
import re
st = "abc/def/([default], [testing], [something])"
dic = {'abc/def/[default]' : '2.7',
'abc/def/[testing]' : '2.1',
'bcd/xed/[something]' : '3.1'}
prefix_regex = "^[\w*/]*"
tag_regex = "\[\w*\]"
prefix = re.findall(prefix_regex, st)[0]
tags = re.findall(tag_regex, st)
for key in dic:
key_prefix = re.findall(prefix_regex, key)[0]
key_tag = re.findall(tag_regex, key)[0]
if prefix == key_prefix:
for tag in tags:
if tag == key_tag:
st = st.replace(tag, dic[key])
print st
OUTPUT:
abc/def/(2.7, 2.1, [something])
Here is a solution using re module.
Hypotheses :
there is a dictionary whose keys are composed of a prefix and a variable part, the variable part is enclosed in brackets ([])
the values are strings by which the variable parts are to be replaced in the string
the string is composed by a prefix, a (, a list of variable parts and a )
the variable parts in the string are enclosed in []
the variable parts in the string are separated by a comma followed by optional spaces
Python code :
import re
class splitter:
pref = re.compile("[^(]+")
iden = re.compile("\[[^]]*\]")
def __init__(self, d):
self.d = d
def split(self, s):
m = self.pref.match(s)
if m is not None:
p = m.group(0)
elts = self.iden.findall(s, m.span()[1])
return p, elts
return None
def convert(self, s):
p, elts = self.split(s)
return p + "(" + ", ".join((self.d[p + elt] for elt in elts)) + ")"
Usage :
s = "abc/def/([default], [testing])"
d = {'abc/def/[default]' : '2.7', 'abc/def/[testing]' : '2.1'}
sp = splitter(d)
print(sp.convert(s))
output :
abc/def/(2.7, 2.1)
Regex is probably not required here. Hope this helps
lhs,rhs = str.split("/(")
rhs1,rhs2 = rhs.strip(")").split(", ")
lhs+="/"
print "{0}({1},{2})".format(lhs,dict[lhs+rhs1],dict[lhs+rhs2])
output
abc/def/(2.7,2.1)

Categories

Resources