How to get the minimum and maximum values in a string? - python

files = ['foo.0001.jpg', 'test2.0003.jpg', 'foo.0004.jpg', 'tmp.txt',
'foo.0003.jpg', 'test2.0002.jpg', 'test2.0004.jpg', 'test.0002.jpg',
'foo.0002.jpg', 'foo.0005.jpg', 'test.0001.jpg']
and I want foo.####.jpg and min, max print
test.####.jpg and min, max print
test2.####.jpg and min, max print
def get_frame_number(files):
for c in foo:
value = files.get(c)
for i in value:
num = i.split(".")[1]
num_list.append(int(num))
print str(min(num_list)) + "-" + str(max(num_list))
I have a function. but couldn't figure it out.

You can use re to try to pull the number out of your file name. Then use this function as the key argument to max and min respectively.
import re
def get_frame_number(file):
match = re.match(r'[\w\d]+\.(\d+)\.jpg', file)
if match:
return int(match.group(1))
else:
return float('nan')
>>> max(files, key=get_frame_number)
'foo.0005.jpg'
>>> min(files, key=get_frame_number)
'foo.0001.jpg'

An option would be using key arg (with lambda function) of max() and min() built-in functions like this:
for fn in ('foo', 'test', 'test2'):
fn_max = max(
(name for name in files if name.startswith('{}.'.format(fn))),
key=lambda name: int(name.split('.')[1]))
fn_min = min(
(name for name in files if name.startswith('{}.'.format(fn))),
key=lambda name: int(name.split('.')[1]))
print(fn, fn_max, fn_min)
Output:
('foo', 'foo.0005.jpg', 'foo.0001.jpg')
('test', 'test.0002.jpg', 'test.0001.jpg')
('test2', 'test2.0004.jpg', 'test2.0002.jpg')

import re
foo = re.findall( r'(foo\.\d+.jpg)','|'.join( sorted(files) ) )
foo[0], foo[-1]
Output :
('foo.0001.jpg', 'foo.0005.jpg')
Similarly you can check for min, max of other files:
test = re.findall( r'(test\.\d+.jpg)','|'.join( sorted(files) ) )
test[0], test[-1]
test2 = re.findall( r'(test2\.\d+.jpg)','|'.join( sorted(files) ) )
test2[0], test2[-1]
Putting all together in one liner:
[ ( i[0], i[-1] ) for i in [ re.findall( r'('+ j + '\.\d+.jpg)','|'.join( sorted(files) ) ) for j in ['foo','test','test2'] ] ]
Output:
[('foo.0001.jpg', 'foo.0005.jpg'),
('test.0001.jpg', 'test.0002.jpg'),
('test2.0002.jpg', 'test2.0004.jpg')]

def get_frame_number(files,name):
nums = []
for each in files:
parts = each.strip().split('.')
if parts[0] == name:nums.append(int(parts[1]))
else:print("Ignoring",each)
return(sorted(nums)[0],sorted(nums)[-1])
Try this with :
print(get_frame_number(files,"test"))
print(get_frame_numbers(files,"test2"))
print(get_frame_numbers(files,"foo"))

Related

Python split list values based on condition

Given a python list split values based on certain criteria:
list = ['(( value(name) = literal(luke) or value(like) = literal(music) )
and (value(PRICELIST) in propval(valid))',
'(( value(sam) = literal(abc) or value(like) = literal(music) ) and
(value(PRICELIST) in propval(valid))']
Now list[0] would be
(( value(name) = literal(luke) or value(like) = literal(music) )
and (value(PRICELIST) in propval(valid))
I want to split such that upon iterating it would give me:
#expected output
value(sam) = literal(abc)
value(like) = literal(music)
That too if it starts with value and literal. At first I thought of splitting with and ,or but it won't work as sometimes there could be missing and ,or.
I tried :
for i in list:
i.split()
print(i)
#output ['((', 'value(abc)', '=', 'literal(12)', 'or' ....
I am open to suggestions based on regex also. But I have little idea about it I prefer not to include it
So to avoid so much clutter, I'm going to explain the solution in this comment. I hope that's okay.
Given your comment above which I couldn't quite understand, is this what you want? I changed the list to add in the other values you mentioned:
>>> import re
>>> list = ['''(( value(name) = literal(luke) or value(like) = literal(music) )
and (value(PRICELIST) in propval(valid))''',
'''(( value(sam) = literal(abc) or value(like) = literal(music) ) and
(value(PRICELIST) in propval(valid))''',
'''(value(PICK_SKU1) = propval(._sku)''', '''propval(._amEntitled) > literal(0))''']
>>> found_list = []
>>> for item in list:
for element in re.findall('([\w\.]+(?:\()[\w\.]+(?:\))[\s=<>(?:in)]+[\w\.]+(?:\()[\w\.]+(?:\)))', item):
found_list.append(element)
>>> found_list
['value(name) = literal(luke)', 'value(like) = literal(music)', 'value(PRICELIST) in propval(valid)', 'value(sam) = literal(abc)', 'value(like) = literal(music)', 'value(PRICELIST) in propval(valid)', 'value(PICK_SKU1) = propval(._sku)', 'propval(._amEntitled) > literal(0)']
Explanation:
Pre-Note - I changed [a-zA-Z0-9\._]+ to [\w\.]+ because they mean essentially the same thing but one is more concise. I explain what characters are covered by those queries in the next step
With ([\w\.]+, noting that it is "unclosed" meaning I am priming the regex to capture everything in the following query, I am telling it to begin by capturing all characters that are in the range a-z, A-Z, and _, and an escaped period (.)
With (?:\() I am saying the captured query should contain an escaped "opening" parenthesis (()
With [\w\.]+(?:\)) I'm saying follow that parenthesie again with the word characters outlined in the second step, but this time through (?:\)) I'm saying follow them with an escaped "closing" parenthesis ())
This [\s=<>(?:in)]+ is kind of reckless but for the sake of readability and assuming that your strings will remain relatively consistent this says, that the "closing parenthesis" should be followed by "whitespace", a =, a <, a >, or the word in, in any order however many times they all occur consistently. It is reckless because it will also match things like << <, = in > =, etc. To make it more specific could easily result in a loss of captures though
With [\w\.]+(?:\()[\w\.]+(?:\)) I'm saying once again, find the word characters from step 1, followed by an "opening parenthesis," followed again by the word characters, followed by a "closing parenthesis"
With the ) I am closing the "unclosed" capture group (remember the first capture group above started as "unclosed"), to tell the regex engine to capture the entire query I have outlined
Hope this helps
#Duck_dragon
Your strings in your list in the opening post were formatted in such a way that they cause a syntax error in Python. In the example I give below, I edited it to use '''
>>> import re
>>> list = ['''(( value(name) = literal(luke) or value(like) = literal(music) )
and (value(PRICELIST) in propval(valid))''',
'''(( value(sam) = literal(abc) or value(like) = literal(music) ) and
(value(PRICELIST) in propval(valid))''']
#Simple findall without setting it equal to a variable so it returns a list of separate strings but which you can't use
#You can also use the *MORE SIMPLE* but less flexible regex: '([a-zA-Z]+\([a-zA-Z]+\)[\s=]+[a-zA-Z]+\([a-zA-Z]+\))'
>>> for item in list:
re.findall('([a-zA-Z]+(?:\()[a-zA-Z]+(?:\))[\s=]+[a-zA-Z]+(?:\()[a-zA-Z]+(?:\)))', item)
['value(name) = literal(luke)', 'value(like) = literal(music)']
['value(sam) = literal(abc)', 'value(like) = literal(music)']
.
To take this a step further and give you an array you can work with:
>>> import re
>>> list = ['''(( value(name) = literal(luke) or value(like) = literal(music) )
and (value(PRICELIST) in propval(valid))''',
'''(( value(sam) = literal(abc) or value(like) = literal(music) ) and
(value(PRICELIST) in propval(valid))''']
#Declaring blank array found_list which you can use to call the individual items
>>> found_list = []
>>> for item in list:
for element in re.findall('([a-zA-Z]+(?:\()[a-zA-Z]+(?:\))[\s=]+[a-zA-Z]+(?:\()[a-zA-Z]+(?:\)))', item):
found_list.append(element)
>>> found_list
['value(name) = literal(luke)', 'value(like) = literal(music)', 'value(sam) = literal(abc)', 'value(like) = literal(music)']
.
Given your comment below which I couldn't quite understand, is this what you want? I changed the list to add in the other values you mentioned:
>>> import re
>>> list = ['''(( value(name) = literal(luke) or value(like) = literal(music) )
and (value(PRICELIST) in propval(valid))''',
'''(( value(sam) = literal(abc) or value(like) = literal(music) ) and
(value(PRICELIST) in propval(valid))''',
'''(value(PICK_SKU1) = propval(._sku)''', '''propval(._amEntitled) > literal(0))''']
>>> found_list = []
>>> for item in list:
for element in re.findall('([\w\.]+(?:\()[\w\.]+(?:\))[\s=<>(?:in)]+[\w\.]+(?:\()[\w\.]+(?:\)))', item):
found_list.append(element)
>>> found_list
['value(name) = literal(luke)', 'value(like) = literal(music)', 'value(PRICELIST) in propval(valid)', 'value(sam) = literal(abc)', 'value(like) = literal(music)', 'value(PRICELIST) in propval(valid)', 'value(PICK_SKU1) = propval(._sku)', 'propval(._amEntitled) > literal(0)']
.
Edit: Or is this what you want?
>>> import re
>>> list = ['''(( value(name) = literal(luke) or value(like) = literal(music) )
and (value(PRICELIST) in propval(valid))''',
'''(( value(sam) = literal(abc) or value(like) = literal(music) ) and
(value(PRICELIST) in propval(valid))''']
#Declaring blank array found_list which you can use to call the individual items
>>> found_list = []
>>> for item in list:
for element in re.findall('([a-zA-Z]+(?:\()[a-zA-Z]+(?:\))[\s=<>(?:in)]+[a-zA-Z]+(?:\()[a-zA-Z]+(?:\)))', item):
found_list.append(element)
>>> found_list
['value(name) = literal(luke)', 'value(like) = literal(music)', 'value(PRICELIST) in propval(valid)', 'value(sam) = literal(abc)', 'value(like) = literal(music)', 'value(PRICELIST) in propval(valid)']
Let me know if you need an explanation.
.
#Fyodor Kutsepin
In your example take out your_list_ and replace it with OP's list to avoid confusion. Secondly, your for loop lacks a : producing syntax errors
First, I would suggest you to avoid of naming your variables like build-in functions.
Second, you don't need a regex if you want to get the mentioned output.
for example:
first, rest = your_list_[1].split(') and'):
for item in first[2:].split('or')
print(item)
Not saying you should but you definately could use a PEG parser here:
from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor
data = ['(( value(name) = literal(luke) or value(like) = literal(music) ) and (value(PRICELIST) in propval(valid))',
'(( value(sam) = literal(abc) or value(like) = literal(music) ) and (value(PRICELIST) in propval(valid))']
grammar = Grammar(
r"""
expr = term (operator term)*
term = lpar* factor (operator needle)* rpar*
factor = needle operator needle
needle = word lpar word rpar
operator = ws? ("=" / "or" / "and" / "in") ws?
word = ~"\w+"
lpar = "(" ws?
rpar = ws? ")"
ws = ~r"\s*"
"""
)
class HorribleStuff(NodeVisitor):
def generic_visit(self, node, visited_children):
return node.text or visited_children
def visit_factor(self, node, children):
output, equal = [], False
for child in node.children:
if (child.expr.name == 'needle'):
output.append(child.text)
elif (child.expr.name == 'operator' and child.text.strip() == '='):
equal = True
if equal:
print(output)
for d in data:
tree = grammar.parse(d)
hs = HorribleStuff()
hs.visit(tree)
This yields
['value(name)', 'literal(luke)']
['value(sam)', 'literal(abc)']

Extracting data from string with specific format using Python

I am novice with Python and currently I am trying to use it to parse some custom output formated string. In fact format contains named lists of float and lists of tuples of float. I wrote a function but it looks excessive. How can it be done in more suitable way for Python?
import re
def extract_line(line):
line = line.lstrip('0123456789# ')
measurement_list = list(filter(None, re.split(r'\s*;\s*', line)))
measurement = {}
for elem in measurement_list:
elem_list = list(filter(None, re.split(r'\s*=\s*', elem)))
name = elem_list[0]
if name == 'points':
points = list(filter(None, re.split(r'\s*\(\s*|\s*\)\s*',elem_list[1].strip(' {}'))))
for point in points:
p = re.match(r'\s*(\d+(?:\.\d+)?)\s*,\s*(\d+(?:\.\d+)?)\s*', point).groups()
if 'points' not in measurement.keys():
measurement['points'] = []
measurement['points'].append(tuple(map(float,p)))
else:
values = list(filter(None, elem_list[1].strip(' {}').split(' ')))
for value in values:
if name not in measurement.keys():
measurement[name] = []
measurement[name].append(float(value))
return measurement
to_parse = '#10 points = { ( 2.96296 , 0.822213 ) ( 3.7037 , 0.902167 ) } ; L = { 5.20086 } ; P = { 3.14815 3.51852 } ;'
print(extract_line(to_parse))
You can do it using re.findall:
import re
to_parse = '#10 points = { ( 2.96296 , 0.822213 ) ( 3.7037 , 0.902167 ) } ; L = { 5.20086 } ; P = { 3.14815 3.51852 } ;'
m_list = re.findall(r'(\w+)\s*=\s*{([^}]*)}', to_parse)
measurements = {}
for k,v in m_list:
if k == 'points':
elts = re.findall(r'([0-9.]+)\s*,\s*([0-9.]+)', v)
measurements[k] = [tuple(map(float, elt)) for elt in elts]
else:
measurements[k] = [float(x) for x in v.split()]
print(measurements)
Feel free to put it in a function and to check if keys don't already exists.
This:
import re
a=re.findall(r' ([\d\.eE-]*) ',to_parse)
map(float, a)
>> [2.96296, 0.822213, 3.7037, 0.902167, 5.20086, 3.14815]
Will give you your list of numbers, is that what you look for?

Is it possible to use regular expressions with pdfquery?

Can we use regex to detect text within a pdf (using pdfquery or another tool)?
I know we can do this:
pdf = pdfquery.PDFQuery("tests/samples/IRS_1040A.pdf")
pdf.load()
label = pdf.pq('LTTextLineHorizontal:contains("Cash")')
left_corner = float(label.attr('x0'))
bottom_corner = float(label.attr('y0'))
cash = pdf.pq('LTTextLineHorizontal:in_bbox("%s, %s, %s, %s")' % \
(left_corner, bottom_corner-30, \
left_corner+150, bottom_corner)).text()
print cash
'179,000.00'
But we need something like this:
pdf = pdfquery.PDFQuery("tests/samples/IRS_1040A.pdf")
pdf.load()
label = pdf.pq('LTTextLineHorizontal:regex("\d{1,3}(?:,\d{3})*(?:\.\d{2})?")')
cash = str(label.attr('x0'))
print cash
'179,000.00'
This is not exactly a lookup for a regex, but it works to format/filter the possible extractions:
def regex_function(pattern, match):
re_obj = re.search(pattern, match)
if re_obj != None and len(re_obj.groups()) > 0:
return re_obj.group(1)
return None
pdf = pdfquery.PDFQuery("tests/samples/IRS_1040A.pdf")
pattern = ''
pdf.extract( [
('with_parent','LTPage[pageid=1]'),
('with_formatter', 'text'),
('year', 'LTTextLineHorizontal:contains("Form 1040A (")',
lambda match: regex_function(SOME_PATTERN_HERE, match)))
])
I didn't test this next one, but it might work also:
def some_regex_function_feature():
# here you could use some regex.
return float(this.get('width',0)) * float(this.get('height',0)) > 40000
pdf.pq('LTPage[page_index="1"] *').filter(regex_function_filter_here)
[<LTTextBoxHorizontal>, <LTRect>, <LTRect>]

Create a sublist by datedelta in Python

I have a list of data points that contains a measurement every 5 minutes for 24 hours. I need to create a new list with the average of that measurement for each hour in the list. What's the best way to accomplish that?
Date Amount
2015-03-14T00:00:00.000-04:00 12545.869
2015-03-14T00:05:00.000-04:00 12467.326
2015-03-14T00:10:00.000-04:00 12416.948
2015-03-14T00:15:00.000-04:00 12315.698
2015-03-14T00:20:00.000-04:00 12276.38
2015-03-14T00:25:00.000-04:00 12498.696
2015-03-14T00:30:00.000-04:00 12426.145
2015-03-14T00:35:00.000-04:00 12368.659
2015-03-14T00:40:00.000-04:00 12322.785
2015-03-14T00:45:00.000-04:00 12292.719
2015-03-14T00:50:00.000-04:00 12257.965
2015-03-14T00:55:00.000-04:00 12221.375
2015-03-14T01:00:00.000-04:00 12393.725
2015-03-14T01:05:00.000-04:00 12366.674
2015-03-14T01:10:00.000-04:00 12378.578
2015-03-14T01:15:00.000-04:00 12340.754
2015-03-14T01:20:00.000-04:00 12288.511
2015-03-14T01:25:00.000-04:00 12266.136
2015-03-14T01:30:00.000-04:00 12236.639
2015-03-14T01:35:00.000-04:00 12181.668
2015-03-14T01:40:00.000-04:00 12171.992
2015-03-14T01:45:00.000-04:00 12164.298
2015-03-14T01:50:00.000-04:00 12137.282
2015-03-14T01:55:00.000-04:00 12116.486
2015-03-14T02:00:02.000-04:00 12090.439
2015-03-14T02:05:00.000-04:00 12085.924
2015-03-14T02:10:00.000-04:00 12034.78
2015-03-14T02:15:00.000-04:00 12037.367
2015-03-14T02:20:00.000-04:00 12006.649
2015-03-14T02:25:00.000-04:00 11985.588
2015-03-14T02:30:00.000-04:00 11999.41
2015-03-14T02:35:00.000-04:00 11943.121
2015-03-14T02:40:00.000-04:00 11934.346
2015-03-14T02:45:00.000-04:00 11928.568
2015-03-14T02:50:00.000-04:00 11918.63
2015-03-14T02:55:00.000-04:00 11885.698
2015-03-14T03:00:00.000-04:00 11863.065
2015-03-14T03:05:00.000-04:00 11883.256
2015-03-14T03:10:00.000-04:00 11870.095
2015-03-14T03:15:00.000-04:00 11849.104
2015-03-14T03:20:00.000-04:00 11849.18
2015-03-14T03:25:00.000-04:00 11834.229
2015-03-14T03:30:00.000-04:00 11826.603
2015-03-14T03:35:00.000-04:00 11823.516
2015-03-14T03:40:00.000-04:00 11849.386
2015-03-14T03:45:00.000-04:00 11832.385
2015-03-14T03:50:00.000-04:00 11847.059
2015-03-14T03:55:00.000-04:00 11831.807
2015-03-14T04:00:00.000-04:00 11844.027
2015-03-14T04:05:00.000-04:00 11873.114
2015-03-14T04:10:00.000-04:00 11904.105
2015-03-14T04:15:00.000-04:00 11879.018
2015-03-14T04:20:00.000-04:00 11899.658
2015-03-14T04:25:00.000-04:00 11887.808
2015-03-14T04:30:00.000-04:00 11879.875
2015-03-14T04:35:00.000-04:00 11924.149
2015-03-14T04:40:00.000-04:00 11929.499
2015-03-14T04:45:00.000-04:00 11932.086
2015-03-14T04:50:00.000-04:00 11989.847
2015-03-14T04:55:00.000-04:00 12000.971
This is a beautiful use of itertools.groupby because you can actually take advantage of the generators it returns instead of instantly making them lists or something:
import itertools, pprint
d = {}
for (key,gen) in itertools.groupby(lst, key=lambda l: int(l[0][11:13])):
d[key] = sum(v for (d,v) in gen)
pprint.pprint(d)
And for average instead of sum:
import itertools, pprint
def avg(gf):
_sum = 0
for (i,e) in enumerate(gf): _sum += e
return float(_sum) / (i+1)
d = {}
for (key,gen) in itertools.groupby(lst, key=lambda l: int(l[0][11:13])):
#d[key] = sum(v for (d,v) in gen)
d[key] = avg(v for (d,v) in gen)
pprint.pprint(d)
Output:
{0: 148410.565,
1: 147042.743,
2: 143850.52000000002,
3: 142159.685,
4: 142944.15699999998}
Where the key of the dictionary ([0,1,2,3,4]) corresponds to the hour of the timestamp.
Input:
lst = [
['2015-03-14T00:00:00.000-04:00', 12545.869 ],
['2015-03-14T00:05:00.000-04:00', 12467.326],
['2015-03-14T00:10:00.000-04:00', 12416.948],
['2015-03-14T00:15:00.000-04:00', 12315.698],
['2015-03-14T00:20:00.000-04:00', 12276.38],
['2015-03-14T00:25:00.000-04:00', 12498.696],
['2015-03-14T00:30:00.000-04:00', 12426.145],
['2015-03-14T00:35:00.000-04:00', 12368.659],
['2015-03-14T00:40:00.000-04:00', 12322.785],
['2015-03-14T00:45:00.000-04:00', 12292.719],
['2015-03-14T00:50:00.000-04:00', 12257.965],
['2015-03-14T00:55:00.000-04:00', 12221.375],
['2015-03-14T01:00:00.000-04:00', 12393.725],
['2015-03-14T01:05:00.000-04:00', 12366.674],
['2015-03-14T01:10:00.000-04:00', 12378.578],
['2015-03-14T01:15:00.000-04:00', 12340.754],
['2015-03-14T01:20:00.000-04:00', 12288.511],
['2015-03-14T01:25:00.000-04:00', 12266.136],
['2015-03-14T01:30:00.000-04:00', 12236.639],
['2015-03-14T01:35:00.000-04:00', 12181.668],
['2015-03-14T01:40:00.000-04:00', 12171.992],
['2015-03-14T01:45:00.000-04:00', 12164.298],
['2015-03-14T01:50:00.000-04:00', 12137.282],
['2015-03-14T01:55:00.000-04:00', 12116.486],
['2015-03-14T02:00:02.000-04:00', 12090.439],
['2015-03-14T02:05:00.000-04:00', 12085.924],
['2015-03-14T02:10:00.000-04:00', 12034.78],
['2015-03-14T02:15:00.000-04:00', 12037.367],
['2015-03-14T02:20:00.000-04:00', 12006.649],
['2015-03-14T02:25:00.000-04:00', 11985.588],
['2015-03-14T02:30:00.000-04:00', 11999.41],
['2015-03-14T02:35:00.000-04:00', 11943.121],
['2015-03-14T02:40:00.000-04:00', 11934.346],
['2015-03-14T02:45:00.000-04:00', 11928.568],
['2015-03-14T02:50:00.000-04:00', 11918.63],
['2015-03-14T02:55:00.000-04:00', 11885.698],
['2015-03-14T03:00:00.000-04:00', 11863.065],
['2015-03-14T03:05:00.000-04:00', 11883.256],
['2015-03-14T03:10:00.000-04:00', 11870.095],
['2015-03-14T03:15:00.000-04:00', 11849.104],
['2015-03-14T03:20:00.000-04:00', 11849.18],
['2015-03-14T03:25:00.000-04:00', 11834.229],
['2015-03-14T03:30:00.000-04:00', 11826.603],
['2015-03-14T03:35:00.000-04:00', 11823.516],
['2015-03-14T03:40:00.000-04:00', 11849.386],
['2015-03-14T03:45:00.000-04:00', 11832.385],
['2015-03-14T03:50:00.000-04:00', 11847.059],
['2015-03-14T03:55:00.000-04:00', 11831.807],
['2015-03-14T04:00:00.000-04:00', 11844.027],
['2015-03-14T04:05:00.000-04:00', 11873.114],
['2015-03-14T04:10:00.000-04:00', 11904.105],
['2015-03-14T04:15:00.000-04:00', 11879.018],
['2015-03-14T04:20:00.000-04:00', 11899.658],
['2015-03-14T04:25:00.000-04:00', 11887.808],
['2015-03-14T04:30:00.000-04:00', 11879.875],
['2015-03-14T04:35:00.000-04:00', 11924.149],
['2015-03-14T04:40:00.000-04:00', 11929.499],
['2015-03-14T04:45:00.000-04:00', 11932.086],
['2015-03-14T04:50:00.000-04:00', 11989.847],
['2015-03-14T04:55:00.000-04:00', 12000.971],
]
Edit: per discussion in comments, what about:
import itertools, pprint
def avg(gf):
_sum = 0
for (i,e) in enumerate(gf): _sum += e
return float(_sum) / (i+1)
d = {}
for (key,gen) in itertools.groupby(lst, key=lambda l: int(l[0][11:13])):
vals = list(gen) # Unpack generator
key = vals[0][0][:13]
d[key] = avg(v for (d,v) in vals)
pprint.pprint(d)
You can do this pretty easily using a variety of tools, but I'll use a simple loop for simplicity sake:
>>> with open("listfile.txt", "r") as e:
>>> list_ = e.read().splitlines()
>>> list_ = list_[1:] # Grab all but the first line
>>>
>>> dateValue = dict()
>>> for row in list_:
>>> date, value - row.split()
>>> if ":00:" in date:
>>> # Start new value
>>> amount = int(value)
>>>
>>> elif ":55:" in date:
>>> # End new value
>>> date = date.split(':') # Grab only date and hour info
>>> dateValue[date] = amount / 12. # Returns a float, remove the period to return an integer
>>> del amount # Just in case the data isn't uniform, so it raises an error
>>>
>>> else:
>>> date += int(value)
If you want to export it to lists, just do:
>>> listDate = list()
>>> listAmount = list()
>>> for k in sorted(dateValue.keys() ):
>>> v = dateValue.get(k)
>>>
>>> listDate.append(k)
>>> listAmount.append(v)
quick and dirty way
reads= [
'2015-03-14T00:00:00.000-04:00 12545.869',
'2015-03-14T00:05:00.000-04:00 12467.326',
'2015-03-14T00:10:00.000-04:00 12416.948',
'2015-03-14T00:15:00.000-04:00 12315.698',
'2015-03-14T00:20:00.000-04:00 12276.38',
'2015-03-14T00:25:00.000-04:00 12498.696',
'2015-03-14T00:30:00.000-04:00 12426.145',
'2015-03-14T00:35:00.000-04:00 12368.659',
'2015-03-14T00:40:00.000-04:00 12322.785',
'2015-03-14T00:45:00.000-04:00 12292.719',
'2015-03-14T00:50:00.000-04:00 12257.965',
'2015-03-14T00:55:00.000-04:00 12221.375',
'2015-03-14T01:00:00.000-04:00 12393.725',
'2015-03-14T01:05:00.000-04:00 12366.674',
'2015-03-14T01:10:00.000-04:00 12378.578',
'2015-03-14T01:15:00.000-04:00 12340.754',
'2015-03-14T01:20:00.000-04:00 12288.511',
'2015-03-14T01:25:00.000-04:00 12266.136',
'2015-03-14T01:30:00.000-04:00 12236.639',
'2015-03-14T01:35:00.000-04:00 12181.668',
'2015-03-14T01:40:00.000-04:00 12171.992',
'2015-03-14T01:45:00.000-04:00 12164.298',
'2015-03-14T01:50:00.000-04:00 12137.282',
'2015-03-14T01:55:00.000-04:00 12116.486'
]
sums = {}
for read in reads:
hour = read.split(':')[0]
value = float(read.split().pop())
if hour in sums:
sums[hour] += value
else:
sums[hour] = value
avg = {}
for s in sums:
avg[s] = sums[s]/12
print avg

regular expression for a string format

I have a string as
(device
(vfb
(xxxxxxxx)
(xxxxxxxx)
(location 0.0.0.0:5900)
)
)
(device
(console
(xxxxxxxx)
(xxxxxxxx)
(location 80)
)
)
I need to read the location line from "vfb" portion of the string. I have tried to use regular expression like
import re
re.findall(r'device.*?\vfb.*?\(.*?(.*?).*(.*?\))
But it doesn't give me the required output.
It's better to use a parser for problems like this. Fortunately, a parser would be rather trivial in your case:
def parse(source):
def expr(tokens):
t = tokens.pop(0)
if t != '(':
return {'value': t}
key, val = tokens.pop(0), {}
while tokens[0] != ')':
val.update(expr(tokens))
tokens.pop(0)
return {key:val}
tokens = re.findall(r'\(|\)|[^\s()]+', source)
lst = []
while tokens:
lst.append(expr(tokens))
return lst
Given the above snippet, this creates a structure like:
[{'device': {'vfb': {'location': {'value': '0.0.0.0:5900'}, 'xxxxxxxx': {}}}},
{'device': {'console': {'location': {'value': '80'}, 'xxxxxxxx': {}}}}]
Now you can iterate it and fetch whatever you need:
for item in parse(source):
try:
location = item['device']['vfb']['location']['value']
except KeyError:
pass
With that intro from Martijn Pieters, here is a pyparsing approach:
inputdata = """(device
(vfb
(xxxxxxxx)
(xxxxxxxx)
(location 0.0.0.0:5900)
)
)
(device
(console
(xxxxxxxx)
(xxxxxxxx)
(location 80)
)
)"""
from pyparsing import OneOrMore, nestedExpr
# a nestedExpr defaults to reading space-separated words within nested parentheses
data = OneOrMore(nestedExpr()).parseString(inputdata)
print (data.asList())
# recursive search to walk parsed data to find desired entry
def findPath(seq, path):
for s in seq:
if s[0] == path[0]:
if len(path) == 1:
return s[1]
else:
ret = findPath(s[1:], path[1:])
if ret is not None:
return ret
return None
print findPath(data, "device/vfb/location".split('/'))
prints:
[['device', ['vfb', ['xxxxxxxx'], ['xxxxxxxx'], ['location', '0.0.0.0:5900']]],
['device', ['console', ['xxxxxxxx'], ['xxxxxxxx'], ['location', '80']]]]
0.0.0.0:5900
Maybe this gets you started:
In [84]: data = '(device(vfb(xxxxxxxx)(xxxxxxxx)(location 0.0.0.0:5900)))'
In [85]: m = re.search(r"""
.....: vfb
.....: .*
.....: \(
.....: location
.....: \s+
.....: (
.....: [^\)]+
.....: )
.....: \)""", data, flags=re.X)
In [86]: m.group(1)
Out[86]: '0.0.0.0:5900'

Categories

Resources