Parse line data until keyword with pyparsing

Parse line data until keyword with pyparsing - python

I'm trying to parse line data and then group them in list.
Here is my script:
from pyparsing import *
data = """START
line 2
line 3
line 4
END
START
line a
line b
line c
END
"""
EOL = LineEnd().suppress()
start = Keyword('START').suppress() + EOL
end = Keyword('END').suppress() + EOL
line = SkipTo(LineEnd()) + EOL
lines = start + OneOrMore(start | end | Group(line))
start.setDebug()
end.setDebug()
line.setDebug()
result = lines.parseString(data)
results_list = result.asList()
print(results_list)
This code was inspired by another stackoverflow question:
Matching nonempty lines with pyparsing
What I need is to parse everything from START to END line by line and save it to a list per group (everything from START to matching END is one group). However this script put every line in new group.
This is the result:
[['line 2'], ['line 3'], ['line 4'], ['line a'], ['line b'], ['line c'], ['']]
And I want it to be:
[['line 2', 'line 3', 'line 4'], ['line a', 'line b', 'line c']]
Also it parse an empty string at the end.
I'm a pyparsing beginner so I ask you for your help.
Thanks

You could use a nestedExpr to find the text delimited by START and END.
If you use
In [322]: pp.nestedExpr('START', 'END').searchString(data).asList()
Out[322]:
[[['line', '2', 'line', '3', 'line', '4']],
[['line', 'a', 'line', 'b', 'line', 'c']]]
then the text is split on whitespace. (Notice we have 'line', '2' above where we want 'line 2' instead). We'd rather it just split only on '\n'. So to fix this we can use the pp.nestedExpr function's content parameter which allows us to control what is considered an item inside the nested list.
The source code for nestedExpr defines
content = (Combine(OneOrMore(~ignoreExpr +
~Literal(opener) + ~Literal(closer) +
CharsNotIn(ParserElement.DEFAULT_WHITE_CHARS,exact=1))
).setParseAction(lambda t:t[0].strip()))
by default, where pp.ParserElement.DEFAULT_WHITE_CHARS is
In [324]: pp.ParserElement.DEFAULT_WHITE_CHARS
Out[324]: ' \n\t\r'
This is what causes nextExpr to split on all whitespace.
So if we reduce that to simply '\n', then nestedExpr splits the content by
lines instead of by all whitespace.
import pyparsing as pp
data = """START
line 2
line 3
line 4
END
START
line a
line b
line c
END
"""
opener = 'START'
closer = 'END'
content = pp.Combine(pp.OneOrMore(~pp.Literal(opener)
+ ~pp.Literal(closer)
+ pp.CharsNotIn('\n',exact=1)))
expr = pp.nestedExpr(opener, closer, content=content)
result = [item[0] for item in expr.searchString(data).asList()]
print(result)
yields
[['line 2', 'line 3', 'line 4'], ['line a', 'line b', 'line c']]

Related

How to get correct output from regex.split()?

import re
number_with_both_parantheses = "(\(*([\d+\.]+)\))"
def process_numerals(text):
k = re.split(number_with_both_parantheses, text)
k = list(filter(None, k))
for elem in k:
print(elem)
INPUT = 'Statement 1 (1) Statement 2 (1.1) Statement 3'
expected_output = ['Statement 1', '(1)' , 'Statement 2', '(1.1)', 'Statement 3']
current_output = ['Statement 1', '(1)' , '1', 'Statement 2', '(1.1)', '1.1' , 'Statement 3']
My input is the INPUT. I am getting the current_output when call the method 'process_numerals' with input text. How do I shift to expected output ?

Your regex seems off. You realize that \(* checks for zero or more left parentheses?
>>> import re
>>> INPUT = 'Statement 1 (1) Statement 2 (1.1) Statement 3'
>>> re.split('\((\d+(?:\.\d+)?)\)', INPUT)
['Statement 1 ', '1', ' Statement 2 ', '1.1', ' Statement 3']
If you really want the literal parentheses to be included, put them inside the capturing parentheses.
The non-capturing parentheses (?:...) allow you to group without capturing. I guess that's what you are mainly looking for.

How do I split a multi-line string into multiple lines in python?

I have a multi-line string:
inputString = "Line 1\nLine 2\nLine 3"
I want to have an array, each element will have maximum 2 lines it it as below:
outputStringList = ["Line 1\nLine2", "Line3"]
Can i convert inputString to outputStringList in python. Any help will be appreciated.

you could try to find 2 lines (with lookahead inside it to avoid capturing the linefeed) or only one (to process the last, odd line). I expanded your example to show that it works for more than 3 lines (with a little "cheat": adding a newline in the end to handle all cases:
import re
s = "Line 1\nLine 2\nLine 3\nline4\nline5"
result = re.findall(r'(.+?\n.+?(?=\n)|.+)', s+"\n")
print(result)
result:
['Line 1\nLine 2', 'Line 3\nline4', 'line5']
the "add newline cheat" allows to process that properly:
s = "Line 1\nLine 2\nLine 3\nline4\nline5\nline6"
result:
['Line 1\nLine 2', 'Line 3\nline4', 'line5\nline6']

Here is an alternative using the grouper itertools recipe to group any number of lines together.
Note: you can implement this recipe by hand, or you can optionally install a third-party library that implements this recipe for you, i.e. pip install more_itertools.
Code
from more_itertools import grouper
def group_lines(iterable, n=2):
return ["\n".join((line for line in lines if line))
for lines in grouper(n, iterable.split("\n"), fillvalue="")]
Demo
s1 = "Line 1\nLine 2\nLine 3"
s2 = "Line 1\nLine 2\nLine 3\nLine4\nLine5"
group_lines(s1)
# ['Line 1\nLine 2', 'Line 3']
group_lines(s2)
# ['Line 1\nLine 2', 'Line 3\nLine4', 'Line5']
group_lines(s2, n=3)
# ['Line 1\nLine 2\nLine 3', 'Line4\nLine5']
Details
group_lines() splits the string into lines and then groups the lines by n via grouper.
list(grouper(2, s1.split("\n"), fillvalue=""))
[('Line 1', 'Line 2'), ('Line 3', '')]
Finally, for each group of lines, only non-emptry strings are rejoined with a newline character.
See more_itertools docs for more details on grouper.

I'm hoping I get your logic right - If you want a list of string, each with at most one newline delimiter, then the following code snippet will work:
# Newline-delimited string
a = "Line 1\nLine 2\nLine 3\nLine 4\nLine 5\nLine 6\nLine 7"
# Resulting list
b = []
# First split the string into "1-line-long" pieces
a = a.split("\n")
for i in range(1, len(a), 2):
# Then join the pieces by 2's and append to the resulting list
b.append(a[i - 1] + "\n" + a[i])
# Account for the possibility of an odd-sized list
if i == len(a) - 2:
b.append(a[i + 1])
print(b)
>>> ['Line 1\nLine 2', 'Line 3\nLine 4', 'Line 5\nLine 6', 'Line 7']
Although this solution isn't the fastest nor the best, it's easy to understand and it does not involve extra libraries.

I wanted to post the grouper recipe from the itertools docs as well, but PyToolz' partition_all is actually a bit nicer.
from toolz import partition_all
s = "Line 1\nLine 2\nLine 3\nLine 4\nLine 5"
result = ['\n'.join(tup) for tup in partition_all(2, s.splitlines())]
# ['Line 1\nLine 2', 'Line 3\nLine 4', 'Line 5']
Here's the grouper solution for the sake of completeness:
from itertools import zip_longest
# Recipe from the itertools docs.
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
result = ['\n'.join((a, b)) if b else a for a, b in grouper(s, 2)]

Use str.splitlines() to split the full input into lines:
>>> inputString = "Line 1\nLine 2\nLine 3"
>>> outputStringList = inputString.splitlines()
>>> print(outputStringList)
['Line 1', 'Line 2', 'Line 3']
Then, join the first lines to obtain the desired result:
>>> result = ['\n'.join(outputStringList[:-1])] + outputStringList[-1:]
>>> print(result)
['Line 1\nLine 2', 'Line 3']
Bonus: write a function that do the same, for any number of desired lines:
def split_to_max_lines(inputStr, n):
lines = inputStr.splitlines()
# This define which element in the list become the 2nd in the
# final result. For n = 2, index = -1, for n = 4, index = -3, etc.
split_index = -(n - 1)
result = ['\n'.join(lines[:split_index])]
result += lines[split_index:]
return result
print(split_to_max_lines("Line 1\nLine 2\nLine 3\nline 4\nLine 5\nLine 6", 2))
print(split_to_max_lines("Line 1\nLine 2\nLine 3\nline 4\nLine 5\nLine 6", 4))
print(split_to_max_lines("Line 1\nLine 2\nLine 3\nline 4\nLine 5\nLine 6", 5))
Returns:
['Line 1\nLine 2\nLine 3\nline 4\nLine 5', 'Line 6']
['Line 1\nLine 2\nLine 3', 'line 4', 'Line 5', 'Line 6']
['Line 1\nLine 2', 'Line 3', 'line 4', 'Line 5', 'Line 6']

I'm not sure what you mean by "a maximum of 2 lines" and how you'd hope to achieve that. However, splitting on newlines is fairly simple.
'Line 1\nLine 2\nLine 3'.split('\n')
This will result in:
['line 1', 'line 2', 'line 3']
To get the weird allowance for "some" line splitting, you'll have to write your own logic for that.

b = "a\nb\nc\nd".split("\n", 3)
c = ["\n".join(b[:-1]), b[-1]]
print c
gives
['a\nb\nc', 'd']

Heapsort not working in Python for list of strings using heapq module

I was reading the python 2.7 documentation when I came across the heapq module. I was interested in the heapify() and the heappop() methods. So, I decided to write a simple heapsort program for integers:
from heapq import heapify, heappop
user_input = raw_input("Enter numbers to be sorted: ")
data = map (int, user_input.split(","))
new_data = []
for i in range(len(data)):
heapify(data)
new_data.append(heappop(data))
print new_data
This worked like a charm.
To make it more interesting, I thought I would take away the integer conversion and leave it as a string. Logically, it should make no difference and the code should work as it did for integers:
from heapq import heapify, heappop
user_input = raw_input("Enter numbers to be sorted: ")
data = user_input.split(",")
new_data = []
for i in range(len(data)):
heapify(data)
print data
new_data.append(heappop(data))
print new_data
Note: I added a print statement in the for loop to see the heapified list.
Here's the output when I ran the script:
`$ python heapsort.py
Enter numbers to be sorted: 4, 3, 1, 9, 6, 2
[' 1', ' 3', ' 2', ' 9', ' 6', '4']
[' 2', ' 3', '4', ' 9', ' 6']
[' 3', ' 6', '4', ' 9']
[' 6', ' 9', '4']
[' 9', '4']
['4']
[' 1', ' 2', ' 3', ' 6', ' 9', '4']`
The reasoning I applied was that since the strings are being compared, the tree should be the same if they were numbers. As is evident, the heapify didn't work correctly after the third iteration. Could someone help me figure out if I am missing something here? I'm running Python 2.4.5 on RedHat 3.4.6-9.
Thanks,
VSN

You should strip the spaces. They are the reason for this strange sorting. Sorting of strings is done character by character with ASCII codes.
So try:
from heapq import heapify, heappop
user_input = raw_input("Enter numbers to be sorted: ")
data = user_input.split(",")
data = map(str.strip, data)
new_data = []
heapify(data)
for i in range(len(data)):
print(data)
new_data.append(heappop(data))
print(new_data)
Sorting
This question is rather about sorting than about heapq. And sorting itself in this context is essentially only about how < and <= works.
Sorting numbers works intuitively, but strings are different. Usually strings are sorted character by character by the bitpattern they use. This is the reason for the following behaviour
>>> sorted("abcABC")
['A', 'B', 'C', 'a', 'b', 'c']
The ASCII code for A is 65 and for a it's 97 (see ASCII Table).
The sorting is done character by character. If a string a is a prefix of another string b, it's always a < b.
>>> sorted(["123", "1", "2", "3", "12", "15", "20"])
['1', '12', '123', '15', '2', '20', '3']
What you want is called "natural sorting". See natsort for that.

Get rid of all white spaces in between lines

So say i had some text like this:
line num 1
line num 2
line num 3
line num 4
I am trying to get rid of all the new lines in between line 2 and 3 and line 3 and 4 while having all of the line num on separate new lines. How would i accomplish this? I have already tried puth=ing them into a list then looping throught them and taking out all of the lone '\n'
ex:
obj=['line num 1','line num 2','\n','line num 3','\n','\n','line num4']
a=-1
for i in obj:
a+=1
if i=='\n':
print 'yes'
del obj[a]
print obj
output:
['line num 1', 'line num 2', 'line num 3', '\n', 'line num4']
It catches some but not all.

In short: don't erase elements while iterating over a list.
Here you will find lot of ways to do this: Remove items from a list while iterating
Note: this is probably the shortest and most pythonic:
filter(lambda x: x!='\n', obj)

I'd simply use regex on the whole file content:
>>> s = """line num 1
line num 2
line num 3
line num 4"""
>>> import re
>>> print re.sub('\n+', '\n', s)
line num 1
line num 2
line num 3
line num 4
P.S. You should newer change list while iterating it.

Maybe if not item.isspace() gives you something more readable:
>>> obj = ['line num 1', 'line num 2', '\n', 'line num 3', '\n', '\n', 'line num4']
>>> [item for item in obj if not item.isspace()]
['line num 1', 'line num 2', 'line num 3', 'line num4']
>>>

def remove_new_line(obj):
if "\n" in obj:
obj.remove("\n")
remove_new_line(obj)
return obj
obj = ['line num 1', 'line num 2', '\n', 'line num 3', '\n', '\n', 'line num4']
print remove_new_line(obj)

You can also try this:
f = open("your file.txt",'r')
values = f.read()
val = re.split(r"\n+",values)
print val
output = ['line num 1', 'line num 2', 'line num 3', 'line num 4']

How to find unique starts of strings?

If I have a list of strings (eg 'blah 1', 'blah 2' 'xyz fg','xyz penguin'), what would be the best way of finding the unique starts of strings ('xyz' and 'blah' in this case)? The starts of strings can be multiple words.

Your question is confusing, as it is not clear what you really want. So I'll give three answers and hope that one of them at least partially answers your question.
To get all unique prefixes of a given list of string, you can do:
>>> l = ['blah 1', 'blah 2', 'xyz fg', 'xyz penguin']
>>> set(s[:i] for s in l for i in range(len(s) + 1))
{'', 'xyz pe', 'xyz penguin', 'b', 'xyz fg', 'xyz peng', 'xyz pengui', 'bl', 'blah 2', 'blah 1', 'blah', 'xyz f', 'xy', 'xyz pengu', 'xyz p', 'x', 'blah ', 'xyz pen', 'bla', 'xyz', 'xyz '}
This code generates all initial slices of every string in the list and passes these to a set to remove duplicates.
To get all largest initial word sequences smaller than the full string, you could go with:
>>> l = ['a b', 'a c', 'a b c', 'b c']
>>> set(s.rsplit(' ', 1)[0] for s in l)
{'a', 'a b', 'b'}
This code creates a set by splitting all strings at their rightmost space, if available (otherwise the while string will be returned).
On the other hand, to get all unique initial word sequences without considering full strings, you could go for:
>>> l = ['a b', 'a c', 'a b c', 'b c']
>>> set(' '.join(w[:i]) for s in l for w in (s.split(),) for i in range(len(w)))
{'', 'a', 'b', 'a b'}
This code splits each word at any whitespace and concatenates all initial slices of the resulting list, except the largest one. This code has pitfall: it will e.g. convert tabs to spaces. This may or may not be an issue in your case.

If you mean unique first words of strings (words being separated by space), this would be:
arr=['blah 1', 'blah 2' 'xyz fg','xyz penguin']
unique=list(set([x.split(' ')[0] for x in arr]))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse line data until keyword with pyparsing - python

Related

How to get correct output from regex.split()?

How do I split a multi-line string into multiple lines in python?

Heapsort not working in Python for list of strings using heapq module

Get rid of all white spaces in between lines

How to find unique starts of strings?

Categories

Resources