How to get correct output from regex.split()? - python

import re
number_with_both_parantheses = "(\(*([\d+\.]+)\))"
def process_numerals(text):
k = re.split(number_with_both_parantheses, text)
k = list(filter(None, k))
for elem in k:
print(elem)
INPUT = 'Statement 1 (1) Statement 2 (1.1) Statement 3'
expected_output = ['Statement 1', '(1)' , 'Statement 2', '(1.1)', 'Statement 3']
current_output = ['Statement 1', '(1)' , '1', 'Statement 2', '(1.1)', '1.1' , 'Statement 3']
My input is the INPUT. I am getting the current_output when call the method 'process_numerals' with input text. How do I shift to expected output ?

Your regex seems off. You realize that \(* checks for zero or more left parentheses?
>>> import re
>>> INPUT = 'Statement 1 (1) Statement 2 (1.1) Statement 3'
>>> re.split('\((\d+(?:\.\d+)?)\)', INPUT)
['Statement 1 ', '1', ' Statement 2 ', '1.1', ' Statement 3']
If you really want the literal parentheses to be included, put them inside the capturing parentheses.
The non-capturing parentheses (?:...) allow you to group without capturing. I guess that's what you are mainly looking for.

Related

How to build regex for finding words that start with `\n` and letter and end with digit OR word?

Here's an example of string, spacing after digit could be different.
product_list = 'Buy:\n Milk \nYoughurt 4 \nBread \nSausages 4 \nBanana '
I want to build a regexp with the following output:
import re
re.findall(r'some pattern', product_list)
['Milk', 'Youghurt 4', 'Bread', 'Sausages 4', 'Banana']
This is what I thought it should look like. However, it returns empty list:
re.findall(r'\n(\w+\w$))', product_list)
I would suggest to use a non-regex (a regex seems expensive), if you can guarantee similar pattern of input:
list(map(lambda x: x.strip(), product_list.split('\n')))[1:]
Code:
product_list = 'Buy:\n Milk \nYoughurt 4 \nBread \nSausages 4 \nBanana '
print(list(map(lambda x: x.strip(), product_list.split('\n')))[1:])
# ['Milk', 'Youghurt 4', 'Bread', 'Sausages 4', 'Banana']
The approach of the below script is to first strip off the leading term:\n in this case Buy:\n. Then, we use re.findall with the following pattern to find all matches:
(.+?)\s*(?:\n|$)
This says to capture anything up until the first optional whitespace character, which is then followed by a newline, or the end of the string.
product_list = 'Buy:\n Milk \nYoughurt 4 \nBread \nSausages 4 \nBanana '
product_list = re.sub(r'^[^\s]*\s+', '', product_list)
matches = re.findall(r'(.+?)\s*(?:\n|$)', product_list)
print(matches)
['Milk', 'Youghurt 4', 'Bread', 'Sausages 4', 'Banana']
This example can be done without a regex, split on : and then \n
actual_list = 'Buy:\n Milk \nYoughurt 4 \nBread \nSausages 4 \nBanana '
product_list = actual_list.split(':')[1]
processed_list = [product.strip() for product in product_list.split('\n') if product.strip() != '']
print(processed_list)
#['Milk', 'Youghurt 4', 'Bread', 'Sausages 4', 'Banana']

Sorting a list by conditional criteria

I know how to simply sort a list in Python using the sort() method and an appropriate lambda rule. However I don't know how to deal with the following situation :
I have a list of strings, that either contain only letters or contain a specific keyword and a number. I want to sort the list first so as to put the elements with the keyword at the end, then sort those by the number they contain.
e.g. my list could be: mylist = ['abc','xyz','keyword 2','def','keyword 1'] and I want it sorted to ['abc','def','xyz','keyword 1','keyword 2'].
I already have something like
mylist.sort(key=lambda x: x.split("keyword")[0],reverse=True)
which produces only
['xyz', 'def', 'abc', 'keyword 2', 'keyword 1']
One liner solution:
mylist.sort(key=lambda x: (len(x.split())>1, x if len(x.split())==1 else int(x.split()[-1]) ) )
Explanation:
First condition len(x.split())>1 makes sure that multi word strings go behind single word strings as they will probably have numbers. So now ties will be there only between single word strings with single word strings or multi word strings with multi word strings due to first condition. Note there won't be any ties with multi word and single word strings. So if multi word string I return an integer else return string itself.
Example:
['xyz', 'keyword 1000', 'def', 'abc', 'keyword 2', 'keyword 1']
Results :
>>> mylist=['xyz', 'keyword 1000', 'def', 'abc', 'keyword 2', 'keyword 1']
>>> mylist.sort(key=lambda x: (len(x.split())>1, x if len(x.split())==1 else int(x.split()[-1]) ) )
>>> mylist
['abc', 'def', 'xyz', 'keyword 1', 'keyword 2', 'keyword 1000']
You can use the "last" element that doesn't contain your keyword as a barrier to sort first the words without the keyword and then the words with the keyword:
barrier = max(filter(lambda x: 'keyword' not in x, mylist))
# 'xyz'
mylist_barriered = [barrier + x if 'keyword' in x else x for x in mylist]
# ['abc', 'xyz', 'xyzkeyword 2', 'def', 'xyzkeyword 1']
res = sorted(mylist_barriered)
# ['abc', 'def', 'xyz', 'xyzkeyword 1', 'xyzkeyword 2']
# Be sure not to replace the barrier itself, `x != barrier`
res = [x.replace(barrier, '') if barrier in x and x != barrier else x for x in res]
res is now:
['abc', 'def', 'xyz', 'keyword 1', 'keyword 2']
The benefit of this non-hard-coded approach (outside out 'keyword', obviously), is that your keyword can occur anywhere in the string and the method will still work. Try the above code with ['abc', 'def', '1 keyword 2', 'xyz', '1 keyword 4'] to see what I mean.
Another easy way to do this, with a divide-and-conquer approach:
precedes = [x for x in mylist if 'keyword' not in x]
sort_precedes = sorted(precedes)
follows = [x for x in mylist if 'keyword' in x]
sort_follows = sorted(follows)
together = sort_precedes + sort_follows
together
['abc', 'def', 'xyz', 'keyword 1', 'keyword 2']
Sort with a tuple by first checking if the item starts with the keyword. If it is, set the first item in the tuple to 1 and then set the other item to the number following the keyword. For non-keyword items, set the first tuple item to 0 (so they always come before keywords) and then the other tuple item can be used for a lexicographical sort:
def func(x):
if x.startswith('keyword'):
return 1, int(x.split()[-1])
return 0, x
mylist.sort(key=func)
print(mylist)
# ['abc', 'def', 'xyz', 'keyword 1', 'keyword 2']
I am prefixing the strings containing "keyword" with the highest value in the ascii table so they go at the end when evaluated by the built-in sort function. https://repl.it/H66r/1
mylist.sort(key=lambda x: x if (x.find("keyword", 0) != -1) else '\127' + x)
EDIT:
This wasn't sorting the keywords strings according to their numbers.
Using the tuple solution we can come up with this : https://repl.it/H66r/8
The first value of the tuple index is very low if not containing "keyword" and its actual value otherwise. Letting the system sort all the keys with similar values.
mylist.sort(key=lambda x: (- sys.maxsize, x) if (x.find("keyword", 0) == -1) else (int(x.split(" ")[1]), x) )

Parse line data until keyword with pyparsing

I'm trying to parse line data and then group them in list.
Here is my script:
from pyparsing import *
data = """START
line 2
line 3
line 4
END
START
line a
line b
line c
END
"""
EOL = LineEnd().suppress()
start = Keyword('START').suppress() + EOL
end = Keyword('END').suppress() + EOL
line = SkipTo(LineEnd()) + EOL
lines = start + OneOrMore(start | end | Group(line))
start.setDebug()
end.setDebug()
line.setDebug()
result = lines.parseString(data)
results_list = result.asList()
print(results_list)
This code was inspired by another stackoverflow question:
Matching nonempty lines with pyparsing
What I need is to parse everything from START to END line by line and save it to a list per group (everything from START to matching END is one group). However this script put every line in new group.
This is the result:
[['line 2'], ['line 3'], ['line 4'], ['line a'], ['line b'], ['line c'], ['']]
And I want it to be:
[['line 2', 'line 3', 'line 4'], ['line a', 'line b', 'line c']]
Also it parse an empty string at the end.
I'm a pyparsing beginner so I ask you for your help.
Thanks
You could use a nestedExpr to find the text delimited by START and END.
If you use
In [322]: pp.nestedExpr('START', 'END').searchString(data).asList()
Out[322]:
[[['line', '2', 'line', '3', 'line', '4']],
[['line', 'a', 'line', 'b', 'line', 'c']]]
then the text is split on whitespace. (Notice we have 'line', '2' above where we want 'line 2' instead). We'd rather it just split only on '\n'. So to fix this we can use the pp.nestedExpr function's content parameter which allows us to control what is considered an item inside the nested list.
The source code for nestedExpr defines
content = (Combine(OneOrMore(~ignoreExpr +
~Literal(opener) + ~Literal(closer) +
CharsNotIn(ParserElement.DEFAULT_WHITE_CHARS,exact=1))
).setParseAction(lambda t:t[0].strip()))
by default, where pp.ParserElement.DEFAULT_WHITE_CHARS is
In [324]: pp.ParserElement.DEFAULT_WHITE_CHARS
Out[324]: ' \n\t\r'
This is what causes nextExpr to split on all whitespace.
So if we reduce that to simply '\n', then nestedExpr splits the content by
lines instead of by all whitespace.
import pyparsing as pp
data = """START
line 2
line 3
line 4
END
START
line a
line b
line c
END
"""
opener = 'START'
closer = 'END'
content = pp.Combine(pp.OneOrMore(~pp.Literal(opener)
+ ~pp.Literal(closer)
+ pp.CharsNotIn('\n',exact=1)))
expr = pp.nestedExpr(opener, closer, content=content)
result = [item[0] for item in expr.searchString(data).asList()]
print(result)
yields
[['line 2', 'line 3', 'line 4'], ['line a', 'line b', 'line c']]

Heapsort not working in Python for list of strings using heapq module

I was reading the python 2.7 documentation when I came across the heapq module. I was interested in the heapify() and the heappop() methods. So, I decided to write a simple heapsort program for integers:
from heapq import heapify, heappop
user_input = raw_input("Enter numbers to be sorted: ")
data = map (int, user_input.split(","))
new_data = []
for i in range(len(data)):
heapify(data)
new_data.append(heappop(data))
print new_data
This worked like a charm.
To make it more interesting, I thought I would take away the integer conversion and leave it as a string. Logically, it should make no difference and the code should work as it did for integers:
from heapq import heapify, heappop
user_input = raw_input("Enter numbers to be sorted: ")
data = user_input.split(",")
new_data = []
for i in range(len(data)):
heapify(data)
print data
new_data.append(heappop(data))
print new_data
Note: I added a print statement in the for loop to see the heapified list.
Here's the output when I ran the script:
`$ python heapsort.py
Enter numbers to be sorted: 4, 3, 1, 9, 6, 2
[' 1', ' 3', ' 2', ' 9', ' 6', '4']
[' 2', ' 3', '4', ' 9', ' 6']
[' 3', ' 6', '4', ' 9']
[' 6', ' 9', '4']
[' 9', '4']
['4']
[' 1', ' 2', ' 3', ' 6', ' 9', '4']`
The reasoning I applied was that since the strings are being compared, the tree should be the same if they were numbers. As is evident, the heapify didn't work correctly after the third iteration. Could someone help me figure out if I am missing something here? I'm running Python 2.4.5 on RedHat 3.4.6-9.
Thanks,
VSN
You should strip the spaces. They are the reason for this strange sorting. Sorting of strings is done character by character with ASCII codes.
So try:
from heapq import heapify, heappop
user_input = raw_input("Enter numbers to be sorted: ")
data = user_input.split(",")
data = map(str.strip, data)
new_data = []
heapify(data)
for i in range(len(data)):
print(data)
new_data.append(heappop(data))
print(new_data)
Sorting
This question is rather about sorting than about heapq. And sorting itself in this context is essentially only about how < and <= works.
Sorting numbers works intuitively, but strings are different. Usually strings are sorted character by character by the bitpattern they use. This is the reason for the following behaviour
>>> sorted("abcABC")
['A', 'B', 'C', 'a', 'b', 'c']
The ASCII code for A is 65 and for a it's 97 (see ASCII Table).
The sorting is done character by character. If a string a is a prefix of another string b, it's always a < b.
>>> sorted(["123", "1", "2", "3", "12", "15", "20"])
['1', '12', '123', '15', '2', '20', '3']
What you want is called "natural sorting". See natsort for that.

How to find unique starts of strings?

If I have a list of strings (eg 'blah 1', 'blah 2' 'xyz fg','xyz penguin'), what would be the best way of finding the unique starts of strings ('xyz' and 'blah' in this case)? The starts of strings can be multiple words.
Your question is confusing, as it is not clear what you really want. So I'll give three answers and hope that one of them at least partially answers your question.
To get all unique prefixes of a given list of string, you can do:
>>> l = ['blah 1', 'blah 2', 'xyz fg', 'xyz penguin']
>>> set(s[:i] for s in l for i in range(len(s) + 1))
{'', 'xyz pe', 'xyz penguin', 'b', 'xyz fg', 'xyz peng', 'xyz pengui', 'bl', 'blah 2', 'blah 1', 'blah', 'xyz f', 'xy', 'xyz pengu', 'xyz p', 'x', 'blah ', 'xyz pen', 'bla', 'xyz', 'xyz '}
This code generates all initial slices of every string in the list and passes these to a set to remove duplicates.
To get all largest initial word sequences smaller than the full string, you could go with:
>>> l = ['a b', 'a c', 'a b c', 'b c']
>>> set(s.rsplit(' ', 1)[0] for s in l)
{'a', 'a b', 'b'}
This code creates a set by splitting all strings at their rightmost space, if available (otherwise the while string will be returned).
On the other hand, to get all unique initial word sequences without considering full strings, you could go for:
>>> l = ['a b', 'a c', 'a b c', 'b c']
>>> set(' '.join(w[:i]) for s in l for w in (s.split(),) for i in range(len(w)))
{'', 'a', 'b', 'a b'}
This code splits each word at any whitespace and concatenates all initial slices of the resulting list, except the largest one. This code has pitfall: it will e.g. convert tabs to spaces. This may or may not be an issue in your case.
If you mean unique first words of strings (words being separated by space), this would be:
arr=['blah 1', 'blah 2' 'xyz fg','xyz penguin']
unique=list(set([x.split(' ')[0] for x in arr]))

Categories

Resources