Why isn't this regex parsing the whole string?

Why isn't this regex parsing the whole string? - python

Writing a simple script to parse a large text file into words, their parent sentences, and some metadata (are they within a quote, etc.). Trying to get the regex to function properly and running into a strange issue. Here's a small bit of test code showing what's going on with my parsing. The white space is intentional, but I can't understand why the last 'word' is not parsing. It is not preceded by any problematic characters (at least as far as I can tell using repr) and when I run parse() on just the problem 'word' it returns the expected array of single words and spaces.
Code:
def parse(new_line):
new_line = new_line.rstrip()
word_array = re.split('([\.\?\!\ ])',new_line,re.M)
print(word_array)
x = full_text.readline()
print(repr(x))
parse(x)
Output:
'Far out in the uncharted backwaters of the unfashionable end of the western spiral arm of the Galaxy\n'
['Far', ' ', 'out', ' ', 'in', ' ', 'the', ' ', 'uncharted', ' ', 'backwaters', ' ', 'of', ' ', 'the', ' ', 'unfashionable end of the western spiral arm of the Galaxy']

re.M is 8, and you're passing that as the maxsplit positional argument. You want flags=re.M instead.

Related

Split a string in python with side by side delimiters

The Question:
Given a list of strings create a function that returns the same list but split along any of the following delimiters ['&', 'OR', 'AND', 'AND/OR', 'IFT'] into a list of lists of strings.
Note the delimiters can be mixed inside a string, there can be many adjacent delimiters, and the list is a column from a dataframe.
EX//
function(["Mary & had a little AND lamb", "Twinkle twinkle ITF little OR star"])
>> [['Mary', 'had a little', 'lamb'], ['Twinkle twinkle', 'little', 'star']]
function(["Mary & AND had a little OR IFT lamb", "Twinkle twinkle AND & ITF little OR & star"])
>> [['Mary', 'had a little', 'lamb'], ['Twinkle twinkle', 'little', 'star']]
My Solution Attempt
Start by replacing any kind of delimiter with a &. I include spaces on either side so that other words like HANDY dont get affected. Next, split each string along the & delimiter knowing that every other kind of delimiter has been replaced.
def clean_and_split(lolon):
# Constants
banned_list = {' AND ', ' OR ', ' ITF ', ' AND/OR '}
# Loop through each list of strings
for i in range(len(lolon)):
# Loop through each delimiter and replace it with ' & '
for word in banned_list:
lolon[i] = lolon[i].replace(word, ' & ')
# Split the string along the ' & ' delimiter
lolon[i] = lolon[i].split('&')
return lolon
The problem is that often side by side delimiters get replaced in a way that leaves an empty string in the middle. Also certain combinations of delimiters dont get removed. This is because when the 'replace' method reads ' OR OR OR ', it will replace the first ' OR ' (since it matches) but wont replace the second because it reads it as 'OR '.
EX//
clean_and_split(["Mario AND Luigi AND & Peach"]) >> ['Mario ', ' Luigi ', ' ', ' Peach'])
clean_and_split(["Mario OR OR OR Luigi", "Testing AND AND PlsWork "])
>> ['Mario ',' OR ', ' Luigi '], ['Testing', 'AND PlsWork]]
The work around to resolve this is to make banned_list = {' AND ', ' OR ', ' ITF ', ' AND/OR ', ' AND ', ' OR ', ' ITF ', ' AND/OR '} forcing the code to loop through everything twice.
Alternate Solution?
Split the column along a list of delimiters. The problem with this is that back to back delimiters don't get caught
df['Correct_Column'].str.split('(?: AND | IFT | OR | & )')
EX//
function(["Mary & AND had a little OR IFT lamb", "Twinkle twinkle AND & ITF little OR & star"])
>> [['Mary', 'AND had a little', 'IFT lamb'], ['Twinkle twinkle', '& little', '& star']]
There HAS to be a more elegant way!

This is where a lookahead and lookbehind are useful, as they won't eat up the spaces you use to match correctly:
import re
text = 'Mary & had a little AND OR lamb, white as ITF snow OR'
replaced = re.sub('(?<=\s)&|OR|AND|ITF|AND/OR(?=\s)', '&', text)
parts = [stripped for s in replaced.split('&') if (stripped := s.strip())]
print(parts)
Result:
['Mary', 'had a little', 'lamb, white as', 'snow']
However, note that:
the parts = line may solve most of your problems anyway, using your own method;
a lookbehind or lookahead requires a fixed-width pattern in Python, so something like (?<=\s|^) won't work, i.e. the OR at the end causes an empty string to be found at the end;
the lookahead/lookbehind correctly deals with 'AND OR', but still finds an empty string in between, which is removed on the parts = line;
the walrus operator is in the parts = line as a simple way to filter out empty strings; stripped := s.strip() is not truthy if the result is an empty string, so stripped will only show up in the list if it is not an empty string.

Python splitting text with line breaks into a list

I'm trying to convert some text into a list. The text contains special characters, numbers, and line breaks. Ultimately I want to have a list with each word as an item in the list without any special characters, numbers, or spaces.
exerpt from text:
I have no ambition to lose my life on the post-road between St. Petersburgh and Archangel. <the&lt I
Currently I'm using this line to split each word into an item in the list:
text_list = [re.sub(r"[^a-zA-Z0-9]+", ' ', k) \
for k in content.split(" ")]
print(text_list)
This code is leaving in spaces and combining words in each item of the list like below
Result:
['I', 'have', 'no', 'ambition', 'to', 'lose', 'my', 'life', 'on', 'the',
'post road', 'between St ', 'Petersburgh', 'and', 'Archangel ', ' lt the lt I']
I would like to split the words into individual items of the list and remove the string ' lt ' and numbers from my list items.
Expected result:
['I', 'have', 'no', 'ambition', 'to', 'lose', 'my', 'life', 'on', 'the',
'post', 'road', 'between', 'St', 'Petersburgh', 'and', 'Archangel', 'the' 'I']
Please help me resolve this issue.
Thanks

Since it looks like you're parsing html text, it's likely all entities are enclosed in & and ;. Removing those makes matching the rest quite easy.
import re
content = 'I have no ambition to lose my life on the post-road between St. Petersburgh and Archangel. <the< I'
# first, remove entities, the question mark makes sure the expression isn't too greedy
content = re.sub(r'&[^ ]+?;', '', content)
# then just match anything that meets your rules
text_list = re.findall(r"[a-zA-Z0-9]+", content)
print(text_list)
Note that 'St Petersburg' likely got matched together because the character between the 't' and 'P' probably isn't a space, but a non-breaking space. If this were just html, I'd expect there to be or something of the sort, but it's possible that in your case there's some UTF non-breaking space character there.
That should not matter with the code above, but if you use a solution using .split(), it likely won't see that character as a space.
In case the &lt is not your mistake, but in the original, this works as a replacement for the .sub() statement:
content = re.sub(r'&[^ ;]+?(?=[ ;]);?', '', content)
Clearly a bit more complicated: it substitutes any string that starts with & [&], followed by one or more characters that are not a space or ;, taking as little as possible [[^ ;]+?], but only if they are then followed by a space or a ; [(?=[ ;])], and in that case that ; is also matched [;?].

Here is what can can be done. You just need to replace any known code of syntax in advance
import re
# define some special syntax that want to remove
special_syntax = r"&(lt|nbsp|gt|amp|quot|apos|cent|pound|yen|euro|copy|reg|)[; ]"
text_list = [re.sub(r"[^a-zA-Z0-9]+", ' ', k).strip() \
# Here I remove the syntax before split them and substitue special char again
for k in re.sub(special_syntax, ' ', content).split(" ")]
# remove empty string from the list
filter_object = filter(lambda x: x != "", text_list)
list(filter_object)
Output
['I', 'have', 'no', 'ambition', 'to', 'lose', 'my', 'life', 'on', 'the',
'post road', 'between', 'St', 'Petersburgh', 'and', 'Archangel', 'the', 'I']

How to split at spaces and commas in Python?

I've been looking around here, but I didn't find anything that was close to my problem. I'm using Python3.
I want to split a string at every whitespace and at commas. Here is what I got now, but I am getting some weird output:
(Don't worry, the sentence is translated from German)
import re
sentence = "We eat, Granny"
split = re.split(r'(\s|\,)', sentence.strip())
print (split)
>>>['We', ' ', 'eat', ',', '', ' ', 'Granny']
What I actually want to have is:
>>>['We', ' ', 'eat', ',', ' ', 'Granny']

I'd go for findall instead of split and just match all the desired contents, like
import re
sentence = "We eat, Granny"
print(re.findall(r'\s|,|[^,\s]+', sentence))

This should work for you:
import re
sentence = "We eat, Granny"
split = list(filter(None, re.split(r'(\s|\,)', sentence.strip())))
print (split)

Alternate way:
import re
sentence = "We eat, Granny"
split = [a for a in re.split(r'(\s|\,)', sentence.strip()) if a]
Output:
['We', ' ', 'eat', ',', ' ', 'Granny']
Works with both python 2.7 and 3

Regex - Splitting Strings at full-stops unless it's part of an honorific [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a list containing all possible titles:
['Mr.', 'Mrs.', 'Ms.', 'Dr.', 'Prof.', 'Rev.', 'Capt.', 'Lt.-Col.', 'Col.', 'Lt.-Cmdr.', 'The Hon.', 'Cmdr.', 'Flt. Lt.', 'Brgdr.', 'Wng. Cmdr.', 'Group Capt.' ,'Rt.', 'Maj.-Gen.', 'Rear Admrl.', 'Esq.', 'Mx', 'Adv', 'Jr.']
I need a Python 2.7 code that can replace all full-stops \. with newline \n unless it's one of the above titles.
Splitting it into a list of strings would be fine as well.
Sample Input:
Modi is waiting in line to Thank Dr. Manmohan Singh for preparing a road map for introduction of GST in India. The bill is set to pass.
Sample Output:
Modi is waiting in line to Thank Dr. Manmohan Singh for preparing a road map for introduction of GST in India.
The bill is set to pass.

This should do the trick, here we use a list comprehension with a conditional statement to concatenate the words with a \n if they contain a full-stop, and are not in the list of key words. Otherwise just concatenate a space.
Finally the words in the sentence are joined using join(), and we use rstrip() to eliminate any newline remaining at the end of the string.
l = set(['Mr.', 'Mrs.', 'Ms.', 'Dr.', 'Prof.', 'Rev.', 'Capt.', 'Lt.-Col.',
'Col.', 'Lt.-Cmdr.', 'The Hon.', 'Cmdr.', 'Flt. Lt.', 'Brgdr.', 'Wng. Cmdr.',
'Group Capt.' ,'Rt.', 'Maj.-Gen.', 'Rear Admrl.', 'Esq.', 'Mx', 'Adv', 'Jr.'] )
s = 'Modi is waiting in line to Thank Dr. Manmohan Singh for preparing a road
map for introduction of GST in India. The bill is set to pass.'
def split_at_period(input_str, keywords):
final = []
split_l = input_str.split(' ')
for word in split_l:
if '.' in word and word not in keywords:
final.append(word + '\n')
continue
final.append(word + ' ')
return ''.join(final).rstrip()
print split_at_period(s, l)
or a one liner :D
print ''.join([w + '\n' if '.' in w and w not in l else w + ' ' for w in s.split(' ')]).rstrip()
Sample output:
Modi is waiting in line to Thank Dr. Manmohan Singh for preparing a road map for introduction of GST in India.
The bill is set to pass.
How it works?
Firstly we split up our string with a space ' ' delimiter using the split() string function, thus returning the following list:
>>> ['Modi', 'is', 'waiting', 'in', 'line', 'to', 'Thank', 'Dr.',
'Manmohan', 'Singh', 'for', 'preparing', 'a', 'road', 'map', 'for',
'introduction', 'of', 'GST', 'in', 'India.', 'The', 'bill', 'is',
'set', 'to', 'pass.']
We then start to build up a new list by iterating through the split-up list. If we see a word that contains a period, but is not a keyword, (Ex: India. and pass. in this case) then we have to concatenate a newline \n to the word to begin the new sentence. We can then append() to our final list, and continue out of the current iteration.
If the word does not end off a sentence with a period, we can just concatenate a space to rebuild the original string.
This is what final looks like before it is built as a string using join().
>>> ['Modi ', 'is ', 'waiting ', 'in ', 'line ', 'to ', 'Thank ', 'Dr.
', 'Manmohan ', 'Singh ', 'for ', 'preparing ', 'a ', 'road ', 'map ',
'for ', 'introduction ', 'of ', 'GST ', 'in ', 'India.\n', 'The ', 'bill ',
'is ', 'set ', 'to ', 'pass.\n']
Excellent, we have spaces, and newlines where they need to be! Now, we can rebuild the string. Notice however, that the the last element in the list also happens to contain a \n, we can clean that up with calling rstrip() on our new string.
The initial solution did not support spaces in the keywords, I've included a new more robust solution below:
import re
def format_string(input_string, keywords):
regexes = '|'.join(keywords) # Combine all keywords into a regex.
split_list = re.split(regexes, input_string) # Split on keys.
removed = re.findall(regexes, input_string) # Find removed keys.
newly_joined = split_list + removed # Interleave removed and split.
newly_joined[::2] = split_list
newly_joined[1::2] = removed
space_regex = '\.\s*'
for index, section in enumerate(newly_joined):
if '.' in section and section not in removed:
newly_joined[index] = re.sub(space_regex, '.\n', section)
return ''.join(newly_joined).strip()

convert all titles (and sole dot) into a regular expression
use a replacement callback
code:
import re
l = "|".join(map(re.escape,['.','Mr.', 'Mrs.', 'Ms.', 'Dr.', 'Prof.', 'Rev.', 'Capt.', 'Lt.-Col.', 'Col.', 'Lt.-Cmdr.', 'The Hon.', 'Cmdr.', 'Flt. Lt.', 'Brgdr.', 'Wng. Cmdr.', 'Group Capt.' ,'Rt.', 'Maj.-Gen.', 'Rear Admrl.', 'Esq.', 'Mx', 'Adv', 'Jr.']))
e="Dear Mr. Foo, I would like to thank you. Because Lt.-Col. Collins told me blah blah. Bye."
def do_repl(m):
s = m.group(1)
if s==".":
rval=".\n"
else:
rval = s
return rval
z = re.sub("("+l+")",do_repl,e)
# bonus: leading blanks should be stripped even that's not the question
z= re.sub(r"\s*\n\s*","\n",z,re.DOTALL)
print(z)
output:
Dear Mr. Foo, I would like to thank you.
Because Lt.-Col. Collins told me blah blah.
Bye.

Preserve whitespaces when using split() and join() in python

I have a data file with columns like
BBP1 0.000000 -0.150000 2.033000 0.00 -0.150 1.77
and the individual columns are separated by a varying number of whitespaces.
My goal is to read in those lines, do some math on several rows, for example multiplying column 4 by .95, and write them out to a new file. The new file should look like the original one, except for the values that I modified.
My approach would be reading in the lines as items of a list. And then I would use split() on those rows I am interested in, which will give me a sublist with the individual column values. Then I do the modification, join() the columns together and write the lines from the list to a new text file.
The problem is that I have those varying amount of whitespaces. I don't know how to introduce them back in the same way I read them in. The only way I could think of is to count characters in the line before I split them, which would be very tedious. Does someone have a better idea to tackle this problem?

You want to use re.split() in that case, with a group:
re.split(r'(\s+)', line)
would return both the columns and the whitespace so you can rejoin the line later with the same amount of whitespace included.
Example:
>>> re.split(r'(\s+)', line)
['BBP1', ' ', '0.000000', ' ', '-0.150000', ' ', '2.033000', ' ', '0.00', ' ', '-0.150', ' ', '1.77']
You probably do want to remove the newline from the end.

Other way to do this is:
s = 'BBP1 0.000000 -0.150000 2.033000 0.00 -0.150 1.77'
s.split(' ')
>>> ['BBP1', '', '', '0.000000', '', '-0.150000', '', '', '', '2.033000', '', '0.00', '-0.150', '', '', '1.77']
If we specify space character argument in split function, it creates list without eating successive space characters. So, original numbers of space characters are restored after 'join' function.

For lines that have whitespace at the beginning and/or end, a more robust pattern is (\S+) to split at non-whitespace characters:
import re
line1 = ' 4 426.2 orange\n'
line2 = '12 82.1 apple\n'
re_S = re.compile(r'(\S+)')
items1 = re_S.split(line1)
items2 = re_S.split(line2)
print(items1) # [' ', '4', ' ', '426.2', ' ', 'orange', '\n']
print(items2) # ['', '12', ' ', '82.1', ' ', 'apple', '\n']
These two lines have the same number of items after splitting, which is handy. The first and last items are always whitespace strings. These lines can be reconstituted using a join with a zero-length string:
print(repr(''.join(items1))) # ' 4 426.2 orange\n'
print(repr(''.join(items2))) # '12 82.1 apple\n'
To contrast the example with a similar pattern (\s+) (lower-case) used in the other answer here, each line splits with different result lengths and positions of the items:
re_s = re.compile(r'(\s+)')
print(re_s.split(line1)) # ['', ' ', '4', ' ', '20.0', ' ', 'orange', '\n', '']
print(re_s.split(line2)) # ['12', ' ', '82.1', ' ', 'apple', '\n', '']
As you can see, this would be a bit more difficult to process in a consistent manner.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why isn't this regex parsing the whole string? - python

re.M is 8, and you're passing that as the maxsplit positional argument. You want flags=re.M instead.

Related

Split a string in python with side by side delimiters

Python splitting text with line breaks into a list

How to split at spaces and commas in Python?

Regex - Splitting Strings at full-stops unless it's part of an honorific [closed]

Preserve whitespaces when using split() and join() in python

Categories

Resources