Split a string in python with side by side delimiters - python

The Question:
Given a list of strings create a function that returns the same list but split along any of the following delimiters ['&', 'OR', 'AND', 'AND/OR', 'IFT'] into a list of lists of strings.
Note the delimiters can be mixed inside a string, there can be many adjacent delimiters, and the list is a column from a dataframe.
EX//
function(["Mary & had a little AND lamb", "Twinkle twinkle ITF little OR star"])
>> [['Mary', 'had a little', 'lamb'], ['Twinkle twinkle', 'little', 'star']]
function(["Mary & AND had a little OR IFT lamb", "Twinkle twinkle AND & ITF little OR & star"])
>> [['Mary', 'had a little', 'lamb'], ['Twinkle twinkle', 'little', 'star']]
My Solution Attempt
Start by replacing any kind of delimiter with a &. I include spaces on either side so that other words like HANDY dont get affected. Next, split each string along the & delimiter knowing that every other kind of delimiter has been replaced.
def clean_and_split(lolon):
# Constants
banned_list = {' AND ', ' OR ', ' ITF ', ' AND/OR '}
# Loop through each list of strings
for i in range(len(lolon)):
# Loop through each delimiter and replace it with ' & '
for word in banned_list:
lolon[i] = lolon[i].replace(word, ' & ')
# Split the string along the ' & ' delimiter
lolon[i] = lolon[i].split('&')
return lolon
The problem is that often side by side delimiters get replaced in a way that leaves an empty string in the middle. Also certain combinations of delimiters dont get removed. This is because when the 'replace' method reads ' OR OR OR ', it will replace the first ' OR ' (since it matches) but wont replace the second because it reads it as 'OR '.
EX//
clean_and_split(["Mario AND Luigi AND & Peach"]) >> ['Mario ', ' Luigi ', ' ', ' Peach'])
clean_and_split(["Mario OR OR OR Luigi", "Testing AND AND PlsWork "])
>> ['Mario ',' OR ', ' Luigi '], ['Testing', 'AND PlsWork]]
The work around to resolve this is to make banned_list = {' AND ', ' OR ', ' ITF ', ' AND/OR ', ' AND ', ' OR ', ' ITF ', ' AND/OR '} forcing the code to loop through everything twice.
Alternate Solution?
Split the column along a list of delimiters. The problem with this is that back to back delimiters don't get caught
df['Correct_Column'].str.split('(?: AND | IFT | OR | & )')
EX//
function(["Mary & AND had a little OR IFT lamb", "Twinkle twinkle AND & ITF little OR & star"])
>> [['Mary', 'AND had a little', 'IFT lamb'], ['Twinkle twinkle', '& little', '& star']]
There HAS to be a more elegant way!

This is where a lookahead and lookbehind are useful, as they won't eat up the spaces you use to match correctly:
import re
text = 'Mary & had a little AND OR lamb, white as ITF snow OR'
replaced = re.sub('(?<=\s)&|OR|AND|ITF|AND/OR(?=\s)', '&', text)
parts = [stripped for s in replaced.split('&') if (stripped := s.strip())]
print(parts)
Result:
['Mary', 'had a little', 'lamb, white as', 'snow']
However, note that:
the parts = line may solve most of your problems anyway, using your own method;
a lookbehind or lookahead requires a fixed-width pattern in Python, so something like (?<=\s|^) won't work, i.e. the OR at the end causes an empty string to be found at the end;
the lookahead/lookbehind correctly deals with 'AND OR', but still finds an empty string in between, which is removed on the parts = line;
the walrus operator is in the parts = line as a simple way to filter out empty strings; stripped := s.strip() is not truthy if the result is an empty string, so stripped will only show up in the list if it is not an empty string.

Related

Regex - Matching String values and Date together

I have a string that looks like the following and am supposed to extract the key : value and i am using Regex for the same.
line ="Date : 20/20/20 Date1 : 15/15/15 Name : Hello World Day : Month Weekday : Monday"
1) Extracting the key or attributes only.
re.findall(r'\w+\s?(?=:)',line)
#['Date ', 'Date1 ', 'Name ', 'Day ', 'Weekday ']
2)Extracting the dates only
re.findall(r'(?<=:)\s?\d{2}/\d{2}/\d{2}',line)
#[' 20/20/20', ' 15/15/15']
3)Extracting the strings perfectly but also some wrong format dates.
re.findall(r'(?<=:)\s?\w+\s?\w+',line)
# [' 20', ' 15', ' Hello World', ' Month', ' Monday']
But when I try to use the OR operator to pull both the strings and dates I get wrong output. I believe the piping has not worked properly.
re.findall(r'(?<=:)\s?\w+\s?\w+|\s?\d{2}/\d{2}/\d{2}',line)
# [' 20', ' 15', ' Hello World', ' Month', ' Monday']
Any help on the above command to extract both the dates (dd/mm/yy) format and the string values will be highly appreciated.
You need to flip it around.
\s?\d{2}/\d{2}/\d{2}|(?<=:)\s?\w+\s?\w+
Live preview
Regex will first try and match the first part. If it succeeds it will not try the next part. The reason it then breaks is because \w results in the first number of the date being matched. Since / isn't a \w (word character) it stops at that point.
Flipping it around makes it first try matching the date. If it doesn't match then it tries matching an attribute. Thus avoiding the problem.

Regex - Splitting Strings at full-stops unless it's part of an honorific [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a list containing all possible titles:
['Mr.', 'Mrs.', 'Ms.', 'Dr.', 'Prof.', 'Rev.', 'Capt.', 'Lt.-Col.', 'Col.', 'Lt.-Cmdr.', 'The Hon.', 'Cmdr.', 'Flt. Lt.', 'Brgdr.', 'Wng. Cmdr.', 'Group Capt.' ,'Rt.', 'Maj.-Gen.', 'Rear Admrl.', 'Esq.', 'Mx', 'Adv', 'Jr.']
I need a Python 2.7 code that can replace all full-stops \. with newline \n unless it's one of the above titles.
Splitting it into a list of strings would be fine as well.
Sample Input:
Modi is waiting in line to Thank Dr. Manmohan Singh for preparing a road map for introduction of GST in India. The bill is set to pass.
Sample Output:
Modi is waiting in line to Thank Dr. Manmohan Singh for preparing a road map for introduction of GST in India.
The bill is set to pass.
This should do the trick, here we use a list comprehension with a conditional statement to concatenate the words with a \n if they contain a full-stop, and are not in the list of key words. Otherwise just concatenate a space.
Finally the words in the sentence are joined using join(), and we use rstrip() to eliminate any newline remaining at the end of the string.
l = set(['Mr.', 'Mrs.', 'Ms.', 'Dr.', 'Prof.', 'Rev.', 'Capt.', 'Lt.-Col.',
'Col.', 'Lt.-Cmdr.', 'The Hon.', 'Cmdr.', 'Flt. Lt.', 'Brgdr.', 'Wng. Cmdr.',
'Group Capt.' ,'Rt.', 'Maj.-Gen.', 'Rear Admrl.', 'Esq.', 'Mx', 'Adv', 'Jr.'] )
s = 'Modi is waiting in line to Thank Dr. Manmohan Singh for preparing a road
map for introduction of GST in India. The bill is set to pass.'
def split_at_period(input_str, keywords):
final = []
split_l = input_str.split(' ')
for word in split_l:
if '.' in word and word not in keywords:
final.append(word + '\n')
continue
final.append(word + ' ')
return ''.join(final).rstrip()
print split_at_period(s, l)
or a one liner :D
print ''.join([w + '\n' if '.' in w and w not in l else w + ' ' for w in s.split(' ')]).rstrip()
Sample output:
Modi is waiting in line to Thank Dr. Manmohan Singh for preparing a road map for introduction of GST in India.
The bill is set to pass.
How it works?
Firstly we split up our string with a space ' ' delimiter using the split() string function, thus returning the following list:
>>> ['Modi', 'is', 'waiting', 'in', 'line', 'to', 'Thank', 'Dr.',
'Manmohan', 'Singh', 'for', 'preparing', 'a', 'road', 'map', 'for',
'introduction', 'of', 'GST', 'in', 'India.', 'The', 'bill', 'is',
'set', 'to', 'pass.']
We then start to build up a new list by iterating through the split-up list. If we see a word that contains a period, but is not a keyword, (Ex: India. and pass. in this case) then we have to concatenate a newline \n to the word to begin the new sentence. We can then append() to our final list, and continue out of the current iteration.
If the word does not end off a sentence with a period, we can just concatenate a space to rebuild the original string.
This is what final looks like before it is built as a string using join().
>>> ['Modi ', 'is ', 'waiting ', 'in ', 'line ', 'to ', 'Thank ', 'Dr.
', 'Manmohan ', 'Singh ', 'for ', 'preparing ', 'a ', 'road ', 'map ',
'for ', 'introduction ', 'of ', 'GST ', 'in ', 'India.\n', 'The ', 'bill ',
'is ', 'set ', 'to ', 'pass.\n']
Excellent, we have spaces, and newlines where they need to be! Now, we can rebuild the string. Notice however, that the the last element in the list also happens to contain a \n, we can clean that up with calling rstrip() on our new string.
The initial solution did not support spaces in the keywords, I've included a new more robust solution below:
import re
def format_string(input_string, keywords):
regexes = '|'.join(keywords) # Combine all keywords into a regex.
split_list = re.split(regexes, input_string) # Split on keys.
removed = re.findall(regexes, input_string) # Find removed keys.
newly_joined = split_list + removed # Interleave removed and split.
newly_joined[::2] = split_list
newly_joined[1::2] = removed
space_regex = '\.\s*'
for index, section in enumerate(newly_joined):
if '.' in section and section not in removed:
newly_joined[index] = re.sub(space_regex, '.\n', section)
return ''.join(newly_joined).strip()
convert all titles (and sole dot) into a regular expression
use a replacement callback
code:
import re
l = "|".join(map(re.escape,['.','Mr.', 'Mrs.', 'Ms.', 'Dr.', 'Prof.', 'Rev.', 'Capt.', 'Lt.-Col.', 'Col.', 'Lt.-Cmdr.', 'The Hon.', 'Cmdr.', 'Flt. Lt.', 'Brgdr.', 'Wng. Cmdr.', 'Group Capt.' ,'Rt.', 'Maj.-Gen.', 'Rear Admrl.', 'Esq.', 'Mx', 'Adv', 'Jr.']))
e="Dear Mr. Foo, I would like to thank you. Because Lt.-Col. Collins told me blah blah. Bye."
def do_repl(m):
s = m.group(1)
if s==".":
rval=".\n"
else:
rval = s
return rval
z = re.sub("("+l+")",do_repl,e)
# bonus: leading blanks should be stripped even that's not the question
z= re.sub(r"\s*\n\s*","\n",z,re.DOTALL)
print(z)
output:
Dear Mr. Foo, I would like to thank you.
Because Lt.-Col. Collins told me blah blah.
Bye.

Why isn't this regex parsing the whole string?

Writing a simple script to parse a large text file into words, their parent sentences, and some metadata (are they within a quote, etc.). Trying to get the regex to function properly and running into a strange issue. Here's a small bit of test code showing what's going on with my parsing. The white space is intentional, but I can't understand why the last 'word' is not parsing. It is not preceded by any problematic characters (at least as far as I can tell using repr) and when I run parse() on just the problem 'word' it returns the expected array of single words and spaces.
Code:
def parse(new_line):
new_line = new_line.rstrip()
word_array = re.split('([\.\?\!\ ])',new_line,re.M)
print(word_array)
x = full_text.readline()
print(repr(x))
parse(x)
Output:
'Far out in the uncharted backwaters of the unfashionable end of the western spiral arm of the Galaxy\n'
['Far', ' ', 'out', ' ', 'in', ' ', 'the', ' ', 'uncharted', ' ', 'backwaters', ' ', 'of', ' ', 'the', ' ', 'unfashionable end of the western spiral arm of the Galaxy']
re.M is 8, and you're passing that as the maxsplit positional argument. You want flags=re.M instead.

Python split string with space unless a \ is in front

Sorry if this post is a bit confusing to read this is my first post on this site and this is a hard question to ask, I have tried my best. I have also tried googling and i can not find anything.
I am trying to make my own command line like application in python and i would like to know how to split a string if a "\" is not in front of a space and to delete the backslash.
This is what i mean.
>>> c = "play song I\ Want\ To\ Break\ Free"
>>> print c.split(" ")
['play', 'song', 'I\\', 'Want\\', 'To\\', 'Break\\', 'Free']
When I split c with a space it keeps the backslash however it removes the space.
This is how I want it to be like:
>>> c = "play song I\ Want\ To\ Break\ Free"
>>> print c.split(" ")
['play', 'song', 'I ', 'Want ', 'To ', 'Break ', 'Free']
If someone can help me that would be great!
Also if it needs Regular expressions could you please explain it more because I have never used them before.
Edit:
Now this has been solved i forgot to ask is there a way on how to detect if the backslash has been escaped or not too?
It looks like you're writing a commandline parser. If that's the case, may I recommend shlex.split? It properly splits a command string according to shell lexing rules, and handles escapes properly. Example:
>>> import shlex
>>> shlex.split('play song I\ Want\ To\ Break\ Free')
['play', 'song', 'I Want To Break Free']
Just split on the space, then replace any string ending with a backslash with with one ending in a space instead:
[s[:-1] + ' ' if s.endswith('\\') else s for s in c.split(' ')]
This is a list comprehension; c is split on spaces, and each resulting string is examined for a trailing \ backslash at the end; if so, the last character is removed and a space is added.
One slight disadvantage: if the original string ends with a backslash (no space), that last backslash is also replaced by a space.
Demo:
>>> c = r"play song I\ Want\ To\ Break\ Free"
>>> [s[:-1] + ' ' if s.endswith('\\') else s for s in c.split(' ')]
['play', 'song', 'I ', 'Want ', 'To ', 'Break ', 'Free']
To handle escaped backslashes, you'd count the number of backslashes. An even number means the backslash is escaped:
[s[:-1] + ' ' if s.endswith('\\') and (len(s) - len(s.rstrip('\\'))) % 2 == 1 else s
for s in c.split(' ')]
Demo:
>>> c = r"play song I\ Want\ To\ Break\\ Free"
>>> [s[:-1] + ' ' if s.endswith('\\') and (len(s) - len(s.rstrip('\\'))) % 2 == 1 else s
... for s in c.split(' ')]
['play', 'song', 'I ', 'Want ', 'To ', 'Break\\\\', 'Free']

Preserve whitespaces when using split() and join() in python

I have a data file with columns like
BBP1 0.000000 -0.150000 2.033000 0.00 -0.150 1.77
and the individual columns are separated by a varying number of whitespaces.
My goal is to read in those lines, do some math on several rows, for example multiplying column 4 by .95, and write them out to a new file. The new file should look like the original one, except for the values that I modified.
My approach would be reading in the lines as items of a list. And then I would use split() on those rows I am interested in, which will give me a sublist with the individual column values. Then I do the modification, join() the columns together and write the lines from the list to a new text file.
The problem is that I have those varying amount of whitespaces. I don't know how to introduce them back in the same way I read them in. The only way I could think of is to count characters in the line before I split them, which would be very tedious. Does someone have a better idea to tackle this problem?
You want to use re.split() in that case, with a group:
re.split(r'(\s+)', line)
would return both the columns and the whitespace so you can rejoin the line later with the same amount of whitespace included.
Example:
>>> re.split(r'(\s+)', line)
['BBP1', ' ', '0.000000', ' ', '-0.150000', ' ', '2.033000', ' ', '0.00', ' ', '-0.150', ' ', '1.77']
You probably do want to remove the newline from the end.
Other way to do this is:
s = 'BBP1 0.000000 -0.150000 2.033000 0.00 -0.150 1.77'
s.split(' ')
>>> ['BBP1', '', '', '0.000000', '', '-0.150000', '', '', '', '2.033000', '', '0.00', '-0.150', '', '', '1.77']
If we specify space character argument in split function, it creates list without eating successive space characters. So, original numbers of space characters are restored after 'join' function.
For lines that have whitespace at the beginning and/or end, a more robust pattern is (\S+) to split at non-whitespace characters:
import re
line1 = ' 4 426.2 orange\n'
line2 = '12 82.1 apple\n'
re_S = re.compile(r'(\S+)')
items1 = re_S.split(line1)
items2 = re_S.split(line2)
print(items1) # [' ', '4', ' ', '426.2', ' ', 'orange', '\n']
print(items2) # ['', '12', ' ', '82.1', ' ', 'apple', '\n']
These two lines have the same number of items after splitting, which is handy. The first and last items are always whitespace strings. These lines can be reconstituted using a join with a zero-length string:
print(repr(''.join(items1))) # ' 4 426.2 orange\n'
print(repr(''.join(items2))) # '12 82.1 apple\n'
To contrast the example with a similar pattern (\s+) (lower-case) used in the other answer here, each line splits with different result lengths and positions of the items:
re_s = re.compile(r'(\s+)')
print(re_s.split(line1)) # ['', ' ', '4', ' ', '20.0', ' ', 'orange', '\n', '']
print(re_s.split(line2)) # ['12', ' ', '82.1', ' ', 'apple', '\n', '']
As you can see, this would be a bit more difficult to process in a consistent manner.

Categories

Resources