Python split string with space unless a \ is in front - python

Sorry if this post is a bit confusing to read this is my first post on this site and this is a hard question to ask, I have tried my best. I have also tried googling and i can not find anything.
I am trying to make my own command line like application in python and i would like to know how to split a string if a "\" is not in front of a space and to delete the backslash.
This is what i mean.
>>> c = "play song I\ Want\ To\ Break\ Free"
>>> print c.split(" ")
['play', 'song', 'I\\', 'Want\\', 'To\\', 'Break\\', 'Free']
When I split c with a space it keeps the backslash however it removes the space.
This is how I want it to be like:
>>> c = "play song I\ Want\ To\ Break\ Free"
>>> print c.split(" ")
['play', 'song', 'I ', 'Want ', 'To ', 'Break ', 'Free']
If someone can help me that would be great!
Also if it needs Regular expressions could you please explain it more because I have never used them before.
Edit:
Now this has been solved i forgot to ask is there a way on how to detect if the backslash has been escaped or not too?

It looks like you're writing a commandline parser. If that's the case, may I recommend shlex.split? It properly splits a command string according to shell lexing rules, and handles escapes properly. Example:
>>> import shlex
>>> shlex.split('play song I\ Want\ To\ Break\ Free')
['play', 'song', 'I Want To Break Free']

Just split on the space, then replace any string ending with a backslash with with one ending in a space instead:
[s[:-1] + ' ' if s.endswith('\\') else s for s in c.split(' ')]
This is a list comprehension; c is split on spaces, and each resulting string is examined for a trailing \ backslash at the end; if so, the last character is removed and a space is added.
One slight disadvantage: if the original string ends with a backslash (no space), that last backslash is also replaced by a space.
Demo:
>>> c = r"play song I\ Want\ To\ Break\ Free"
>>> [s[:-1] + ' ' if s.endswith('\\') else s for s in c.split(' ')]
['play', 'song', 'I ', 'Want ', 'To ', 'Break ', 'Free']
To handle escaped backslashes, you'd count the number of backslashes. An even number means the backslash is escaped:
[s[:-1] + ' ' if s.endswith('\\') and (len(s) - len(s.rstrip('\\'))) % 2 == 1 else s
for s in c.split(' ')]
Demo:
>>> c = r"play song I\ Want\ To\ Break\\ Free"
>>> [s[:-1] + ' ' if s.endswith('\\') and (len(s) - len(s.rstrip('\\'))) % 2 == 1 else s
... for s in c.split(' ')]
['play', 'song', 'I ', 'Want ', 'To ', 'Break\\\\', 'Free']

Related

Eliminating a white spaces from a string except for end of the string

I want to eliminate white spaces in a string except for end of the string
code:
sentence = ['He must be having a great time/n ', 'It is fun to play chess ', 'Sometimes TT is better than Badminton ']
pattern = "\s+^[\s+$]"
res = [re.sub(pattern,', ', line) for line in sentence]
print(res)
But...
output is same input list.
['He must be having a great time/n ', 'It is fun to play chess ', 'Sometimes TT is better than Badminton ']
Can anyone suggest the right solution.
code:
sentence = ['He must be having a great time ', 'It is fun to play chess ', 'Sometimes TT is better than Badminton ']
pattern = "\s+^[\s+$]"
res = [re.sub(pattern,', ', line) for line in sentence]
print(res)
But...
output is same input list.
['He must be having a great time/n ', 'It is fun to play chess ', 'Sometimes TT is better than Badminton ']
expected output:
['He,must,be,having,a,great,time', 'It,is,fun,to,play,chess', 'Sometimes,TT,is,better,than,Badminton ']
We can first strip off leading/trailing whitespace, then do a basic replacement of space to comma:
import re
sentence = ['He must be having a great time\n ', 'It is fun to play chess ', 'Sometimes TT is better than Badminton ']
output = [re.sub(r'\s+', ',', x.strip()) for x in sentence]
print(output)
This prints:
['He,must,be,having,a,great,time',
'It,is,fun,to,play,chess',
'Sometimes,TT,is,better,than,Badminton']
You can use a simpler split/join method (timeit: 1.48 µs ± 74 ns).
str.split() will split on groups of whitespace characters (space or newline for instance).
str.join(iter) will join the elements of iter with the str it is used on.
Demo:
sentence = [
"He must be having a great time\n ",
"It is fun to play chess ",
"Sometimes TT is better than Badminton ",
]
[",".join(s.split()) for s in sentence]
gives
['He,must,be,having,a,great,time',
'It,is,fun,to,play,chess',
'Sometimes,TT,is,better,than,Badminton']
Second method, strip/replace (timeit: 1.56 µs ± 107 ns).
str.strip() removes all whitespace characters at the beginning and then end of str.
str.replace(old, new) replaces all occurences of old in str with new (works because you have single spaces between words in your strings).
Demo:
sentence = [
"He must be having a great time\n ",
"It is fun to play chess ",
"Sometimes TT is better than Badminton ",
]
[s.strip().replace(" ", ",") for s in sentence]
gives
['He,must,be,having,a,great,time',
'It,is,fun,to,play,chess',
'Sometimes,TT,is,better,than,Badminton']
def eliminating_white_spaces(list):
for string in range(0,len(list)):
if ' ' in list[string] and string+1==len(list):
pass
else:
list[string]=str(list[string]).replace(' ',',')
return list

Split a string in python with side by side delimiters

The Question:
Given a list of strings create a function that returns the same list but split along any of the following delimiters ['&', 'OR', 'AND', 'AND/OR', 'IFT'] into a list of lists of strings.
Note the delimiters can be mixed inside a string, there can be many adjacent delimiters, and the list is a column from a dataframe.
EX//
function(["Mary & had a little AND lamb", "Twinkle twinkle ITF little OR star"])
>> [['Mary', 'had a little', 'lamb'], ['Twinkle twinkle', 'little', 'star']]
function(["Mary & AND had a little OR IFT lamb", "Twinkle twinkle AND & ITF little OR & star"])
>> [['Mary', 'had a little', 'lamb'], ['Twinkle twinkle', 'little', 'star']]
My Solution Attempt
Start by replacing any kind of delimiter with a &. I include spaces on either side so that other words like HANDY dont get affected. Next, split each string along the & delimiter knowing that every other kind of delimiter has been replaced.
def clean_and_split(lolon):
# Constants
banned_list = {' AND ', ' OR ', ' ITF ', ' AND/OR '}
# Loop through each list of strings
for i in range(len(lolon)):
# Loop through each delimiter and replace it with ' & '
for word in banned_list:
lolon[i] = lolon[i].replace(word, ' & ')
# Split the string along the ' & ' delimiter
lolon[i] = lolon[i].split('&')
return lolon
The problem is that often side by side delimiters get replaced in a way that leaves an empty string in the middle. Also certain combinations of delimiters dont get removed. This is because when the 'replace' method reads ' OR OR OR ', it will replace the first ' OR ' (since it matches) but wont replace the second because it reads it as 'OR '.
EX//
clean_and_split(["Mario AND Luigi AND & Peach"]) >> ['Mario ', ' Luigi ', ' ', ' Peach'])
clean_and_split(["Mario OR OR OR Luigi", "Testing AND AND PlsWork "])
>> ['Mario ',' OR ', ' Luigi '], ['Testing', 'AND PlsWork]]
The work around to resolve this is to make banned_list = {' AND ', ' OR ', ' ITF ', ' AND/OR ', ' AND ', ' OR ', ' ITF ', ' AND/OR '} forcing the code to loop through everything twice.
Alternate Solution?
Split the column along a list of delimiters. The problem with this is that back to back delimiters don't get caught
df['Correct_Column'].str.split('(?: AND | IFT | OR | & )')
EX//
function(["Mary & AND had a little OR IFT lamb", "Twinkle twinkle AND & ITF little OR & star"])
>> [['Mary', 'AND had a little', 'IFT lamb'], ['Twinkle twinkle', '& little', '& star']]
There HAS to be a more elegant way!
This is where a lookahead and lookbehind are useful, as they won't eat up the spaces you use to match correctly:
import re
text = 'Mary & had a little AND OR lamb, white as ITF snow OR'
replaced = re.sub('(?<=\s)&|OR|AND|ITF|AND/OR(?=\s)', '&', text)
parts = [stripped for s in replaced.split('&') if (stripped := s.strip())]
print(parts)
Result:
['Mary', 'had a little', 'lamb, white as', 'snow']
However, note that:
the parts = line may solve most of your problems anyway, using your own method;
a lookbehind or lookahead requires a fixed-width pattern in Python, so something like (?<=\s|^) won't work, i.e. the OR at the end causes an empty string to be found at the end;
the lookahead/lookbehind correctly deals with 'AND OR', but still finds an empty string in between, which is removed on the parts = line;
the walrus operator is in the parts = line as a simple way to filter out empty strings; stripped := s.strip() is not truthy if the result is an empty string, so stripped will only show up in the list if it is not an empty string.

How to split at spaces and commas in Python?

I've been looking around here, but I didn't find anything that was close to my problem. I'm using Python3.
I want to split a string at every whitespace and at commas. Here is what I got now, but I am getting some weird output:
(Don't worry, the sentence is translated from German)
import re
sentence = "We eat, Granny"
split = re.split(r'(\s|\,)', sentence.strip())
print (split)
>>>['We', ' ', 'eat', ',', '', ' ', 'Granny']
What I actually want to have is:
>>>['We', ' ', 'eat', ',', ' ', 'Granny']
I'd go for findall instead of split and just match all the desired contents, like
import re
sentence = "We eat, Granny"
print(re.findall(r'\s|,|[^,\s]+', sentence))
This should work for you:
import re
sentence = "We eat, Granny"
split = list(filter(None, re.split(r'(\s|\,)', sentence.strip())))
print (split)
Alternate way:
import re
sentence = "We eat, Granny"
split = [a for a in re.split(r'(\s|\,)', sentence.strip()) if a]
Output:
['We', ' ', 'eat', ',', ' ', 'Granny']
Works with both python 2.7 and 3

Regex - Splitting Strings at full-stops unless it's part of an honorific [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a list containing all possible titles:
['Mr.', 'Mrs.', 'Ms.', 'Dr.', 'Prof.', 'Rev.', 'Capt.', 'Lt.-Col.', 'Col.', 'Lt.-Cmdr.', 'The Hon.', 'Cmdr.', 'Flt. Lt.', 'Brgdr.', 'Wng. Cmdr.', 'Group Capt.' ,'Rt.', 'Maj.-Gen.', 'Rear Admrl.', 'Esq.', 'Mx', 'Adv', 'Jr.']
I need a Python 2.7 code that can replace all full-stops \. with newline \n unless it's one of the above titles.
Splitting it into a list of strings would be fine as well.
Sample Input:
Modi is waiting in line to Thank Dr. Manmohan Singh for preparing a road map for introduction of GST in India. The bill is set to pass.
Sample Output:
Modi is waiting in line to Thank Dr. Manmohan Singh for preparing a road map for introduction of GST in India.
The bill is set to pass.
This should do the trick, here we use a list comprehension with a conditional statement to concatenate the words with a \n if they contain a full-stop, and are not in the list of key words. Otherwise just concatenate a space.
Finally the words in the sentence are joined using join(), and we use rstrip() to eliminate any newline remaining at the end of the string.
l = set(['Mr.', 'Mrs.', 'Ms.', 'Dr.', 'Prof.', 'Rev.', 'Capt.', 'Lt.-Col.',
'Col.', 'Lt.-Cmdr.', 'The Hon.', 'Cmdr.', 'Flt. Lt.', 'Brgdr.', 'Wng. Cmdr.',
'Group Capt.' ,'Rt.', 'Maj.-Gen.', 'Rear Admrl.', 'Esq.', 'Mx', 'Adv', 'Jr.'] )
s = 'Modi is waiting in line to Thank Dr. Manmohan Singh for preparing a road
map for introduction of GST in India. The bill is set to pass.'
def split_at_period(input_str, keywords):
final = []
split_l = input_str.split(' ')
for word in split_l:
if '.' in word and word not in keywords:
final.append(word + '\n')
continue
final.append(word + ' ')
return ''.join(final).rstrip()
print split_at_period(s, l)
or a one liner :D
print ''.join([w + '\n' if '.' in w and w not in l else w + ' ' for w in s.split(' ')]).rstrip()
Sample output:
Modi is waiting in line to Thank Dr. Manmohan Singh for preparing a road map for introduction of GST in India.
The bill is set to pass.
How it works?
Firstly we split up our string with a space ' ' delimiter using the split() string function, thus returning the following list:
>>> ['Modi', 'is', 'waiting', 'in', 'line', 'to', 'Thank', 'Dr.',
'Manmohan', 'Singh', 'for', 'preparing', 'a', 'road', 'map', 'for',
'introduction', 'of', 'GST', 'in', 'India.', 'The', 'bill', 'is',
'set', 'to', 'pass.']
We then start to build up a new list by iterating through the split-up list. If we see a word that contains a period, but is not a keyword, (Ex: India. and pass. in this case) then we have to concatenate a newline \n to the word to begin the new sentence. We can then append() to our final list, and continue out of the current iteration.
If the word does not end off a sentence with a period, we can just concatenate a space to rebuild the original string.
This is what final looks like before it is built as a string using join().
>>> ['Modi ', 'is ', 'waiting ', 'in ', 'line ', 'to ', 'Thank ', 'Dr.
', 'Manmohan ', 'Singh ', 'for ', 'preparing ', 'a ', 'road ', 'map ',
'for ', 'introduction ', 'of ', 'GST ', 'in ', 'India.\n', 'The ', 'bill ',
'is ', 'set ', 'to ', 'pass.\n']
Excellent, we have spaces, and newlines where they need to be! Now, we can rebuild the string. Notice however, that the the last element in the list also happens to contain a \n, we can clean that up with calling rstrip() on our new string.
The initial solution did not support spaces in the keywords, I've included a new more robust solution below:
import re
def format_string(input_string, keywords):
regexes = '|'.join(keywords) # Combine all keywords into a regex.
split_list = re.split(regexes, input_string) # Split on keys.
removed = re.findall(regexes, input_string) # Find removed keys.
newly_joined = split_list + removed # Interleave removed and split.
newly_joined[::2] = split_list
newly_joined[1::2] = removed
space_regex = '\.\s*'
for index, section in enumerate(newly_joined):
if '.' in section and section not in removed:
newly_joined[index] = re.sub(space_regex, '.\n', section)
return ''.join(newly_joined).strip()
convert all titles (and sole dot) into a regular expression
use a replacement callback
code:
import re
l = "|".join(map(re.escape,['.','Mr.', 'Mrs.', 'Ms.', 'Dr.', 'Prof.', 'Rev.', 'Capt.', 'Lt.-Col.', 'Col.', 'Lt.-Cmdr.', 'The Hon.', 'Cmdr.', 'Flt. Lt.', 'Brgdr.', 'Wng. Cmdr.', 'Group Capt.' ,'Rt.', 'Maj.-Gen.', 'Rear Admrl.', 'Esq.', 'Mx', 'Adv', 'Jr.']))
e="Dear Mr. Foo, I would like to thank you. Because Lt.-Col. Collins told me blah blah. Bye."
def do_repl(m):
s = m.group(1)
if s==".":
rval=".\n"
else:
rval = s
return rval
z = re.sub("("+l+")",do_repl,e)
# bonus: leading blanks should be stripped even that's not the question
z= re.sub(r"\s*\n\s*","\n",z,re.DOTALL)
print(z)
output:
Dear Mr. Foo, I would like to thank you.
Because Lt.-Col. Collins told me blah blah.
Bye.

Why isn't this regex parsing the whole string?

Writing a simple script to parse a large text file into words, their parent sentences, and some metadata (are they within a quote, etc.). Trying to get the regex to function properly and running into a strange issue. Here's a small bit of test code showing what's going on with my parsing. The white space is intentional, but I can't understand why the last 'word' is not parsing. It is not preceded by any problematic characters (at least as far as I can tell using repr) and when I run parse() on just the problem 'word' it returns the expected array of single words and spaces.
Code:
def parse(new_line):
new_line = new_line.rstrip()
word_array = re.split('([\.\?\!\ ])',new_line,re.M)
print(word_array)
x = full_text.readline()
print(repr(x))
parse(x)
Output:
'Far out in the uncharted backwaters of the unfashionable end of the western spiral arm of the Galaxy\n'
['Far', ' ', 'out', ' ', 'in', ' ', 'the', ' ', 'uncharted', ' ', 'backwaters', ' ', 'of', ' ', 'the', ' ', 'unfashionable end of the western spiral arm of the Galaxy']
re.M is 8, and you're passing that as the maxsplit positional argument. You want flags=re.M instead.

Categories

Resources