Splitting a string in a special way - python

I have a string like str = "3 (15 ounce) cans black beans". I want to split it into several pieces, split by the parenthesis term. The result should be like:
['3', '(15 ounce)', 'cans black beans'] keeping the parenthesis.
How can I achieve this goal using a regular expression in Python?

Try using re.split() with [()] as the regular expression.
>>> import re
>>> s = "3 (15 ounce) cans black beans"
>>> re.split(r'[()]', s)
['3 ', '15 ounce', ' cans black beans']
>>>
>>> help(re.split)
EDIT:
To keep the parenthesis, you could do the following:
>>> re.search(r'(.*)(\(.*\))(.*)', s).groups()
('3 ', '(15 ounce)', ' cans black beans')
>>>

Ok, as Anubhava suggest it the solution is to use re.findall(r'\([^)]*\)|[^()]+', line)
line = '3 (15 ounce) cans black beans, drained and rinsed'
a = re.findall(r'\([^)]*\)|[^()]+', line)
print(a) gives
['3 ', '(15 ounce)', ' cans black beans, drained and rinsed']
exactly what I wanted, Thanks for the ones who tried to help me :)

Related

Eliminating a white spaces from a string except for end of the string

I want to eliminate white spaces in a string except for end of the string
code:
sentence = ['He must be having a great time/n ', 'It is fun to play chess ', 'Sometimes TT is better than Badminton ']
pattern = "\s+^[\s+$]"
res = [re.sub(pattern,', ', line) for line in sentence]
print(res)
But...
output is same input list.
['He must be having a great time/n ', 'It is fun to play chess ', 'Sometimes TT is better than Badminton ']
Can anyone suggest the right solution.
code:
sentence = ['He must be having a great time ', 'It is fun to play chess ', 'Sometimes TT is better than Badminton ']
pattern = "\s+^[\s+$]"
res = [re.sub(pattern,', ', line) for line in sentence]
print(res)
But...
output is same input list.
['He must be having a great time/n ', 'It is fun to play chess ', 'Sometimes TT is better than Badminton ']
expected output:
['He,must,be,having,a,great,time', 'It,is,fun,to,play,chess', 'Sometimes,TT,is,better,than,Badminton ']
We can first strip off leading/trailing whitespace, then do a basic replacement of space to comma:
import re
sentence = ['He must be having a great time\n ', 'It is fun to play chess ', 'Sometimes TT is better than Badminton ']
output = [re.sub(r'\s+', ',', x.strip()) for x in sentence]
print(output)
This prints:
['He,must,be,having,a,great,time',
'It,is,fun,to,play,chess',
'Sometimes,TT,is,better,than,Badminton']
You can use a simpler split/join method (timeit: 1.48 µs ± 74 ns).
str.split() will split on groups of whitespace characters (space or newline for instance).
str.join(iter) will join the elements of iter with the str it is used on.
Demo:
sentence = [
"He must be having a great time\n ",
"It is fun to play chess ",
"Sometimes TT is better than Badminton ",
]
[",".join(s.split()) for s in sentence]
gives
['He,must,be,having,a,great,time',
'It,is,fun,to,play,chess',
'Sometimes,TT,is,better,than,Badminton']
Second method, strip/replace (timeit: 1.56 µs ± 107 ns).
str.strip() removes all whitespace characters at the beginning and then end of str.
str.replace(old, new) replaces all occurences of old in str with new (works because you have single spaces between words in your strings).
Demo:
sentence = [
"He must be having a great time\n ",
"It is fun to play chess ",
"Sometimes TT is better than Badminton ",
]
[s.strip().replace(" ", ",") for s in sentence]
gives
['He,must,be,having,a,great,time',
'It,is,fun,to,play,chess',
'Sometimes,TT,is,better,than,Badminton']
def eliminating_white_spaces(list):
for string in range(0,len(list)):
if ' ' in list[string] and string+1==len(list):
pass
else:
list[string]=str(list[string]).replace(' ',',')
return list

Python Regex Findall non-greedy

I am relatively new to regex and I seem to be struggling to understand the greedy vs non-greedy search (if that is indeed the issue here). Let's say I have a simple text such as this:
# numbers: 4 A 3 B
My goal would be to run a findall to get something like the following output:
['# number:', '4 A 3 B', ' 4 A', ' 3 B']
So if I use the following regex with findall, I would expect it to work:
matches = re.findall(r"(# numbers:)(((?:\s\d)(?:\s\D))*)", "# numbers: 4 A 3 B")
However, the actual output is this:
[('# numbers:', ' 4 A 3 B', ' 3 B')]
Can someone explain why the group ((\s\d)(\d\D)) is only matching ' 3 B' and not also ' 4 A'? I assume it has something to do with the greedy vs. non-greedy search of * is that true? And if so, could you explain how to solve this issue?
Thanks in advance!
I would use re.findall here, twice. First, extract the digit/non digit text series, then use re.findall a second time to find the tuples:
inp = "# numbers: 4 A 3 B"
text = re.findall(r'^# numbers:\s+(.*)$', inp)[0]
matches = re.findall(r'(\d+)\s+(\D+)', text)
print(matches) # [('4', 'A '), ('3', 'B')]

Remove numbers from list but not those in a string

I have a list of list as follows
list_1 = ['what are you 3 guys doing there on 5th avenue', 'my password is 5x35omega44',
'2 days ago I saw it', 'every day is a blessing',
' 345000 people have eaten here at the beach']
I want to remove 3, but not 5th or 5x35omega44. All the solutions I have searched for and tried end up removing numbers in an alphanumeric string, but I want those to remain as is. I want my list to look as follows:
list_1 = ['what are you guys doing there on 5th avenue', 'my password is 5x35omega44',
'days ago I saw it', 'every day is a blessing',
' people have eaten here at the beach']
I am trying the following:
[' '.join(s for s in words.split() if not any(c.isdigit() for c in s)) for words in list_1]
Use lookarounds to check if digits are not enclosed with letters or digits or underscores:
import re
list_1 = ['what are you 3 guys doing there on 5th avenue', 'my password is 5x35omega44',
'2 days ago I saw it', 'every day is a blessing',
' 345000 people have eaten here at the beach']
for l in list_1:
print(re.sub(r'(?<!\w)\d+(?!\w)', '', l))
Output:
what are you guys doing there on 5th avenue
my password is 5x35omega44
days ago I saw it
every day is a blessing
people have eaten here at the beach
Regex demo
One approach would be to use try and except:
def is_intable(x):
try:
int(x)
return True
except ValueError:
return False
[' '.join([word for word in sentence.split() if not is_intable(word)]) for sentence in list_1]
It sounds like you should be using regex. This will match numbers separated by word boundaries:
\b(\d+)\b
Here is a working example.
Some Python code may look like this:
import re
for item in list_1:
new_item = re.sub(r'\b(\d+)\b', ' ', item)
print(new_item)
I am not sure what the best way to handle spaces would be for your project. You may want to put \s at the end of the expression, making it \b(\d+)\b\s or you may wish to handle this some other way.
You can use isinstance(word, int) function and get a shorter way to do it, you could try something like this:
[' '.join([word for word in expression.split() if not isinstance(word, int)]) for expression in list_1]
>>>['what are you guys doing there on 5th avenue', 'my password is 5x35omega44',
'days ago I saw it', 'every day is a blessing', 'people have eaten here at the beach']
Combining the very helpful regex solutions provided, in a list comprehension format that I wanted, I was able to arrive at the following:
[' '.join([re.sub(r'\b(\d+)\b', '', item) for item in expression.split()]) for expression in list_1]

Split a python string by particular identifications [duplicate]

This question already has answers here:
How to split a Python string on new line characters [duplicate]
(2 answers)
Closed 2 years ago.
I am trying to split a python string when a particular character appears.
For example:
mystring="I want to eat an apple. \n 12345 \n 12 34 56"
The output I want is a string with
[["I want to eat an apple"], [12345], [12, 34, 56]]
>>> mystring.split(" \n ")
['I want to eat an apple.', '12345', '12 34 56']
If you specifically want each string inside its own list:
>>> [[s] for s in mystring.split(" \n ")]
[['I want to eat an apple.'], ['12345'], ['12 34 56']]
mystring = "I want to eat an apple. \n 12345 \n 12 34 56"
# split and strip the lines in case they all dont have the format ' \n '
split_list = [line.strip() for line in mystring.split('\n')] # use [line.strip] to make each element a list...
print(split_list)
Output:
['I want to eat an apple.', '12345', '12 34 56']
Use split(),strip() and re for this question
First split the strings by nextline and then strip each of them and then extract numbers from string by re, if length is more than one then replace the item
import re
mystring="I want to eat an apple. \n 12345 \n 12 34 56"
l = [i.strip() for i in mystring.split("\n")]
for idx,i in enumerate(l):
if len(re.findall(r'\d+',i))>1:
l[idx] = re.findall(r'\d+',i)
print(l)
#['I want to eat an apple.', '12345', ['12', '34', '56']]

Python - Don't Understand The Returned Results of This Concatenated Regex Pattern

I am a Python newb trying to get more understanding of regex. Just when I think I got a good grasp of the basics, something throws me - such as the following:
>>> import re
>>> text = "Some nouns like eggs egg bacon what a lovely donkey"
>>> noun_list = ['eggs', 'bacon', 'donkey', 'dog']
>>> noun_patt = r'\s' + '|'.join(noun_list) + r'\s'
>>> found = re.findall(noun_patt, text)
>>> found
[' eggs', 'bacon', 'donkey']
Since I set the regex pattern to find 'whitespace' + 'pipe joined list of nouns' + 'whitespace' - how come:
' eggs' was found with a space before it and not after it?
'bacon' was found with no spaces either side of it?
'donkey' was found with no spaces either side of it and the fact there is no whitespace after it?
The result I was expecting: [' eggs ', ' bacon ']
I am using Python 2.7
You misunderstand the pattern. There is no group around the joint list of nouns, so the first \s is part of the eggs option, the bacon and donkey options have no spaces, and the dog option includes the final \s meta character .
You want to put a group around the nouns to delimit what the | option applies to:
noun_patt = r'\s(?:{})\s'.format('|'.join(noun_list))
The non-capturing group here ((?:...)) puts a limit on what the | options apply to. The \s spaces are now outside of the group and are thus not part of the 4 choices.
You need to use a non-capturing group because if you were to use a regular (capturing) group .findall() would return just the noun, not the spaces.
Demo:
>>> text = "Some nouns like eggs egg bacon what a lovely donkey"
>>> import re
>>> text = "Some nouns like eggs egg bacon what a lovely donkey"
>>> noun_list = ['eggs', 'bacon', 'donkey', 'dog']
>>> noun_patt = r'\s(?:{})\s'.format('|'.join(noun_list))
>>> re.findall(noun_patt, text)
[' eggs ', ' bacon ']
Now both spaces are part of the output.

Categories

Resources