Split a python string by particular identifications [duplicate] - python

This question already has answers here:
How to split a Python string on new line characters [duplicate]
(2 answers)
Closed 2 years ago.
I am trying to split a python string when a particular character appears.
For example:
mystring="I want to eat an apple. \n 12345 \n 12 34 56"
The output I want is a string with
[["I want to eat an apple"], [12345], [12, 34, 56]]

>>> mystring.split(" \n ")
['I want to eat an apple.', '12345', '12 34 56']
If you specifically want each string inside its own list:
>>> [[s] for s in mystring.split(" \n ")]
[['I want to eat an apple.'], ['12345'], ['12 34 56']]

mystring = "I want to eat an apple. \n 12345 \n 12 34 56"
# split and strip the lines in case they all dont have the format ' \n '
split_list = [line.strip() for line in mystring.split('\n')] # use [line.strip] to make each element a list...
print(split_list)
Output:
['I want to eat an apple.', '12345', '12 34 56']

Use split(),strip() and re for this question
First split the strings by nextline and then strip each of them and then extract numbers from string by re, if length is more than one then replace the item
import re
mystring="I want to eat an apple. \n 12345 \n 12 34 56"
l = [i.strip() for i in mystring.split("\n")]
for idx,i in enumerate(l):
if len(re.findall(r'\d+',i))>1:
l[idx] = re.findall(r'\d+',i)
print(l)
#['I want to eat an apple.', '12345', ['12', '34', '56']]

Related

Change whitespace to underscore at specific positions

I have string like this:
strings = ['pic1.jpg siberian cat 24 25', 'pic2.jpg siemese cat 14 32', 'pic3.jpg american bobtail cat 8 13', 'pic4.jpg cat 9 1']
What I want is to replace whitespace between cat breeds to hyphen eliminating whitespace between .jpg and first word in breed, and numbers.
Expected output:
['pic1.jpg siberian_cat 24 25', 'pic2.jpg siemese_cat 14 32', 'pic3.jpg american_bobtail cat 8 13', 'pic4.jpg cat 9 1']
I tried to construct patterns as follows:
[re.sub(r'(?<!jpg\s)([a-z])\s([a-z])\s([a-z])', r'\1_\2_\3', x) for x in strings ]
However, I adds hyphen between .jpg and next word.
The problem is that "cat" is not always put at the end of the word combination.
Here is one approach using re.sub with a callback function:
strings = ['pic1.jpg siberian cat 24 25', 'pic2.jpg siemese cat 14 32', 'pic3.jpg american bobtail cat 8 13', 'pic4.jpg cat 9 1']
output = [re.sub(r'(?<!\S)\w+(?: \w+)* cat\b', lambda x: x.group().replace(' ', '_'), x) for x in strings]
print(output)
This prints:
['pic1.jpg siberian_cat 24 25',
'pic2.jpg siemese_cat 14 32',
'pic3.jpg american_bobtail_cat 8 13',
'pic4.jpg cat 9 1']
Here is an explanation of the regex pattern used:
(?<!\S) assert what precedes first word is either whitespace or start of string
\w+ match a word, which is then followed by
(?: \w+)* a space another word, zero or more times
[ ] match a single space
cat\b followed by 'cat'
In other words, taking the third list element as an example, the regex pattern matches american bobtail cat, then replaces all spaces by underscore in the lambda callback function.
Try this [re.sub(r'jpg\s((\S+\s)+)cat', "jpg " + "_".join(x.split('jpg')[1].split('cat')[0].strip().split()) + "_cat", x) for x in strings ]

Remove numbers from list but not those in a string

I have a list of list as follows
list_1 = ['what are you 3 guys doing there on 5th avenue', 'my password is 5x35omega44',
'2 days ago I saw it', 'every day is a blessing',
' 345000 people have eaten here at the beach']
I want to remove 3, but not 5th or 5x35omega44. All the solutions I have searched for and tried end up removing numbers in an alphanumeric string, but I want those to remain as is. I want my list to look as follows:
list_1 = ['what are you guys doing there on 5th avenue', 'my password is 5x35omega44',
'days ago I saw it', 'every day is a blessing',
' people have eaten here at the beach']
I am trying the following:
[' '.join(s for s in words.split() if not any(c.isdigit() for c in s)) for words in list_1]
Use lookarounds to check if digits are not enclosed with letters or digits or underscores:
import re
list_1 = ['what are you 3 guys doing there on 5th avenue', 'my password is 5x35omega44',
'2 days ago I saw it', 'every day is a blessing',
' 345000 people have eaten here at the beach']
for l in list_1:
print(re.sub(r'(?<!\w)\d+(?!\w)', '', l))
Output:
what are you guys doing there on 5th avenue
my password is 5x35omega44
days ago I saw it
every day is a blessing
people have eaten here at the beach
Regex demo
One approach would be to use try and except:
def is_intable(x):
try:
int(x)
return True
except ValueError:
return False
[' '.join([word for word in sentence.split() if not is_intable(word)]) for sentence in list_1]
It sounds like you should be using regex. This will match numbers separated by word boundaries:
\b(\d+)\b
Here is a working example.
Some Python code may look like this:
import re
for item in list_1:
new_item = re.sub(r'\b(\d+)\b', ' ', item)
print(new_item)
I am not sure what the best way to handle spaces would be for your project. You may want to put \s at the end of the expression, making it \b(\d+)\b\s or you may wish to handle this some other way.
You can use isinstance(word, int) function and get a shorter way to do it, you could try something like this:
[' '.join([word for word in expression.split() if not isinstance(word, int)]) for expression in list_1]
>>>['what are you guys doing there on 5th avenue', 'my password is 5x35omega44',
'days ago I saw it', 'every day is a blessing', 'people have eaten here at the beach']
Combining the very helpful regex solutions provided, in a list comprehension format that I wanted, I was able to arrive at the following:
[' '.join([re.sub(r'\b(\d+)\b', '', item) for item in expression.split()]) for expression in list_1]

Splitting a string in a special way

I have a string like str = "3 (15 ounce) cans black beans". I want to split it into several pieces, split by the parenthesis term. The result should be like:
['3', '(15 ounce)', 'cans black beans'] keeping the parenthesis.
How can I achieve this goal using a regular expression in Python?
Try using re.split() with [()] as the regular expression.
>>> import re
>>> s = "3 (15 ounce) cans black beans"
>>> re.split(r'[()]', s)
['3 ', '15 ounce', ' cans black beans']
>>>
>>> help(re.split)
EDIT:
To keep the parenthesis, you could do the following:
>>> re.search(r'(.*)(\(.*\))(.*)', s).groups()
('3 ', '(15 ounce)', ' cans black beans')
>>>
Ok, as Anubhava suggest it the solution is to use re.findall(r'\([^)]*\)|[^()]+', line)
line = '3 (15 ounce) cans black beans, drained and rinsed'
a = re.findall(r'\([^)]*\)|[^()]+', line)
print(a) gives
['3 ', '(15 ounce)', ' cans black beans, drained and rinsed']
exactly what I wanted, Thanks for the ones who tried to help me :)

python regex add space whenever a number is adjacent to a non-number

I am trying to separate non-numbers from numbers in a Python string. Numbers can include floats.
Examples
Original String Desired String
'4x5x6' '4 x 5 x 6'
'7.2volt' '7.2 volt'
'60BTU' '60 BTU'
'20v' '20 v'
'4*5' '4 * 5'
'24in' '24 in'
Here is a very good thread on how to achieve just that in PHP:
Regex: Add space if letter is adjacent to a number
I would like to manipulate the strings above in Python.
Following piece of code works in the first example, but not in the others:
new_element = []
result = [re.split(r'(\d+)', s) for s in (unit)]
for elements in result:
for element in elements:
if element != '':
new_element.append(element)
new_element = ' '.join(new_element)
break
Easy! Just replace it and use Regex variable. Don't forget to strip whitespaces.
Please try this code:
import re
the_str = "4x5x6"
print re.sub(r"([0-9]+(\.[0-9]+)?)",r" \1 ", the_str).strip() // \1 refers to first variable in ()
I used split, like you did, but modified it like this:
>>> tcs = ['123', 'abc', '4x5x6', '7.2volt', '60BTU', '20v', '4*5', '24in', 'google.com-1.2', '1.2.3']
>>> pattern = r'(-?[0-9]+\.?[0-9]*)'
>>> for test in tcs: print(repr(test), repr(' '.join(segment for segment in re.split(pattern, test) if segment)))
'123' '123'
'abc' 'abc'
'4x5x6' '4 x 5 x 6'
'7.2volt' '7.2 volt'
'60BTU' '60 BTU'
'20v' '20 v'
'4*5' '4 * 5'
'24in' '24 in'
'google.com-1.2' 'google.com -1.2'
'1.2.3' '1.2 . 3'
Seems to have the desired behavior.
Note that you have to remove empty strings from the beginning/end of the array before joining the string. See this question for an explanation.

Writing an If condition to filter out the first word

I have a string:
Father ate a banana and slept on a feather
Part of my code shown below:
...
if word.endswith(('ther')):
print word
This prints feather and also Father
But i want to modify this if condition so it doesn't apply this search for the first word of a sentence. So the result should only print feather.
I tried having and but it didn't work:
...
if word.endswith(('ther')) and word[0].endswith(('ther')):
print word
This doesn't work. HELP
You can use a range to skip first word and apply the endswith() function to the rest of words, like:
s = 'Father ate a banana and slept on a feather'
[w for w in s.split()[1:] if w.endswith('ther')]
You can build a regex:
import re
re.findall(r'(\w*ther)',s)[1:]
['feather']
If i understand your question correctly, you don't want it to print the word if it's the first word in the string. So, you could copy the string and drop the first word.
I'll walk you through it. Say you have this string:
s = "Father ate a banana and slept on a feather"
You can split it by running s.split() and catching that output:
['Father', 'ate', 'a', 'banana', 'and', 'slept', 'on', 'a', 'feather']
So if you want all the words, except the first, you can use the index [1:]. And you can combine the list of words by joining with a space.
s1 = "Father ate a banana and slept on a feather"
s2 = " ".join(s1.split()[1:])
The string s2 will now be the following:
ate a banana and slept on a feather
You can use that string and iterate over the words like you did above.
If you want to avoid making a temporary string
[w for i, w in enumerate(s.split()) if w.endswith('ther') and i]

Categories

Resources