My goal is to split the string below only on double white spaces. See example string below and an attempt using the regular split function.
My attempt
>>> _str='The lorry ran into the mad man before turning over'
>>> _str.split()
['The', 'lorry', 'ran', 'into', 'the', 'mad', 'man', 'before', 'turning', 'over']
Ideal result:
['the lorry ran', 'into the mad man', 'before turning over']
Any suggestions on how to arrive at the ideal result? thanks.
split can use an argument which is used to split:
>>> _str='The lorry ran into the mad man before turning over'
>>> _str.split(' ')
['The lorry ran', 'into the mad man', 'before turning over']
From the doc
str.split([sep[, maxsplit]])
Return a list of the words in the string, using sep as the delimiter string.
If maxsplit is given, at most maxsplit splits are
done (thus, the list will have at most maxsplit+1 elements).
If sep is given, consecutive delimiters are not grouped together and are deemed
to delimit empty strings (for example,
'1,,2'.split(',') returns ['1', '', '2']). The sep argument may
consist of multiple characters (for example, '1<>2<>3'.split('<>')
returns ['1', '2', '3']).
split takes a seperator argument. Just pass ' ' to it:
>>> _str='The lorry ran into the mad man before turning over'
>>> _str.split(' ')
['The lorry ran', 'into the mad man', 'before turning over']
>>>
Give your split() a double space as an argument.
>>> _str='The lorry ran into the mad man before turning over'
>>> _str.split(" ")
['The lorry ran', 'into the mad man', 'before turning over']
>>>
Use the re module:
>>> import re
>>> example = 'The lorry ran into the mad man before turning over'
>>> re.split(r'\s{2}', example)
['The lorry ran', 'into the mad man', 'before turning over']
Since, you need to split on 2 or more spaces, you can do.
>>> import re
>>> _str = 'The lorry ran into the mad man before turning over'
>>> re.split("\s{2,}", _str)
['The lorry ran', 'into the mad man', 'before turning over']
>>> _str = 'The lorry ran into the mad man before turning over'
>>> re.split("\s{2,}", _str)
['The lorry ran', 'into the mad man', 'before turning over']
Related
I'm trying to tokenize the below text with stopwords('is', 'the', 'was') as delimiters
The expected output is this:
['Walter',
'feeling anxious',
'He',
'diagnosed today,'
'He probably',
'best person I know']
This is the code which I trying to make the above output
import nltk
stopwords = ['is', 'the', 'was']
sents = nltk.sent_tokenize("Walter was feeling anxious. He was diagnosed today. He probably is the best person I know.")
sents_rm_stopwords = []
for sent in sents:
sents_rm_stopwords.append(' '.join(w for w in nltk.word_tokenize(sent) if w not in stopwords))
My code output is this:
['Walter feeling anxious .',
'He diagnosed today .',
'He probably best person I know .']
How can I make the expected output?
So the problem considers both stopwords and line delimiters. Assuming that we can define a line by the symbol ., you can introduce that to multiple splits by using re.split().
import re
s = "Walter was feeling anxious. He was diagnosed today. He probably is the best person I know."
result = re.split(" was | is | the |\. |\.", s)
results
>>
['Walter',
'feeling anxious',
'He',
'diagnosed today',
'He probably',
'the best person I know',
'']
Because we are using both single . and . with a whitespace after, the split results will return an additional ''. Assuming that this structure of sentences are consistent, you can slice the results to get your expected results.
result[:-1]
>>
['Walter',
'feeling anxious',
'He',
'diagnosed today',
'He probably',
'the best person I know']
I have list of strings which need to be transformed into a smaller list of strings, depending whether two consecutive elements belong to the same phrase.
This happens, at the moment, if the last character of the i-th string is lower and the first character of the i+1-th string is also lower, but more complex conditions should be checked in the future.
For example this very profound text:
['I am a boy',
'and like to play'
'My friends also'
'like to play'
'Cats and dogs are '
'nice pets, and'
'we like to play with them'
]
should become:
['I am a boy and like to play',
'My friends also like to play',
'Cats and dogs are nice pets, and we like to play with them'
]
My python solution
I think the data you have posted is comma seperated. If it is pfb a simple loop solution.
data=['I am a boy',
'and like to play',
'My friends also',
'like to play',
'Cats and dogs are ',
'nice pets, and',
'we like to play with them'
]
required_list=[]
for j,i in enumerate(data):
print(i,j)
if j==0:
req=i
else:
if i[0].isupper():
required_list.append(req)
req=i
else:
req=req+" "+i
required_list.append(req)
print(required_list)
Here is your code check it
data = ['I am a boy',
'and like to play'
'My friends also'
'like to play'
'Cats and dogs are '
'nice pets, and'
'we like to play with them'
]
joined_string = ",".join(data).replace(',',' ')
import re
values = re.findall('[A-Z][^A-Z]*', joined_string)
print(values)
Since you want to do it recursively, you can try something like this:
def join_text(text, new_text):
if not text:
return
if not new_text:
new_text.append(text.pop(0))
return join_text(text, new_text)
phrase = text.pop(0)
if phrase[0].islower(): # you can add more complicated logic here
new_text[-1] += ' ' + phrase
else:
new_text.append(phrase)
return join_text(text, new_text)
phrases = [
'I am a boy',
'and like to play',
'My friends also',
'like to play',
'Cats and dogs are ',
'nice pets, and',
'we like to play with them'
]
joined_phrases = []
join_text(phrases, joined_phrases)
print(joined_phrases)
My solution has some problems with witespaces, but I hope you got the idea.
Hope it helps!
I've been looking around here, but I didn't find anything that was close to my problem. I'm using Python3.
I want to split a string at every whitespace and at commas. Here is what I got now, but I am getting some weird output:
(Don't worry, the sentence is translated from German)
import re
sentence = "We eat, Granny"
split = re.split(r'(\s|\,)', sentence.strip())
print (split)
>>>['We', ' ', 'eat', ',', '', ' ', 'Granny']
What I actually want to have is:
>>>['We', ' ', 'eat', ',', ' ', 'Granny']
I'd go for findall instead of split and just match all the desired contents, like
import re
sentence = "We eat, Granny"
print(re.findall(r'\s|,|[^,\s]+', sentence))
This should work for you:
import re
sentence = "We eat, Granny"
split = list(filter(None, re.split(r'(\s|\,)', sentence.strip())))
print (split)
Alternate way:
import re
sentence = "We eat, Granny"
split = [a for a in re.split(r'(\s|\,)', sentence.strip()) if a]
Output:
['We', ' ', 'eat', ',', ' ', 'Granny']
Works with both python 2.7 and 3
I'm trying to split a string in Python using a regex pattern but its not working correctly.
Example text:
"The quick {brown fox} jumped over the {lazy} dog"
Code:
"The quick {brown fox} jumped over the {lazy} dog".split(r'({.*?}))
I'm using a capture group so that the split delimiters are retained in the array.
Desired result:
['The quick', '{brown fox}', 'jumped over the', '{lazy}', 'dog']
Actual result:
['The quick {brown fox} jumped over the {lazy} dog']
As you can see there is clearly not a match as it doesn't split the string. Can anyone let me know where I'm going wrong? Thanks.
You're calling the strings' split method, not re's
>>> re.split(r'({.*?})', "The quick {brown fox} jumped over the {lazy} dog")
['The quick ', '{brown fox}', ' jumped over the ', '{lazy}', ' dog']
park = "a park.shp"
road = "the roads.shp"
school = "a school.shp"
train = "the train"
bus = "the bus.shp"
mall = "a mall"
ferry = "the ferry"
viaduct = "a viaduct"
dataList = [park, road, school, train, bus, mall, ferry, viaduct]
print dataList
for a in dataList:
print a
#if a.endswith(".shp"):
# dataList.remove(a)
print dataList
gives the following output (so the loop is working and reading everything correctly):
['a park.shp', 'the roads.shp', 'a school.shp', 'the train', 'the bus.shp', 'a mall', 'the ferry', 'a viaduct']
a park.shp
the roads.shp
a school.shp
the train
the bus.shp
a mall
the ferry
a viaduct
['a park.shp', 'the roads.shp', 'a school.shp', 'the train', 'the bus.shp', 'a mall', 'the ferry', 'a viaduct']
but when I remove the # marks to run the if statement, where it should remove the strings ending in .shp, the string road remains in the list?
['a park.shp', 'the roads.shp', 'a school.shp', 'the train', 'the bus.shp', 'a mall', 'the ferry', 'a viaduct']
a park.shp
a school.shp
the bus.shp
the ferry
a viaduct
['the roads.shp', 'the train', 'a mall', 'the ferry', 'a viaduct']
Something else I noticed, it doesn't print all the strings when it's clearly in a for loop that should go through each string? Can someone please explain what's going wrong, where the loop keeps the string road but finds the other strings ending with .shp and removes them correctly?
Thanks,
C
(FYI, this is on Python 2.6.6, because of Arc 10.0)
You are mutating the list and causing the index to skip.
Use a list comprehension like this:
[d for d in dataList if not d.endswith('.shp')]
and then get:
>>> ['the train', 'a mall', 'the ferry', 'a viaduct']
Removing items from the same list you're iterating over almost always causes problems. Make a copy of the original list and iterate over that instead; that way you don't skip anything.
for a in dataList[:]: # Iterate over a copy of the list
print a
if a.endswith(".shp"):
dataList.remove(a) # Remove items from the original, not the copy
Of course, if this loop has no purpose other than creating a list with no .shp files, you can just use one list comprehension and skip the whole mess.
no_shp_files = [a for a in datalist if not a.endswith('.shp')]