How to replace a spaced comma in a string? - python

I have the following string which I want to parse into words. Here is string sentence
text = "If it is hot , don’t touch"
What I've tried so far:
import string
text = "If it is hot , don’t touch"
words = [word.replace(',', '') for word in text.split()]
print(words)
However I've got following result:
['If', 'it', 'is', 'hot', '', 'don’t', 'touch']
What I want as a result:
['If', 'it', 'is', 'hot', 'don’t', 'touch']

text = "If it is hot , don’t touch"
newtext = text.replace(",", "")
words = newtext.split()
print(words)

you can use a filter function, instead of the replace finction of ',' with ' '
you can use a function that returns only the words which is not equal to ','
like this:
words = [word for word in text.split() if word!=',']

Related

Search for a word in a list

I want to search for the existence of the word hi.
import re
word = 'hi?'
cleanString = re.sub('\W+',' ', word)
print(cleanString.lower())
GREETING_INPUTS = ("hello", 'hi', 'hii', "hey")
if cleanString.lower() in GREETING_INPUTS:
print('yes')
else:
print('no')
When word = 'hi', it prints yes. But for word = 'hi?', it prints no. Why is it so and please suggest any solution.
Replace this line:
cleanString = re.sub('\W+',' ', word)
With:
cleanString = re.sub('\W+','', word)
Because you're replacing all the matches of '\W+' with ' ', a space, so the string would be 'hi ', so then you need to replace it with empty string '' for it to work, the string would become 'hi'

Python split string exactly on one space. if double space make " word" not "word"

I have the following string.
words = "this is a book and i like it"
What i want is that when i split it by one space i get the following.
wordList = words.split(" ")
print wordList
<< ['this','is','a',' book','and','i',' like','it']
Simple words.split(" ") function splits the string but incase of double space it remove both spaces which gives 'book' and 'like'. and what i need is ' book' and ' like' keeping extra spaces intact in the split output in case of double, triple... n spaces
You can split on whitespace that is not preceded by white space using look behind (?<=) syntax:
import re
re.split("(?<=\\S) ", words)
# ['this', 'is', 'a', ' book', 'and', 'i', ' like', 'it']
Or similarly, use negative look behind:
re.split("(?<!\\s) ", words)
# ['this', 'is', 'a', ' book', 'and', 'i', ' like', 'it']
Just another regex solution: if you need to split with a single left-most whitespace char, use \s? to match one or zero whitespaces, and then capture 0+ remaining whitespaces and the subsequent non-whitespace chars.
One very important step: run rstrip on the input string before running the regex to remove all the trailing whitespace, since otherwise, its performance will decrease greatly.
import re
words = "this is a book and i like it"
print(re.findall(r'\s?(\s*\S+)', words.rstrip()))
# => ['this', 'is', 'a', ' book', 'and', 'i', ' like', 'it']
See a Python demo. The re.findall returns just the captured substrings and since we only have one capturing group, the result is a list of those captures.
Also, here is a regex demo. Details:
\s? - 1 or 0 (due to ? quantifier) whitespaces
(\s*\S+) - Capturing group #1 matching
\s* - zero or more (due to the * quantifier) whitespace
\S+ - 1 or more (due to + quantifier) non-whitespace symbols.
If you don't feel like using a regex and want to keep something close to your own code, you could use something like this:
words = "this is a book and i like it"
wordList = words.split(" ")
for i in range(len(wordList)):
if(wordList[i]==''):
wordList[i+1] = ' ' + wordList[i+1]
wordList = [x for x in wordList if x != '']
print wordList
# Outputs: ['this', 'is', 'a', ' book', 'and', 'i', ' like', 'it']
An alternative using a list comprehension:
word_list = iter(words.split(" "))
["".join([" ", next(word_list)]) if not w else w for w in word_list]
# ['this', 'is', 'a', ' book', 'and', 'i', ' like', 'it']

how to split a text file into multiple list based on whitespacing in python?

hi i'm new to python programming, please help me to create a function that taken in a text file as an argument and creates a list of words thereby removing all punctuation and the list "splits" on double space. What i mean to say is the list should create subsists on every double space occurrences within a text file.
This is my function:
def tokenize(document):
file = open("document.txt","r+").read()
print re.findall(r'\w+', file)
Input text file has a string as follows:
What's did the little boy tell the game warden? His dad was in the kitchen poaching eggs!
Note: There's a double spacing after warden? and before His
My function gives me an output like this
['what','s','did','the','little','boy','tell','the','game','warden','His','dad','was','in','the','kitchen','poaching','eggs']
Desired output :
[['what','s','did','the','little','boy','tell','the','game','warden'],
['His','dad','was','in','the','kitchen','poaching','eggs']]
First split the whole text on double spaces and then pass each item to regex as:
>>> file = "What's did the little boy tell the game warden? His dad was in the kitchen poaching eggs!"
>>> file = text.split(' ')
>>> file
["What's did the little boy tell the game warden?", 'His dad was in the kitchen poaching eggs!']
>>> res = []
>>> for sen in file:
... res.append(re.findall(r'\w+', sen))
...
>>> res
[['What', 's', 'did', 'the', 'little', 'boy', 'tell', 'the', 'game', 'warden'], ['His', 'dad', 'was', 'in', 'the', 'kitchen', 'poaching', 'eggs']]
Here's a reasonable all-RE's approach:
def tokenize(document):
with open("document.txt") as f:
text = f.read()
blocks = re.split(r'\s\s+', text)
return [re.findall(r'\w+', b) for b in blocks]
The builtin split function allows splitting on multiple spaces.
This:
a = "hello world. How are you"
b = a.split(' ')
c = [ x.split(' ') for x in b ]
Yields:
c = [['hello', 'world.'], ['how', 'are', 'you?']]
If you want to remove the punctuation too, apply regex to elements in 'b' or to 'x' in the third statement.
At first split the file by punctuation, and then on the second pass split the resulted strings by spaces.
def splitByPunct(s):
return (x.group(0) for x in re.finditer(r'[^\.\,\?\!]+', s) if x and x.group(0))
[x.split() for x in splitByPunct("some string, another string! The phrase")]
this yields
[['some', 'string'], ['another', 'string'], ['The', 'phrase']]

Convert a list of string sentences to words

I'm trying to essentially take a list of strings containg sentences such as:
sentence = ['Here is an example of what I am working with', 'But I need to change the format', 'to something more useable']
and convert it into the following:
word_list = ['Here', 'is', 'an', 'example', 'of', 'what', 'I', 'am',
'working', 'with', 'But', 'I', 'need', 'to', 'change', 'the format',
'to', 'something', 'more', 'useable']
I tried using this:
for item in sentence:
for word in item:
word_list.append(word)
I thought it would take each string and append each item of that string to word_list, however the output is something along the lines of:
word_list = ['H', 'e', 'r', 'e', ' ', 'i', 's' .....etc]
I know I am making a stupid mistake but I can't figure out why, can anyone help?
You need str.split() to split each string into words:
word_list = [word for line in sentence for word in line.split()]
Just .split and .join:
word_list = ' '.join(sentence).split(' ')
You haven't told it how to distinguish a word. By default, iterating through a string simply iterates through the characters.
You can use .split(' ') to split a string by spaces. So this would work:
for item in sentence:
for word in item.split(' '):
word_list.append(word)
for item in sentence:
for word in item.split():
word_list.append(word)
Split sentence into words:
print(sentence.rsplit())

Converting a String to a List of Words?

I'm trying to convert a string to a list of words using python. I want to take something like the following:
string = 'This is a string, with words!'
Then convert to something like this :
list = ['This', 'is', 'a', 'string', 'with', 'words']
Notice the omission of punctuation and spaces. What would be the fastest way of going about this?
I think this is the simplest way for anyone else stumbling on this post given the late response:
>>> string = 'This is a string, with words!'
>>> string.split()
['This', 'is', 'a', 'string,', 'with', 'words!']
Try this:
import re
mystr = 'This is a string, with words!'
wordList = re.sub("[^\w]", " ", mystr).split()
How it works:
From the docs :
re.sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function.
so in our case :
pattern is any non-alphanumeric character.
[\w] means any alphanumeric character and is equal to the character set
[a-zA-Z0-9_]
a to z, A to Z , 0 to 9 and underscore.
so we match any non-alphanumeric character and replace it with a space .
and then we split() it which splits string by space and converts it to a list
so 'hello-world'
becomes 'hello world'
with re.sub
and then ['hello' , 'world']
after split()
let me know if any doubts come up.
To do this properly is quite complex. For your research, it is known as word tokenization. You should look at NLTK if you want to see what others have done, rather than starting from scratch:
>>> import nltk
>>> paragraph = u"Hi, this is my first sentence. And this is my second."
>>> sentences = nltk.sent_tokenize(paragraph)
>>> for sentence in sentences:
... nltk.word_tokenize(sentence)
[u'Hi', u',', u'this', u'is', u'my', u'first', u'sentence', u'.']
[u'And', u'this', u'is', u'my', u'second', u'.']
The most simple way:
>>> import re
>>> string = 'This is a string, with words!'
>>> re.findall(r'\w+', string)
['This', 'is', 'a', 'string', 'with', 'words']
Using string.punctuation for completeness:
import re
import string
x = re.sub('['+string.punctuation+']', '', s).split()
This handles newlines as well.
Well, you could use
import re
list = re.sub(r'[.!,;?]', ' ', string).split()
Note that both string and list are names of builtin types, so you probably don't want to use those as your variable names.
Inspired by #mtrw's answer, but improved to strip out punctuation at word boundaries only:
import re
import string
def extract_words(s):
return [re.sub('^[{0}]+|[{0}]+$'.format(string.punctuation), '', w) for w in s.split()]
>>> str = 'This is a string, with words!'
>>> extract_words(str)
['This', 'is', 'a', 'string', 'with', 'words']
>>> str = '''I'm a custom-built sentence with "tricky" words like https://stackoverflow.com/.'''
>>> extract_words(str)
["I'm", 'a', 'custom-built', 'sentence', 'with', 'tricky', 'words', 'like', 'https://stackoverflow.com']
Personally, I think this is slightly cleaner than the answers provided
def split_to_words(sentence):
return list(filter(lambda w: len(w) > 0, re.split('\W+', sentence))) #Use sentence.lower(), if needed
A regular expression for words would give you the most control. You would want to carefully consider how to deal with words with dashes or apostrophes, like "I'm".
list=mystr.split(" ",mystr.count(" "))
This way you eliminate every special char outside of the alphabet:
def wordsToList(strn):
L = strn.split()
cleanL = []
abc = 'abcdefghijklmnopqrstuvwxyz'
ABC = abc.upper()
letters = abc + ABC
for e in L:
word = ''
for c in e:
if c in letters:
word += c
if word != '':
cleanL.append(word)
return cleanL
s = 'She loves you, yea yea yea! '
L = wordsToList(s)
print(L) # ['She', 'loves', 'you', 'yea', 'yea', 'yea']
I'm not sure if this is fast or optimal or even the right way to program.
def split_string(string):
return string.split()
This function will return the list of words of a given string.
In this case, if we call the function as follows,
string = 'This is a string, with words!'
split_string(string)
The return output of the function would be
['This', 'is', 'a', 'string,', 'with', 'words!']
This is from my attempt on a coding challenge that can't use regex,
outputList = "".join((c if c.isalnum() or c=="'" else ' ') for c in inputStr ).split(' ')
The role of apostrophe seems interesting.
Probably not very elegant, but at least you know what's going on.
my_str = "Simple sample, test! is, olny".lower()
my_lst =[]
temp=""
len_my_str = len(my_str)
number_letter_in_data=0
list_words_number=0
for number_letter_in_data in range(0, len_my_str, 1):
if my_str[number_letter_in_data] in [',', '.', '!', '(', ')', ':', ';', '-']:
pass
else:
if my_str[number_letter_in_data] in [' ']:
#if you want longer than 3 char words
if len(temp)>3:
list_words_number +=1
my_lst.append(temp)
temp=""
else:
pass
else:
temp = temp+my_str[number_letter_in_data]
my_lst.append(temp)
print(my_lst)
You can try and do this:
tryTrans = string.maketrans(",!", " ")
str = "This is a string, with words!"
str = str.translate(tryTrans)
listOfWords = str.split()

Categories

Resources