I have a list such as this
list = ['Hi', ',', 'my', 'name', 'is', 'Bob', '!']
I wanted to convert this to a string, and originally, I found on stackoverflow that .join() could be used. So i did:
x = ' '.join(list)
print(x)
which prints
"Hi , my name is Bob !"
when what I want printed is:
"Hi, my name is Bob!"
How do I not add spaces before periods and exclamation points? I want a more general case so that I can for example read in a text file as a list, and convert it to a string.
Thanks!
To solve it in a general case, use the nltk's "moses" detokenizer:
In [1]: l = ["Hi", ",", "my", "name", "is", "Bob", "!"]
In [2]: from nltk.tokenize.moses import MosesDetokenizer
In [3]: detokenizer = MosesDetokenizer()
In [4]: detokenizer.detokenize(l, return_str=True)
Out[4]: u'Hi, my name is Bob!'
The detokenizer is not yet a part of a stable nltk package. To be able to use it now, install nltk directly from github.
How about this, using simple regex?
import re
list = ['Hi', ',', 'my', 'name', 'is', 'Bob', '!']
x = re.sub(r' (\W)',r'\1',' '.join(list))
print(x)
>>> Hi, my name is Bob!
A little different solution:
>>> from string import punctuation
>>> lis = ["Hi", ",", "my", "name", "is", "Bob", "!"]
>>> string = ''
>>> for i, x in enumerate(lis):
if x not in punctuation and i != 0:
string += ' ' + x
elif x not in punctuation and i == 0:
string += x
else:
string += x
>>> print(string)
"Hi, my name is Bob!"
Related
I was trying to create a program that removes all sorts of punctuation from a given input sentence. The code looked somewhat like this
from string import punctuation
sent = str(input())
def rempunc(string):
for i in string:
word =''
list = [0]
if i in punctuation:
x = string.index(i)
word += string[list[-1]:x]+' '
list.append(x)
list_2 = word.split(' ')
return list_2
print(rempunc(sent))
However the output is coming out as follows:
This state ment has # 1 ! punc.
['This', 'state', 'ment', 'has', '#', '1', '!', 'punc', '']
Why isn't the punctuation being removed entirely? Am I missing something in the code?
I tried changing x with x-1 in line 7 but it did not help. Now I'm stuck and don't know what else to try.
Repeated string slicing isn't necessary here.
I would suggest using filter() to filter out the undesired characters for each word, and then reading that result into a list comprehension. From there, you can use a second filter() operation to remove the empty strings:
from string import punctuation
def remove_punctuation(s):
cleaned_words = [''.join(filter(lambda x: x not in punctuation, word))
for word in s.split()]
return list(filter(lambda x: x != "", cleaned_words))
print(remove_punctuation(input()))
This outputs:
['This', 'state', 'ment', 'has', '1', 'punc']
I want an efficient way to split a list of strings using a list of words as the delimiters. The output is another list of strings.
I tried multiple .split in a single line, which does not work because the first .split returns a list and succeeding .split require a string.
Here is the input:
words = ["hello my name is jolloopp", "my jolloopp name is hello"]
splitters = ['my', 'is']
I want the output to be
final_list = ["hello ", " name ", " jolloopp", " jolloopp name ", " hello"]
Note the spaces.
It is also possible to have something like
draft_list = [["hello ", " name ", " jolloopp"], [" jolloopp name ", " hello"]]
which can be flattened using something like numpy reshape(-1,1) to get final_list, but the ideal case is
ideal_list = ["hello", "name", "jolloopp", "jolloopp name", "hello"]
where the spaces have been stripped, which is similar to using .strip().
EDIT 1:
Using re.split doesn't fully work if the word delimiters are part of other words.
words = ["hellois my name is myjolloopp", "my isjolloopp name is myhello"]
splitters = ['my', 'is']
then the output would be
['hello', '', 'name', '', 'jolloopp', '', 'jolloopp name', '', 'hello']
when it should be
['hellois', 'name', 'myjolloopp', 'isjolloopp name', 'myhello']
This is a known issue with solutions using re.split.
EDIT 2:
[x.strip() for x in re.split(' | '.join(splitters), ''.join(words))]
does not work properly when the input is
words = ["hello world", "hello my name is jolloopp", "my jolloopp name is hello"]
The output becomes
['hello worldhello', 'name', 'jolloopp', 'jolloopp name', 'hello']
when the output should be
['hello world', 'hello', 'name', 'jolloopp', 'jolloopp name', 'hello']
You could use re like,
Updated using the better way suggested by #pault using word boundaries \b instead of :space:,
>>> import re
>>> words = ['hello world', 'hello my name is jolloopp', 'my jolloopp name is hello']
# Iterate over the list of words and then use the `re` to split the strings,
>>> [z for y in (re.split('|'.join(r'\b{}\b'.format(x) for x in splitters), word) for word in words) for z in y]
['hello world', 'hello ', ' name ', ' jolloopp', '', ' jolloopp name ', ' hello']
hi i'm new to python programming, please help me to create a function that taken in a text file as an argument and creates a list of words thereby removing all punctuation and the list "splits" on double space. What i mean to say is the list should create subsists on every double space occurrences within a text file.
This is my function:
def tokenize(document):
file = open("document.txt","r+").read()
print re.findall(r'\w+', file)
Input text file has a string as follows:
What's did the little boy tell the game warden? His dad was in the kitchen poaching eggs!
Note: There's a double spacing after warden? and before His
My function gives me an output like this
['what','s','did','the','little','boy','tell','the','game','warden','His','dad','was','in','the','kitchen','poaching','eggs']
Desired output :
[['what','s','did','the','little','boy','tell','the','game','warden'],
['His','dad','was','in','the','kitchen','poaching','eggs']]
First split the whole text on double spaces and then pass each item to regex as:
>>> file = "What's did the little boy tell the game warden? His dad was in the kitchen poaching eggs!"
>>> file = text.split(' ')
>>> file
["What's did the little boy tell the game warden?", 'His dad was in the kitchen poaching eggs!']
>>> res = []
>>> for sen in file:
... res.append(re.findall(r'\w+', sen))
...
>>> res
[['What', 's', 'did', 'the', 'little', 'boy', 'tell', 'the', 'game', 'warden'], ['His', 'dad', 'was', 'in', 'the', 'kitchen', 'poaching', 'eggs']]
Here's a reasonable all-RE's approach:
def tokenize(document):
with open("document.txt") as f:
text = f.read()
blocks = re.split(r'\s\s+', text)
return [re.findall(r'\w+', b) for b in blocks]
The builtin split function allows splitting on multiple spaces.
This:
a = "hello world. How are you"
b = a.split(' ')
c = [ x.split(' ') for x in b ]
Yields:
c = [['hello', 'world.'], ['how', 'are', 'you?']]
If you want to remove the punctuation too, apply regex to elements in 'b' or to 'x' in the third statement.
At first split the file by punctuation, and then on the second pass split the resulted strings by spaces.
def splitByPunct(s):
return (x.group(0) for x in re.finditer(r'[^\.\,\?\!]+', s) if x and x.group(0))
[x.split() for x in splitByPunct("some string, another string! The phrase")]
this yields
[['some', 'string'], ['another', 'string'], ['The', 'phrase']]
I'm trying to convert a string to a list of words using python. I want to take something like the following:
string = 'This is a string, with words!'
Then convert to something like this :
list = ['This', 'is', 'a', 'string', 'with', 'words']
Notice the omission of punctuation and spaces. What would be the fastest way of going about this?
I think this is the simplest way for anyone else stumbling on this post given the late response:
>>> string = 'This is a string, with words!'
>>> string.split()
['This', 'is', 'a', 'string,', 'with', 'words!']
Try this:
import re
mystr = 'This is a string, with words!'
wordList = re.sub("[^\w]", " ", mystr).split()
How it works:
From the docs :
re.sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function.
so in our case :
pattern is any non-alphanumeric character.
[\w] means any alphanumeric character and is equal to the character set
[a-zA-Z0-9_]
a to z, A to Z , 0 to 9 and underscore.
so we match any non-alphanumeric character and replace it with a space .
and then we split() it which splits string by space and converts it to a list
so 'hello-world'
becomes 'hello world'
with re.sub
and then ['hello' , 'world']
after split()
let me know if any doubts come up.
To do this properly is quite complex. For your research, it is known as word tokenization. You should look at NLTK if you want to see what others have done, rather than starting from scratch:
>>> import nltk
>>> paragraph = u"Hi, this is my first sentence. And this is my second."
>>> sentences = nltk.sent_tokenize(paragraph)
>>> for sentence in sentences:
... nltk.word_tokenize(sentence)
[u'Hi', u',', u'this', u'is', u'my', u'first', u'sentence', u'.']
[u'And', u'this', u'is', u'my', u'second', u'.']
The most simple way:
>>> import re
>>> string = 'This is a string, with words!'
>>> re.findall(r'\w+', string)
['This', 'is', 'a', 'string', 'with', 'words']
Using string.punctuation for completeness:
import re
import string
x = re.sub('['+string.punctuation+']', '', s).split()
This handles newlines as well.
Well, you could use
import re
list = re.sub(r'[.!,;?]', ' ', string).split()
Note that both string and list are names of builtin types, so you probably don't want to use those as your variable names.
Inspired by #mtrw's answer, but improved to strip out punctuation at word boundaries only:
import re
import string
def extract_words(s):
return [re.sub('^[{0}]+|[{0}]+$'.format(string.punctuation), '', w) for w in s.split()]
>>> str = 'This is a string, with words!'
>>> extract_words(str)
['This', 'is', 'a', 'string', 'with', 'words']
>>> str = '''I'm a custom-built sentence with "tricky" words like https://stackoverflow.com/.'''
>>> extract_words(str)
["I'm", 'a', 'custom-built', 'sentence', 'with', 'tricky', 'words', 'like', 'https://stackoverflow.com']
Personally, I think this is slightly cleaner than the answers provided
def split_to_words(sentence):
return list(filter(lambda w: len(w) > 0, re.split('\W+', sentence))) #Use sentence.lower(), if needed
A regular expression for words would give you the most control. You would want to carefully consider how to deal with words with dashes or apostrophes, like "I'm".
list=mystr.split(" ",mystr.count(" "))
This way you eliminate every special char outside of the alphabet:
def wordsToList(strn):
L = strn.split()
cleanL = []
abc = 'abcdefghijklmnopqrstuvwxyz'
ABC = abc.upper()
letters = abc + ABC
for e in L:
word = ''
for c in e:
if c in letters:
word += c
if word != '':
cleanL.append(word)
return cleanL
s = 'She loves you, yea yea yea! '
L = wordsToList(s)
print(L) # ['She', 'loves', 'you', 'yea', 'yea', 'yea']
I'm not sure if this is fast or optimal or even the right way to program.
def split_string(string):
return string.split()
This function will return the list of words of a given string.
In this case, if we call the function as follows,
string = 'This is a string, with words!'
split_string(string)
The return output of the function would be
['This', 'is', 'a', 'string,', 'with', 'words!']
This is from my attempt on a coding challenge that can't use regex,
outputList = "".join((c if c.isalnum() or c=="'" else ' ') for c in inputStr ).split(' ')
The role of apostrophe seems interesting.
Probably not very elegant, but at least you know what's going on.
my_str = "Simple sample, test! is, olny".lower()
my_lst =[]
temp=""
len_my_str = len(my_str)
number_letter_in_data=0
list_words_number=0
for number_letter_in_data in range(0, len_my_str, 1):
if my_str[number_letter_in_data] in [',', '.', '!', '(', ')', ':', ';', '-']:
pass
else:
if my_str[number_letter_in_data] in [' ']:
#if you want longer than 3 char words
if len(temp)>3:
list_words_number +=1
my_lst.append(temp)
temp=""
else:
pass
else:
temp = temp+my_str[number_letter_in_data]
my_lst.append(temp)
print(my_lst)
You can try and do this:
tryTrans = string.maketrans(",!", " ")
str = "This is a string, with words!"
str = str.translate(tryTrans)
listOfWords = str.split()
I have a list like
['hello', '...', 'h3.a', 'ds4,']
this should turn into
['hello', 'h3a', 'ds4']
and i want to remove only the punctuation leaving the letters and numbers intact.
Punctuation is anything in the string.punctuation constant.
I know that this is gunna be simple but im kinda noobie at python so...
Thanks,
giodamelio
Assuming that your initial list is stored in a variable x, you can use this:
>>> x = [''.join(c for c in s if c not in string.punctuation) for s in x]
>>> print(x)
['hello', '', 'h3a', 'ds4']
To remove the empty strings:
>>> x = [s for s in x if s]
>>> print(x)
['hello', 'h3a', 'ds4']
Use string.translate:
>>> import string
>>> test_case = ['hello', '...', 'h3.a', 'ds4,']
>>> [s.translate(None, string.punctuation) for s in test_case]
['hello', '', 'h3a', 'ds4']
For the documentation of translate, see http://docs.python.org/library/string.html
In python 3+ use this instead:
import string
s = s.translate(str.maketrans('','',string.punctuation))
import string
print ''.join((x for x in st if x not in string.punctuation))
ps st is the string. for the list is the same...
[''.join(x for x in par if x not in string.punctuation) for par in alist]
i think works well. look at string.punctuaction:
>>> print string.punctuation
!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~
To make a new list:
[re.sub(r'[^A-Za-z0-9]+', '', x) for x in list_of_strings]
Just be aware that string.punctuation works in English, but may not work for other languages with other punctuation marks.
You could add them to a list LIST_OF_LANGUAGE_SPECIFIC_PUNCTUATION and then concatenate it to string.punctuation to get a fuller set of punctuation.
punctuation = string.punctuation + [LIST_OF_LANGUAGE_SPECIFIC_PUNCTUATION]