converting a string to a list of words - python

I'm trying to convert a string to a list of words using python. I want to take something like the following:
string = '"This","is","a","string","with","words!"'
Then convert to something like this :
list = ['This', 'is', 'a', 'string', 'with', 'words']
Notice the omission of punctuation and spaces. What would be the fastest way of going about this?

This can quite literally be interpreted as python code using ast.literal_eval. It will produce a tuple but just turn it into a list.
>>> import ast
>>> list(ast.literal_eval(string.replace('!', ''))
['This', 'is', 'a', 'string', 'with', 'words']
Or use a list comprehension:
>>> [s.strip('"') for s in string.replace('!','').split(',')]
['This', 'is', 'a', 'string', 'with', 'words']

Assuming the string is always in the format "A","B",..., this should work:
list = list(map(lambda x: x[1:-1], string.split(',')))

string = 'This is a string, with words!'
string_to_array = string.split()
print(string_to_array)

Here are some python in-built functions that can help you achieve your desired output:
stringO = '"This","is","a","string","with","words!"'
punctuations = '''!()-[]{};:'"\<>./?##$%^&*_~'''
string = ""
for char in stringO:
if char not in punctuations:
string = string + char
string = string.split(',')
print(string)

Related

python: how to replace a string in a list of string with a list of strings?

okay here is the example:
data = ['This', 'is', 'a', 'test', 'of', 'the', 'list']
replaceText = 'test'
replaceData =['new', 'test']
i did data.replace(replaceText, replaceData) but it doesn't work. How to replace a string in a list of string with a list of strings? Any help will be appreciated.
Edit:
The exact condition is to replace or split the words that contain "s" so I put a loop in it. So the end result will print
data = ['Thi', 'i', 'a', 'te','t', 'of', 'the', 'li','t']
In a list, find the position of text with .index(), then replace by using slice assignment:
pos = data.index(replaceText)
data[pos:pos+1] = replaceData
This will replace only one occurrence of replaceText at a time. Demo:
>>> data = ['This', 'is', 'a', 'test', 'of', 'the', 'list']
>>> replaceText = 'test'
>>> replaceData =['new', 'test']
>>> pos = data.index(replaceText)
>>> data[pos:pos+1] = replaceData
To replace all occurences, use pos plus the length of replaceData to skip searching past the previous match:
pos = 0
while True:
try:
pos = data.index(replaceText, pos)
except ValueError:
break
data[pos:pos+1] = replaceData
pos += len(replaceData)
If you need to loop over data while modifying it, use a copy instead:
for n in data[:]:
# maniplate data
You can use list's index() method to find the position p of replaceText:
p = data.index(replaceText)
and then use the construct
data[start:end] = another_list
to replace elements from p to p+1 (end is not inclusive) with replaceData:
data[p:p+1] = replaceData
Note that index() throws ValueError if replaceText does not exist in data:
try:
p = data.index(replaceText)
data[p:p+1] = replaceData
except ValueError:
# replaceText is not present in data, handle appropriately.
yah, the actually condition needs me to replace or split any string
that contains the character 's', say 'test' will be replaced by 'te'
and 't' to the list
from itertools import chain
data = ['This', 'is', 'a', 'test', 'of', 'the', 'list']
>>> filter(None, chain.from_iterable(el.split('s') for el in data))
['Thi', 'i', 'a', 'te', 't', 'of', 'the', 'li', 't']

How to split string into words that do not contain whitespaces in python?

My string is:
"This is a string"
I want to turn it into a list:
["This", "is", "a", "string"]
I use the split(" ") method, but it adds whitespaces as list elements. Please help,
Best Regards
>>> v="This is a string"
>>> v.split()
['This', 'is', 'a', 'string']
just use split().
It won't add whitespace as elements if you just use .split(), instead of .split(' ')
>>> "This is a string".split()
['This', 'is', 'a', 'string']
Like the docs say, don't pass an argument.
>>> "This is a string".split()
['This', 'is', 'a', 'string']

Converting a String to a List of Words?

I'm trying to convert a string to a list of words using python. I want to take something like the following:
string = 'This is a string, with words!'
Then convert to something like this :
list = ['This', 'is', 'a', 'string', 'with', 'words']
Notice the omission of punctuation and spaces. What would be the fastest way of going about this?
I think this is the simplest way for anyone else stumbling on this post given the late response:
>>> string = 'This is a string, with words!'
>>> string.split()
['This', 'is', 'a', 'string,', 'with', 'words!']
Try this:
import re
mystr = 'This is a string, with words!'
wordList = re.sub("[^\w]", " ", mystr).split()
How it works:
From the docs :
re.sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function.
so in our case :
pattern is any non-alphanumeric character.
[\w] means any alphanumeric character and is equal to the character set
[a-zA-Z0-9_]
a to z, A to Z , 0 to 9 and underscore.
so we match any non-alphanumeric character and replace it with a space .
and then we split() it which splits string by space and converts it to a list
so 'hello-world'
becomes 'hello world'
with re.sub
and then ['hello' , 'world']
after split()
let me know if any doubts come up.
To do this properly is quite complex. For your research, it is known as word tokenization. You should look at NLTK if you want to see what others have done, rather than starting from scratch:
>>> import nltk
>>> paragraph = u"Hi, this is my first sentence. And this is my second."
>>> sentences = nltk.sent_tokenize(paragraph)
>>> for sentence in sentences:
... nltk.word_tokenize(sentence)
[u'Hi', u',', u'this', u'is', u'my', u'first', u'sentence', u'.']
[u'And', u'this', u'is', u'my', u'second', u'.']
The most simple way:
>>> import re
>>> string = 'This is a string, with words!'
>>> re.findall(r'\w+', string)
['This', 'is', 'a', 'string', 'with', 'words']
Using string.punctuation for completeness:
import re
import string
x = re.sub('['+string.punctuation+']', '', s).split()
This handles newlines as well.
Well, you could use
import re
list = re.sub(r'[.!,;?]', ' ', string).split()
Note that both string and list are names of builtin types, so you probably don't want to use those as your variable names.
Inspired by #mtrw's answer, but improved to strip out punctuation at word boundaries only:
import re
import string
def extract_words(s):
return [re.sub('^[{0}]+|[{0}]+$'.format(string.punctuation), '', w) for w in s.split()]
>>> str = 'This is a string, with words!'
>>> extract_words(str)
['This', 'is', 'a', 'string', 'with', 'words']
>>> str = '''I'm a custom-built sentence with "tricky" words like https://stackoverflow.com/.'''
>>> extract_words(str)
["I'm", 'a', 'custom-built', 'sentence', 'with', 'tricky', 'words', 'like', 'https://stackoverflow.com']
Personally, I think this is slightly cleaner than the answers provided
def split_to_words(sentence):
return list(filter(lambda w: len(w) > 0, re.split('\W+', sentence))) #Use sentence.lower(), if needed
A regular expression for words would give you the most control. You would want to carefully consider how to deal with words with dashes or apostrophes, like "I'm".
list=mystr.split(" ",mystr.count(" "))
This way you eliminate every special char outside of the alphabet:
def wordsToList(strn):
L = strn.split()
cleanL = []
abc = 'abcdefghijklmnopqrstuvwxyz'
ABC = abc.upper()
letters = abc + ABC
for e in L:
word = ''
for c in e:
if c in letters:
word += c
if word != '':
cleanL.append(word)
return cleanL
s = 'She loves you, yea yea yea! '
L = wordsToList(s)
print(L) # ['She', 'loves', 'you', 'yea', 'yea', 'yea']
I'm not sure if this is fast or optimal or even the right way to program.
def split_string(string):
return string.split()
This function will return the list of words of a given string.
In this case, if we call the function as follows,
string = 'This is a string, with words!'
split_string(string)
The return output of the function would be
['This', 'is', 'a', 'string,', 'with', 'words!']
This is from my attempt on a coding challenge that can't use regex,
outputList = "".join((c if c.isalnum() or c=="'" else ' ') for c in inputStr ).split(' ')
The role of apostrophe seems interesting.
Probably not very elegant, but at least you know what's going on.
my_str = "Simple sample, test! is, olny".lower()
my_lst =[]
temp=""
len_my_str = len(my_str)
number_letter_in_data=0
list_words_number=0
for number_letter_in_data in range(0, len_my_str, 1):
if my_str[number_letter_in_data] in [',', '.', '!', '(', ')', ':', ';', '-']:
pass
else:
if my_str[number_letter_in_data] in [' ']:
#if you want longer than 3 char words
if len(temp)>3:
list_words_number +=1
my_lst.append(temp)
temp=""
else:
pass
else:
temp = temp+my_str[number_letter_in_data]
my_lst.append(temp)
print(my_lst)
You can try and do this:
tryTrans = string.maketrans(",!", " ")
str = "This is a string, with words!"
str = str.translate(tryTrans)
listOfWords = str.split()

Python regex string to list of words (including words with hyphens)

I would like to parse a string to obtain a list including all words (hyphenated words, too). Current code is:
s = '-this is. A - sentence;one-word'
re.compile("\W+",re.UNICODE).split(s)
returns:
['', 'this', 'is', 'A', 'sentence', 'one', 'word']
and I would like it to return:
['', 'this', 'is', 'A', 'sentence', 'one-word']
If you don't need the leading empty string, you could use the pattern \w(?:[-\w]*\w)? for matching:
>>> import re
>>> s = '-this is. A - sentence;one-word'
>>> rx = re.compile(r'\w(?:[-\w]*\w)?')
>>> rx.findall(s)
['this', 'is', 'A', 'sentence', 'one-word']
Note that it won't match words with apostrophes like won't.
Here my traditional "why to use regexp language when you can use Python" alternative:
import string
s = "-this is. A - sentence;one-word what's"
s = filter(None,[word.strip(string.punctuation)
for word in s.replace(';','; ').split()
])
print s
""" Output:
['this', 'is', 'A', 'sentence', 'one-word', "what's"]
"""
You could use "[^\w-]+" instead.
s = "-this is. A - sentence;one-word what's"
re.findall("\w+-\w+|[\w']+",s)
result:
['this', 'is', 'A', 'sentence', 'one-word', "what's"]
make sure you notice that the correct ordering is to look for hyypenated words first!
Yo can try with the NLTK library:
>>> import nltk
>>> s = '-this is a - sentence;one-word'
>>> hyphen = r'(\w+\-\s?\w+)'
>>> wordr = r'(\w+)'
>>> r = "|".join([ hyphen, wordr])
>>> tokens = nltk.tokenize.regexp_tokenize(s,r)
>>> print tokens
['this', 'is', 'a', 'sentence', 'one-word']
I found it here: http://www.cs.oberlin.edu/~jdonalds/333/lecture03.html Hope it helps

Is there a function in Python to split a string without ignoring the spaces?

Is there a function in Python to split a string without ignoring the spaces in the resulting list?
E.g:
s="This is the string I want to split".split()
gives me
>>> s
['This', 'is', 'the', 'string', 'I', 'want', 'to', 'split']
I want something like
['This',' ','is',' ', 'the',' ','string', ' ', .....]
>>> import re
>>> re.split(r"(\s+)", "This is the string I want to split")
['This', ' ', 'is', ' ', 'the', ' ', 'string', ' ', 'I', ' ', 'want', ' ', 'to', ' ', 'split']
Using the capturing parentheses in re.split() causes the function to return the separators as well.
I don't think there is a function in the standard library that does that by itself, but "partition" comes close
The best way is probably to use regular expressions (which is how I'd do this in any language!)
import re
print re.split(r"(\s+)", "Your string here")
Silly answer just for the heck of it:
mystring.replace(" ","! !").split("!")
The hard part with what you're trying to do is that you aren't giving it a character to split on. split() explodes a string on the character you provide to it, and removes that character.
Perhaps this may help:
s = "String to split"
mylist = []
for item in s.split():
mylist.append(item)
mylist.append(' ')
mylist = mylist[:-1]
Messy, but it'll do the trick for you...

Categories

Resources