how to turn string of words into list - python

i have turned a list of words into a string
now i want to turn them back into a list but i dont know how, please help
temp = ['hello', 'how', 'is', 'your', 'day']
temp_string = str(temp)
temp_string will then be "[hello, how, is, your, day]"
i want to turn this back into a list now but when i do list(temp_string), this will happen
['[', "'", 'h', 'e', 'l', 'l', 'o', "'", ',', ' ', "'", 'h', 'o', 'w', "'", ',', ' ', "'", 'i', 's', "'", ',', ' ', "'", 'y', 'o', 'u', 'r', "'", ',', ' ', "'", 'd', 'a', 'y', "'", ']']
Please help

You can do this easily by evaluating the string. That's not something I'd normally suggest but, assuming you control the input, it's quite safe:
>>> temp = ['hello', 'how', 'is', 'your', 'day'] ; type(temp) ; temp
<class 'list'>
['hello', 'how', 'is', 'your', 'day']
>>> tempstr = str(temp) ; type(tempstr) ; tempstr
<class 'str'>
"['hello', 'how', 'is', 'your', 'day']"
>>> temp2 = eval(tempstr) ; type(temp2) ; temp2
<class 'list'>
['hello', 'how', 'is', 'your', 'day']

Duplicate question? Converting a String to a List of Words?
Working code below (Python 3)
import re
sentence_list = ['hello', 'how', 'are', 'you']
sentence = ""
for word in sentence_list:
sentence += word + " "
print(sentence)
#output: "hello how are you "
word_list = re.sub("[^\w]", " ", sentence).split()
print(word_list)
#output: ['hello', 'how', 'are', 'you']

You can split on commas and join them back together:
temp = ['hello', 'how', 'is', 'your', 'day']
temp_string = str(temp)
temp_new = ''.join(temp_string.split(','))
The join() function takes a list, which is created from the split() function while using ',' as the delimiter. join() will then construct a string from the list.

Related

How do I make a list of strings withouth invalid characters in any of the strings? [duplicate]

This question already has answers here:
Best way to strip punctuation from a string
(32 answers)
Closed 4 months ago.
For example, if I had this list of invalid characters:
invalid_char_list = [',', '.', '!']
And this list of strings:
string_list = ['Hello,', 'world.', 'I', 'am', 'a', 'programmer!!']
I would want to get this new list:
new_string_list = ['Hello', 'world', 'I', 'am', 'a', 'programmer']
withouth , or . or ! in any of the strings in the list because those are the characters that are in my list of invalid characters.
You can use regex and create this pattern : [,.!] and replace with ''.
import re
re_invalid = re.compile(f"([{''.join(invalid_char_list)}])")
# re_invalid <-> re.compile(r'([,.!])', re.UNICODE)
new_string_list = [re_invalid.sub(r'', s) for s in string_list]
print(new_string_list)
Output:
['Hello', 'world', 'I', 'am', 'a', 'programmer']
[.,!] : Match only this characters (',', '.', '!') in the set
You can try looping through the string_list and replacing each invalid char with an empty string.
invalid_char_list = [',', '.', '!']
string_list = ['Hello,', 'world.', 'I', 'am', 'a', 'programmer!!']
for invalid_char in invalid_char_list:
string_list=[x.replace(invalid_char,'') for x in string_list]
print(string_list)
The Output:
['Hello', 'world', 'I', 'am', 'a', 'programmer']
We can loop over each string in string_list and each invalid character and use String.replace to replace any invalid characters with '' (nothing).
invalid_char_list = [',', '.', '!']
string_list = ['Hello,', 'world.', 'I', 'am', 'a', 'programmer!!']
formatted_string_list = []
for string in string_list:
for invalid in invalid_char_list:
string = string.replace(invalid, '')
formatted_string_list.append(string)
You can use strip():
string_list = ['Hello,', ',world.?', 'I', 'am?', '!a,', 'programmer!!?']
new_string_list = [c.strip(',.!?') for c in string_list]
print(new_string_list)
#['Hello', 'world', 'I', 'am', 'a', 'programmer']

How can I split a string in to multiple words, spaces and tabs in Python?

For example:
String1='Hi what are you doing?'
should be split like:
List1=['Hi','\s','what','\s','are','\s','you','\s','doing','\s','?']
If you want to split only :
String1='Hi what are you doing ?'
print(String1.split())
output:
['Hi', 'what', 'are', 'you', 'doing', '?']
if you want as you shown in the example :
print(String1.replace(" "," \s ").split())
output:
['Hi', '\\s', 'what', '\\s', 'are', '\\s', 'you', '\\s', 'doing', '\\s', '?']
import re
s = your string here \nhello" re.split('\s+', s)
Another method through re module.
>>> import re
>>> s = "your string here \nhello \thi"
>>> re.findall(r'\S+', s) ['your', 'string', 'word', 'hello', 'hi']
This would match one or more non-space characters.
Try this:
s ='Hi what are you doing?'
import re
re.findall('[a-zA-Z]{1,}|[^a-zA-Z]{1,}', s)
output:
['Hi', ' ', 'what', ' ', 'are', ' ', 'you', ' ', 'doing', '?']

split existed list based on the repeated word

I tried to split a list into new list. Here's the initial list:
initList =['PTE123', '', 'I', 'am', 'programmer', 'PTE345', 'based', 'word',
'title', 'PTE427', 'how', 'are', 'you']
If I want to split the list based on the PTExyz to new list which looks:
newList = ['PTE123 I am programmer', 'PTE345 based word title', 'PTE427 how are you']
How should I develop proper algorithm for general case with repeated item PTExyz?
Thank You!
The algorithm will be something like this.
Iterate over the list. Find a the string s that starts with PTE. Assign it to a temp string which is initialized as an empty string. Add every next string s with temp unless that string starts with PTE. In that case, if the temp string is not empty then append it with your result list else add the string with temp.
ls = ['PTE123', '', 'I', 'am', 'programmer', 'PTE345', 'based', 'word', 'title', 'PTE427', 'how', 'are', 'you']
result = []
temp = ''
for s in ls:
if s.startswith('PTE'):
if temp != '':
result.append(temp)
temp = s
else:
if temp == '':
continue
temp += ' ' + s
result.append(temp)
print(result)
Edit
For handling the pattern PTExyz you can use regular expression. In that case the code will be like this where the line is s.startswith('PTE'):
re.match(r'PTE\w{3}$', s)
I think it will work
l =['PTE123', '', 'I', 'am', 'programmer', 'PTE345', 'based', 'word','title', 'PTE427', 'how', 'are', 'you']
resultlist = []
s = ' '.join(l)
str = s.split('PTE')
for i in str:
resultlist.append('PTE'+i)
resultlist.remove('PTE')
print resultlist
It works on a regular expression PTExyz
import re
l =['PTE123', '', 'I', 'am', 'programmer', 'PTE345', 'based', 'word',
'title', 'PTE427', 'how', 'are', 'you']
pattern = re.compile(r'[P][T][E]\d\d\d')
k = []
for i in l:
if pattern.match(i) is not None:
k.append(i)
s = ' '.join(l)
str = re.split(pattern, s)
str.remove('')
for i in range(len(k)):
str[i] = k[i] + str[i]
print str
>>> list =['PTE123', '', 'I', 'am', 'programmer', 'PTE345', 'based', 'word','title', 'PTE427', 'how', 'are', 'you']
>>> index_list =[ list.index(item) for item in list if "PTE" in item]
>>> index_list.append(len(list))
>>> index_list
[0, 5, 9, 13]
>>> [' '.join(list[index_list[i-1]:index_list[i]]) for i,item in enumerate(index_list) if item > 0 ]
Output
['PTE123 I am programmer', 'PTE345 based word title', 'PTE427 how are you']

WordNetlemmatizer error - all alphabets are lemmatized

I am trying to lemmatize my dataset for sentiment analysis - What should I do to get the expected output rather than the current output? Input file is a csv - stored as DataFrame object.
dataset = pd.read_csv('xyz.csv')
Here is my code
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
list1_ = []
for file_ in dataset:
result1 = dataset['Content'].apply(lambda x: [lemmatizer.lemmatize(y) for y in x])
list1_.append(result1)
dataset = pd.concat(list1_, ignore_index=True)
Expected
>> lemmatizer.lemmatize('cats')
>> [cat]
Current output
>> lemmatizer.lemmatize('cats')
>> [c,a,t,s]
TL;DR
result1 = dataset['Content'].apply(lambda x: [lemmatizer.lemmatize(y) for y in x.split()])
Lemmatizer takes in any string as an input.
If dataset['Content'] columns are strings, iterating through a string would be iterating through the characters not "words", e.g.
>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> x = 'this is a foo bar sentence, that is of type str'
>>> [wnl.lemmatize(ch) for ch in x]
['t', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' ', 'f', 'o', 'o', ' ', 'b', 'a', 'r', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', ',', ' ', 't', 'h', 'a', 't', ' ', 'i', 's', ' ', 'o', 'f', ' ', 't', 'y', 'p', 'e', ' ', 's', 't', 'r']
So you would have to first word tokenize your sentence string, e.g.:
>>> from nltk import word_tokenize
>>> [wnl.lemmatize(word) for word in x.split()]
['this', 'is', 'a', 'foo', 'bar', 'sentence,', 'that', 'is', 'of', 'type', 'str']
>>> [wnl.lemmatize(ch) for ch in word_tokenize(x)]
['this', 'is', 'a', 'foo', 'bar', 'sentence', ',', 'that', 'is', 'of', 'type', 'str']
another e.g.
>>> from nltk import word_tokenize
>>> x = 'the geese ran through the parks'
>>> [wnl.lemmatize(word) for word in x.split()]
['the', u'goose', 'ran', 'through', 'the', u'park']
>>> [wnl.lemmatize(ch) for ch in word_tokenize(x)]
['the', u'goose', 'ran', 'through', 'the', u'park']
But to get a more accurate lemmatization, you should get the sentence word tokenized and pos-tagged, see https://github.com/alvations/earthy/blob/master/FAQ.md#how-to-use-default-nltk-functions-in-earthy

How do I select the first elements of each list in a list of lists?

I am trying to isolate the first words in a series of sentences using Python/ NLTK.
created an unimportant series of sentences (the_text) and while I am able to divide that into tokenized sentences, I cannot successfully separate just the first words of each sentence into a list (first_words).
[['Here', 'is', 'some', 'text', '.'], ['There', 'is', 'a', 'a', 'person', 'on', 'the', 'lawn', '.'], ['I', 'am', 'confused', '.'], ['There', 'is', 'more', '.'], ['Here', 'is', 'some', 'more', '.'], ['I', 'do', "n't", 'know', 'anything', '.'], ['I', 'should', 'add', 'more', '.'], ['Look', ',', 'here', 'is', 'more', 'text', '.'], ['How', 'great', 'is', 'that', '?']]
the_text="Here is some text. There is a a person on the lawn. I am confused. "
the_text= (the_text + "There is more. Here is some more. I don't know anything. ")
the_text= (the_text + "I should add more. Look, here is more text. How great is that?")
sents_tok=nltk.sent_tokenize(the_text)
sents_words=[nltk.word_tokenize(sent) for sent in sents_tok]
number_sents=len(sents_words)
print (number_sents)
print(sents_words)
for i in sents_words:
first_words=[]
first_words.append(sents_words (i,0))
print(first_words)
Thanks for the help!
There are three problems with your code, and you have to fix all three to make it work:
for i in sents_words:
first_words=[]
first_words.append(sents_words (i,0))
First, you're erasing first_words each time through the loop: move the first_words=[] outside the loop.
Second, you're mixing up function calling syntax (parentheses) with indexing syntax (brackets): you want sents_words[i][0].
Third, for i in sents_words: iterates over the elements of sents_words, not the indices. So you just want i[0]. (Or, alternatively, for i in range(len(sents_words)), but there's no reason to do that.)
So, putting it together:
first_words=[]
for i in sents_words:
first_words.append(i[0])
If you know anything about comprehensions, you may recognize that this pattern (start with an empty list, iterate over something, appending some expression to the list) is exactly what a list comprehension does:
first_words = [i[0] for i in sents_words]
If you don't, then either now is a good time to learn about comprehensions, or don't worry about this part. :)
>>> sents_words = [['Here', 'is', 'some', 'text', '.'],['There', 'is', 'a', 'a', 'person', 'on', 'the', 'lawn', '.'], ['I', 'am', 'confused', '.'], ['There', 'is', 'more', '.'], ['Here', 'is', 'some', 'more', '.'], ['I', 'do', "n't", 'know', 'anything', '.'], 'I', 'should', 'add', 'more', '.'], ['Look', ',', 'here', 'is', 'more', 'text', '.'], ['How', 'great', 'is', 'that', '?']]
You can use a loop to append to a list you've initialized previously:
>>> first_words = []
>>> for i in sents_words:
... first_words.append(i[0])
...
>>> print(*first_words)
Here There I There Here I I Look How
or a comprehension (replace those square brackets with parentheses to create a generator instead):
>>> first_words = [i[0] for i in sents_words]
>>> print(*first_words)
Here There I There Here I I Look How
or if you don't need to save it for later use, you can directly print the items:
>>> print(*(i[0] for i in sents_words))
Here There I There Here I I Look How
Here's an example of how to access items in lists and list of lists:
>>> fruits = ['apple','orange', 'banana']
>>> fruits[0]
'apple'
>>> fruits[1]
'orange'
>>> cars = ['audi', 'ford', 'toyota']
>>> cars[0]
'audi'
>>> cars[1]
'ford'
>>> things = [fruits, cars]
>>> things[0]
['apple', 'orange', 'banana']
>>> things[1]
['audi', 'ford', 'toyota']
>>> things[0][0]
'apple'
>>> things[0][1]
'orange'
For you problem:
>>> from nltk import sent_tokenize, word_tokenize
>>>
>>> the_text="Here is some text. There is a a person on the lawn. I am confused. There is more. Here is some more. I don't know anything. I should add more. Look, here is more text. How great is that?"
>>>
>>> tokenized_text = [word_tokenize(s) for s in sent_tokenize(the_text)]
>>>
>>> first_words = []
>>> # Iterates through the sentneces.
... for sent in tokenized_text:
... print sent
...
['Here', 'is', 'some', 'text', '.']
['There', 'is', 'a', 'a', 'person', 'on', 'the', 'lawn', '.']
['I', 'am', 'confused', '.']
['There', 'is', 'more', '.']
['Here', 'is', 'some', 'more', '.']
['I', 'do', "n't", 'know', 'anything', '.']
['I', 'should', 'add', 'more', '.']
['Look', ',', 'here', 'is', 'more', 'text', '.']
['How', 'great', 'is', 'that', '?']
>>> # First words in each sentence.
... for sent in tokenized_text:
... word0 = sent[0]
... first_words.append(word0)
... print word0
...
...
Here
There
I
There
Here
I
I
Look
How
>>> print first_words ['Here', 'There', 'I', 'There', 'Here', 'I', 'I', 'Look', 'How']
In one-liner with list comprehensions:
# From the_text, you extract the first word directly
first_words = [word_tokenize(s)[0] for s in sent_tokenize(the_text)]
# From tokenized_text
tokenized_text= [word_tokenize(s) for s in sent_tokenize(the_text)]
first_words = [w[0] for s in tokenized_text]
Another alternative, although it's pretty much similar to abarnert's suggestion:
first_words = []
for i in range(number_sents):
first_words.append(sents_words[i][0])

Categories

Resources