I wanted to know how to iterate through a string word by word.
string = "this is a string"
for word in string:
print (word)
The above gives an output:
t
h
i
s
i
s
a
s
t
r
i
n
g
But I am looking for the following output:
this
is
a
string
When you do -
for word in string:
You are not iterating through the words in the string, you are iterating through the characters in the string. To iterate through the words, you would first need to split the string into words , using str.split() , and then iterate through that . Example -
my_string = "this is a string"
for word in my_string.split():
print (word)
Please note, str.split() , without passing any arguments splits by all whitespaces (space, multiple spaces, tab, newlines, etc).
This is one way to do it:
string = "this is a string"
ssplit = string.split()
for word in ssplit:
print (word)
Output:
this
is
a
string
for word in string.split():
print word
Using nltk.
from nltk.tokenize import sent_tokenize, word_tokenize
sentences = sent_tokenize("This is a string.")
words_in_each_sentence = word_tokenize(sentences)
You may use TweetTokenizer for parsing casual text with emoticons and such.
One way to do this is using a dictionary. The problem for the code above is it counts each letter in a string, instead of each word. To solve this problem, you should first turn the string into a list by using the split() method, and then create a variable counts each comma in the list as its own value. The code below returns each time a word appears in a string in the form of a dictionary.
s = input('Enter a string to see if strings are repeated: ')
d = dict()
p = s.split()
word = ','
for word in p:
if word not in d:
d[word] = 1
else:
d[word] += 1
print (d)
s = 'hi how are you'
l = list(map(lambda x: x,s.split()))
print(l)
Output: ['hi', 'how', 'are', 'you']
You can try this method also:
sentence_1 = "This is a string"
list = sentence_1.split()
for i in list:
print (i)
Related
I have 2 scenarios so split a string
scenario 1:
"##$hello?? getting good.<li>hii"
I want to be split as 'hello','getting','good.<li>hii (Scenario 1)
'hello','getting','good','li,'hi' (Scenario 2)
Any ideas please??
Something like this should work:
>>> re.split(r"[^\w<>.]+", s) # or re.split(r"[##$? ]+", s)
['', 'hello', 'getting', 'good.<li>hii']
>>> re.split(r"[^\w]+", s)
['', 'hello', 'getting', 'good', 'li', 'hii']
This might be what your looking for \w+ it matches any digit or letter from 1 to n times as many times as possible. Here is a working Java-Script
var value = "##$hello?? getting good.<li>hii";
var matches = value.match(
new RegExp("\\w+", "gi")
);
console.log(matches)
It works by using \w+ which matches word characters as many times as possible. You cound also use [A-Za-b] to match only letters which not numbers. As show here.
var value = "##$hello?? getting good.<li>hii777bloop";
var matches = value.match(
new RegExp("[A-Za-z]+", "gi")
);
console.log(matches)
It matches what are in the brackets 1 to n timeas as many as possible. In this case the range a-z of lower case charactors and the range of A-Z uppder case characters. Hope this is what you want.
For first scenario just use regex to find all words that are contain word characters and <>.:
In [60]: re.findall(r'[\w<>.]+', s)
Out[60]: ['hello', 'getting', 'good.<li>hii']
For second one you need to repleace the repeated characters only if they are not valid english words, you can do this using nltk corpus, and re.sub regex:
In [61]: import nltk
In [62]: english_vocab = set(w.lower() for w in nltk.corpus.words.words())
In [63]: repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
In [64]: [repeat_regexp.sub(r'\1\2\3', word) if word not in english_vocab else word for word in re.findall(r'[^\W]+', s)]
Out[64]: ['hello', 'getting', 'good', 'li', 'hi']
In case you are looking for solution without regex. string.punctuation will give you list of all special characters.
Use this list with list comprehension for achieving your desired result as:
>>> import string
>>> my_string = '##$hello?? getting good.<li>hii'
>>> ''.join([(' ' if s in string.punctuation else s) for s in my_string]).split()
['hello', 'getting', 'good', 'li', 'hii'] # desired output
Explanation: Below is the step by step instruction regarding how it works:
import string # Importing the 'string' module
special_char_string = string.punctuation
# Value of 'special_char_string': '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
my_string = '##$hello?? getting good.<li>hii'
# Generating list of character in sample string with
# special character replaced with whitespace
my_list = [(' ' if item in special_char_string else item) for item in my_string]
# Join the list to form string
my_string = ''.join(my_list)
# Split it based on space
my_desired_list = my_string.strip().split()
The value of my_desired_list will be:
['hello', 'getting', 'good', 'li', 'hii']
What is the easiest way in Python to replace the nth word in a string, assuming each word is separated by a space?
For example, if I want to replace the tenth word of a string and get the resulting string.
I guess you may do something like this:
nreplace=1
my_string="hello my friend"
words=my_string.split(" ")
words[nreplace]="your"
" ".join(words)
Here is another way of doing the replacement:
nreplace=1
words=my_string.split(" ")
" ".join([words[word_index] if word_index != nreplace else "your" for word_index in range(len(words))])
Let's say your string is:
my_string = "This is my test string."
You can split the string up using split(' ')
my_list = my_string.split()
Which will set my_list to
['This', 'is', 'my', 'test', 'string.']
You can replace the 4th list item using
my_list[3] = "new"
And then put it back together with
my_new_string = " ".join(my_list)
Giving you
"This is my new string."
A solution involving list comprehension:
text = "To be or not to be, that is the question"
replace = 6
replacement = 'it'
print ' '.join([x if index != replace else replacement for index,x in enumerate(s.split())])
The above produces:
To be or not to be, it is the question
You could use a generator expression and the string join() method:
my_string = "hello my friend"
nth = 0
new_word = 'goodbye'
print(' '.join(word if i != nth else new_word
for i, word in enumerate(my_string.split(' '))))
Output:
goodbye my friend
Through re.sub.
>>> import re
>>> my_string = "hello my friend"
>>> new_word = 'goodbye'
>>> re.sub(r'^(\s*(?:\S+\s+){0})\S+', r'\1'+new_word, my_string)
'goodbye my friend'
>>> re.sub(r'^(\s*(?:\S+\s+){1})\S+', r'\1'+new_word, my_string)
'hello goodbye friend'
>>> re.sub(r'^(\s*(?:\S+\s+){2})\S+', r'\1'+new_word, my_string)
'hello my goodbye'
Just replace the number within curly braces with the position of the word you want to replace - 1. ie, for to replace the first word, the number would be 0, for second word the number would be 1, likewise it goes on.
Say I have the following dictionary:
d = {"word1":0, "word2":0}
For this regex I need to verify that a word in the string isn't a key in that dictionary.
Is it possible to set a variable to anything not in a dictionary, for the purposes of a regex?
Forget about regex in this case:
test = "word1 word2 word3" # your string
words = test.split(' ') # words in your string
dict = {"word1":0, "word2":0} # your dict
for word in words:
if word in dict:
print word, "is a key in dict"
else:
print word, "isn't a key in dict"
>>> d = {"foo":0, "spam":0}
>>> test = "This is a string with many words, including foo and bar"
>>> any(word in d for word in test.split())
True
If punctuation is a problem (for example, "This is foo." would not find foo with this approach), and since you said all your words are alphanumeric, you could also use
>>> import re
>>> test = "This is foo."
>>> any(word in d for word in re.findall("[A-Za-z0-9]+", test))
I want to check if a word is in a list of words.
word = "with"
word_list = ["without", "bla", "foo", "bar"]
I tried if word in set(list), but it is not yielding the wanted result due to the fact in is matching string rather than item. That is to say, "with" is a match in any of the words in the word_list but still if "with" in set(list) will say True.
What is a simpler way for doing this check than manually iterate over the list?
You could do:
found = any(word in item for item in wordlist)
It checks each word for a match and returns true if any are matches
in is working as expected for an exact match:
>>> word = "with"
>>> mylist = ["without", "bla", "foo", "bar"]
>>> word in mylist
False
>>>
You can also use:
milist.index(myword) # gives error if your word is not in the list (use in a try/except)
or
milist.count(myword) # gives a number > 0 if the word is in the list.
However, if you are looking for a substring, then:
for item in mylist:
if word in item:
print 'found'
break
btw, dont use list for the name of a variable
You could also create a single search string by concatenating all of the words in word_list into a single string:
word = "with"
word_list = ' '.join(["without", "bla", "foo", "bar"])
Then a simple in test will do the job:
return word in word_list
I am looking for an expression to match strings against a list of words like ["xxx", "yyy", "zzz"]. The strings need to contain all three words but they do not need to be in the same order.
E.g., the following strings should be matched:
'"yyy" string of words and than “zzz" string of words “xxx"'
or
'string of words “yyy””xxx””zzz” string of words'
Simple string operation:
mywords = ("xxx", "yyy", "zzz")
all(x in mystring for x in mywords)
If word boundaries are relevant (i. e. you want to match zzz but not Ozzzy):
import re
all(re.search(r"\b" + re.escape(word) + r"\b", mystring) for word in mywords)
I'd use all and re.search for finding matches.
>>> words = ('xxx', 'yyy' ,'zzz')
>>> text = "sdfjhgdsf zzz sdfkjsldjfds yyy dfgdfgfd xxx"
>>> all([re.search(w, text) for w in words])
True