Python : find words in string without white space

Python : find words in string without white space - python

I'm trying to make a function to look for words in a string without white space : 'Daysaregood' .
i iterate for every letter until i find if the word exists by comparing with list based on already iterated letter, using enchant the module enchant.
and this is what i tried:
import enchant
import time
fulltext =[]
def work(out):
if len(out)>0:
word = ''
wd = ""
# iterate for every Letter
for i in out:
word = word + i
print word
d = enchant.Dict('en_US')
# a list of words to compare to
list = d.suggest(word.title())
print list
#check if word exists
if word.title() in list :
print 'Word found'
wd = word
else:
print 'Word not found'
print '\n'+wd
fulltext.append(str(wd))
time.sleep(2)
work(out[len(wd):])
else:
print '\n fulltext : '
print fulltext
word="Daysaregood"
work(word)
Now for this text the scripts runs like i want, i get a list like this :
['Days', 'are', 'good'].
But when i try something like 'spaceshuttle', the function gets confused with 'space' and steels the 's' in 'shuttle' so i get this :
['spaces', 'hut', 't', 'l', 'e'].
My goal is to take return every word by itself and store them into a list.
Any help is appreciated.

The issue with your task is that the desired output doesn't follow strict rules, per se. If you were to input 'pineapple', would you expect ['pine', 'apple'] or ['pineapple']? It would be rather difficult / impossible to have it predict this.

Related

(python) I keep getting an IndexError: string index out of range

I am trying to solve the question
Implement the mapper, mapFileToCount, which takes a string (text from a file) and returns the number of capitalized words in that string. A word is defined as a series
of characters separated from other words by either a space or a newline. A word is capitalized if its first letter is capitalized (A vs a).
and my python code currently reads
def mapFileToCount(s):
lines = (str(s)).splitlines()
words = (str(lines)).split(" ")
up = 0
for word in words:
if word[0].isupper() == True:
up = up + 1
return up
However I keep getting the error IndexError: string index out of range
please help

For now
given Hi huy \n hi you there
lines will be ['Hi huy ', ' hi you there']
words will be ["['Hi", 'huy', "',", "'", 'hi', 'you', "there']"] as you use the str(lines) to split on
I'd suggest you split on any whitespace at once with words = re.split("\s+", s).
Then the problem of IndexError comes in cases like Hi where are you__ (_ is space), when you split there will be an empty string at the end, and you can't access the first char char of this, so just add a condition in the if
if word because 0-size word are False, and other True
if word[0].isupper() for you test
import re
def mapFileToCount(s):
words = re.split("\s+", s)
up = 0
for word in words:
if word and word[0].isupper():
up = up + 1
return up

The string index out of range means that the index you are trying to access does not exist in a string. That means you're trying to get a character from the string at a given point. If that given point does not exist , then you will be trying to get a character that is not inside of the string.
In your code its that word[0].

Print words in between two keywords

I am trying to write a code in Python where I get a print out of all the words in between two keywords.
scenario = "This is a test to see if I can get Python to print out all the words in between Python and words"
go = False
start = "Python"
end = "words"
for line in scenario:
if start in line: go = True
elif end in line:
go = False
continue
if go: print(line)
Want to have a print out of "to print out all the"

Slightly different approach, let's create a list which each element being a word in the sentence. Then let's use list.index() to find which position in the sentence the start and end words first occur. We can then return the words in the list between those indices. We want it back as a string and not a list, so we join them together with a space.
# list of words ['This', 'is', 'a', 'test', ...]
words = scenario.split()
# list of words between start and end ['to', 'print', ..., 'the']
matching_words = words[words.index(start)+1:words.index(end)]
# join back to one string with spaces between
' '.join(matching_words)
Result:
to print out all the

Your initial problem is that you're iterating over scenario the string, instead of splitting it into seperate words, (Use scenario.split()) but then there are other issues about switching to searching for the end word once the start has been found, instead, you might like to use index to find the two strings and then slice the string
scenario = "This is a test to see if I can get Python to print out all the words in between Python and words"
start = "Python"
end = "words"
start_idx = scenario.index(start)
end_idx = scenario.index(end)
print(scenario[start_idx + len(start):end_idx].strip())

You can accomplish this with a simple regex
import re
txt = "This is a test to see if I can get Python to print out all the words in between Python and words"
x = re.search("(?<=Python\s).*?(?=\s+words)", txt)
Here is the regex in action --> REGEX101

Split the string and go over it word by word to find the index at which the two keywords occur. Once you have those two indices, combine the list between those indices into a string.
scenario = 'This is a test to see if I can get Python to print out all the words in between Python and words'
start_word = 'Python'
end_word = 'words'
# Split the string into a list
list = scenario.split()
# Find start and end indices
start = list.index(start_word) + 1
end = list.index(end_word)
# Construct a string from elements at list indices between `start` and `end`
str = ' '.join(list[start : end])
# Print the result
print str

In Python, how to delete some words in a string according to a list?

This is what I came up with, before getting stuck (NB source of the text : The Economist) :
import random
import re
text = 'One calculation by a film consultant implies that half of Hollywood productions with budgets over one hundred million dollars lose money.'
nbofwords = len(text.split())
words = text.split()
randomword = random.choice(words)
randomwordstr = str(randomword)
Step 1 works : Delete the random word from the original text
replaced1 = re.sub(randomwordstr, '', text)
replaced2 = re.sub(' ', ' ', replaced1)
Step 2 works : Select a defined number of random words
nbofsamples = 3
randomitems = random.choices(population=words, k=nbofsamples)
gives, e.g. ['over', 'consultant', 'One']
Step 3 works : Delete from the original text one element of that list of random words thanks to its index
replaced3 = re.sub(randomitems[1], '', text)
replaced4 = re.sub(' ', ' ', replaced3)
deletes the word 'consultant'
Step 4 fails : Delete from the original text all the elements of that list of random words thanks to their index
The best I can figure out is :
replaced5 = re.sub(randomitems[0],'',text)
replaced6 = re.sub(randomitems[1],'',replaced5)
replaced7 = re.sub(randomitems[2],'',replaced6)
replaced8 = re.sub(' ', ' ', replaced7)
print(replaced8)
It works (all 3 words have been deleteg), but it is clumsy and inefficient (I would have to rewrite it if I changed the nbofsamples variable).
How can I iterate from my list of random words (step 2) to delete those words in the original text ?
Thanks in advance

to delete words in a list from a string just use a for-loop. This will iterate through each item in the list, assigning the value of the item in the list to whatever variable you want (In this case i used "i", but i can be pretty much anything a normal variable could be) and executes the code in the loop until there are no more items in the list given. Here's the bare bones version of a for-loop:
list = []
for i in list:
print(i)
in your case you wanted to remove the words specified in the list from a string, so just plug the variable "i" into the same method you've been using to remove the words. After that you need a constantly changing variable, otherwise the loop would have only removed the last word in the list from the string. after that you can print the output. This code will work a list of and length.
r=replaced3
for i in randomitems:
replaced4 = re.sub(i, '', r)
r=replaced4
print(replaced4)

Note that as long as you do not use any regular expressions but replace just simple strings by others (or nothing), you don't need re:
for r in randomitems:
text = text.replace(r, '')
print(text)
For replacing only the first occurence you can simple set desired number of occurences in the replace function:
text = text.replace(r, '', 1)

Python How to skip the part in a string marked by certain symbols？

I‘m trying to reconstruct a sentence by one-to-one matching the words in a word list to a sentence:
text='acatisananimal'
words=['cat','is','an','a','animal']
for i in words:
if i in text:
final=text.replace(i,' '+i)
text=final
print(final)
the expected output will be like:
a cat is an animal
If I run my code, the 'a' and 'an' in 'animal' will be unavoidably separated too.
So I want to sort the word list by the length, and search for the long words first.
words.sort(key=len)
words=words[::-1]
Then I would like to mark the long words with special symbols, and expect the program could skip the part I marked. For example:
acatisan%animal&
And finally I will erase the symbols. But I'm stuck here. I don't know what to do to make the program skip the certain parts between '%' and '&' . Can anyone help me?? Or are there better ways to solve the spacing problem? Lots of Thanks!
**For another case，what if the text include the words that are not included in the word list？How could I handle this？
text=‘wowwwwacatisananimal’

A more generalized approach would be to look for all valid words at the beginning, split them off and explore the rest of the letters, e.g.:
def compose(letters, words):
q = [(letters, [])]
while q:
letters, result = q.pop()
if not letters:
return ' '.join(result)
for word in words:
if letters.startswith(word):
q.append((letters[len(word):], result+[word]))
>>> words=['cat','is','an','a','animal']
>>> compose('acatisananimal', words)
'a cat is an animal'
If there are potentially multiple possible sentence compositions it would trivial to turn this into a generator and replace return with yield to yield all matching sentence compositions.
Contrived example (just replace return with yield):
>>> words=['adult', 'sex', 'adults', 'exchange', 'change']
>>> list(compose('adultsexchange', words))
['adults exchange', 'adult sex change']

Maybe you can replace the word with the index, so the final string should be like this 3 0 1 2 4 and then convert it back to sentence:
text='acatisananimal'
words=['cat','is','an','a','animal']
for i in sorted(words,key=len,reverse=True):
if i in text:
final=text.replace(i,' %s'%words.index(i))
text=final
print(" ".join(words[int(i)] for i in final.split()))
Output:
a cat is an animal

You need a small modification in your code, update the code line
final=text.replace(i,' '+i)
to
final=text.replace(i,' '+i, 1) . This will replace only the first occurrence.
So the updated code would be
text='acatisananimal'
words=['cat','is','an','a','animal']
for i in words:
if i in text:
final=text.replace(i,' '+i, 1)
text=final
print(final)
Output is:
a cat is an animal

if you are getting on that part of removing only the symbols...then regex is your what you are looking for..import a module called re and do this.
import re
code here
print re.sub(r'\W+', ' ', final)

I wouldn't recommend using different delimeters either side of your matched words(% and & in your example.)
It's easier to use the same delimiter either side of your marked word and use Python's list slicing.
The solution below uses the [::n] syntax for getting every nth element of a list.
a[::2] gets even-numbered elements, a[1::2] gets the odd ones.
>>> fox = "the|quick|brown|fox|jumpsoverthelazydog"
Because they have | characters on either side, 'quick' and 'fox' are odd-numbered elements when you split the string on |:
>>> splitfox = fox.split('|')
>>> splitfox
['the', 'quick', 'brown', 'fox', 'jumpsoverthelazydog']
>>> splitfox[1::2]
['quick', 'fox']
and the rest are even:
>>> splitfox[::2]
['the', 'brown', 'jumpsoverthelazydog']
So, by enclosing known words in | characters, splitting, and scanning even-numbered elements, you're searching only those parts of the text that are not yet matched. This means you don't match within already-matched words.
from itertools import chain
def flatten(list_of_lists):
return chain.from_iterable(list_of_lists)
def parse(source_text, words):
words.sort(key=len, reverse=True)
texts = [source_text, ''] # even number of elements helps zip function
for word in words:
new_matches_and_text = []
for text in texts[::2]:
new_matches_and_text.append(text.replace(word, f"|{word}|"))
previously_matched = texts[1::2]
# merge new matches back in
merged = '|'.join(flatten(zip(new_matches_and_text, previously_matched)))
texts = merged.split('|')
# remove blank words (matches at start or end of a string)
texts = [text for text in texts if text]
return ' '.join(texts)
>>> parse('acatisananimal', ['cat', 'is', 'a', 'an', 'animal'])
'a cat is an animal'
>>> parse('atigerisanenormousscaryandbeautifulanimal', ['tiger', 'is', 'an', 'and', 'animal'])
'a tiger is an enormousscary and beautiful animal'
The merge code uses the zip and flatten functions to splice the new matches and old matches together. It basically works by pairing even and odd elements of the list, then "flattening" the result back into one long list ready for the next word.
This approach leaves the unrecognised words in the text.
'beautiful' and 'a' are handled well because they're on their own (i.e. next to recognised words.)
'enormous' and 'scary' are not known and, as they're next to each other, they're left stuck together.
Here's how to list the unknown words:
>>> known_words = ['cat', 'is', 'an', 'animal']
>>> sentence = parse('anayeayeisananimal', known_words)
>>> [word for word in sentence.split(' ') if word not in known_words]
['ayeaye']
I'm curious: is this a bioinformatics project?

List and dict comprehension is another way to do it:
result = ' '.join([word for word, _ in sorted([(k, v) for k, v in zip(words, [text.find(word) for word in words])], key=lambda x: x[1])])
So, I used zip to combine words and their position in text, sorted the words by their position in original text and finally joined the result with ' '.

How might I create an acronym by splitting a string at the spaces, taking the character indexed at 0, joining it together, and capitalizing it?

My code
beginning = input("What would you like to acronymize? : ")
second = beginning.upper()
third = second.split()
fourth = "".join(third[0])
print(fourth)
I can't seem to figure out what I'm missing. The code is supposed to the the phrase the user inputs, put it all in caps, split it into words, join the first character of each word together, and print it. I feel like there should be a loop somewhere, but I'm not entirely sure if that's right or where to put it.

Say input is "Federal Bureau of Agencies"
Typing third[0] gives you the first element of the split, which is "Federal". You want the first element of each element in the sprit. Use a generator comprehension or list comprehension to apply [0] to each item in the list:
val = input("What would you like to acronymize? ")
print("".join(word[0] for word in val.upper().split()))
In Python, it would not be idiomatic to use an explicit loop here. Generator comprehensions are shorter and easier to read, and do not require the use of an explicit accumulator variable.

When you run the code third[0], Python will index the variable third and give you the first part of it.
The results of .split() are a list of strings. Thus, third[0] is a single string, the first word (all capitalized).
You need some sort of loop to get the first letter of each word, or else you could do something with regular expressions. I'd suggest the loop.
Try this:
fourth = "".join(word[0] for word in third)
There is a little for loop inside the call to .join(). Python calls this a "generator expression". The variable word will be set to each word from third, in turn, and then word[0] gets you the char you want.

works for me this way:
>>> a = "What would you like to acronymize?"
>>> a.split()
['What', 'would', 'you', 'like', 'to', 'acronymize?']
>>> ''.join([i[0] for i in a.split()]).upper()
'WWYLTA'
>>>

One intuitive approach would be:
get the sentence using input (or raw_input in python 2)
split the sentence into a list of words
get the first letter of each word
join the letters with a space string
Here is the code:
sentence = raw_input('What would you like to acronymize?: ')
words = sentence.split() #split the sentece into words
just_first_letters = [] #a list containing just the first letter of each word
#traverse the list of words, adding the first letter of
#each word into just_first_letters
for word in words:
just_first_letters.append(word[0])
result = " ".join(just_first_letters) #join the list of first letters
print result

#acronym2.py
#illustrating how to design an acronymn
import string
def main():
sent=raw_input("Enter the sentence: ")#take input sentence with spaces
for i in string.split(string.capwords(sent)):#split the string so each word
#becomes
#a string
print string.join(i[0]), #loop through the split
#string(s) and
#concatenate the first letter
#of each of the
#split string to get your
#acronym
main()

name = input("Enter uppercase with lowercase name")
print(f'the original string = ' + name)
def uppercase(name):
res = [char for char in name if char.isupper()]
print("The uppercase characters in string are : " + "".join(res))
uppercase(name)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python : find words in string without white space - python

The issue with your task is that the desired output doesn't follow strict rules, per se. If you were to input 'pineapple', would you expect ['pine', 'apple'] or ['pineapple']? It would be rather difficult / impossible to have it predict this.

Related

(python) I keep getting an IndexError: string index out of range

Print words in between two keywords

In Python, how to delete some words in a string according to a list?

Python How to skip the part in a string marked by certain symbols？

How might I create an acronym by splitting a string at the spaces, taking the character indexed at 0, joining it together, and capitalizing it?

Categories

Resources