Pythonically remove the first X words of a string - python

How do I do it Pythonically?
I know how to delete the first word, but now I need to remove three.
Note that words can be delimited by amount of whitecap, not just a single space (although I could enforce a single white space if it must be so).
[Update] I mean any X words; I don't know hat they are.
I am considering looping and repeatedly removing the first word, joining together again, rinsing and repeating.

s = "this is my long sentence"
print ' '.join(s.split(' ')[3:])
This will print
"long sentence"
Which I think is what you need (it will handle the white spaces the way you wanted).

Try:
import re
print re.sub("(\w+)", "", "a sentence is cool", 3)
Prints cool

This can be done by simple way as:
In [7]: str = 'Hello, this is long string'
In [8]: str = str[3:]
In [9]: str
Out[9]: 'lo, this is long string'
In [10]:
Now you can update 3 on line In[8] with your X

You can use the split function to do this. Essentially, it splits the string up into individual (space separated, by default) words. These words are stored in a list and then from that list, you can access the words you want, just like you would with a normal list of other data types. Using the desired words you can then join the list to form a string.
for example:
import string
str='This is a bunch of words'
string_list=string.split(
#The string is now stored in a list that looks like:
#['this', 'is', 'a', 'bunch', 'of', 'words']
new_string_list=string_list[3:]
#the list is now: ['bunch', 'of', 'words']
new_string=string.join(new_string_list)
#you now have the string 'bunch of words'
You can also do this in fewer lines, if desired (not sure if this is pythonic though)
import string as st
str='this is a bunch of words'
new_string=st.join(st.split(str[3:])
print new_string
#output would be 'bunch of words'

You can use split:
>>> x = 3 # number of words to remove from beginning
>>> s = 'word1 word2 word3 word4'
>>> s = " ".join(s.split()) # remove multiple spacing
>>> s = s.split(" ", x)[x] # split and keep elements after index x
>>> s
'word4'
This will handle multiple spaces as well.

Related

change the first word of string to the first letter

I want to change the first word of a string to the first letter of that word. For organisms, you can write "Arabidopsis thaliana" or "A. thaliana".
Because the String names are sometimes too long for my purpose I want to change this, so the string becomes shorter.
I tried to find a similar question, but it is always removing the first word or make the first letter uppercase or replacing the first word with a specific character, but never with the first character of the word itself.
Use replace() :
>>> s = 'Arabidopsis thaliana'
>>> s.replace(s.split()[0], s[0])
'A thaliana'
In the rare case, according to mrCarnivore if the first word is occurring multiple times we could use maxreplace parameter
>>> s = 'Arabidopsis Arabidopsis thaliana'
>>> s.replace(s.split()[0], s[0], 1)
'A Arabidopsis bologna'
This works:
s = 'Arabidopsis thaliana bologna'
l = s.split()
s2 = l[0][0] + '. ' + ' '.join(l[1:])
print(s2)

Python How to skip the part in a string marked by certain symbols?

I‘m trying to reconstruct a sentence by one-to-one matching the words in a word list to a sentence:
text='acatisananimal'
words=['cat','is','an','a','animal']
for i in words:
if i in text:
final=text.replace(i,' '+i)
text=final
print(final)
the expected output will be like:
a cat is an animal
If I run my code, the 'a' and 'an' in 'animal' will be unavoidably separated too.
So I want to sort the word list by the length, and search for the long words first.
words.sort(key=len)
words=words[::-1]
Then I would like to mark the long words with special symbols, and expect the program could skip the part I marked. For example:
acatisan%animal&
And finally I will erase the symbols. But I'm stuck here. I don't know what to do to make the program skip the certain parts between '%' and '&' . Can anyone help me?? Or are there better ways to solve the spacing problem? Lots of Thanks!
**For another case,what if the text include the words that are not included in the word list?How could I handle this?
text=‘wowwwwacatisananimal’
A more generalized approach would be to look for all valid words at the beginning, split them off and explore the rest of the letters, e.g.:
def compose(letters, words):
q = [(letters, [])]
while q:
letters, result = q.pop()
if not letters:
return ' '.join(result)
for word in words:
if letters.startswith(word):
q.append((letters[len(word):], result+[word]))
>>> words=['cat','is','an','a','animal']
>>> compose('acatisananimal', words)
'a cat is an animal'
If there are potentially multiple possible sentence compositions it would trivial to turn this into a generator and replace return with yield to yield all matching sentence compositions.
Contrived example (just replace return with yield):
>>> words=['adult', 'sex', 'adults', 'exchange', 'change']
>>> list(compose('adultsexchange', words))
['adults exchange', 'adult sex change']
Maybe you can replace the word with the index, so the final string should be like this 3 0 1 2 4 and then convert it back to sentence:
text='acatisananimal'
words=['cat','is','an','a','animal']
for i in sorted(words,key=len,reverse=True):
if i in text:
final=text.replace(i,' %s'%words.index(i))
text=final
print(" ".join(words[int(i)] for i in final.split()))
Output:
a cat is an animal
You need a small modification in your code, update the code line
final=text.replace(i,' '+i)
to
final=text.replace(i,' '+i, 1) . This will replace only the first occurrence.
So the updated code would be
text='acatisananimal'
words=['cat','is','an','a','animal']
for i in words:
if i in text:
final=text.replace(i,' '+i, 1)
text=final
print(final)
Output is:
a cat is an animal
if you are getting on that part of removing only the symbols...then regex is your what you are looking for..import a module called re and do this.
import re
code here
print re.sub(r'\W+', ' ', final)
I wouldn't recommend using different delimeters either side of your matched words(% and & in your example.)
It's easier to use the same delimiter either side of your marked word and use Python's list slicing.
The solution below uses the [::n] syntax for getting every nth element of a list.
a[::2] gets even-numbered elements, a[1::2] gets the odd ones.
>>> fox = "the|quick|brown|fox|jumpsoverthelazydog"
Because they have | characters on either side, 'quick' and 'fox' are odd-numbered elements when you split the string on |:
>>> splitfox = fox.split('|')
>>> splitfox
['the', 'quick', 'brown', 'fox', 'jumpsoverthelazydog']
>>> splitfox[1::2]
['quick', 'fox']
and the rest are even:
>>> splitfox[::2]
['the', 'brown', 'jumpsoverthelazydog']
So, by enclosing known words in | characters, splitting, and scanning even-numbered elements, you're searching only those parts of the text that are not yet matched. This means you don't match within already-matched words.
from itertools import chain
def flatten(list_of_lists):
return chain.from_iterable(list_of_lists)
def parse(source_text, words):
words.sort(key=len, reverse=True)
texts = [source_text, ''] # even number of elements helps zip function
for word in words:
new_matches_and_text = []
for text in texts[::2]:
new_matches_and_text.append(text.replace(word, f"|{word}|"))
previously_matched = texts[1::2]
# merge new matches back in
merged = '|'.join(flatten(zip(new_matches_and_text, previously_matched)))
texts = merged.split('|')
# remove blank words (matches at start or end of a string)
texts = [text for text in texts if text]
return ' '.join(texts)
>>> parse('acatisananimal', ['cat', 'is', 'a', 'an', 'animal'])
'a cat is an animal'
>>> parse('atigerisanenormousscaryandbeautifulanimal', ['tiger', 'is', 'an', 'and', 'animal'])
'a tiger is an enormousscary and beautiful animal'
The merge code uses the zip and flatten functions to splice the new matches and old matches together. It basically works by pairing even and odd elements of the list, then "flattening" the result back into one long list ready for the next word.
This approach leaves the unrecognised words in the text.
'beautiful' and 'a' are handled well because they're on their own (i.e. next to recognised words.)
'enormous' and 'scary' are not known and, as they're next to each other, they're left stuck together.
Here's how to list the unknown words:
>>> known_words = ['cat', 'is', 'an', 'animal']
>>> sentence = parse('anayeayeisananimal', known_words)
>>> [word for word in sentence.split(' ') if word not in known_words]
['ayeaye']
I'm curious: is this a bioinformatics project?
List and dict comprehension is another way to do it:
result = ' '.join([word for word, _ in sorted([(k, v) for k, v in zip(words, [text.find(word) for word in words])], key=lambda x: x[1])])
So, I used zip to combine words and their position in text, sorted the words by their position in original text and finally joined the result with ' '.

Capitalize first character of a word in a string

How is one of the following versions different from the other?
The following code returns the first letter of a word from string capitalize:
s = ' '.join(i[0].upper() + i[1:] for i in s.split())
The following code prints only the last word with every character separated by space:
for i in s.split():
s=' '.join(i[0].upper()+i[1:]
print s
For completeness and for people who find this question via a search engine, the proper way to capitalize the first letter of every word in a string is to use the title method.
>>> capitalize_me = 'hello stackoverlow, how are you?'
>>> capitalize_me.title()
'Hello Stackoverlow, How Are You?'
for i in s.split():`
At this point i is a word.
s = ' '.join(i[0].upper() + i[1:])
Here, i[0] is the first character of the string, and i[1:] is the rest of the string. This, therefore, is a shortcut for s = ' '.join(capitalized_s). The str.join() method takes as its argument a single iterable. In this case, the iterable is a string, but that makes no difference. For something such as ' '.join("this"), str.join() iterates through each element of the iterable (each character of the string) and puts a space between each one. Result: t h i s There is, however, an easier way to do what you want: s = s.title()

how do i make a bunch of words into a list?

I've got a random bunch of words and I need to make it into a list, but there is a problem, I must take the words as they are and convert them into a list in the program itself.
for example I got this raw input:
hello,mike,cat,dog,burger
Now how do i take this 5 words and make my program to make each word into a proper string like so: "hello","mike","cat","dog","burger"
You can use the split method
>>> s = "hello,mike,cat,dog,burger"
>>> l = s.split(',')
>>> l
['hello', 'mike', 'cat', 'dog', 'burger']
You're looking for str.split
the_string = '"hello","mike","cat","dog","burger"'
the_list = the_string.split(",") # split on a literal comma
Note that this requires that the user properly format the string, and that there are no leading or trailing spaces (e.g. Hello, dog becomes ["Hello", " dog"]). Consider building some sanity tests for the string, and possibly mapping the whole thing through str.strip

Removing list of words from a string

I have a list of stopwords. And I have a search string. I want to remove the words from the string.
As an example:
stopwords=['what','who','is','a','at','is','he']
query='What is hello'
Now the code should strip 'What' and 'is'. However in my case it strips 'a', as well as 'at'. I have given my code below. What could I be doing wrong?
for word in stopwords:
if word in query:
print word
query=query.replace(word,"")
If the input query is "What is Hello", I get the output as:
wht s llo
Why does this happen?
This is one way to do it:
query = 'What is hello'
stopwords = ['what', 'who', 'is', 'a', 'at', 'is', 'he']
querywords = query.split()
resultwords = [word for word in querywords if word.lower() not in stopwords]
result = ' '.join(resultwords)
print(result)
I noticed that you want to also remove a word if its lower-case variant is in the list, so I've added a call to lower() in the condition check.
the accepted answer works when provided a list of words separated by spaces, but that's not the case in real life when there can be punctuation to separate the words. In that case re.split is required.
Also, testing against stopwords as a set makes lookup faster (even if there's a tradeoff between string hashing & lookup when there's a small number of words)
My proposal:
import re
query = 'What is hello? Says Who?'
stopwords = {'what','who','is','a','at','is','he'}
resultwords = [word for word in re.split("\W+",query) if word.lower() not in stopwords]
print(resultwords)
output (as list of words):
['hello','Says','']
There's a blank string in the end, because re.split annoyingly issues blank fields, that needs filtering out. 2 solutions here:
resultwords = [word for word in re.split("\W+",query) if word and word.lower() not in stopwords] # filter out empty words
or add empty string to the list of stopwords :)
stopwords = {'what','who','is','a','at','is','he',''}
now the code prints:
['hello','Says']
building on what karthikr said, try
' '.join(filter(lambda x: x.lower() not in stopwords, query.split()))
explanation:
query.split() #splits variable query on character ' ', e.i. "What is hello" -> ["What","is","hello"]
filter(func,iterable) #takes in a function and an iterable (list/string/etc..) and
# filters it based on the function which will take in one item at
# a time and return true.false
lambda x: x.lower() not in stopwords # anonymous function that takes in variable,
# converts it to lower case, and returns true if
# the word is not in the iterable stopwords
' '.join(iterable) #joins all items of the iterable (items must be strings/chars)
#using the string/char in front of the dot, i.e. ' ' as a joiner.
# i.e. ["What", "is","hello"] -> "What is hello"
Looking at the other answers to your question I noticed that they told you how to do what you are trying to do, but they did not answer the question you posed at the end.
If the input query is "What is Hello", I get the output as:
wht s llo
Why does this happen?
This happens because .replace() replaces the substring you give it exactly.
for example:
"My, my! Hello my friendly mystery".replace("my", "")
gives:
>>> "My, ! Hello friendly stery"
.replace() is essentially splitting the string by the substring given as the first parameter and joining it back together with the second parameter.
"hello".replace("he", "je")
is logically similar to:
"je".join("hello".split("he"))
If you were still wanting to use .replace to remove whole words you might think adding a space before and after would be enough, but this leaves out words at the beginning and end of the string as well as punctuated versions of the substring.
"My, my! hello my friendly mystery".replace(" my ", " ")
>>> "My, my! hello friendly mystery"
"My, my! hello my friendly mystery".replace(" my", "")
>>> "My,! hello friendlystery"
"My, my! hello my friendly mystery".replace("my ", "")
>>> "My, my! hello friendly mystery"
Additionally, adding spaces before and after will not catch duplicates as it has already processed the first sub-string and will ignore it in favor of continuing on:
"hello my my friend".replace(" my ", " ")
>>> "hello my friend"
For these reasons your accepted answer by Robby Cornelissen is the recommended way to do what you are wanting.
" ".join([x for x in query.split() if x not in stopwords])
stopwords=['for','or','to']
p='Asking for help, clarification, or responding to other answers.'
for i in stopwords:
n=p.replace(i,'')
p=n
print(p)

Categories

Resources