Python: 4 words before a substring if they exsist - python

I want to have 4 words a particular word. if there are only 3 wrords before I want the 3 words to be printed.
Example:
input: There is a bad CAT sitting on the wall
output: is a bad
line is having the sentance.
if 'CAT' in line:
print(line.split('CAT')[0].split()[len((line.split('CAT')[0]))-3): len(line.split('CAT')[0])])
Can you let me know if I am missing anything and if there is any other efficent way.
Planning to do line.split(CAT)[0] to get all the data before cat.
again on that I want to getoutput of [0] starting at len-3 to len.
Its giving Error am I missing anything.
also can I add a condition if there are only 2 words print only 2

You're on the right track. If you want to get the three words before D in a string S, defaulting to fewer words if there's less than three available, you can use this:
S.split(D)[0].split()[-3:]
Examples:
>>> S = 'There is a bad CAT sitting on the wall'
>>> S.split('CAT')[0].split()[-3:]
['is', 'a', 'bad']
>>> S = 'The bad CAT is sitting on the wall'
>>> S.split('CAT')[0].split()[-3:]
['The', 'bad']
Of course, if you wish to join this back into a string, you can use:
' '.join(S.split(D)[0].split()[-3:])
This can also be accomplished using regular expressions, but I doubt it would offer much better performance.

Split the line at the start, then find the index of the word you want in the resulting list. You can then slice the list (making sure that you don't start the slice at less than zero), and join it back together again. If there are fewer than 3 words preceding it will only show what is there.
line = "There is a bad CAT sitting on the wall"
sline = line.split(' ')
if 'CAT' in sline:
pos = sline.index('CAT')
print(' '.join(sline[max(0, pos-3):pos]))

IMO, trying to do all this in one line makes things too confusing. I recommend that you break it up into smaller parts.
if line.find('CAT') != -1:
words = line.split('CAT')[0].strip().split(' ')
print(words[max(len(words) - 3, 0):])
Explanation of some things:
Yes, some people will think 'CAT' in line is more Pythonic, but I prefer line.find('CAT') != -1, as it won't ignore some errors that might occur if line isn't a string. See the str.find() documentation for details about the function.
The strip() in the 2nd line assures that trailing spaces are removed.
The final line finds the position of the third word before 'CAT', if there is one, and then prints the appropriate words out as a list. As noted in other answers, you and use str.join() to put them back together as a string if you want.

Related

Removing a word that contains symbols such as "#", "#", or ":" in python

I have just started learning Python coding this semester and we are given some revision exercise. However i am stuck on one of the question. The text file given are tweets from US elections in 2016. Sample as below:
I wish they would show out takes of Dick Cheney #GOPdebates
Candidates went after #HillaryClinton 32 times in the #GOPdebate-but remained silent about the issues that affect us.
It seems like Ben Carson REALLY doesn't want to be there. #GOPdebates
RT #ColorOfChange: Or better said: #KKKorGOP #GOPDebate
The question requires me to write a Python program that reads from the file tweets.txt. Remember that each line contains one tweet. For each tweet, your program should remove any word that is less than 8 characters long, and also any word that contains a hash (#), at (#), or colon (:) character. What i have now:
for line in open("tweets.txt"):
aline=line.strip()
words=aline.split()
length=len(words)
remove=['#','#',':']
for char in words:
if "#" in char:
char=''
if "#" in char:
char=''
if ":" in char:
char=''
which did not work, and the resulting list still contains #,# or :. Any help appreciated! Thank you!
Assigning char='' in the loop does not change or remove the actual char (actually a word) in the list, it just assign a different value to the variable char.
Instead, you might use a list comprehension / generator expression for filtering the words that satisfy the conditions.
>>> tweet = "Candidates went after #HillaryClinton 32 times in the #GOPdebate-but remained silent about the issues that affect us."
>>> [w for w in tweet.split() if not any(c in w for c in "##:") and len(w) >= 8]
['Candidates', 'remained']
Optionally, use ' '.join(...) to join the remaining words back to a "sentence", although that might not make too much sense.
Use this code.
import re
tweet=re.sub(r'#', '',tweet )
tweet=re.sub(r'#', '',tweet )
tweet=re.sub(r':', '',tweet )
The below will open the file (it's usually better to use "with open" when working with files), loop through all the lines and remove the '##:' using translate. Then remove the words with less than 8 characters giving you the output "new_line".
with open('tweets.txt') as rf:
for sentence in rf:
line = sentence.strip()
line = line.translate({ord(i): None for i in '##:'})
line = line.split()
new_line = [ word for word in line if len(word) >= 8 ]
print(new_line)
It's not the most succinct way and there's definitely better ways to do it but it's probably a bit easier to read and understand seen as though you've just started learning, like me.

Python - Capture string with or without specific character

I am trying to capture the sentence after a specific word. Each sentences are different in my code and those sentence doesn't necessarily have to have this specific word to split by. If the word doesn't appear, I just need like blank string or list.
Example 1: working
my_string="Python is a amazing programming language"
print(my_string.split("amazing",1)[1])
programming language
Example 2:
my_string="Java is also a programming language."
print(my_string.split("amazing",1)[1]) # amazing word doesn't appear in the sentence.
Error: IndexError: list index out of range
Output needed :empty string or list ..etc.
I tried something like this, but it still fails.
my_string.split("amazing",1)[1] if my_string.split("amazing",1)[1] == None else my_string.split("amazing",1)[1]
When you use the .split() argument you can specify what part of the list you want to use with either integers or slices. If you want to check a specific word in your string you can do is something like this:
my_str = "Python is cool"
my_str_list = my_str.split()
if 'cool' in my_str_list:
print(my_str)`
output:
"Python is cool"
Otherwise, you can run a for loop in a list of strings to check if it finds the word in multiple strings.
You have some options here. You can split and check the result:
tmp = my_string.split("amazing", 1)
result = tmp[1] if len(tmp) > 1 else ''
Or you can check for containment up front:
result = my_string.split("amazing", 1)[1] if 'amazing' in my_string else ''
The first option is more efficient if most of the sentences have matches, the second one if most don't.
Another option similar to the first is
result = my_string.split("amazing", 1)[-1]
if result == my_string:
result = ''
In all cases, consider doing something equivalent to
result = result.lstrip()
Instead of calling index 1, call index -1. This calls the last item in the list.
my_string="Java is also a programming language."
print(my_string.split("amazing",1)[1])
returns ' programming language.'

Capitalizing the beginning of sentences in Python

The following code is for an assignment that asks that a string of sentences is entered from a user and that the beginning of each sentence is capitalized by a function.
For example, if a user enters: 'hello. these are sample sentences. there are three of them.'
The output should be: 'Hello. These are sample sentences. There are three of them.'
I have created the following code:
def main():
sentences = input('Enter sentences with lowercase letters: ')
capitalize(sentences)
#This function capitalizes the first letter of each sentence
def capitalize(user_sentences):
sent_list = user_sentences.split('. ')
new_sentences = []
count = 0
for count in range(len(sent_list)):
new_sentences = sent_list[count]
new_sentences = (new_sentences +'. ')
print(new_sentences.capitalize())
main()
This code has two issues that I am not sure how to correct. First, it prints each sentence as a new line. Second, it adds an extra period at the end. The output from this code using the sample input from above would be:
Hello.
These are sample sentences.
There are three of them..
Is there a way to format the output to be one line and remove the final period?
The following works for reasonably clean input:
>>> s = 'hello. these are sample sentences. there are three of them.'
>>> '. '.join(x.capitalize() for x in s.split('. '))
'Hello. These are sample sentences. There are three of them.'
If there is more varied whitespace around the full-stop, you might have to use some more sophisticated logic:
>>> '. '.join(x.strip().capitalize() for x in s.split('.'))
Which normalizes the whitespace which may or may not be what you want.
def main():
sentences = input('Enter sentences with lowercase letters: ')
capitalizeFunc(sentences)
def capitalizeFunc(user_sentences):
sent_list = user_sentences.split('. ')
print(".".join((i.capitalize() for i in sent_list)))
main()
Output:
Enter sentences with lowercase letters: "hello. these are sample sentences. there are three of them."
Hello.These are sample sentences.There are three of them.
I think this might be helpful:
>>> sentence = input()
>>> '. '.join(map(lambda s: s.strip().capitalize(), sentence.split('.')))
>>> s = 'hello. these are sample sentences. there are three of them.'
>>> '. '.join(map(str.capitalize, s.split('. ')))
'Hello. These are sample sentences. There are three of them.'
This code has two issues that I am not sure how to correct. First, it prints each sentence as a new line.
That’s because you’re printing each sentence with a separate call to print. By default, print adds a newline. If you don’t want it to, you can override what it adds with the end keyword parameter. If you don’t want it to add anything at all, just use end=''
Second, it adds an extra period at the end.
That’s because you’re explicitly adding a period to every sentence, including the last one.
One way to fix this is to keep track of the index as well as the sentence as you’re looping over them—e.g., with for index, sentence in enumerate(sentences):. Then you only add the period if index isn’t the last one. Or, slightly more simply, you add the period at the start, if the index is anything but zero.
However, theres a better way out of both of these problems. You split the string into sentences by splitting on '. '. You can join those sentences back into one big string by doing the exact opposite:
sentences = '. '.join(sentences)
Then you don’t need a loop (there’s one hidden inside join of course), you don’t need to worry about treating the last or first one special, and you only have one print instead of a bunch of them so you don’t need to worry about end.
A different trick is to put the cleverness of print to work for you instead of fighting it. Not only does it add a newline at the end by default, it also lets you print multiple things and adds a space between them by default. For example, print(1, 2, 3) or, equivalently, print(*[1, 2, 3]) will print out 1 2 3. And you can override that space separator with anything else you want. So you can print(*sentences, sep='. ', end='') to get exactly what you want in one go. However, this may be a bit opaque or over-clever to people reading your code. Personally, whenever I can use join instead (which is usually), I do that even though it’s a bit more typing, because it makes it more obvious what’s happening.
As a side note, a bit of your code is misleading:
new_sentences = []
count = 0
for count in range(len(sent_list)):
new_sentences = sent_list[count]
new_sentences = (new_sentences +'. ')
print(new_sentences.capitalize())
The logic of that loop is fine, but it would be a lot easier to understand if you called the one-new-sentence variable new_sentence instead of new_sentences, and didn’t set it to an empty list at the start. As it is, the reader is led to expect that you’re going to build up a list of new sentences and then do something with it, but actually you just throw that list away at the start and handle each sentence one by one.
And, while we’re at it, you don’t need count here; just loop over sent_list directly:
for sentence in sent_list:
new_sentence = sent + '. '
print(new_sentence.capitalize())
This does the same thing as the code you had, but I think it’s easier to understand that it does that think from a quick glance.
(Of course you still need the fixes for your two problems.)
Use nltk.sent_tokenize to tokenize the string into sentences. And capitalize each sentence, and join them again.
A sentence can't always end with a ., there can other things too, like a ?, or !. Also three consecutive dots ..., doesn't end the sentence. sent_tokenize will handle them all.
from nltk.tokenize import sent_tokenize
def capitalize(user_sentences):
sents = sent_tokenize(user_sentences)
capitalized_sents = [sent.capitalize() for sent in sents]
joined_ = ' '.join(capitalized_sents)
print(joined_)
The reason your sentences were being printed on separate lines, were because print always ends its output with a newline. So, printing sentences separately in loop will make them print on newlines. So, you should print them all at once, after joining them. Or, you can specify end='' in print statement, so it doesn't end the sentences with newline characters.
The second thing, about output being ended with an extra period, is because, you're appending '. ' with each of the sentence. The good thing about sent_tokenize is, it doesn't remove '.', '?', etc from the end of the sentences, so you don't have to append '. ' at the end manually again. Instead, you can just join the sentences with a space character, and you'll be good to go.
If you get an error for nltk not being recognized, you can install it by running pip install nltk on the terminal/cmd.

How to select words of equal max length from a text document

I am trying to write a program to read a text document and output the longest word in the document. If there are multiple longest words (i.e., all of equal length) then I need to output them all in the same order in which they occur. For example, if the longest words were dog and cat your code should produce:
dog cat
I am having trouble finding out how to select numerous words of equal max length and print them. This is as far as I've gotten, I am just struggling to think of how to select all words with equal max length:
open the file for reading
fh = open('poem.txt', 'r')
longestlist = []
longestword = ''
for line in fh:
words = (line.strip().split(' '))
for word in words:
word = ''.join(c for c in word if c.isalpha())
if len(word) > (longestword):
longest.append(word)
for i in longestlist:
print i
Ok, first off, you should probably use a with as statement, it just simplifies things and makes sure you don't mess up. So
fh = open('poem.txt', 'r')
becomes
with open('poem.txt','r') as file:
and since you're just concerned with words, you might as well use a built-in from the start:
words = file.read().split()
Then you just set a counter of the max word length (initialized to 0), and an empty list. If the word has broken the max length, set a new maxlength and rewrite the list to include only that word. If it's equal to the maxlength, include it in the list. Then just print out the list members. If you want to include some checks like .isalpha() feel free to put it in the relevant portions of the code.
maxlength = 0
longestlist = []
for word in words:
if len(word) > maxlength:
maxlength = len(word)
longestlist = [word]
elif len(word) == maxlength:
longestlist.append(word)
for item in longestlist:
print item
-MLP
What you need to do is to keep a list of all the longest words you've seen so far and keep the longest length. So for example, if the longest word so far has the length 5, you will have a list of all words with 5 characters in it. As soon as you see a word with 6 or more characters, you will clear that list and only put that one word in it and also update the longest length. If you visited words with same length as the longest you should add them to the list.
P.S. I did not put the code so you can do it yourself.
TLDR
Showing the results for a file named poem.txt whose contents are:
a dog is by a cat to go hi
>>> with open('poem.txt', 'r') as file:
... words = file.read().split()
...
>>> [this_word for this_word in words if len(this_word) == len(max(words,key=len))]
['dog', 'cat']
Explanation
You can also make this faster by using the fact that <file-handle>.read.split() returns a list object and the fact that Python's max function can take a function (as the keyword argument key.) After that, you can use list comprehension to find multiple longest words.
Let's clarify that. I'll start by making a file with the example properties you mentioned,
For example, if the longest words were dog and cat your code should produce:
dog cat
{If on Windows - here I specifically use cmd}
>echo a dog is by a cat to go hi > poem.txt
{If on a *NIX system - here I specifically use bash}
$ echo "a dog is by a cat to go hi" > poem.txt
Let's look at the result of the <file-handle>.read.split() call. Let's follow the advice of #MLP and use the with open ... as statement.
{Windows}
>python
or possibly (with conda, for example)
>py
{*NIX}
$ python3
From here, it's the same.
>>> with open('poem.txt', 'r') as file:
... words = file.read().split()
...
>>> type(words)
<class 'list'>
From the Python documentation for max
max(iterable, *[, key, default])
max(arg1, arg2, *args[, key])
Return the largest item in an iterable or the largest of two or more arguments.
If one positional argument is provided, it should be an iterable. The largest item in the iterable is returned. If two or more positional arguments are provided, the largest of the positional arguments is returned.
There are two optional keyword-only arguments. The key argument specifies a one-argument ordering function like that used for list.sort(). The default argument specifies an object to return if the provided iterable is empty. If the iterable is empty and default is not provided, a ValueError is raised.
If multiple items are maximal, the function returns the first one encountered. This is consistent with other sort-stability preserving tools such as sorted(iterable, key=keyfunc, reverse=True)[0] and heapq.nlargest(1, iterable, key=keyfunc).
New in version 3.4: The default keyword-only argument.
Changed in version 3.8: The key can be None.
Let's use a quick, not-so-robust way to see if we meet the iterable requirement (this SO Q&A gives a variety of other ways).
>>> hasattr(words, '__iter__')
True
Armed with this knowledge, and remembering the caveat, "If multiple items are maximal, the function returns the first one encountered.", we can go about solving the problem. We'll use the len function (use >>> help(len) if you want to know more).
>>> max(words, key=len)
'dog'
Not quite there. We just have the word. Now, it's time to use list comprehension to find all words with that length. First getting that length
>>> max_word_length = len(max(words, key=len))
>>> max_word_length
3
Now for the kicker.
>>> [this_word for this_word in words if len(this_word) == len(max(words,key=len))]
['dog', 'cat']
or, using the commands from before, and making things a bit more readable
>>> [this_word for this_word in words if len(this_word) == max_word_length]
['dog', 'cat']
You can use a variety of methods you'd like if you don't want the list format, i.e. if you actually want
dog cat
but I need to go, so I'll leave it where it is.

Count all occurrences of elements with and without special characters in a list from a text file in python

I really apologize if this has been answered before but I have been scouring SO and Google for a couple of hours now on how to properly do this. It should be easy and I know I am missing something simple.
I am trying to read from a file and count all occurrences of elements from a list. This list is not just whole words though. It has special characters and punctuation that I need to get as well.
This is what I have so far, I have been trying various ways and this post got me the closest:
Python - Finding word frequencies of list of words in text file
So I have a file that contains a couple of paragraphs and my list of strings is:
listToCheck = ['the','The ','the,','the;','the!','the\'','the.','\'the']
My full code is:
#!/usr/bin/python
import re
from collections import Counter
f = open('text.txt','r')
wanted = ['the','The ','the,','the;','the!','the\'','the.','\'the']
words = re.findall('\w+', f.read().lower())
cnt = Counter()
for word in words:
if word in wanted:
print word
cnt[word] += 1
print cnt
my output thus far looks like:
the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
Counter({'the': 17})
It is counting my "the" strings with punctuation but not counting them as separate counters. I know it is because of the \W+. I am just not sure what the proper regex pattern to use here or if I'm going about this the wrong way.
I suspect there may be some extra details to your specific problem that you are not describing here for simplicity. However, I'll assume that what you are looking for is to find a given word, e.g. "the", which could have either an upper or lower case first letter, and can be preceded and followed either by a whitespace or by some punctuation characters such as ;,.!'. You want to count the number of all the distinct instances of this general pattern.
I would define a single (non-disjunctive) regular expression that define this. Something like this
import re
pattern = re.compile(r"[\s',;.!][Tt]he[\s.,;'!]")
(That might not be exactly what you are looking for in general. I just assuming it is based on what you stated above. )
Now, let's say our text is
text = '''
Foo and the foo and ;the, foo. The foo 'the and the;
and the' and the; and foo the, and the. foo.
'''
We could do
matches = pattern.findall(text)
where matches will be
[' the ',
';the,',
' The ',
"'the ",
' the;',
" the'",
' the;',
' the,',
' the.']
And then you just count.
from collections import Counter
count = Counter()
for match in matches:
count[match] += 1
which in this case would lead to
Counter({' the;': 2, ' the.': 1, ' the,': 1, " the'": 1, ' The ': 1, "'the ": 1, ';the,': 1, ' the ': 1})
As I said at the start, this might not be exactly what you want, but hopefully you could modify this to get what you want.
Just to add, a difficulty with using a disjunctive regular expression like
'the|the;|the,|the!'
is that the strings like "the," and "the;" will also match the first option, i.e. "the", and that will be returned as the match. Even though this problem could be avoided by more careful ordering of the options, I think it might not be easier in general.
The simplest option is to combine all "wanted" strings into one regular expression:
rr = '|'.join(map(re.escape, wanted))
and then find all matches in the text using re.findall.
To make sure longer stings match first, just sort the wanted list by length:
wanted.sort(key=len, reverse=True)
rr = '|'.join(map(re.escape, wanted))

Categories

Resources