Remove unmixed numbers from file

Remove unmixed numbers from file - python

Say I have a file called input.txt that looks like this
I listened to 4 u2 albums today
meet me at 5
squad 4ever
I want to filter out the numbers that are on their own, so "4" and "5" should go but "u2" and "4ever" should remain the same. i.e the output should be
I listened to u2 albums today
meet me at
squad 4ever
I've been trying to use this code
for line in fileinput.input("input.txt", inplace=True):
new_s = ""
for word in line.split(' '):
if not all(char.isdigit() for char in word):
new_s += word
new_s += ' '
print(new_s, end='')
Which is pretty similar to the code I found here: Removing numbers mixed with letters from string
But instead of the wanted output I get
I listened to u2 albums today
meet me at 5
squad 4ever
As you can see there are two problems here, first only the first line loses the digit I want it to lose, "5" is still present in the second line. The second problem is the extra white space at the beginning of a new line.
I've been playing around with the code for a while and browsing stackoverflow, but can't find where the problem is coming from. Any insights?

str.split(' ') does not remove the trailing newlines from each line. They end up attached to the last word of the line. So for your first problem, the '5' doesn't get removed because it's actually '5\n', and the \n is not a digit.
The second problem is related. When you print the last word of each line, it contains that newline, plus you're adding a space on to the end. That space shows up as the first character of the next line.
The simplest solution is simply to change line.split(' ') to line.split(). Without any arguments, split() will remove all whitespace, including the newlines. You'll also need to remove the end='' from your print so that the newlines are added back in.

Just use regexp.
re.sub(r"\b\d+\b", "", input)
match any digit between word boundaries
Or to avoid double spaces:
re.sub(r"\s\d+\s", " ", input)

You can use regex:
data = open('file.txt').read()
import re
data = re.sub('(?<=\s)\d+(?=$)|(?<=^)\d+(?<=\s)|(\s\d+\s)', '', data)
Output:
I listened tou2 albums today
meet me at
squad 4ever

Related

Removing a word that contains symbols such as "#", "#", or ":" in python

I have just started learning Python coding this semester and we are given some revision exercise. However i am stuck on one of the question. The text file given are tweets from US elections in 2016. Sample as below:
I wish they would show out takes of Dick Cheney #GOPdebates
Candidates went after #HillaryClinton 32 times in the #GOPdebate-but remained silent about the issues that affect us.
It seems like Ben Carson REALLY doesn't want to be there. #GOPdebates
RT #ColorOfChange: Or better said: #KKKorGOP #GOPDebate
The question requires me to write a Python program that reads from the file tweets.txt. Remember that each line contains one tweet. For each tweet, your program should remove any word that is less than 8 characters long, and also any word that contains a hash (#), at (#), or colon (:) character. What i have now:
for line in open("tweets.txt"):
aline=line.strip()
words=aline.split()
length=len(words)
remove=['#','#',':']
for char in words:
if "#" in char:
char=''
if "#" in char:
char=''
if ":" in char:
char=''
which did not work, and the resulting list still contains #,# or :. Any help appreciated! Thank you!

Assigning char='' in the loop does not change or remove the actual char (actually a word) in the list, it just assign a different value to the variable char.
Instead, you might use a list comprehension / generator expression for filtering the words that satisfy the conditions.
>>> tweet = "Candidates went after #HillaryClinton 32 times in the #GOPdebate-but remained silent about the issues that affect us."
>>> [w for w in tweet.split() if not any(c in w for c in "##:") and len(w) >= 8]
['Candidates', 'remained']
Optionally, use ' '.join(...) to join the remaining words back to a "sentence", although that might not make too much sense.

Use this code.
import re
tweet=re.sub(r'#', '',tweet )
tweet=re.sub(r'#', '',tweet )
tweet=re.sub(r':', '',tweet )

The below will open the file (it's usually better to use "with open" when working with files), loop through all the lines and remove the '##:' using translate. Then remove the words with less than 8 characters giving you the output "new_line".
with open('tweets.txt') as rf:
for sentence in rf:
line = sentence.strip()
line = line.translate({ord(i): None for i in '##:'})
line = line.split()
new_line = [ word for word in line if len(word) >= 8 ]
print(new_line)
It's not the most succinct way and there's definitely better ways to do it but it's probably a bit easier to read and understand seen as though you've just started learning, like me.

Python - string index out of range issue

This is the question I was given to solve:
Create a program inputs a phrase (like a famous quotation) and prints all of the words that start with h-z.
I solved the problem, but the first two methods didn't work and I wanted to know why:
#1 string index out of range
quote = input("enter a 1 sentence quote, non-alpha separate words: ")
word = ""
for character in quote:
if character.isalpha():
word += character.upper()
else:
if word[0].lower() >= "h":
print(word)
word = ""
else:
word = ""
I get the IndexError: string index out of range message for any words after "g". Shouldn't the else statement catch it? I don't get why it doesn't, because if I remove the brackets [] from word[0], it works.
#2: last word not printing
quote = input("enter a 1 sentence quote, non-alpha separate words: ")
word = ""
for character in quote:
if character.isalpha():
word += character.upper()
else:
if word.lower() >= "h":
print(word)
word = ""
else:
word = ""
In this example, it works to a degree. It eliminates any words before 'h' and prints words after 'h', but for some reason doesn't print the last word. It doesn't matter what quote i use, it doesn't print the last word even if it's after 'h'. Why is that?

You're calling on word[0]. This accesses the first element of the iterable string word. If word is empty (that is, word == ""), there is no "first element" to access; thus you get an IndexError. If a "word" starts with a non-alphabetic character (e.g. a number or a dash), then this will happen.
The second error you're having, with your second code snippet leaving off the last word, is because of the approach you're using for this problem. It looks like you're trying to walk through the sentence you're given, character by character, and decide whether to print a word after having read through it (which you know because you hit a space character. But this leads to the issue with your second approach, which is that it doesn't print the last string. That's because the last character in your sentence isn't a space - it's just the last letter in the last word. So, your else loop is never executed.
I'd recommend using an entirely different approach, using the method string.split(). This method is built-in to python and will transform one string into a list of smaller strings, split across the character/substring you specify. So if I do
quote = "Hello this is a sentence"
words = quote.split(' ')
print(words)
you'll end up seeing this:
['Hello', 'this', 'is', 'a', 'sentence']
A couple of things to keep in mind on your next approach to this problem:
You need to account for empty words (like if I have two spaces in a row for some reason), and make sure they don't break the script.
You need to account for non-alphanumeric characters like numbers and dashes. You can either ignore them or handle them differently, but you have to have something in place.
You need to make sure that you handle the last word at some point, even if the sentence doesn't end in a space character.
Good luck!

Instead of what you're doing, you can Iterate over each word in the string and count how many of them begin in those letters. Read about the function str.split(), in the parameter you enter the divider, in this case ' ' since you want to count the words, and that returns a list of strings. Iterate over that in the loop and it should work.

Capitalizing the beginning of sentences in Python

The following code is for an assignment that asks that a string of sentences is entered from a user and that the beginning of each sentence is capitalized by a function.
For example, if a user enters: 'hello. these are sample sentences. there are three of them.'
The output should be: 'Hello. These are sample sentences. There are three of them.'
I have created the following code:
def main():
sentences = input('Enter sentences with lowercase letters: ')
capitalize(sentences)
#This function capitalizes the first letter of each sentence
def capitalize(user_sentences):
sent_list = user_sentences.split('. ')
new_sentences = []
count = 0
for count in range(len(sent_list)):
new_sentences = sent_list[count]
new_sentences = (new_sentences +'. ')
print(new_sentences.capitalize())
main()
This code has two issues that I am not sure how to correct. First, it prints each sentence as a new line. Second, it adds an extra period at the end. The output from this code using the sample input from above would be:
Hello.
These are sample sentences.
There are three of them..
Is there a way to format the output to be one line and remove the final period?

The following works for reasonably clean input:
>>> s = 'hello. these are sample sentences. there are three of them.'
>>> '. '.join(x.capitalize() for x in s.split('. '))
'Hello. These are sample sentences. There are three of them.'
If there is more varied whitespace around the full-stop, you might have to use some more sophisticated logic:
>>> '. '.join(x.strip().capitalize() for x in s.split('.'))
Which normalizes the whitespace which may or may not be what you want.

def main():
sentences = input('Enter sentences with lowercase letters: ')
capitalizeFunc(sentences)
def capitalizeFunc(user_sentences):
sent_list = user_sentences.split('. ')
print(".".join((i.capitalize() for i in sent_list)))
main()
Output:
Enter sentences with lowercase letters: "hello. these are sample sentences. there are three of them."
Hello.These are sample sentences.There are three of them.

I think this might be helpful:
>>> sentence = input()
>>> '. '.join(map(lambda s: s.strip().capitalize(), sentence.split('.')))

>>> s = 'hello. these are sample sentences. there are three of them.'
>>> '. '.join(map(str.capitalize, s.split('. ')))
'Hello. These are sample sentences. There are three of them.'

This code has two issues that I am not sure how to correct. First, it prints each sentence as a new line.
That’s because you’re printing each sentence with a separate call to print. By default, print adds a newline. If you don’t want it to, you can override what it adds with the end keyword parameter. If you don’t want it to add anything at all, just use end=''
Second, it adds an extra period at the end.
That’s because you’re explicitly adding a period to every sentence, including the last one.
One way to fix this is to keep track of the index as well as the sentence as you’re looping over them—e.g., with for index, sentence in enumerate(sentences):. Then you only add the period if index isn’t the last one. Or, slightly more simply, you add the period at the start, if the index is anything but zero.
However, theres a better way out of both of these problems. You split the string into sentences by splitting on '. '. You can join those sentences back into one big string by doing the exact opposite:
sentences = '. '.join(sentences)
Then you don’t need a loop (there’s one hidden inside join of course), you don’t need to worry about treating the last or first one special, and you only have one print instead of a bunch of them so you don’t need to worry about end.
A different trick is to put the cleverness of print to work for you instead of fighting it. Not only does it add a newline at the end by default, it also lets you print multiple things and adds a space between them by default. For example, print(1, 2, 3) or, equivalently, print(*[1, 2, 3]) will print out 1 2 3. And you can override that space separator with anything else you want. So you can print(*sentences, sep='. ', end='') to get exactly what you want in one go. However, this may be a bit opaque or over-clever to people reading your code. Personally, whenever I can use join instead (which is usually), I do that even though it’s a bit more typing, because it makes it more obvious what’s happening.
As a side note, a bit of your code is misleading:
new_sentences = []
count = 0
for count in range(len(sent_list)):
new_sentences = sent_list[count]
new_sentences = (new_sentences +'. ')
print(new_sentences.capitalize())
The logic of that loop is fine, but it would be a lot easier to understand if you called the one-new-sentence variable new_sentence instead of new_sentences, and didn’t set it to an empty list at the start. As it is, the reader is led to expect that you’re going to build up a list of new sentences and then do something with it, but actually you just throw that list away at the start and handle each sentence one by one.
And, while we’re at it, you don’t need count here; just loop over sent_list directly:
for sentence in sent_list:
new_sentence = sent + '. '
print(new_sentence.capitalize())
This does the same thing as the code you had, but I think it’s easier to understand that it does that think from a quick glance.
(Of course you still need the fixes for your two problems.)

Use nltk.sent_tokenize to tokenize the string into sentences. And capitalize each sentence, and join them again.
A sentence can't always end with a ., there can other things too, like a ?, or !. Also three consecutive dots ..., doesn't end the sentence. sent_tokenize will handle them all.
from nltk.tokenize import sent_tokenize
def capitalize(user_sentences):
sents = sent_tokenize(user_sentences)
capitalized_sents = [sent.capitalize() for sent in sents]
joined_ = ' '.join(capitalized_sents)
print(joined_)
The reason your sentences were being printed on separate lines, were because print always ends its output with a newline. So, printing sentences separately in loop will make them print on newlines. So, you should print them all at once, after joining them. Or, you can specify end='' in print statement, so it doesn't end the sentences with newline characters.
The second thing, about output being ended with an extra period, is because, you're appending '. ' with each of the sentence. The good thing about sent_tokenize is, it doesn't remove '.', '?', etc from the end of the sentences, so you don't have to append '. ' at the end manually again. Instead, you can just join the sentences with a space character, and you'll be good to go.
If you get an error for nltk not being recognized, you can install it by running pip install nltk on the terminal/cmd.

How do I output the acronym on one line

I am following the hands-on python tutorials from Loyola university and for one exercise I am supposed to get a phrase from the user, capatalize the first letter of each word and print the acronym on one line.
I have figured out how to print the acronym but I can't figure out how to print all the letters on one line.
letters = []
line = input('?:')
letters.append(line)
for l in line.split():
print(l[0].upper())

Pass end='' to your print function to suppress the newline character, viz:
for l in line.split():
print(l[0].upper(), end='')
print()

Your question would be better if you shared the code you are using so far, I'm just guessing that you have saved the capital letters into a list.
You want the string method .join(), which takes a string separator before the . and then joins a list of items with that string separator between them. For an acronym you'd want empty quotes
e.g.
l = ['A','A','R','P']
acronym = ''.join(l)
print(acronym)

You could make a string variable at the beginning string = "".
Then instead of doing print(l[0].upper()) just append to the string string += #yourstuff
Lastly, print(string)

Removing \n from myFile

I am trying to create a dictionary of list that the key is the anagrams and the value(list) contains all the possible words out of that anagrams.
So my dict should contain something like this
{'aaelnprt': ['parental', 'paternal', 'prenatal'], ailrv': ['rival']}
The possible words are inside a .txt file. Where every word is separated by a newline. Example
Sad
Dad
Fruit
Pizza
Which leads to a problem when I try to code it.
with open ("word_list.txt") as myFile:
for word in myFile:
if word[0] == "v": ##Interested in only word starting with "v"
word_sorted = ''.join(sorted(word)) ##Get the anagram
for keys in list(dictonary.keys()):
if keys == word_sorted: ##Heres the problem, it doesn't get inside here as theres extra characters in <word_sorted> possible "\n" due to the linebreak of myfi
print(word_sorted)
dictonary[word_sorted].append(word)

If every word in "word_list.txt" is followed by '\n' then you can just use slicing to get rid of the last char of the word.
word_sorted = ''.join(sorted(word[:-1]))
But if the last word in "word_list.txt" isn't followed by '\n', then you should use rstrip().
word_sorted = ''.join(sorted(word.rstrip()))
The slice method is slightly more efficient, but for this application I doubt you'll notice the difference, so you might as well just play safe & use rstrip().

Use rstrip(), it removes the \n character.
...
...
keys == word_sorted.rstrip()
...

You should try to use the .rstrip() function in your code, it will remove the "\n"
Here you can check it .rstrip()

strip only removes characters from the beginning or end of a string.
Use rstrip() to remove \n character
Also you can use replace syntax, to replace newline with something else.
str2 = str.replace("\n", "")

So, I see a few problems here, how is anything getting into the dictionary, I see no assignments? Obviously you've only provided us a snippet, so maybe that's elsewhere.
You're also using a loop when you could be using in (it's more efficient, truly it is).
with open ("word_list.txt") as myFile:
for word in myFile:
if word[0] == "v": ##Interested in only word starting with "v"
word_sorted = ''.join(sorted(word.rstrip())) ##Get the anagram
if word_sorted in dictionary:
print(word_sorted)
dictionary[word_sorted].append(word)
else:
# The case where we don't find an anagram in our dict
dictionary[word_sorted] = [word,]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove unmixed numbers from file - python

Just use regexp. re.sub(r"\b\d+\b", "", input) match any digit between word boundaries Or to avoid double spaces: re.sub(r"\s\d+\s", " ", input)

You can use regex: data = open('file.txt').read() import re data = re.sub('(?<=\s)\d+(?=$)|(?<=^)\d+(?<=\s)|(\s\d+\s)', '', data) Output: I listened tou2 albums today meet me at squad 4ever

Related

Removing a word that contains symbols such as "#", "#", or ":" in python

Python - string index out of range issue

Capitalizing the beginning of sentences in Python

How do I output the acronym on one line

Removing \n from myFile

Categories

Resources