stripping punctuation and finding unique words in Python - python

So my task is as such:
Write a program that displays a list of all the unique words found in the file uniq_words.txt.
Print your results in alphabetic order and lowercase. Hint: Store words as the elements of a set;
remove punctuations by using the string.punctuation from the string module.
Currently, the code that I have is:
def main():
import string
with open('uniq_words.txt') as content:
new = sorted(set(content.read().split()))
for i in new:
while i in string.punctuation:
new.discard(i)
print(new)
main()
If I run the code as such, it goes into an infinite loop printing the unique words over and over again. There sre still words in my set that appear as i.e "value." or "never/have". How do I remove the punctuation with the string.punctuation module? Or am I approaching this from a wrong direction? Would appreciate any advice!
Edit: The link does not help me, in that the method given does not work in a list.

My solution:
import string
with open('sample_string.txt') as content:
sample_string = content.read()
print(sample_string)
# Sample string: containing punctuation! As well as CAPITAL LETTERS and duplicates duplicates.
sample_string = sample_string.strip('\n')
sample_string = sample_string.translate(str.maketrans('', '', string.punctuation)).lower()
out = sorted(list(set(sample_string.split(" "))))
print(out)
# ['and', 'as', 'capital', 'containing', 'duplicates', 'letters', 'punctuation', 'sample', 'string', 'well']

This is actually two tasks, so let's split this into two questions. I will deal with your problem regarding stripping punctuation, because you have shown own efforts in this matter. For the problem of determining unique words, please open a new question (and also look for similar questions here on stack overflow before posting a new question, I am pretty sure you will find something useful!)
You correctly found out that you are ending up in an infinite loop. This is because your while loop condition is always true, once i is a punctuation character. Removing i from new does not change that. You avoid this by using a simple if-condition. Actually, your code is mixing up the concept of while and of if and your scenario is tailored for an if-statement. I think you thought you needed a while loop, because you had the concept of iteration in mind. But you are already iterating over content in the for loop. So, the bug fix would be:
for i in new:
if i in string.punctuation:
new.discard(i)
However, a different and more "pythonic" way would be to use list comprehension instead of a for-loop
with open("uniq_words.txt") as content:
stripped_content = "".join([
x
for x in content.read()
if x not in string.punctuation
])

Related

Stuck with writing a function that returns a large list of nouns in a dictionary form (python)

Declaration: I apologise if I have not explicitly expressed my problem and I am not exactly looking for someone to code this for me, yet give me a lead on how to break this down into smaller tasks (would be a good opportunity to learn if i knew how to break it down).
I have a file full of nouns (est 500). My goal is to pluralize each word within the file. There are a few ways to pluralize a word depending on how the word ends. Either ending with 's', 'es', 'ies' and so on..
The function I am aiming to write takes one argument.
def pluralize (word)
and outputs:
{'plural': word_in_plural, 'status' : x}
word_in_plural is the pluralized version of the input argument word and x is a string which can have one of values: 'empty_string', 'proper_noun', 'already_in_plural', 'success'.
Really have no clue on how to break this down, i've been going through lecture videos for the last 3 hours and its got me no-where.
I would rather not waste your time with these queries and want to follow SOF etiquette. I apologise in advance if im asking incorrectly.
If anyone could provide a point in the right direction, I would be very grateful. All the best.
Looks like you want the python pattern lib:
import pattern.en
print (pattern.en.pluralize("house"))
Output:
houses
To install:
pip3 install pattern
I'll try and break the problem down into parts. I feel like that helps me think about and simplify more complex code and it lets me test it as I go.
First off the main organization of the function, If the pluralization were just to add s every time then you might be able to do something like this.
def pluralize(word):
output = dict()
output["plural"] = word + "s" #replace this line later with better logic
output["status"] = "success" #replace this line later with better logic
return output
To do a pluralization based on the ending of the word you could compare the ending of the word to another string that you have already defined. If statements would work just fine
if word[-1] in {'a','e','i','o','u'}:
output["plural"] = word + 's'
elif word[-1] == 's':
output["plural"] = word + "es"
#add more conditions here as needed
else:
output["plural"] = word + 's'
additional if statements could be added as needed.
for the status you can add an if, elif, else at the to check for what you need. For example
if word == '':
output["status"] = "empty_string"
#add other conditions as needed
I hope this gives you something to start with.
Python also requires that indentation lines up and tab characters are not the same as spaces.
To make a string lower case in python you can use word.lower()
Best of luck with your application.

How can I join different segments of a list?

I'm having trouble in a school project because I don't know how to join elements of a list in segments. Here's an example: Let's say I have the following list:
list = ["T","h","i","s","I","s","A","L","i","s","t",]
How could I join this list so that the program outputs the following?:
Output: ["This","Is","A","List"]
Assuming list is your input, and without giving you the answer outright since it's a school project you should do yourself, here are some hints.
You'll want to check if a character is uppercase to know when the start of a word is. With python, you can use isupper() (ex: 'C'.isupper() would return True).
Python strings are iterable.
You can add a character to the end of a string using += (ex: myWord += 'a')
You can add a string to a list using append (ex: myList.append(myWord))
Remember this is a learning experience and there's no real value to being given the answer outright, if that's what you were hoping for. Best of luck and welcome to StackOverflow.
You can use regex for this
import re
list = ["T","h","i","s","I","s","A","L","i","s","t",]
sep=[s for s in re.split("([A-Z][^A-Z]*)", ''.join(list)) if s]
print(sep)

Change two characters into one symbol (Python)

Im currently working on a file compression task for school, and I find myself unable to understand what's happening in this code (more specifically what ISN'T happening and why it is not happening).
So in this section of the code what I'm aiming to do is, in non-coding terms, change two adjacent letters which are the same into one symbol, therefore taking up less memory:
for i, word in enumerate(file_contents):
#file_contents = LIST of words in any given text file
word_contents = (file_contents[i]).split()
for ind,letter in enumerate(word_contents[:-1]):
if word_contents[ind] == word_contents[ind+1]:
word_contents[ind] = ''
word_contents[ind+1] = '★'
However, when I run the full code with a sample text file, it seemingly doesn't do what I told it to do. For instance, the word 'Sally' should be 'Sa★y' but instead stays the same.
Could anyone help me get on the right track?
EDIT: I missed out a pretty key detail. I want the compressed string to somehow appear back in the original file_contents list where there are double letters, as the purpose of the full compression algorithm is to return a compressed version of the text in an inputted file.
I would suggest use a regex matching same adjacent characters.
Example:
import re
txt = 'sally and bobby'
print(re.sub(r"(.)\1", '*', txt))
# sa*y and bo*y
Loop and condition checking in your code are not required. Use below line instead:
word_contents = re.sub(r"(.)\1", '*', word_contents)
There are a few things wrong with your code (I think).
1) split produces a list not a str, so when you say this enumerate(word_contents[:-1]) It looks like you're assuming that gets you a string?!? at any rate... I'm not sure it is or not.
but then!
2)with this line:
if word_contents[ind] == word_contents[ind+1]:
word_contents[ind] = ''
word_contents[ind+1] = '★'
You're operating on your list again. Where it looks pretty clear that you want to be operating on the string, or a list of characters in a word you're processing. At best this function will do nothing, and at worst, you're corrupting the word content list.
So when you perform your modifications you are modifying the word_contents list and not the list item [:-1] you are actually looking over. There are more issues, but I think that answers your question (I hope)
If you really want to understand what you're doing wrong I recommend putting in print statements along what you're doing. If you're looking for someone to do your homework for you, there is another which already gave you an answer I guess.
Here is an example of how you should add logging to the function
for i, word in enumerate(file_contents):
#file_contents = LIST of words in any given text file
word_contents = (file_contents[i]).split()
# See what the word content list actually is
print(word_contents)
# See what your slice is actually returning
print(word_contents[:-1])
# Unless you have something modifying your list elsewhere you probably want to iterate over the words list generally and not just the slice of it as well.
for ind,letter in enumerate(word_contents[:-1]):
# See what your other test is testing
print(word_contents[ind], word_contents[ind+1])
# Here you probably actually want
# word_contents[:-1][ind]
# which is the list item you iterate over and then the actual string I suspect you get back
if word_contents[ind] == word_contents[ind+1]:
word_contents[ind] = ''
word_contents[ind+1] = '★'
UPDATE: based on the follow up questions from the OP I've made a sample program annotated with descriptions. Note this isn't an optimal solution, but mainly an exercise in teaching flow control and using basic structures.
# define the initial data...
file = "sally was a quick brown fox and jumped over the lazy dog which we'll call billy"
file_contents = file.split()
# Enumerate isn't needed in your example unless you intend to use the index later (example below)
for list_index, word in enumerate(file_contents):
# changing something you iterate over is dangerous and sometimes confusing like in your case you iterated over
# word contents and then modified it. if you have to take
# two characters you change the index and size of the structure making changes potentially invalid. So we'll create a new data structure to dump the results in
compressed_word = []
# since we have a list of strings we'll just iterate over each string (or word) individually
for character in word:
# Check to see if there is any data in the intermediate structure yet if not there are no duplicate chars yet
if compressed_word:
# if there are chars in new structure, test to see if we hit same character twice
if character == compressed_word[-1]:
# looks like we did, replace it with your star
compressed_word[-1] = "*"
# continue skips the rest of this iteration the loop
continue
# if we haven't seen the character before or it is the first character just add it to the list
compressed_word.append(character)
# I guess this is one reason why you may want enumerate, to update the list with the new item?
# join() is just converting the list back to a string
file_contents[list_index] = "".join(compressed_word)
# prints the new version of the original "file" string
print(" ".join(file_contents))
outputs: "sa*y was a quick brown fox and jumped over the lazy dog which we'* ca* bi*y"

Count occurrences of elements in string from a list?

I'm trying to count the number of occurrences of verbal contractions in some speeches I've gathered. One particular speech looks like this:
speech = "I've changed the path of the economy, and I've increased jobs in our own
home state. We're headed in the right direction - you've all been a great help."
So, in this case, I'd like to count four (4) contractions. I have a list of contractions, and here are some of the first few terms:
contractions = {"ain't": "am not; are not; is not; has not; have not",
"aren't": "are not; am not",
"can't": "cannot",...}
My code looks something like this, to begin with:
count = 0
for word in speech:
if word in contractions:
count = count + 1
print count
I'm not getting anywhere with this, however, as the code's iterating over every single letter, as opposed to whole words.
Use str.split() to split your string on whitespace:
for word in speech.split():
This will split on arbitrary whitespace; this means spaces, tabs, newlines, and a few more exotic whitespace characters, and any number of them in a row.
You may need to lowercase your words using str.lower() (otherwise Ain't won't be found, for example), and strip punctuation:
from string import punctuation
count = 0
for word in speech.lower().split():
word = word.strip(punctuation)
if word in contractions:
count += 1
I use the str.strip() method here; it removes everything found in the string.punctuation string from the start and end of a word.
You're iterating over a string. So the items are characters. To get the words from a string you can use naive methods like str.split() that makes this for you (now you can iterate over a list of strings (the words splitted on the argument of str.split(), default: split on whitespace). There is even re.split(), which is more powerful. But I don't think that you need splitting the text with regexes.
What you have to do at least is to lowercase your string with str.lower() or to put all possible occurences (also with capital letters) in the dictionary. I strongly recommending the first alternative. The latter isn't really practicable. Removing the punctuation is also a duty for this. But this is still naive. If you're need a more sophisticated method, you have to split the text via a word tokenizer. NLTK is a good starting point for that, see the nltk tokenizer. But I strongly feel that this problem is not your major one or affects you really in solving your question. :)
speech = """I've changed the path of the economy, and I've increased jobs in our own home state. We're headed in the right direction - you've all been a great help."""
# Maybe this dict makes more sense (list items as values). But for your question it doesn't matter.
contractions = {"ain't": ["am not", "are not", "is not", "has not", "have not"], "aren't": ["are not", "am not"], "i've": ["i have", ]} # ...
# with re you can define advanced regexes, but maybe
# from string import punctuation (suggestion from Martijn Pieters answer
# is still enough for you)
import re
def abbreviation_counter(input_text, abbreviation_dict):
count = 0
# what you want is a list of words. str.split() does this job for you.
# " " is default and you can also omit this. But if you really need better
# methods (see answer text abover), you have to take a word tokenizer tool
# or have to write your own.
for word in input_text.split(" "):
# and also clean word (remove ',', ';', ...) afterwards. The advantage of
# using re over `from string import punctuation` is that you have more
# control in what you want to remove. That means that you can add or
# remove easily any punctuation mark. It could be very handy. It could be
# also overpowered. If the latter is the case, just stick to Martijn Pieters
# solution.
if re.sub(',|;', '', word).lower() in abbreviation_dict:
count += 1
return count
print abbrev_counter(speech, contractions)
2 # yeah, it worked - I've included I've in your list :)
It's a litte bit frustrating to give an answer at the same time as Martijn Pieters does ;), but I hope I still have generated some values for you. That's why I've edited my question to give you some hints for future work in addition.
A for loop in Python iterates over all elements in an iterable. In the case of strings the elements are the characters.
You need to split the string into a list (or tuple) of strings that contain the words. You can use .split(delimiter) for this.
Your problem is quite common, so Python has a shortcut: speech.split() splits at any number of spaces/tabs/newlines, so you only get your words in the list.
So your code should look like this:
count = 0
for word in speech.split():
if word in contractions:
count = count + 1
print(count)
speech.split(" ") works too, but only splits on whitespaces but not tabs or newlines and if there are double spaces you'd get empty elements in your resulting list.

Multiple punctuation stripping

I tried multiple solutions here, and although they strip some code, they dont seem to work on multiple punctuations ex. "[ or ',
This code:
regex = re.compile('[%s]' % re.escape(string.punctuation))
for i in words:
while regex.match(i):
regex.sub('', i)
I got from:
Best way to strip punctuation from a string in Python was good but i still encounter problems with double punctuations.
I added While loop in hope to ittirate over each word to remove multiple punctuations but that does not seem to work it just gets stuck on the first item "[ and does not exit it
Am I just missing some obvious piece that I am just being oblivious too?
I solved the problem by adding a redundancy and double looping my lists, this takes extremely long time (well into the minutes) due to fairly large sets
I use Python 2.7
Your code doesn't work because regex.match needs the beginning of the string or complete string to match.
Also, you did not do anything with the return value of regex.sub(). sub doesn't work in place, but you need to assign its result to something.
regex.search returns a match if the pattern is found anywhere in the string and works as expected:
import re
import string
words = ['a.bc,,', 'cdd,gf.f.d,fe']
regex = re.compile('[%s]' % re.escape(string.punctuation))
for i in words:
while regex.search(i):
i = regex.sub('', i)
print i
Edit: As pointed out below by #senderle, the while clause isn't necessary and can be left out completely.
this will replace everything not alphanumeric ...
re.sub("[^a-zA-Z0-9 ]","",my_text)
>>> re.sub("[^a-zA-Z0-9 ]","","A [Black. Cat' On a Hot , tin roof!")
'A Black Cat On a Hot tin roof'
Here is a simple way:
>>> print str.translate("My&& Dog's {{{%!##%!##$L&&&ove Sal*mon", None,'~`!##$%^&*()_+=-[]\|}{;:/><,.?\"\'')
>>> My Dogs Love Salmon
Using this str.translate function will eliminate the punctuation. I usually use this for eliminating numbers from DNA sequence reads.

Categories

Resources