I've been trying to solve this problem for a few hours now and can't come with the right solution, this is the question:
Write a loop that creates a new word list, using a
string method to strip the words from the list created in Problem 3
of all leading and trailing punctuation. Hint: the string library,
which is imported above, contains a constant named punctuation.
Three lines of code.
Here is my code:
import string
def litCricFriend(wordList, text):
theList = text.lower().replace('-', ' ').split() #problem 3
#problem below
for word in theList:
word.strip(string.punctuation)
return theList
You've got a couple bits in your code that... well, I'm not really sure why they're there, to be honest, haha. Let's work through this together!
I'm assuming you have been given some text: text = "My hovercraft is full of eels!". Let's split this into words, make the words lowercase, and remove all punctuation. We know we need string.punctuation and str.split(), and you've also figured out that str.replace() is useful. So let's use these and get our result!
import string
def remove_punctuation(text):
# First, let's remove the punctuation.
# We do this by looping through each punctuation mark in the
# `string.punctuation` list, and then replacing that mark with
# the empty string and re-assigning that to the same variable.
for punc in string.punctuation:
text = text.replace(punc, '')
# Now our text is all de-punctuated! So let's make a list of
# the words, all lowercased, and return it in one go:
return text.lower().split()
Looks to me like the function is only three lines, which is what you said you wanted!
For the advanced reader, you could also use functools and do it in one line (I split it into two for readability, but it's still "one line"):
import string
import functools
def remove_punctuation(text):
return functools.reduce(lambda newtext, punc: newtext.replace(punc, ''),
punctuation, text).lower().split()
Related
I'm writing this function which needs to return an abbreviated version of a str. The return str must contain the first letter, number of characters removed and the, last letter;it must be abbreviated per word and not by sentence, then after that I need to join every word again with the same format including the special-characters. I tried using the re.findall() method but it automatically removes the special-characters so I can't use " ".join() because it will leave out the special-characters.
Here's my code:
import re
def abbreviate(wrd):
return " ".join([i if len(i) < 4 else i[0] + str(len(i[1:-1])) + i[-1] for i in re.findall(r"[\w']+", wrd)])
print(abbreviate("elephant-rides are really fun!"))
The output would be:
e6t r3s are r4y fun
But the output should be:
e6t-r3s are r4y fun!
No need for str.join. Might as well take full advantage of what the re module has to offer.
re.sub accepts a string or a callable object (like a function or lambda), which takes the current match as an input and must return a string with which to replace the current match.
import re
pattern = "\\b[a-z]([a-z]{2,})[a-z]\\b"
string = "elephant-rides are really fun!"
def replace(match):
return f"{match.group(0)[0]}{len(match.group(1))}{match.group(0)[-1]}"
abbreviated = re.sub(pattern, replace, string)
print(abbreviated)
Output:
e6t-r3s are r4y fun!
>>>
Maybe someone else can improve upon this answer with a cuter pattern, or any other suggestions. The way the pattern is written now, it assumes that you're only dealing with lowercase letters, so that's something to keep in mind - but it should be pretty straightforward to modify it to suit your needs. I'm not really a fan of the repetition of [a-z], but that's just the quickest way I could think of for capturing the "inner" characters of a word in a separate capturing group. You may also want to consider what should happen with words/contractions like "don't" or "shouldn't".
Thank you for viewing my question. After a few more searches, trial, and error I finally found a way to execute my code properly without changing it too much. I simply substituted re.findall(r"[\w']+", wrd) with re.split(r'([\W\d\_])', wrd) and also removed the whitespace in "".join() for they were simply not needed anymore.
import re
def abbreviate(wrd):
return "".join([i if len(i) < 4 else i[0] + str(len(i[1:-1])) + i[-1] for i in re.split(r'([\W\d\_])', wrd)])
print(abbreviate("elephant-rides are not fun!"))
Output:
e6t-r3s are not fun!
Basically, I have a list of special characters. I need to split a string by a character if it belongs to this list and exists in the string. Something on the lines of:
def find_char(string):
if string.find("some_char"):
#do xyz with some_char
elif string.find("another_char"):
#do xyz with another_char
else:
return False
and so on. The way I think of doing it is:
def find_char_split(string):
char_list = [",","*",";","/"]
for my_char in char_list:
if string.find(my_char) != -1:
my_strings = string.split(my_char)
break
else:
my_strings = False
return my_strings
Is there a more pythonic way of doing this? Or the above procedure would be fine? Please help, I'm not very proficient in python.
(EDIT): I want it to split on the first occurrence of the character, which is encountered first. That is to say, if the string contains multiple commas, and multiple stars, then I want it to split by the first occurrence of the comma. Please note, if the star comes first, then it will be broken by the star.
I would favor using the re module for this because the expression for splitting on multiple arbitrary characters is very simple:
r'[,*;/]'
The brackets create a character class that matches anything inside of them. The code is like this:
import re
results = re.split(r'[,*;/]', my_string, maxsplit=1)
The maxsplit argument makes it so that the split only occurs once.
If you are doing the same split many times, you can compile the regex and search on that same expression a little bit faster (but see Jon Clements' comment below):
c = re.compile(r'[,*;/]')
results = c.split(my_string)
If this speed up is important (it probably isn't) you can use the compiled version in a function instead of having it re compile every time. Then make a separate function that stores the actual compiled expression:
def split_chars(chars, maxsplit=0, flags=0, string=None):
# see note about the + symbol below
c = re.compile('[{}]+'.format(''.join(chars)), flags=flags)
def f(string, maxsplit=maxsplit):
return c.split(string, maxsplit=maxsplit)
return f if string is None else f(string)
Then:
special_split = split_chars(',*;/', maxsplit=1)
result = special_split(my_string)
But also:
result = split_chars(',*;/', my_string, maxsplit=1)
The purpose of the + character is to treat multiple delimiters as one if that is desired (thank you Jon Clements). If this is not desired, you can just use re.compile('[{}]'.format(''.join(chars))) above. Note that with maxsplit=1, this will not have any effect.
Finally: have a look at this talk for a quick introduction to regular expressions in Python, and this one for a much more information packed journey.
With python, I'm trying to take a text file, and then create one long list of words (with words in the order they appear in the document).
What I have so far goes through each line and then just basically adds the words to the long list.
It is supposed to lowercase each word, and remove any punctuation it finds.
wordstory=[a.lower().strip(string.punctuation) for b in [line.split() for line in open('alice.txt')] for a in b]
It seems that some punctuation isn't recognized by .strip(string.punctuation) for removal, and further, in some cases, the punctuation gets converted to odd codes.
I end up with situations like this with \xe2\x80\x94 not supposed to be there at all.
..
'she',
'spoke\xe2\x80\x94fancy',
'curtseying',
..
Also, when an apostrophe occurs next to a double quotation, the apostrophe isn't removed by .strip(string.punctuation). I end up with:
..
'she',
"couldn't",
'answer',
..
Can someone provide some code that will help, and or point me to a resource that will help me understand what is going on?
I think you're having unicode problems, as well as being unnecessarily obfuscated with the list comprehension.
I'd recommend doing something like this:
# -*- coding: utf-8 -*-
import string
file = open("""text_file.txt""", "r")
raw_text = file.read()
# stripping punctuation
punctuation = set(string.punctuation)
trimmed_text = ''.join(char for char in raw_text if char not in punctuation)
# splitting into list
word_list = trimmed_text.split(" ")
# removing duplicates
unique_word_list = set(word_list)
# or if you're preserving the order, maybe try:
unique_word_list = []
for word in word_list:
if word not in unique_word_list:
unique_word_list.append(word)
print(unique_word_list)
If you want to remove all punctuation use translateand string.maketrans:
In [94]: import string
In [95]: a ="she's all foo!"
In [96]: a.lower().translate(string.maketrans("",""), string.punctuation)
Out[96]: 'shes all foo'
str.strip only removes chars from the end or start of a string.
I have defined the following code
exclude = set(string.punctuation)
lmtzr = nltk.stem.wordnet.WordNetLemmatizer()
wordList= ['"the']
answer = [lmtzr.lemmatize(word.lower()) for word in list(set(wordList)-exclude)]
print answer
I have previously printed exclude and the quotation mark " is part of it. I expected answer to be [the]. However, when I printed answer, it shows up as ['"the']. I'm not entirely sure why it's not taking out the punctuation correctly. Would I need to check each character individually instead?
When you create a set from wordList it stores the string '"the' as the only element,
>>> set(wordList)
set(['"the'])
So using set difference will return the same set,
>>> set(wordList) - set(string.punctuation)
set(['"the'])
If you want to just remove punctuation you probably want something like,
>>> [word.translate(None, string.punctuation) for word in wordList]
['the']
Here I'm using the translate method of strings, only passing in a second argument specifying which characters to remove.
You can then perform the lemmatization on the new list.
How do I make a function where it will filter out all the non-letters from the string? For example, letters("jajk24me") will return back "jajkme". (It needs to be a for loop) and will string.isalpha() function help me with this?
My attempt:
def letters(input):
valids = []
for character in input:
if character in letters:
valids.append( character)
return (valids)
If it needs to be in that for loop, and a regular expression won't do, then this small modification of your loop will work:
def letters(input):
valids = []
for character in input:
if character.isalpha():
valids.append(character)
return ''.join(valids)
(The ''.join(valids) at the end takes all of the characters that you have collected in a list, and joins them together into a string. Your original function returned that list of characters instead)
You can also filter out characters from a string:
def letters(input):
return ''.join(filter(str.isalpha, input))
or with a list comprehension:
def letters(input):
return ''.join([c for c in input if c.isalpha()])
or you could use a regular expression, as others have suggested.
import re
valids = re.sub(r"[^A-Za-z]+", '', my_string)
EDIT: If it needs to be a for loop, something like this should work:
output = ''
for character in input:
if character.isalpha():
output += character
See re.sub, for performance consider a re.compile to optimize the pattern once.
Below you find a short version which matches all characters not in the range from A to Z and replaces them with the empty string. The re.I flag ignores the case, thus also lowercase (a-z) characters are replaced.
import re
def charFilter(myString)
return re.sub('[^A-Z]+', '', myString, 0, re.I)
If you really need that loop there are many awnsers, explaining that specifically. However you might want to give a reason why you need a loop.
If you want to operate on the number sequences and thats the reason for the loop consider replacing the replacement string parameter with a function like:
import re
def numberPrinter(matchString) {
print(matchString)
return ''
}
def charFilter(myString)
return re.sub('[^A-Z]+', '', myString, 0, re.I)
The method string.isalpha() checks whether string consists of alphabetic characters only. You can use it to check if any modification is needed.
As to the other part of the question, pst is just right. You can read about regular expressions in the python doc: http://docs.python.org/library/re.html
They might seem daunting but are really useful once you get the hang of them.
Of course you can use isalpha. Also, valids can be a string.
Here you go:
def letters(input):
valids = ""
for character in input:
if character.isalpha():
valids += character
return valids
Not using a for-loop. But that's already been thoroughly covered.
Might be a little late, and I'm not sure about performance, but I just thought of this solution which seems pretty nifty:
set(x).intersection(y)
You could use it like:
from string import ascii_letters
def letters(string):
return ''.join(set(string).intersection(ascii_letters))
NOTE:
This will not preserve linear order. Which in my use case is fine, but be warned.