Removing punctuation/numbers from text problem

Removing punctuation/numbers from text problem - python

I had some code that worked fine removing punctuation/numbers using regular expressions in python, I had to change the code a bit so that a stop list worked, not particularly important. Anyway, now the punctuation isn't being removed and quite frankly i'm stumped as to why.
import re
import nltk
# Quran subset
filename = raw_input('Enter name of file to convert to ARFF with extension, eg. name.txt: ')
# create list of lower case words
word_list = re.split('\s+', file(filename).read().lower())
print 'Words in text:', len(word_list)
# punctuation and numbers to be removed
punctuation = re.compile(r'[-.?!,":;()|0-9]')
for word in word_list:
word = punctuation.sub("", word)
print word_list
Any pointers on why it's not working would be great, I'm no expert in python so it's probably something ridiculously stupid. Thanks.

Change
for word in word_list:
word = punctuation.sub("", word)
to
word_list = [punctuation.sub("", word) for word in word_list]
Assignment to word in the for-loop above, simply changes the value referenced by this temporary variable. It does not alter word_list.

You're not updating your word list. Try
for i, word in enumerate(word_list):
word_list[i] = punctuation.sub("", word)
Remember that although word starts off as a reference to the string object in the word_list, assignment rebinds the name word to the new string object returned by the sub function. It doesn't change the originally referenced object.

Related

for loop iterable not overwriting original

I am trying to find all fullstops in a body of text and create a whitespace in from of them. I am only allowed to use for loops or I would use regular expression. I can't understand why the following for-loop does not replace the original letter with my new assignment. My code is as follows:
text = 'I have a ruler and bought for 25 gpb.'
text_split = text.split()
for word in text_split:
for letter in word:
if letter == '.':
letter = ' .'
If anyone could help then it would be greatly appreciated

letter = ' .' just rebinds the name letter to a different string. The original object bound to letter is unchanged (and can't be changed even in theory; str is an immutable type). A similar problem prevents changing the original str in text_split bound to word on each loop.
For this specific case, you just want:
text_split = [word.replace('.', ' .') for word in text_split]
or the slightly longer spelled out version (that modifies text_split in place instead of replacing it with a new list of modified str):
for i, word in enumerate(text_split):
text_split[i] = word.replace('.', ' .')

Trying to make sure certain symbols aren't in a word

I currently have the following to filter words with square and normal brackets and can't help but think there must be a tidier way to do this..
words = [word for word in random.choice(headlines).split(" ")[1:-1] if "[" not in word and "]" not in word and "(" not in word and ")" not in word]
I tried creating a list or tuple of symbols and doing
if symbol not in word
But it dies because I'm comparing a list with a string. I appreciate I could explode this out and do a compare like:
for word in random.choice(headlines).split(" ")[1:-1]:
popIn = 1
for symbol in symbols:
if symbol in word:
popIn = 0
if popIn = 1:
words.append(word)
But it seems like overkill in my head. I appreciate I'm a novice programmer so anything I can do to tidy either method up would be very helpful.

Use set intersection.
brackets = set("[]()")
words = [word for word in random.choice(headlines).split(" ")[1:-1] if not brackets.intersection(word)]
The intersection is empty if word does not contain any of the characters in brackets.
You might also consider using itertools instead of a list comprehension.
words = list(itertools.ifilterfalse(brackets.intersection,
random.choice(headlines).split(" "))[1:-1]))

I'm not sure of what you want to filter but I advise you to use the Regular expression module of python.
import re
r = re.compile("\w*[\[\]\(\)]+\w*")
test = ['foo', '[bar]', 'f(o)o']
result = [word for word in test if not r.match(word)]
print result
output is ['foo']

Cannot prove that there are Words in my list for my WordSearch game

I have created a list of all possible outcomes for this specific wordgrid, doing diagonals,up,down and all the reverses too):
I have called this allWords, but when I try too find specific words I know are in the allWords the loop does not find the Hidden words. I know my problem but I do know how to go around it (sorry for terrible explanation hopefully an example below will show it better):
an Example follows: My wordList is the list of words that I know are hidden somewhere in the wordgrid. My allWords is a list of Rows,Columns,Diagonals from the wordgrid but
WordList = ['HAMMER','....']
allWords = ['ARBHAMMERTYU','...']
that HAMMER is in allWords but 'cloaked' by other characters after it so I am unable to show HAMMER is in the wordgrid.
length = len(allWords)
for i in range(length):
word = allWords[i]
if word in wordList:
print("I have found", word)
it does not find any word HAMMER in allWords.
Any help towards solving this problem would be great

You are not comparing each word in wordList to a word in allWords. The line if word in wordList compares the exact word.
i.e.
if word in wordList will return True only if the word Hammer is in wordList.
To match substring you need another loop:
for i in range(length):
word = allWords[i]
for w in WordList:
if w in word:
print("I have found ", word)

If I understand your problem correctly, you probably need to implement a function that checks if a token (e.g. 'HAMMER') is present in any of the entries in allWords. My best bet for solution would be to use regular expressions.
import re
def findWordInWordList(word, allWords):
pattern = re.compile(".*%s.*" % word)
for item in allWords:
match = pattern.search(item)
if match:
return match
This will return first occurence, if you want more then it's easy to collect them in a list.

You could try something like this:
for word in allWords:
if word in WordList:
print("I have found", word)
Ah, or maybe the error is that you wrote wordList and you really defined WordList. Hope this helps.

If I understand correctly, you are trying to find a match inside allWords and you want to iterate over WordList and determine if there is a substring match.
So, if that is correct, then your code is not exactly doing that. To go through your code step by step to correct what is happening:
length = len(allWords)
for i in range(length):
What you want to do above is not necessarily go over your allWords. You want to iterate over WordList and see if it is inside allWords. You are not doing that, instead you want to do this:
length = len(WordList)
for i in range(length):
With that in mind, that means now you want to reference WordList and not allWords, so you want to now change this:
word = allWords[i]
to this:
word = WordList[i]
Finally, here comes a new bit of information to determine if you in fact have a substring match in the strings you are matching. A method called "any". The "any" method works by returning True if at least one match of what you are looking for is found. It looks like this:
any(if "something" in word in word for words)
Then it will return True if it "something" is in word otherwise it will return False.
So, to put this all together, and run your code with your sample input, we get:
WordList = ['HAMMER','....']
allWords = ['ARBHAMMERTYU','...']
length = len(WordList)
for i in range(length):
word = WordList[i]
if any(word in w for w in allWords):
print("I have found", word)
Output:
I have found HAMMER

Store value of "if any(" search

I'm trying to store the value of word, is there a way to do this?
if any(word in currentFile for word in otherFile):

Don't use any if you want the words themselves:
words = [word for word in otherFile if word in currentFile]
Then you can truth-test directly (since an empty list is falsy):
if words:
# do stuff
And also access the words that matched:
print words
EDIT: If you only want the first matching word, you can do that too:
word = next((word for word in otherFile if word in currentFile), None)
if word:
# do stuff with word

Just a little follow-up:
You should consider what is an input to any() function here. Input is a generator. So let's break it down:
word in currentFile is a boolean expression - output value is True or False
for word in otherFile performs an iteration over otherFile
So the output of any() argument would be in fact generator of boolean values. You can check it by simply executing [word in currentFile for word in otherFile]. Note that brackets means that a list would be created, with all values computed at once. Generator works functionally the same (if what you do is a single loop over all values), but are better memory-wise. The point is - what you feed to any() is a list of booleans. It has no knowledge about actual words - therefore it cannot possibly output one.
No. You'll have to write explicit loop:
def find_first(currentFile, otherFile)
for word in currentFile:
if word in otherFile:
return word
If no match is found, function would implicitly return None which may be handled by a caller outside of find_first() function.

You're not going to be able to store this value directly from any. I'd recommend a for-loop
for word in otherFile:
if word in currentFile:
break
else:
word = None
if word is not None:
print word, "was found in the current file"
Note that this will store only the first relevant value of word. If you would like all relevant values of word, then this should do it:
words = [word for word in otherFile if word in currentFile]
for word in words:
print word, "was found in the current file"

You can get the first word from otherFile that is also in currentFile by dropping all words from otherFile that are not in currentFile and then taking the next one:
from itertools import dropwhile
word = next(dropwhile(lambda word: word not in currentFile, otherfile))
If there is no such word, this raises StopIteration.
You can get all words from otherFile that are also in currentFile by using a list comprehension:
words = [word for word in otherFile if word in currentFile]
Or by using a set intersection:
words = list(set(otherFile) & set(currentFile))
Or by using the filter function:
words = filter(lambda word: word in currentFile, otherFile)

Python NLTK not taking out punctuations correctly

I have defined the following code
exclude = set(string.punctuation)
lmtzr = nltk.stem.wordnet.WordNetLemmatizer()
wordList= ['"the']
answer = [lmtzr.lemmatize(word.lower()) for word in list(set(wordList)-exclude)]
print answer
I have previously printed exclude and the quotation mark " is part of it. I expected answer to be [the]. However, when I printed answer, it shows up as ['"the']. I'm not entirely sure why it's not taking out the punctuation correctly. Would I need to check each character individually instead?

When you create a set from wordList it stores the string '"the' as the only element,
>>> set(wordList)
set(['"the'])
So using set difference will return the same set,
>>> set(wordList) - set(string.punctuation)
set(['"the'])
If you want to just remove punctuation you probably want something like,
>>> [word.translate(None, string.punctuation) for word in wordList]
['the']
Here I'm using the translate method of strings, only passing in a second argument specifying which characters to remove.
You can then perform the lemmatization on the new list.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing punctuation/numbers from text problem - python

Change for word in word_list: word = punctuation.sub("", word) to word_list = [punctuation.sub("", word) for word in word_list] Assignment to word in the for-loop above, simply changes the value referenced by this temporary variable. It does not alter word_list.

Related

for loop iterable not overwriting original

Trying to make sure certain symbols aren't in a word

Cannot prove that there are Words in my list for my WordSearch game

Store value of "if any(" search

Python NLTK not taking out punctuations correctly

Categories

Resources