I currently have the following to filter words with square and normal brackets and can't help but think there must be a tidier way to do this..
words = [word for word in random.choice(headlines).split(" ")[1:-1] if "[" not in word and "]" not in word and "(" not in word and ")" not in word]
I tried creating a list or tuple of symbols and doing
if symbol not in word
But it dies because I'm comparing a list with a string. I appreciate I could explode this out and do a compare like:
for word in random.choice(headlines).split(" ")[1:-1]:
popIn = 1
for symbol in symbols:
if symbol in word:
popIn = 0
if popIn = 1:
words.append(word)
But it seems like overkill in my head. I appreciate I'm a novice programmer so anything I can do to tidy either method up would be very helpful.
Use set intersection.
brackets = set("[]()")
words = [word for word in random.choice(headlines).split(" ")[1:-1] if not brackets.intersection(word)]
The intersection is empty if word does not contain any of the characters in brackets.
You might also consider using itertools instead of a list comprehension.
words = list(itertools.ifilterfalse(brackets.intersection,
random.choice(headlines).split(" "))[1:-1]))
I'm not sure of what you want to filter but I advise you to use the Regular expression module of python.
import re
r = re.compile("\w*[\[\]\(\)]+\w*")
test = ['foo', '[bar]', 'f(o)o']
result = [word for word in test if not r.match(word)]
print result
output is ['foo']
Related
Given a string, I have to reverse every word, but keeping them in their places.
I tried:
def backward_string_by_word(text):
for word in text.split():
text = text.replace(word, word[::-1])
return text
But if I have the string Ciao oaiC, when it try to reverse the second word, it's identical to the first after beeing already reversed, so it replaces it again. How can I avoid this?
You can use join in one line plus generator expression:
text = "test abc 123"
text_reversed_words = " ".join(word[::-1] for word in text.split())
s.replace(x, y) is not the correct method to use here:
It does two things:
find x in s
replace it with y
But you do not really find anything here, since you already have the word you want to replace. The problem with that is that it starts searching for x from the beginning at the string each time, not at the position you are currently at, so it finds the word you have already replaced, not the one you want to replace next.
The simplest solution is to collect the reversed words in a list, and then build a new string out of this list by concatenating all reversed words. You can concatenate a list of strings and separate them with spaces by using ' '.join().
def backward_string_by_word(text):
reversed_words = []
for word in text.split():
reversed_words.append(word[::-1])
return ' '.join(reversed_words)
If you have understood this, you can also write it more concisely by skipping the intermediate list with a generator expression:
def backward_string_by_word(text):
return ' '.join(word[::-1] for word in text.split())
Splitting a string converts it to a list. You can just reassign each value of that list to the reverse of that item. See below:
text = "The cat tac in the hat"
def backwards(text):
split_word = text.split()
for i in range(len(split_word)):
split_word[i] = split_word[i][::-1]
return ' '.join(split_word)
print(backwards(text))
I have created a list of all possible outcomes for this specific wordgrid, doing diagonals,up,down and all the reverses too):
I have called this allWords, but when I try too find specific words I know are in the allWords the loop does not find the Hidden words. I know my problem but I do know how to go around it (sorry for terrible explanation hopefully an example below will show it better):
an Example follows: My wordList is the list of words that I know are hidden somewhere in the wordgrid. My allWords is a list of Rows,Columns,Diagonals from the wordgrid but
WordList = ['HAMMER','....']
allWords = ['ARBHAMMERTYU','...']
that HAMMER is in allWords but 'cloaked' by other characters after it so I am unable to show HAMMER is in the wordgrid.
length = len(allWords)
for i in range(length):
word = allWords[i]
if word in wordList:
print("I have found", word)
it does not find any word HAMMER in allWords.
Any help towards solving this problem would be great
You are not comparing each word in wordList to a word in allWords. The line if word in wordList compares the exact word.
i.e.
if word in wordList will return True only if the word Hammer is in wordList.
To match substring you need another loop:
for i in range(length):
word = allWords[i]
for w in WordList:
if w in word:
print("I have found ", word)
If I understand your problem correctly, you probably need to implement a function that checks if a token (e.g. 'HAMMER') is present in any of the entries in allWords. My best bet for solution would be to use regular expressions.
import re
def findWordInWordList(word, allWords):
pattern = re.compile(".*%s.*" % word)
for item in allWords:
match = pattern.search(item)
if match:
return match
This will return first occurence, if you want more then it's easy to collect them in a list.
You could try something like this:
for word in allWords:
if word in WordList:
print("I have found", word)
Ah, or maybe the error is that you wrote wordList and you really defined WordList. Hope this helps.
If I understand correctly, you are trying to find a match inside allWords and you want to iterate over WordList and determine if there is a substring match.
So, if that is correct, then your code is not exactly doing that. To go through your code step by step to correct what is happening:
length = len(allWords)
for i in range(length):
What you want to do above is not necessarily go over your allWords. You want to iterate over WordList and see if it is inside allWords. You are not doing that, instead you want to do this:
length = len(WordList)
for i in range(length):
With that in mind, that means now you want to reference WordList and not allWords, so you want to now change this:
word = allWords[i]
to this:
word = WordList[i]
Finally, here comes a new bit of information to determine if you in fact have a substring match in the strings you are matching. A method called "any". The "any" method works by returning True if at least one match of what you are looking for is found. It looks like this:
any(if "something" in word in word for words)
Then it will return True if it "something" is in word otherwise it will return False.
So, to put this all together, and run your code with your sample input, we get:
WordList = ['HAMMER','....']
allWords = ['ARBHAMMERTYU','...']
length = len(WordList)
for i in range(length):
word = WordList[i]
if any(word in w for w in allWords):
print("I have found", word)
Output:
I have found HAMMER
I have a regular expression like this:
findthe = re.compile(r" the ")
replacement = ["firstthe", "secondthe"]
sentence = "This is the first sentence in the whole universe!"
What I am trying to do is to replace each occurrence with an associated replacement word from a list so that the end sentence would look like this:
>>> print sentence
This is firstthe first sentence in secondthe whole universe
I tried using re.sub inside a for loop enumerating over replacement but it looks like re.sub returns all occurrences. Can someone tell me how to do this efficiently?
If it is not required to use regEx than you can try to use the following code:
replacement = ["firstthe", "secondthe"]
sentence = "This is the first sentence in the whole universe!"
words = sentence.split()
counter = 0
for i,word in enumerate(words):
if word == 'the':
words[i] = replacement[counter]
counter += 1
sentence = ' '.join(words)
Or something like this will work too:
import re
findthe = re.compile(r"\b(the)\b")
print re.sub(findthe, replacement[1],re.sub(findthe, replacement[0],sentence, 1), 1)
And at least:
re.sub(findthe, lambda matchObj: replacement.pop(0),sentence)
Artsiom's last answer is destructive of replacement variable. Here's a way to do it without emptying replacement
re.sub(findthe, lambda m, r=iter(replacement): next(r), sentence)
You can use a callback function as the replace parameter, see how at:
http://docs.python.org/library/re.html#re.sub
Then use some counter and replace depending on the counter value.
I had some code that worked fine removing punctuation/numbers using regular expressions in python, I had to change the code a bit so that a stop list worked, not particularly important. Anyway, now the punctuation isn't being removed and quite frankly i'm stumped as to why.
import re
import nltk
# Quran subset
filename = raw_input('Enter name of file to convert to ARFF with extension, eg. name.txt: ')
# create list of lower case words
word_list = re.split('\s+', file(filename).read().lower())
print 'Words in text:', len(word_list)
# punctuation and numbers to be removed
punctuation = re.compile(r'[-.?!,":;()|0-9]')
for word in word_list:
word = punctuation.sub("", word)
print word_list
Any pointers on why it's not working would be great, I'm no expert in python so it's probably something ridiculously stupid. Thanks.
Change
for word in word_list:
word = punctuation.sub("", word)
to
word_list = [punctuation.sub("", word) for word in word_list]
Assignment to word in the for-loop above, simply changes the value referenced by this temporary variable. It does not alter word_list.
You're not updating your word list. Try
for i, word in enumerate(word_list):
word_list[i] = punctuation.sub("", word)
Remember that although word starts off as a reference to the string object in the word_list, assignment rebinds the name word to the new string object returned by the sub function. It doesn't change the originally referenced object.
I have a list that contains many sentences. I want to iterate through the list, removing from all sentences words like "and", "the", "a", "are", etc.
I tried this:
def removearticles(text):
articles = {'a': '', 'an':'', 'and':'', 'the':''}
for i, j in articles.iteritems():
text = text.replace(i, j)
return text
As you can probably tell, however, this will remove "a" and "an" when it appears in the middle of the word. I need to remove only the instances of the words when they are delimited by blank space, and not when they are within a word. What is the most efficient way of going about this?
I would go for regex, something like:
def removearticles(text):
re.sub('(\s+)(a|an|and|the)(\s+)', '\1\3', text)
or if you want to remove the leading whitespace as well:
def removearticles(text):
re.sub('\s+(a|an|and|the)(\s+)', '\2', text)
This looks more like an NLP job than something you would do with straight regex. I would check out NLTK (http://www.nltk.org/) IIRC it comes with a corpus full of filler words like the ones you're trying to get rid of.
def removearticles(text):
articles = {'a': '', 'an':'', 'and':'', 'the':''}
rest = []
for word in text.split():
if word not in articles:
rest.append(word)
return ' '.join(rest)
in operator of dict run faster than list.
Try something along the lines of
articles = ['and', 'a']
newText = ''
for word in text.split(' '):
if word not in articles:
newText += word+' '
return newText[:-1]
It can be done using regex. Iterator through your strings or (''.join the list and send it as a string) to the following regex.
>>> import re
>>> rx = re.compile(r'\ban\b|\bthe\b|\band\b|\ba\b')
>>> rx.sub(' ','a line with lots of an the and a baad')
' line with lots of baad'