I'm sure I am missing something obvious here, but I have been staring at this code for a while and cannot find the root of the problem.
I want to search through many strings, find all the occurrences of certain keywords, and for each of these hits, to retrieve (and save) the two words immediately preceding and following the keywords.
So far the code I have find those words, but when there is more than one occurrence of the keyword in a string, the code returns two different lists. How can I aggregate those lists at the observation/string level (so that I can match it back to string i)?
Here is a mock example of a sample and desired results. Keyword is "not":
review_list=['I like this book.', 'I do not like this novel, no, I do not.']
results= [[], ['I do not like this I do not']]
Current results (using code below) would be:
results = [[], ['I do not like this'], ['I do not']]
Here is the code (simplified version):
for i in review_list:
if (" not " or " neither ") in i:
z = i.split(' ')
for x in [x for (x, y) in enumerate(z) if find_not in y]:
neg_1=[(' '.join(z[max(x-numwords,0):x+numwords+1]))]
neg1.append(neg_1)
elif (" not " or " neither ") not in i:
neg_1=[]
neg1.append(neg_1)
Again, I am certain this is basic, but as a new Python user, any help will be greatly appreciated. Thanks!
To extract only words (removing punctuation) e.g from a string such as
'I do not like this novel, no, I do not.'
I recommend regular expressions:
import re
words = re.findall(r'\w+', somestring)
To find all indices at which one word equals not:
indices = [i for i, w in enumerate(words) if w=='not']
To get the two previous and to following words as well, I recommend a set to remove duplications:
allindx = set()
for i in indices:
for j in range(max(0, i-2), min(i+3, len(words))):
allindx.add(j)
and finally to get all the words in question into a space-joined string:
result = ' '.join(words[i] for i in sorted(allindx))
Now of course we can put all these tidbits together into a function...:
import re
def twoeachside(somestring, keyword):
words = re.findall(r'\w+', somestring)
indices = [i for i, w in enumerate(words) if w=='not']
allindx = set()
for i in indices:
for j in range(max(0, i-2), min(i+3, len(words)):
allindx.add(j)
result = ' '.join(words(i) for i in sorted(allindx))
return result
Of course, this function works on a single sentence. To make a list of results from a list of sentences:
review_list = ['I like this book.', 'I do not like this novel, no, I do not.']
results = [twoeachside(s, 'not') for s in review_list]
assert results == [[], ['I do not like this I do not']]
the last assert of course just being a check that the code works as you desire:-)
EDIT: actually judging from the example you somewhat absurdly require the results' items to be lists with a single string item if non-empty but empty lists if the string in them would be empty. This absolutely weird spec can of course also be met...:
results = [twoeachside(s, 'not') for s in review_list]
results = [[s] if s else [] for s in results]
it just makes no sense whatsoever, but hey!, it's your spec!-)
Related
This question already has answers here:
How can I use `return` to get back multiple values from a loop? Can I put them in a list?
(2 answers)
Closed 2 years ago.
I am trying to get this loop to capitalize each word using the capitalize() function
Here is the code
def cap(words):
for j in words:
print(j)
return j.capitalize()
s = "my name"
parts = s.split(" ")
print(parts)
it = iter(parts)
capped = cap(parts)
print(capped)
result = capped.join(" ")
print(capped)
Output:
['my', 'name']
my
My
My
I am wanting it to return both words capitalized.
Why does the program fail?
Have a closer look at your cap() function.
def cap(words):
for j in words:
print(j)
return j.capitalize()
return j.captialize() will exit the function and will only return the capitalized value of the first word.
Correction
The function must capitalize all the elements in the list.
def cap(words):
capitalized = []
for word in words:
capitalized.append(word.capitalize())
return capitalized
Now the final code should look like
def cap(words):
capitalized = []
for word in words:
capitalized.append(word.capitalize())
return capitalized
words = ["hello","foo","bar"]
capitalized = cap(words)
print(*capitalized)
perhaps a more pythonic way would be to use a list comprehension
def cap(words):
return [word.capitalize() for word in words]
words = ["hello","foo","bar"]
capitalized = cap(words)
print("Regular: ",*words) # printing un-packed version of words[]
print("Capitalized: ",*capitalized)
Output:
Regular: hello foo bar
Capitalized: Hello Foo Bar
What about something as simple as:
[s.capitalize() for s in words]
What is happening here?
In short, code complexity optimisation. Rather than writing verbose, (potentially) inefficient code, part of writing quality code is efficiency and readability.
This loop is using a technique called list comprehension. This is a very useful technique when dealing with lists, or for something simple enough to not require a dedicated function.
As words is iterated, the .capitalize() function is called on each word with the results being appended to the returned list - without explicitly appending the list.
To make this into a function, you can use:
def cap(words) -> list:
"""Capitalise each word in a list.
Args:
words (list): List of words to be capitalised.
Returns:
A list of capitalised words.
"""
return [s.capitalize() for s in words]
Output:
words = 'my name'.split(' ')
cap(words)
>>> ['My', 'Name']
I have "cleaned" your code (but kept the concept).
Note that it can be done in a much shorter way.
def cap(words):
result = []
for j in words:
result.append(j.capitalize())
return result
words = "my name"
words_lst = words.split(" ")
capped = cap(words_lst)
result = " ".join(capped)
print(result)
output
My Name
A shorter way :-)
print(' '.join([s.capitalize() for s in 'my name'.split(' ')]))
output
My Name
This is my code, but it doesn't work. It should read text from the console, split it into words and distribute them into 3 lists and use separators between them.
words = list(map(str, input().split(" ")))
lowercase_words = []
uppercase_words = []
mixedcase_words = []
def split_symbols(list):
from operator import methodcaller
list = words
map(methodcaller(str,"split"," ",",",":",";",".","!","( )","","'","\\","/","[ ]","space"))
return list
for word in words:
if words[word] == word.lower():
words[word] = lowercase_words
elif words[word] == word.upper():
words[word] = uppercase_words
else:
words[word] = mixedcase_words
print(f"Lower case: {split_symbols(lowercase_words)}")
print(f"Upper case: {split_symbols(uppercase_words)}")
print(f"Mixed case: {split_symbols(mixedcase_words)}")
There are several issues in your code.
1) words is a list and word is string. And you are trying to access the list with the index as string which will throw an error. You must use integer for indexing a list. In this case, you don't even need indexes.
2) To check lower or upper case you can just do, word == word.lower() or word == word.upper(). Or another approach would be to use islower() or isupper() function which return a boolean.
3) You are trying to assign an empty list to that element of list. What you want is to append the word to that particular list. You want something like lowercase_words.append(word). Same for uppercase and mixedcase
So, to fix this two issues you can write the code like this -
for word in words:
if word == word.lower(): # same as word.islower()
lowercase_words.append(word)
elif word == word.upper(): # same as word.isupper()
uppercase_words.append(word)
else:
mixedcase_words.append(word)
My advice would be to refrain from naming variable things like list. Also, in split_words() you are assigning list to words. I think you meant it other way around.
Now I am not sure about the "use separators between them" part of the question. But the line map(methodcaller(str,"split"," ",",",":",";",".","!","( )","","'","\\","/","[ ]","space")) is definitely wrong. map() takes a function and an iterable. In your code the iterable part is absent and I think this where the input param list fits in. So, it may be something like -
map(methodcaller("split"," "), list)
But then again I am not sure what are you trying to achieve with that many seperator
I have a set of fixed words of size 20. I have a large file of 20,000 records, where each record contains a string and I want to find if any word from the fixed set is present in a string and if present the index of the word.
example
s1=set([barely,rarely, hardly])#( actual size 20)
l2= =["i hardly visit", "i do not visit", "i can barely talk"] #( actual size 20,000)
def get_token_index(token,indx):
if token in s1:
return indx
else:
return -1
def find_word(text):
tokens=nltk.word_tokenize(text)
indexlist=[]
for i in range(0,len(tokens)):
indexlist.append(i)
word_indx=map(get_token_index,tokens,indexlist)
for indx in word_indx:
if indx !=-1:
# Do Something with tokens[indx]
I want to know if there is a better/faster way to do it.
This suggesting is only removing some glaring inefficiencies, but won't affect the overall complexity of your solution:
def find_word(text, s1=s1): # micro-optimization, make s1 local
tokens = nltk.word_tokenize(text)
for i, word in in enumerate(tokens):
if word in s1:
# Do something with `word` and `i`
Essentially, you are slowing things down by using map when all you really need is a condition inside your loop body anyway... So basically, just get rid of get_token_index, it is over-engineered.
You can use list comprehension with a double for loop:
s1=set(["barely","rarely", "hardly"])
l2 = ["i hardly visit", "i do not visit", "i can barely talk"]
locations = [c for c, b in enumerate(l2) for a in s1 if a in b]
In this example, the output would be:
[0, 2]
However, if you would like a way of accessing the indexes at which a certain word appears:
from collections import defaultdict
d = defaultdict(list)
for word in s1:
for index, sentence in l2:
if word in sentence:
d[word].append(index)
This should work:
strings = []
for string in l2:
words = string.split(' ')
for s in s1:
if s in words:
print "%s at index %d" % (s, words.index(s))
The Easiest Way and Slightly More Efficient way would be using the Python Generator Function
index_tuple = list((l2.index(i) for i in s1 i in l2))
you can time it and check how efficiently this works with your requirement
I'm trying to create a basic program to pick out the positions of words in a quote. So far, I've got the following code:
print("Your word appears in your quote at position(s)", string.index(word))
However, this only prints the first position where the word is indexed, which is fine if the quote only contains the word once, but if the word appears multiple times, it will still only print the first position and none of the others.
How can I make it so that the program will print every position in succession?
Note: very confusingly, string here stores a list. The program is supposed to find the positions of words stored within this list.
It seems that you're trying to find occurrences of a word inside a string: the re library has a function called finditer that is ideal for this purpose. We can use this along with a list comprehension to make a list of the indexes of a word:
>>> import re
>>> word = "foo"
>>> string = "Bar foo lorem foo ipsum"
>>> [x.start() for x in re.finditer(word, string)]
[4, 14]
This function will find matches even if the word is inside another, like this:
>>> [x.start() for x in re.finditer("foo", "Lorem ipsum foobar")]
[12]
If you don't want this, encase your word inside a regular expression like this:
[x.start() for x in re.finditer("\s+" + word + "\s+", string)]
Probably not the fastest/best way but it will work. Used in rather than == in case there were quotations or other unexpected punctuation aswell! Hope this helps!!
def getWord(string, word):
index = 0
data = []
for i in string.split(' '):
if i.lower() in word.lower():
data.append(index)
index += 1
return data
Here is a code I quickly made that should work:
string = "Hello my name is Amit and I'm answering your question".split(' ')
indices = [index for (word, index) in enumerate(string) if word == "QUERY"]
That should work, although returns the index of the word. You could make a calculation that adds the lengths of all words before that word to get the index of the letter.
So my function should open a file and count the word length and give the output. For example,
many('sample.txt')
Words of length 1: 2
Words of length 2: 6
Words of length 3: 7
Words of length 4: 6
My sample.txt file contains:
This is a test file. How many words are of length one?
How many words are of length three? We should figure it out!
Can a function do this?
My coding so far,
def many(fname): infile = open(fname,'r')
text = infile.read()
infile.close()
L = text.split()
L.sort
for item in L:
if item == 1:
print('Words of length 1:', L.count(item))
Can anyone tell me what I'm doing wrong. I call the function nothing happens. It's clearly because of my coding but I don't know where to go from here. Any help would be nice, thanks.
You want to obtain a list of lengths (1, 2, 3, 4,... characters) and a number of occurrences of words with this length in the file.
So until L = text.split() it was a good approach. Now have a look at dictionaries in Python, that will allow you to store the data structure mentioned above and iterate over the list of words in the file. Just a hint...
Since this is homework, I'll post a short solution here, and leave it as exercise to figure out what it does and why it works :)
>>> from collections import Counter
>>> text = open("sample.txt").read()
>>> counts = Counter([len(word.strip('?!,.')) for word in text.split()])
>>> counts[3]
7
What do you expect here
if item == 1:
and here
L.count(item)
And what does actually happen? Use a debugger and have a look at the variable values or just print them to the screen.
Maybe also this:
>>> s
'This is a test file. How many words are of length one? How many words are of length three? We should figure it out! Can a function do this?'
>>> {x:[len([c for c in w ]) for w in s.split()].count(x) for x in [len([c for c in w ]) for w in s.split()] }
{1: 2, 2: 6, 3: 5, 4: 6, 5: 4, 6: 5, 8: 1}
Let's analyze your problem step-by-step.
You need to:
Retrieve all the words from a file
Iterate over all the words
Increment the counter N every time you find a word of length N
Output the result
You already did the step 1:
def many(fname):
infile = open(fname,'r')
text = infile.read()
infile.close()
L = text.split()
Then you (try to) sort the words, but it is not useful. You would sort them alphanumerically, so it is not useful for your task.
Instead, let's define a Python dictionary to hold the count of words
lengths = dict()
#sukhbir correctly suggested in a comment to use the Counter class, and I encourage you to go and search for it, but I'll stick to traditional dictionaries in this example as i find it important to familiarize with the basics of the language before exploring the library.
Let's go on with step 2:
for word in L:
length = len(word)
For each word in the list, we assign to the variable length the length of the current word. Let's check if the counter already has a slot for our length:
if length not in lengths:
lengths[length] = 0
If no word of length length was encountered, we allocate that slot and we set that to zero. We can finally execute step 3:
lengths[length] += 1
Finally, we incremented the counter of words with the current length of 1 unit.
At the end of the function, you'll find that lengths will contain a map of word length -> number of words of that length. Let's verify that by printing its contents (step 4):
for length, counter in lengths.items():
print "Words of length %d: %d" % (length, counter)
If you copy and paste the code I wrote (respecting the indentation!!) you will get the answers you need.
I strongly suggest you to go through the Python tutorial.
The regular expression library might also be helpful, if being somewhat overkill. A simple word matching re might be something like:
import re
f = open("sample.txt")
text = f.read()
words = re.findall("\w+", text)
Words is then a list of... words :)
This however will not properly match words like 'isn't' and 'I'm', as \w only matches alphanumerics. In the spirit of this being homework I guess I'll leave that for the interested reader, but Python Regular Expression documentation is pretty good as a start.
Then my approach for counting these words by length would be something like:
occurrence = dict()
for word in words:
try:
occurrence[len(word)] = occurrence[len(word)] + 1
except KeyError:
occurrence[len(word)] = 1
print occurrence.items()
Where a dictionary (occurrence) is used to store the word lengths and their occurrence in your text. The try: and except: keywords deal with the first time we try and store a particular length of word in the dictionary, where in this case the dictionary is not happy at being asked to retrieve something that it has no knowledge of, and the except: picks up the exception that is thrown as a result and stores the first occurrence of that length of word. The last line prints everything in your dictionary.
Hope this helps :)