Find + Find next in Python - python

Let L be a list of strings.
Here is the code I use for finding a string texttofind in the list L.
texttofind = 'Bonjour'
for s in L:
if texttofind in s:
print 'Found!'
print s
break
How would you do a Find next feature ? Do I need to store the index of the previously found string?

One approach for huge lists would be to use a generator. Suppose you do not know whether the user will need the next match.
def string_in_list(s, entities):
"""Return elements of entities that contain given string."""
for e in entities:
if s in e:
yield e
huge_list = ['you', 'say', 'hello', 'I', 'say', 'goodbye'] # ...
matches = string_in_list('y', huge_list) # look for strings with letter 'y'
next(matches) # first match
next(matches) # second match
The other answers suggesting list comprehensions are great for short lists when you want all results immediately. The nice thing about this approach is that if you never need the third result no time is wasted finding it. Again, it would really only matter for big lists.
Update: If you want the cycle to restart at the first match, you could do something like this...
def string_in_list(s, entities):
idx = 0
while idx < len(entities):
if s in entities[idx]:
yield entities[idx]
idx += 1
if idx >= len(entities):
# restart from the beginning
idx = 0
huge_list = ['you', 'say', 'hello']
m = string_in_list('y', huge_list)
next(m) # you
next(m) # say
next(m) # you, again
See How to make a repeating generator for other ideas.
Another Update
It's been years since I first wrote this. Here's a better approach using itertools.cycle:
from itertools import cycle # will repeat after end
# look for s in items of huge_list
matches = cycle(i for i in huge_list if s in i)
next(matches)

Finding all strings in L which have as substring s.
[f for f in L if s in f]

If you want to find all indexes of strings in L which have s as a substring,
[i for i in range(0, len(L)) if L[i].find(s) >= 0]

This will find next if it exists. You can wrap it in function and return None/Empty string if it doesn't.
L = ['Hello', 'Hola', 'Bonjour', 'Salam']
for l in L:
if l == texttofind:
print l
if L.index(l) >= 0 and L.index(l) < len(L):
print L[L.index(l)+1]

Related

Fastest way to search list for an element that begins with a string?

What is the fastest way to search a list whether or not it has an element that begins with a specified string, and then return the index of the element if it's found.
Something like:
mylist=['one','two','three','four']
mystring='thr'
It should return 2.
You can't get better than O(n) complexity here, but generally speaking if you are after pure speed then don't even use Python.
The canonical Python solution I would propose is to use a memory efficient generator and call next on it once.
>>> mylist = ['one','two','three','four']
>>> mystring = 'thr'
>>> next(index for index, item in enumerate(mylist) if item.startswith('thr'))
2
By default, this will give you a StopIteration exception if the condition is never satisfied. You can provide a second argument to next if you want a fallback-value.
indices = [i for i, s in enumerate(mylist) if s.startswith('thr')]
Enumerate is slightly faster
Just running a counter should do you fine
i = 0
mylist = ['one','two','three','four']
mystring = 'thr'
for x in mylist:
if mystring in x:
i = i + 1
print (i)
else:
i = i + 1
Although this will print '3' and not '2'.
I hope this helps.
If you are going to do more than a single search, you can organize the list in a way to get better than O(n) execution time for each search. Obviously if you're only doing a single search the overhead of reorganizing the list will be prohibitive.
import bisect
mylist.sort()
n = bisect.bisect_left(mylist, mystring)
if n >= len(mylist) or not mylist[n].startswith(mystring):
print('not found')
If you need to preserve the original order it's only a little more complicated.
mysorted = sorted((s,i) for i,s in enumerate(mylist))
n = bisect.bisect_left(mysorted, (mystring, 0))
if n >= len(mysorted) or not mysorted[n][0].startswith(mystring):
print('not found')
else:
n = mysorted[n][1]
mystring='thr'
[n for (n,item) in enumerate(mylist) if item.startswith(mystring)][0]
Out: 2

fast way to search for a set of words in a list of words python

I have a set of fixed words of size 20. I have a large file of 20,000 records, where each record contains a string and I want to find if any word from the fixed set is present in a string and if present the index of the word.
example
s1=set([barely,rarely, hardly])#( actual size 20)
l2= =["i hardly visit", "i do not visit", "i can barely talk"] #( actual size 20,000)
def get_token_index(token,indx):
if token in s1:
return indx
else:
return -1
def find_word(text):
tokens=nltk.word_tokenize(text)
indexlist=[]
for i in range(0,len(tokens)):
indexlist.append(i)
word_indx=map(get_token_index,tokens,indexlist)
for indx in word_indx:
if indx !=-1:
# Do Something with tokens[indx]
I want to know if there is a better/faster way to do it.
This suggesting is only removing some glaring inefficiencies, but won't affect the overall complexity of your solution:
def find_word(text, s1=s1): # micro-optimization, make s1 local
tokens = nltk.word_tokenize(text)
for i, word in in enumerate(tokens):
if word in s1:
# Do something with `word` and `i`
Essentially, you are slowing things down by using map when all you really need is a condition inside your loop body anyway... So basically, just get rid of get_token_index, it is over-engineered.
You can use list comprehension with a double for loop:
s1=set(["barely","rarely", "hardly"])
l2 = ["i hardly visit", "i do not visit", "i can barely talk"]
locations = [c for c, b in enumerate(l2) for a in s1 if a in b]
In this example, the output would be:
[0, 2]
However, if you would like a way of accessing the indexes at which a certain word appears:
from collections import defaultdict
d = defaultdict(list)
for word in s1:
for index, sentence in l2:
if word in sentence:
d[word].append(index)
This should work:
strings = []
for string in l2:
words = string.split(' ')
for s in s1:
if s in words:
print "%s at index %d" % (s, words.index(s))
The Easiest Way and Slightly More Efficient way would be using the Python Generator Function
index_tuple = list((l2.index(i) for i in s1 i in l2))
you can time it and check how efficiently this works with your requirement

Python string extraction from array of strings

I am having trouble figuring out the following:
Suppose I have a list of strings
strings = ["and","the","woah"]
I want the output to be a list of strings where the ith position of every string becomes a new string item in the array like so
["atw","nho","dea","h"]
I am playing with the following list comprehension
u = [[]]*4
c = [u[i].append(stuff[i]) for i in range(0,4) for stuff in strings]
but its not working out. Can anyone help? I know you can use other tools to accomplish this, but i am particularly interested in making this happen with for loops and list comprehensions. This may be asking a lot, Let me know if I am.
Using just list comprehensions and for loops you can:
strings = ["and","the","woah"]
#Get a null set to be filled in
new = ["" for x in range(max([len(m) for m in strings]))]
#Cycle through new list
for index,item in enumerate(new):
for w in strings:
try:
item += w[index]
new[index] = item
except IndexError,err:
pass
print new
My idea would be to use itertools.izip_longest and a list comprehension.
>>> from itertools import izip_longest
>>> strings = ["and","the","woah"]
>>> [''.join(x) for x in izip_longest(*strings, fillvalue='')]
['atw', 'nho', 'dea', 'h']
Try
array = ["and","the","woah"]
array1 = []
longest_item = 0
for i in range(0,3): #length of array
if len(array[i]) > longest_item:
longest_item = len(array[i]) #find longest string
for i in range(0,longest_item):
str = ""
for i1 in range(0,3): #length of array
if len(array[i1]) < longest_item:
continue
str += array[i1][i:i+1]
array1.append(str)
I didn't actually try this code out, I just improvised it. Please leave a comment ASAP if you find a bug.

Python - Searching index of lists in list containing one element

I have a list L of 4-length list
L = [[1,2,12,13],[2,3,13,14],...]
and two integers a and b which appear many times in the sublists. What I want is to find the index of the sublists in L which contain a AND b.
I wrote a little code
l=[]
for i in range(len(L)):
if L[i][0]==a or L[i][1]==a or L[i][2]==a or L[i][3]==a:
l.append([i] + L[i]) # I put the index in the first position.
# Now l is a list of 5-length lists.
# I do the same loop on that list.
r=[]
for i in range(len(l)):
if l[i][1]==b or l[i][2]==b or l[i][3]==b or l[i][4]==b:
r.append(i)
The index I am looking for are in the list r. However I am pretty sure there is another way to do it in Python since I barely know this language. Maybe if my variable L is something else than a list of lists it would be easier/faster, because I will call this procedure a lot in my main program. (len(L) is around 3000)
By the way I know that the number of index is between one and four included, so I could put some break but I don't know if it will be faster.
---------------- EDIT 1 ----------------
Change "a or b (or is inclusive)" to "a AND b" in the second sentence. I wrote a mistake about my goal.
You can do this:
r = [i for i,x in enumerate(L) if any(y in x for y in (a,b))]
enumerate will give you both indices and values in your list comprehension, and the any statement will tell you if either a or b are in x, which is a sublist in L
Try with
for index, item in enumerate(L):
if a in item or b in item:
r.append(index)
Use any() to test the sublists:
if any(a in subl for subl in L):
This tests each subl but exits the generator expression loop early if a match is found.
This does not, however, return the specific sublist that matched. You could use next() with a generator expression to find the first match:
matched = next((subl for subl in L if a in subl), None)
if matched is not None:
matched[1] += 1
where None is a default returned if the generator expression raises a StopIteration exception, or you can omit the default and use exception handling instead:
try:
matched = next(subl for subl in L if a in subl)
matched[1] += 1
except StopIteration:
pass # no match found
This kind of thing is what list comprehension is made for.
If you really want inclusive or -- then this is the list you want. In your code, currently, you've giving and.
result = [a_tuple for a_tuple in L if a in a_tuple or b in a_tuple]

Finding words that are inside successive words

def sucontain(A):
C = A.split()
def magic(x):
B = [C[i]==C[i+1] for i in range(len(C)-1)]
return any(B)
N = [x for x in C if magic(x)]
return N
Phrase = "So flee fleeting candy can and bandage"
print (sucontain(Phrase))
The goal of this function is to create a list of the words that are inside of each successive word. For example the function would take the string ""So flee fleeting candy can and bandage" as input and return ['flee', 'and'] because flee is inside fleeting (the next word) and 'and' is inside 'bandage'. If no cases like these are found, an empty list [] should be returned. My code right now is returning [] instead of ['flee', 'and']. Can someone point out what I'm doing wrong? thank you
Just pair the consecutive words, then it becomes an easy list comprehension…
>>> s = "So flee fleeting candy can and bandage"
>>> words = s.split()
>>> [i for i, k in zip(words, words[1:]) if i in k]
['flee', 'and']
There is definitely something wrong with your magic function. It accepts x as an argument but doesn't use it anywhere.
Here is an alternate version that doesn't use an additional function:
def sucontain(A):
C = A.split()
return [w for i, w in enumerate(C[:-1]) if w in C[i+1]]
The enumerate() function allows us to loop over the indices and the values together, which makes it very straight forward to perform the test. C[i+1] is the next value and w is the current value so w in C[i+1] checks to see if the current value is contained in the next value. We use C[:-1] to make sure that we stop one before the last item, otherwise C[i+1] would result in an IndexError.
Looking ahead can be problematic. Instead of testing whether the current word is in the next one, check to see whether the previous word is in the current one. This almost always makes things simpler.
Also, use descriptive variable names instead of C and A and x and B and N and magic.
def succotash(text): # okay, so that isn't very descriptive
lastword = " " # space won't ever be in a word
results = []
for currentword in text.split():
if lastword in currentword:
results.append(currentword)
lastword = currentword
return results
print succotash("So flee fleeting candy can and bandage")

Categories

Resources