Related
Working on a pattern recognition function in Python that suppose to return an array of patterns with a counter
Let's imagine a list of strings:
m = ['ABA','ABB', 'ABC','BCA','BCB','BCC','ABBC', 'ABBA', 'ABBC']
at the high-level, what I would like to get back is:
Pattern | Count
----------------
AB | 6
ABB | 4
BC | 3
----------------
The problem: all I know that patterns begin with 2 characters and are leading characters for each string value (i.e. XXZZZ, XXXZZZ (where XX is a pattern that I'm looking for)). I would like to be able to parametrize minimal length of a pattern as a function's input to optimize the run time.
PS. each item in the list is a single word already.
my problem is that I need to iterate for each letter starting from the threshold, and I'm getting stuck there.
I'd prefer to use startswith('AB')
First, let's define your string:
>>> m = ['ABA','ABB', 'ABC','BCA','BCB','BCC','ABBC', 'ABBA', 'ABBC']
Now, let's get a count of all leading strings of length 2 or 3:
>>> from collections import Counter
>>> c = Counter([s[:2] for s in m] + [s[:3] for s in m if len(s)>=3])
To compare with your table, here are the three most common leading strings:
>>> c.most_common(3)
Out[15]: [('AB', 6), ('ABB', 4), ('BC', 3)]
Update
To include all keys up to up to length len(max(m, key=len))-1:
>>> n = len(max(m, key=len))
>>> c = Counter(s[:i] for s in m for i in range(2, min(n, 1+len(s))))
Additional Test
To demonstrate that we are working correctly with longer strings, let's consider different input:
>>> m = ['ab', 'abc', 'abcdef']
>>> n = len(max(m, key=len))
>>> c = Counter(s[:i] for s in m for i in range(2, min(n, 1+len(s))))
>>> c.most_common()
[('ab', 3), ('abc', 2), ('abcd', 1), ('abcde', 1)]
Using collections.Counter
counter = collections.Counter()
min_length = 2
max_length = len(max(m, key=len))
for length in range(min_length, max_length):
counter.update(word[:length] for word in m if len(word) >= length)
You can use the function accumulate() to generate accumulated strings and the function islice() to get the strings with a minimal length:
from itertools import accumulate, islice
from collections import Counter
m = ['ABA','ABB', 'ABC','BCA','BCB','BCC','ABBC', 'ABBA', 'ABBC']
c = Counter()
for i in map(accumulate, m):
c.update(islice(i, 1, None)) # get strings with a minimal length of 2
print(c.most_common(3))
# [('AB', 6), ('ABB', 4), ('BC', 3)]
I am trying to create a function that accepts a string of any length and then a string with the first letter, then the letter one index from the first, then the letter 2 indices from the second, etc. Say I have a string:
my_string = "0123456789"
Expected output:
'0136'
or another example
my_string = "0123456789ABCDEFG"
Expected output:
'0136AF'
Things I have tried:
#Try 1
new_string = ""
for i in range(len(string)):
new_string += string[:i+i]
print(new_string)
#Try 2
new_string = ""
for i in range(len(string)):
new_string += string[:(i*(i+1))/2]
print(new_string)
You can do it with the following simple while-loop, maintaining index and increment:
string = "0123456789ABCDEFG"
new_string, ind, inc = "", 0, 0
while ind < len(string):
new_string += string[ind]
inc += 1
ind += inc
new_string
# '0136AF'
Or use fancy itertools:
from itertools import accumulate, count, takewhile
string = "0123456789ABCDEFG"
''.join(string[i] for i in takewhile(lambda x: x < len(string), accumulate(count())))
# '0136AF'
You need to generate the triangle numbers (thanks #kevin) and then grab the indices, returning if we get an IndexError (we've reached the end of the string)
def triangles():
# could be improved with itertools.count
count = 1
index = 0
while True:
yield index
index += count
count += 1
def get_tri_indices(s):
res = []
for index in triangles():
try:
res.append(s[index])
except IndexError: # we're out of range
return ''.join(res)
output
get_tri_indices('0123456789abcdef') # --> 0136af
Can use accumulate toolz and add operator
from toolz import accumulate
from operator import add
my_string = "0123456789ABCDEFG"
''.join([my_string[i] for i in accumulate(add,range(len(my_string))) if i <len(my_string)])
Output
'0136AF'
After some experimentation, here's the definitive itertools-abuse answer.
>>> from itertools import accumulate, count, takewhile
>>> from operator import itemgetter
>>>
>>> def accuchars(s):
... idx = takewhile(lambda x: x < len(s), accumulate(count()))
... return ''.join(itemgetter(*idx)(s))
...
>>> my_string = "0123456789ABCDEFG"
>>> accuchars(my_string)
'0136AF'
The number sequence you want to generate is a triangular number, where each n+1 is the sum of 0..n. To get the list of numbers, you can iterate and add them up but it is also possible to generate it using a list comprehension. The formula n(n+1)/2 gives the nth triangular number, so
[n*(n+1)//2 for n in range(20)]
>>> [0, 1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 66, 78, 91, 105, 120, 136, 153, 171, 190]
But before you can use this, you need to know the final number for your string length. You cannot plug in any sufficiently large number for the range as Python will otherwise complain
IndexError: string index out of range
So you need the reverse of the formula; this will give you the value m where Tri(m) ≤ len(string):
[x*(x+1)//2 for x in range(int(((1+8*len(my_string))**0.5)/2+.999))]
>>> [0, 1, 3, 6, 10, 15]
Now you have a reliable method to generate just the indexes you need, so you can grab the characters
[my_string[x*(x+1)//2] for x in range(int(((1+8*len(my_string))**0.5)/2+.999))]
>>> ['0', '1', '3', '6', 'A', 'F']
... and join them together in a single list comprehension:
print (''.join(my_string[x*(x+1)//2] for x in range(int(((1+8*len(my_string))**0.5)/2+.999))))
>>> 0136AF
I have two very large lists(that's why I used ... ), a list of lists:
x = [['I like stackoverflow. Hi ok!'],['this is a great community'],['Ok, I didn\'t like this!.'],...,['how to match and return the frequency?']]
and a list of strings:
y = ['hi', 'nice', 'ok',..., 'frequency']
I would like to return in a new list the times (count) that any word in y occurred in all the lists of x. For example, for the above lists, this should be the correct output:
[(1,2),(2,0),(3,1),...,(n,count)]
As follows, [(1,count),...,(n,count)]. Where n is the number of the list and count the number of times that any word from y appeared in x. Any idea of how to approach this?.
First, you should preprocess x into a list of sets of lowercased words -- that will speed up the following lookups enormously. E.g:
ppx = []
for subx in x:
ppx.append(set(w.lower() for w in re.finditer(r'\w+', subx))
(yes, you could collapse this into a list comprehension, but I'm aiming for some legibility).
Next, you loop over y, checking how many of the sets in ppx contain each item of y -- that would be
[sum(1 for s in ppx if w in s) for w in y]
That doesn't give you those redundant first items you crave, but enumerate to the rescue...:
list(enumerate((sum(1 for s in ppx if w in s) for w in y), 1))
should give exactly what you require.
Here is a more readable solution. Check my comments in the code.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
x = [['I like stackoverflow. Hi ok!'],['this is a great community'],['Ok, I didn\'t like this!.'],['how to match and return the frequency?']]
y = ['hi', 'nice', 'ok', 'frequency']
assert len(x)==len(y), "you have to make sure length of x equals y's"
num = []
for i in xrange(len(y)):
# lower all the strings in x for comparison
# find all matched patterns in x and count it, and store result in variable num
num.append(len(re.findall(y[i], x[i][0].lower())))
res = []
# use enumerate to give output in format you want
for k, v in enumerate(num):
res.append((k,v))
# here is what you want
print res
OUTPUT:
[(0, 1), (1, 0), (2, 1), (3, 1)]
INPUT:
x = [['I like stackoverflow. Hi ok!'],['this is a great community'],
['Ok, I didn\'t like this!.'],['how to match and return the frequency?']]
y = ['hi', 'nice', 'ok', 'frequency']
CODE:
import re
s1 = set(y)
index = 0
result = []
for itr in x:
itr = re.sub('[!.?]', '',itr[0].lower()).split(' ')
# remove special chars and convert to lower case
s2 = set(itr)
intersection = s1 & s2
#find intersection of common strings
num = len(intersection)
result.append((index,num))
index = index+1
OUTPUT:
result = [(0, 2), (1, 0), (2, 1), (3, 1)]
You could do like this also.
>>> x = [['I like stackoverflow. Hi ok!'],['this is a great community'],['Ok, I didn\'t like this!.'],['how to match and return the frequency?']]
>>> y = ['hi', 'nice', 'ok', 'frequency']
>>> l = []
>>> for i,j in enumerate(x):
c = 0
for x in y:
if re.search(r'(?i)\b'+x+r'\b', j[0]):
c += 1
l.append((i+1,c))
>>> l
[(1, 2), (2, 0), (3, 1), (4, 1)]
(?i) will do a case-insensitive match. \b called word boundaries which matches between a word character and a non-word character.
Maybe you could concatenate the strings in x to make the computation easy:
w = ' '.join(i[0] for i in x)
Now w is a long string like this:
>>> w
"I like stackoverflow. Hi ok! this is a great community Ok, I didn't like this!. how to match and return the frequency?"
With this conversion, you can simply do this:
>>> l = []
>>> for i in range(len(y)):
l.append((i+1, w.count(str(y[i]))))
which gives you:
>>> l
[(1, 2), (2, 0), (3, 1), (4, 0), (5, 1)]
You can make a dictionary where key is each item in the "Y" List. Loop through the values of the keys and look up for them in the dictionary. Keep updating the value as soon as you encounter the word into your X nested list.
But I need the index of the second time appearance.
It is like I have a string "asd#1-2#qwe"
I can simply use index method to find the index value of first #, which is 3.
But now I wanna get the index of second #, which should be 7.
Use enumerate and a list comprehension:
>>> s = "asd#1-2#qwe"
>>> [i for i, c in enumerate(s) if c=='#']
[3, 7]
Or if string contains just two '#' then use str.rfind:
>>> s.rfind('#')
7
Using regex: This will work for overlapping sub-strings as well:
>>> s = "asd##1-2####qwe"
>>> import re
#Find index of all '##' in s
>>> [m.start() for m in re.finditer(r'(?=##)', s)]
[3, 8, 9, 10]
use this:
s = "asd#1-2#qwe"
try:
s.index('#',s.index('#')+1)
except:
print "not found"
Use the index method to get the first occurrence of #. If the index method allows a starting position, use the position of the first # + 1 for the start. If it doesn't, make a copy of the string starting at position of first # + 1 (possibly a copy and then a substring).
a = "asd#1-2#qwe"
f = '#'
idx = []
for i, v in enumerate(a):
if v == f:
idx.append(i)
print idx
I have a list of lists of strings, like this:
l = [['apple','banana','kiwi'],['chair','table','spoon']]
Given a string, I want its index in l. Experimenting with numpy, this is what I ended up with:
import numpy as np
l = [['apple','banana','kiwi'],['chair','table','spoon']]
def ind(s):
i = [i for i in range(len(l)) if np.argwhere(np.array(l[i]) == s)][0]
j = np.argwhere(np.array(l[i]) == s)[0][0]
return i, j
s = ['apple','banana','kiwi','chair','table','spoon']
for val in s:
try:
print val, ind(val)
except IndexError:
print 'oops'
This fails for apple and chair, getting an indexerror. Also, this just looks bad to me. Is there some better approch to doing this?
Returns a list of tuples containing (outer list index, inner list index), designed such that the item you're looking for can be in multiple inner lists:
l = [['apple','banana','kiwi'],['chair','table','spoon']]
def findItem(theList, item):
return [(ind, theList[ind].index(item)) for ind in xrange(len(theList)) if item in theList[ind]]
findItem(l, 'apple') # [(0, 0)]
findItem(l, 'spoon') # [(1, 2)]
If you want to use numpy, you don't need to roll your own:
import numpy as np
l = np.array([['apple','banana','kiwi'],['chair','table','spoon']])
s = ['apple','banana','kiwi','chair','table','spoon']
for a in s:
arg = np.argwhere(l==a)
print a, arg, tuple(arg[0]) if len(arg) else None
l = [['apple','banana','kiwi'],['chair','table','spoon']]
def search(lst, item):
for i in range(len(lst)):
part = lst[i]
for j in range(len(part)):
if part[j] == item: return (i, j)
return None
I'd create a dictionary to map the items to their indices:
>>> import numpy as np
>>> l = [['apple','banana','kiwi'],['chair','table','spoon']]
>>> a = np.array(l,dtype=object)
>>> a
array([[apple, banana, kiwi],
[chair, table, spoon]], dtype=object)
>>> d = {s:idx for (idx),s in np.ndenumerate(a)}
>>> d['apple']
(0, 0)
>>> d['chair']
(1, 0)
numpy + ndenumerate is nice for creating the index, but it's definitely not necessary. Of course, this is going to be most efficient if you can create the index once and then reuse it for subsequent searches.
One way is to make use of enumerate:
l = [['apple','banana','kiwi'],['chair','table','spoon']]
s = ['apple','banana','kiwi','chair','table','spoon']
for a in s:
for i, ll in enumerate(l):
for j, b in enumerate(ll):
if a == b:
print a, i, j
In your line that computes i, you already have the answer if you apply argwhere to the entire list, rather than each sublist. There is no need to search again for j.
def ind(s):
match = np.argwhere(np.array(l == s))
if match:
i, j = match[0]
else:
return -1, -1
This is will return the indeces of the first occurence of the string you're searching for.
Also, you might consider how this method is impacted as the complexity of the problem increases. This method will iterate over every element of your list, so the runtime cost increases as the list becomes bigger. So, if the number of test strings you're trying to find in the list also increases, you might want to think about using a dictionary to create a lookup table once, then subsequent searches for test strings are cheaper.
def make_lookup(search_list):
lookup_table = {}
for i, sublist in enumerate(list):
for j, word in enumerate(sublist):
lookup_table[word] = (i, j)
return lookup_table
lookup_table = make_lookup(l)
def ind(s):
if s in lookup_table:
return lookup_table[s]
else:
return -1, -1
To get index of list of list in python:
theList = [[1,2,3], [4,5,6], [7,8,9]]
for i in range(len(theList)):
if 5 in theList(i):
print("[{0}][{1}]".format(i, theList[i].index(5))) #[1][1]
This solution will find all occurrences of the string you're searching for:
l = [['apple','banana','kiwi','apple'],['chair','table','spoon']]
def findItem(theList, item):
return [(i, j) for i, line in enumerate(theList)
for j, char in enumerate(line) if char == item]
findItem(l, 'apple') # [(0, 0), (0, 3)]
findItem(l, 'spoon') # [(1, 2)]