Grep multi-layered iterable for strings that match (Python) - python

Say that we have a multilayered iterable with some strings at the "final" level, yes strings are iterable, but I think that you get my meaning:
['something',
('Diff',
('diff', 'udiff'),
('*.diff', '*.patch'),
('text/x-diff', 'text/x-patch')),
('Delphi',
('delphi', 'pas', 'pascal', 'objectpascal'),
('*.pas',),
('text/x-pascal',['lets', 'put one here'], )),
('JavaScript+Mako',
('js+mako', 'javascript+mako'),
('application/x-javascript+mako',
'text/x-javascript+mako',
'text/javascript+mako')),
...
]
Is there any convenient way that I could implement a search that would give me the indices of the matching strings? I would like something that would act something like this (where the above list is data):
>>> grep('javascript', data)
and it would return [ (2,1,1), (2,2,0), (2,2,1), (2,2,2) ] perhaps. Maybe I'm missing a comparable solution that returns nothing of the sort but can help me find some strings within a multi-layered list of iterables of iterables of .... strings.
I wrote a little bit but it was seeming juvenile and inelegant so I thought I would ask here. I guess that I could just keep nesting the exception the way I started here to the number of levels that the function would then support, but I was hoping to get something neat, abstract, pythonic.
import re
def rgrep(s, data):
''' given a iterable of strings or an iterable of iterables of strings,
returns the index/indices of strings that contain the search string.
Args::
s - the string that you are searching for
data - the iterable of strings or iterable of iterables of strings
'''
results = []
expr = re.compile(s)
for item in data:
try:
match = expr.search(item)
if match != None:
results.append( data.index(item) )
except TypeError:
for t in item:
try:
m = expr.search(t)
if m != None:
results.append( (list.index(item), item.index(t)) )
except TypeError:
''' you can only go 2 deep! '''
pass
return results

I'd split recursive enumeration from grepping:
def enumerate_recursive(iter, base=()):
for index, item in enumerate(iter):
if isinstance(item, basestring):
yield (base + (index,)), item
else:
for pair in enumerate_recursive(item, (base + (index,))):
yield pair
def grep_index(filt, iter):
return (index for index, text in iter if filt in text)
This way you can do both non-recursive and recursive grepping:
l = list(grep_index('opt1', enumerate(sys.argv))) # non-recursive
r = list(grep_index('diff', enumerate_recursive(your_data))) # recursive
Also note that we're using iterators here, saving RAM for longer sequences if necessary.
Even more generic solution would be to give a callable instead of string to grep_index. But that might not be necessary for you.

Here is a grep that uses recursion to search the data structure.
Note that good data structures lead the way to elegant solutions.
Bad data structures make you bend over backwards to accomodate.
This feels to me like one of those cases where a bad data structure is obstructing
rather than helping you.
Having a simple data structure with a more uniform structure
(instead of using this grep) might be worth investigating.
#!/usr/bin/env python
data=['something',
('Diff',
('diff', 'udiff'),
('*.diff', '*.patch'),
('text/x-diff', 'text/x-patch',['find','java deep','down'])),
('Delphi',
('delphi', 'pas', 'pascal', 'objectpascal'),
('*.pas',),
('text/x-pascal',['lets', 'put one here'], )),
('JavaScript+Mako',
('js+mako', 'javascript+mako'),
('application/x-javascript+mako',
'text/x-javascript+mako',
'text/javascript+mako')),
]
def grep(astr,data,prefix=[]):
result=[]
for idx,elt in enumerate(data):
if isinstance(elt,basestring):
if astr in elt:
result.append(tuple(prefix+[idx]))
else:
result.extend(grep(astr,elt,prefix+[idx]))
return result
def pick(data,idx):
if idx:
return pick(data[idx[0]],idx[1:])
else:
return data
idxs=grep('java',data)
print(idxs)
for idx in idxs:
print('data[%s] = %s'%(idx,pick(data,idx)))

To get the position use enumerate()
>>> data = [('foo', 'bar', 'frrr', 'baz'), ('foo/bar', 'baz/foo')]
>>>
>>> for l1, v1 in enumerate(data):
... for l2, v2 in enumerate(v1):
... if 'f' in v2:
... print l1, l2, v2
...
0 0 foo
1 0 foo/bar
1 1 baz/foo
In this example I am using a simple match 'foo' in bar yet you probably use regex for the job.
Obviously, enumerate() can provide support in more than 2 levels as in your edited post.

Related

Python: Index slicing from a list for each index in for loop

I got stuck in slicing from a list of data inside a for loop.
list = ['[init.svc.logd]: [running]', '[init.svc.logd-reinit]: [stopped]']
what I am looking for is to print only key without it values (running/stopped)
Overall code,
for each in list:
print(each[:]) #not really sure what may work here
result expected:
init.svc.logd
anyone for a quick solution?
If you want print only the key, you could use the split function to take whatever is before : and then replace [ and ] with nothing if you don't want them:
list = ['[init.svc.logd]: [running]', '[init.svc.logd-reinit]: [stopped]']
for each in list:
print(each.split(":")[0].replace('[','').replace(']','')) #not really sure what may work here
which gives :
init.svc.logd
init.svc.logd-reinit
You should probably be using a regular expression. The concept of 'key' in the question is ambiguous as there are no data constructs shown that have keys - it's merely a list of strings. So...
import re
list_ = ['[init.svc.logd]: [running]', '[init.svc.logd-reinit]: [stopped]']
for e in list_:
if r := re.findall('\[(.*?)\]', e):
print(r[0])
Output:
init.svc.logd
init.svc.logd-reinit
Note:
This is more robust than string splitting solutions for cases where data are unexpectedly malformed

A better way to rewrite multiple appended replace methods using an input array of strings in python?

I have a really ugly command where I use many appended "replace()" methods to replace/substitute/scrub many different strings from an original string. For example:
newString = originalString.replace(' ', '').replace("\n", '').replace('()', '').replace('(Deployed)', '').replace('(BeingAssembled)', '').replace('ilo_', '').replace('ip_', '').replace('_ilop', '').replace('_ip', '').replace('backupnetwork', '').replace('_ilo', '').replace('prod-', '').replace('ilo-','').replace('(EndofLife)', '').replace('lctcvp0033-dup,', '').replace('newx-', '').replace('-ilo', '').replace('-prod', '').replace('na,', '')
As you can see, it's a very ugly statement and makes it very difficult to know what strings are in the long command. It also makes it hard to reuse.
What I'd like to do is define an input array of of many replacement pairs, where a replacement pair looks like [<ORIGINAL_SUBSTRING>, <NEW_SUBSTRING>]; where the greater array looks something like:
replacementArray = [
[<ORIGINAL_SUBSTRING>, <NEW_SUBSTRING>],
[<ORIGINAL_SUBSTRING>, <NEW_SUBSTRING>],
[<ORIGINAL_SUBSTRING>, <NEW_SUBSTRING>],
[<ORIGINAL_SUBSTRING>, <NEW_SUBSTRING>]
]
AND, I'd like to pass that replacementArray, along with the original string that needs to be scrubbed to a function that has a structure something like:
def replaceAllSubStrings(originalString, replacementArray):
newString = ''
for each pair in replacementArray:
perform the substitution
return newString
MY QUESTION IS: What is the right way to write the function's code block to apply each pair in the replacementArray? Should I be using the "replace()" method? The "sub()" method? I'm confused as to how to restructure the original code into a nice clean function.
Thanks, in advance, for any help you can offer.
You have the right idea. Use sequence unpacking to iterate each pair of values:
def replaceAllSubStrings(originalString, replacementArray):
for in_rep, out_rep in replacementArray:
originalString = originalString.replace(in_rep, out_rep)
return originalString
How about using re?
import re
def make_xlat(*args, **kwds):
adict = dict(*args, **kwds)
rx = re.compile('|'.join(map(re.escape, adict)))
def one_xlat(match):
return adict[match.group(0)]
def xlat(text):
return rx.sub(one_xlat, text)
return xlat
replaces = {
"a": "b",
"well": "hello"
}
replacer = make_xlat(replaces)
replacer("a well?")
# b hello?
You can add as many items in replaces as you want.

Sort list for date in Python

import re
arr1 = ['2018.07.17 11:30:00,-0.19', '2018.07.17 17:55:00,0.86']
arr2 = ['2018.07.17 11:34:00,-0.39', '2018.07.17 17:59:01,0.85']
def combine_strats_lambda(*strats):
"""
Takes *strats in date,return format
combines infinite amount of strats with date, return and packs them into
one
single sorted array
>> RETURN: combined list
"""
temp = []
# create combined list
for v in enumerate(strats):
i = 0
while i < len(v[1]):
temp.append(v[1][i])
#k = re.findall(r"[\w']+", temp)[:6]
i += 1
temp2 = sorted(timestamps, key=lambda d: tuple(map(int, re.findall(r"[\w']+", d[0]))))
return temp2
Hi,
I've been trying to finish this function, which should combine multiple lists of dates,percentage returns and sort them.
I've come across a solution with lambda but all I get is this message:
return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object
Do you know an easier solution to the problem or what the error is caused by? I can't seem to figure it out.
Anything appreciated :)
The very basic error in your code is in line:
for v in enumerate(strats):
You have apparently forgotten that enumerate(...) returns two
values: the index and the current value from the iterable.
So, as you used just single v, it gets the index, not the value.
Another important point is that if the datetime strings are written as
yyyy.MM.dd hh:mm:ss, you can sort them using just string sort.
So, to gather the strings, you need a list comprehension, with 2 nested
loops.
And to sort them, you should use sorted function, specifying as the sort
key the "initial" (date / time) part, before the comma.
To sum up, to get the sorted list of strings, taken from a couple of
arguments of your function, sorted on the date / time part,
you can use the following program, written using version 3.6 of Python:
arr1 = ['2018.07.17 11:30:00,-0.19', '2018.07.17 17:55:00,0.86']
arr2 = ['2018.07.17 11:34:00,-0.39', '2018.07.17 17:59:01,0.85']
def combine_strats_lambda(*strats):
temp = [ v2 for v1 in strats for v2 in v1 ]
return sorted(temp, key = lambda v: v.split(',')[0])
res = combine_strats_lambda(arr1, arr2)
for x in res:
parts = x.split(',')
print("{:20s} {:>6s}".format(parts[0], parts[1]))
It does not even use re module.

Python: argument conversion during string format error /w dictionary/list reads

new to these boards and understand there is protocol and any critique is appreciated. I have begun python programming a few days ago and am trying to play catch-up. The basis of the program is to read a file, convert a specific occurrence of a string into a dictionary of positions within the document. Issues abound, I'll take all responses.
Here is my code:
f = open('C:\CodeDoc\Mm9\sampleCpG.txt', 'r')
cpglist = f.read()
def buildcpg(cpg):
return "\t".join(["%d" % (k) for k in cpg.items()])
lookingFor = 'CG'
i = 0
index = 0
cpgdic = {}
try:
while i < len(cpglist):
index = cpglist.index(lookingFor, i)
i = index + 1
for index in range(len(cpglist)):
if index not in cpgdic:
cpgdic[index] = index
print (buildcpg(cpgdic))
except ValueError:
pass
f.close()
The cpgdic is supposed to act as a dictionary of the position reference obtained in the index. Each read of index should be entering cpgdic as a new value, and the print (buildcpg(cpgdic)) is my hunch of where the logic fails. I believe(??) it is passing cpgdic into the buildcpg function, where it should be returned as an output of all the positions of 'CG', however the error "TypeError:not all arguments converted during string formatting" shows up. Your turn!
ps. this destroys my 2GB memory; I need to improve with much more reading
cpg.items is yielding tuples. As such, k is a tuple (length 2) and then you're trying to format that as a single integer.
As a side note, you'll probably be a bit more memory efficient if you leave off the [ and ] in the join line. This will turn your list comprehension to a generator expression which is a bit nicer. If you're on python2.x, you could use cpg.iteritems() instead of cpg.items() as well to save a little memory.
It also makes little sense to store a dictionary where the keys and the values are the same. In this case, a simple list is probably more elegant. I would probably write the code this way:
with open('C:\CodeDoc\Mm9\sampleCpG.txt') as fin:
cpgtxt = fin.read()
indices = [i for i,_ in enumerate(cpgtxt) if cpgtxt[i:i+2] == 'CG']
print '\t'.join(indices)
Here it is in action:
>>> s = "CGFOOCGBARCGBAZ"
>>> indices = [i for i,_ in enumerate(s) if s[i:i+2] == 'CG']
>>> print indices
[0, 5, 10]
Note that
i for i,_ in enumerate(s)
is roughly the same thing as
i for i in range(len(s))
except that I don't like range(len(s)) and the former version will work with any iterable -- Not just sequences.

python trie implementation with substring matching

My problem:
Let say I have strings:
ali, aligator, aliance
because they have common prefix I want to store them in trie, like:
trie['ali'] = None
trie['aligator'] = None
trie['aliance'] = None
So far so good - i can use trie implementation from Biopython library.
But what I want to achive is abilitiy to find all keys in that trie that contain particular substring.
For example:
trie['ga'] would return 'aligator' and
trie['li'] would return ('ali','aligator','aliance').
Any suggestions?
Edit: I think you may be looking for a Suffix tree, particularly noting that "Suffix trees also provided one of the first linear-time solutions for the longest common substring problem.".
Just noticed another SO question that seems very related: Finding longest common substring using Trie
I would do something like this:
class Trie(object):
def __init__(self,strings=None):
#set the inicial strings, if passed
if strings:
self.strings = strings
else:
self.strings =[]
def __getitem__(self, item):
#search for the partial string on the string list
for s in self.strings:
if item in s:
yield s
def __len__(self):
#just for fun
return len(self.strings)
def append(self,*args):
#append args to existing strings
for item in args:
if item not in self.strings:
self.strings.append(item)
Then:
t1 = Trie()
t1.append("ali","aligator","aliance")
print list(t1['ga'])
print list(t1['li'])
>>['aligator']
>>['ali', 'aligator', 'aliance']

Categories

Resources