Python string pattern recognition/compression

Python string pattern recognition/compression - python

I can do basic regex alright, but this is slightly different, namely I don't know what the pattern is going to be.
For example, I have a list of similar strings:
lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
In this case the common pattern is two segments of common text: 'sometxt' and 'moretxt', starting and separated by something else that is variable in length.
The common string and variable string can of course occur at any order and at any number of occasions.
What would be a good way to condense/compress the list of strings into their common parts and individual variations?
An example output might be:
c = ['sometxt', 'moretxt']
v = [('a','0'), ('b','1'), ('aa','10'), ('zz','999')]

This solution finds the two longest common substrings and uses them to delimit the input strings:
def an_answer_to_stackoverflow_question_1914394(lst):
"""
>>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
>>> an_answer_to_stackoverflow_question_1914394(lst)
(['sometxt', 'moretxt'], [('a', '0'), ('b', '1'), ('aa', '10'), ('zz', '999')])
"""
delimiters = find_delimiters(lst)
return delimiters, list(split_strings(lst, delimiters))
find_delimiters and friends finds the delimiters:
import itertools
def find_delimiters(lst):
"""
>>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
>>> find_delimiters(lst)
['sometxt', 'moretxt']
"""
candidates = list(itertools.islice(find_longest_common_substrings(lst), 3))
if len(candidates) == 3 and len(candidates[1]) == len(candidates[2]):
raise ValueError("Unable to find useful delimiters")
if candidates[1] in candidates[0]:
raise ValueError("Unable to find useful delimiters")
return candidates[0:2]
def find_longest_common_substrings(lst):
"""
>>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
>>> list(itertools.islice(find_longest_common_substrings(lst), 3))
['sometxt', 'moretxt', 'sometx']
"""
for i in xrange(min_length(lst), 0, -1):
for substring in common_substrings(lst, i):
yield substring
def min_length(lst):
return min(len(item) for item in lst)
def common_substrings(lst, length):
"""
>>> list(common_substrings(["hello", "world"], 2))
[]
>>> list(common_substrings(["aabbcc", "dbbrra"], 2))
['bb']
"""
assert length <= min_length(lst)
returned = set()
for i, item in enumerate(lst):
for substring in all_substrings(item, length):
in_all_others = True
for j, other_item in enumerate(lst):
if j == i:
continue
if substring not in other_item:
in_all_others = False
if in_all_others:
if substring not in returned:
returned.add(substring)
yield substring
def all_substrings(item, length):
"""
>>> list(all_substrings("hello", 2))
['he', 'el', 'll', 'lo']
"""
for i in range(len(item) - length + 1):
yield item[i:i+length]
split_strings splits the strings using the delimiters:
import re
def split_strings(lst, delimiters):
"""
>>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
>>> list(split_strings(lst, find_delimiters(lst)))
[('a', '0'), ('b', '1'), ('aa', '10'), ('zz', '999')]
"""
for item in lst:
parts = re.split("|".join(delimiters), item)
yield tuple(part for part in parts if part != '')

Here is a scary one to get the ball rolling.
>>> import re
>>> makere = lambda n: ''.join(['(.*?)(.+)(.*?)(.+)(.*?)'] + ['(.*)(\\2)(.*)(\\4)(.*)'] * (n - 1))
>>> inp = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
>>> re.match(makere(len(inp)), ''.join(inp)).groups()
('a', 'sometxt', '0', 'moretxt', '', 'b', 'sometxt', '1', 'moretxt', 'aa', '', 'sometxt', '10', 'moretxt', 'zz', '', 'sometxt', '999', 'moretxt', '')
I hope its sheer ugliness will inspire better solutions :)

This seems to be an example of the longest common subsequence problem. One way could be to look at how diffs are generated. The Hunt-McIlroy algorithm seems to have been the first, and is such the simplest, especially since it apparently is non-heuristic.
The first link contains detailed discussion and (pseudo) code examples. Assuming, of course, Im not completely of the track here.

This look much like the LZW algorithm for data (text) compression. There should be python implementations out there, which you may be able to adapt to your need.
I assume you have no a priori knowledge of these sub strings that repeat often.

I guess you should start by identifying substrings (patterns) that frequently occur in the strings. Since naively counting substrings in a set of strings is rather computationally expensive, you'll need to come up with something smart.
I've done substring counting on a large amount of data using generalized suffix trees (example here). Once you know the most frequent substrings/patterns in the data, you can take it from there.

How about subbing out the known text, and then splitting?
import re
[re.sub('(sometxt|moretxt)', ',', x).split(',') for x in lst]
# results in
[['a', '0', ''], ['b', '1', ''], ['aa', '10', ''], ['zz', '999', '']]

Related

Store delimiter and delimiter positions before replacing them in python

I'm working on a text pattern problem. I've the following input -
term = 'CG-14/0,2-L-0_2'
I need to remove all the possible punctuation (delimiters) from the input term. Basically I need the following output from the input term -
'CG1402L02'
I also need to store (in any format (object, dict, tuple etc.)) the delimiter and the position of the delimiter before removing the delimiters.
Example of the output (If returned as tuple) -
((-,2), (/,5), (,,7), (-,9), (-,11), (_,13))
I'm able to get the output using the following python code -
re.sub(r'[^\w]', '', term.replace('_', ''))
But how do I store the delimiter and delimiter position (in the most efficient way) before removing the delimiters?

You can simply walk once through term and collect all nessessary infos on the way:
from string import ascii_letters,digits
term = 'CG-14/0,2-L-0_2'
# defined set of allowed characters a-zA-Z0-9
# set lookup is O(1) - fast
ok = set(digits +ascii_letters)
specials = {}
clean = []
for i,c in enumerate(term):
if c in ok:
clean.append(c)
else:
specials.setdefault(c,[])
specials[c].append(i)
cleaned = ''.join(clean)
print(clean)
print(cleaned)
print(specials)
Output:
['C', 'G', '1', '4', '0', '2', 'L', '0', '2'] # list of characters in set ok
CG1402L02 # the ''.join()ed list
{'-': [2, 9, 11], '/': [5], ',': [7], '_': [13]} # dict of characters/positions not in ok
See:
string.ascii_letters
string.digits
You can use
specials = []
and inside the iteration:
else:
specials.append((c,i))
to get a list of tuples instead of the dictionary:
[('-', 2), ('/', 5), (',', 7), ('-', 9), ('-', 11), ('_', 13)]

You could do something like this, adding whatever other delimiters you need to the list delims
term = 'CG-14/0,2-L-0_2'
delims = ['-','/',',','_']
locations = []
pos = 0
for c in term: ##iterate through the characters in the string
if c in delims:
locations.append([c,pos]) ##store the character and its original position
pos+=1
And then do you re.sub command to replace them.

Convert 5A2B4C11G string to [(5,"A"),(2,"B"),(4,"C"),(11,"G")] in Python

The title pretty much says it all. I have a small run-length decoding script:
def RLdecode(characterList):
decodedString = ""
for character, count in characterList:
decodedString += character.upper() * count
return decodedString
That script requires a list (or whatever this is) that looks like:
[(5,"A"),(2,"B"),(4,"C"),(11,"G")]
But in order to make it more user-friendly, I want the user to be able to input a string like this:
"5A2B4C11G"
How would I convert a string like the one above into a list readable by my script? Also, sorry that the title of the question is very specific, but I don't know what the process is called :\

using itertools.groupby:
There's a nice way to do the letter/digit grouping using itertools.groupby:
import itertools
a="5A2B4C11G"
result = [("".join(v)) for k,v in itertools.groupby(a,str.isdigit)]
that returns ['5', 'A', '2', 'B', '4', 'C', '11', 'G']
Unfortunately, it flattens the number/letter tuple, so more work is required. Note that applying Kaushik solution to that input gives expected result now that the number/letter is properly done:
[(int(result[i]),result[i+1]) for i in range(0,len(result),2)]
result:
[(5, 'A'), (2, 'B'), (4, 'C'), (11, 'G')]
using regexes:
Anyway, in that case, regular expressions are well suited to extract the patterns with the required hierarchy.
Just match the string using 1 or more digits + a letter, and convert the obtained tuples to match the (integer, string) format, using a list comprehension to do so, in one line.
import re
a="5A2B4C11G"
result = [(int(i),v) for i,v in re.findall('(\d+)([A-Z])',a)]
print(result)
gives:
[(5, 'A'), (2, 'B'), (4, 'C'), (11, 'G')]

Using list comprehension :
#s is the string
[(int(s[i]),s[i+1]) for i in range(0,len(s),2)]
#driver values
IN : s="5A2B4C"
OUT : [(5, 'A'), (2, 'B'), (4, 'C')]
Here range(0,len(s),2) gives values as : [0, 2, 4] which we use to go through the string.
NOTE : this ofcourse only works with strings of even size and with numbers below 10.
EDIT : As for numbers with double digits, the answer by Jean-François Fabre works well.

You can do this with regex if you want:
In one line
sorted_list=[i for i in re.findall(pattern, a, re.M)]
Same approach :
import re
a="5A2B4C"
pattern=r'(\d)(\w)'
list=[]
art=re.findall(pattern,a,re.M)
for i in art:
list.append(i)
print(list)
For your new edited problem here is my new solution :
import re
a = "5A2B4C11G"
pattern = r'([0-9]+)([a-zA-Z])'
list = []
art = re.findall(pattern, a, re.M)
for i in art:
list.append(i)
print(list)
Output:
[('5', 'A'), ('2', 'B'), ('4', 'C'), ('11', 'G')]

You have already got the answer from Jean-François Fabre.
The process is call length decoding.
The whole process can be done in one liner by following code.
from re import sub
text = "5A2B4C11G"
sub(r'(\d+)(\D)', lambda m: m.group(2) * int(m.group(1)),text)
OUTPUT : 'AAAAABBCCCCGGGGGGGGGGG'
NOTE This is not the answer but just an optimization idea for the OP as answer is already present in Jean-François Fabre

import re
str = "5A2B4C11G"
pattern = r"(\d+)(\D)" # group1: digit(s), group2: non-digit
substitution = r"\1,\2 " # "ditits,nondigit "
temp = re.sub(pattern, substitution, str) # gives "5,A 2,B 4,C 11,G "
temp = temp.split() # gives ['5,A', '2,B', '4,C', '11,G']
result = [el.split(",") for el in temp] # gives [['5', 'A'], ['2', 'B'],
# ['4', 'C'], ['11', 'G']] - see note
First we replace sequences of digits followed by a symbol to something to which we can apply 2-level split(), choosing 2 different delimiters in the replacement string r"\1,\2 "
space for the 1st level (outer) split(), and
, for the 2nd level one (inner).
Then we apply those 2 splits.
Note: If you have a significant reason to obtain tuples (instead of good enough inner lists), simply apply the tuple() function in the last statement:
result = [tuple(el.split(",")) for el in temp]

Python, splitting strings on middle characters with overlapping matches using regex

In Python, I am using regular expressions to retrieve strings from a dictionary which show a specific pattern, such as having some repetitions of characters than a specific character and another repetitive part (e.g. ^(\w{0,2})o(\w{0,2})$).
This works as expected, but now I'd like to split the string in two substrings (eventually one might be empty) using the central character as delimiter. The issue I am having stems from the possibility of multiple overlapping matches inside a string (e.g. I'd want to use the previous regex to split the string room in two different ways, (r, om) and (ro, m)).
Both re.search().groups() and re.findall() did not solve this issue, and the docs on the re module seems to point out that overlapping matches would not be returned by the methods.
Here is a snippet showing the undesired behaviour:
import re
dictionary = ('room', 'door', 'window', 'desk', 'for')
regex = re.compile('^(\w{0,2})o(\w{0,2})$')
halves = []
for word in dictionary:
matches = regex.findall(word)
if matches:
halves.append(matches)

I am posting this as an answer mainly not to leave the question answered in the case someone stumbles here in the future and since I've managed to reach the desired behaviour, albeit probably not in a very pythonic way, this might be useful as a starting point from someone else. Some notes on how improve this answer (i.e. making more "pythonic" or simply more efficient would be very welcomed).
The only way of getting all the possible splits of the words having length in a certain range and a character in certain range of positions, using the characters in the "legal" positions as delimiters, both using there and the new regex modules involves using multiple regexes. This snippet allows to create at runtime an appropriate regex knowing the length range of the word, the char to be seek and the range of possible positions of such character.
dictionary = ('room', 'roam', 'flow', 'door', 'window',
'desk', 'for', 'fo', 'foo', 'of', 'sorrow')
char = 'o'
word_len = (3, 6)
char_pos = (2, 3)
regex_str = '(?=^\w{'+str(word_len[0])+','+str(word_len[1])+'}$)(?=\w{'
+str(char_pos[0]-1)+','+str(char_pos[1]-1)+'}'+char+')'
halves = []
for word in dictionary:
matches = re.match(regex_str, word)
if matches:
matched_halves = []
for pos in xrange(char_pos[0]-1, char_pos[1]):
split_regex_str = '(?<=^\w{'+str(pos)+'})'+char
split_word =re.split(split_regex_str, word)
if len(split_word) == 2:
matched_halves.append(split_word)
halves.append(matched_halves)
The output is:
[[['r', 'om'], ['ro', 'm']], [['r', 'am']], [['fl', 'w']], [['d', 'or'], ['do', 'r']], [['f', 'r']], [['f', 'o'], ['fo', '']], [['s', 'rrow']]]
At this point I might start considering using a regex just to find the to words to be split and the doing the splitting in 'dumb way' just checking if the characters in the range positions are equal char. Anyhow, any remark is extremely appreciated.

EDIT: Fixed.
Does a simple while loop work?
What you want is re.search and then loop with a 1 shift:
https://docs.python.org/2/library/re.html
>>> dictionary = ('room', 'door', 'window', 'desk', 'for')
>>> regex = re.compile('(\w{0,2})o(\w{0,2})')
>>> halves = []
>>> for word in dictionary:
>>> start = 0
>>> while start < len(word):
>>> match = regex.search(word, start)
>>> if match:
>>> start = match.start() + 1
>>> halves.append([match.group(1), match.group(2)])
>>> else:
>>> # no matches left
>>> break
>>> print halves
[['ro', 'm'], ['o', 'm'], ['', 'm'], ['do', 'r'], ['o', 'r'], ['', 'r'], ['nd', 'w'], ['d', 'w'], ['', 'w'], ['f', 'r'], ['', 'r']]

The last list elements and conditional loops

mylist="'a','b','c'"
count=0
i=0
while count< len(mylist):
if mylist[i]==mylist[i+1]:
print mylist[i]
count +=1
i +=1
Error:
File "<string>", line 6, in <module>
IndexError: string index out of range
I'm assuming that when it gets to the last (nth) element it can't find an n+1 to compare it to, so it gives me an error.
Interestingly, i think that I've done this before and not had this problem on a larger list: Here is an example (with credit to Raymond Hettinger for fixing it up)
list=['a','a','x','c','e','e','f','f','f']
i=0
count = 0
while count < len(list)-2:
if list[i] == list[i+1]:
if list [i+1] != list [i+2]:
print list[i]
i+=1
count +=1
else:
print "no"
count += 1
else:
i +=1
count += 1
For crawling through a list in the way I've attempted, is there any fix so that I don't go "out of range?" I plan to implement this on a very large list, where I'll have to check if "list[i]==list[i+16]", for example. In the future, I would like to add on conditions like "if int(mylist[i+3])-int(mylist[i+7])>10: newerlist.append[mylist[i]". So it's important that I solve this problem.
I thought about inserting a break statement, but was unsuccessful.
I know this is not the most efficient, but I'm at the point where it's what i understand best.

So it sounds like you are trying to compare elements in your list at various fixed offsets. perhaps something like this could help you:
for old, new in zip(lst, lst[n:]):
if some_cond(old, new):
do_work()
Explanation:
lst[n:] returns a copy of lst, starting from the nth (mind the 0-indexing) element
>>> lst = [1,2,2,3];
>>> lst[1:]
[2,2,3]
zip(l1, l2) creates a new list of tuples, with one element from each list
>>> zip(lst, lst[1:])
[(1, 2), (2, 2), (2, 3)]
Note that it stops as soon as either list runs out. in this case, the offset list runs out first.
for a list of tuples, you can "upack directly" in the loop variable, so
for old, new in zip(lst, lst[1:])
gives loops through the elements you want (pairs of successive elements in your list)

As a general idea, if you are trying to look ahead a certain number of places, you can do a few things:
In the loop check (I.e. count < length), you'll need to check on the max field. So in your example, you wanted to go 16 spaces. This would mean that you would need to check count < (length - 16). The downside is that your last elements (the last 16) won't be iterated over.
Check inside the loop to make sure the index is applicable. That is, on each if statement start with: if(I+16 < length && logic_you_want_to_check). This will allow you to continue through the loop, but when the logic will fail because its out of bounds, you won't error out.
Note- this probably isn't what you want, but ill add it for completeness. Wrap around your logic. This will only work if wrap arounds can be considered. If you literally want to check the 16th index ahead of your current index (I.e like a place in a line perhaps), then wrapping around doesn't really suit well. But if don't need that logic, and want to model your values in a circular pattern, you can modulus your index. That is: if array[i] == array [(i + 16)%length(array)] would check either 16 ahead or wrap around to the front of the array.

Edit:
Right, with the new information in the OP, this becomes much simpler. Use the itertools grouper() recipe to group the data for each person into tuples:
import itertools
def grouper(iterable, n, fillvalue=None):
"""Collect data into fixed-length chunks or blocks"""
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return itertools.zip_longest(*args, fillvalue=fillvalue)
data = ['John', 'Sally', '5', '10', '11', '4', 'John', 'Sally', '3', '7', '7', '10', 'Bill', 'Hallie', '4', '6', '2', '1']
grouper(data, 6)
Now your data looks like:
[
('John', 'Sally', '5', '10', '11', '4'),
('John', 'Sally', '3', '7', '7', '10'),
('Bill', 'Hallie', '4', '6', '2', '1')
]
Which should be easy to work with, by comparison.
Old Answer:
If you need to make more arbitrary links, rather than just checking continuous values:
def offset_iter(iterable, n):
offset = iter(iterable)
consume(offset, n)
return offset
data = ['a', 'a', 'x', 'c', 'e', 'e', 'f', 'f', 'f']
offset_3 = offset_iter(data, 3)
for item, plus_3 in zip(data, offset_3): #Naturally, itertools.izip() in 2.x
print(item, plus_3) #if memory usage is important.
Naturally, you would want to use semantically valid names. The advantage to this method is it works with arbitrary iterables, not just lists, and is efficient and readable, without any ugly, inefficient iteration by index. If you need to continue checking once the offset values have run out (for other conditions, say) then use itertools.zip_longest() (itertools.izip_longest() in 2.x).
Using the consume() recipe from itertools.
import itertools
import collections
def consume(iterator, n):
"""Advance the iterator n-steps ahead. If n is none, consume entirely."""
# Use functions that consume iterators at C speed.
if n is None:
# feed the entire iterator into a zero-length deque
collections.deque(iterator, maxlen=0)
else:
# advance to the empty slice starting at position n
next(itertools.islice(iterator, n, n), None)
I would, however, greatly question if you need to re-examine your data structure in this case.
Original Answer:
I'm not sure what your aim is, but from what I gather you probably want itertools.groupby():
>>> import itertools
>>> data = ['a', 'a', 'x', 'c', 'e', 'e', 'f', 'f', 'f']
>>> grouped = itertools.groupby(data)
>>> [(key, len(list(items))) for key, items in grouped]
[('a', 2), ('x', 1), ('c', 1), ('e', 2), ('f', 3)]
You can use this to work out when there are (arbitrarily large) runs of repeated items. It's worth noting you can provide itertools.groupby() with a key argument that will group them based on any factor you want, not just equality.

If you adhere to "Practicality beats purity":
for idx, element in enumerate(yourlist[n:]):
if yourlist[idx] == yourlist[idx-n]
...
If you don't care about memory efficiency go for second's answer. If you want the purest answer then go for Lattyware's one.

Splitting a string into a list (but not separating adjacent numbers) in Python

For example, I have:
string = "123ab4 5"
I want to be able to get the following list:
["123","ab","4","5"]
rather than list(string) giving me:
["1","2","3","a","b","4"," ","5"]

Find one or more adjacent digits (\d+), or if that fails find non-digit, non-space characters ([^\d\s]+).
>>> string = '123ab4 5'
>>> import re
>>> re.findall('\d+|[^\d\s]+', string)
['123', 'ab', '4', '5']
If you don't want the letters joined together, try this:
>>> re.findall('\d+|\S', string)
['123', 'a', 'b', '4', '5']

The other solutions are definitely easier. If you want something far less straightforward, you could try something like this:
>>> import string
>>> from itertools import groupby
>>> s = "123ab4 5"
>>> result = [''.join(list(v)) for _, v in groupby(s, key=lambda x: x.isdigit())]
>>> result = [x for x in result if x not in string.whitespace]
>>> result
['123', 'ab', '4', '5']

You could do:
>>> [el for el in re.split('(\d+)', string) if el.strip()]
['123', 'ab', '4', '5']

This will give the split you want:
re.findall(r'\d+|[a-zA-Z]+', "123ab4 5")
['123', 'ab', '4', '5']

you can do a few things here, you can
1. iterate the list and make groups of numbers as you go, appending them to your results list.
not a great solution.
2. use regular expressions.
implementation of 2:
>>> import re
>>> s = "123ab4 5"
>>> re.findall('\d+|[^\d]', s)
['123', 'a', 'b', '4', ' ', '5']
you want to grab any group which is at least 1 number \d+ or any other character.
edit
John beat me to the correct solution first. and its a wonderful solution.
i will leave this here though because someone else might misunderstand the question and look for an answer to what i thought was written also. i was under the impression the OP wanted to capture only groups of numbers, and leave everything else individual.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python string pattern recognition/compression - python

This look much like the LZW algorithm for data (text) compression. There should be python implementations out there, which you may be able to adapt to your need. I assume you have no a priori knowledge of these sub strings that repeat often.

How about subbing out the known text, and then splitting? import re [re.sub('(sometxt|moretxt)', ',', x).split(',') for x in lst] # results in [['a', '0', ''], ['b', '1', ''], ['aa', '10', ''], ['zz', '999', '']]

Related

Store delimiter and delimiter positions before replacing them in python

Convert 5A2B4C11G string to [(5,"A"),(2,"B"),(4,"C"),(11,"G")] in Python

Python, splitting strings on middle characters with overlapping matches using regex

The last list elements and conditional loops

Splitting a string into a list (but not separating adjacent numbers) in Python

Categories

Resources