Python Look-And-Say Regex - python

I'm trying to write a regex for the Look-and-Say sequence in python. The idea is to split a given string into same-digit substrings. With trial and error, I have '((\d)\\2*)'.
For the pattern 11244455221116 this gives [('11', '1'), ('2', '2'), ('444', '4'), ('55', '5'), ('22', '2'), ('111', '1'), ('6', '6')] as expected. This works, but looks clumsy. Is there a cleaner way to do this, with or without regex?

You could use itertools.groupby:
import itertools as IT
text = '11244455221116'
print([(''.join(group), key) for key, group in IT.groupby(text)])
yields
[('11', '1'), ('2', '2'), ('444', '4'), ('55', '5'), ('22', '2'), ('111', '1'), ('6', '6')]
But re.findall is faster:
In [67]: %timeit [(''.join(group), key)for key, group in IT.groupby(text*100)]
1000 loops, best of 3: 528 us per loop
In [68]: %timeit re.findall(r'((\d)\2*)', text*100)
1000 loops, best of 3: 219 us per loop

instead of splitting your string, you can do a replace with a lambda function:
re.sub(r'(\d)\1*', lambda x: str(len(x.group(0)))+x.group(1), '112224355')
result: 2132141325

Related

How to sort a list of tuples but with one specific tuple being the first?

I'm doing an application to find the best path for a delivery.
The delivery send me his path:
[
('0', '1'),
('1', '2'),
('0', '2'),
('2', '0')
]
... where every pair of numbers is a location and smallest numbers are closer. They also send me their starting point. For example: 2.
I did a function to sort from lower to higher:
def lowToHigh(trajet):
trajet_opti = trajet
print(sorted(trajet_opti))
lowToHigh([
('0', '1'),
('1', '2'),
('0', '2'),
('2', '0')
])
The output is like this:
[('0', '1'), ('0', '2'), ('1', '2'), ('2', '0')]
I need a function who puts the tuple with the starting number first:
def starting_tuple():
starting_number = 2
.
.
.
Which returns something like this:
[('2', '0'), ('0', '1'), ('0', '2'), ('1', '2')]
Sort with a key that adds another tuple element representing whether the list item equals the starting point.
>>> path = [
... ('0', '1'),
... ('1', '2'),
... ('0', '2'),
... ('2', '0')
... ]
>>> sorted(path, key=lambda c: (c[0] != '2', c))
[('2', '0'), ('0', '1'), ('0', '2'), ('1', '2')]
The expression c[0] != '2' will be False (0) for the starting point and True (1) for all others, which will force the starting point to come at the start of the list. If there are multiple starting points, they will be sorted normally relative to each other.

How to create value pairs with lambda in pyspark?

I am trying to convert a pyspark rdd like this:
before:
[
[('169', '5'), ('2471', '6'), ('48516', '10')],
[('58', '7'), ('163', '7')],
[('172', '5'), ('186', '4'), ('236', '6')]
]
after:
[
[('169', '5'), ('2471', '6')],
[('169', '5'),('48516', '10')],
[('2471', '6'), ('48516', '10')],
[('58', '7'), ('163', '7')],
[('172', '5'), ('186', '4')],
[('172', '5'), ('236', '6')],
[('186', '4'), ('236', '6')]
]
The idea is to go through each line and create new line pairwise. I tried to find out a solution myself with lambda tutorials but with no good. May I ask for some help? If this is repeating other questions, I apologize. Thanks!
I'd use flatMap with itertools.combinations:
from itertools import combinations
rdd.flatMap(lambda xs: combinations(xs, 2))

Round off some values of a tuple

I have tuples like this ( I not sure will it call a list of tuple or not ! )
ratings = [('5', 45.58139534883721), ('4', 27.44186046511628), ('3', 20.0), ('2', 5.116279069767442), ('1', 1.8604651162790697)]
I want to make second value round off ( or truncate, don't matter to me )up to 2 decimal place, like this:
[('5', 45.58), ('4', 27.44), ('3', 20.0), ('2', 5.11), ('1', 1.86)]
I tried something like this:
l = tuple([round(x,2) if isinstance(x, float) else x for x in ratings])
But this seems to be not working. What can I try?
Round the 2nd element of your tuples only:
ratings = [('5', 45.58139534883721), ('4', 27.44186046511628), ('3', 20.0), ('2', 5.116279069767442), ('1', 1.8604651162790697)]
l = [(item[0],round(item[1],2)) for item in ratings]
# [('5', 45.58), ('4', 27.44), ('3', 20.0), ('2', 5.12), ('1', 1.86)]

Create strings with all possible combinations

I am using a OCR algorithm (tesseract based) which has difficulties with recognizing certain characters. I have partially solved that by creating my own "post-processing hash-table" which includes pairs of characters. For example, since the text is just numbers, I have figured out that if there is Q character inside the text, it should be 9 instead.
However I have a more serious problem with 6 and 8 characters since both of them are recognized as B. Now since I know what I am looking for (when I am translating the image to text) and the strings are fairly short (6~8 digits), I thought to create strings with all possible combinations of 6 and 8 and compare each one of them to the one I am looking for.
So for example, I have the following string recognized by the OCR:
L0B7B0B5
So each B here can be 6 or 8.
Now I want to generate a list like the below:
L0878085
L0878065
L0876085
L0876065
.
.
So it's kind of binary table with 3 digits and in this case there are 8 options. But the amount of B characters in string can be other than 3 (it can be any number).
I have tried to use Python itertools module with something like that:
list(itertools.product(*["86"] * 3))
Which will provide the following result:
[('8', '8', '8'), ('8', '8', '6'), ('8', '6', '8'), ('8', '6', '6'), ('6', '8', '8'), ('6', '8', '6'), ('6', '6', '8'), ('6', '6', '6')]
which I assume I can then later use to swap B characters. However, for some reason I can't make itertools work in my environment. I assume it has something to do the fact I am using Jython and not pure Python.
I will be happy to hear any other ideas as how to complete this task. Maybe there is a simpler solution I didn't think of?
itertools.product accepts a repeat keyword that you can use:
In [92]: from itertools import product
In [93]: word = "L0B7B0B5"
In [94]: subs = product("68", repeat=word.count("B"))
In [95]: list(subs)
Out[95]:
[('6', '6', '6'),
('6', '6', '8'),
('6', '8', '6'),
('6', '8', '8'),
('8', '6', '6'),
('8', '6', '8'),
('8', '8', '6'),
('8', '8', '8')]
Then one fairly concise method to make the substitutions is to do a reduction operation with the string replace method:
In [97]: subs = product("68", repeat=word.count("B"))
In [98]: [reduce(lambda s, c: s.replace('B', c, 1), sub, word) for sub in subs]
Out[98]:
['L0676065',
'L0676085',
'L0678065',
'L0678085',
'L0876065',
'L0876085',
'L0878065',
'L0878085']
Another method, using a couple more functions from itertools:
In [90]: from itertools import chain, izip_longest
In [91]: subs = product("68", repeat=word.count("B"))
In [92]: [''.join(chain(*izip_longest(word.split('B'), sub, fillvalue=''))) for sub in subs]
Out[92]:
['L0676065',
'L0676085',
'L0678065',
'L0678085',
'L0876065',
'L0876085',
'L0878065',
'L0878085']
Here simple recursive function for generating your strings : - (It is a pseudo code)
permut(char[] original,char buff[],int i) {
if(i<original.length) {
if(original[i]=='B') {
buff[i] = '6'
permut(original,buff,i+1)
buff[i] = '8'
permut(original,buff,i+1)
}
else if(original[i]=='Q') {
buff[i] = '9'
permut(original,buff,i+1)
}
else {
buff[i] = ch[i];
permut(original,buff,i+1)
}
}
else {
store buff[]
}
}

Why am I not getting the result of sorted function in expected order?

print activities
activities = sorted(activities,key = lambda item:item[1])
print activities
Activities in this case is a list of tuples like (start_number,finish_number) the output of the above code according to me should be the list of values sorted according the the increasing order of finish_number. When I tried the above code in shell I got the following output. I am not sure why the second list is not sorted according the the increasing order of the finish_number. Please help me in understanding this.
[('1', '4'), ('3', '5'), ('0', '6'), ('5', '7'), ('3', '9'), ('5', '9'), ('6', '10'), ('8', '11'), ('8', '12'), ('2', '14'), ('12', '16')]
[('6', '10'), ('8', '11'), ('8', '12'), ('2', '14'), ('12', '16'), ('1', '4'), ('3', '5'), ('0', '6'), ('5', '7'), ('3', '9'), ('5', '9')]
You are sorting strings instead of integers: in that case, 10 is "smaller" than 4. To sort on integers, convert it to this:
activites = sorted(activities,key = lambda item:int(item[1]))
print activities
Results in:
[('1', '4'), ('3', '5'), ('0', '6'), ('5', '7'), ('3', '9'), ('5', '9'), ('6', '10'), ('8', '11'), ('8', '12'), ('2', '14'), ('12', '16')]
Your items are being compared as strings, not as numbers. Thus, since the 1 character comes before 4 lexicographically, it makes sense that 10 comes before 4.
You need to cast the value to an int first:
activities = sorted(activities,key = lambda item:int(item[1]))
You are sorting strings, not numbers. Strings get sorted character by character.
So, for example '40' is greater than '100' because character 4 is larger than 1.
You can fix this on the fly by simply casting the item as an integer.
activities = sorted(activities,key = lambda item: int(item[1]))
It's because you're not storing the number as a number, but as a string. The string '10' comes before the string '2'. Try:
activities = sorted(activities, key=lambda i: int(i[1]))
Look for a BROADER solution to your problem: Convert your data from str to int immediately on input, work with it as int (otherwise you'll be continually be bumping into little problems like this), and format your data as str for output.
This principle applies generally, e.g. when working with non-ASCII string data, do UTF-8 -> unicode -> UTF-8; don't try to manipulate undecoded text.

Categories

Resources