Fastest way to extract and increase latest number from end of string - python

I have a list of strings that have numbers as suffixes. I'm trying to extract the highest number so I can increase it by 1. Here's what I came up with but I'm wondering if there's a faster way to do this:
data = ["object_1", "object_2", "object_3", "object_blah", "object_123asdfd"]
numbers = [int(obj.split("_")[-1]) for obj in data if obj.split("_")[-1].isdigit()] or [0]
print sorted(numbers)[-1] + 1 # Output is 4
A few conditions:
It's very possible that the suffix is not a number at all, and should be skipped.
If no input is valid, then the output should be 1 (this is why I have or [0])
No Python 3 solutions, only 2.7.
Maybe some regex magic would be faster to find the highest number to increment on? I don't like the fact that I have to split twice.
Edit
I did some benchmarks on the current answers using 100 iterations on data that has 10000 items:
Alex Noname's method: 1.65s
Sushanth's method: 1.95s
Balaji Ambresh method: 2.12s
My original method: 2.16s
I've accepted an answer for now, but feel free to contribute.

Using a heapq.nlargest is a pretty efficient way. Maybe someone will compare with other methods.
import heapq
a = heapq.nlargest(1, map(int, filter(lambda b: b.isdigit(), (c.split('_')[-1] for c in data))))[0]
Comparing with the original method (Python 3.8)
import heapq
import random
from time import time
data = []
for i in range(0, 1000000):
data.append(f'object_{random.randrange(10000000)}')
begin = time()
a = heapq.nlargest(1, map(int, filter(lambda b: b.isdigit(), (c.split('_')[-1] for c in data))))[0]
print('nlargest method: ', time() - begin)
print(a)
begin = time()
numbers = [int(obj.split("_")[-1]) for obj in data if obj.split("_")[-1].isdigit()] or [0]
a = sorted(numbers)[-1]
print('original method: ', time() - begin)
print(a)
nlargest method: 0.4306185245513916
9999995
original method: 0.8409149646759033
9999995

try this, using list comprehension to get all digits & max would return the highest value.
max([
int(x.split("_")[-1]) if x.split("_")[-1].isdigit() else 0 for x in data
]) + 1

Try:
import re
res = max([int( (re.findall('_(\d+)$', item) or [0])[0] ) for item in data]) + 1
Value:
4

Related

Python (lambda?) random number generation

I have a string where I want to output random ints of differing size using Python's built-in format function.
IE: "{one_digit}:{two_digit}:{one_digit}"
Yields: "3:27:9"
I'm trying:
import random
"{one_digit}:{two_digit}:{one_digit}".format(one_digit=random.randint(1,9),two_digits=random.randint(10,99))
but this always outputs...
"{one_digit}:{two_digit}:{one_digit}".format(one_digit=random.randint(1,9),two_digit=random.randint(10,99))
>>>'4:22:4'
"{one_digit}:{two_digit}:{one_digit}".format(one_digit=random.randint(1,9),two_digit=random.randint(10,99))
>>>'7:48:7'
"{one_digit}:{two_digit}:{one_digit}".format(one_digit=random.randint(1,9),two_digit=random.randint(10,99))
>>>'2:28:2'
"{one_digit}:{two_digit}:{one_digit}".format(one_digit=random.randint(1,9),two_digit=random.randint(10,99))
>>>'1:12:1'
Which is as expected since the numbers are evaluated before hand. I'd like them to all be random, though. I tried using a lambda function but only got this:
"test{number}:{number}".format(number=lambda x: random.randint(1,10))
But that only yields
"test{number}:{number}".format(number=lambda x: random.randint(1,10))
>>>'test<function <lambda> at 0x10aa14e18>:<function <lambda> at 0x10aa14e18>'
First off: str.format is the wrong tool for the job, because it doesn't allow you to generate a different value for each replacement.
The correct solution is therefore to implement your own replacement function. We'll replace the {one_digit} and {two_digit} format specifiers with something more suitable: {1} and {2}, respectively.
format_string = "{1}:{2}:{1}"
Now we can use regex to substitute all of these markers with random numbers. Regex is handy because re.sub accepts a replacement function, which we can use to generate a new random number every time:
import re
def repl(match):
num_digits = int(match.group(1))
lower_bound = 10 ** (num_digits - 1)
upper_bound = 10 * lower_bound - 1
random_number = random.randint(lower_bound, upper_bound)
return str(random_number)
result = re.sub(r'{(\d+)}', repl, format_string)
print(result) # result: 5:56:1
How about this?
import random
r = [1,2,3,4,5]
','.join(map(str,(random.randint(-10**i,10**i) for i in r)))
The first two params(-10** i, 10**i) are low and upper bound meanwhile size=10 is the amount of numbers).
Example output: '-8,45,-328,7634,51218'
Explanation:
It seems you are looking to join random numbers with ,. This can simply be done using ','.join([array with strings]), e.g. ','.join(['1','2']) which would return '1,2'.
What about This?
'%s:%s:%s' % (random.randint(1,9),random.randint(10,99),random.randint(1,9))
EDIT : meeting requirements.
a=[1,2,2,1,3,4,5,9,0] # our definition of the pattern (decimal range)
b= ''
for j in enumerate(a):
x=random.randint(10**j,10**(j+1)-1)
b = b + '%s:' % x
print(b)
sample:
print (b)
31:107:715:76:2602:99021:357311:7593756971:1:

Prepare my bigdata with Spark via Python

My 100m in size, quantized data:
(1424411938', [3885, 7898])
(3333333333', [3885, 7898])
Desired result:
(3885, [3333333333, 1424411938])
(7898, [3333333333, 1424411938])
So what I want, is to transform the data so that I group 3885 (for example) with all the data[0] that have it). Here is what I did in python:
def prepare(data):
result = []
for point_id, cluster in data:
for index, c in enumerate(cluster):
found = 0
for res in result:
if c == res[0]:
found = 1
if(found == 0):
result.append((c, []))
for res in result:
if c == res[0]:
res[1].append(point_id)
return result
but when I mapPartitions()'ed data RDD with prepare(), it seem to do what I want only in the current partition, thus return a bigger result than the desired.
For example, if the 1st record in the start was in the 1st partition and the 2nd in the 2nd, then I would get as a result:
(3885, [3333333333])
(7898, [3333333333])
(3885, [1424411938])
(7898, [1424411938])
How to modify my prepare() to get the desired effect? Alternatively, how to process the result that prepare() produces, so that I can get the desired result?
As you may already have noticed from the code, I do not care about speed at all.
Here is a way to create the data:
data = []
from random import randint
for i in xrange(0, 10):
data.append((randint(0, 100000000), (randint(0, 16000), randint(0, 16000))))
data = sc.parallelize(data)
You can use a bunch of basic pyspark transformations to achieve this.
>>> rdd = sc.parallelize([(1424411938, [3885, 7898]),(3333333333, [3885, 7898])])
>>> r = rdd.flatMap(lambda x: ((a,x[0]) for a in x[1]))
We used flatMap to have a key, value pair for every item in x[1] and we changed the data line format to (a, x[0]), the a here is every item in x[1]. To understand flatMap better you can look to the documentation.
>>> r2 = r.groupByKey().map(lambda x: (x[0],tuple(x[1])))
We just grouped all key, value pairs by their keys and used tuple function to convert iterable to tuple.
>>> r2.collect()
[(3885, (1424411938, 3333333333)), (7898, (1424411938, 3333333333))]
As you said you can use [:150] to have first 150 elements, I guess this would be proper usage:
r2 = r.groupByKey().map(lambda x: (x[0],tuple(x[1])[:150]))
I tried to be as explanatory as possible. I hope this helps.

List Comparison Algorithm: How can it be made better?

Running on Python 3.3
I am attempting to create an efficient algorithm to pull all of the similar elements between two lists. The problem is two fold. First, I can not seem to find any algorithms online. Second, there should be a more efficient way.
By 'similar elements', I mean two elements that are equal in value (be it string, int, whatever).
Currently, I am taking a greedy approach by:
Sorting the lists that are being compared,
Comparing each element in the shorter list to each element in the larger list,
Since the largeList and smallList are sorted we can save the last index that was visited,
Continue from the previous index (largeIndex).
Currently, the run-time seems to be average of O(nlog(n)). This can be seen by running the test cases listed after this block of code.
Right now, my code looks as such:
def compare(small,large,largeStart,largeEnd):
for i in range(largeStart, largeEnd):
if small==large[i]:
return [1,i]
if small<large[i]:
if i!=0:
return [0,i-1]
else:
return [0, i]
return [0,largeStart]
def determineLongerList(aList, bList):
if len(aList)>len(bList):
return (aList, bList)
elif len(aList)<len(bList):
return (bList, aList)
else:
return (aList, bList)
def compareElementsInLists(aList, bList):
import time
startTime = time.time()
holder = determineLongerList(aList, bList)
sameItems = []
iterations = 0
##########################################
smallList = sorted(holder[1])
smallLength = len(smallList)
smallIndex = 0
largeList = sorted(holder[0])
largeLength = len(largeList)
largeIndex = 0
while (smallIndex<smallLength):
boolean = compare(smallList[smallIndex],largeList,largeIndex,largeLength)
if boolean[0]==1:
#`compare` returns 1 as True
sameItems.append(smallList[smallIndex])
oldIndex = largeIndex
largeIndex = boolean[1]
else:
#else no match and possible new index
oldIndex = largeIndex
largeIndex = boolean[1]
smallIndex+=1
iterations =largeIndex-oldIndex+iterations+1
print('RAN {it} OUT OF {mathz} POSSIBLE'.format(it=iterations, mathz=smallLength*largeLength))
print('RATIO:\t\t'+str(iterations/(smallLength*largeLength))+'\n')
return sameItems
, and here are some test cases:
def testLargest():
import time
from random import randint
print('\n\n******************************************\n')
start_time = time.time()
lis = []
for i in range(0,1000000):
ran = randint(0,1000000)
lis.append(ran)
lis2 = []
for i in range(0,1000000):
ran = randint(0,1000000)
lis2.append(ran)
timeTaken = time.time()-start_time
print('CREATING LISTS TOOK:\t\t'+str(timeTaken))
print('\n******************************************')
start_time = time.time()
c = compareElementsInLists(lis, lis2)
timeTaken = time.time()-start_time
print('COMPARING LISTS TOOK:\t\t'+str(timeTaken))
print('NUMBER OF SAME ITEMS:\t\t'+str(len(c)))
print('\n******************************************')
#testLargest()
'''
One rendition of testLargest:
******************************************
CREATING LISTS TOOK: 21.009342908859253
******************************************
RAN 999998 OUT OF 1000000000000 POSSIBLE
RATIO: 9.99998e-07
COMPARING LISTS TOOK: 13.99990701675415
NUMBER OF SAME ITEMS: 632328
******************************************
'''
def testLarge():
import time
from random import randint
print('\n\n******************************************\n')
start_time = time.time()
lis = []
for i in range(0,1000000):
ran = randint(0,100)
lis.append(ran)
lis2 = []
for i in range(0,1000000):
ran = randint(0,100)
lis2.append(ran)
timeTaken = time.time()-start_time
print('CREATING LISTS TOOK:\t\t'+str(timeTaken))
print('\n******************************************')
start_time = time.time()
c = compareElementsInLists(lis, lis2)
timeTaken = time.time()-start_time
print('COMPARING LISTS TOOK:\t\t'+str(timeTaken))
print('NUMBER OF SAME ITEMS:\t\t'+str(len(c)))
print('\n******************************************')
testLarge()
If you are just searching for all elements which are in both lists, you should use data types meant to handle such tasks. In this case, sets or bags would be appropriate. These are internally represented by hashing mechanisms which are even more efficient than searching in sorted lists.
(collections.Counter represents a suitable bag.)
If you do not care for doubled elements, then sets would be fine.
a = set(listA)
print a.intersection(listB)
This will print all elements which are in listA and in listB. (Without doubled output for doubled input elements.)
import collections
a = collections.Counter(listA)
b = collections.Counter(listB)
print a & b
This will print how many elements are how often in both lists.
I didn't make any measuring but I'm pretty sure these solutions are way faster than your self-made attempts.
To convert a counter into a list of all represented elements again, you can use list(c.elements()).
Using ipython magic for timeit but it doesn't compare favourably with just a standard set() intersection.
Setup:
import random
alist = [random.randint(0, 100000) for _ in range(1000)]
blist = [random.randint(0, 100000) for _ in range(1000)]
Compare Elements:
%%timeit -n 1000
compareElementsInLists(alist, blist)
1000 loops, best of 3: 1.9 ms per loop
Vs Set Intersection
%%timeit -n 1000
set(alist) & set(blist)
1000 loops, best of 3: 104 µs per loop
Just to make sure we get the same results:
>>> compareElementsInLists(alist, blist)
[8282, 29521, 43042, 47193, 48582, 74173, 96216, 98791]
>>> set(alist) & set(blist)
{8282, 29521, 43042, 47193, 48582, 74173, 96216, 98791}

How to find Median [duplicate]

This question already has answers here:
Finding median of list in Python
(28 answers)
Closed 6 years ago.
I have data like this.
Ram,500
Sam,400
Test,100
Ram,800
Sam,700
Test,300
Ram,900
Sam,800
Test,400
What is the shortest way to fine the "median" from above data.
My result should be something like...
Median = 1/2(n+1), where n is the number of data values in the sample.
Test 500
Sam 700
Ram 800
Python 3.4 includes statistics built-in, so you can use the method statistics.median:
>>> from statistics import median
>>> median([1, 3, 5])
3
Use numpy's median function.
Its a little unclear how your data is actually represented, so I've assumed it is a list of tuples:
data = [('Ram',500), ('Sam',400), ('Test',100), ('Ram',800), ('Sam',700),
('Test',300), ('Ram',900), ('Sam',800), ('Test',400)]
from collections import defaultdict
def median(mylist):
sorts = sorted(mylist)
length = len(sorts)
if not length % 2:
return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
return sorts[length / 2]
data_dict = defaultdict(list)
for el in data:
data_dict[el[0]].append(el[1])
print [(key,median(val)) for key, val in data_dict.items()]
print median([5,2,4,3,1])
print median([5,2,4,3,1,6])
#output:
[('Test', 300), ('Ram', 800), ('Sam', 700)]
3
3.5
The function median returns the median from a list. If there are an even number of entries it takes the middle value of the middle two entries (this is standard).
I've used defaultdict to create a dict keyed by your data and their values, which is a more useful representation of your data.
Check this out:
def median(lst):
even = (0 if len(lst) % 2 else 1) + 1
half = (len(lst) - 1) / 2
return sum(sorted(lst)[half:half + even]) / float(even)
Note:
sorted(lst) produces a sorted copy of lst;
sum([1]) == 1;
Easiest way to get the median of a list with integer data:
x = [1,3,2]
print "The median of x is:",sorted(x)[len(x)//2]
I started with user3100512's answer and quickly realized it doesn't work for an even number of items. I added some conditionals to it to compute the median.
def median(x):
if len(x)%2 != 0:
return sorted(x)[len(x)/2]
else:
midavg = (sorted(x)[len(x)/2] + sorted(x)[len(x)/2-1])/2.0
return midavg
median([4,5,6,7])
should return 5.5

Longest common prefix using buffer?

If I have an input string and an array:
s = "to_be_or_not_to_be"
pos = [15, 2, 8]
I am trying to find the longest common prefix between the consecutive elements of the array pos referencing the original s. I am trying to get the following output:
longest = [3,1]
The way I obtained this is by computing the longest common prefix of the following pairs:
s[15:] which is _be and s[2:] which is _be_or_not_to_be giving 3 ( _be )
s[2:] which is _be_or_not_to_be and s[8:] which is _not_to_be giving 1 ( _ )
However, if s is huge, I don't want to create multiple copies when I do something like s[x:]. After hours of searching, I found the function buffer that maintains only one copy of the input string but I wasn't sure what is the most efficient way to utilize it here in this context. Any suggestions on how to achieve this?
Here is a method without buffer which doesn't copy, as it only looks at one character at a time:
from itertools import islice, izip
s = "to_be_or_not_to_be"
pos = [15, 2, 8]
length = len(s)
for start1, start2 in izip(pos, islice(pos, 1, None)):
pref = 0
for pos1, pos2 in izip(xrange(start1, length), xrange(start2, length)):
if s[pos1] == s[pos2]:
pref += 1
else:
break
print pref
# prints 3 1
I use islice, izip, and xrange in case you're talking about potentially very long strings.
I also couldn't resist this "One Liner" which doesn't even require any indexing:
[next((i for i, (a, b) in
enumerate(izip(islice(s, start1, None), islice(s, start2, None)))
if a != b),
length - max((start1, start2)))
for start1, start2 in izip(pos, islice(pos, 1, None))]
One final method, using os.path.commonprefix:
[len(commonprefix((buffer(s, n), buffer(s, m)))) for n, m in zip(pos, pos[1:])]
>>> import os
>>> os.path.commonprefix([s[i:] for i in pos])
'_'
Let Python to manage memory for you. Don't optimize prematurely.
To get the exact output you could do (as #agf suggested):
print [len(commonprefix([buffer(s, i) for i in adj_indexes]))
for adj_indexes in zip(pos, pos[1:])]
# -> [3, 1]
I think your worrying about copies is unfounded. See below:
>>> s = "how long is a piece of string...?"
>>> t = s[12:]
>>> print t
a piece of string...?
>>> id(t[0])
23295440
>>> id(s[12])
23295440
>>> id(t[2:20]) == id(s[14:32])
True
Unless you're copying the slices and leaving references to the copies hanging around, I wouldn't think it could cause any problem.
edit: There are technical details with string interning and stuff that I'm not really clear on myself. But I'm sure that a string slice is not always a copy:
>>> x = 'google.com'
>>> y = x[:]
>>> x is y
True
I guess the answer I'm trying to give is to just let python manage its memory itself, to begin with, you can look at memory buffers and views later if needed. And if this is already a real problem occurring for you, update your question with details of what the actual problem is.
One way of doing using buffer this is give below. However, there could be much faster ways.
s = "to_be_or_not_to_be"
pos = [15, 2, 8]
lcp = []
length = len(pos) - 1
for index in range(0, length):
pre = buffer(s, pos[index])
cur = buffer(s, pos[index+1], pos[index+1]+len(pre))
count = 0
shorter, longer = min(pre, cur), max(pre, cur)
for i, c in enumerate(shorter):
if c != longer[i]:
break
else:
count += 1
lcp.append(count)
print
print lcp

Categories

Resources