Concatenating strings faster than appending to lists - python

I am trying to isolate specific items in a list (e.g. [0, 1, 1] will return [0, 1]). I managed to get through this, but I noticed something strange.
When I tried to append to list it ran about 7 times slower then when I was concatenating strings and then splitting it.
This is my code:
import time
start = time.time()
first = [x for x in range(99999) if x % 2 == 0]
second = [x for x in range(99999) if x % 4 == 0]
values = first + second
distinct_string = ""
for i in values:
if not str(i) in distinct_string:
distinct_string += str(i) + " "
print(distinct_string.split())
print(" --- %s sec --- " % (start - time.time()))
This result end in about 5 seconds... Now for the lists:
import time
start = time.time()
first = [x for x in range(99999) if x % 2 == 0]
second = [x for x in range(99999) if x % 4 == 0]
values = first + second
distinct_list = []
for i in values:
if not i in distinct_list:
distinct_list.append(i)
print(distinct_list)
print(" --- %s sec --- " % (start - time.time()))
Runs at around 40 seconds.
What makes string faster even though I am converting a lot of values to strings?

Note that it's generally better to use timeit to compare functions, which runs the same thing multiple times to get average performance, and to factor out repeated code to focus on the performance that matters. Here's my test script:
first = [x for x in range(999) if x % 2 == 0]
second = [x for x in range(999) if x % 4 == 0]
values = first + second
def str_method(values):
distinct_string = ""
for i in values:
if not str(i) in distinct_string:
distinct_string += str(i) + " "
return [int(s) for s in distinct_string.split()]
def list_method(values):
distinct_list = []
for i in values:
if not i in distinct_list:
distinct_list.append(i)
return distinct_list
def set_method(values):
seen = set()
return [val for val in values if val not in seen and seen.add(val) is None]
if __name__ == '__main__':
assert str_method(values) == list_method(values) == set_method(values)
import timeit
funcs = [func.__name__ for func in (str_method, list_method, set_method)]
setup = 'from __main__ import {}, values'.format(', '.join(funcs))
for func in funcs:
print(func)
print(timeit.timeit(
'{}(values)'.format(func),
setup=setup,
number=1000
))
I've added int conversion to make sure that the functions return the same thing, and get the following results:
str_method
1.1685157899992191
list_method
2.6124089090008056
set_method
0.09523714500392089
Note that it is not true that searching in a list is faster than searching in a string if you have to convert the input:
>>> timeit.timeit('1 in l', setup='l = [9, 8, 7, 6, 5, 4, 3, 2, 1]')
0.15300405000016326
>>> timeit.timeit('str(1) in s', setup='s = "9 8 7 6 5 4 3 2 1"')
0.23205067300295923
Repeated appending to a list is not very efficient, as it means frequent resizing of the underlying object - the list comprehension, as shown in the set version, is more efficient.

searching in strings:
if not str(i) in distinct_string:
is much faster
then searching in lists
if not i in distinct_list:
here are lprofile lines for string search in OP code
Line # Hits Time Per Hit % Time Line Contents
17 75000 80366013 1071.5 92.7 if not str(i) in distinct_string:
18 50000 2473212 49.5 2.9 distinct_string += str(i) + " "
and for list search in OP code
39 75000 769795432 10263.9 99.1 if not i in distinct_list:
40 50000 2813804 56.3 0.4 distinct_list.append(i)

I think there is a flaw of logic that makes the string method seemingly much faster.
When matching substrings in a long string, the in operator will return prematurely at the first substring containing the search item. To prove this, I let the loop run backwards from the highest values down to the smallest, and it returned only 50% of the values of the original loop (I checked the length of the result only). If the matching was exact there should be no difference whether you check the sequence from the start or from the end. I conclude that the string method short-cuts a lot of comparisons by matching near the start of the long string. The particular choice of duplicates is unfortunately masking this.
In a second test, I let the string method search for " " + str(i) + " " to eliminate substring matches. Now it will run only about 2x faster than the list method (but still, faster).
#jonrsharpe: Regarding the set_method I cannot see why one would touch all set elements one by one and not in one set statement like this:
def set_method(values):
return list(set(values))
This produces exactly the same output and runs about 2.5x faster on my PC.

Related

Fastest way to extract and increase latest number from end of string

I have a list of strings that have numbers as suffixes. I'm trying to extract the highest number so I can increase it by 1. Here's what I came up with but I'm wondering if there's a faster way to do this:
data = ["object_1", "object_2", "object_3", "object_blah", "object_123asdfd"]
numbers = [int(obj.split("_")[-1]) for obj in data if obj.split("_")[-1].isdigit()] or [0]
print sorted(numbers)[-1] + 1 # Output is 4
A few conditions:
It's very possible that the suffix is not a number at all, and should be skipped.
If no input is valid, then the output should be 1 (this is why I have or [0])
No Python 3 solutions, only 2.7.
Maybe some regex magic would be faster to find the highest number to increment on? I don't like the fact that I have to split twice.
Edit
I did some benchmarks on the current answers using 100 iterations on data that has 10000 items:
Alex Noname's method: 1.65s
Sushanth's method: 1.95s
Balaji Ambresh method: 2.12s
My original method: 2.16s
I've accepted an answer for now, but feel free to contribute.
Using a heapq.nlargest is a pretty efficient way. Maybe someone will compare with other methods.
import heapq
a = heapq.nlargest(1, map(int, filter(lambda b: b.isdigit(), (c.split('_')[-1] for c in data))))[0]
Comparing with the original method (Python 3.8)
import heapq
import random
from time import time
data = []
for i in range(0, 1000000):
data.append(f'object_{random.randrange(10000000)}')
begin = time()
a = heapq.nlargest(1, map(int, filter(lambda b: b.isdigit(), (c.split('_')[-1] for c in data))))[0]
print('nlargest method: ', time() - begin)
print(a)
begin = time()
numbers = [int(obj.split("_")[-1]) for obj in data if obj.split("_")[-1].isdigit()] or [0]
a = sorted(numbers)[-1]
print('original method: ', time() - begin)
print(a)
nlargest method: 0.4306185245513916
9999995
original method: 0.8409149646759033
9999995
try this, using list comprehension to get all digits & max would return the highest value.
max([
int(x.split("_")[-1]) if x.split("_")[-1].isdigit() else 0 for x in data
]) + 1
Try:
import re
res = max([int( (re.findall('_(\d+)$', item) or [0])[0] ) for item in data]) + 1
Value:
4

Longest subset of five-positions five-elements permutations, only one-element-position in common

I am trying to get the longest list of a set of five ordered position, 1 to 5 each, satisfying the condition that any two members of the list cannot share more than one identical position (index). I.e., 11111 and 12222 is permitted (only the 1 at index 0 is shared), but 11111 and 11222 is not permitted (same value at index 0 and 1).
I have tried a brute-force attack, starting with the complete list of permutations, 3125 members, and walking through the list element by element, rejecting the ones that do not match the criteria, in several steps:
step one: testing elements 2 to 3125 against element 1, getting a new shorter list L'
step one: testing elements 3 to N' against element 2', getting a shorter list yet L'',
and so on.
I get a 17 members solution, perfectly valid. The problem is that:
I know there are, at least, two 25-member valid solution found by a matter of good luck,
The solution by this brute-force method depends strongly on the initial order of the 3125 members list, so I have been able to find from 12- to 21-member solutions, shuffling the L0 list, but I have never hit the 25-member solutions.
Could anyone please put light on the problem? Thank you.
This is my approach so far
import csv, random
maxv = 0
soln=0
for p in range(0,1): #Intended to run multiple times
z = -1
while True:
z = z + 1
file1 = 'Step' + "%02d" % (z+0) + '.csv'
file2 = 'Step' + "%02d" % (z+1) + '.csv'
nextdata=[]
with open(file1, 'r') as csv_file:
data = list(csv.reader(csv_file))
#if file1 == 'Step00.csv': # related to p loop
# random.shuffle(data)
i = 0
while i <= z:
nextdata.append(data[i])
i = i + 1
for j in range(z, len(data)):
sum=0
for k in range(0,5):
if (data[z][k] == data[j][k]):
sum = sum + 1
if sum < 2:
nextdata.append(data[j])
ofile = open(file2, 'wb')
writer = csv.writer(ofile)
writer.writerows(nextdata)
ofile.close()
if (len(nextdata) < z + 1 + 1):
if (z+1)>= maxv:
maxv = z+1
print maxv
ofile = open("Solution"+"%02d" % soln + '.csv', 'wb')
writer = csv.writer(ofile)
writer.writerows(nextdata)
ofile.close()
soln = soln + 1
break
Here is a Picat model for the problem (as I understand it): http://hakank.org/picat/longest_subset_of_five_positions.pi It use constraint modelling and SAT solver.
Edit: Here is a MiniZinc model: http://hakank.org/minizinc/longest_subset_of_five_positions.mzn
The model (predicate go/0) check lengths of 2 to 100. All lengths between 2 and 25 has at least one solution (probably at lot more). So 25 is the longest sub sequence. Here is one 25 length solution:
{1,1,1,3,4}
{1,2,5,1,5}
{1,3,4,4,1}
{1,4,2,2,2}
{1,5,3,5,3}
{2,1,3,2,1}
{2,2,4,5,4}
{2,3,2,1,3}
{2,4,1,4,5}
{2,5,5,3,2}
{3,1,2,5,5}
{3,2,3,4,2}
{3,3,5,2,4}
{3,4,4,3,3}
{3,5,1,1,1}
{4,1,4,1,2}
{4,2,1,2,3}
{4,3,3,3,5}
{4,4,5,5,1}
{4,5,2,4,4}
{5,1,5,4,3}
{5,2,2,3,1}
{5,3,1,5,2}
{5,4,3,1,4}
{5,5,4,2,5}
There is a lot of different 25 lengths solutions (the predicate go2/0 checks that).
Here is the complete model (edited from the file above):
import sat.
main => go.
%
% Test all lengths from 2..100.
% 25 is the longest.
%
go ?=>
nolog,
foreach(M in 2..100)
println(check=M),
if once(check(M,_X)) then
println(M=ok)
else
println(M=not_ok)
end,
nl
end,
nl.
go => true.
%
% Check if there is a solution with M numbers
%
check(M, X) =>
N = 5,
X = new_array(M,N),
X :: 1..5,
foreach(I in 1..M, J in I+1..M)
% at most 1 same number in the same position
sum([X[I,K] #= X[J,K] : K in 1..N]) #<= 1,
% symmetry breaking: sort the sub sequence
lex_lt(X[I],X[J])
end,
solve([ff,split],X),
foreach(Row in X)
println(Row)
end,
nl.

Find two numbers from a list that add up to a specific number

This is super bad and messy, I am new to this, please help me.
Basically, I was trying to find two numbers from a list that add up to a target number.
I have set up an example with lst = [2, 4, 6, 10] and a target value of target = 8. The answer in this example would be (2, 6) and (6, 2).
Below is my code but it is long and ugly and I am sure there is a better way of doing it. Can you please see how I can improve from my code below?
from itertools import product, permutations
numbers = [2, 4, 6, 10]
target_number = 8
two_nums = (list(permutations(numbers, 2)))
print(two_nums)
result1 = (two_nums[0][0] + two_nums[0][1])
result2 = (two_nums[1][0] + two_nums[1][1])
result3 = (two_nums[2][0] + two_nums[2][1])
result4 = (two_nums[3][0] + two_nums[3][1])
result5 = (two_nums[4][0] + two_nums[4][1])
result6 = (two_nums[5][0] + two_nums[5][1])
result7 = (two_nums[6][0] + two_nums[6][1])
result8 = (two_nums[7][0] + two_nums[7][1])
result9 = (two_nums[8][0] + two_nums[8][1])
result10 = (two_nums[9][0] + two_nums[9][1])
my_list = (result1, result2, result3, result4, result5, result6, result7, result8, result9, result10)
print (my_list)
for i in my_list:
if i == 8:
print ("Here it is:" + str(i))
For every number on the list, you can look for his complementary (number that when added to the previous one would give the required target sum). If it exists, get the pair and exit, otherwise move on.
This would look like the following:
numbers = [2, 4, 6, 10]
target_number = 8
for i, number in enumerate(numbers[:-1]): # note 1
complementary = target_number - number
if complementary in numbers[i+1:]: # note 2
print("Solution Found: {} and {}".format(number, complementary))
break
else: # note 3
print("No solutions exist")
which produces:
Solution Found: 2 and 6
Notes:
You do not have to check the last number; if there was a pair you would have already found it by then.
Notice that the membership check (which is quite costly in lists) is optimized since it considers the slice numbers[i+1:] only. The previous numbers have been checked already. A positive side-effect of the slicing is that the existence of e.g., one 4 in the list, does not give a pair for a target value of 8.
This is an excellent setup to explain the miss-understood and often confusing use of else in for-loops. The else triggers only if the loop was not abruptly ended by a break.
If the e.g., 4 - 4 solution is acceptable to you even when having a single 4 in the list you can modify as follows:
numbers = [2, 4, 6, 10]
target_number = 8
for i, number in enumerate(numbers):
complementary = target_number - number
if complementary in numbers[i:]:
print("Solution Found: {} and {}".format(number, complementary))
break
else:
print("No solutions exist")
A list comprehension will work well here. Try this:
from itertools import permutations
numbers = [2, 4, 6, 10]
target_number = 8
solutions = [pair for pair in permutations(numbers, 2) if sum(pair) == 8]
print('Solutions:', solutions)
Basically, this list comprehension looks at all the pairs that permutations(numbers, 2) returns, but only keeps the ones whose total sum equals 8.
The simplest general way to do this is to iterate over your list and for each item iterate over the rest of the list to see if it adds up to the target value. The downside of this is it is an O(n^2) operation. I don't know off the top of my head if there is a more efficient solution. I'm not 100% sure my syntax is correct, but it should look something like the following:
done = False
for i, val in enumerate(numbers):
if val >= target_number:
continue
for j, val2 in enumerate(numbers, i+1):
if val + val2 == target_number:
print ("Here it is: " + str(i) + "," + str(j))
done = True
break
if done:
break
Of course you should create this as a function that returns your result instead of just printing it. That would remove the need for the "done" variable.
If you are trying to find the answer for multiple integers with a long list that has duplicate values, I would recommend using frozenset. The "checked" answer will only get the first answer and then stop.
import numpy as np
numbers = np.random.randint(0, 100, 1000)
target = 17
def adds_to_target(base_list, target):
return_list = []
for i in range(len(base_list)):
return_list.extend([list((base_list[i], b)) for b in base_list if (base_list[i] + b)==target])
return set(map(frozenset, return_list))
# sample output
{frozenset({7, 10}),
frozenset({4, 13}),
frozenset({8, 9}),
frozenset({5, 12}),
frozenset({2, 15}),
frozenset({3, 14}),
frozenset({0, 17}),
frozenset({1, 16}),
frozenset({6, 11})}
1) In the first for loop, lists containing two integers that sum to the target value are added to "return_list" i.e. a list of lists is created.
2) Then frozenset takes out all duplicate pairs.
%timeit adds_to_target(numbers, target_number)
# 312 ms ± 8.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
you can do it in one line with list comprehension like below:
from itertools import permutations
numbers = [2, 4, 6, 10]
target_number = 8
two_nums = (list(permutations(numbers, 2)))
result=[i for i in two_nums if i[0]+i[1] == target_number]
[(2,6) , (6,2)]
If you want a way to do this efficiently without itertools -
numbers = [1,3,4,5,6,2,3,4,1]
target = 5
number_dict = {}
pairs = []
for num in numbers:
number_dict[num] = number_dict.get(num, 0) + 1
complement = target - num
if complement in number_dict.keys():
pairs.append((num, complement))
number_dict.pop(num)
number_dict.pop(complement)
This is this simple :)
def func(array, target):
flag = 0;
for x in array:
for y in array:
if (target-x) == y and x != y:
print(x,y)
flag = 1
break
if flag ==1:
break
import pandas as pd
Filename = "D:\\python interview\\test.txt"
wordcount_dict = dict()
#input("Enter Filename:")
list_ = [1,2,4,6,8]
num = 10
for number in list_:
num_add = number
for number_ in list_:
if number_ + num_add == num and number_ != num_add :
print(number_ , num_add)
n is the sum desired, L is the List. Basically you enter inside the loop and from that no to end of list iterate through the next loop. If L[i],L[j] indexes in list adds up to n and if L[i]!=L[j] print it.
numbers=[1,2,3,4,9,8,5,10,20,30,6]
def two_no_summer(n,L):
for i in range(0,len(L)):
for j in range(i,len(L)):
if (L[i]+L[j]==n) & (L[i]!=L[j]):
print(L[i],L[j])
Execution: https://i.stack.imgur.com/Wu47x.jpg

ArcGIS:Python - Adding Commas to a String

In ArcGIS I have intersected a large number of zonal polygons with another set and recorded the original zone IDs and the data they are connected with. However the strings that are created are one long list of numbers ranging from 11 to 77 (each ID is 11 characters long). I am looking to add a "," between each one making, it easier to read and export later as a .csv file. To do this I wrote this code:
def StringSplit(StrO,X):
StrN = StrO #Recording original string
StrLen = len(StrN)
BStr = StrLen/X #How many segments are inside of one string
StrC = BStr - 1 #How many times it should loop
if StrC > 0:
while StrC > 1:
StrN = StrN[ :((X * StrC) + 1)] + "," + StrN[(X * StrC): ]
StrC = StrC - 1
while StrC == 1:
StrN = StrN[:X+1] + "," + StrN[(X*StrC):]
StrC = 0
while StrC == 0:
return StrN
else:
return StrN
The main issue is how it has to step through multiple rows (76) with various lengths (11 -> 77). I got the last parts to work, just not the internal loop as it returns an error or incorrect outputs for strings longer than 22 characters.
Thus right now:
1. 01234567890 returns 01234567890
2. 0123456789001234567890 returns 01234567890,01234567890
3. 012345678900123456789001234567890 returns either: Error or ,, or even ,,01234567890
I know it is probably something pretty simple I am missing, but I can't seem remember what it is...
It can be easily done by regex.
those ........... are 11 dots for give split for every 11th char.
you can use pandas to create csv from the array output
Code:
import re
x = re.findall('...........', '01234567890012345678900123456789001234567890')
print(x)
myString = ",".join(x)
print(myString)
output:
['01234567890', '01234567890', '01234567890', '01234567890']
01234567890,01234567890,01234567890,01234567890
for the sake of simplicity you can do this
code:
x = ",".join(re.findall('...........', '01234567890012345678900123456789001234567890'))
print(x)
Don't make the loops by yourself, use python libraries or builtins, it will be easier. For example :
def StringSplit(StrO,X):
substring_starts = range(0, len(StrO), X)
substrings = (StrO[start:start + X] for start in substring_starts)
return ','.join(substrings)
string = '1234567890ABCDE'
print(StringSplit(string, 5))
# '12345,67890,ABCDE'

List Comparison Algorithm: How can it be made better?

Running on Python 3.3
I am attempting to create an efficient algorithm to pull all of the similar elements between two lists. The problem is two fold. First, I can not seem to find any algorithms online. Second, there should be a more efficient way.
By 'similar elements', I mean two elements that are equal in value (be it string, int, whatever).
Currently, I am taking a greedy approach by:
Sorting the lists that are being compared,
Comparing each element in the shorter list to each element in the larger list,
Since the largeList and smallList are sorted we can save the last index that was visited,
Continue from the previous index (largeIndex).
Currently, the run-time seems to be average of O(nlog(n)). This can be seen by running the test cases listed after this block of code.
Right now, my code looks as such:
def compare(small,large,largeStart,largeEnd):
for i in range(largeStart, largeEnd):
if small==large[i]:
return [1,i]
if small<large[i]:
if i!=0:
return [0,i-1]
else:
return [0, i]
return [0,largeStart]
def determineLongerList(aList, bList):
if len(aList)>len(bList):
return (aList, bList)
elif len(aList)<len(bList):
return (bList, aList)
else:
return (aList, bList)
def compareElementsInLists(aList, bList):
import time
startTime = time.time()
holder = determineLongerList(aList, bList)
sameItems = []
iterations = 0
##########################################
smallList = sorted(holder[1])
smallLength = len(smallList)
smallIndex = 0
largeList = sorted(holder[0])
largeLength = len(largeList)
largeIndex = 0
while (smallIndex<smallLength):
boolean = compare(smallList[smallIndex],largeList,largeIndex,largeLength)
if boolean[0]==1:
#`compare` returns 1 as True
sameItems.append(smallList[smallIndex])
oldIndex = largeIndex
largeIndex = boolean[1]
else:
#else no match and possible new index
oldIndex = largeIndex
largeIndex = boolean[1]
smallIndex+=1
iterations =largeIndex-oldIndex+iterations+1
print('RAN {it} OUT OF {mathz} POSSIBLE'.format(it=iterations, mathz=smallLength*largeLength))
print('RATIO:\t\t'+str(iterations/(smallLength*largeLength))+'\n')
return sameItems
, and here are some test cases:
def testLargest():
import time
from random import randint
print('\n\n******************************************\n')
start_time = time.time()
lis = []
for i in range(0,1000000):
ran = randint(0,1000000)
lis.append(ran)
lis2 = []
for i in range(0,1000000):
ran = randint(0,1000000)
lis2.append(ran)
timeTaken = time.time()-start_time
print('CREATING LISTS TOOK:\t\t'+str(timeTaken))
print('\n******************************************')
start_time = time.time()
c = compareElementsInLists(lis, lis2)
timeTaken = time.time()-start_time
print('COMPARING LISTS TOOK:\t\t'+str(timeTaken))
print('NUMBER OF SAME ITEMS:\t\t'+str(len(c)))
print('\n******************************************')
#testLargest()
'''
One rendition of testLargest:
******************************************
CREATING LISTS TOOK: 21.009342908859253
******************************************
RAN 999998 OUT OF 1000000000000 POSSIBLE
RATIO: 9.99998e-07
COMPARING LISTS TOOK: 13.99990701675415
NUMBER OF SAME ITEMS: 632328
******************************************
'''
def testLarge():
import time
from random import randint
print('\n\n******************************************\n')
start_time = time.time()
lis = []
for i in range(0,1000000):
ran = randint(0,100)
lis.append(ran)
lis2 = []
for i in range(0,1000000):
ran = randint(0,100)
lis2.append(ran)
timeTaken = time.time()-start_time
print('CREATING LISTS TOOK:\t\t'+str(timeTaken))
print('\n******************************************')
start_time = time.time()
c = compareElementsInLists(lis, lis2)
timeTaken = time.time()-start_time
print('COMPARING LISTS TOOK:\t\t'+str(timeTaken))
print('NUMBER OF SAME ITEMS:\t\t'+str(len(c)))
print('\n******************************************')
testLarge()
If you are just searching for all elements which are in both lists, you should use data types meant to handle such tasks. In this case, sets or bags would be appropriate. These are internally represented by hashing mechanisms which are even more efficient than searching in sorted lists.
(collections.Counter represents a suitable bag.)
If you do not care for doubled elements, then sets would be fine.
a = set(listA)
print a.intersection(listB)
This will print all elements which are in listA and in listB. (Without doubled output for doubled input elements.)
import collections
a = collections.Counter(listA)
b = collections.Counter(listB)
print a & b
This will print how many elements are how often in both lists.
I didn't make any measuring but I'm pretty sure these solutions are way faster than your self-made attempts.
To convert a counter into a list of all represented elements again, you can use list(c.elements()).
Using ipython magic for timeit but it doesn't compare favourably with just a standard set() intersection.
Setup:
import random
alist = [random.randint(0, 100000) for _ in range(1000)]
blist = [random.randint(0, 100000) for _ in range(1000)]
Compare Elements:
%%timeit -n 1000
compareElementsInLists(alist, blist)
1000 loops, best of 3: 1.9 ms per loop
Vs Set Intersection
%%timeit -n 1000
set(alist) & set(blist)
1000 loops, best of 3: 104 µs per loop
Just to make sure we get the same results:
>>> compareElementsInLists(alist, blist)
[8282, 29521, 43042, 47193, 48582, 74173, 96216, 98791]
>>> set(alist) & set(blist)
{8282, 29521, 43042, 47193, 48582, 74173, 96216, 98791}

Categories

Resources