How to fix wrong calculations in Python Mode for Processing? - python

I try to implement a visual illustration of Pascal's triangle with Python Mode for Processing for MAC OS X. One of the necessary steps is, of course, the calculation of the binomial coefficients in each row of the triangle. I chose to do it in a recursive way instead of calculating factorials. My code works well in Jupyter, but produces different outcomes in Processing. Does anybody know why and how I can fix the problem?
rows = 301
pascal=[[1], [1,1]]
for i in range (rows):
last_row = pascal[len(pascal)-1]
next_row = [1] +[last_row[i]+last_row[i+1] for i in range(len(last_row)) if i < len(last_row)-1] +[1]
pascal.append(next_row)
print (pascal[35][16])
The code produces the correct results when executed in Jupyter, but has different results in Processing. The problems begin in row 35 of the triangle (countig starts from 0). The 16th element in this row should be 4059928950 but Processing calculates -235038346. And from then on, the calculations in Processing seem to be often wrong.

The most principled approach would be to find a big integer library that you can call from Jython, but since all you need is addition, it is easy to write your own function which will take two base 10 string representations of positive integers and return the string representation of their sum:
rows = 301
def add_nums(s1,s2):
#reverse strings and 0-pad to be of the same length
s1 = s1[::-1]
s2 = s2[::-1]
s1 += '0'*(max(len(s1),len(s2)) - len(s1))
s2 += '0'*(max(len(s1),len(s2)) - len(s2))
dsum = []
c = 0 #carry
for d1,d2 in zip(s1,s2):
a,b = int(d1), int(d2)
c,r = divmod(a+b+c,10)
dsum.append(str(r))
if c > 0: dsum.append('1')
return ''.join(reversed(dsum))
pascal=[['1'], ['1','1']]
for i in range (rows):
last_row = pascal[len(pascal)-1]
next_row = ['1'] +[add_nums(last_row[i],last_row[i+1]) for i in range(len(last_row)) if i < len(last_row)-1] +['1']
pascal.append(next_row)
print (pascal[35][16]) #prints 4059928950

Related

How can I improve this Python code to calculate Information Gain from Gini impurity?

The following code is intended to calculate info gain from a dataset, using Gini impurity. I thought the code that I wrote is functional and should perform successfully in all cases, but there are several hidden test cases on Sololearn that it fails in.
My submission is below, but here is a link to the same at Sololearn:
https://code.sololearn.com/cQEDIvXRgL3e
The pedantic version of my code, with editable inputs and exhaustive outputs, is at:
https://code.sololearn.com/cO755SFZAUJ0
Is there an error or oversight in this code that I'm missing? There must be something wrong with it as it's failing in the hidden test cases, but I have no idea what that could be.
From what I can see in the visible test cases, Sololearn is sending even-numbered sets of 1s and 0s to the code, which is converting it into lists as per the lines below. In my test version these lines are swapped for empty lists, which I populate with 1s and 0s before running it. I've tried sets of both odd-numbered and even numbered lengths, with resulting splits being of equal or unequal length, and it doesn't seem to adversely affect the results.
s = [int(x) for x in input().split()]
a = [int(x) for x in input().split()]
b = [int(x) for x in input().split()]
#Function to get counts for set and splits, to be used in later formulae.
def setCount(n):
return len(n)
Cs = setCount(s)
Ca = setCount(a)
Cb = setCount(b)
#Function to get sums of "True" values in each, for later formulae.
def tSum(x):
sum = 0
for n in x:
if n == 1:
sum += 1
return sum
Ss = tSum(s)
Sa = tSum(a)
Sb = tSum(b)
#Function to get percentage of "True" values in each, for later formulae.
def getp(x, n):
p = x/n
return p
Ps = (getp(Ss, Cs))
Pa = (getp(Sa, Ca))
Pb = (getp(Sb, Cb))
#Function to get Gini impurity for each, to be used in final formula.
def gimp(p):
return 2 * p * (1-p)
Hs = (gimp(Ps))
Ha = (gimp(Pa))
Hb = (gimp(Pb))
#Final formula, intended to output information gain to five decimal places.
infoGain = round((Hs - (Sa/Ss) * Ha - (Sb/Ss) * Hb),5)
print(infoGain)
This question was answered on Sololearn by Tibor Santa, a mentor there. Their code that solved the test cases is much more to the point of the problem. It is pasted below, and can be found on Sololearn at: https://code.sololearn.com/cUoaMq6bzxP8/
The long and short of it is that, since the result is to be rounded to five decimals, it's entirely likely that different approaches to writing the code will elicit variances in the result. While my code wasn't "wrong," it wasn't the right approach to get the exact values behind the hidden test cases. It's also unnecessarily verbose.
The code that solved the test cases:
def gini(p):
return 2 * p * (1-p)
def p(data):
return sum(data) / len(data)
giniS = gini(p(S))
deltaA = gini(p(A)) * len(A) / len(S)
deltaB = gini(p(B)) * len(B) / len(S)
gain = giniS - deltaA - deltaB

While loop with Cython? or Better way to remove the elements that fall into a given range

I am basically looking for a faster/better/efficient way to perform a piece of my python code.
Here goes a simpler version of my part of code.
import numpy as np
A = np.random.choice(100,80) # randomly select integers
A = np.sort(A) # sort it
B = np.unique(A) # drop the duplicate values
What I want to do with this vector B is to remove its elements that fall within a given range from the previous value. For example, if I have a sorted vector B = [1,2,5,7,8,11,20,25,30] and a range value that I would like to assign is 10, then my code should output C = [1,11,25]. (2,5,7,8 were removed because it has the distance less than 10 with the element 1. Next element is 11. 20 is removed because 20 has the distance less than 10 with the element 11. Next is 25 so 30 is removed). You get the idea.
I wrote the code as following:
def RemoveViolations(vec, L):
S = []
P = 0 # pointer
C = 0 # counter
while C < vec.size:
S.append(vec[C])
preC = np.where(vec>S[P]+L)[0]
if preC.size:
C = preC[0]
else:
C = vec.size+1
P = P+1
return np.asarray(S)
So, now, I can do this C = RemoveViolations(B,10), which works like a charm.
Now, the issue is that this is very slow code in python. I have like a sorted vector size of 1 million and it takes some time to finish this code. Is there a better way to do this task?
If I need to implement Cython, how would I change the code to work in C++ environment? I heard it's not really complicated, but a quick search didn't work out well.
Thank you!
The complexity of your algorithm is the problem: Here is a solution in pure python that executes under 0.15s on my 8 years old laptop (your implementation needed 200 seconds; i/e a 1300 times improvement for n=1000000):
import random
def get_filtered_values(dist, seq):
prev_val = seq[0]
compare_to = prev_val + dist
filtered = [prev_val]
for elt in seq[1:]:
if elt <= compare_to: # <-- change to `<` to match desired results;
# this matches the results of your implementation
continue
else:
compare_to = elt + dist
filtered.append(elt)
return filtered
B = [1,2,5,7,8,11,20,25,30]
print(get_filtered_values(10, B))
n = 1000000
C = sorted(list(set([random.randint(0, n) for _ in range(n)])))
get_filtered_values(10, C)
You can cythonize this code, or numpyize it as you wish, but it probably will not be necessary.

Possible corner cases that result in Errors

recently I was trying out some past year AIO questions, and I couldn't solve this one.
The problem is as follows:
Safe Cracking
Input File: safein.txt
Output File: safeout.txt
Time Limit: 1 second
Over the centuries many wars have been won, not through a battle of strength but a battle of wits. Your sources have recently informed you that your mother has purchased a pre-release copy of the latest computer game, WheeZork, and is hiding it in the safe at home. To open the safe, one must enter a long sequence of numbers by rotating the circular dial to each number in the correct order.
If the correct sequence of numbers is entered into the safe, the safe opens and you can sneak your Christmas present out early. If you get the sequence wrong the alarm system is activated and you will be grounded for life! Luckily, you know that your mother has written the code down on a sheet of paper in her bedside drawer, which she likes to call the "code sheet".
She often boasts about how clever she is and explains to you that the numbers on the sheet are indeed the code to the safe, however each number in the sequence has been increased by a constant non-negative integer, k, which only she knows. Without this value the sheet is useless, and thus only she can open the safe ... or so she thinks!
In order to determine the correct code, you have spied on your mother when she unlocks the safe and you have managed to remember part of the code that she entered. You are not sure which part of the code this corresponds to, but it is definitely a consecutive sequence of numbers in the code. Armed with this knowledge, you set out to determine the full code to the safe.
Your task is, given the full list of numbers your mother has written down on her code sheet, and the shorter sequence that you know appears in the actual code, determine the entire sequence to unlock the safe.
For example, if the code to the safe was 7, 9, 4, 6, 8, 12, and your mother had incremented all numbers by 4, her code sheet would read 11, 13, 8, 10, 12, 16. This is because 7 + 4 = 11, giving the first number 11. The second number is obtained by adding 9 + 4 = 13. The third number is obtained by adding 4 + 4 = 8, and so forth. You may have caught a glimpse of her entering the numbers 4, 6, 8, in order. With this knowledge, you can determine the entire code.
Input
The first line of the input file will contain two integers, a b, separated by a space. The integer a is the length of the sequence written on your mother's code sheet ( 2 <= a <= 100000). The integer b is the length of the sequence that you know is contained within the code to the safe ( 2 <= b <= 30).
Following this will be a lines, each containing a single integer between 1 and 1000000. These lines are the sequence written on your mother's code sheet, in the order they are entered into the safe.
Following this will be b lines, each containing a single integer, also between 1 and 1000000. These lines describe the glimpse of the actual code to the safe.
You are guaranteed that there will only be one possible solution for any given input scenario.
Output
Your output file should consist of a lines. Each of these lines should contain a single integer, representing the full sequence of numbers required to open the safe.
My code (that passes most test cases) is as follows:
'''program implements a function diff which is returns a list that is the difference of the current elements in the list
to solve the problem, the subset list is reduced until it consists of a single integer
the index of this integer in the original list is then found, before determining the non-negative integer
which was used to encode the numbers'''
def diff(lst):
lst2=[]
for i in range(len(lst)-1):
lst2.append(lst[i+1]-lst[i])
return lst2
infile = open("safein.txt", "r")
outfile = open("safeout.txt", "w")
a, b = map(int, infile.readline().split())
lst, sub = [], []
for i in range(a):
lst.append(int(infile.readline()))
for j in range(b):
sub.append(int(infile.readline()))
temp_sub=sub
temp_lst=lst
k = 0
while len(temp_sub) != 1:
temp_sub=diff(temp_sub)
k+=1
for x in range(k):
temp_lst=diff(temp_lst)
n = temp_lst.index(temp_sub[0])
m = lst[n]-sub[0]
lst=[x-m for x in lst]
for i in lst:
outfile.write(str(i) + "\n")
As this code passes most test cases, with the exception of some cases that give an error (I do not know what error it is), I was wondering if anyone could suggest some corner cases that would lead to this algorithm creating an error. So far all the cases that I have thought of have passed.
EDIT:
as niemmi has pointed out below, there are some side cases which my above algorithm cannot handle. As such, I have rewritten another algorithm to solve it. This algorithm passes most test cases, and there are no errors, except that the execution takes longer than 1s. Could anyone help reduce the time complexity of this solution?
def subset(lst1, lst2):
if lst2[0] in lst1:
idx = lst1.index(lst2[0])
for i in range(len(lst2)):
if lst2[i]==lst1[idx+i]:
continue
else:
return False
else:
return False
return True
infile = open("safein.txt", "r")
outfile = open("safeout.txt", "w")
a, b = map(int, infile.readline().split())
lst, sub = [], []
for x in range(a):
lst.append(int(infile.readline()))
for y in range(b):
sub.append(int(infile.readline()))
if subset(lst, sub):
for i in range(a):
outfile.write(str(int(lst[i])) + "\n")
infile.close()
outfile.close()
exit()
i=1
while True:
temp_sub = [x+i for x in sub]
if subset(lst, temp_sub):
lst = [x-i for x in lst]
for j in range(a):
outfile.write(str(int(lst[j])) + "\n")
infile.close()
outfile.close()
exit()
i+=1
EDIT:
Thanks to niemmi, who provided a solution below that I edited slightly to pass a test case returning an error.
def diff(seq):
return (seq[i - 1] - seq[i] for i in range(1, len(seq)))
with open('safein.txt') as in_file:
a, b = (int(x) for x in in_file.readline().split())
code = [int(in_file.readline()) for _ in range(a)]
plain = [int(in_file.readline()) for _ in range(b)]
code_diff = tuple(diff(code))
plain_diff = tuple(diff(plain))
k = 0
def index(plain_diff, code_diff, plain, code, a, b, k):
for i in range(k, a - b):
for j, x in enumerate(plain_diff, i):
if code_diff[j] != x:
break
else:
k = code[i] - plain[0]
break # found match, break outer loop
return k
k = index(plain_diff, code_diff, plain, code, a, b, k)
with open('safeout.txt', 'w') as out_file:
out_file.write('\n'.join(str(x - k) for x in code))
Thanks!
The above implementation calculates repeatedly the differences of consecutive elements on following lines:
while len(temp_sub) != 1:
temp_sub=diff(temp_sub)
k+=1
When run against the example input after first round temp_sub is [2, 2] and after second and final round it is [0]. Then the implementation proceeds to do the same kind of reduction for temp_lst that contains the incremented code which results to [-7, 7, 0, 2].
Then index is used to find the index with 0 value from temp_lst which is then used to deduce k. This approach obviously won't work if there's another 0 value on temp_lst before the index you're trying to find. We can easily craft an input where this might be the case, for example adding 11 twice at the beginning of the code sheet so the full sheet would be [11, 11, 11, 13, 8, 10, 12, 16].
EDIT Why not just use the initial approach of differences of subsequent numbers to find k? Code below loops over code sheet and for each position checks if plain sequence can start from there i.e. if the number is equal or greater than first number in plain sequence since k was defined to be non-negative integer. Then it loops over next b - 1 numbers both on code sheet and plain sequence to see if differences match or not.
Worst case time complexity is O(ab), if that's not good enough you could utilize KMP for faster matching.
with open('safein.txt') as in_file:
a, b = (int(x) for x in in_file.readline().split())
code = [int(in_file.readline()) for _ in range(a)]
plain = [int(in_file.readline()) for _ in range(b)]
for i in range(a):
k = code[i] - plain[0]
if k < 0:
continue
for j in range(1, b):
if code[i] - code[i + j] != plain[0] - plain[j]:
break
else:
break # found match, break outer loop
with open('safeout.txt', 'w') as out_file:
out_file.write('\n'.join(str(x - k) for x in code))

Python: Performance when looping and manipulating a large array

My question is two-fold:
Is there a way to both efficiently loop over and manipulate an
array using enumerate for example and manipulate the loop at
the same time?
Are there any memory-optimized versions of arrays in python?
(like NumPy creating smaller arrays with a specified type)
I have made an algorithm finding prime numbers in range (2 - rng) with the Sieve of Eratosthenes.
Note: The problem is nonexistent if searching for primes in 2 - 1,000,000 (under 1 sec total runtime too). In the tens and hundreds of millions this starts to hurt. So far changing the table from including all natural numbers to just odd ones, the rough maximum range I was able to search was 400 million (200 million in odd numbers).
Whiles instead of for loops decrease performance at least with the current algorithm.
NumPy while being able to create smaller arrays with type conversion, it actually takes roughly double the time to process with the same code, except
oddTable = np.int8(np.zeros(size))
in place of
oddTable = [0] * size
and using integers to assign values "prime" and "not prime" to keep the array type.
Using pseudo-code, the algorithm would look like this:
oddTable = [0] * size # Array representing odd numbers excluding 1 up to rng
for item in oddTable:
if item == 0: # Prime, since not product of any previous prime
set item to "prime"
set every multiple of item in oddTable to "not prime"
Python is a neat language particularly when looping over every item in a list, but as the index in, say
for i in range(1000)
can't be manipulated while in the loop, I had to convert the range a few times to produce an iterable which to use. In the code: "P" marks prime numbers, "_" marks not primes and 0 not checked.
num = 1 # Primes found (2 is prime)
size = int(rng / 2) - 1 # Size of table required to represent odd numbers
oddTable = [0] * size # Array with odd numbers \ 1: [3, 5, 7, 9...]
new_rng = int((size - 1) / 3) # To go through every 3rd item
for i in range(new_rng): # Eliminate no % 3's
oddTable[i * 3] = "_"
oddTable[0] = "P" # Set 3 to prime
num += 1
def act(x): # The actual integer index x in table refers to
x = (x + 1) * 2 + 1
return x
# Multiples of 2 and 3 eliminated, so all primes are 6k + 1 or 6k + 5
# In the oddTable: remaining primes are either 3*i + 1 or 3*i + 2
# new_rng to loop exactly 1/3 of the table length -> touch every item once
for i in range(new_rng):
j = 3*i + 1 # 3*i + 1
if oddTable[j] == 0:
num += 1
oddTable[j] = "P"
k = act(j)
multiple = j + k # The odd multiple indexes of act(j)
while multiple < size:
oddTable[multiple] = "_"
multiple += k
j += 1 # 3*i + 2
if oddTable[j] == 0:
num += 1
oddTable[j] = "P"
k = act(j)
multiple = j + k
while multiple < size:
oddTable[multiple] = "_"
multiple += k
To make your code more pythonic, split your algorithm in smaller chunks (functions), so that each chunk can be grasped easily.
My second comment might astound you: Python comes with "batteries included". In order to program your Erathostenes' Sieve, why do you need to manipulate arrays explicitly and pollute your code with it? Why not create a function (e.g. is_prime) and use the standard memoize decorator that was provided for that purpose? (If you insist on using 2.7, see also memoization library for python 2.7).
The result of the two pieces of advice above might not be the "most efficient", but it will (as I experienced with that exact problem) work well enough, while allowing you to quickly create sleek code that will save your programmer's time (both for creation and maintenance).

Python - Streamlining sudoku solver code

I am writing a script to efficiently solve a sudoku puzzle, but there's one part of my code that I think is extremely ugly and want to streamline.
def square(cell):
rows='ABCDEFGHI'
cols='123456789'
cell_row = cell[0][0]
cell_col = cell[0][1]
if cell_row in rows[0:3]:
x = 'A'
if cell_row in rows[3:6]:
x = 'B'
if cell_row in rows[6:9]:
x = 'C'
if cell_col in cols[0:3]:
y = 'a'
if cell_col in cols[3:6]:
y = 'b'
if cell_col in cols[6:9]:
y = 'c'
return (['Aa','Ab','Ac','Ba','Bb','Bc','Ca','Cb','Cc'].index(x+y))+1
Given that a sudoku board is comprised of 9 3x3 squares the purpose of this function is to take the coordinates of a cell on the board and return the number of the 3x3 square to which the cell belongs (where the square in the top left is number 1, and the bottom right is number 9). The input 'cell' is in the form ['A5', 6] where A indicates the row, 5 the column and 6 the value of the cell.
The code that I have works but there's got to be a much more efficient or presentable way of doing it. I would be grateful for any suggestions.
Personally, I don't think magic numbers like '65' and '97' make the solution more presentable! How about:
def square(cell):
rows = 'ABCDEFGHI'
cell_row = rows.index(cell[0][0])
cell_col = int(cell[0][1]) - 1
return 3 * (cell_row // 3) + cell_col // 3 + 1
I was able to make a greatly simplified version of your formula. I started by assigning both the row and column a 0-based index. Then I used integer division to only get the information about what 3-block the square is in. Since moving down a 3-block of rows increases the index by 3 while moving to the right only increases it by 1, I multiply the row index by 3 after the division. Here's the finished function:
def square(cell):
coords = (ord(cell[0][0]) - 65,int(cell[0][1]) - 1)
return 3 * (coords[0] // 3) + coords[1] // 3 + 1
Edit: Fixed offset by 1 - even though I would rather start at 0 as you'll probably want to use the returned value as an index for another (sub-)array.
And as I cannot comment on other answers yet just my 2 cents here:
cdlane's answer is slightly slower than the one presented here. If you get rid of the .lower() (I assume you don't care about fail safes at this point) and use Brien's answer you gain another slight performance boost. I don't know how often you'll evaluate square() but maybe it's worth to ditch readability for performance ;)
I think the attached snippet should do the trick.
def square(cell):
# http://www.asciitable.com/
# https://docs.python.org/3/library/functions.html#ord
row = ord(cell[0][0].lower()) - 97
column = int(cell[0][1])-1
return 3*(row//3) + column//3 + 1

Categories

Resources