How to import my data into python

How to import my data into python - python

I'm currently working on Project Euler 18 which involves a triangle of numbers and finding the value of the maximum path from top to bottom. It says you can do this project either by brute forcing it or by figuring out a trick to it. I think I've figured out the trick, but I can't even begin to solve this because I don't know how to start manipulating this triangle in Python.
https://projecteuler.net/problem=18
Here's a smaller example triangle:
3
7 4
2 4 6
8 5 9 3
In this case, the maximum route would be 3 -> 7 -> 4 -> 9 for a value of 23.
Some approaches I considered:
I've used NumPy quite a lot for other tasks, so I wondered if an array would work. For that 4 number base triangle, I could maybe do a 4x4 array and fill up the rest with zeros, but aside from not knowing how to import the data in that way, it also doesn't seem very efficient. I also considered a list of lists, where each sublist was a row of the triangle, but I don't know how I'd separate out the terms without going through and adding commas after each term.
Just to emphasise, I'm not looking for a method or a solution to the problem, just a way I can start to manipulate the numbers of the triangle in python.

Here is a little snippet that should help you with reading the data:
rows = []
with open('problem-18-data') as f:
for line in f:
rows.append([int(i) for i in line.rstrip('\n').split(" ")])

Related

Efficient nearest neighbour search on streaming data

I'm looking to harness the magic of Python to perform a nearest neighbour (NN) search on a dataset that will grow over time, for example because new entries arrive through streaming. On static data, a NN search is doable as shown below. To start off the example, e and f are two sets of static data points. For each entry in e we need to know which one in f is nearest to it. It's simple enough to do:
e=pd.DataFrame({'lbl':['a','b','c'],'x':[1.5,2,2.5],'y':[1.5,3,2]})
f=pd.DataFrame({'x':[2,2],'y':[2,1.5]})
from sklearn.neighbors import BallTree as bt
def runtree():
tree=bt(f[['x','y']],2)
dy,e['nearest']=tree.query(e[['x','y']],1)
return e
runtree() returns e with the index of the nearest data point in the final column:
lbl x y nearest
0 a 1.5 1.5 1
1 b 2.0 3.0 0
2 c 2.5 2.0 0
Now, let's treat f as a dataframe that will grow over time, and add a new record to it:
f.loc[2]=[2.5]+[1.75]
When running runtree() again, the record with lbl=c is closer to the new entry (the bottom right entry shows index=2 now). Before the new entry was added, the same record was closest to index 0 (see above):
lbl x y nearest
0 a 1.5 1.5 1
1 b 2.0 3.0 0
2 c 2.5 2.0 2
The question is, is there a way to get this final result without rerunning runtree() for all the records in e, but instead refreshing only the ones that are relatively close to the new entry we've added in f? In the example it would be great to have a way to know that only the final row needs to be refreshed, without running all the rows to find out.
To put this into context: in the example for e above, we have three records in two dimensions. A real world example could have millions of records in more than two dimensions. It seems inefficient to rerun all the calculations every time that a new record arrives in f. A more efficient method might factor in that some of the entries in e are nowhere near the new one in f so they should not need updating.
It might be possible to delve into Euclidean distance maths, but my sense is that all the heavy lifting has already been done in packages like BallTree.
Does a package exist that can do what is needed here, on growing rather than static data, without lifting the bonnet on some serious math?

Is it possible to breakdown a numpy array to run through 1 different value in every iteration?

So, I have an excel spreadsheet which I want my programme to be able to access and get data from.
To do this I installed pandas and have managed to import the spreadsheet into the code.
Satellites_Path = (r'C:\Users\camer\Documents\Satellite_info.xlsx')
df = pd.read_excel(Satellites_Path, engine='openpyxl')
So this all works.
The problem is that what I want it to do is to grab a piece of data, say the distance between 2 things, and run this number through a loop. I then want it to go one down in the excel spreadsheet and do this again for the new number until there it finishes the column and then I want it to end.
The data file reads as:
Number ObjectName DistanceFromEarth(km) Radius(km) Mass(10^24kg)
0 0 Earth 0.0 6378.1 5.97240
1 1 Moon 384400.0 1783.1 0.07346
I put the 'Number' in as I thought that I could do a loop of whilst Number is < a limit then run through these numbers but I have found that the datafile doesn't work as an array or integer so that doesn't work.
Since then, I have tried to put it into an integer by turning it into a NumPy array:
N = df.loc[:,'Number']
D = np.array(df.loc[:,'DistanceFromEarth(km)'])
R = np.array(df.loc[:,'Radius(km)'])
However, the arrays are still problematic. I have tried to split them up like:
a = (np.array(N))
print(a)
newa = np.array_split(a,3)
and this sort of now works but as a test, I made this little bit and it repeats infinitely:
while True:
if (newa[0]) < 1:
print(newa)
If 1 is made a 0, it prints once and then stops. I just want it to run a couple of times.
What I am getting at is, is it possible to read this file, grab a number from it and run through calculations using it, and then repeat that for the next satellite in the list? The reason I thought to do it this way is that I am going to make quite a long list. I already have a working simulation of local planets in the solar system but I wanted to add more bodies and doing it in the way that I was would make it extremely long to write, very dense and introduce more problems.
Reading from a file from excel would make my life a lot easier and make it more future-proof but I don't know if it's possible and I can't see anywhere online that is similar.

Pandas are absolutely a good choice here, but lack of knowledge of how to use them appears to be holding you back.
Here's a couple of examples that may be applicable to your situation:
Simple row by row calculations to make a new column:
df['Diameter(km)'] = df['Radius(km)']*2
print(df)
Output:
Number ObjectName DistanceFromEarth(km) Radius(km) Mass(10^24kg) Diameter(km)
0 0 Earth 0.0 6378.1 5.97240 12756.2
1 1 Moon 384400.0 1783.1 0.07346 3566.2
Running each row through a function:
def do_stuff(row):
print(f"Number: {row['Number']}", f"Object Name: {row['ObjectName']}")
df.apply(do_stuff, axis=1)
Output:
Number: 0 Object Name: Earth
Number: 1 Object Name: Moon

Random Sudoku Generator

I'm trying to build a python script that generates a 9x9 block with numbers 1-9 that are unique along the rows, columns and within the 3x3 blocks - you know, Sudoku!
So, I thought I would start simple and get more complicated as I went. First I made it so it randomly populated each array value with a number 1-9. Then made sure numbers along rows weren't replicated. Next, I wanted to the same for rows & columns. I think my code is OK - it's certainly not fast but I don't know why it jams up..
import numpy as np
import random
#import pdb
#pdb.set_trace()
#Soduku solver!
#Number input
soduku = np.zeros(shape=(9,9))
for i in range(0,9,1):
for j in range(0,9,1):
while True:
x = random.randint(1,9)
if x not in soduku[i,:] and x not in soduku[:,j]:
soduku[i,j] = x
if j == 8: print(soduku[i,:])
break
So it moves across the columns populating with random ints, drops a row and repeats. The most the code should really need to do is generate 9 numbers for each square if it's really unlucky - I think if we worked it out it would be less than 9*9*9 values needing generating. Something is breaking it!
Any ideas?!

I think what's happening is that your code is getting stuck in your while-loop. You test for the condition if x not in soduku[i,:] and x not in soduku[:,j], but what happens if this condition is not met? It's very likely that your code is running into a dead-end sudoku board (can't be solved with any values), and it's getting stuck inside the while-loop because the condition to break can never be met.

Generating it like this is very unlikely to work. There are many ways where you can generate 8 of the 9 3*3 squares making it impossible to fill in the last square at all, makign it hang forever.
Another approach would be to fill in all the numbers on at the time (so, all the 1s first, then all the 2s, etc.). It would be like the Eight queens puzzle, but with 9 queens. And when you get to a position where it is impossible to place a number, restart.
Another approach would be to start all the squares at 9 and strategically decrement them somehow, e.g. first decrement all the ones that cannot be 9, excluding the 9s in the current row/column/square, then if they are all impossible or all possible, randomly decrement one.
You can also try to enumerate all sudoku boards, then reverse the enumaration function with a random integer, but I don't know how successful this may be, but this is the only method where they could be chosen with uniform randomness.

You are coming at the problem from a difficult direction. It is much easier to start with a valid Sudoku board and play with it to make a different valid Sudoku board.
An easy valid board is:
1 2 3 | 4 5 6 | 7 8 9
4 5 6 | 7 8 9 | 1 2 3
7 8 9 | 1 2 3 | 4 5 6
---------------------
2 3 4 | 5 6 7 | 8 9 1
5 6 7 | 8 9 1 | 2 3 4
8 9 1 | 2 3 4 | 5 6 7
---------------------
3 4 5 | 6 7 8 | 9 1 2
6 7 8 | 9 1 2 | 3 4 5
9 1 2 | 3 4 5 | 6 7 8
Having found a valid board you can make a new valid board by playing with your original.
You can swap any row of three 3x3 blocks with any other block row. You can swap any column of three 3x3 blocks with another block column. Within each block row you can swap single cell rows; within each block column you can swap single cell columns. Finally you can permute the digits so there are different digits in the cells as long as the permutation is consistent across the whole board.
None of these changes will make a valid board invalid.

I use permutations(range(1,10)) from itertools to create a list of all possible rows. Then I put each row into a sudoku from top to bottom one by one. If contradicts occurs, use another row from the list. In this approach, I can find out some valid completed sudoku board in a short time. It continue generate completed board within a minute.
And then I remove numbers from the valid completed sudoku board one by one in random positions. After removing each number, check if it still has unique solution. If not, resume the original number and change to next random position. Usually I can remove 55~60 numbers from the board. It take time within a minute, too. It is workable.
However, the first few generated the completed sudoku board has number 1,2,3,4,5,6,7,8,9 in the first row. So I shuffle the whole list. After shuffling the list, it becomes difficult to generate a completed sudoku board. Mission fails.
A better approach may be in this ways. You collect some sudoku from the internet. You complete them so that they are used as seeds. You remove numbers from them as mention above in paragraph 2. You can get some sudoku. You can use these sudokus to further generate more by any of the following methods
swap row 1 and row 3, or row 4 and row 6, or row 7 and row 9
similar method for columns
swap 3x3 blocks 1,4,7 with 3,6,9 or 1,2,3 with 7,8,9 correspondingly.
mirror the sudoku vertical or horizontal
rotate 90, 180, 270 the sudoku
random permute the numbers on the board. For example, 1->2, 2->3, 3->4, .... 8->9, 9->1. Or you can just swap only 2 of them. eg. 1->2, 2->1. This also works.

Python bug when creating matrices

I have written a code in Python to create a transition probability matrix from the data, but I keep getting wrong values for two specific data points. I have spent several days on trying to figure out the problem, but with no success.
About the code: The input is 4 columns in csv file. After preparation of the data, the first two columns are the new and old state values. I need to calculate how often each old state value transfers to a new one (basically, how often each pair (x,y) occurs in the first two columns of the data). The values in these columns are from 0 to 99. In the trans_pr matrix I want to get a number how often a pair (x,y) occurs in the data and have this number at the corresponding coordinates (x,y) in the trans_pr matrix. Since the values are from 0 to 99 I can just add 1 to the matrix at this coordinates each time they occur in the data.
The problem: The code works fine, but I always get zeros at coordinates (:,29) and (:,58) and (29,:) and (58;:) despite having observations there. It also sometimes seems to add the number at this coordinates to the previous line. Again, doesn't make any sense to me.
I would be very grateful if anyone could help. (I am new to Python, therefore the code is probably inefficient, but only the bug is relevant.)
The code is as simple as it can be:
from numpy import *
import csv
my_data = genfromtxt('99c_test.csv', delimiter=',')
"""prepares data for further calculations"""
my_data1=zeros((len(my_data),4))
my_data1[1:,0]=100*my_data[1:,0]
my_data1[1:,1]=100*my_data[1:,3]
my_data1[1:,2]=my_data[1:,1]
my_data1[1:,3]=my_data[1:,2]
my_data2=my_data1
trans_pr=zeros((101,101))
print my_data2
"""fills the matrix with frequencies of observations"""
for i in range(len(my_data2)):
trans_pr[my_data2[i,1],my_data2[i,0]]=trans_pr[my_data2[i,1],my_data2[i,0]]+1
c = csv.writer(open("trpr1.csv", "wb"))
c.writerows(trans_pr)
You can test the code with this input (just save it as csv file):
p_cent,p_euro,p_euro_old,p_cent_old
0.01,1,1,0.28
0.01,1,1,0.29
0.01,1,1,0.3
0.01,1,1,0.28
0.01,1,1,0.29
0.01,1,1,0.3
0.01,1,1,0.57
0.01,1,1,0.58
0.01,1,1,0.59
0.01,1,1,0.6

This sound very much like a rounding issue. I'd suppose that e.g. 100*0.29 (as a floating point number) is rounded downwards (i.e. truncated) and thus yields 28 instead of 29. Try rounding the numbers by yourself (i.e. a up/down rounding) before using them as an array index.
Update: Verified my conjecture by testing it, even the numbers are as described above - see here.

You may findrint() useful, from numpy. It rounds a value to its nearest integer (see numpy.rint() doc). Have you tried the following :
for i in range(len(my_data2)):
trans_pr[rint(my_data2[i,1]), rint(my_data2[i,0])] = \
trans_pr[rint(my_data2[i,1]), rint(my_data2[i,0])] + 1

Counting problem: possible sudoko tables?

I'm working on a sudoko solver (python). my method is using a game tree and explore possible permutations for each set of digits by DFS Algorithm.
in order to analyzing problem, i want to know what is the count of possible valid and invalid sudoko tables?
-> a 9*9 table that have 9 one, 9 two, ... , 9 nine.
(this isn't exact duplicate by this question)
my solution is:
1- First select 9 cells for 1s: (*)
2- and like (1) for other digits (each time, 9 cells will be deleted from remaining available cells):
C(81-9,9) , C(81-9*2,9) .... =
3- finally multiply the result by 9! (permutation of 1s-2s-3s...-9s in (*))
this is not equal to accepted answer of this question but problems are equivalent. what did i do wrong?

The number of valid Sudoku solution grids for the standard 9×9 grid was calculated by Bertram Felgenhauer and Frazer Jarvis in 2005 to be 6,670,903,752,021,072,936,960.
Mathematics of Sudoku |
source
I think problem with your solution is that deleting 9 cells each time from available cells does not necessarily create a valid grid. What I mean is just deleting 9 cells won't suffice.
That is why 81! / (9!)^9 is much bigger number than actual valid solutions.
EDIT:
Permutations with repeated elements
Your solutions is almost correct if you want all the tables not just valid sudoku tables.
There is a formula:
(a+b+c+...)! / [a! b! c! ....]
Suppose there are 5 boys and 3 girls and we have 8 seats then number of different ways in which they can seat is
(5+3)! / (5! 3!)
Your problem is analogous to this one.
There are 9 1s , 9 2s ... 9 9s.
and 81 places
so answer should be (9+9+...)! / (9!)^9
Now if you multiply again by 9! then this will add duplicate arrangements to the number by shuffling them.

According to this Wikipedia article (or this OEIS sequence), there are roughly 6.6 * 10^21 different sudoku squares.

What you did wrong was the last step: you shouldn't multiply the answer by 9!. You have already counted all possible squares.
This doesn't help you much when counting the possible Sudoku-tables. One other thing you could do is to count the tables where the "row-condition" holds: that is just (9!)^9, because you just choose one permutation of 1..9 for every row.
Still closer to the Sudoku-problem is counting Latin squares. Latin square has to satisfy both the "row-condition" and "column condition". That is already a difficult problem and no closed form formula is known. Sudoku is a Latin square with the additional "subsquare-condition".

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.