How to modify and access elements, with numpy arrays - python

Recently, I was working on a data mining project for school using python, pycharm, and numpy arrays. My goal was to find the covariance matrix without using .cov(). The data set given was about (19,000 x 11). I used a subset of this for testing (12 x 11). While trying to center the data, I wrote a function called def center(self, data): Basically, it's a for loop that takes a column slice (data[:, i]) of the 2-D array and iterates through it assigning to the original value that value minus the mean of the column, (val = val - columnMean) the following is the loop:
for i in range(len(data[0])):
for j in range(len(data[:])):
data[:, i][j] = data[:, i][j] - data[:, i].mean()
I have run this code and dozens of variations of it, literally, hundreds of times, but the assignment never happens. The best I can figure is that I'm not using a conda environment with pycharm. I downloaded anaconda3, but I can't find the conda.exe for the path, however I'm not even sure this is the problem.
These are the imports in the program:
#!/usr/bin/python, import os, import sys, import pandas as pd, import csv, import numpy as np, import random
This is the actually function:
class AssignmentThree:
def __init__(self, file):
self.data = -1
def center(self, data):
d = data
for col in range(len(d[0])):
mean = d[:, col].mean()
for row in range(len(d[:, 0])):
d[row][col] = d[row][col] - mean
# Originally I used d[:, i][row] = d[:, i][row] - mean
This is a sample of the "data" in file "magic04.data":
28.7967,16.0021,2.6449,0.3918,0.1982,27.7004,22.011,-8.2027,40.092,81.8828,g
31.6036,11.7235,2.5185,0.5303,0.3773,26.2722,23.8238,-9.9574,6.3609,205.261,g
162.052,136.031,4.0612,0.0374,0.0187,116.741,-64.858,-45.216,76.96,256.788,g
23.8172,9.5728,2.3385,0.6147,0.3922,27.2107,-6.4633,-7.1513,10.449,116.737,g
75.1362,30.9205,3.1611,0.3168,0.1832,-5.5277,28.5525,21.8393,4.648,356.462,g
51.624,21.1502,2.9085,0.242,0.134,50.8761,43.1887,9.8145,3.613,238.098,g
48.2468,17.3565,3.0332,0.2529,0.1515,8.573,38.0957,10.5868,4.792,219.087,g
it was passed as a terminal parameter using sys and assigned by a separate function as follows:
Afile = open(file)
self.data = pd.read_csv(Afile, header=None, delimiter=',', usecols=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
What I found is that I can assign a variable like "d[:, i] = d[:, i].mean()" without a problem, but:
"d[:, i][row] = d[:, i][row] - d[:, i].mean()" or
"d[row][col] = d[row][col] - mean"
never assigns anything to "d[:, i][row]/ or d[row][col]" and it remains unchanged. To top it off, when I first ran the program the first "d[:, i].mean()" was equal to 0, which explained why the value never changed, however I have run the code with other hard set values and the behavior persist. The code never throws any warnings or other indications of compiler error.
If anyone has some insight it would be greatly appreciated.

Related

Python - Having Severe Issues with Timeit

I am attempting to use timeit to keep track of how long a sorting algorithm takes to finish. However, it seems I can't even find an answer online how to exactly run timeit with functions that were originally written in another module. I've tried various setups and string inputs but am finding myself lost on this.
I tried using "t = timeit.Timer('sort.bubble(temp_array)')," but printing out the timer objects only gives me the memory addresses and cannot be converted to an integer...
In this case, I am calling bubble sort from another module.
#This section is on a timetests.py file
import random
import sort
import timeit
test_array1 = [random.randint(0, 500) for i in range(10)]
arrays_to_sort = [test_array1]
bubble_times = []
for a in range(len(arrays_to_sort)):
temp_array = arrays_to_sort[a]
t = timeit(sort.bubble(temp_array))
bubble_times.append(t)
**t = timeit(sort.bubble(temp_array))** #code is definitely not correct here
#This file is on sort.py
def bubble(list):
for current_pass in range(len(list) - 1, 0, -1):
for element in range(current_pass):
#Swap the element if the current one is smaller than
#next one
if list[element] > list[element + 1]:
temp = list[element]
list[element] = list[element + 1]
list[element + 1] = temp
return list
You need to function bubble and variable temp_array in the local environment.
Try:
t = timeit.timeit("bubble(temp_array)", setup = "from sort import bubble; from __main__ import temp_array"),
bubble_times.append(t)
Explanation:
So we use bubble() since that's the way to access an imported function
you would use sort.bubble() if you had imported the sort module rather than just a function from the module
We have to also bring in temp_array (assuming we are running timetests.py as the main module)
Using lambda function
Another option is use lambda to create a zero argument function which we pass to timeit. Note: checkout how to pass parameters of a function when using timeit.Timer()
t = timeit.timeit(lambda: sort.bubble(temp_array))

Improving loop in loops with Numpy

I am using numpy arrays aside from pandas for speed purposes. However, I am unable to advance my codes using broadcasting, indexing etc. Instead, I am using loop in loops as below. It is working but seems so ugly and inefficient to me.
Basically what I am doing is, I am trying to imitate groupby of pandas at the step mydata[mydata[:,1]==i]. You may consider it as a firm id number. Then with respect to the lookup data, I am checking if it is inside the selected firm or not at the step all(np.isin(lookup[u],d[:,3])). But as I denoted at the beginning, I feel so uncomfortable about this.
out = []
for i in np.unique(mydata[:,1]):
d = mydata[mydata[:,1]==i]
for u in range(0,len(lookup)):
control = all(np.isin(lookup[u],d[:,3]))
if(control):
out.append(d[np.isin(d[:,3],lookup[u])])
It takes about 0.27 seconds. However there must exist some clever alternatives.
I also tried Numba jit() but it does not work.
Could anyone help me about that?
Thanks in advance!
Fake Data:
a = np.repeat(np.arange(100)+5000, np.random.randint(50, 100, 100))
b = np.random.randint(100,200,len(a))
c = np.random.randint(10,70,len(a))
index = np.arange(len(a))
mydata = np.vstack((index,a, b,c)).T
lookup = []
for i in range(0,60):
lookup.append(np.random.randint(10,70,np.random.randint(3,6,1) ))
I had some problems getting the goal of your Program, but I got a decent performance improvement, by refactoring your second for loop. I was able to compress your code to 3 or 4 lines.
f = (
lambda lookup: out1.append(d[np.isin(d[:, 3], lookup)])
if all(np.isin(lookup, d[:, 3]))
else None
)
out = []
for i in np.unique(mydata[:, 1]):
d = mydata[mydata[:, 1] == i]
list(map(f, lookups))
This resolves to the same output list you received previously and the code runs almost twice as quick (at least on my machine).

Why can't I import my module into my main code?

I wrote some code (named exercise 2) where I define a function (named is_divisible) and it has worked perfectly.
Afterwards to learn how to import functions, I wrote the same code but without the defined function, and created a second module (named is_divisible). But whenever I import this module into the original "exercise 2" I get
No module named 'is_divisible'
I have checked that both python files are in the same folder, the name of the file is correct, and I know the code is well written because it has worked before and it is from a lecturer's of mine. I have also attempted to name the module and the function differently and to instead write:
from divis import is_divisible
but this was also unsuccessful.
Where am I going wrong? I will leave the code below:
import random
import math
import numpy as np
random_list=[]
for i in range (0,5):
r=random.randint(0,10)
random_list.append(r)
print(random_list) #five numbers from 0 to 10 are chosen and appended to a list
new_result=[print('right' for x in random_list if round(np.cosh(x)**2 - np.sinh(x)**2,2) == 1]
#checking the numbers following a maths rule
import is_divisible #trying to import the function is_divisible
divisor=3
idx = is_divisible(random_list, divisor)
for i in idx:
print(f'Value {random_list[i]} (at index {i}) is divisible by {divisor}')
the code for the function is_divisible is:
def is_divisible(x, n):
""" Find the indices of x where the element is exactly divisible by n.
Arguments:
x - list of numbers to test
n - single divisor
Returns a list of the indices of x for which the value of the element is
divisible by n (to a precision of 1e-6 in the case of floats).
Example:
>>> is_divisible([3, 1, 3.1415, 6, 7.5], 3)
[0, 3]
"""
r = []
small = 1e-6
for i, m in enumerate(x):
if m % n < small:
r.append(i)
return r
I know this question has been answered multiple times, but none of the answers seem to work for me or maybe I am not doing it correctly.
Generally, when you type import <Module> the module is the name of the file. So, if you had the function is_divisible inside a Python file named a.py, then to import it you will write from a import is_divisible. If instead, you would like to import the whole file, then you'd write import a.py, then to use the function you would use a.is_divisible(random_list, divisor).
You should also make sure that both files are in the same folder.

What is the equivalent way of doing this type of pythonic vectorized assignment in MATLAB?

I'm trying to translate this line of code from Python to MATLAB:
new_img[M[0, :] - corners[0][0], M[1, :] - corners[1][0], :] = img[T[0, :], T[1, :], :]
So, naturally, I wrote something like this:
new_img(M(1,:)-corners(2,1),M(2,:)-corners(2,2),:) = img(T(1,:),T(2,:),:);
But it gives me the following error when it reaches that line:
Requested 106275x106275x3 (252.4GB) array exceeds maximum array size
preference. Creation of arrays greater than this limit may take a long
time and cause MATLAB to become unresponsive. See array size limit or
preference panel for more information.
This has made me believe that it is not assigning things correctly. Img is at most a 1000 × 1500 RGB image. The same code works in less than 5 seconds in Python. How can I do vector assignment like the code in the first line in MATLAB?
By the way, I didn't paste all lines of my code for this post not to get too long. If I need to add anything else, please let me know.
Edit:
Here's an explanation of what I want my code to do (basically, this is what the Python code does):
Consider this line of code. It's not a real MATLAB code, I'm just trying to explain what I want to do:
A([2 3 5], [1 3 5]) = B([1 2 3], [2 4 6])
It is interpreted like this:
A(2,1) = B(1,2)
A(3,1) = B(2,2)
A(5,1) = B(3,2)
A(2,3) = B(1,4)
A(3,3) = B(2,4)
A(5,3) = B(3,4)
...
...
...
Instead, I want it to be interpreted like this:
A(2,1) = B(1,2)
A(3,3) = B(2,4)
A(5,5) = B(3,6)
When you do A[vector1, vector2] in Python, you index the set:
A[vector1[0], vector2[0]]
A[vector1[1], vector2[1]]
A[vector1[2], vector2[2]]
A[vector1[3], vector2[3]]
...
In MATLAB, the similar-looking A(vector1, vector2) instead indexes the set:
A(vector1(1), vector2(1))
A(vector1(1), vector2(2))
A(vector1(1), vector2(3))
A(vector1(1), vector2(4))
...
A(vector1(2), vector2(1))
A(vector1(2), vector2(2))
A(vector1(2), vector2(3))
A(vector1(2), vector2(4))
...
That is, you get each combination of indices. You should think of it as a sub-array composed of the rows and columns specified in the two vectors.
To accomplish the same as the Python code, you need to use linear indexing:
index = sub2ind(size(A), vector1, vector2);
A(index)
Thus, your MATLAB code should do:
index1 = sub2ind(size(new_img), M(1,:)-corners(2,1), M(2,:)-corners(2,2));
index2 = sub2ind(size(img), T(1,:), T(2,:));
% these indices are for first 2 dims only, need to index in 3rd dim also:
offset1 = size(new_img,1) * size(new_img,2);
offset2 = size(img,1) * size(img,2);
index1 = index1.' + offset1 * (0:size(new_img,3)-1);
index2 = index2.' + offset2 * (0:size(new_img,3)-1);
new_img(index1) = img(index2);
What the middle block does here is add linear indexes for the same elements along the 3rd dimension. If ii is the linear index to an element in the first channel, then ii + offset1 is an index to the same element in the second channel, and ii + 2*offset1 is an index to the same element in the third channel, etc. So here we're generating indices to all those matrix elements. The + operation is doing implicit singleton expansion (what they call "broadcasting" in Python). If you have an older version of MATLAB this will fail, you need to replace that A+B with bsxfun(#plus,A,B).

python sparse matrix creation paralellize to speed up

I am creating a sparse matrix file, by extracting the features from an input file. The input file contains in each row, one film id, and then followed by some feature IDs and that features score.
6729792 4:0.15568 8:0.198796 9:0.279261 13:0.17829 24:0.379707
the first number is the ID of the film, and then the value to the left of the colon is feature ID and the value to the right is the score of that feature.
Each line represents one film, and the number of feature:score pairs vary from one film to another.
here is how I construct my sparse matrix.
import sys
import os
import os.path
import time
import numpy as np
from Film import Film
import scipy
from scipy.sparse import coo_matrix, csr_matrix, rand
def sparseCreate(self, Debug):
a = rand(self.total_rows, self.total_columns, format='csr')
l, m = a.shape[0], a.shape[1]
f = tb.open_file("sparseFile.h5", 'w')
filters = tb.Filters(complevel=5, complib='blosc')
data_matrix = f.create_carray(f.root, 'data', tb.Float32Atom(), shape=(l, m), filters=filters)
index_film = 0
input_data = open('input_file.txt', 'r')
for line in input_data:
my_line = np.array(line.split())
id_film = my_line[0]
my_line = np.core.defchararray.split(my_line[1:], ":")
self.data_matrix_search_normal[str(id_film)] = index_film
self.data_matrix_search_reverse[index_film] = str(id_film)
for element in my_line:
if int(element[0]) in self.selected_features:
column = self.index_selected_feature[str(element[0])]
data_matrix[index_film, column] = float(element[1])
index_film += 1
self.selected_matrix = data_matrix
json.dump(self.data_matrix_search_reverse,
open(os.path.join(self.output_path, "data_matrix_search_reverse.json"), 'wb'),
sort_keys=True, indent=4)
my_films = Film(
self.selected_matrix, self.data_matrix_search_reverse, self.path_doc, self.output_path)
x_matrix_unique = self.selected_matrix[:, :]
r_matrix_unique = np.asarray(x_matrix_unique)
f.close()
return my_films
Question:
I feel that this function is too slow on big datasets, and it takes too long to calculate.
How can I improve and accelerate it? maybe using MapReduce? What is wrong in this function that makes it too slow?
IO + conversions (from str, to str, even 2 times to str of the same var, etc) + splits + explicit loops. Btw, there is CSV python module which may be used to parse your input file, you can experiment with it (I suppose you use space as delimiter). Also I' see you convert element[0] to int/str which is bad - you create many tmp. object. If you call this function several times, you may to try to reuse some internal objects (array?). Also, you can try to implement it in another style: with map or list comprehension, but experiments are needed...
General idea of Python code optimization is to avoid explicit Python byte-code execution and to prefer native/C Python functions (for anything). And sure try to solve so many conversions. Also if input file is yours you can format it to fixed length of fields - this helps you to avoid split/parse totally (only string indexing).

Categories

Resources