Let's say I have some arrays/lists that contains a lot of values, which means that loading several of these into memory would ultimately result in a memory error due to lack of memory. One way to circumvent this is to load these arrays/lists into a generator, and then use them when needed. However, with generators you don't have so much control as with arrays/lists - and that is my problem.
Let me explain.
As an example I have the following code, which produces a generator with some small lists. So yeah, this is not memory intensive at all, just an example:
import numpy as np
np.random.seed(10)
number_of_lists = range(0, 5)
generator_list = (np.random.randint(0, 10, 10) for i in number_of_lists)
If I iterate over this list I get the following:
for i in generator_list:
print(i)
>> [9 4 0 1 9 0 1 8 9 0]
>> [8 6 4 3 0 4 6 8 1 8]
>> [4 1 3 6 5 3 9 6 9 1]
>> [9 4 2 6 7 8 8 9 2 0]
>> [6 7 8 1 7 1 4 0 8 5]
What I would like to do is sum element wise for all the lists (axis = 0). So the above should in turn result in:
[36, 22, 17, 17, 28, 16, 28, 31, 29, 14]
To do this I could use the following:
sum = [0]*10
for i in generator_list:
sum += i
where 10 is the length of one of the lists.
So far so good. I am not sure if there is a better/more optimized way of doing it, but it works.
My problem is that I would like to determine which lists in the generator_list I want to use. For example, what if I wanted to sum two of the first [0] list, one of the third, and 2 of the last, i.e.:
[9 4 0 1 9 0 1 8 9 0]
[9 4 0 1 9 0 1 8 9 0]
[4 1 3 6 5 3 9 6 9 1]
[6 7 8 1 7 1 4 0 8 5]
[6 7 8 1 7 1 4 0 8 5]
>> [34, 23, 19, 10, 35, 5, 19, 22, 43, 11]
How would I go about doing that ?
And before any questions arise why I want to do it this way, the reason is that in my real case, getting the arrays into the generator takes some time. I could then in principle just generate a new generator where I put in the order of lists as seen in the new list, but again, that would mean I would have to wait to get them in a new generator. And if this is to happen thousands of times (as seen with bootstrapping), well, it would take some time. With the first generator I have ALL lists that are available. Now I just wish to use them selectively so I don't have to create a new generator every time I want to mix it up, and sum a new set of arrays/lists.
import numpy as np
np.random.seed(10)
number_of_lists = range(5)
generator_list = (np.random.randint(0, 10, 10) for i in number_of_lists)
indices = [0, 0, 2, 4, 4]
assert sorted(indices) == indices, "only works for sorted list"
# sum_ = [0] * 10
# I prefer this:
sum_ = np.zeros((10,), dtype=int)
generator_index = -1
for index in indices:
while generator_index < index:
vector = next(generator_list)
generator_index += 1
sum_ += vector
print(sum_)
outputs
[34 23 19 10 37 5 19 22 43 11]
Related
Pretty new to Python. I'm trying to create a function which should look at a csv file, with an ID number, Name, and then N columns of numbers from different tests and then scale/round the numbers so they can be compared to the Danish grading system from [-3, 00, 02, 4, 7, 10, 12].
My script below does exactly that, but my function only returns the last result of the DF.
Here's the CSV, I use for testing:
StudentID,Name,Assignment1,Assignment2,Assignment3
s123456,Michael Andersen,7,5,4
s123789,Bettina Petersen,12,3,10
s123468,Thomas Nielsen,-3,7,2
s123579,Marie Hansen,10,12,12
s123579,Marie Hansen,10,12,12
s127848, Andreas Nielsen,2,2,2
s120799, Mads Westergaard,12,12,10
Its worth to mention that i need these functions separate, for my main script.
I've made a simple function which loads the file using pandas:
import pandas as pd
def dataLoad(filename):
grades = pd.read_csv(filename)
return grades
then I've written this script for the rounding of the numbers:
# Importing modules
import pandas as pd
import numpy as np
#Loading in the function dataLoad
from dataLoad import dataLoad
#Defining my data witht the function
grades=dataLoad('Karakterer.csv')
def roundGrade(grades):
#Dropping the two first columns of the pd.DF
grades=grades.drop(['StudentID','Name'],axis=1)
#Making the pd.DF into a numpy array
sample_grades=np.array(grades)
#Setting the parameters of the scale to round up to
grade_Scale = np.array([-3,0,2,4,7,10,12])
#Defining i, so i get gradually bigger with each cycle
i=0
#Making a for loop, which rounds every number in every row of the given array
for i in range(0,len(grades)):
grouped = [min(grade_Scale,key=lambda x:abs(grade-x)) for grade in sample_grades[i,:]]
#Making i 1 time bigger for each cycle
i=i+1
return grouped
Tell if you need some more information about the script, cheers guys!
For improve performance use numpy:
#assign output to df instead grades for possible assign values back in last step
df = dataLoad('Karakterer.csv')
grades = df.drop(['StudentID','Name'],axis=1)
grade_Scale = np.array([-3,0,2,4,7,10,12])
grades=df.drop(['StudentID','Name'],axis=1)
print (grades)
Assignment1 Assignment2 Assignment3
0 7 5 4
1 12 3 10
2 -3 7 2
3 10 12 12
4 10 12 12
5 2 2 2
6 12 12 10
arr = grades.values
a = grade_Scale[np.argmin(np.abs(arr[:,:, None] - grade_Scale[None,:]), axis=2)]
print (a)
[[ 7 4 4]
[12 2 10]
[-3 7 2]
[10 12 12]
[10 12 12]
[ 2 2 2]
[12 12 10]]
Last if need assign back output to columns:
df[grades.columns] = a
print (df)
StudentID Name Assignment1 Assignment2 Assignment3
0 s123456 Michael Andersen 7 4 4
1 s123789 Bettina Petersen 12 2 10
2 s123468 Thomas Nielsen -3 7 2
3 s123579 Marie Hansen 10 12 12
4 s123579 Marie Hansen 10 12 12
5 s127848 Andreas Nielsen 2 2 2
6 s120799 Mads Westergaard 12 12 10
Explanation:
It is used this solution but for multiple columns:
Idea is compare 2d array created from all columns from DataFrame to arr by array grade_Scale. So you can use broadcasting for possible create 3d array of differences between them with absolute values:
print (np.abs(arr[:,:, None] - grade_Scale[None,:]))
[[[10 7 5 3 0 3 5]
[ 8 5 3 1 2 5 7]
[ 7 4 2 0 3 6 8]]
[[15 12 10 8 5 2 0]
[ 6 3 1 1 4 7 9]
[13 10 8 6 3 0 2]]
[[ 0 3 5 7 10 13 15]
[10 7 5 3 0 3 5]
[ 5 2 0 2 5 8 10]]
[[13 10 8 6 3 0 2]
[15 12 10 8 5 2 0]
[15 12 10 8 5 2 0]]
[[13 10 8 6 3 0 2]
[15 12 10 8 5 2 0]
[15 12 10 8 5 2 0]]
[[ 5 2 0 2 5 8 10]
[ 5 2 0 2 5 8 10]
[ 5 2 0 2 5 8 10]]
[[15 12 10 8 5 2 0]
[15 12 10 8 5 2 0]
[13 10 8 6 3 0 2]]]
Then use position by minimal values by numpy.argmin per axis=2 (working with 3rd axis in 3d array):
print (np.argmin(np.abs(arr[:,:, None] - grade_Scale[None,:]), axis=2))
[[4 3 3]
[6 2 5]
[0 4 2]
[5 6 6]
[5 6 6]
[2 2 2]
[6 6 5]]
And last use indexing by grade_Scale values:
print (grade_Scale[np.argmin(np.abs(arr[:,:, None] - grade_Scale[None,:]), axis=2)])
[[ 7 4 4]
[12 2 10]
[-3 7 2]
[10 12 12]
[10 12 12]
[ 2 2 2]
[12 12 10]]
You are re-assigning the new calculated value to grouped in every iteration. One way to handle that is to declare a variable and append,
def roundGrade(grades):
i = 0
grouped = []
for i in range(0,len(grades)):
grouped.append([min(grade_Scale,key=lambda x:abs(grade-x)) for grade in sample_grades[i,:]])
i=i+1
return grouped
Now call the function,
roundGrade(np.array([[ 7, 5, 4],
[12, 3, 10]]))
[[7, 4, 4], [12, 2, 10]]
Let's say I have this array:
np.arange(9)
[0 1 2 3 4 5 6 7 8]
I would like to shuffle the elements with np.random.shuffle but certain numbers have to be in the original order.
I want that 0, 1, 2 have the original order.
I want that 3, 4, 5 have the original order.
And I want that 6, 7, 8 have the original order.
The number of elements in the array would be multiple of 3.
For example, some possible outputs would be:
[ 3 4 5 0 1 2 6 7 8]
[ 0 1 2 6 7 8 3 4 5]
But this one:
[2 1 0 3 4 5 6 7 8]
Would not be valid because 0, 1, 2 are not in the original order
I think that maybe zip() could be useful here, but I'm not sure.
Short solution using numpy.random.shuffle and numpy.ndarray.flatten functions:
arr = np.arange(9)
arr_reshaped = arr.reshape((3,3)) # reshaping the input array to size 3x3
np.random.shuffle(arr_reshaped)
result = arr_reshaped.flatten()
print(result)
One of possible random results:
[3 4 5 0 1 2 6 7 8]
Naive approach:
num_indices = len(array_to_shuffle) // 3 # use normal / in python 2
indices = np.arange(num_indices)
np.random.shuffle(indices)
shuffled_array = np.empty_like(array_to_shuffle)
cur_idx = 0
for idx in indices:
shuffled_array[cur_idx:cur_idx+3] = array_to_shuffle[idx*3:(idx+1)*3]
cur_idx += 3
Faster (and cleaner) option:
num_indices = len(array_to_shuffle) // 3 # use normal / in python 2
indices = np.arange(num_indices)
np.random.shuffle(indices)
tmp = array_to_shuffle.reshape([-1,3])
tmp = tmp[indices,:]
tmp.reshape([-1])
I have a very large dataframe
in>> all_data.shape
out>> (228714, 436)
What I would like to do effciently is multiply many of the columns together. I started with a for loop and list of columns--the most effcient way I have found is
from itertools import combinations
newcolnames=list(all_data.columns.values)
newcolnames=newcolnames[0:87]
#make cross products (the columns I want to operate on are the first 87)
for c1, c2 in combinations(newcolnames, 2):
all_data['{0}*{1}'.format(c1,c2)] = all_data[c1] * all_data[c2]
The problem as one may guess is I have 87 columns which would give on the order of 3800 new columns (yes this is what I intended). Both my jupyter notebook and ipython shell choke on this calculation. I need to figure a better way to undertake this multiplication.
Is there a more efficient way to vectorize and/or process? Perhaps using a numpy array (my dataframe has been processed and now contains only numbers and NANs, it started with categorical variables).
As you have mentioned NumPy in the question, that might be a viable option here, specially because you might want to work in 2D space of NumPy instead of 1D columnar processing with pandas. To start off, you can convert the dataframe to a NumPy array with a call to np.array, like so -
arr = np.array(df) # df is the input dataframe
Now, you can get the pairwise combinations of the column IDs and then index into the columns and perform column-wise multiplications and all of this would be done in a vectorized manner, like so -
idx = np.array(list(combinations(newcolnames, 2)))
out = arr[:,idx[:,0]]*arr[:,idx[:,1]]
Sample run -
In [117]: arr = np.random.randint(0,9,(4,8))
...: newcolnames = [1,4,5,7]
...: for c1, c2 in combinations(newcolnames, 2):
...: print arr[:,c1] * arr[:,c2]
...:
[16 2 4 56]
[64 2 6 16]
[56 3 0 24]
[16 4 24 14]
[14 6 0 21]
[56 6 0 6]
In [118]: idx = np.array(list(combinations(newcolnames, 2)))
...: out = arr[:,idx[:,0]]*arr[:,idx[:,1]]
...:
In [119]: out.T
Out[119]:
array([[16, 2, 4, 56],
[64, 2, 6, 16],
[56, 3, 0, 24],
[16, 4, 24, 14],
[14, 6, 0, 21],
[56, 6, 0, 6]])
Finally, you can create the output dataframe with propers column headers (if needed), like so -
>>> headers = ['{0}*{1}'.format(idx[i,0],idx[i,1]) for i in range(len(idx))]
>>> out_df = pd.DataFrame(out,columns = headers)
>>> df
0 1 2 3 4 5 6 7
0 6 1 1 6 1 5 6 3
1 6 1 2 6 4 3 8 8
2 5 1 4 1 0 6 5 3
3 7 2 0 3 7 0 5 7
>>> out_df
1*4 1*5 1*7 4*5 4*7 5*7
0 1 5 3 5 3 15
1 4 3 8 12 32 24
2 0 6 3 0 0 18
3 14 0 14 0 49 0
you can try the df.eval() method:
for c1, c2 in combinations(newcolnames, 2):
all_data['{0}*{1}'.format(c1,c2)] = all_data.eval('{} * {}'.format(c1, c2))
I have these lines of code that create a list (with different amount of variables in it), and want to put them in an outfile. The thing is in
outfile.write('%i ?????' % (bn, crealines[bn]))
I don't know exactly how to write the format since the output varies in number.
Is there anyway of putting an output with different number of columns?
*I looked at this: Increasing variables and numbers by one each time (python) ... but in my case they don't increase one-by-one.
Also, can I print a list without the parenthesis?
The code is like this:
(# In this case I am creating a "cube" -matrix- of 3x3x3)
nx = ny = nz = 3
vec = []
crealines = []
outfile = open('test.txt', 'a')
for bn in arange(nx*ny*nz):
vec = neighboringcubes(bn,nx,ny,nz) #this is a defined function to see which cubes are neighbors to the cube "bn"
crealines.append(vec)
print bn, crealines[bn]
outfile.write('%i, %i ....' % (bn, crealines[bn]))
outfile.close()
using print it gives me this (which is correct):
0 0 0 <---- this is the output from function neighboringcubes() -which I don't need-
0 [1, 3, 9] <---- THIS IS WHAT I WANT WRITTEN IN THE OUTPUTFILE
1 0 0
1 [2, 0, 4, 10]
2 0 0
2 [1, 5, 11]
0 1 0
3 [4, 6, 0, 12]
1 1 0
4 [5, 3, 7, 1, 13] <--- BUT YOU CAN SEE IT CHANGES
2 1 0
5 [4, 8, 2, 14]
0 2 0
6 [7, 3, 15]
1 2 0
7 [8, 6, 4, 16]
2 2 0
8 [7, 5, 17]
0 0 1
9 [10, 12, 18, 0]
1 0 1
10 [11, 9, 13, 19, 1]
...
I want the outfile to have in the first column the number of the cube, and the following columns -from lower to higher- the neighbors; like this:
0 1 3 9
1 0 2 4 10
2 1 5 11
3 0 4 6 12
4 1 3 5 7 13
5 2 4 8 14
6 3 7 15
7 4 6 8 16
8 5 7 17
9 0 10 12 18
...
Your question isn't quite clear to me, but I believe you want to print the variable bn followed by its neighbors in sorted order. If so, this code snippet illustrates how to do that:
>>> bn = 5
>>> neighbors = [10, 12, 2, 4]
>>> print bn, ' '.join(map(str, sorted(neighbors)))
Which results in this output:
5 2 4 10 12
Few proposition, depending on what you exactly want (now they are the same, but may behave differently depending on data):
bn = 5
neighbours = [8, 12, -1, 4]
print "{} [{}]".format(bn, ', '.join(map(str, sorted(neighbours))))
print bn, repr(sorted(neighbours))
print bn, str(sorted(neighbours))
output:
5 [-1, 4, 8, 12]
5 [-1, 4, 8, 12]
5 [-1, 4, 8, 12]
I'm having the idea that you could just use print bn, crealines[bn].sort() but I could be wrong. (Thats because I cant test your code. Where is the function arange imported from?)
Thanks #jaime!
This solved the format problem:
print "{} {}".format(bn, ' '.join(map(str,crealines[bn])))
I sorted neighbors outside using vec=sorted(neighbors), then crealines[bn] is already sorted.
The output looks like this
5 2 4 8 14
There are 2D arrays of numbers as outputs of some numerical processes in the form of 1x1, 3x3, 5x5, ... shaped, that correspond to different resolutions.
In a stage an average i.e., 2D array value in the shape nxn needs to be produced.
If the outputs were in consistency of shape i.e., say all in 11x11 the solution was obvious, so:
element_wise_mean_of_all_arrays.
For the problem of this post however the arrays are in different shapes so the obvious way does not work!
I thought it might be some help by using kron function however it didn't. For example, if array is in shape of 17x17 how to make it 21x21. So for all others from 1x1,3x3,..., to build a constant-shaped array, say 21x21.
Also it can be the case that the arrays are smaller and bigger in shape compared to the target shape. That is an array of 31x31 to be shruk into 21x21.
You could imagine the problem as a very common task for images, being shrunk or extended.
What are possible efficient approaches to do the same jobs on 2D arrays, in Python, using numpy, scipy, etc?
Updates:
Here is a bit optimized version of the accepted answer bellow:
def resize(X,shape=None):
if shape==None:
return X
m,n = shape
Y = np.zeros((m,n),dtype=type(X[0,0]))
k = len(X)
p,q = k/m,k/n
for i in xrange(m):
Y[i,:] = X[i*p,np.int_(np.arange(n)*q)]
return Y
It works perfectly, however do you all agree it is the best choice in terms of the efficiency? If not any improvement?
# Expanding ---------------------------------
>>> X = np.array([[1,2,3],[4,5,6],[7,8,9]])
[[1 2 3]
[4 5 6]
[7 8 9]]
>>> resize(X,[7,11])
[[1 1 1 1 2 2 2 2 3 3 3]
[1 1 1 1 2 2 2 2 3 3 3]
[1 1 1 1 2 2 2 2 3 3 3]
[4 4 4 4 5 5 5 5 6 6 6]
[4 4 4 4 5 5 5 5 6 6 6]
[7 7 7 7 8 8 8 8 9 9 9]
[7 7 7 7 8 8 8 8 9 9 9]]
# Shrinking ---------------------------------
>>> X = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])
[[ 1 2 3 4]
[ 5 6 7 8]
[ 9 10 11 12]
[13 14 15 16]]
>>> resize(X,(2,2))
[[ 1 3]
[ 9 11]]
Final note: that the code above easily could be translated to Fortran for the highest performance possible.
I'm not sure I understand exactly what you are trying but if what I think the simplest way would be:
wanted_size = 21
a = numpy.array([[1,2,3],[4,5,6],[7,8,9]])
b = numpy.zeros((wanted_size, wanted_size))
for i in range(wanted_size):
for j in range(wanted_size):
idx1 = i * len(a) / wanted_size
idx2 = j * len(a) / wanted_size
b[i][j] = a[idx1][idx2]
You could maybe replace the b[i][j] = a[idx1][idx2] with some custom function like the average of a 3x3 matrix centered in a[idx1][idx2] or some interpolation function.